Put your name and project on meta:Research

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Put your name and project on meta:Research

Daniel Kinzler
Hi all

apparently as a side effect of the "about wikipedia projects" thread, some
people (including myself) have started to put their names and projects on
<http://meta.wikimedia.org/wiki/Research>.

I encurrage everyone to do the same. It's a great way to get an overview and to
find people to talk to.

-- Daniel

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

non-techie in need of guidance - Template namespace

Quse Guy
Greetings all,
Please excuse me if my message is inappropriate for this list, but I'm looking for a place to start and unsure where to begin.

I'm trying to get some sense of the scope of the Template namespace on the English-language Wikipedia: anything from sheer numbers, to which templates are the most edited, to which are the most used (either in terms of total number of transclusions/What Links Here, or else actual number of "hits").

To be a bit more specific, I'm particularly interested in those Templates which are in the following two categories:
* http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
* http://en.wikipedia.org/wiki/Category:Infobox_templates
But both of these categories consist of a large number of subcategories, sub-sub-categories, sub-sub-sub-...categories, making it difficult to attempt even a basic count of all the Navbox and Infobox templates.  Of course, determining which Navboxes and Infoboxes are the most edited or the most used would be impossible to ascertain manually.

I haven't written an SQL statement since taking a database course in 1998, so I'm wary of downloading one of the database dumps and attempting to manipulate things on my own.  Nor am I even certain that the data included in the dumps would allow me to aggregate across sub-sub-sub...categories, or derive edit counts or use counts.

Perhaps there's a GUI tool or interface that would be helpful in compiling these stats? Or perhaps these statistics are readily available and I simply haven't looked in the right places   :)

Again, any advice on this matter would be most appreciated.

Regards,
David


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: non-techie in need of guidance - Template namespace

Brianna Laugher
2008/9/18 Quse Guy <[hidden email]>:

> Greetings all,
> Please excuse me if my message is inappropriate for this list, but I'm
> looking for a place to start and unsure where to begin.
>
> I'm trying to get some sense of the scope of the Template namespace on the
> English-language Wikipedia: anything from sheer numbers, to which templates
> are the most edited, to which are the most used (either in terms of total
> number of transclusions/What Links Here, or else actual number of "hits").
>
> To be a bit more specific, I'm particularly interested in those Templates
> which are in the following two categories:
> * http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
> * http://en.wikipedia.org/wiki/Category:Infobox_templates
> But both of these categories consist of a large number of subcategories,
> sub-sub-categories, sub-sub-sub-...categories, making it difficult to
> attempt even a basic count of all the Navbox and Infobox templates.  Of
> course, determining which Navboxes and Infoboxes are the most edited or the
> most used would be impossible to ascertain manually.
>
> I haven't written an SQL statement since taking a database course in 1998,
> so I'm wary of downloading one of the database dumps and attempting to
> manipulate things on my own.  Nor am I even certain that the data included
> in the dumps would allow me to aggregate across sub-sub-sub...categories, or
> derive edit counts or use counts.
>
> Perhaps there's a GUI tool or interface that would be helpful in compiling
> these stats? Or perhaps these statistics are readily available and I simply
> haven't looked in the right places   :)

Your best bet is really to befriend a techie to construct the SQL
query for you, then submit it via the Query service:
<https://wiki.toolserver.org/view/Query_service>

However............... you could kinda-sorta hack something together
using the API. <http://en.wikipedia.org/w/api.php>
<http://www.mediawiki.org/wiki/API>

Here's my examples using the Python client mwclient
<https://mwclient.svn.sourceforge.net/svnroot/mwclient/trunk/mwclient>
and the interactive Python commandline:

initialise stuff

>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> topcat = 'Category:Infobox templates'

so first, try and get all the subcats below the topcat in the tree by
recursing (very crudely) through them. Note namespace 14 is the
Category namespace.

>>> allcats = []
>>> allcats.append(topcat)
>>> subcats = [p.name for p in site.Pages[topcat] if p.namespace==14]
>>> while len(subcats) > 0:
...  newsubcats = []
...  for s in subcats:
...   allcats.append(s)
...   newsubcats += [p.name for p in site.Pages[s] if p.namespace==14]
...  subcats = newsubcats
...

(I'm actually not sure this works. I got bored and killed it, and
len(allcats) was already 429. It would be better to be more strict and
record which categories we have checked, to cut off potential cycles
in the category graph.)

Anyway, assuming that actually works, now we want to only get the
templates in those categories. Note the Template namespace is 10.

>>> alltemplates = []
>>> for cat in allcats:
...  for p in site.Pages[cat]:
...   if p.namespace == 10:
...    alltemplates.append(p.name)
...

OK so now, for each template, we want to find out how many pages it is
embedded in. (Templates are embedded when they are referenced in
{{curly brackets}}, as opposed to regular old [[links]]). To be more
careful here there is probably a way to only get the embeddedin
results for the main namespace (ie, use in articles).

>>> embeddict = {}
>>> for t in alltemplates:
...   template = site.Pages[t]
...   embedtotal = len(list(template.embeddedin()))
...   embeddict[t] = embedtotal
...

This also takes a particularly long time. (I killed my process so my
examples below are truncated)

>>> embeddict.values()
[1, 6, 60, 141, 2, 2, 47, 0, 88, 19, 2, 212, 186, 1, 595, 76, 17, 444,
13, 70, 15, 87, 5, 11, 0, 102, 25, 356, 289, 1, 272, 184, 14, 2, 0,
14, 7, 2, 1407, 20, 0, 7, 32, 19, 0, 63, 1065, 31, 57, 72, 0, 2, 47,
5, 797, 3, 16, 3, 43, 99, 295, 14, 22, 0, 10, 9, 150, 6, 1, 1, 132, 5,
6, 110, 7, 42, 200, 58]

We can get a bit of an idea of really high usage.

>>> for k in embeddict.keys():
...  v = embeddict[k]
...  if v > 1000:
...   print k, v
...
Template:Infobox Website 1407
Template:Infobox Organization 1065

embeddict.keys() will also serve as a list of all the templates in
that whole category tree.

Determining the most edits could also be done via the API. But I would
question the relevance of this, as (A) all of these templates will be
highly nested and dependent on other templates, so perhaps they should
be considered too, and (B) all these templates are very likely to be
highly complex and the vast majority of users will be discouraged
implicitly and usually explicitly from editing them.

As for "hits", do you mean views of the Template: page? This can be
determined from some recent pageview statistics released
(<http://dammit.lt/wikistats/>, see <http://stats.grok.se/> as an
example), but probably more relevant is how many times the template is
viewed when it is used on articles.

You could try adding up the pageviews of the various articles a
template is embedded in (and again this is available via the API) and
this would probably be reasonably reliable, given that these templates
are at the top of pages and so if the article is loaded/viewed, very
likely the template is too. There is a slight difficulty in that it is
hard to figure out when the template was added to a particular
article; obviously pageviews before that time wouldn't have seen the
template. I suspect the only way to figure that out is to wade through
page revisions and that is also possible via the API :) but it gives
me a bit of a headache just thinking about it.

The people at DBPedia might be able to help you.
<http://dbpedia.org/About> I am pretty sure most of the data they
extract from Wikipedia is from these kinds of templates, so they
probably have a good bunch of tips and tricks to share, too.

<plug> I also wrote a blog post about the history of and attitudes to
templates on English Wikipedia, which you might find interesting,
although probably not statistically relevant. :)
<http://brianna.modernthings.org/article/83/templatology-an-essay>
</plug>

cheers,
Brianna

--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: non-techie in need of guidance - Template namespace

Phoebe Ayers-2
In reply to this post by Quse Guy
You might ask the folks at Freebase http://www.freebase.com/ for help.
They gave a presentation at one of the SF-Bay area meetups recently
and described how they've managed to extract data about templates and
infoboxes from Wikipedia. I am fuzzy on the details but they can
probably help... I believe most of their code is open source.

-- Phoebe

On Wed, Sep 17, 2008 at 6:49 PM, Quse Guy <[hidden email]> wrote:

> Greetings all,
> Please excuse me if my message is inappropriate for this list, but I'm
> looking for a place to start and unsure where to begin.
>
> I'm trying to get some sense of the scope of the Template namespace on the
> English-language Wikipedia: anything from sheer numbers, to which templates
> are the most edited, to which are the most used (either in terms of total
> number of transclusions/What Links Here, or else actual number of "hits").
>
> To be a bit more specific, I'm particularly interested in those Templates
> which are in the following two categories:
> * http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
> * http://en.wikipedia.org/wiki/Category:Infobox_templates
> But both of these categories consist of a large number of subcategories,
> sub-sub-categories, sub-sub-sub-...categories, making it difficult to
> attempt even a basic count of all the Navbox and Infobox templates.  Of
> course, determining which Navboxes and Infoboxes are the most edited or the
> most used would be impossible to ascertain manually.
>
> I haven't written an SQL statement since taking a database course in 1998,
> so I'm wary of downloading one of the database dumps and attempting to
> manipulate things on my own.  Nor am I even certain that the data included
> in the dumps would allow me to aggregate across sub-sub-sub...categories, or
> derive edit counts or use counts.
>
> Perhaps there's a GUI tool or interface that would be helpful in compiling
> these stats? Or perhaps these statistics are readily available and I simply
> haven't looked in the right places   :)
>
> Again, any advice on this matter would be most appreciated.
>
> Regards,
> David
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>



--
- phoebe s. ayers | [hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l