Updates: XML parser, weekly stats production, server, readership data

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Updates: XML parser, weekly stats production, server, readership data

Jeremy Tobacman

[1] Thanks to superb work by Erik Garrison, we now have an efficient, C-based parser that extracts header data from WMF xml dumps into csv files readable by standard statistical software packages.
  * Source for this parser will soon be web-available; stay tuned.
  * The csv files will also be available online, either from <a href="http://download.wikimedia.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">download.wikimedia.org (if the parser can be run on the WMF servers) or from a webserver on karma or at NBER (see below).
  * If you just can't wait, let us know and we'll offer express service :)
  * The csv files consist of these variables with these types:
names:  title,articleid,revid,date,time,anon,editor,editorid,minor
types:  str,int,int,str,str,[0/1],str,int,[0/1]

[2] We have begun to use these csv files to produce weekly sets of statistics.
See last week's work here:
<a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006-07-03" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)"> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006-07-03
This week we will finish out that set of stats.
Next week's list needs your creative suggestions:  Please edit directly!
<a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006-07-17" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)"> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Stats2006-07-17

[3] NBER has set us up with a pretty good Linux box, <a href="http://wikiq.nber.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)"> wikiq.nber.org, running Fedora Core 5.  We hope to have Xen instances available for researchers interested in doing statistical analysis on the csv files within two weeks.

[4] WMF readership data continues to be irretrievably lost.  What can we do to begin saving at least some of it as soon as possible?  If we were to save only articleid for one of every hundred squid requests, and include some indicator in the file at the end of each day, privacy concerns and computational burdens would be minimized, and this would still be a great start.
     How can we make this happen?


Wiki-research-l mailing list
[hidden email]