Wikimedia Research, Quantitative Analysis, General User Survey and more

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Wikimedia Research, Quantitative Analysis, General User Survey and more

This mail (including pictures) was sent to attendants of Wikimania 2006 and
some others that recently showed active interest in quantitative research.
Crossposting here. I hope you will find at least something in this mail that
is to your liking.

Wikimania 2006 was, like its predecessor in Frankfurt, a source of
Several official and impromptu meetings were held that were related to
research and quantitative analysis.
On a conference with 6 parallel sessions one has to make difficult choices,
and for me it was impossible to attend several highly interesting research


Wikimedia Research

I am very much looking forward towards a transcript or at least speaker
notes and/or personal observations of several presentations.
Foremost among them James' Research about Wikimedia: A workshop [1]

I also hope that James as Chief Research Officer could give us a sense of
direction and timing: the mission of the Wikimedia Research Network [2] is
lofty, the number of Wikimedians that subscribed large, but the current
status for most activities seems to be 'idle' [3] [4] ?  Also is there any
coordination with external research groups, like mentioned on [5] and
elsewhere [6] ?

Would it be useful to divide Wikimedia Research Network activities in
A Quantitative Analysis
B Social Research Collaborations [7]
C Other Activities
and coordinate these separately?

C would still cover 50%+ of the WRN mission statement, like: identify the
needs of the individual Wikimedia projects, make recommendations for
targeted development, guide and motivate outside developers, assist in the
study of new project proposals.

I expect on Wikimania most social science sessions  [8] presented relevant
material and either used or added to quantative research. So there is
synergy between A and B.



There was no IRC meeting of the Research Team after December 2005. There are
pretty active Wikimedia researchers outside the team though. For me
Wikimania 2006 confirmed that more exchange of ideas would be helpful.

I'm not sure more IRC discussions are a panacea. Personally I prefer
discussion via wiki and mailing list, it is less spontaneous but one can
easier formulate a coherent proposal or comment on it in a thoughtful
manner, and no less important: it is much better to follow for others who
read the discussion later.

Part of the information flow is now on meta, some of it on the research
mailing list [8] (which is largely dormant [9], though recent posts are very
useful). And some of it on the freelogy list [10] and probably elsewhere.

What about making the Wikimedia research list the central forum for all
broad and conceptual discussions and link from there to meta for detailed
discussions? I will post his mail there anyway, of course without the




I personally enjoyed very much session Can Visualization Help? [11]
= IBM researcher Fernanda Viégas [12] talked about the famous Wikipedia
History Flow tool [13], which was recently extended, announced a free
edition and told that Tim Starling had pledged to reinstate the relevant
export function so that we can use the tool on our projects.
= IBM researcher Martin Wattenberg [14] showed his newest toy where one can
see all contributions of one single Wikimedia editor, presented as an
association cloud (titles grouped per namespace and sorted by number of
edits, font size varied per title to express relative number of edits). It
is somewhat scary though, I feel a quantitative improvement - exposing data
that are already online in a much more efficient manner -, can lead to a
qualitative setback - exposing ones character and interests in a way that
was never expected. People may after all regret that they edited under their
real name. Although personally I will happily continue to do so, it is a
matter of responsibility towards the community to at least discuss whether
we should actively promote such a tool. I know I'm partially guilty in this
respect myself with mailing list stats  but feel that did not cross the
= Visualization guru Ben Schneiderman [15] made a case for more advanced
data visualisation tools to spice up wikistats. I am a long time admirer of
several of his UI inventions and happy to take up the challenge.



General User Survey

One promising but sleeping WRT project, that I initiated myself, is the
'General User Survey' [21]. A few Wikimania participants interested in
wikistats gathered ad hoc at lunch time on Saturday (others interested in
the project, Cormaggio, Piotrus were at the conference, but not in the
vicinity at that moment). Kevin Gamble, associate director of 75 Land-Grant
Universities, expressed his continued interest and said he might be able to
offer programming support

A project definition plus rationale [21] and a mockup questionnaire form
[22] have been created and discussed for more than a year. I started the
transition towards technical design [23] and with Kevins support and
resources coding might follow later this year. Once we have a proof of
concept in e.g. English and German (at least two languages to show
multilingual aspects) I'm sure more people will start to take notice, and
help to discuss and fine-tune the questionnaire. At a later stage, before
going live with a multilingual golden edition, we will probably have to
discuss matters with the board (Anthere already stated her support) in order
to make this an official survey, hopefully with coverage on the project
pages themselves (banner announcement ?). Mind you, the implementation is
not exactly trivial, lots of issues involved that require critical
discussion, code and coordination. I invite everyone to comment on tech
notes, especially of course Kevin, and hope to learn from him whether coding
this project fits within his budget.



Quantitative Analysis

Saturday I met Jeremy Tobacman. We had a long and very interesting
discussion, mainly on new initiatives centered around the freelogy servers.
Jeremy proposed to held an impromptu lunch meeting on Sunday and gathered a
room full of people.

[pictures removed]

Several mails have already been written about this, but to a smaller
audience. So here are a few highlights.

Issues that were discussed:

1 Hardware
The two tool servers [32] are very crowded and insufficient for all stats
jobs we might want to run. The tool servers run a mirror of the live
database so well behaved SQL queries are possible. Well behaved meaning they
should no try to emulate the xml dump process where extracting the English
Wikipedia (all revisions) already takes a full week.

Alexander Wait (Sasha) has access to huge hardware resources, enough to
calculate how many parallel universes it takes to find at least one zebra
couple where a black-and-white mother and a white-and-black  father have
exactly mirrored patterns and thus produce offspring that is either all
black or all white (mind you, albino's are false positives).

Since in reality Sasha is merely interested in unraveling the secrets of DNA
he has some cpu cycles to spare. Upon request virtual machines can be
catered for. The freelogy-discuss mailing list archives have information
about hardware availability [33]

By the way, Jeremy and Erik Tobacman have a server at The National Bureau of
Economic Research (NBER) for quantitative research on Wikipedia.

Also I am urged by the Communications Subcomittee to spend more of my time
on publishable stats (in time spent TomeRaider offline edition of Wikipedia
easily dominated, but the time for offline browsing is nearly over) and they
want me to have a dedicated server. I would like it to be well utilised, but
of course it should produce timely wikistats in the first place, as that is
what it is offered for. To be discussed.

2 Real time data collection / Performance / Storage
It would be useful to learn when a page is being slashdotted or otherwise in
the news, at the moment of the actual event, in order that vandal patrols
can be timely summoned, and article improvement can commence right away.

Major performance issues need to be addressed.

Do we gather and keep every page hit ? Hardly practicable. Wikimedia visitor
stats were not disabled for no reason. It seems we are getting switches that
can log accesses stochastically (e.g. every 100nth access, plus for a
selected subset of IP addresses all hits to monitor navigation patterns).
There might be a need to store data in aggregated (condensed) form, as
volumes will be huge. At least tapping from switches directly puts no burden
on squids (=web proxies/caches).

Brion will be asked to drop bz2 compression on xml dump job, as it is so
much slower and compresses so much less than 7zip. Brion had to develop a
distributed version of bzip to get it working at all on the 800 Gb enwiki
dump file. Format bz2 is however supported on more platforms, so Brion may
no comply.

Specifically about wikistats: I explained why I always process the full
historic dump instead of doing incremental steps: new functionality in
wikistats means processing it all anyway. Data for older months are not
really static due to frequent deletions and moves. Could I speed up counts
section of wikistats by splitting job over several servers ? I'll have to
look into it.

3 Data publishing
We should be careful not to publish very granular data for outside
inspection. It is a well known fact that China wants complete control over
its citizens. Less known is that they have the latest technology (mainly
bought in the US) and lots of it, and about 30.000 IT professionals
(estimate by Reporters without Borders/Reporters sans Frontières) working on
concealment of internet resources, redirection of internet requests and
spying on internet usage patterns in general. They would love to see our raw
access logs. Cathy will you attend the Chinese Wikimania? [34] If you happen
to hear about these things, I hope you will blog about it. See also [35]

See also well timed scoop [36] about AOL privacy disaster.

4 Measuring quality quantitatively
It may be impossible to define quality, let alone measure it, But it will be
fun to zoom in on it and see how far we can come. Spurred by Jimbo's
excellent Wikimania kick off speech, where he stressed we will need more
attention to quality, I started a project to extend wikistats. Brian offered
lots of ideas and hopefully will prove me wrong in my belief that adding
spelling, grammar and readability assessments is not to be taken too lightly
in a multilingual environment [37] [38]

[31] mp3 audio
2.html (registration needed:
[35] (I wonder if he
is the person who gave a smashing full hour speech on this at 20c3 Berlin)
(data were anonimized but some users had searched for their own name several
times and were easily recognized, lots of very embarrassing stuff was
(conceptual overview)



By the way Angela Beasley and Jakob Voss will give a workshop on Wikipedia
research on WikiSym 2006 [41] [42]


Regards, Erik Zachte

Wiki-research-l mailing list
[hidden email]