[Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

James Salsman-2
MZMcBride wrote:
>... the number of non-deleted revisions per day for the
> English Wikipedia. The results are here:
> https://en.wikipedia.org/wiki/Special:Permalink/565971356

So, that looks terrible: http://i.imgur.com/Z9lYCWj.png

It looks terrible in the same way that every other graph of active
users and several other related measures look like.

But it isn't. It doesn't account for the power law of practice which
causes everyone who has ever edited Wikipedia to get better at it with
time. And since so many IP editors are obviously returning, that means
a lot more than under the false but very common assumption that every
IP editor is new.

Here's what really matters, articlespace size:  http://i.imgur.com/TfaD99V.png

The size of the article text in bytes has been marching on linearly
since the beginning of Wikipedia, with extremely low variation, just
like the short popular vital articles and every other measure of
quality content.

There is no legitimate basis to worry about anything until the linear
trend of the total article bytes breaks out of its 12 year linear
trend.

(If you multiply columns 'E' and 'I' from
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm the database size
shows a cusp at around 2006, corresponding to the growth modes, but
two separate linear trends fit both modes far better than any growth
model fits the entire curve.)

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

Denny Vrandečić
Thank you for the observation.

Is the graph <http://i.imgur.com/TfaD99V.png> based on actual data? Because
it looks just tad bit too linear to me. (I do not disagree with the
finding, just wondering about the graph itself).

I still would worry, though: our content is increasing linearly, as you
say, but the number of active contributors is not. If we take for granted
that active contributors are the ones who provide quality control for the
articles, this means that since 2006 or so the ratio of content per
contributor is linearly declining, which would mean that our quality would
suffer.

I see two effects to counter that:

1) as you already mentioned, contributors are getting increasingly more
experienced and more effective in fulfilling their tasks.

2) we continue to have a strong increase in readers and even stronger in
pageviews (i.e. more and more people consult Wikipedia more and more). They
probably also provide a layer of quality assurance, even though they might
not qualify to be counted as active contributors.

I have the gut feeling that 1) cannot be sufficient, and I would be curious
in the effects of 2) - especially considering that much of the Foundation
development work can be considered in improving 2 further (visual editor,
article rating, mobile editing, etc.)





2013/7/27 James Salsman <[hidden email]>

> MZMcBride wrote:
> >... the number of non-deleted revisions per day for the
> > English Wikipedia. The results are here:
> > https://en.wikipedia.org/wiki/Special:Permalink/565971356
>
> So, that looks terrible: http://i.imgur.com/Z9lYCWj.png
>
> It looks terrible in the same way that every other graph of active
> users and several other related measures look like.
>
> But it isn't. It doesn't account for the power law of practice which
> causes everyone who has ever edited Wikipedia to get better at it with
> time. And since so many IP editors are obviously returning, that means
> a lot more than under the false but very common assumption that every
> IP editor is new.
>
> Here's what really matters, articlespace size:
> http://i.imgur.com/TfaD99V.png
>
> The size of the article text in bytes has been marching on linearly
> since the beginning of Wikipedia, with extremely low variation, just
> like the short popular vital articles and every other measure of
> quality content.
>
> There is no legitimate basis to worry about anything until the linear
> trend of the total article bytes breaks out of its 12 year linear
> trend.
>
> (If you multiply columns 'E' and 'I' from
> http://stats.wikimedia.org/EN/TablesWikipediaEN.htm the database size
> shows a cusp at around 2006, corresponding to the growth modes, but
> two separate linear trends fit both modes far better than any growth
> model fits the entire curve.)
>
> _______________________________________________
> Wikimedia-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>




--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

John Mark Vandenberg
On Sat, Jul 27, 2013 at 6:29 PM, Denny Vrandečić
<[hidden email]> wrote:

> Thank you for the observation.
>
> Is the graph <http://i.imgur.com/TfaD99V.png> based on actual data? Because
> it looks just tad bit too linear to me. (I do not disagree with the
> finding, just wondering about the graph itself).
>
> I still would worry, though: our content is increasing linearly, as you
> say, but the number of active contributors is not. If we take for granted
> that active contributors are the ones who provide quality control for the
> articles, this means that since 2006 or so the ratio of content per
> contributor is linearly declining, which would mean that our quality would
> suffer.

There are a few parts of this that I dont think it can be taken for
granted, and I would love to see stats about quality rather than
quantity, as you're talking about quality, and that should be a
significant component of our analysis.

1) 'active contributors are the ones who provide quality control'

   bots do a lot of what used to be done by humans back in 2007,
rolling back most silly edits.
   and it is a small subset of active contributors who do the majority
of the maintenance.

2) the number of active contributors _doing quality control_ has declined.

   we know the number of overall editors is declining, and I think you
are right that those doing quality control is declining, but is there
evidence to support it?  And does it support that this decline is a
problem?

My gut feeling is that the decline in 'quality control' edits is
tightly linked to the increase in bots doing quality control.

i.e. do we have research to support total article-to-editor ratio
having a bearing on average quality of content?
A proxy could be average number of references per article ..?

It seems unlikely, as our content over the last five years has
increased in quality, and our number of editors has declined.

> I see two effects to counter that:
>
> 1) as you already mentioned, contributors are getting increasingly more
> experienced and more effective in fulfilling their tasks.
>
> 2) we continue to have a strong increase in readers and even stronger in
> pageviews (i.e. more and more people consult Wikipedia more and more). They
> probably also provide a layer of quality assurance, even though they might
> not qualify to be counted as active contributors.
>
> I have the gut feeling that 1) cannot be sufficient, and I would be curious
> in the effects of 2) - especially considering that much of the Foundation
> development work can be considered in improving 2 further (visual editor,
> article rating, mobile editing, etc.)

I agree with James that (1) is significant, and (2 - 'the future')
brings many unknowns with it.

(1) consists of our entire potential editor base, which includes of
all our currently active editors, and all of our inactive editors who
are able to resume editing at any time - i.e. not blocked, not ^&%ed
off, etc.  They all know the syntax, and have demonstrated their
commitment to the vision, _and_ the writers have a personal connection
to the articles that they worked on.  I see lots of them come back
occasionally to touch up or expand their work.

(2) brings different editors, for good or ill.  There are some
concerns in the community that simplifying editing will bring more
non-trivial vandalism that bots cant handle, and even more good
meaning editors who are discouraged when they can't understand why
their edit has disappeared, because they dont read the history, the
talk pages, etc, etc.  The ratio of experienced editor to newbie could
be a significant factor in the maintenance of a friendly environment.

More is not always better.

Don't get me wrong; a good VE will be very helpful, and the projects
defensive mechanisms will adapt.  But I predict that if we see lots of
poor quality articles from VE, without adequate references, and the
community backlogs become problematic, the community will want develop
tools to limit new poor quality articles.

Does anyone have stats for the number of blocked users per month over
the years, as that is hurting our potential editor base, and number of
reverts of edits by new users.

--
John Vandenberg

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

James Salsman-2
In reply to this post by James Salsman-2
Denny Vrandečić wrote:
>...
> Is the graph <http://i.imgur.com/TfaD99V.png> based on actual data?

Yes, the precise sizes for the
dumps.wikimedia.org/enwiki/YYYYMMDD/enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2
files are:

2012-07-02 9524994664
2012-08-02 9824345489
2012-09-02 9929910893
2012-10-01 10015876877
2012-11-01 10124555675
2012-12-01 10220499338
2013-01-02 10315766966
2013-02-04 10425240648
2013-03-04 10430830645
2013-04-03 10433658645
2013-05-03 10525475953
2013-06-04 10617572833
2013-07-08 10721955835

The byte count approximations from multiplying columns 'E' and 'I'
from http://stats.wikimedia.org/EN/TablesWikipediaEN.htm are at the
end of this message. Again, that data best fits two linear trends,
with a cusp around 2006.

> our content is increasing... but the number of active
> contributors is not.

I'm becoming increasingly convinced that as contributors become more
experienced, they choose to do most of their work logged out. What are
the advantages of using a registered account? Theoretically you can
prove that you made contributions, but as far as I know only one
person so far has ever obtained professional credit for their
contributions (there is a recent thread on wiki-research-l about
this.) What are the disadvantages of using a registered account to
edit? Anyone who opposes an edit politically is likely to examine the
entirety of the editor's contribution history and will all too often
stalk, punish by reverting old edits, or dispute the contributor's
work. Anonymous IP editors rarely face such time wasting scrutiny and
hassles. For anyone whose primary goal is to build an encyclopedia as
opposed to socializing, amassing administrative power, or obtaining a
job with the Foundation, the choice is obvious.  Those who wish their
contributions to be remembered for posterity are more likely to become
serial puppeteers than registered editors, unless they want to spend
most of their time being hassled in article space.

John Vandenberg wrote:
>...
> I would love to see stats about quality rather than quantity....

It would be a mistake to rely on volunteer or Foundation assessments
of quality, because the likelihood that they would be biased is far to
great. We should rely only on third party assessments of article
quality, such as those in
http://en.wikipedia.org/wiki/Reliability_of_Wikipedia#Assessments
nearly all of which show continuous ongoing improvement.

Automatic measures of quality proposed so far have not really
impressed me, but I think http://arxiv.org/pdf/1206.2517.pdf has huge
potential and I am confident that the ideas it promotes will be easily
automated by bots after it is proven through peer review.

> Does anyone have stats for the number of blocked users per month

Yes, but it's almost meaningless because the vast majority of blocks
are for persistent vandalism, often at schools or libraries where we
really have no way to determine whether the editors involved ever
returned to do productive work.

---

Products of columns 'E' and 'I' from
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm :

Jan-10 11330500000
Dec-09 11262300000
Nov-09 11206500000
Oct-09 10788000000
Sep-09 10725000000
Aug-09 10653000000
Jul-09 10263100000
Jun-09 10213800000
May-09 9791600000
Apr-09 9718800000
Mar-09 9328500000
Feb-09 9301500000
Jan-09 9250200000
Dec-08 8855600000
Nov-08 8806200000
Oct-08 8415000000
Sep-08 8375000000
Aug-08 8317500000
Jul-08 7960800000
Jun-08 7941600000
May-08 7557800000
Apr-08 7498000000
Mar-08 7112600000
Feb-08 7068600000
Jan-08 6738900000
Dec-07 6699000000
Nov-07 6318000000
Oct-07 6256000000
Sep-07 5859600000
Aug-07 5823500000
Jul-07 5499000000
Jun-07 5181600000
May-07 5140800000
Apr-07 4793600000
Mar-07 4724800000
Feb-07 4662400000
Jan-07 4320000000
Dec-06 4257000000
Nov-06 3917200000
Oct-06 3871000000
Sep-06 3551600000
Aug-06 3510000000
Jul-06 3195600000
Jun-06 2896300000
May-06 2856700000
Apr-06 2557000000
Mar-06 2476177000
Feb-06 2312907000
Jan-06 2170049000
Dec-05 2013600000
Nov-05 1869076000
Oct-05 1746960000
Sep-05 1627864000
Aug-05 1526784000
Jul-05 1407976000
Jun-05 1300334000
May-05 1209984000
Apr-05 1002925000
Mar-05 924630000
Feb-05 872320000
Jan-05 838272000
Dec-04 861724000
Nov-04 806195000
Oct-04 743904000
Sep-04 689924000
Aug-04 644502000
Jul-04 595665000
Jun-04 552900000
May-04 511038000
Apr-04 476750000
Mar-04 440286000
Feb-04 403010000
Jan-04 375536000
Dec-03 350336000
Nov-03 329219000
Oct-03 310616000
Sep-03 294689000
Aug-03 278630000
Jul-03 261555000
Jun-03 244454000
May-03 230328000
Apr-03 217200000
Mar-03 204630000
Feb-03 193475000
Jan-03 182936000
Dec-02 171010000
Nov-02 162150000
Oct-02 150480000
Sep-02 80733000
Aug-02 66990000
Jul-02 59755000
Jun-02 55420000
May-02 49259000
Apr-02 47790000
Mar-02 44968000
Feb-02 39350000
Jan-02 30582000
Dec-01 26832000
Nov-01 21994000
Oct-01 17244000
Sep-01 10982000
Aug-01 7100000
Jul-01 4186000
Jun-01 3240000
May-01 2373600
Apr-01 1295800
Mar-01 596904
Feb-01 186636
Jan-01 33800

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

Mark
In reply to this post by Denny Vrandečić
On 7/27/13 10:29 AM, Denny Vrandečić wrote:
> I still would worry, though: our content is increasing linearly, as you
> say, but the number of active contributors is not. If we take for granted
> that active contributors are the ones who provide quality control for the
> articles, this means that since 2006 or so the ratio of content per
> contributor is linearly declining, which would mean that our quality would
> suffer.
>

One useful bit of information is what *kind* of editors there are, not
just the raw numbers..

For example, here is a hypothetical situation, which I think James and
John are contemplating, which would result in a numerical decline in
editors-per-article with no real change in actual editorial attention to
the article:

* Article in 2007, with 19 editors: Initial content written by 1 person,
moderate expansions from 3 people, copyediting from 5 people,
vandalism-rollback from 10 people

* Similar article in 2013, with 12 editors: Initial content written by 1
person, moderate expansions from 3 people, copyediting from 3 people and
1 typo-fixing bot, vandalism-rollback from 2 people and 2 anti-vandal bots

Basically all that happened in this hypothetical is that two of the
typo-fixers were replaced by a typo-fixing bot, and 8 rollbacks that
would've once been done by recent-changes patrollers were instead done
by a smaller number of anti-vandal bots. Maybe that's not what the
change looks like, but I don't think the raw edit-count data can tell us
either way.

I think this is also a potential issue with the definition of active
users, which is defined as 5 edits/month for "active" and 100
edits/month for "very active". The latter in particular much more
heavily favors people who make many smaller edits versus fewer large
edits. And are there editors contributing substantial amounts of content
to Wikipedia who don't even hit the lower threshold? One possible group
are people whose main contribution is to write new articles, and do
little to no other editing. Some people write offline and then
contribute a new, well-referenced article in a single edit. If that's
their only involvement in Wikipedia, they wouldn't be counted as active
Wikipedians in the numbers, even if they're sending us a steady stream
of 1-2 new articles/month.

I'm not sure how to best answer those questions automatically. Bytes, as
James suggests, could be one possible proxy, but in addition to total
bytes, we could look at the editor level. Has there been a decline in
"active editors" if we define active editing as changing more than N
bytes in the encyclopedia in a month, not counting rollbacks? That would
count people who wrote substantial new articles as active, even if they
did it in only 1 or 2 edits/month (although on the other hand, it
wouldn't count people who made 100 rollbacks and no other edits).

Another possibility could be to sample a subset of either articles, or
of editors, and manually annotate what kind of editing is going on. More
tedious and would of necessity be on a small subset of the encyclopedia,
but might avoid papering over things that are obvious when you look at
them but tend to get lost in big-data analyses.

-Mark

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>