List of all authors via API

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

List of all authors via API

Johannes Beigel-2
We're heavily using the MediaWiki API in our opensource project mwlib (http://code.pediapress.com/ 
), so first of all: Thanks to you all for implementing this  
functionality to MediaWiki!

Maybe you're following the discussion initiated by Erik Möller on  
Foundation-l about appropriate attribution. As there is yet a consesus  
to be found, we plan to include all authors (minus minor edits, minus  
bots) after each article in documents (PDFs, ODFs) rendered from  
article collections.

Currently we're using an API query with prop=revisions, requesting  
rvprop=user|ids|flags. Afterwards we're filtering out minor edits,  
anonymous/IP edits and bot edits (via regular expression on username  
and comment) and combine edits by the same author. To retrieve the  
data for all revisions for heavily edited articles (e.g.  
[[en:Physics]]), this requires lots of API requests with rvlimit=500.

Is there a way (or a plan to implement one) to retrieve the list of  
unique contributors for a given article (from a given revision down to  
the first one)? Ideally this would accept parameters for the mentioned  
filtering. I guess inside of MediaWiki code this can be handled very  
efficiently (using appropriate database queries) and would eliminate  
the need to transfer lots of redundant data over the socket.

-- Johannes Beigel


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Brion Vibber-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Johannes Beigel wrote:
> Is there a way (or a plan to implement one) to retrieve the list of  
> unique contributors for a given article (from a given revision down to  
> the first one)? Ideally this would accept parameters for the mentioned  
> filtering. I guess inside of MediaWiki code this can be handled very  
> efficiently (using appropriate database queries) and would eliminate  
> the need to transfer lots of redundant data over the socket.

Given that this could require filtering through hundreds of thousands of
unique revisions for a single request, I don't think we currently have a
good plan for that. :)

Doing it sensibly would require adding infrastructure for storing a
"major authors list" for later retrieval with minimal incremental
processing, and this has not yet been done.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkCDQMACgkQwRnhpk1wk47JcACdE/NJlwPurO/s3n8Y2bOVtVqV
pGsAn1d3gc0VETv7fzBgId/GCwIxJ2xZ
=qhdS
-----END PGP SIGNATURE-----

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Magnus Manske-2
On Fri, Oct 24, 2008 at 5:59 PM, Brion Vibber <[hidden email]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Johannes Beigel wrote:
>> Is there a way (or a plan to implement one) to retrieve the list of
>> unique contributors for a given article (from a given revision down to
>> the first one)? Ideally this would accept parameters for the mentioned
>> filtering. I guess inside of MediaWiki code this can be handled very
>> efficiently (using appropriate database queries) and would eliminate
>> the need to transfer lots of redundant data over the socket.
>
> Given that this could require filtering through hundreds of thousands of
> unique revisions for a single request, I don't think we currently have a
> good plan for that. :)

I just ran a DISTINCT mysql query for all non-IP editors of
[[en:George W. Bush]] on the toolserver, and that took 3 seconds.
There are 41790 revisions.

Considering that this would be a worst case article, and that it ran
on the overtaxed toolserver, it does seem possible. Maybe if we'd have
one MySQL slave / Apache dedicated for this task?

Made-up URL: http://authors.wikimedia.org/en.wikipedia/George_W._Bush

Magnus

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Brion Vibber-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Magnus Manske wrote:

> On Fri, Oct 24, 2008 at 5:59 PM, Brion Vibber <[hidden email]> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Johannes Beigel wrote:
>>> Is there a way (or a plan to implement one) to retrieve the list of
>>> unique contributors for a given article (from a given revision down to
>>> the first one)? Ideally this would accept parameters for the mentioned
>>> filtering. I guess inside of MediaWiki code this can be handled very
>>> efficiently (using appropriate database queries) and would eliminate
>>> the need to transfer lots of redundant data over the socket.
>> Given that this could require filtering through hundreds of thousands of
>> unique revisions for a single request, I don't think we currently have a
>> good plan for that. :)
>
> I just ran a DISTINCT mysql query for all non-IP editors of
> [[en:George W. Bush]] on the toolserver, and that took 3 seconds.
> There are 41790 revisions.

Indeed, it's not as bad as I was afraid. I'm still a little leery that
the EXPLAIN lists "Using temporary" though. :P

> Considering that this would be a worst case article, and that it ran
> on the overtaxed toolserver, it does seem possible. Maybe if we'd have
> one MySQL slave / Apache dedicated for this task?

Probably fine to pull from the same slaves already dedicated for
contributions queries (relevant indexes are already pulled into memory).

Figuring out how to get something other than a raw list of thousands of
editors for a "nice" author list remains a harder task. :)

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkInEIACgkQwRnhpk1wk45YFQCgqGtWOps8dAU/qbjQJA290qDJ
6pMAnRBgO0erMT2fmB2GxHnWXj7t/bdi
=QiWY
-----END PGP SIGNATURE-----

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Guess Who?
--- On Wed, 10/29/08, Brion Vibber <[hidden email]> wrote:

> From: Brion Vibber <[hidden email]>
> Subject: Re: [Mediawiki-api] List of all authors via API
> To: "MediaWiki API announcements & discussion" <[hidden email]>
> Date: Wednesday, October 29, 2008, 10:24 AM
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Magnus Manske wrote:
> > On Fri, Oct 24, 2008 at 5:59 PM, Brion Vibber
> <[hidden email]> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Johannes Beigel wrote:
> >>> Is there a way (or a plan to implement one) to
> retrieve the list of
> >>> unique contributors for a given article (from
> a given revision down to
> >>> the first one)? Ideally this would accept
> parameters for the mentioned
> >>> filtering. I guess inside of MediaWiki code
> this can be handled very
> >>> efficiently (using appropriate database
> queries) and would eliminate
> >>> the need to transfer lots of redundant data
> over the socket.
> >> Given that this could require filtering through
> hundreds of thousands of
> >> unique revisions for a single request, I don't
> think we currently have a
> >> good plan for that. :)
> >
> > I just ran a DISTINCT mysql query for all non-IP
> editors of
> > [[en:George W. Bush]] on the toolserver, and that took
> 3 seconds.
> > There are 41790 revisions.
>
> Indeed, it's not as bad as I was afraid. I'm still
> a little leery that
> the EXPLAIN lists "Using temporary" though. :P
>
> > Considering that this would be a worst case article,
> and that it ran
> > on the overtaxed toolserver, it does seem possible.
> Maybe if we'd have
> > one MySQL slave / Apache dedicated for this task?
>
> Probably fine to pull from the same slaves already
> dedicated for
> contributions queries (relevant indexes are already pulled
> into memory).
>
> Figuring out how to get something other than a raw list of
> thousands of
> editors for a "nice" author list remains a harder
> task. :)

wouldn't that be a snap using the group_by function?  sorry, I don't know the database structure, but generically:

  SELECT contributors, COUNT(*) FROM database GROUP BY contributors

would return a list of all contributors and the number of contributions they've made; it could be tweaked to return only those contributors who've made over X contributions.  of course, I've only worked on small databases, so I have no idea what the overhead on this would be...


     

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Brion Vibber-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Guess Who? wrote:

> --- On Wed, 10/29/08, Brion Vibber wrote:
>> Figuring out how to get something other than a raw list of
>> thousands of
>> editors for a "nice" author list remains a harder
>> task. :)
>
> wouldn't that be a snap using the group_by function?  sorry, I don't
> know the database structure, but generically:
>
> SELECT contributors, COUNT(*) FROM database GROUP BY contributors
>
> would return a list of all contributors and the number of
> contributions they've made; it could be tweaked to return only those
> contributors who've made over X contributions.  of course, I've only
> worked on small databases, so I have no idea what the overhead on
> this would be...

Making a lot of edits to the page doesn't necessarily mean you've
contributed a lot of text to it. It might mean you've vandalized it a
lot (always reverted) or that you've reverted a lot of vandalism.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkLQMsACgkQwRnhpk1wk472twCfVfr4chZnBPS/MBP31H7zI+p7
r/kAoKd40ULM0rBBMPF8ZwxqyHNO//hK
=NdkQ
-----END PGP SIGNATURE-----

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Platonides
Brion Vibber wrote:

> Guess Who? wrote:
>> it could be tweaked to return only those
>> contributors who've made over X contributions.  of course, I've only
>> worked on small databases, so I have no idea what the overhead on
>> this would be...
>
> Making a lot of edits to the page doesn't necessarily mean you've
> contributed a lot of text to it. It might mean you've vandalized it a
> lot (always reverted) or that you've reverted a lot of vandalism.
>
> - -- brion

Or simply that you don't know how to use the preview button.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: List of all authors via API

Guess Who?
--- On Fri, 10/31/08, Platonides <[hidden email]> wrote:

> From: Platonides <[hidden email]>
> Subject: Re: [Mediawiki-api] List of all authors via API
> To: "MediaWiki API announcements & discussion" <[hidden email]>
> Date: Friday, October 31, 2008, 3:19 PM
> Brion Vibber wrote:
> > Guess Who? wrote:
> >> it could be tweaked to return only those
> >> contributors who've made over X contributions.
>  of course, I've only
> >> worked on small databases, so I have no idea what
> the overhead on
> >> this would be...
> >
> > Making a lot of edits to the page doesn't
> necessarily mean you've
> > contributed a lot of text to it. It might mean
> you've vandalized it a
> > lot (always reverted) or that you've reverted a
> lot of vandalism.
> >
> > - -- brion
>
> Or simply that you don't know how to use the preview
> button.

lol - true, but it's the quick and dirty metric.  if you want something more sophisticated you could play with weighted means on the positive character counts (or maybe weighted geometric means - weighted arithmetics might be easier to do in the SQL, though), which would up-play people who make moderate, consistent additions. but really, anything you do is going to have have false positives and false negatives, and would need to be vetted for obvious mistakes.


     

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api