Wikipedia Users database

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikipedia Users database

Rami Al-Rfou'
Hi,

I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.

So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.

  • I can not see the users table! Is the users table stored in a special partial dump?
  • Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ?
  • I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
  • Are you aware of any tools that extract the text from wikipedia markup language.
Regards.

--
Rami Al-Rfou'
PhD student at Stony Brook University

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

Torsten Zesch

Hi,

 

you might want to have a look at the JWPL Revision Toolkit

http://code.google.com/p/jwpl/

It should provide most information you are looking for, especially access to all the modifications and a parser to extract the plain text from Wikipedia.

 

The UIMA toolkit

http://code.google.com/p/dkpro-core-asl/

also contains a component that gets you all pairs of adjacent revisions from which makes it quite easy to spot the ones which are additions only.

 

-Torsten

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Rami Al-Rfou'
Sent: Tuesday, October 18, 2011 9:29 PM
To: Research into Wikimedia content and communities
Cc: Yanqing Chen
Subject: [Wiki-research-l] Wikipedia Users database

 

Hi,

 

I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.

 

So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.

 

  • I can not see the users table! Is the users table stored in a special partial dump?
  • Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ?
  • I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
  • Are you aware of any tools that extract the text from wikipedia markup language.

Regards.

 

--

Rami Al-Rfou'

PhD student at Stony Brook University


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

Rami Al-Rfou'
In reply to this post by Rami Al-Rfou'
Hi All,

So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N

It seems that such list is populated from a  database. Does anyone know where can I find such database ?

Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories 
?

Regards.

On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' <[hidden email]> wrote:
Hi,

I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.

So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.

  • I can not see the users table! Is the users table stored in a special partial dump?
  • Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ?
  • I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
  • Are you aware of any tools that extract the text from wikipedia markup language.
Regards.

--
Rami Al-Rfou'
PhD student at Stony Brook University



--
Rami Al-Rfou'


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

Kim Bruning
On Mon, Oct 24, 2011 at 04:11:22PM -0400, Rami Al-Rfou' wrote:
> Hi All,
>
> So with more investigation I discovered that I can get a list of the users
> depending on their skill at a specific language. For example:
> http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
>
> It seems that such list is populated from a  database. Does anyone know
> where can I find such database ?

Everything is eventually filled from a database of course :-P , but
specifically, this is a category.
http://www.mediawiki.org/wiki/Help:Categories

sincerely,
        Kim Bruning


--

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

Laura Hale
In reply to this post by Rami Al-Rfou'


On Tue, Oct 25, 2011 at 7:11 AM, Rami Al-Rfou' <[hidden email]> wrote:
Hi All,

So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N

It seems that such list is populated from a  database. Does anyone know where can I find such database ?

Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories 
?



http://csv.ozziesport.com/October%209%20-%20Wikipedia%20English%20Data.csv is a file I have related to that.  It is about a year old and a result of manual data mining, where I looked for user boxes and which users had transcluded them onto their user space.  My file only covers English Wikipedia and doesn't include every user box around.  It might be a good place to start.  I don't think that userbox information is stored in a separate user table, so I doubt that you would be able to get access to it through that route. :/


--
twitter: purplepopple
blog: ozziesport.com


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

Jon Davis-5
In reply to this post by Rami Al-Rfou'
  • That example you posted isn't a list of all users, just ones who have added "Babel" template to their userpages [1].
  • That data is stored in the database, in the category and categorylinks tables (possibly elsewhere, I can't remember offhand).
  • I don't think the  are sorted in anything more than the current row order in the database (so in the order of creation).
  • The user pages will be included in the "All pages" dumps (as opposed to the "Articles, templates, image descriptions, and primary meta-pages.")
As for your original sets of questions:
  • IIRC, no userdata is included in any dumps. This is to protect user privacy.
  • No on all accounts, only thing related in the interface language. If you click "My Preferences" on any Wiki, what options you see there is what is stored in the users table (more or less)
  • All edits are "modifications" technically. You'd have to programatically figure out what is _just_ adding content.
  • Yes, that "tool" would be called MediaWiki, if you want the most accurate parser of MediaWiki Markup [2]. There are some alternative parser's [3] but their output can be of variable quality.

-Jon
[1] http://meta.wikimedia.org/wiki/Meta:Babel_templates

On Mon, Oct 24, 2011 at 13:11, Rami Al-Rfou' <[hidden email]> wrote:
Hi All,

So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N

It seems that such list is populated from a  database. Does anyone know where can I find such database ?

Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories 
?

Regards.


On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' <[hidden email]> wrote:
Hi,

I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.

So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.

  • I can not see the users table! Is the users table stored in a special partial dump?
  • Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ?
  • I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
  • Are you aware of any tools that extract the text from wikipedia markup language.
Regards.

--
Rami Al-Rfou'
PhD student at Stony Brook University



--
Rami Al-Rfou'


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Jon
[[User:ShakataGaNai]] / KJ6FNQ
http://snowulf.com/
http://ipv6wiki.net/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Users database

R.Stuart Geiger
FYI, a lot of the database tables are archived on
http://dumps.wikimedia.org -- see
http://dumps.wikimedia.org/enwiki/20111007/ for the latest dump.  The
user table is private, but it doesn't seem like you need that.  You're
looking for what people have publicly posted on their own user pages,
which MediaWiki understands as a page in a specific namespace, barely
connected to a user at all from a database standpoint.

So if you're looking for category members (like the babel template you
linked), it can be found in enwiki-DATE-categorylinks.sql.gz.  Import
that into mySQL -- it is about 10gb uncompressed, with the indexes
making up another 25gb.   However, that'll just give you the page_ids
of the user page containing the template.  You also have to download
and import the page table (also public and archived) and join to it in
mySQL if you want to get the usernames of everyone who has put
themselves in those categories.  Page is much more manageable --
uncompressed, it is about 3 gb of data and 2.5 gb of indexes.
Stuart

--
R. Stuart Geiger
UC-Berkeley School of Information
User:Staeiou / @staeiou

On Mon, Oct 24, 2011 at 1:38 PM, Jon Davis <[hidden email]> wrote:

> That example you posted isn't a list of all users, just ones who have added
> "Babel" template to their userpages [1].
> That data is stored in the database, in the category and categorylinks
> tables (possibly elsewhere, I can't remember offhand).
> I don't think the  are sorted in anything more than the current row order in
> the database (so in the order of creation).
> The user pages will be included in the "All pages" dumps (as opposed to the
> "Articles, templates, image descriptions, and primary meta-pages.")
>
> As for your original sets of questions:
>
> IIRC, no userdata is included in any dumps. This is to protect user privacy.
> No on all accounts, only thing related in the interface language. If you
> click "My Preferences" on any Wiki, what options you see there is what is
> stored in the users table (more or less)
> All edits are "modifications" technically. You'd have to programatically
> figure out what is _just_ adding content.
> Yes, that "tool" would be called MediaWiki, if you want the most accurate
> parser of MediaWiki Markup [2]. There are some alternative parser's [3] but
> their output can be of variable quality.
>
> -Jon
> [1] http://meta.wikimedia.org/wiki/Meta:Babel_templates
> [2] http://www.mediawiki.org/wiki/Markup_spec
> [3] http://www.mediawiki.org/wiki/Alternative_parsers
> On Mon, Oct 24, 2011 at 13:11, Rami Al-Rfou' <[hidden email]> wrote:
>>
>> Hi All,
>> So with more investigation I discovered that I can get a list of the users
>> depending on their skill at a specific language. For
>> example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
>>
>> It seems that such list is populated from a  database. Does anyone know
>> where can I find such database ?
>> Other questions are regarding the partial dumps of wikipedia. Are the
>> dumps sorted by any field ? How can get all the users pages ? Are they
>> stored in a specific dump ? Or the dumps are stored by page titles or
>> categories
>> ?
>> Regards.
>>
>> On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' <[hidden email]> wrote:
>>>
>>> Hi,
>>> I am planning to study the difference in users edits style and their
>>> spelling errors in English Wikipedia as part of a research project I am
>>> involved in.
>>> So I downloaded some of the wikipedia XML partial dump and convert them
>>> to SQL. My understanding that wikipedia stores every copy of the pages in
>>> the database.
>>>
>>> I can not see the users table! Is the users table stored in a special
>>> partial dump?
>>> Does the user table contain any properties related to the user country,
>>> preferred wikipeidas, or their skill in different languages ?
>>> I am interested in the user modifications that contain addition to the
>>> articles and not modification or deletion. I am planning now to diff between
>>> revisions to get such data. Are you aware of any tool or effort that can
>>> help?
>>> Are you aware of any tools that extract the text from wikipedia markup
>>> language.
>>>
>>> Regards.
>>> --
>>> Rami Al-Rfou'
>>> PhD student at Stony Brook University
>>
>>
>> --
>> Rami Al-Rfou'
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> --
> Jon
> [[User:ShakataGaNai]] / KJ6FNQ
> http://snowulf.com/
> http://ipv6wiki.net/
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l