How to mount a local copy of the English Wikipedia for researchers?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to mount a local copy of the English Wikipedia for researchers?

Steve Bennett-8
Hi all,
  I've been tasked with setting up a local copy of the English
Wikipedia for researchers - sort of like another Toolserver. I'm not
having much luck, and wondered if anyone has done this recently, and
what approach they used? We only really need the current article text
- history and meta pages aren't needed.

Things I have tried:
1) Downloading and mounting the SQL dumps

No good because they don't contain article text

2) Downloading and mounting other SQL "research dumps" (eg
ftp://ftp.rediris.es/mirror/WKP_research)

No good because they're years out of date

3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files

No good because they decompress to astronomically large. I got about
halfway through decompressing them and was over 7Tb.

Also, WikiXRay appears to be old and out of date (although
interestingly its author Felipe Ortega has just committed to the
gitorious repository[1] on Monday for the first time in over a year)

4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)

No good because it's old and out of date: it only supports export
version 0.3, and the current dumps are 0.6

5) Using importDump.php on a latest-pages-articles.xml dump [2]

No good because it just spews out 7.6Gb of this output:

PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler out_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
PHP Warning:  xml_parse(): Unable to call handler in_() in
/usr/share/mediawiki/includes/Import.php on line 437
...


So, any suggestions for approaches that might work? Or suggestions for
fixing the errors in step 5?

Steve


[1] http://gitorious.org/wikixray
[2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How to mount a local copy of the English Wikipedia for researchers?

Lars Aronsson
On 2012-06-12 23:19, Steve Bennett wrote:
>    I've been tasked with setting up a local copy of the English
> Wikipedia for researchers - sort of like another Toolserver. I'm not
> having much luck,

Have your researchers learn Icelandic. Importing the
small Icelandic Wikipedia is fast. They can test their
theories and see if their hypotheses make any sense.
When they've done their research on Icelandic, have
them learn Danish, then Norwegian, Swedish, Dutch,
before going to German and finally English. There's
a fine spiral of language sizes around the North Sea.

It's when they are frustrated waiting for an analysis
taking 15 minutes for Norwegian, that they will find
smarter algorithms that will enable them to take on
the larger languages.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How to mount a local copy of the English Wikipedia for researchers?

Jona Christopher Sahnwaldt
In reply to this post by Steve Bennett-8
mwdumper seems to work for recent dumps:
http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html

On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <[hidden email]> wrote:

> Hi all,
>  I've been tasked with setting up a local copy of the English
> Wikipedia for researchers - sort of like another Toolserver. I'm not
> having much luck, and wondered if anyone has done this recently, and
> what approach they used? We only really need the current article text
> - history and meta pages aren't needed.
>
> Things I have tried:
> 1) Downloading and mounting the SQL dumps
>
> No good because they don't contain article text
>
> 2) Downloading and mounting other SQL "research dumps" (eg
> ftp://ftp.rediris.es/mirror/WKP_research)
>
> No good because they're years out of date
>
> 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
>
> No good because they decompress to astronomically large. I got about
> halfway through decompressing them and was over 7Tb.
>
> Also, WikiXRay appears to be old and out of date (although
> interestingly its author Felipe Ortega has just committed to the
> gitorious repository[1] on Monday for the first time in over a year)
>
> 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
>
> No good because it's old and out of date: it only supports export
> version 0.3, and the current dumps are 0.6
>
> 5) Using importDump.php on a latest-pages-articles.xml dump [2]
>
> No good because it just spews out 7.6Gb of this output:
>
> PHP Warning:  xml_parse(): Unable to call handler in_() in
> /usr/share/mediawiki/includes/Import.php on line 437
> PHP Warning:  xml_parse(): Unable to call handler out_() in
> /usr/share/mediawiki/includes/Import.php on line 437
> PHP Warning:  xml_parse(): Unable to call handler in_() in
> /usr/share/mediawiki/includes/Import.php on line 437
> PHP Warning:  xml_parse(): Unable to call handler in_() in
> /usr/share/mediawiki/includes/Import.php on line 437
> ...
>
>
> So, any suggestions for approaches that might work? Or suggestions for
> fixing the errors in step 5?
>
> Steve
>
>
> [1] http://gitorious.org/wikixray
> [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How to mount a local copy of the English Wikipedia for researchers?

Adam Wight
I ran into this problem recently.  A python script is available at https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py, that will convert .xml.bz2 dumps into flat fast-import files which can be loaded into most databases.  Sorry this tool is still alpha quality.

Feel free to contact with problems.

-Adam Wight

[hidden email]:

> mwdumper seems to work for recent dumps:
> http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
>
> On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <[hidden email]> wrote:
> > Hi all,
> >  I've been tasked with setting up a local copy of the English
> > Wikipedia for researchers - sort of like another Toolserver. I'm not
> > having much luck, and wondered if anyone has done this recently, and
> > what approach they used? We only really need the current article text
> > - history and meta pages aren't needed.
> >
> > Things I have tried:
> > 1) Downloading and mounting the SQL dumps
> >
> > No good because they don't contain article text
> >
> > 2) Downloading and mounting other SQL "research dumps" (eg
> > ftp://ftp.rediris.es/mirror/WKP_research)
> >
> > No good because they're years out of date
> >
> > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
> >
> > No good because they decompress to astronomically large. I got about
> > halfway through decompressing them and was over 7Tb.
> >
> > Also, WikiXRay appears to be old and out of date (although
> > interestingly its author Felipe Ortega has just committed to the
> > gitorious repository[1] on Monday for the first time in over a year)
> >
> > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
> >
> > No good because it's old and out of date: it only supports export
> > version 0.3, and the current dumps are 0.6
> >
> > 5) Using importDump.php on a latest-pages-articles.xml dump [2]
> >
> > No good because it just spews out 7.6Gb of this output:
> >
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler out_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > ...
> >
> >
> > So, any suggestions for approaches that might work? Or suggestions for
> > fixing the errors in step 5?
> >
> > Steve
> >
> >
> > [1] http://gitorious.org/wikixray
> > [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How to mount a local copy of the English Wikipedia for researchers?

Steve Bennett-8
Thanks, I'm trying this. It consumes phenomenal amounts of memory
though - I keep getting a "Killed" message from Ubuntu, even with a
20Gb swap file. Will keep trying with an even bigger one.

I'll also give mwdumper another go.

Steve

On Wed, Jun 13, 2012 at 3:03 PM, Adam Wight <[hidden email]> wrote:

> I ran into this problem recently.  A python script is available at https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py, that will convert .xml.bz2 dumps into flat fast-import files which can be loaded into most databases.  Sorry this tool is still alpha quality.
>
> Feel free to contact with problems.
>
> -Adam Wight
>
> [hidden email]:
>> mwdumper seems to work for recent dumps:
>> http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
>>
>> On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <[hidden email]> wrote:
>> > Hi all,
>> >  I've been tasked with setting up a local copy of the English
>> > Wikipedia for researchers - sort of like another Toolserver. I'm not
>> > having much luck, and wondered if anyone has done this recently, and
>> > what approach they used? We only really need the current article text
>> > - history and meta pages aren't needed.
>> >
>> > Things I have tried:
>> > 1) Downloading and mounting the SQL dumps
>> >
>> > No good because they don't contain article text
>> >
>> > 2) Downloading and mounting other SQL "research dumps" (eg
>> > ftp://ftp.rediris.es/mirror/WKP_research)
>> >
>> > No good because they're years out of date
>> >
>> > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files
>> >
>> > No good because they decompress to astronomically large. I got about
>> > halfway through decompressing them and was over 7Tb.
>> >
>> > Also, WikiXRay appears to be old and out of date (although
>> > interestingly its author Felipe Ortega has just committed to the
>> > gitorious repository[1] on Monday for the first time in over a year)
>> >
>> > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
>> >
>> > No good because it's old and out of date: it only supports export
>> > version 0.3, and the current dumps are 0.6
>> >
>> > 5) Using importDump.php on a latest-pages-articles.xml dump [2]
>> >
>> > No good because it just spews out 7.6Gb of this output:
>> >
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler out_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > ...
>> >
>> >
>> > So, any suggestions for approaches that might work? Or suggestions for
>> > fixing the errors in step 5?
>> >
>> > Steve
>> >
>> >
>> > [1] http://gitorious.org/wikixray
>> > [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l