"Quick" request

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

"Quick" request

Bruno Goncalves-2
Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...). 

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well. 

Best,

Bruno

*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: [hidden email]
*******************************************

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Scott Hale
Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library.  https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how well any perform/scale.
https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.

Cheers,
Scott


On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <[hidden email]> wrote:
Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...). 

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well. 

Best,

Bruno

*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: [hidden email]
*******************************************
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Dr. Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Bruno Goncalves-2
Thanks for the suggestions. I'll take a look.

There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade :) HTML or Plain Text dumps would be a boon for the NLP world.

Best,

B



*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: [hidden email]
*******************************************

On Mon, Feb 22, 2016 at 11:10 AM, Scott Hale <[hidden email]> wrote:
Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library.  https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how well any perform/scale.
https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.

Cheers,
Scott


On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <[hidden email]> wrote:
Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...). 

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well. 

Best,

Bruno

*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: [hidden email]
*******************************************
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Dr. Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Federico Leva (Nemo)
Bruno Goncalves, 22/02/2016 22:58:
> There used to be official HTML dumps
> https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't
> been updated in almost a decade :)

The job is effectively done by Kiwix now.
http://download.kiwix.org/zim/wikipedia/
For instance:
   wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G

There are several tools to extract the HTML from a ZIM file:
http://www.openzim.org/wiki/Readers

Nemo


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Bruno Goncalves-2

The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/
For instance:
  wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G

Humm... It seems like they are all several months old? 

*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: [hidden email]
*******************************************

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Federico Leva (Nemo)
Bruno Goncalves, 23/02/2016 00:19:
>
>        wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G
>
>
> Humm... It seems like they are all several months old?

As you can see, Kelson recently focused on other things like the "wp1"
releases. The ZIM dump production is now orders of magnitudes easier
than it was years ago with the dumpHTML methods, so if you have a cogent
need for a more recent dump you can tell Kelson (cc) and he'll probably
be able to help.

Feel free to send patches as well. ;-)
https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/

Nemo

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: "Quick" request

Marco Fossati-2
In reply to this post by Bruno Goncalves-2
Hi Bruno,

I have been using the WikiExtractor for this task:
https://github.com/attardi/wikiextractor

Hope this helps.
Cheers,

Marco

On 2/22/16 23:32, [hidden email] wrote:

> Date: Mon, 22 Feb 2016 23:12:08 +0100
> From: "Federico Leva (Nemo)"<[hidden email]>
> To: Research into Wikimedia content and communities
> <[hidden email]>
> Subject: Re: [Wiki-research-l] "Quick" request
> Message-ID:<[hidden email]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Bruno Goncalves, 22/02/2016 22:58:
>> >There used to be official HTML dumps
>> >https://dumps.wikimedia.org/other/static_html_dumps/  but they haven't
>> >been updated in almost a decade:)
> The job is effectively done by Kiwix now.
> http://download.kiwix.org/zim/wikipedia/
> For instance:
>     wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G
>
> There are several tools to extract the HTML from a ZIM file:
> http://www.openzim.org/wiki/Readers
>
> Nemo

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l