GSoC 2018 Introduction: Hagar Shilo

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

GSoC 2018 Introduction: Hagar Shilo

Hagar Shilo
Hi All,

My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
University, Israel.

This summer I will be working on a user search menu and user filters for
Wikipedia's "Recent changes" section. Here is the workplan:
https://phabricator.wikimedia.org/T190714

My mentors are Moriel and Roan.

I am looking forward to becoming a Wikimedia developer and an open source
contributor.

Cheers,
Hagar
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2018 Introduction: Hagar Shilo

Amir E. Aharoni
Welcome / ברוכה הבאה!

בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏<
[hidden email]>:

> Hi All,
>
> My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
> University, Israel.
>
> This summer I will be working on a user search menu and user filters for
> Wikipedia's "Recent changes" section. Here is the workplan:
> https://phabricator.wikimedia.org/T190714
>
> My mentors are Moriel and Roan.
>
> I am looking forward to becoming a Wikimedia developer and an open source
> contributor.
>
> Cheers,
> Hagar
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Getting a local dump of Wikipedia in HTML

Aidan Hogan
Hi all,

I am wondering what is the fastest/best way to get a local dump of
English Wikipedia in HTML? We are looking just for the current versions
(no edit history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the
source markup in the Wikipedia dumps to HTML, but the latest version
seems to take on average several seconds per article (including after
the most common templates have been downloaded and stored locally). This
means it would take several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable
crawl delay (5 seconds) it would several months to get a copy of every
article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any
help, pointers or advice would be greatly appreciated!!

Best,
Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Fæ
On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote:

> Hi all,
>
> I am wondering what is the fastest/best way to get a local dump of English
> Wikipedia in HTML? We are looking just for the current versions (no edit
> history) of articles for the purposes of a research project.
>
> We have been exploring using bliki [1] to do the conversion of the source
> markup in the Wikipedia dumps to HTML, but the latest version seems to take
> on average several seconds per article (including after the most common
> templates have been downloaded and stored locally). This means it would take
> several months to convert the dump.
>
> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> crawl delay (5 seconds) it would several months to get a copy of every
> article in HTML (or at least the "reachable" ones).
>
> Hence we are a bit stuck right now and not sure how to proceed. Any help,
> pointers or advice would be greatly appreciated!!
>
> Best,
> Aidan
>
> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?

Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

Fae
--
[hidden email] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC 2018 Introduction: Hagar Shilo

Eran Rosenthal
In reply to this post by Amir E. Aharoni
Good luck / בהצלחה!


On Thu, May 3, 2018 at 7:39 PM, Amir E. Aharoni <
[hidden email]> wrote:

> Welcome / ברוכה הבאה!
>
> בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏<
> [hidden email]>:
>
> > Hi All,
> >
> > My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv
> > University, Israel.
> >
> > This summer I will be working on a user search menu and user filters for
> > Wikipedia's "Recent changes" section. Here is the workplan:
> > https://phabricator.wikimedia.org/T190714
> >
> > My mentors are Moriel and Roan.
> >
> > I am looking forward to becoming a Wikimedia developer and an open source
> > contributor.
> >
> > Cheers,
> > Hagar
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Aidan Hogan
In reply to this post by Fæ
Hi Fae,

On 03-05-2018 16:18, Fæ wrote:

> On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote:
>> Hi all,
>>
>> I am wondering what is the fastest/best way to get a local dump of English
>> Wikipedia in HTML? We are looking just for the current versions (no edit
>> history) of articles for the purposes of a research project.
>>
>> We have been exploring using bliki [1] to do the conversion of the source
>> markup in the Wikipedia dumps to HTML, but the latest version seems to take
>> on average several seconds per article (including after the most common
>> templates have been downloaded and stored locally). This means it would take
>> several months to convert the dump.
>>
>> We also considered using Nutch to crawl Wikipedia, but with a reasonable
>> crawl delay (5 seconds) it would several months to get a copy of every
>> article in HTML (or at least the "reachable" ones).
>>
>> Hence we are a bit stuck right now and not sure how to proceed. Any help,
>> pointers or advice would be greatly appreciated!!
>>
>> Best,
>> Aidan
>>
>> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
>
> Just in case you have not thought of it, how about taking the XML dump
> and converting it to the format you are looking for?
>
> Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
>

Thanks for the pointer! We are currently attempting to do something like
that with bliki. The issue is that we are interested in the
semi-structured HTML elements (like lists, tables, etc.) which are often
generated through external templates with complex structures. Often from
the invocation of a template in an article, we cannot even tell if it
will generate a table, a list, a box, etc. E.g., it might say "Weather
box" in the markup, which gets converted to a table.

Although bliki can help us to interpret and expand those templates, each
page takes quite long, meaning months of computation time to get the
semi-structured data we want from the dump. Due to these templates, we
have not had much success yet with this route of taking the XML dump and
converting it to HTML (or even parsing it directly); hence we're still
looking for other options. :)

Cheers,
Aidan

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Neil Patel Quinn
Hey Aidan!

I would suggest checking out RESTBase (
https://www.mediawiki.org/wiki/RESTBase), which offers an API for
retrieving HTML versions of Wikipedia pages. It's maintained by the
Wikimedia Foundation and used by a number of production Wikimedia services,
so you can rely on it.

I don't believe there are any prepared dumps of this HTML, but you should
be able to iterate through the RESTBase API, as long as you follow the
rules (from https://en.wikipedia.org/api/rest_v1/):

   - *Limit your clients to no more than 200 requests/s to this API. Each
   API endpoint's documentation may detail more specific usage limits.*
   - *Set a unique User-Agent or Api-User-Agent header that allows us to
   contact you quickly. Email addresses or URLs of contact pages work well.*



On Thu, 3 May 2018 at 14:26, Aidan Hogan <[hidden email]> wrote:

> Hi Fae,
>
> On 03-05-2018 16:18, Fæ wrote:
> > On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote:
> >> Hi all,
> >>
> >> I am wondering what is the fastest/best way to get a local dump of
> English
> >> Wikipedia in HTML? We are looking just for the current versions (no edit
> >> history) of articles for the purposes of a research project.
> >>
> >> We have been exploring using bliki [1] to do the conversion of the
> source
> >> markup in the Wikipedia dumps to HTML, but the latest version seems to
> take
> >> on average several seconds per article (including after the most common
> >> templates have been downloaded and stored locally). This means it would
> take
> >> several months to convert the dump.
> >>
> >> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> >> crawl delay (5 seconds) it would several months to get a copy of every
> >> article in HTML (or at least the "reachable" ones).
> >>
> >> Hence we are a bit stuck right now and not sure how to proceed. Any
> help,
> >> pointers or advice would be greatly appreciated!!
> >>
> >> Best,
> >> Aidan
> >>
> >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
> >
> > Just in case you have not thought of it, how about taking the XML dump
> > and converting it to the format you are looking for?
> >
> > Ref
> https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
> >
>
> Thanks for the pointer! We are currently attempting to do something like
> that with bliki. The issue is that we are interested in the
> semi-structured HTML elements (like lists, tables, etc.) which are often
> generated through external templates with complex structures. Often from
> the invocation of a template in an article, we cannot even tell if it
> will generate a table, a list, a box, etc. E.g., it might say "Weather
> box" in the markup, which gets converted to a table.
>
> Although bliki can help us to interpret and expand those templates, each
> page takes quite long, meaning months of computation time to get the
> semi-structured data we want from the dump. Due to these templates, we
> have not had much success yet with this route of taking the XML dump and
> converting it to HTML (or even parsing it directly); hence we're still
> looking for other options. :)
>
> Cheers,
> Aidan
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>
(he/him/his)
product analyst, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Bartosz Dziewoński
In reply to this post by Aidan Hogan
On 2018-05-03 20:54, Aidan Hogan wrote:
> I am wondering what is the fastest/best way to get a local dump of
> English Wikipedia in HTML? We are looking just for the current versions
> (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading:
http://www.kiwix.org/downloads/

Their downloads use the ZIM file format, looks like there are libraries
available for reading it in many programming languages:
http://www.openzim.org/wiki/Readers

--
Bartosz Dziewoński

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Neil Patel Quinn
Also, for the curious, the request for dedicated HTML dumps is tracked in
this Phabricator task: https://phabricator.wikimedia.org/T182351

On Thu, 3 May 2018 at 15:19, Bartosz Dziewoński <[hidden email]> wrote:

> On 2018-05-03 20:54, Aidan Hogan wrote:
> > I am wondering what is the fastest/best way to get a local dump of
> > English Wikipedia in HTML? We are looking just for the current versions
> > (no edit history) of articles for the purposes of a research project.
>
> The Kiwix project provides HTML dumps of Wikipedia for offline reading:
> http://www.kiwix.org/downloads/
>
> Their downloads use the ZIM file format, looks like there are libraries
> available for reading it in many programming languages:
> http://www.openzim.org/wiki/Readers
>
> --
> Bartosz Dziewoński
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>
(he/him/his)
product analyst, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Kaartic Sivaraam
In reply to this post by Bartosz Dziewoński
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
> On 2018-05-03 20:54, Aidan Hogan wrote:
>> I am wondering what is the fastest/best way to get a local dump of
>> English Wikipedia in HTML? We are looking just for the current
>> versions (no edit history) of articles for the purposes of a research
>> project.
>
> The Kiwix project provides HTML dumps of Wikipedia for offline reading:
> http://www.kiwix.org/downloads/
>

In case you need pure HTML and not the ZIM file format, you could check
out mwoffliner[1], the tool used to generate ZIM files. It dumps HTML
files locally before generating the ZIM file. Though, HTML is an
intermediary for the tool it could be held back if you wish. See [2] for
more information about what options the tool accepts.

I'm not sure if it's possible to instruct the tool to stop immediately
after the dumping of the pages thus avoiding the creation of the ZIM
file altogether. But you could work around it by perusing the 'verbose'
output (turned on through the '--verbose' option) of the tool to
identify when dumping has been completed and stop it manually.

In case of any doubts about using the tool, feel free to reach out.

References:
[1]: https://github.com/openzim/mwoffliner
[2]: https://github.com/openzim/mwoffliner/blob/master/lib/parameterList.js


--
Sivaraam

QUOTE:

“The most valuable person on any team is the person who makes everyone
else on the team more valuable, not the person who knows the most.”

      - Joel Spolsky


Sivaraam?

You possibly might have noticed that my signature recently changed from
'Kaartic' to 'Sivaraam' both of which are parts of my name. I find the
new signature to be better for several reasons one of which is that the
former signature has a lot of ambiguities in the place I live as it is a
common name (NOTE: it's not a common spelling, just a common name). So,
I switched signatures before it's too late.

That said, I won't mind you calling me 'Kaartic' if you like it [of
course ;-)]. You can always call me using either of the names.


KIND NOTE TO THE NATIVE ENGLISH SPEAKER:

As I'm not a native English speaker myself, there might be mistaeks in
my usage of English. I apologise for any mistakes that I make.

It would be "helpful" if you take the time to point out the mistakes.

It would be "super helpful" if you could provide suggestions about how
to correct those mistakes.

Thanks in advance!


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Kaartic Sivaraam
On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:

> On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
>> On 2018-05-03 20:54, Aidan Hogan wrote:
>>> I am wondering what is the fastest/best way to get a local dump of
>>> English Wikipedia in HTML? We are looking just for the current
>>> versions (no edit history) of articles for the purposes of a research
>>> project.
>>
>> The Kiwix project provides HTML dumps of Wikipedia for offline reading:
>> http://www.kiwix.org/downloads/
>>
>
> In case you need pure HTML and not the ZIM file format, you could check
> out mwoffliner[1], ...
Note that the HTML is (of course) is not the same as the one you see
when visiting Wikipedia. For example, the side bar links are not present
here, the ToC would not be present.


--
Sivaraam

QUOTE:

“The most valuable person on any team is the person who makes everyone
else on the team more valuable, not the person who knows the most.”

      - Joel Spolsky


Sivaraam?

You possibly might have noticed that my signature recently changed from
'Kaartic' to 'Sivaraam' both of which are parts of my name. I find the
new signature to be better for several reasons one of which is that the
former signature has a lot of ambiguities in the place I live as it is a
common name (NOTE: it's not a common spelling, just a common name). So,
I switched signatures before it's too late.

That said, I won't mind you calling me 'Kaartic' if you like it [of
course ;-)]. You can always call me using either of the names.


KIND NOTE TO THE NATIVE ENGLISH SPEAKER:

As I'm not a native English speaker myself, there might be mistaeks in
my usage of English. I apologise for any mistakes that I make.

It would be "helpful" if you take the time to point out the mistakes.

It would be "super helpful" if you could provide suggestions about how
to correct those mistakes.

Thanks in advance!


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Getting a local dump of Wikipedia in HTML

Aidan Hogan
Hi all,

Many thanks for all the pointers! In the end we wrote a small client to
grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase)
as suggested by Neil. The HTML looks perfect, and with the generous 200
requests/second limit (which we could not even manage to reach with our
local machine), it only took a couple of days to grab all current
English Wikipedia articles.

@Kaartic, many thanks for the offers of help with extracting HTML from
ZIM! We also investigated this option in parallel with converting ZIM to
HTML using Zimreader-Java [1], and indeed it looked promising, but we
had some issues with extracting links. We did not try the mwoffliner
tool you mentioned since we got what we needed through RESTBase in the
end. In any case, we appreciate the offers of help. :)

Best,
Aidan

[1] https://github.com/openzim/zimreader-java

On 08-05-2018 9:34, Kaartic Sivaraam wrote:

> On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:
>> On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
>>> On 2018-05-03 20:54, Aidan Hogan wrote:
>>>> I am wondering what is the fastest/best way to get a local dump of
>>>> English Wikipedia in HTML? We are looking just for the current
>>>> versions (no edit history) of articles for the purposes of a research
>>>> project.
>>>
>>> The Kiwix project provides HTML dumps of Wikipedia for offline reading:
>>> http://www.kiwix.org/downloads/
>>>
>>
>> In case you need pure HTML and not the ZIM file format, you could check
>> out mwoffliner[1], ...
>
> Note that the HTML is (of course) is not the same as the one you see
> when visiting Wikipedia. For example, the side bar links are not present
> here, the ToC would not be present.
>
>

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l