Hi All,
My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel. This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714 My mentors are Moriel and Roan. I am looking forward to becoming a Wikimedia developer and an open source contributor. Cheers, Hagar _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Welcome / ברוכה הבאה!
בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo < [hidden email]>: > Hi All, > > My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv > University, Israel. > > This summer I will be working on a user search menu and user filters for > Wikipedia's "Recent changes" section. Here is the workplan: > https://phabricator.wikimedia.org/T190714 > > My mentors are Moriel and Roan. > > I am looking forward to becoming a Wikimedia developer and an open source > contributor. > > Cheers, > Hagar > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Hi all,
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project. We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump. We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones). Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!! Best, Aidan [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote:
> Hi all, > > I am wondering what is the fastest/best way to get a local dump of English > Wikipedia in HTML? We are looking just for the current versions (no edit > history) of articles for the purposes of a research project. > > We have been exploring using bliki [1] to do the conversion of the source > markup in the Wikipedia dumps to HTML, but the latest version seems to take > on average several seconds per article (including after the most common > templates have been downloaded and stored locally). This means it would take > several months to convert the dump. > > We also considered using Nutch to crawl Wikipedia, but with a reasonable > crawl delay (5 seconds) it would several months to get a copy of every > article in HTML (or at least the "reachable" ones). > > Hence we are a bit stuck right now and not sure how to proceed. Any help, > pointers or advice would be greatly appreciated!! > > Best, > Aidan > > [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home > > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for? Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia Fae -- [hidden email] https://commons.wikimedia.org/wiki/User:Fae _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
In reply to this post by Amir E. Aharoni
Good luck / בהצלחה!
On Thu, May 3, 2018 at 7:39 PM, Amir E. Aharoni < [hidden email]> wrote: > Welcome / ברוכה הבאה! > > בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo < > [hidden email]>: > > > Hi All, > > > > My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv > > University, Israel. > > > > This summer I will be working on a user search menu and user filters for > > Wikipedia's "Recent changes" section. Here is the workplan: > > https://phabricator.wikimedia.org/T190714 > > > > My mentors are Moriel and Roan. > > > > I am looking forward to becoming a Wikimedia developer and an open source > > contributor. > > > > Cheers, > > Hagar > > _______________________________________________ > > Wikitech-l mailing list > > [hidden email] > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
In reply to this post by Fæ
Hi Fae,
On 03-05-2018 16:18, Fæ wrote: > On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote: >> Hi all, >> >> I am wondering what is the fastest/best way to get a local dump of English >> Wikipedia in HTML? We are looking just for the current versions (no edit >> history) of articles for the purposes of a research project. >> >> We have been exploring using bliki [1] to do the conversion of the source >> markup in the Wikipedia dumps to HTML, but the latest version seems to take >> on average several seconds per article (including after the most common >> templates have been downloaded and stored locally). This means it would take >> several months to convert the dump. >> >> We also considered using Nutch to crawl Wikipedia, but with a reasonable >> crawl delay (5 seconds) it would several months to get a copy of every >> article in HTML (or at least the "reachable" ones). >> >> Hence we are a bit stuck right now and not sure how to proceed. Any help, >> pointers or advice would be greatly appreciated!! >> >> Best, >> Aidan >> >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home > > Just in case you have not thought of it, how about taking the XML dump > and converting it to the format you are looking for? > > Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia > Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table. Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :) Cheers, Aidan _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Hey Aidan!
I would suggest checking out RESTBase ( https://www.mediawiki.org/wiki/RESTBase), which offers an API for retrieving HTML versions of Wikipedia pages. It's maintained by the Wikimedia Foundation and used by a number of production Wikimedia services, so you can rely on it. I don't believe there are any prepared dumps of this HTML, but you should be able to iterate through the RESTBase API, as long as you follow the rules (from https://en.wikipedia.org/api/rest_v1/): - *Limit your clients to no more than 200 requests/s to this API. Each API endpoint's documentation may detail more specific usage limits.* - *Set a unique User-Agent or Api-User-Agent header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.* On Thu, 3 May 2018 at 14:26, Aidan Hogan <[hidden email]> wrote: > Hi Fae, > > On 03-05-2018 16:18, Fæ wrote: > > On 3 May 2018 at 19:54, Aidan Hogan <[hidden email]> wrote: > >> Hi all, > >> > >> I am wondering what is the fastest/best way to get a local dump of > English > >> Wikipedia in HTML? We are looking just for the current versions (no edit > >> history) of articles for the purposes of a research project. > >> > >> We have been exploring using bliki [1] to do the conversion of the > source > >> markup in the Wikipedia dumps to HTML, but the latest version seems to > take > >> on average several seconds per article (including after the most common > >> templates have been downloaded and stored locally). This means it would > take > >> several months to convert the dump. > >> > >> We also considered using Nutch to crawl Wikipedia, but with a reasonable > >> crawl delay (5 seconds) it would several months to get a copy of every > >> article in HTML (or at least the "reachable" ones). > >> > >> Hence we are a bit stuck right now and not sure how to proceed. Any > help, > >> pointers or advice would be greatly appreciated!! > >> > >> Best, > >> Aidan > >> > >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home > > > > Just in case you have not thought of it, how about taking the XML dump > > and converting it to the format you are looking for? > > > > Ref > https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia > > > > Thanks for the pointer! We are currently attempting to do something like > that with bliki. The issue is that we are interested in the > semi-structured HTML elements (like lists, tables, etc.) which are often > generated through external templates with complex structures. Often from > the invocation of a template in an article, we cannot even tell if it > will generate a table, a list, a box, etc. E.g., it might say "Weather > box" in the markup, which gets converted to a table. > > Although bliki can help us to interpret and expand those templates, each > page takes quite long, meaning months of computation time to get the > semi-structured data we want from the dump. Due to these templates, we > have not had much success yet with this route of taking the XML dump and > converting it to HTML (or even parsing it directly); hence we're still > looking for other options. :) > > Cheers, > Aidan > > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
In reply to this post by Aidan Hogan
On 2018-05-03 20:54, Aidan Hogan wrote:
> I am wondering what is the fastest/best way to get a local dump of > English Wikipedia in HTML? We are looking just for the current versions > (no edit history) of articles for the purposes of a research project. The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/ Their downloads use the ZIM file format, looks like there are libraries available for reading it in many programming languages: http://www.openzim.org/wiki/Readers -- Bartosz Dziewoński _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Also, for the curious, the request for dedicated HTML dumps is tracked in
this Phabricator task: https://phabricator.wikimedia.org/T182351 On Thu, 3 May 2018 at 15:19, Bartosz Dziewoński <[hidden email]> wrote: > On 2018-05-03 20:54, Aidan Hogan wrote: > > I am wondering what is the fastest/best way to get a local dump of > > English Wikipedia in HTML? We are looking just for the current versions > > (no edit history) of articles for the purposes of a research project. > > The Kiwix project provides HTML dumps of Wikipedia for offline reading: > http://www.kiwix.org/downloads/ > > Their downloads use the ZIM file format, looks like there are libraries > available for reading it in many programming languages: > http://www.openzim.org/wiki/Readers > > -- > Bartosz Dziewoński > > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
In reply to this post by Bartosz Dziewoński
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
> On 2018-05-03 20:54, Aidan Hogan wrote: >> I am wondering what is the fastest/best way to get a local dump of >> English Wikipedia in HTML? We are looking just for the current >> versions (no edit history) of articles for the purposes of a research >> project. > > The Kiwix project provides HTML dumps of Wikipedia for offline reading: > http://www.kiwix.org/downloads/ > In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], the tool used to generate ZIM files. It dumps HTML files locally before generating the ZIM file. Though, HTML is an intermediary for the tool it could be held back if you wish. See [2] for more information about what options the tool accepts. I'm not sure if it's possible to instruct the tool to stop immediately after the dumping of the pages thus avoiding the creation of the ZIM file altogether. But you could work around it by perusing the 'verbose' output (turned on through the '--verbose' option) of the tool to identify when dumping has been completed and stop it manually. In case of any doubts about using the tool, feel free to reach out. References: [1]: https://github.com/openzim/mwoffliner [2]: https://github.com/openzim/mwoffliner/blob/master/lib/parameterList.js -- Sivaraam QUOTE: “The most valuable person on any team is the person who makes everyone else on the team more valuable, not the person who knows the most.” - Joel Spolsky Sivaraam? You possibly might have noticed that my signature recently changed from 'Kaartic' to 'Sivaraam' both of which are parts of my name. I find the new signature to be better for several reasons one of which is that the former signature has a lot of ambiguities in the place I live as it is a common name (NOTE: it's not a common spelling, just a common name). So, I switched signatures before it's too late. That said, I won't mind you calling me 'Kaartic' if you like it [of course ;-)]. You can always call me using either of the names. KIND NOTE TO THE NATIVE ENGLISH SPEAKER: As I'm not a native English speaker myself, there might be mistaeks in my usage of English. I apologise for any mistakes that I make. It would be "helpful" if you take the time to point out the mistakes. It would be "super helpful" if you could provide suggestions about how to correct those mistakes. Thanks in advance! _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:
> On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote: >> On 2018-05-03 20:54, Aidan Hogan wrote: >>> I am wondering what is the fastest/best way to get a local dump of >>> English Wikipedia in HTML? We are looking just for the current >>> versions (no edit history) of articles for the purposes of a research >>> project. >> >> The Kiwix project provides HTML dumps of Wikipedia for offline reading: >> http://www.kiwix.org/downloads/ >> > > In case you need pure HTML and not the ZIM file format, you could check > out mwoffliner[1], ... when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present. -- Sivaraam QUOTE: “The most valuable person on any team is the person who makes everyone else on the team more valuable, not the person who knows the most.” - Joel Spolsky Sivaraam? You possibly might have noticed that my signature recently changed from 'Kaartic' to 'Sivaraam' both of which are parts of my name. I find the new signature to be better for several reasons one of which is that the former signature has a lot of ambiguities in the place I live as it is a common name (NOTE: it's not a common spelling, just a common name). So, I switched signatures before it's too late. That said, I won't mind you calling me 'Kaartic' if you like it [of course ;-)]. You can always call me using either of the names. KIND NOTE TO THE NATIVE ENGLISH SPEAKER: As I'm not a native English speaker myself, there might be mistaeks in my usage of English. I apologise for any mistakes that I make. It would be "helpful" if you take the time to point out the mistakes. It would be "super helpful" if you could provide suggestions about how to correct those mistakes. Thanks in advance! _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Hi all,
Many thanks for all the pointers! In the end we wrote a small client to grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) as suggested by Neil. The HTML looks perfect, and with the generous 200 requests/second limit (which we could not even manage to reach with our local machine), it only took a couple of days to grab all current English Wikipedia articles. @Kaartic, many thanks for the offers of help with extracting HTML from ZIM! We also investigated this option in parallel with converting ZIM to HTML using Zimreader-Java [1], and indeed it looked promising, but we had some issues with extracting links. We did not try the mwoffliner tool you mentioned since we got what we needed through RESTBase in the end. In any case, we appreciate the offers of help. :) Best, Aidan [1] https://github.com/openzim/zimreader-java On 08-05-2018 9:34, Kaartic Sivaraam wrote: > On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote: >> On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote: >>> On 2018-05-03 20:54, Aidan Hogan wrote: >>>> I am wondering what is the fastest/best way to get a local dump of >>>> English Wikipedia in HTML? We are looking just for the current >>>> versions (no edit history) of articles for the purposes of a research >>>> project. >>> >>> The Kiwix project provides HTML dumps of Wikipedia for offline reading: >>> http://www.kiwix.org/downloads/ >>> >> >> In case you need pure HTML and not the ZIM file format, you could check >> out mwoffliner[1], ... > > Note that the HTML is (of course) is not the same as the one you see > when visiting Wikipedia. For example, the side bar links are not present > here, the ToC would not be present. > > _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
Free forum by Nabble | Edit this page |