The Whole Wikipedia in English with pictures in one 40GB big file

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart-5
Hi

For the first time, we have achieved to release a complete dump of all
encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles
with their 3.5 millions pictures:
http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices
like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian
with Wikionboard.

You don't need modern computers with big CPUs. You can for example
create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by
using our ZIM dedicated Web server called kiwix-serve. A demo is
available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC
systems) which includes fulltext search index+ZIM file+binaries:
http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks
thanks to multiples recent innovations:
* The Parsoid (cluster), which gives us an HTML output with additional
semantic RDF tags
* mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
API (and Parsoid API)
* zimwriterfs, a solution able to compile any local HTML directory to a
ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we
will work to industrialize and automatize the ZIM file generation
process, one thing which is probably the most oldest and important
problem we still face at Kiwix.

All this would not have been possible without the support:
* Wikimedia CH and the "ZIM autobuild" project
* Wikimedia France and the Afripedia project
* Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to
fix a few issues on mwoffliner:
* Recreate the "table of content" based on the HTML DOM (*)
* Scrape Mediawiki Resourceloader in a manner it will continue to work
offline (***)
* Scrape categories (**)
* Localized the script (*)
* Improve the global performance by introducing usage of workers (**)
* Create nodezim, the libzim nodejs binding and use it (***, need also
compilation and C++ skills)
* Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)

Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: The Whole Wikipedia in English with pictures in one 40GB big file

James Forrester-4
On Saturday, March 1, 2014, Emmanuel Engelhart <[hidden email]> wrote:

> Hi
>
> For the first time, we have achieved to release a complete dump of all
> encyclopedic articles of the Wikipedia in English, *with thumbnails*.


Great news, Emmanuel – congratulations!

[Snip]

BTW, we need additional developer helps  with javascript/nodejs skills to

fix a few issues on mwoffliner:
> * Recreate the "table of content" based on the HTML DOM (*)


We are currently working on doing similar work to this in VisualEditor (to
provide for a Table of Contents that can change 'live' as the document is
edited); this code may ultimately be used to generate the "real" Tables of
Contents for the reading HTML, as part of the plans to replace the output
of the PHP parser with Parsoid everywhere.

It should be possible for Kiwix to re-use this in some way (rather than
have to re-implement it!). We hope to have something to show in the next
few weeks, if that's helpful.

J.


--
James D. Forrester
Product Manager, VisualEditor
Wikimedia Foundation, Inc.

[hidden email] | @jdforrester
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart-5
Le 01/03/2014 19:26, James Forrester a écrit :

> On Saturday, March 1, 2014, Emmanuel Engelhart <[hidden email]> wrote:
> fix a few issues on mwoffliner:
>> * Recreate the "table of content" based on the HTML DOM (*)
>
> We are currently working on doing similar work to this in VisualEditor (to
> provide for a Table of Contents that can change 'live' as the document is
> edited); this code may ultimately be used to generate the "real" Tables of
> Contents for the reading HTML, as part of the plans to replace the output
> of the PHP parser with Parsoid everywhere.
>
> It should be possible for Kiwix to re-use this in some way (rather than
> have to re-implement it!). We hope to have something to show in the next
> few weeks, if that's helpful.

Nice news!

Yes, indeed, this would be great to be able to re-use your work.

I have subscribed this Bugzilla entry, which is guess what you mean:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49224

Of course, for our usage, the best solution would be to have the initial
TOC rendering done on Parsoid side, which would also offer the advantage
of speeding-up initial VE rendering.

Emmanuel

--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart-5
In reply to this post by Emmanuel Engelhart-5
Le 02/03/2014 01:33, Samuel Klein a écrit :
> Brilliant.  Congrats to everyone who is working on this!
> What is needed to scrape categories?

0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages),
download the list of categories they belong to (with the MW API).
1 - For each dumped page, implement the HTML rendering of the category
list at the bottom.
2 - For each category page, get the content HTML rendering from Parsoid
and compute and render sorted lists of articles and sub-categories in a
similar fashion like the online version (with multiple pages if necessary).

All the stuff must be integrated in the nodejs script and category graph
must be stored in redis.

Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: The Whole Wikipedia in English with pictures in one 40GB big file

Gryllida
In reply to this post by Emmanuel Engelhart-5


On Sun, 2 Mar 2014, at 4:01, Emmanuel Engelhart wrote:

> Hi
>
> For the first time, we have achieved to release a complete dump of all
> encyclopedic articles of the Wikipedia in English, *with thumbnails*.
>
> This ZIM file is 40 GB big and contains the current 4.5 million articles
> with their 3.5 millions pictures:
> http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
>
> This ZIM file is directly and easily usable on many types of devices
> like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian
> with Wikionboard.
>
> You don't need modern computers with big CPUs. You can for example
> create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by
> using our ZIM dedicated Web server called kiwix-serve. A demo is
> available here: http://library.kiwix.org/wikipedia_en_all/

Do this for other sister projects perhaps.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: The Whole Wikipedia in English with pictures in one 40GB big file

Gabriel Wicke-3
In reply to this post by Emmanuel Engelhart-5
Emmanuel,

On 03/01/2014 09:01 AM, Emmanuel Engelhart wrote:
> Hi
>
> For the first time, we have achieved to release a complete dump of all
> encyclopedic articles of the Wikipedia in English, *with thumbnails*.

this is great news. Congratulations!

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart-5
In reply to this post by Emmanuel Engelhart-5
Le 07/03/2014 19:25, Asaf Bartov a écrit :
> btw, are these new improved tools documented anywhere?
> http://kiwix.org/wiki/Development does not seem to point in the right
> direction.

The usage is pretty straightforward (for IT people) and IMO everything
necessary is explained in the READMEs:
* mwoffliner:
https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/
* zimwriterfs:
https://sourceforge.net/p/kiwix/other/ci/master/tree/zimwriterfs/

NB: The goal is not that everybody creates its own full wikipedia ZIM
file. The goal is that we (Wikimedia) provide these files, often enough
to always have up2date ZIM information (so at least one time per month).
Thus, the challenge is now to setup an infrastructure similar to the one
which creates the XML dumps.

Emmanuel

PS: We really want to make a post @blog.wikimedia.org (so in English).
If someone is volunteer to write this, I would really appreciate his help.
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Jay Ashworth-2
----- Original Message -----
> From: "Emmanuel Engelhart" <[hidden email]>

> PS: We really want to make a post @blog.wikimedia.org (so in English).
> If someone is volunteer to write this, I would really appreciate his
> help.

If you write such a blog post in what English you have handy, I'd be happy
to English it up for you; you know what points you want to make better than
I would.  :-)

Cheers,
-- jra
--
Jay R. Ashworth                  Baylink                       [hidden email]
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates       http://www.bcp38.info          2000 Land Rover DII
St Petersburg FL USA      BCP38: Ask For It By Name!           +1 727 647 1274

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l