Status update on new Collections PDF Renderer

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Status update on new Collections PDF Renderer

Matthew Walker
Hey All,

For those who are not aware, the WMF is currently attempting to replace the
backend renderer for the Collection extension (mwlib). This is the renderer
that creates the PDFs for the 'Download to PDF' sidebar link and creates
books (downloadable in multipe formats and printable via PediaPress), using
Special:Book. We're taking the data centre migration as our cue to replace
mwlib for several reasons; high among them being the desire to use Parsoid
to do the parsing from wikitext into something usable by an external tool
-- mwlib currently does this conversion internally. This should allow us to
solve several other long standing mwlib issues with respect to the
rendering of non latin languages.

Last week we started work on the new parser, which we're calling the
'Collection Offline Content Generator' or OCG-C. Today I can say that where
we are is promising but by no means complete. We as yet only have basic
support for rendering articles, and a lot of complex articles are failing
to render. For the curious we have an the alpha product [1] and a public
coordination / documentation page [2] -- you can also join us in
#mediawiki-pdfhack.

In broad strokes [3]; our solution is a LVS fronted Node.JS backend cluster
with a Redis job queue. Bundling (content gather from the wiki) and
Rendering are two distinct processes with an intermediate file [4] in
between. Any renderer should be able to pick the intermediate file up and
produce output [5]. We will store bundle files and generated documents
under a short timeout in Swift, and have a somewhat longer frontend cache
period in varnish for the final documents. Deployments will be happening
via Trebuchet, and node dependencies are stored in a seperate git
repository -- much like Parsoid and eventually Mathoid [6].

The Foundation is still partnering with PediaPress to provide print on
demand books. However, bundling and rendering will in future be performed
on their servers.

The team will continue to work on this project over the coming weeks. Big
mileposts in no particular order are table support, puppetization into beta
labs, load testing, and multilingual support. Our plan is have something
that the community can reliably beta test soon with final deployment into
production happening, probably, early January [7]. Decommisioning of the
old servers is expected to happen by late January, so that's our hard
deadline to wrap things up.

Big thanks to Max, Scott, Brad & Jeff for all their help so far, and to
Faidon, Ryan and other ops team members for their support.

If you'd like to help, ping me on IRC, and you'll continue to find us on
#mediawiki-pdfhack !

~ Matt Walker

[1] http://mwalker-enwikinews.instance-proxy.wmflabs.org/Special:Book
[2] https://www.mediawiki.org/wiki/PDF_rendering
[3] More detail available at
https://www.mediawiki.org/wiki/PDF_rendering/Architecture
[4] The format is almost exactly the same as the format mwlib uses, just
with RDF instead of HTML
    https://www.mediawiki.org/wiki/PDF_rendering/Bundle_format
[5] Right now the alpha solution only has a LaTeX renderer, but we have
plans for a native HTML renderer (both for PDF and epub) and the ZIM
community has been in contact with us about their RDF to ZIM renderer.
[6] Mathoid is the LaTeX math renderer that Gabriel wrote which will run on
the same servers as this service. Both falling under this nebulous category
of node based 'Offline Content Generators'
[7] I'm being hazy here because we have other duties to other teams again.
Tuesdays until Jan 1 are my dedicated days to working on this project, and
I become full time again to this project come Jan 1. Erik will reach out to
organize a follow-up sprint.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Status update on new Collections PDF Renderer

C. Scott Ananian
Let me talk a little bit about the bundle format, briefly:

* It is intended to be a complete copy of all wiki resources required to
make an offline dump, in any format.  That means that all the articles are
spidered and template-expanded and all related images and other media are
fetched and stored in a zip archive.  The archive also will contain all
license and authorship information needed to make the attributions, etc,
needed for a license-compliant rendering.  This should provide developers
of rendering backends a substantial headstart.

* The current bundle format is backwards compatible with the pediapress
bundles.  We have made some additions, primarily having to do with better
disambiguating table keys/filenames/etc to deal with collections which span
multiple wikis.  We also add the parsoid parser output.

* The backwards-compatibility features are somewhat experimental.  As
Matthew noted, the plan is for pediapress to eventually begin hosting their
bundler on their own servers.  We hope that they will be able to share our
bundles, but that decision is up to them.  We may deprecate some of the
backwards-compatibility content of the bundles (for example, removing the
PHP parser output) if no one ends up using them.  (None the less, having
pediapress' working bundle format was very helpful to me in writing the new
bundler, and I want to thank them!)

* I've made a conscious effort to support *very large* bundles in this
format.  That is, I try not to hold complete data relating to a bundle in
memory, and we use sqlite databases wherever possible to support
article-at-a-time access during rendering.  The MW-hosted servers will
probably have reasonably-small resource limits, but it is my intention that
if you want to create an offline dump of an entire wiki (or large subset
thereof), then you should be able to use the existing renderers and bundler
to do so.  I'd encourage people interested in making large slices to get in
touch and hopefully start playing with the code, so we can identify any
bundle-format related bottlenecks and eliminate them before the bundle
format is too firmly established.

* The bundler (and latex renderer) are independent npm modules, loosely
coupled to the Collection extension.  Again, this should encourage reuse of
the bundler and renderer in other projects.  Patches welcome!

http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOfflineContentGenerator%2Fbundler.git

http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOfflineContentGenerator%2Flatex_renderer.git
The npm module name is still in flux.  It's currently mw-bundler and
mw-latexer, maybe mw-ocg-bundler etc would be better.
  --scott
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Status update on new Collections PDF Renderer

Erik Moeller-4
In reply to this post by Matthew Walker
Thanks, Matt, for the detailed update, as well as for your leadership
throughout the project, and thanks to everyone who's helped with the
effort so far. :-)

As Matt outlined, we're going to keep moving on critical path issues
til January and will do a second sprint then to get things ready for
production. Currently we're targeting January 6-January 17 for the
second sprint. Will keep you posted.

All best,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Status update on new Collections PDF Renderer

C. Scott Ananian
In reply to this post by C. Scott Ananian
On Wed, Nov 27, 2013 at 9:52 AM, Bjoern Hassler <[hidden email]> wrote:

> could I check whether this new process would pick up formatting inserted via
> css styles, e.g. attached to a <span> or <div>?
>
> On our mediawiki (http://www.oer4schools.org) we use a handful of different
> css styles to provide boxes for different types of text (such as facilitator
> notes or background reading). With the PediaPress tools this didn't render
> at all (because the wiki text was parsed directly), but also <blockquote>
> and table background colors did not render nicely, leaving us very few
> options for highlighting blocks of text. (See here for an example of two
> types of boxed text:
> http://orbit.educ.cam.ac.uk/wiki/OER4Schools/ICTs_in_interactive_teaching.)

The current plan is for the latex renderer *NOT* to pick up CSS
styles, in general.  The latex renderer will be a 'semantic renderer'
-- it will normalize the formatting to make it conform to house style.
 It will be tuned to the needs of the Wikipedias.  It knows about
certain CSS classes and Templates, but is not particularly
extensible...

...which is why it won't be the only backend!  We also expect to have
an "HTML" renderer, which will apply CSS styles to the Parsoid output
and render to PDF via phantom JS (aka webkit).

This gives you two options, "faithful" and "beautiful".  In my
experience so far, the LaTeX output, when it works, produces superior
output -- the typesetting is better, the ligatures and non-latin
support ought to be superior, the justification is nicer, and math
rendering should be stellar.  We also use a two column layout and
normalize figure sizes to match the column widths, which helps
maintain a clean appearance.  However, as you have noted, the LaTeX
renderer isn't particularly extensible, and there are cases where we
need to preserve the author's styling even at the cost of somewhat
less 'clean' output.  Some articles can't easily be shoehorned into
our 'house style'.  The Parsoid->HTML->webkit->PDF render path should
be a good solution in these cases, even if (for instance) the
paragraph justification and page splitting isn't quite as pretty.
(Browser technology continues to improve; one day it may be possible
to make the HTML->PDF pipeline just as pretty.  So the "faithful"
approach is also our "forward-looking" renderer.)

Our architecture allows multiple 'backends' to be plugged in, so it is
possible there could be other options as well.  I hope to refactor the
LaTeX backend at some point, for instance, to make it more extensible
so that you could in theory add special 'tweaks' for your wiki's
"house style".  I could also add a CSS engine so that the LaTeX
backend could pick up certain CSS styles -- like table background
color, for instance.

It's all a work in progress, of course!  But the
"faithful"/"beautiful" split is the principle we're working with.
 --scott

--
(http://cscott.net)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Status update on new Collections PDF Renderer

Željko Filipin
In reply to this post by Matthew Walker
On Mon, Nov 25, 2013 at 12:52 AM, Matthew Walker <[hidden email]>wrote:

> For those who are not aware, the WMF is currently attempting to replace
> the backend renderer for the Collection extension (mwlib). This is the
> renderer that creates the PDFs for the 'Download to PDF' sidebar link and
> creates books (downloadable in multipe formats and printable via
> PediaPress), using Special:Book.


Would you like to see Selenium tests for this feature? We have a few Google
Code-in students that are resolving tasks that we give them faster than we
can create new tasks. If you would like to see tests for this, please let
me know as soon as possible. Now is already too late! :)

Even better, create a bug (and send me the link) with examples what needs
to be tested.

Something like this:

- when I create a pdf from an empty page, an empty pdf file should be
created
- when I create a pdf from a page that has a title and one paragraph, the
pdf should contain the title and one paragraph of text
- and so on...

I do not know what really needs to be tested, this was just a few simple
ideas.

Željko
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Status update on new Collections PDF Renderer

Helder .
In reply to this post by C. Scott Ananian
On Mon, Nov 25, 2013 at 2:24 PM, C. Scott Ananian
<[hidden email]> wrote:
> Let me talk a little bit about the bundle format, briefly:
>
> * It is intended to be a complete copy of all wiki resources required to
> make an offline dump, in any format.  That means that all the articles are
> spidered and template-expanded and all related images and other media are
> fetched and stored in a zip archive.  The archive also will contain all
> license and authorship information needed to make the attributions, etc,
> needed for a license-compliant rendering.  This should provide developers
> of rendering backends a substantial headstart.

Except for these, I suppose:
https://bugzilla.wikimedia.org/show_bug.cgi?id=28064
https://bugzilla.wikimedia.org/show_bug.cgi?id=27629

Helder

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Status update on new Collections PDF Renderer

Željko Filipin
In reply to this post by Željko Filipin
On Wed, Nov 27, 2013 at 8:12 PM, Željko Filipin <[hidden email]>wrote:

> Even better, create a bug (and send me the link) with examples what needs
> to be tested.


Looks like the bug[1] already exists. I will create a task for code-in
students to create a few simple tests. If more tests are needed, let me
know.

Željko
--
1: https://bugzilla.wikimedia.org/show_bug.cgi?id=46224
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l