Re-implementing PDF support

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Re-implementing PDF support

Erik Moeller-4
Hi folks,

for a long time we've relied on the mwlib libraries by PediaPress to
generate PDFs on Wikimedia sites. These have served us well (we
generate >200K PDFs/day), but they architecturally pre-date a lot of
important developments in MediaWiki, and actually re-implement the
MediaWiki parser (!) in Python. The occasion of moving the entire PDF
service to a new data-center has given us reason to re-think the
architecture and come up with a minimally viable alternative that we
can support long term.

Most likely, we'll end up using Parsoid's HTML5 output, transform it
to add required bits like licensing info and prettify it, and then
render it to PDF via phantomjs, but we're still looking at various
rendering options.

Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and
Jeff Green for joining the effort, and thanks to the PediaPress folks
for giving background as needed. Ideally we'd like to continue to
support printed book generation via PediaPress' web service, while
completely replacing the rendering tech stack on the WMF side of
things (still using the Collection extension to manage books). We may
need to deprecate some output formats - more on that as we go.

We've got the collection-alt-renderer project set up on Labs (thanks
Andrew) and can hopefully get a plan to our ops team soon as to how
the new setup could work.

If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.

Live notes here:
http://etherpad.wikimedia.org/p/pdfhack

Stuff will be consolidated here:
https://www.mediawiki.org/wiki/PDF_rendering

Some early experiments with different rendering strategies here:
https://github.com/cscott/pdf-research

Some improvements to Collection extension underway:
https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Collection,n,z

More soon,
Erik

--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Strainu
Hi,

I'm grabbing this opportunity to bring up 3 bugs related to mwlib that
deserve a larger discussion and should perhaps be implemented
differently in the new version.

1. https://bugzilla.wikimedia.org/show_bug.cgi?id=56560 - PDF creation
tool considers IPv6 addresses as users, not anonymous.

I've pushed a patched for this and it was merged; however, the
detection was based on regex and, as a quick google search will tell
you, it's not so obvious to do a regex to cover all IPv6 cases.
Perhaps the information anon user/logged in user might be sent from
MW.

2. https://bugzilla.wikimedia.org/show_bug.cgi?id=56219 - PDF creation
tool excludes contributors with a "bot" substring in their username

I've also pushed a pull request for this one, but it was rejected
based on the en.wp policy that prevents bot-like usernames for humans.
The problem is more complex though:

a. Should bots be credited for their edits? While most of them do
simple tasks, we have recently seen an increase in bot-created
content. On ro.wp we even have a few lists only edited by robots.
b. If the robots should _not_ be credited, how do we detect them?
Ideally, there should be an automatical way to do so, but according to
http://www.mediawiki.org/wiki/Bots, it only works for recent changes.
Less ideally, only users with "bot" at the end should be removed, in
order to keep users like
https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is
not a robot, but has "bot" in the name) in the contributor list.


3. https://bugzilla.wikimedia.org/show_bug.cgi?id=2994 - Automatically
generated count and list of contributors to an article (authorship
tracking)

This is an old enhancement request, revived by me last month in a
wikimedia-l thread:
http://lists.wikimedia.org/pipermail/wikimedia-l/2013-October/128575.html
. The idea is to decide if and how to credit:
a. vandals
b. reverters
c. contributors which had their valid contributions rephrased or
replaced from the article.
d. contributors with valid contributions but invalid names

I hope the people working on this feature will take the time to
consider these issues and come up with solutions for them.

Thanks,
   Strainu


2013/11/13 Erik Moeller <[hidden email]>:

> Hi folks,
>
> for a long time we've relied on the mwlib libraries by PediaPress to
> generate PDFs on Wikimedia sites. These have served us well (we
> generate >200K PDFs/day), but they architecturally pre-date a lot of
> important developments in MediaWiki, and actually re-implement the
> MediaWiki parser (!) in Python. The occasion of moving the entire PDF
> service to a new data-center has given us reason to re-think the
> architecture and come up with a minimally viable alternative that we
> can support long term.
>
> Most likely, we'll end up using Parsoid's HTML5 output, transform it
> to add required bits like licensing info and prettify it, and then
> render it to PDF via phantomjs, but we're still looking at various
> rendering options.
>
> Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and
> Jeff Green for joining the effort, and thanks to the PediaPress folks
> for giving background as needed. Ideally we'd like to continue to
> support printed book generation via PediaPress' web service, while
> completely replacing the rendering tech stack on the WMF side of
> things (still using the Collection extension to manage books). We may
> need to deprecate some output formats - more on that as we go.
>
> We've got the collection-alt-renderer project set up on Labs (thanks
> Andrew) and can hopefully get a plan to our ops team soon as to how
> the new setup could work.
>
> If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.
>
> Live notes here:
> http://etherpad.wikimedia.org/p/pdfhack
>
> Stuff will be consolidated here:
> https://www.mediawiki.org/wiki/PDF_rendering
>
> Some early experiments with different rendering strategies here:
> https://github.com/cscott/pdf-research
>
> Some improvements to Collection extension underway:
> https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Collection,n,z
>
> More soon,
> Erik
>
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Brad Jorsch (Anomie)
Note these are my own thoughts and not anything representative of the team.

On Wed, Nov 13, 2013 at 6:55 AM, Strainu <[hidden email]> wrote:
> b. If the robots should _not_ be credited, how do we detect them?
> Ideally, there should be an automatical way to do so, but according to
> http://www.mediawiki.org/wiki/Bots, it only works for recent changes.
> Less ideally, only users with "bot" at the end should be removed, in
> order to keep users like
> https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is
> not a robot, but has "bot" in the name) in the contributor list.

Another way to exclude (most) bots would be to skip any user with the
"bot" user right. Note though that this would still include edits by
unflagged bots, or by bots that have since been decommissioned and the
bot flag removed.

Personally, though, I do agree that excluding any user with "bot" in
the name (or even with a name ending in "bot") is a bad idea even if
just applied to enwiki, and worse when applied to other wikis that may
have different naming conventions.

> . The idea is to decide if and how to credit:
> a. vandals
> b. reverters
> c. contributors which had their valid contributions rephrased or
> replaced from the article.
> d. contributors with valid contributions but invalid names

The hard part there is detecting these, particularly case (c). And
even then, the article may still be based on the original work in a
copyright sense even if no single word of the original edit remains.

Then there's also the situation where A makes an edit that is
partially useful and partially bad, B reverts, then C comes along and
incorporates parts of C's edit.


--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Tyler Romeo
In reply to this post by Erik Moeller-4
On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller <[hidden email]> wrote:

> Most likely, we'll end up using Parsoid's HTML5 output, transform it
> to add required bits like licensing info and prettify it, and then
> render it to PDF via phantomjs, but we're still looking at various
> rendering options.
>

I don't have anything against this, but what's the reasoning? You now have
to parse the wikitext into HTML5 and then parse the HTML5 into PDF. I'm
guessing you've found some library that automatically "prints" HTML5, which
would make sense since browsers do that already, but I'm just curious.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Emmanuel Engelhart-5
Le 13/11/2013 17:10, Tyler Romeo a écrit :

> On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller <[hidden email]> wrote:
>
>> Most likely, we'll end up using Parsoid's HTML5 output, transform it
>> to add required bits like licensing info and prettify it, and then
>> render it to PDF via phantomjs, but we're still looking at various
>> rendering options.
>>
>
> I don't have anything against this, but what's the reasoning? You now have
> to parse the wikitext into HTML5 and then parse the HTML5 into PDF. I'm
> guessing you've found some library that automatically "prints" HTML5, which
> would make sense since browsers do that already, but I'm just curious.

Here is an example about how this works:
https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Brad Jorsch (Anomie)
In reply to this post by Tyler Romeo
On Wed, Nov 13, 2013 at 11:10 AM, Tyler Romeo <[hidden email]> wrote:
> On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller <[hidden email]> wrote:
> I'm
> guessing you've found some library that automatically "prints" HTML5, which
> would make sense since browsers do that already, but I'm just curious.

Yes, phantomjs, as mentioned in the original message.

To be more specific, phantomjs is basically WebKit without a GUI, so
the output would be roughly equivalent to opening the page in Chrome
or Safari and printing to a PDF. Future plans include using bookjs or
the like to improve the rendering.


--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Tyler Romeo
On Wed, Nov 13, 2013 at 11:16 AM, Brad Jorsch (Anomie) <
[hidden email]> wrote:

> Yes, phantomjs, as mentioned in the original message.
>
> To be more specific, phantomjs is basically WebKit without a GUI, so
> the output would be roughly equivalent to opening the page in Chrome
> or Safari and printing to a PDF. Future plans include using bookjs or
> the like to improve the rendering.
>

Aha awesome. Thanks for explaining.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Gabriel Wicke-3
In reply to this post by Tyler Romeo
On 11/13/2013 08:10 AM, Tyler Romeo wrote:

> On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller <[hidden email]> wrote:
>
>> Most likely, we'll end up using Parsoid's HTML5 output, transform it
>> to add required bits like licensing info and prettify it, and then
>> render it to PDF via phantomjs, but we're still looking at various
>> rendering options.
>>
>
> I don't have anything against this, but what's the reasoning? You now have
> to parse the wikitext into HTML5 and then parse the HTML5 into PDF.

We are already parsing all edited pages to HTML5 and will also start
storing (rather than just caching) this HTML very soon, so there will
not be any extra parsing involved in the longer term. Getting the HTML
will basically be a request for a static HTML page.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

Strainu
In reply to this post by Brad Jorsch (Anomie)
Thanks Brad,

I'm wondering if it wouldn't make sense to have a dedicated bugday at
the end of the sprint?

Strainu

2013/11/13 Brad Jorsch (Anomie) <[hidden email]>:

> Note these are my own thoughts and not anything representative of the team.
>
> On Wed, Nov 13, 2013 at 6:55 AM, Strainu <[hidden email]> wrote:
>> b. If the robots should _not_ be credited, how do we detect them?
>> Ideally, there should be an automatical way to do so, but according to
>> http://www.mediawiki.org/wiki/Bots, it only works for recent changes.
>> Less ideally, only users with "bot" at the end should be removed, in
>> order to keep users like
>> https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is
>> not a robot, but has "bot" in the name) in the contributor list.
>
> Another way to exclude (most) bots would be to skip any user with the
> "bot" user right. Note though that this would still include edits by
> unflagged bots, or by bots that have since been decommissioned and the
> bot flag removed.
>
> Personally, though, I do agree that excluding any user with "bot" in
> the name (or even with a name ending in "bot") is a bad idea even if
> just applied to enwiki, and worse when applied to other wikis that may
> have different naming conventions.
>
>> . The idea is to decide if and how to credit:
>> a. vandals
>> b. reverters
>> c. contributors which had their valid contributions rephrased or
>> replaced from the article.
>> d. contributors with valid contributions but invalid names
>
> The hard part there is detecting these, particularly case (c). And
> even then, the article may still be based on the original work in a
> copyright sense even if no single word of the original edit remains.
>
> Then there's also the situation where A makes an edit that is
> partially useful and partially bad, B reverts, then C comes along and
> incorporates parts of C's edit.
>
>
> --
> Brad Jorsch (Anomie)
> Software Engineer
> Wikimedia Foundation
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Re-implementing PDF support

C. Scott Ananian
Let's see what sorts of bugs crop up?  In my (limited) experience, the most
common issues are probably article content which renders poorly as a PDF
for some reason.  Those bugs aren't easy to fix in a bug day sprint, since
they tend to crop up slowly over time as people use the service and collect
lists of suboptimal pages.  (And some of these issues might be eventually
traced to Parsoid, and we know from experience that fixing those ends up
being a gradual collaboration between authors and developers to determine
whether the wikitext should be rewritten or the parser extended, etc.)

On the other hand, if our servers are crashing or the UI code is buggy,
etc, then a bug day would probably be useful to squash those sorts of
things.
 --scott


On Thu, Nov 14, 2013 at 9:49 AM, Strainu <[hidden email]> wrote:

> Thanks Brad,
>
> I'm wondering if it wouldn't make sense to have a dedicated bugday at
> the end of the sprint?
>
> Strainu
>
> 2013/11/13 Brad Jorsch (Anomie) <[hidden email]>:
> > Note these are my own thoughts and not anything representative of the
> team.
> >
> > On Wed, Nov 13, 2013 at 6:55 AM, Strainu <[hidden email]> wrote:
> >> b. If the robots should _not_ be credited, how do we detect them?
> >> Ideally, there should be an automatical way to do so, but according to
> >> http://www.mediawiki.org/wiki/Bots, it only works for recent changes.
> >> Less ideally, only users with "bot" at the end should be removed, in
> >> order to keep users like
> >> https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is
> >> not a robot, but has "bot" in the name) in the contributor list.
> >
> > Another way to exclude (most) bots would be to skip any user with the
> > "bot" user right. Note though that this would still include edits by
> > unflagged bots, or by bots that have since been decommissioned and the
> > bot flag removed.
> >
> > Personally, though, I do agree that excluding any user with "bot" in
> > the name (or even with a name ending in "bot") is a bad idea even if
> > just applied to enwiki, and worse when applied to other wikis that may
> > have different naming conventions.
> >
> >> . The idea is to decide if and how to credit:
> >> a. vandals
> >> b. reverters
> >> c. contributors which had their valid contributions rephrased or
> >> replaced from the article.
> >> d. contributors with valid contributions but invalid names
> >
> > The hard part there is detecting these, particularly case (c). And
> > even then, the article may still be based on the original work in a
> > copyright sense even if no single word of the original edit remains.
> >
> > Then there's also the situation where A makes an edit that is
> > partially useful and partially bad, B reverts, then C comes along and
> > incorporates parts of C's edit.
> >
> >
> > --
> > Brad Jorsch (Anomie)
> > Software Engineer
> > Wikimedia Foundation
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l