thumb generation

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

thumb generation

wp mirror
Dear Brian,

On 9/13/15, Brian Wolff <[hidden email]> wrote:
> On 9/12/15, wp mirror <[hidden email]> wrote:
>> 0) Context
>>
>> I am currently developing new features for WP-MIRROR (see <
>> https://www.mediawiki.org/wiki/Wp-mirror>).
>>
>> 1) Objective
>>
>> I would like WP-MIRROR to generate all image thumbs during the mirror
build

>> process. This is so that mediawiki can render pages quickly using
>> precomputed thumbs.
>>
>> 2) Dump importation
>>
>> maintenance/importDump.php - this computes thumbs during importation, but
>> is too slow.
>> mwxml2sql - loads databases quickly, but does not compute thumbs.
>>
>> 3) Question
>>
>> Is there a way to compute all the thumbs after loading databases quickly
>> with mwxml2sql?
>>
>> Sincerely Yours,
>> Kent
>> ______________________________
>_________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> Hi. My understanding is that wp-mirror sets up a MediaWiki instance
> for rendering the mirror. One solution would be to set up 404-thumb
> rendering. This makes it so that instead of pre-rendering the needed
> thumbs, MediaWiki will render the thumbs on-demand whenever the web
> browser requests a thumb. There's some instructions for how this works
> at https://www.mediawiki.org/wiki/Manual:Thumb.php This is probably
> the best solution to your problem.

Right. Currently, wp-mirror does set up mediawiki to use 404-thumb
rendering.

This works fine, but can cause a few seconds latency when rendering pages.
Also, it would be nice to be able to generate thumb dump tarballs, just
like we used to generate original size media dump tarballs. I would like
wp-mirror have such dump features.

> Otherwise, MW needs to know what thumbs are needed for all pages,
> which involves parsing pages (e.g. via refreshLinks.php). This is a
> very slow process. If you already had all the thumbnail's generated,
> you could just copy over the thumb directory perhaps, but I'm not sure
> where you would get a pre-generated thumb directory.

Wp-mirror does load the *links.sql.gz dump files into the *links tables,
because this method is two orders of magnitude faster than
maintenance/refreshLinks.php.

>--
>-bawolff

Idea.  I am thinking of piping the *pages-articles.xml.bz2 dump file
through an AWK script to write all unique [[File:*]] tags into a file. This
can be done quickly. The question then is: Given a file with all the media
tags, how can I generate all the thumbs. What mediawiki function shall I
call? Can this be done using the web API? Any other ideas?

Sincerely Yours,
Kent
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: thumb generation

Platonides
On 15/09/15 01:34, wp mirror wrote:
> Idea.  I am thinking of piping the *pages-articles.xml.bz2 dump file
> through an AWK script to write all unique [[File:*]] tags into a file. This
> can be done quickly. The question then is: Given a file with all the media
> tags, how can I generate all the thumbs. What mediawiki function shall I
> call? Can this be done using the web API? Any other ideas?
>
> Sincerely Yours,
> Kent

You know it will fail for all kind of images included through templates
(particularly infoboxes), right?


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: thumb generation

Gergo Tisza
On Mon, Sep 14, 2015 at 4:49 PM, Platonides <[hidden email]> wrote:

> You know it will fail for all kind of images included through templates
> (particularly infoboxes), right?


Indeed, it is not possible to find out what thumbnails are used by a page
without actually parsing it. Your best bet is to wait until Parsoid dumps
become available (T17017 <https://phabricator.wikimedia.org/T17017>), then
go through those with an XML parser and extract the thumb URLs. That's
still slow but not as slow as the MediaWiki parser. (Or you can try to find
a regexp which matches thumbnail URLs but we all know what happens
<http://stackoverflow.com/a/1732454/323407> when you use a regexp to parse
HTML.) After that, just throw those URLs at the 404 handler.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l