Populating PageImages data

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Populating PageImages data

Max Semenik
A month ago, PageImages extension[1] was black-deployed, intended to
automatically associate images with articles. It populates its data
when LinksUpdate is run, i.e. when a page or templates it trascludes
is edited or purged. Since then, most of pages were re-parsed, however
slightly less than a million English WP articles remain:

select count(*), avg(page_len) from page where page_namespace=0 and page_is_redirect=0 and page_touched < '20121229000000';
+----------+---------------+
| count(*) | avg(page_len) |
+----------+---------------+
|   977568 |     3172.0948 |
+----------+---------------+
1 row in set (5 min 59.55 sec)

Waiting for these pages to be updated naturally could take forever:

select min(page_touched) from page where page_namespace=0 and page_is_redirect=0;
+-------------------+
| min(page_touched) |
+-------------------+
| 20090714142954    |
+-------------------+
1 row in set (2 min 15.13 sec)

That was [2] before I purged it: obscure topic, no templates.

Thus, I would like to populate this data with a script[3]. To reduce
the scare, let me remark that these pages have almost no templates and
are significantly smaller than average: 3172 bytes vs. 5673 so they
should be mostly fast to parse.

Is running it a good idea?

-----
[1] https://www.mediawiki.org/wiki/Extension:PageImages
[2] https://en.wikipedia.org/wiki/City_of_Melbourne_election,_2008
[3] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/PageImages.git;a=blob;f=initImageData.php;hb=HEAD

--
Best regards,
  Max Semenik ([[User:MaxSem]])


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

MZMcBride-2
Max Semenik wrote:
>A month ago, PageImages extension was black-deployed, intended to
>automatically associate images with articles.

I looked at <https://www.mediawiki.org/wiki/Extension:PageImages> and I'm
still having difficulty understanding this extension's purpose. Is there a
related bug or request for comment (RFC) for this?

>select count(*), avg(page_len) from page where page_namespace=0 and
>page_is_redirect=0 and page_touched < '20121229000000';
>+----------+---------------+
>| count(*) | avg(page_len) |
>+----------+---------------+
>|   977568 |     3172.0948 |
>+----------+---------------+
>1 row in set (5 min 59.55 sec)

select count(*) from page where page_namespace=0 and page_is_redirect=1
and page_touched < '20120101000000';
+----------+
| count(*) |
+----------+
|       16 |
+----------+
1 row in set (26.61 sec)

I ran a script in December 2012 on the English Wikipedia that updated the
page_touched date of every redirect in NS:0 (and a few other namespaces, I
believe) where the page_touched date was not like '2012%'. I'd considered
running the same script on non-redirects. It turns out that if you take
the stored wikitext of pages and echo (post) it back at the wiki via the
edit action a few million times, you can discover some interesting bugs.

>Thus, I would like to populate this data with a script[3]. To reduce
>the scare, let me remark that these pages have almost no templates and
>are significantly smaller than average: 3172 bytes vs. 5673 so they
>should be mostly fast to parse.

I don't think there's any reason to be scared here.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Brian Wolff
In reply to this post by Max Semenik
On 1/31/13, Max Semenik <[hidden email]> wrote:

> A month ago, PageImages extension[1] was black-deployed, intended to
> automatically associate images with articles. It populates its data
> when LinksUpdate is run, i.e. when a page or templates it trascludes
> is edited or purged. Since then, most of pages were re-parsed, however
> slightly less than a million English WP articles remain:
>
> select count(*), avg(page_len) from page where page_namespace=0 and
> page_is_redirect=0 and page_touched < '20121229000000';
> +----------+---------------+
> | count(*) | avg(page_len) |
> +----------+---------------+
> |   977568 |     3172.0948 |
> +----------+---------------+
> 1 row in set (5 min 59.55 sec)
[..]

You do realize that page_touched gets updated by a bunch of things,
many of which do not cause a LinksUpdate to happen? So running the
script as you proposed will not populate the table for all data.

Of course there really isn't any way to figure out when the last
LinksUpdate happened, so I suppose page_touched is as close as we can
get. I guess in most cases if something has had its page_touched
updated by a non-LinksUpdate event, that probably means people
actually look at the article, so someone has or will probably edit the
article soon.

--bawolff

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Max Semenik
In reply to this post by MZMcBride-2
On 01.02.2013, 9:21 MZMcBride wrote:

> Max Semenik wrote:
>>A month ago, PageImages extension was black-deployed, intended to
>>automatically associate images with articles.

> I looked at <https://www.mediawiki.org/wiki/Extension:PageImages> and I'm
> still having difficulty understanding this extension's purpose.

It returns thumbnails associated with articles, attempting to return
only meaningful images, not ones from maintenance templates, stubs or
flag icons.

>  Is there a
> related bug or request for comment (RFC) for this?

A bug or a RFC is not required for WMF devs to work on something, we
tend to do what our bosses say:)

--
Best regards,
  Max Semenik ([[User:MaxSem]])


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

John Doe-27
I think there are still some serious issues with this extension, I
have checked several pages, and used the max limit parameter and all
it returns is a single thumb

On Fri, Feb 1, 2013 at 8:20 AM, Max Semenik <[hidden email]> wrote:

> On 01.02.2013, 9:21 MZMcBride wrote:
>
>> Max Semenik wrote:
>>>A month ago, PageImages extension was black-deployed, intended to
>>>automatically associate images with articles.
>
>> I looked at <https://www.mediawiki.org/wiki/Extension:PageImages> and I'm
>> still having difficulty understanding this extension's purpose.
>
> It returns thumbnails associated with articles, attempting to return
> only meaningful images, not ones from maintenance templates, stubs or
> flag icons.
>
>>  Is there a
>> related bug or request for comment (RFC) for this?
>
> A bug or a RFC is not required for WMF devs to work on something, we
> tend to do what our bosses say:)
>
> --
> Best regards,
>   Max Semenik ([[User:MaxSem]])
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Max Semenik
On 01.02.2013, 18:14 John wrote:

> I think there are still some serious issues with this extension, I
> have checked several pages, and used the max limit parameter and all
> it returns is a single thumb

That's the point. If you want to enumerate all images on a page,
there's prop=images. PageImages returns just 1, most appropriate,
thumb.

--
Best regards,
  Max Semenik ([[User:MaxSem]])


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

John Doe-27
Its broken, on pages where there are multiple images it just shows the
first one

On Friday, February 1, 2013, Max Semenik wrote:

> On 01.02.2013, 18:14 John wrote:
>
> > I think there are still some serious issues with this extension, I
> > have checked several pages, and used the max limit parameter and all
> > it returns is a single thumb
>
> That's the point. If you want to enumerate all images on a page,
> there's prop=images. PageImages returns just 1, most appropriate,
> thumb.
>
> --
> Best regards,
>   Max Semenik ([[User:MaxSem]])
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Happy Melon-2
But not simply the first image to be found in the source, which in many
cases is the icon in a maintenance template or top icon.  For
https://en.wikipedia.org/wiki/Louis_Bonaparte, for instance, the image
returned is correctly the one from the infobox, not the
book-with-question-mark icon from the needs-more-references template.
There's still room for improvement, for sure; but it's definitely a
legitimate piece of data to want to collect.

--HM


On 1 February 2013 15:17, John <[hidden email]> wrote:

> Its broken, on pages where there are multiple images it just shows the
> first one
>
> On Friday, February 1, 2013, Max Semenik wrote:
>
> > On 01.02.2013, 18:14 John wrote:
> >
> > > I think there are still some serious issues with this extension, I
> > > have checked several pages, and used the max limit parameter and all
> > > it returns is a single thumb
> >
> > That's the point. If you want to enumerate all images on a page,
> > there's prop=images. PageImages returns just 1, most appropriate,
> > thumb.
> >
> > --
> > Best regards,
> >   Max Semenik ([[User:MaxSem]])
> >
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Brad Jorsch (Anomie)
In reply to this post by Max Semenik
On Fri, Feb 1, 2013 at 10:03 AM, Max Semenik <[hidden email]> wrote:
> On 01.02.2013, 18:14 John wrote:
>
>> I think there are still some serious issues with this extension, I
>> have checked several pages, and used the max limit parameter and all
>> it returns is a single thumb
>
> That's the point. If you want to enumerate all images on a page,
> there's prop=images. PageImages returns just 1, most appropriate,
> thumb.

That could have been made a lot more clear, both in the documentation
and in the name of the module itself ("pageimages" implies more than
one per page).

Also, BTW, the API module could use implementing getExamples() and
getHelpUrls(). And I wonder if there's a reason it uses 50 and 100
rather than ApiBase::LIMIT_SMALL1 and ApiBase::LIMIT_SMALL2 for the
limit. And why it defaults to 1 rather than 10 like pretty much
everything else.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Max Semenik
On 01.02.2013, 19:40 Brad wrote:

> On Fri, Feb 1, 2013 at 10:03 AM, Max Semenik <[hidden email]> wrote:
>> On 01.02.2013, 18:14 John wrote:
>>
>>> I think there are still some serious issues with this extension, I
>>> have checked several pages, and used the max limit parameter and all
>>> it returns is a single thumb
>>
>> That's the point. If you want to enumerate all images on a page,
>> there's prop=images. PageImages returns just 1, most appropriate,
>> thumb.

> That could have been made a lot more clear, both in the documentation
> and in the name of the module itself ("pageimages" implies more than
> one per page).


> And I wonder if there's a reason it uses 50 and 100
> rather than ApiBase::LIMIT_SMALL1 and ApiBase::LIMIT_SMALL2 for the
> limit. And why it defaults to 1 rather than 10 like pretty much
> everything else.

Because with File::transform()'s worst-case performance, 500 is too
much.



--
Best regards,
  Max Semenik ([[User:MaxSem]])


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Populating PageImages data

Brad Jorsch (Anomie)
On Fri, Feb 1, 2013 at 4:31 PM, Brad Jorsch <[hidden email]> wrote:
> On Fri, Feb 1, 2013 at 11:16 AM, Max Semenik <[hidden email]> wrote:
>> Because with File::transform()'s worst-case performance, 500 is too
>> much.
>
> Perhaps we should patch ApiQueryImageInfo too then.
>
> Although https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&generator=allimages&gailimit=500&prop=imageinfo&iiurlwidth=50&iiprop=url|size|dimensions
> didn't seem bad. Must not be hitting the worst case.

https://gerrit.wikimedia.org/r/#/c/47189/

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l