Identifying Wikipedia stubs in various languages

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifying Wikipedia stubs in various languages

Robert West
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks!
Bob

--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Wikipedia stubs in various languages

Stuart A. Yeates
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck 

Probably doesn't translate to CJK languages which have radically different information content per character. 

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[hidden email]> wrote:
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks!
Bob

--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Wikipedia stubs in various languages

Morten Wang
I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.


Cheers,
Morten


On 20 September 2016 at 02:40, Stuart A. Yeates <[hidden email]> wrote:
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck 

Probably doesn't translate to CJK languages which have radically different information content per character. 

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[hidden email]> wrote:
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks!
Bob

--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Wikipedia stubs in various languages

Stuart A. Yeates
You _really_ need to exclude markup and include only body text when measuring stubs. It's not uncommon for mass-produced articles with a  only one or two sentences of text to approach 1K characters, once you include maintenance templates, content templates, categories, infobox, references, etc, etc

cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang <[hidden email]> wrote:
I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.


Cheers,
Morten


On 20 September 2016 at 02:40, Stuart A. Yeates <[hidden email]> wrote:
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck 

Probably doesn't translate to CJK languages which have radically different information content per character. 

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[hidden email]> wrote:
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks!
Bob

--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Wikipedia stubs in various languages

Andrew Gray-3
In reply to this post by Morten Wang
Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"....

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang <[hidden email]> wrote:

> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates <[hidden email]> wrote:
>>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
- Andrew Gray
  [hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Fwd: [Analytics] Identifying Wikipedia stubs in various languages

Giuseppe Profiti
In reply to this post by Robert West
[forwarding my answer from analytics ml, I forgot to subscribe to this list too]

Hi Robert,
one solution may be to use a query on Wikidata to retrieve the name
for the stubs category in all the different languages. Then you could
use a tool like PetScan to retrive all the pages in such categories,
or write your own tool by using either a query on the database or
Mediawiki API.
You can find a sample solution here:
http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb

I wrote that thing while on a train, so it may be messy and/or  sub-optimal.
I would like to thank Alex Monk and Yuvi Panda for their help with SQL
on paws today.

Best,
Giuseppe

2016-09-20 11:26 GMT+02:00 Robert West <[hidden email]>:

> Hi everyone,
>
> Does anyone know if there's a straightforward (ideally language-independent)
> way of identifying stub articles in Wikipedia?
>
> Whatever works is ok, whether it's publicly available data or data
> accessible only on the WMF cluster.
>
> I've found lists for various languages (e.g., Italian or English), but the
> lists are in different formats, so separate code is required for each
> language, which doesn't scale.
>
> I guess in the worst case, I'll have to grep for the respective stub
> templates in the respective wikitext dumps, but even this requires to know
> for each language what the respective template is. So if anyone could point
> me to a list of stub templates in different languages, that would also be
> appreciated.
>
> Thanks!
> Bob
>
> --
> Up for a little language game? -- http://www.unfun.me
>
> _______________________________________________
> Analytics mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Analytics] Identifying Wikipedia stubs in various languages

Robert West
Thanks a bunch to everyone who chimed in here. These hints brought us
forward quite a bit!

Bob

On Wed, Sep 21, 2016 at 12:50 AM, Giuseppe Profiti
<[hidden email]> wrote:

> [forwarding my answer from analytics ml, I forgot to subscribe to this list too]
>
> Hi Robert,
> one solution may be to use a query on Wikidata to retrieve the name
> for the stubs category in all the different languages. Then you could
> use a tool like PetScan to retrive all the pages in such categories,
> or write your own tool by using either a query on the database or
> Mediawiki API.
> You can find a sample solution here:
> http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb
>
> I wrote that thing while on a train, so it may be messy and/or  sub-optimal.
> I would like to thank Alex Monk and Yuvi Panda for their help with SQL
> on paws today.
>
> Best,
> Giuseppe
>
> 2016-09-20 11:26 GMT+02:00 Robert West <[hidden email]>:
>> Hi everyone,
>>
>> Does anyone know if there's a straightforward (ideally language-independent)
>> way of identifying stub articles in Wikipedia?
>>
>> Whatever works is ok, whether it's publicly available data or data
>> accessible only on the WMF cluster.
>>
>> I've found lists for various languages (e.g., Italian or English), but the
>> lists are in different formats, so separate code is required for each
>> language, which doesn't scale.
>>
>> I guess in the worst case, I'll have to grep for the respective stub
>> templates in the respective wikitext dumps, but even this requires to know
>> for each language what the respective template is. So if anyone could point
>> me to a list of stub templates in different languages, that would also be
>> appreciated.
>>
>> Thanks!
>> Bob
>>
>> --
>> Up for a little language game? -- http://www.unfun.me
>>
>> _______________________________________________
>> Analytics mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l