finding the "most recognizable" page names

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

finding the "most recognizable" page names

Michael Katz-4
I'm making a crossword-style word game, and I'm trying to automate the process of creating the puzzles, at least somewhat.

I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by how "recognizable" they are, where by recognizable I mean something like, "how likely it is that the average American on the street will be familiar with the name/phrase/subject".


For instance, just to take a random example, on a recognizability scale from 0 to 100, I might score (just guessing here):


    Lady_Gaga = 90

    Lady_Jane_Grey = 10

    Lady_and_the_Tramp = 90

    Lady_Antebellum = 5

    Lady-in-waiting = 70

    Lady_Bird_Johnson = 65

    Lady_Marmalade = 10

    Ladysmith_Black_Mambazo = 10


One suggestion would just be to use the page length (either number of characters or physical rendered page length) as a proxy for recognizability. That might work, but it feels kind of crude, and certainly would get many false positives, such as Bose-Einstein_condensation.

Someone suggested to me that I might count incoming page links, and referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't understand whether/how the linking structure was represented.

So my questions are:

(1) Do you know if a list like I'm try to make already exists?

(2) If you were going to make a list like this how would you do it? If it was based on page length, which files would you download and process to make it as efficient as possible? If it was based on incoming links, which files specifically would you use, and how would you determine the link count?

Thanks for any help.
_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

WereSpielChequers-2
Hi Michael,

I don't know if such a list exists, other than lists by largest numbers of
views.

Size of article probably relates to interest of one or a few editors and
complexity of information, I doubt if it would closely relate to
recognisability. Incoming links is probably better but can get awfully
skewed by templates, and some links are more meaningful than others.

Recognisable in the USA is not necessarily the same as recognisable
globally. Ideally if you want a US specific list you need US specific data,
if you use a global list you could wind up asking Americans about Johnny
Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider
the generation you are targeting.    Lady_Bird_Johnson would be better known
among Americans and older people.

I'd suggest using metrics of page views per article, and if you want a
specifically US product screen out articles that don't use American English
spelling. Better still would be to get page views from the USA, or at least
page views ignoring the 6 hours when the US is most likely to be asleep.

WereSpielChequers

On 30 September 2011 04:17, Michael Katz <[hidden email]> wrote:

> I'm making a crossword-style word game, and I'm trying to automate the
> process of creating the puzzles, at least somewhat.
>
> I am hoping to find or create a list of English Wikipedia page titles,
> sorted roughly by how "recognizable" they are, where by recognizable I mean
> something like, "how likely it is that the average American on the street
> will be familiar with the name/phrase/subject".
>
>
> For instance, just to take a random example, on a recognizability scale
> from 0 to 100, I might score (just guessing here):
>
>
>     Lady_Gaga = 90
>
>     Lady_Jane_Grey = 10
>
>     Lady_and_the_Tramp = 90
>
>     Lady_Antebellum = 5
>
>     Lady-in-waiting = 70
>
>     Lady_Bird_Johnson = 65
>
>     Lady_Marmalade = 10
>
>     Ladysmith_Black_Mambazo = 10
>
>
> One suggestion would just be to use the page length (either number of
> characters or physical rendered page length) as a proxy for recognizability.
> That might work, but it feels kind of crude, and certainly would get many
> false positives, such as Bose-Einstein_condensation.
>
> Someone suggested to me that I might count incoming page links, and
> referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular
> the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that
> file but couldn't understand whether/how the linking structure was
> represented.
>
> So my questions are:
>
> (1) Do you know if a list like I'm try to make already exists?
>
> (2) If you were going to make a list like this how would you do it? If it
> was based on page length, which files would you download and process to make
> it as efficient as possible? If it was based on incoming links, which files
> specifically would you use, and how would you determine the link count?
>
> Thanks for any help.
> _______________________________________________
> WikiEN-l mailing list
> [hidden email]
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>
_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

Doc glasgow


> -----Original Message-----
> From: [hidden email] [mailto:wikien-l-
> [hidden email]] On Behalf Of WereSpielChequers
> Sent: 30 September 2011 10:56
> To: Michael Katz; English Wikipedia
> Subject: Re: [WikiEN-l] finding the "most recognizable" page names
>
> I'd suggest using metrics of page views per article, and if you want a
> specifically US product screen out articles that don't use American
> English
> spelling. Better still would be to get page views from the USA, or at
> least
> page views ignoring the 6 hours when the US is most likely to be asleep.
>
> WereSpielChequers
>

Removing non-US spellings would also distort. You would dismiss  "Tony
Blair", while keeping "Tony Boselli" and also remove articles like
"Scotland" "Queen Elizabeth II" and "George III of the United Kingdom" - all
of which might conceivably have some recognition in the US.


_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

WereSpielChequers-2
In reply to this post by WereSpielChequers-2
Yes, but for the purpose of creating a creating a game that may not be an
issue. Michael asked how to get a list of recognisable topics to build a
game with, not how to list all 3.7 million article names in order of
recognisability.

WereSpielChequers

On 30 September 2011 11:11, Scott MacDonald <[hidden email]>wrote:

>
>
> > -----Original Message-----
> > From: [hidden email] [mailto:wikien-l-
> > [hidden email]] On Behalf Of WereSpielChequers
> > Sent: 30 September 2011 10:56
> > To: Michael Katz; English Wikipedia
> > Subject: Re: [WikiEN-l] finding the "most recognizable" page names
> >
> > I'd suggest using metrics of page views per article, and if you want a
> > specifically US product screen out articles that don't use American
> > English
> > spelling. Better still would be to get page views from the USA, or at
> > least
> > page views ignoring the 6 hours when the US is most likely to be asleep.
> >
> > WereSpielChequers
> >
>
> Removing non-US spellings would also distort. You would dismiss  "Tony
> Blair", while keeping "Tony Boselli" and also remove articles like
> "Scotland" "Queen Elizabeth II" and "George III of the United Kingdom" -
> all
> of which might conceivably have some recognition in the US.
>
>
> _______________________________________________
> WikiEN-l mailing list
> [hidden email]
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>
_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

Michael Katz-4
In reply to this post by WereSpielChequers-2
Thanks for the reply. Can you tell me exactly which dump files you'd look in to find the number of page views, plus any information about finding the page views within those files, if it's not obvious? Is there a way to distinguish between editor page views and user page views? (Perhaps subtract the number of edits made? If so, how I can find the number of edits made?)

Something about page views seems a little funny, because it seems like there are some very recognizable things that just aren't looked up much. But perhaps it's my best hope...



________________________________
From: WereSpielChequers <[hidden email]>
To: Michael Katz <[hidden email]>; English Wikipedia <[hidden email]>
Sent: Friday, September 30, 2011 2:55 AM
Subject: Re: [WikiEN-l] finding the "most recognizable" page names


Hi Michael,

I don't know if such a list exists, other than lists by largest numbers of views.

Size of article probably relates to interest of one or a few editors and complexity of information, I doubt if it would closely relate to recognisability. Incoming links is probably better but can get awfully skewed by templates, and some links are more meaningful than others.

Recognisable in the USA is not necessarily the same as recognisable globally. Ideally if you want a US specific list you need US specific data, if you use a global list you could wind up asking Americans about Johnny Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider the generation you are targeting.    Lady_Bird_Johnson would be better known among Americans and older people.

I'd suggest using metrics of page views per article, and if you want a specifically US product screen out articles that don't use American English spelling. Better still would be to get page views from the USA, or at least page views ignoring the 6 hours when the US is most likely to be asleep.

WereSpielChequers


On 30 September 2011 04:17, Michael Katz <[hidden email]> wrote:

I'm making a crossword-style word game, and I'm trying to automate the process of creating the puzzles, at least somewhat.

>
>I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by how "recognizable" they are, where by recognizable I mean something like, "how likely it is that the average American on the street will be familiar with the name/phrase/subject".
>
>
>For instance, just to take a random example, on a recognizability scale from 0 to 100, I might score (just guessing here):
>
>
>    Lady_Gaga = 90
>
>    Lady_Jane_Grey = 10
>
>    Lady_and_the_Tramp = 90
>
>    Lady_Antebellum = 5
>
>    Lady-in-waiting = 70
>
>    Lady_Bird_Johnson = 65
>
>    Lady_Marmalade = 10
>
>    Ladysmith_Black_Mambazo = 10
>
>
>One suggestion would just be to use the page length (either number of characters or physical rendered page length) as a proxy for recognizability. That might work, but it feels kind of crude, and certainly would get many false positives, such as Bose-Einstein_condensation.
>
>Someone suggested to me that I might count incoming page links, and referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't understand whether/how the linking structure was represented.
>
>So my questions are:
>
>(1) Do you know if a list like I'm try to make already exists?
>
>(2) If you were going to make a list like this how would you do it? If it was based on page length, which files would you download and process to make it as efficient as possible? If it was based on incoming links, which files specifically would you use, and how would you determine the link count?
>
>Thanks for any help.
>_______________________________________________
>WikiEN-l mailing list
>[hidden email]
>To unsubscribe from this mailing list, visit:
>https://lists.wikimedia.org/mailman/listinfo/wikien-l
>
_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

Ian Woollard
The raw dumps are here:

http://dammit.lt/wikistats/

IRC the compressed files consist of the list of the articles that were
accessed, in the order they were retrieved. You have to process them to
count how often each article was read.

Of course:

http://stats.grok.se/

has done that heavy lifting already and they keep lists of the most popular
articles.

On 30 September 2011 18:53, Michael Katz <[hidden email]> wrote:

> Thanks for the reply. Can you tell me exactly which dump files you'd look
> in to find the number of page views, plus any information about finding the
> page views within those files, if it's not obvious? Is there a way to
> distinguish between editor page views and user page views? (Perhaps subtract
> the number of edits made? If so, how I can find the number of edits made?)
>
> Something about page views seems a little funny, because it seems like
> there are some very recognizable things that just aren't looked up much. But
> perhaps it's my best hope...
>
>
>
> ________________________________
> From: WereSpielChequers <[hidden email]>
> To: Michael Katz <[hidden email]>; English Wikipedia <
> [hidden email]>
> Sent: Friday, September 30, 2011 2:55 AM
> Subject: Re: [WikiEN-l] finding the "most recognizable" page names
>
>
> Hi Michael,
>
> I don't know if such a list exists, other than lists by largest numbers of
> views.
>
> Size of article probably relates to interest of one or a few editors and
> complexity of information, I doubt if it would closely relate to
> recognisability. Incoming links is probably better but can get awfully
> skewed by templates, and some links are more meaningful than others.
>
> Recognisable in the USA is not necessarily the same as recognisable
> globally. Ideally if you want a US specific list you need US specific data,
> if you use a global list you could wind up asking Americans about Johnny
> Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider
> the generation you are targeting.    Lady_Bird_Johnson would be better known
> among Americans and older people.
>
> I'd suggest using metrics of page views per article, and if you want a
> specifically US product screen out articles that don't use American English
> spelling. Better still would be to get page views from the USA, or at least
> page views ignoring the 6 hours when the US is most likely to be asleep.
>
> WereSpielChequers
>
>
> On 30 September 2011 04:17, Michael Katz <[hidden email]>
> wrote:
>
> I'm making a crossword-style word game, and I'm trying to automate the
> process of creating the puzzles, at least somewhat.
> >
> >I am hoping to find or create a list of English Wikipedia page titles,
> sorted roughly by how "recognizable" they are, where by recognizable I mean
> something like, "how likely it is that the average American on the street
> will be familiar with the name/phrase/subject".
> >
> >
> >For instance, just to take a random example, on a recognizability scale
> from 0 to 100, I might score (just guessing here):
> >
> >
> >    Lady_Gaga = 90
> >
> >    Lady_Jane_Grey = 10
> >
> >    Lady_and_the_Tramp = 90
> >
> >    Lady_Antebellum = 5
> >
> >    Lady-in-waiting = 70
> >
> >    Lady_Bird_Johnson = 65
> >
> >    Lady_Marmalade = 10
> >
> >    Ladysmith_Black_Mambazo = 10
> >
> >
> >One suggestion would just be to use the page length (either number of
> characters or physical rendered page length) as a proxy for recognizability.
> That might work, but it feels kind of crude, and certainly would get many
> false positives, such as Bose-Einstein_condensation.
> >
> >Someone suggested to me that I might count incoming page links, and
> referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular
> the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that
> file but couldn't understand whether/how the linking structure was
> represented.
> >
> >So my questions are:
> >
> >(1) Do you know if a list like I'm try to make already exists?
> >
> >(2) If you were going to make a list like this how would you do it? If it
> was based on page length, which files would you download and process to make
> it as efficient as possible? If it was based on incoming links, which files
> specifically would you use, and how would you determine the link count?
> >
> >Thanks for any help.
> >_______________________________________________
> >WikiEN-l mailing list
> >[hidden email]
> >To unsubscribe from this mailing list, visit:
> >https://lists.wikimedia.org/mailman/listinfo/wikien-l
> >
> _______________________________________________
> WikiEN-l mailing list
> [hidden email]
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>



--
-Ian Woollard
_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l
Reply | Threaded
Open this post in threaded view
|

Re: finding the "most recognizable" page names

Bob the Wikipedian
You might also consider http://buzzlog.yahoo.com/overall/ which lists
the topics the world is searching for.

Bob

On 9/30/2011 1:24 PM, Ian Woollard wrote:

> The raw dumps are here:
>
> http://dammit.lt/wikistats/
>
> IRC the compressed files consist of the list of the articles that were
> accessed, in the order they were retrieved. You have to process them to
> count how often each article was read.
>
> Of course:
>
> http://stats.grok.se/
>
> has done that heavy lifting already and they keep lists of the most popular
> articles.
>
> On 30 September 2011 18:53, Michael Katz<[hidden email]>  wrote:
>
>> Thanks for the reply. Can you tell me exactly which dump files you'd look
>> in to find the number of page views, plus any information about finding the
>> page views within those files, if it's not obvious? Is there a way to
>> distinguish between editor page views and user page views? (Perhaps subtract
>> the number of edits made? If so, how I can find the number of edits made?)
>>
>> Something about page views seems a little funny, because it seems like
>> there are some very recognizable things that just aren't looked up much. But
>> perhaps it's my best hope...
>>
>>
>>
>> ________________________________
>> From: WereSpielChequers<[hidden email]>
>> To: Michael Katz<[hidden email]>; English Wikipedia<
>> [hidden email]>
>> Sent: Friday, September 30, 2011 2:55 AM
>> Subject: Re: [WikiEN-l] finding the "most recognizable" page names
>>
>>
>> Hi Michael,
>>
>> I don't know if such a list exists, other than lists by largest numbers of
>> views.
>>
>> Size of article probably relates to interest of one or a few editors and
>> complexity of information, I doubt if it would closely relate to
>> recognisability. Incoming links is probably better but can get awfully
>> skewed by templates, and some links are more meaningful than others.
>>
>> Recognisable in the USA is not necessarily the same as recognisable
>> globally. Ideally if you want a US specific list you need US specific data,
>> if you use a global list you could wind up asking Americans about Johnny
>> Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider
>> the generation you are targeting.    Lady_Bird_Johnson would be better known
>> among Americans and older people.
>>
>> I'd suggest using metrics of page views per article, and if you want a
>> specifically US product screen out articles that don't use American English
>> spelling. Better still would be to get page views from the USA, or at least
>> page views ignoring the 6 hours when the US is most likely to be asleep.
>>
>> WereSpielChequers
>>
>>
>> On 30 September 2011 04:17, Michael Katz<[hidden email]>
>> wrote:
>>
>> I'm making a crossword-style word game, and I'm trying to automate the
>> process of creating the puzzles, at least somewhat.
>>> I am hoping to find or create a list of English Wikipedia page titles,
>> sorted roughly by how "recognizable" they are, where by recognizable I mean
>> something like, "how likely it is that the average American on the street
>> will be familiar with the name/phrase/subject".
>>>
>>> For instance, just to take a random example, on a recognizability scale
>> from 0 to 100, I might score (just guessing here):
>>>
>>>     Lady_Gaga = 90
>>>
>>>     Lady_Jane_Grey = 10
>>>
>>>     Lady_and_the_Tramp = 90
>>>
>>>     Lady_Antebellum = 5
>>>
>>>     Lady-in-waiting = 70
>>>
>>>     Lady_Bird_Johnson = 65
>>>
>>>     Lady_Marmalade = 10
>>>
>>>     Ladysmith_Black_Mambazo = 10
>>>
>>>
>>> One suggestion would just be to use the page length (either number of
>> characters or physical rendered page length) as a proxy for recognizability.
>> That might work, but it feels kind of crude, and certainly would get many
>> false positives, such as Bose-Einstein_condensation.
>>> Someone suggested to me that I might count incoming page links, and
>> referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular
>> the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that
>> file but couldn't understand whether/how the linking structure was
>> represented.
>>> So my questions are:
>>>
>>> (1) Do you know if a list like I'm try to make already exists?
>>>
>>> (2) If you were going to make a list like this how would you do it? If it
>> was based on page length, which files would you download and process to make
>> it as efficient as possible? If it was based on incoming links, which files
>> specifically would you use, and how would you determine the link count?
>>> Thanks for any help.
>>> _______________________________________________
>>> WikiEN-l mailing list
>>> [hidden email]
>>> To unsubscribe from this mailing list, visit:
>>> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>>>
>> _______________________________________________
>> WikiEN-l mailing list
>> [hidden email]
>> To unsubscribe from this mailing list, visit:
>> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>>
>
>

_______________________________________________
WikiEN-l mailing list
[hidden email]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l