Project exploring automated classification of article importance

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Project exploring automated classification of article importance

Morten Wang
Hello everyone,

I am currently working with Aaron Halfaker and Dario Taraborelli at the
Wikimedia Foundation on a project exploring automated classification of
article importance. Our goal is to characterize the importance of an
article within a given context and design a system to predict a relative
importance rank. We have a project page on meta[1] and welcome comments or
thoughts on our talk page. You can of course also respond here on
wiki-research-l, or send me an email.

Before moving on to model-building I did a fairly thorough literature
review, finding a myriad of papers spanning several disciplines. We have a
draft literature review also up on meta[2], which should give you a
reasonable introduction to the topic. Again, comments or thoughts (e.g.
papers we’ve missed) on the talk page, mailing list, or through email are
welcome.

Links:

   1. https://meta.wikimedia.org/wiki/Research:Automated_
   classification_of_article_importance
   <https://meta.wikimedia.org/wiki/Research:Automated_classification_of_article_importance>
   2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance

Regards,
Morten
[[User:Nettrom]] aka [[User:SuggestBot]]
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Pine W
Hi Nettrom,

A few resources from English Wikipedia regarding article importance as
ranked by humans:

https://en.wikipedia.org/wiki/Wikipedia:Vital_articles

https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Release_Version_Criteria#Priority_of_topic

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statistics

I infer from the ENWP Wikicup's scoring protocol that for purposes of the
competition, an article's "importance" is loosely inferred from the number
of language editions of Wikipedia in which the article appears:
https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.

HTH,

Pine


On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:

> Hello everyone,
>
> I am currently working with Aaron Halfaker and Dario Taraborelli at the
> Wikimedia Foundation on a project exploring automated classification of
> article importance. Our goal is to characterize the importance of an
> article within a given context and design a system to predict a relative
> importance rank. We have a project page on meta[1] and welcome comments or
> thoughts on our talk page. You can of course also respond here on
> wiki-research-l, or send me an email.
>
> Before moving on to model-building I did a fairly thorough literature
> review, finding a myriad of papers spanning several disciplines. We have a
> draft literature review also up on meta[2], which should give you a
> reasonable introduction to the topic. Again, comments or thoughts (e.g.
> papers we’ve missed) on the talk page, mailing list, or through email are
> welcome.
>
> Links:
>
>    1. https://meta.wikimedia.org/wiki/Research:Automated_
>    classification_of_article_importance
>    <https://meta.wikimedia.org/wiki/Research:Automated_
> classification_of_article_importance>
>    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
>
> Regards,
> Morten
> [[User:Nettrom]] aka [[User:SuggestBot]]
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Morten Wang
Hi Pine,

These are great pointers to existing practices on enwiki, some of which
I've been looking for and/or missed, thanks!


Cheers,
Morten

On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:

> Hi Nettrom,
>
> A few resources from English Wikipedia regarding article importance as
> ranked by humans:
>
> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
>
> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> Editorial_Team/Release_Version_Criteria#Priority_of_topic
>
> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statistics
>
> I infer from the ENWP Wikicup's scoring protocol that for purposes of the
> competition, an article's "importance" is loosely inferred from the number
> of language editions of Wikipedia in which the article appears:
> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
>
> HTH,
>
> Pine
>
>
> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
>
> > Hello everyone,
> >
> > I am currently working with Aaron Halfaker and Dario Taraborelli at the
> > Wikimedia Foundation on a project exploring automated classification of
> > article importance. Our goal is to characterize the importance of an
> > article within a given context and design a system to predict a relative
> > importance rank. We have a project page on meta[1] and welcome comments
> or
> > thoughts on our talk page. You can of course also respond here on
> > wiki-research-l, or send me an email.
> >
> > Before moving on to model-building I did a fairly thorough literature
> > review, finding a myriad of papers spanning several disciplines. We have
> a
> > draft literature review also up on meta[2], which should give you a
> > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > papers we’ve missed) on the talk page, mailing list, or through email are
> > welcome.
> >
> > Links:
> >
> >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> >    classification_of_article_importance
> >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > classification_of_article_importance>
> >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> >
> > Regards,
> > Morten
> > [[User:Nettrom]] aka [[User:SuggestBot]]
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Kerry Raymond
Just a few musings on the issue of Importance and how to research it ...

I agree it is intuitive that importance is likely to be linked to pageviews and inbound links but, as the preliminary experiment showed, it's probably not that simple.

Pageviews tells us something about importance to readers of Wikipedia, while inbound links tells us something about importance to writers of Wikipedia, and I suspect that writers are not a proxy for readers as the editor surveys suggest that Wikipedia writers are not typical of broader society on at least two variables: gender and level of education (might be others, I can't remember).

But I think importance is a relative metric rather than  absolute. I think by taking the mean value of importance across a number of WikiProjects in the preliminary experiment may have lost something because it tried (through averaging) to look at importance "generally". I would suspect conducting an experiment considering only the importance ratings wrt to a single WikiProject would be more likely to show correlation with pageviews (wrt to other articles in that same WikiProject) and inbound links. And I think there are two kinds of inbound links to be considered, those coming from other articles within the same WikiProject and those coming from outside that Wikiproject. I suspect different insights will be obtained by looking at both types of inbound links separately rather than treating them as an aggregate. I note also that WikiProjects are not entirely independent of one another but have relationships between them. For example, The WikiProject Australian Roads describes itself as an "intersection" (ha ha!) of WikiProject Highways and WikiProject Australia, so I expect that we would find greater correlation in importance between related WikiProjects than between unrelated WikiProjects.

When thinking about readers and pageviews, I think we have to ask ourselves is there a difference between popularity and importance. Or whether popularity *is* importance. I sense that, as a group of educated people, those of us reading this research mailing list probably do think there is a difference. Certainly if there is no difference, then this research can stop now -- just judge importance by  pageviews. Let's assume a difference then. When looking at pageviews of an article, they are not always consistent over time. Here are the pageviews for Drottninggatan

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan

Why so interesting on 8 April? A terrorist attack occurred there. This spike in pageviews occurs all the time when some topic is in the news (even peripherally as in this case where it is not the article about the terrorist attack but about the street in which it occurred). Did the street become more "important"? I think it became more interesting but not more important. So I think we do have to be careful to understand that pageviews probably reflect interest rather than importance.  I note that The Chainsmokers (a music group with a number of songs in the current USA music charts) gets many more Wikipedia article pageviews  than the Wikipedia article on Pasteurization but The Chainsmokers are not rated as being of high importance by the relevant WikiProjects while Pasteurization is very important in WikiProject Food and Drink. Since pasteurisation prevents a lot of deaths, I think we might agree that in the real world pasteurisation is more important than a music group regardless of what pageviews tell us.

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization

Of course it is matters for Wikipedia's success that our *popular* articles are of high quality, but I think we have be cautious about pageviews being a proxy for importance.

When we look at Wikipedia writers' decisions in tagging the importance of articles to WikiProjects, what do we find? As we know, project tags are often placed on new articles (and often not subsequently reviewed). So while I find that quality tags are often out-of-date, the importance seems to be pretty accurate even on a new stub articles. This is because it is the importance of the *topic* that is being assessed which is independent of the Wikipedia article itself. Provided the article is clear enough about what it is about and why it matters (which is the traditional content of that first paragraph or two and failing to provide it will likely result in speedy deletion of the new article), assessment of the topic's importance can be made even at new stub level. This tells us that importance for Wikipedia writers is determined by something outside of Wikipedia (probably their real-world knowledge of that topic space -- one assumes that project taggers are quite interested in the topic space of that project). While article quality hopefully improves over time, I would be surprised if article importance greatly changed over time. Obviously there are counter-examples.  I am guessing Donald Trump's article may have grown in importance over time but that's probably because his lede para changed. Adding President of the USA into the lede paragraph makes him much more important than he was before in the real world and internal to Wikipedia he has acquired an inbound link from the presumably high-importance President of the USA article. So I think it might be interesting to study those articles whose importance does change over time to see if there are any strong correlations with what is happening to the article inside Wikipedia. I think it is this set of importance-changing articles may be where we really learn what Wikipedia article characteristics are strongly correlated to "importance" given that importance itself appears to be pretty stable for most articles.

Although not stated explicitly, I imagine we believe that generally less important articles tend to link to more important articles but more important articles don't link to less important articles. And hence in-bound links are likely to matter in assessing importance and that in-bound links from "important" articles are more valuable than in-bound links from less important articles (which creates something of a bootstrapping problem) similar to the issue to Google's PageRank algorithms. But I think we do have some information that Google doesn't have. The average webpage does not have a lede paragraph that situates the topic relative to other topics; a Wikipedia article does. If I have to choose to define Thing X in terms of Thing Y, it tends to suggest that Y is more important than X. If Y also defines itself in terms of X, then it tends to suggest they are equivalent in importance at some way. Indeed I suspect when we get to the VERY IMPORTANT topics we will see this kind of circular definition (e.g. you see circular definitions in Wikipedia around Philosophy and Knowledge). Aside, if you have never done this before, try this experiment. Choose a random article (left hand tool bar in Desktop Wikipedia), then click the first link in the article that matters (i.e. ignore links hatnotes or links inside parentheses). Repeat this first link clicking and sooner or later you will reach articles like Knowledge and Philosophy, which all sit inside circular definition groups.

If we look at the Donald Trump article, his first sentence contains only two links, one to List of Presidents of the USA and the other to President of the USA. If we look at the those two articles, we find that both of them mention Donald Trump in their lede paras (although not as early as the first sentence) and before mentions of any other US President elsewhere in the article. Which is consistent with what we know about the real world, the role of the President is more important than its officeholders and that the current officeholder has more importance than a past officeholder. So topic importance does seems to be skewed towards the "present day".

So I suspect the links in the lede paras are of greater relevance to the assessment of importance than links further down in the article which will be more likely relate to details of a topic and may include examples and counter-examples (this is a way in which high importance article may mention much lower importance articles). However, we do have to be a little bit careful here because of the MoS practice of not linking very common terms. For example, an Australian article will often refer to Australia in the lede para but it will almost certainly not be linked to the Australia article (and any attempt to add such a link will likely see it removed with an edit summary that mentions [[WP:Overlinking]]) whereas there is no problem if you link to an Australian state article, e.g. New South Wales. So we might find that some very important topics that often appear in ledes might get fewer links that you might expect because of the MoS policies on overlinking, which may be problem when working with inbound links. It may be that for "very common topics" the presence of the article title (or its synonyms) in the lede may have to be considered as if it were an in-bound link for statistical research purposes.

Given all of the above, perhaps the most interesting group of articles to study in Wikipedia are those articles whose manually-assessed importance has changed over the life of the article AND which were NOT current topics in the lifetime of Wikipedia (given the influence of "current" on importance). But having said that, I wonder if that group of articles actually exists. Recently a newish Australian contributor expressed disappointment that all the new articles they had created were tagged (by others) as of Low Importance. My instinctive reply was "that's normal, I think of the thousands of articles I have started only a couple even rated as Mid importance, this is because the really important articles were all started long ago precisely because they were important". I suspect topics that are very important (for reasons other than being short-lived importance due in being "current" in the lifetime of Wikipedia) will generally show up as having started early in Wikipedia's life and that those that become more/less important over time will be largely linked to becoming or ceasing to be "current" topics). E.g. article Pasteurization started in May 2001 saying nothing more than " Pasteurization is the process of killing off bacteria in milk by quickly heating it to a near boiling temperature, then quickly cooling it again before the taste and other desirable properties are affected. The process was named after its inventor, French scientist Louis Pasteur. See also dairy products." The links in this very first version are still present in its lede paragraph today, suggesting our understanding of "non-current" topics is stable and hence initial importance determinations can probably be accurately made. For Pasteurization the Talk page shows it was not project-tagged until 2007 when it was assigned High Importance as its first assessment.

I suspect we will find that initial manual assessment of article importance will be pretty accurate for most articles. And I suspect if we plot initial importance assessments against time of assessment, we will find the higher importance articles commenced life on Wikipedia earlier than the lower importance articles. If I am correct, then there isn't a lot of value in machine-assessment of importance of topics because it relates to factors external to Wikipedia and often does not change over time and therefore can often be correctly assessed manually even on new stub articles (and any unassessed articles can probably be rated as Low Importance as statistically that's almost certainly going to be correct). If a topic becomes more important due to "current" events, then invariably that article will be updated by many people and one of them will sooner or later manually adjust its importance. What is less likely to happen is re-assessing downwards of Importance when an important "current" topic loses its importance when it is no longer current, e.g. are former American presidents like Barack Obama or George W Bush or further back less important now? These articles will not be updated frequently once the topic is no longer in the news and therefore it is less likely an editor will notice and manually downgrade the importance, so there may be a greater role for machine-assessment in downgrading importance rather than upgrading importance.

Another area where there might be a role for machine-assessed importance in regards to POV-pushing where an POV-motivated editor might change the manual-assessment importance of articles to be higher or lower based on their POV (e.g. my political party is Top Importance, other parties are of Low Importance). I suspect that often a page watcher would correct or at least question that kind of re-assessment. However, articles with few active pagewatchers you might get away with POV-pushing the article's importance tag because nobody noticed. In this situation, a machine assessment could be useful in spotting this kind of thing.

This suggests that another metric of interest to importance might be number of pagewatchers, although I suspect that pagewatching may relate more to caring about the article than to caring about the topic. And one has to be careful to distinguish active pagewatchers (those who actually do review changes on their watchlists) from those who don't, as that may make a difference (although I am not sure we can really tell which pagewatchers are truly actively reviewing as a "satisfactory review" doesn't leave a trace whereas an "unsatisfactory" review is likely to lead to a relatively soon revert or some other change to the article, the article Talk or the User Talk of reviewed contributor which may be detectable).

The other aspect of articles that occurs to me as being possibly linked to importance of the topic would be use of the article as the "main" article for a category or as the title of a navbox (as it suggests that the articles in the category or navbox are in some way subordinate to the main/title article). Similarly for list articles, the "type" of the list is often more important than its instances).

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email]] On Behalf Of Morten Wang
Sent: Friday, 21 April 2017 6:04 AM
To: Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

Hi Pine,

These are great pointers to existing practices on enwiki, some of which I've been looking for and/or missed, thanks!


Cheers,
Morten

On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:

> Hi Nettrom,
>
> A few resources from English Wikipedia regarding article importance as
> ranked by humans:
>
> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
>
> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> Editorial_Team/Release_Version_Criteria#Priority_of_topic
>
> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> ics
>
> I infer from the ENWP Wikicup's scoring protocol that for purposes of
> the competition, an article's "importance" is loosely inferred from
> the number of language editions of Wikipedia in which the article appears:
> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
>
> HTH,
>
> Pine
>
>
> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
>
> > Hello everyone,
> >
> > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > the Wikimedia Foundation on a project exploring automated
> > classification of article importance. Our goal is to characterize
> > the importance of an article within a given context and design a
> > system to predict a relative importance rank. We have a project page
> > on meta[1] and welcome comments
> or
> > thoughts on our talk page. You can of course also respond here on
> > wiki-research-l, or send me an email.
> >
> > Before moving on to model-building I did a fairly thorough
> > literature review, finding a myriad of papers spanning several
> > disciplines. We have
> a
> > draft literature review also up on meta[2], which should give you a
> > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > papers we’ve missed) on the talk page, mailing list, or through
> > email are welcome.
> >
> > Links:
> >
> >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> >    classification_of_article_importance
> >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > classification_of_article_importance>
> >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> >
> > Regards,
> > Morten
> > [[User:Nettrom]] aka [[User:SuggestBot]]
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Jane Darnell
Yes I totally agree that "importance is a relative metric rather than
absolute." I also agree that incoming links and pageviews are not accurate
measurements of "importance" for all of the reasons you mention. However,
we are still a project that is actively exploring the universe of
knowledge, and leaning heavily on academia and other established sources we
must "boldly go where no man has gone before" (and please feel free to
insert "white, euro-centric" before the man part). So do you have any
suggestions what we could measure going forward that would cough up some
interesting stats to monitor? Pagewatching is useful , but problematic
because these are only assigned at page-creation, while some marginal
editor interest might be expanded to whole categories (speaking as someone
who has thousands of pages watchlisted on multiple projects). I like your
thoughts about looking for key articles such as those used as the "article
as the "main" article for a category or as the title of a navbox ".  I am
looking for similar usages of paintings as a way to find popular painters
or paintings rather than just those paintings which have articles written
about them (which are often written for totally random reasons such as
theft/sale/wikiproject).

On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <[hidden email]>
wrote:

> Just a few musings on the issue of Importance and how to research it ...
>
> I agree it is intuitive that importance is likely to be linked to
> pageviews and inbound links but, as the preliminary experiment showed, it's
> probably not that simple.
>
> Pageviews tells us something about importance to readers of Wikipedia,
> while inbound links tells us something about importance to writers of
> Wikipedia, and I suspect that writers are not a proxy for readers as the
> editor surveys suggest that Wikipedia writers are not typical of broader
> society on at least two variables: gender and level of education (might be
> others, I can't remember).
>
> But I think importance is a relative metric rather than  absolute. I think
> by taking the mean value of importance across a number of WikiProjects in
> the preliminary experiment may have lost something because it tried
> (through averaging) to look at importance "generally". I would suspect
> conducting an experiment considering only the importance ratings wrt to a
> single WikiProject would be more likely to show correlation with pageviews
> (wrt to other articles in that same WikiProject) and inbound links. And I
> think there are two kinds of inbound links to be considered, those coming
> from other articles within the same WikiProject and those coming from
> outside that Wikiproject. I suspect different insights will be obtained by
> looking at both types of inbound links separately rather than treating them
> as an aggregate. I note also that WikiProjects are not entirely independent
> of one another but have relationships between them. For example, The
> WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
> of WikiProject Highways and WikiProject Australia, so I expect that we
> would find greater correlation in importance between related WikiProjects
> than between unrelated WikiProjects.
>
> When thinking about readers and pageviews, I think we have to ask
> ourselves is there a difference between popularity and importance. Or
> whether popularity *is* importance. I sense that, as a group of educated
> people, those of us reading this research mailing list probably do think
> there is a difference. Certainly if there is no difference, then this
> research can stop now -- just judge importance by  pageviews. Let's assume
> a difference then. When looking at pageviews of an article, they are not
> always consistent over time. Here are the pageviews for Drottninggatan
>
> https://tools.wmflabs.org/pageviews/?project=en.
> wikipedia.org&platform=all-access&agent=user&range=
> latest-90&pages=Drottninggatan
>
> Why so interesting on 8 April? A terrorist attack occurred there. This
> spike in pageviews occurs all the time when some topic is in the news (even
> peripherally as in this case where it is not the article about the
> terrorist attack but about the street in which it occurred). Did the street
> become more "important"? I think it became more interesting but not more
> important. So I think we do have to be careful to understand that pageviews
> probably reflect interest rather than importance.  I note that The
> Chainsmokers (a music group with a number of songs in the current USA music
> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> article on Pasteurization but The Chainsmokers are not rated as being of
> high importance by the relevant WikiProjects while Pasteurization is very
> important in WikiProject Food and Drink. Since pasteurisation prevents a
> lot of deaths, I think we might agree that in the real world pasteurisation
> is more important than a music group regardless of what pageviews tell us.
>
> https://tools.wmflabs.org/pageviews/?project=en.
> wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_
> Chainsmokers|Pasteurization
>
> Of course it is matters for Wikipedia's success that our *popular*
> articles are of high quality, but I think we have be cautious about
> pageviews being a proxy for importance.
>
> When we look at Wikipedia writers' decisions in tagging the importance of
> articles to WikiProjects, what do we find? As we know, project tags are
> often placed on new articles (and often not subsequently reviewed). So
> while I find that quality tags are often out-of-date, the importance seems
> to be pretty accurate even on a new stub articles. This is because it is
> the importance of the *topic* that is being assessed which is independent
> of the Wikipedia article itself. Provided the article is clear enough about
> what it is about and why it matters (which is the traditional content of
> that first paragraph or two and failing to provide it will likely result in
> speedy deletion of the new article), assessment of the topic's importance
> can be made even at new stub level. This tells us that importance for
> Wikipedia writers is determined by something outside of Wikipedia (probably
> their real-world knowledge of that topic space -- one assumes that project
> taggers are quite interested in the topic space of that project). While
> article quality hopefully improves over time, I would be surprised if
> article importance greatly changed over time. Obviously there are
> counter-examples.  I am guessing Donald Trump's article may have grown in
> importance over time but that's probably because his lede para changed.
> Adding President of the USA into the lede paragraph makes him much more
> important than he was before in the real world and internal to Wikipedia he
> has acquired an inbound link from the presumably high-importance President
> of the USA article. So I think it might be interesting to study those
> articles whose importance does change over time to see if there are any
> strong correlations with what is happening to the article inside Wikipedia.
> I think it is this set of importance-changing articles may be where we
> really learn what Wikipedia article characteristics are strongly correlated
> to "importance" given that importance itself appears to be pretty stable
> for most articles.
>
> Although not stated explicitly, I imagine we believe that generally less
> important articles tend to link to more important articles but more
> important articles don't link to less important articles. And hence
> in-bound links are likely to matter in assessing importance and that
> in-bound links from "important" articles are more valuable than in-bound
> links from less important articles (which creates something of a
> bootstrapping problem) similar to the issue to Google's PageRank
> algorithms. But I think we do have some information that Google doesn't
> have. The average webpage does not have a lede paragraph that situates the
> topic relative to other topics; a Wikipedia article does. If I have to
> choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
> more important than X. If Y also defines itself in terms of X, then it
> tends to suggest they are equivalent in importance at some way. Indeed I
> suspect when we get to the VERY IMPORTANT topics we will see this kind of
> circular definition (e.g. you see circular definitions in Wikipedia around
> Philosophy and Knowledge). Aside, if you have never done this before, try
> this experiment. Choose a random article (left hand tool bar in Desktop
> Wikipedia), then click the first link in the article that matters (i.e.
> ignore links hatnotes or links inside parentheses). Repeat this first link
> clicking and sooner or later you will reach articles like Knowledge and
> Philosophy, which all sit inside circular definition groups.
>
> If we look at the Donald Trump article, his first sentence contains only
> two links, one to List of Presidents of the USA and the other to President
> of the USA. If we look at the those two articles, we find that both of them
> mention Donald Trump in their lede paras (although not as early as the
> first sentence) and before mentions of any other US President elsewhere in
> the article. Which is consistent with what we know about the real world,
> the role of the President is more important than its officeholders and that
> the current officeholder has more importance than a past officeholder. So
> topic importance does seems to be skewed towards the "present day".
>
> So I suspect the links in the lede paras are of greater relevance to the
> assessment of importance than links further down in the article which will
> be more likely relate to details of a topic and may include examples and
> counter-examples (this is a way in which high importance article may
> mention much lower importance articles). However, we do have to be a little
> bit careful here because of the MoS practice of not linking very common
> terms. For example, an Australian article will often refer to Australia in
> the lede para but it will almost certainly not be linked to the Australia
> article (and any attempt to add such a link will likely see it removed with
> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
> problem if you link to an Australian state article, e.g. New South Wales.
> So we might find that some very important topics that often appear in ledes
> might get fewer links that you might expect because of the MoS policies on
> overlinking, which may be problem when working with inbound links. It may
> be that for "very common topics" the presence of the article title (or its
> synonyms) in the lede may have to be considered as if it were an in-bound
> link for statistical research purposes.
>
> Given all of the above, perhaps the most interesting group of articles to
> study in Wikipedia are those articles whose manually-assessed importance
> has changed over the life of the article AND which were NOT current topics
> in the lifetime of Wikipedia (given the influence of "current" on
> importance). But having said that, I wonder if that group of articles
> actually exists. Recently a newish Australian contributor expressed
> disappointment that all the new articles they had created were tagged (by
> others) as of Low Importance. My instinctive reply was "that's normal, I
> think of the thousands of articles I have started only a couple even rated
> as Mid importance, this is because the really important articles were all
> started long ago precisely because they were important". I suspect topics
> that are very important (for reasons other than being short-lived
> importance due in being "current" in the lifetime of Wikipedia) will
> generally show up as having started early in Wikipedia's life and that
> those that become more/less important over time will be largely linked to
> becoming or ceasing to be "current" topics). E.g. article Pasteurization
> started in May 2001 saying nothing more than " Pasteurization is the
> process of killing off bacteria in milk by quickly heating it to a near
> boiling temperature, then quickly cooling it again before the taste and
> other desirable properties are affected. The process was named after its
> inventor, French scientist Louis Pasteur. See also dairy products." The
> links in this very first version are still present in its lede paragraph
> today, suggesting our understanding of "non-current" topics is stable and
> hence initial importance determinations can probably be accurately made.
> For Pasteurization the Talk page shows it was not project-tagged until 2007
> when it was assigned High Importance as its first assessment.
>
> I suspect we will find that initial manual assessment of article
> importance will be pretty accurate for most articles. And I suspect if we
> plot initial importance assessments against time of assessment, we will
> find the higher importance articles commenced life on Wikipedia earlier
> than the lower importance articles. If I am correct, then there isn't a lot
> of value in machine-assessment of importance of topics because it relates
> to factors external to Wikipedia and often does not change over time and
> therefore can often be correctly assessed manually even on new stub
> articles (and any unassessed articles can probably be rated as Low
> Importance as statistically that's almost certainly going to be correct).
> If a topic becomes more important due to "current" events, then invariably
> that article will be updated by many people and one of them will sooner or
> later manually adjust its importance. What is less likely to happen is
> re-assessing downwards of Importance when an important "current" topic
> loses its importance when it is no longer current, e.g. are former American
> presidents like Barack Obama or George W Bush or further back less
> important now? These articles will not be updated frequently once the topic
> is no longer in the news and therefore it is less likely an editor will
> notice and manually downgrade the importance, so there may be a greater
> role for machine-assessment in downgrading importance rather than upgrading
> importance.
>
> Another area where there might be a role for machine-assessed importance
> in regards to POV-pushing where an POV-motivated editor might change the
> manual-assessment importance of articles to be higher or lower based on
> their POV (e.g. my political party is Top Importance, other parties are of
> Low Importance). I suspect that often a page watcher would correct or at
> least question that kind of re-assessment. However, articles with few
> active pagewatchers you might get away with POV-pushing the article's
> importance tag because nobody noticed. In this situation, a machine
> assessment could be useful in spotting this kind of thing.
>
> This suggests that another metric of interest to importance might be
> number of pagewatchers, although I suspect that pagewatching may relate
> more to caring about the article than to caring about the topic. And one
> has to be careful to distinguish active pagewatchers (those who actually do
> review changes on their watchlists) from those who don't, as that may make
> a difference (although I am not sure we can really tell which pagewatchers
> are truly actively reviewing as a "satisfactory review" doesn't leave a
> trace whereas an "unsatisfactory" review is likely to lead to a relatively
> soon revert or some other change to the article, the article Talk or the
> User Talk of reviewed contributor which may be detectable).
>
> The other aspect of articles that occurs to me as being possibly linked to
> importance of the topic would be use of the article as the "main" article
> for a category or as the title of a navbox (as it suggests that the
> articles in the category or navbox are in some way subordinate to the
> main/title article). Similarly for list articles, the "type" of the list is
> often more important than its instances).
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Morten Wang
> Sent: Friday, 21 April 2017 6:04 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: Re: [Wiki-research-l] Project exploring automated classification
> of article importance
>
> Hi Pine,
>
> These are great pointers to existing practices on enwiki, some of which
> I've been looking for and/or missed, thanks!
>
>
> Cheers,
> Morten
>
> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
>
> > Hi Nettrom,
> >
> > A few resources from English Wikipedia regarding article importance as
> > ranked by humans:
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > Editorial_Team/Release_Version_Criteria#Priority_of_topic
> >
> > https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> > ics
> >
> > I infer from the ENWP Wikicup's scoring protocol that for purposes of
> > the competition, an article's "importance" is loosely inferred from
> > the number of language editions of Wikipedia in which the article
> appears:
> > https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
> >
> > HTH,
> >
> > Pine
> >
> >
> > On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
> >
> > > Hello everyone,
> > >
> > > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > > the Wikimedia Foundation on a project exploring automated
> > > classification of article importance. Our goal is to characterize
> > > the importance of an article within a given context and design a
> > > system to predict a relative importance rank. We have a project page
> > > on meta[1] and welcome comments
> > or
> > > thoughts on our talk page. You can of course also respond here on
> > > wiki-research-l, or send me an email.
> > >
> > > Before moving on to model-building I did a fairly thorough
> > > literature review, finding a myriad of papers spanning several
> > > disciplines. We have
> > a
> > > draft literature review also up on meta[2], which should give you a
> > > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > > papers we’ve missed) on the talk page, mailing list, or through
> > > email are welcome.
> > >
> > > Links:
> > >
> > >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >    classification_of_article_importance
> > >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > > classification_of_article_importance>
> > >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >
> > > Regards,
> > > Morten
> > > [[User:Nettrom]] aka [[User:SuggestBot]]
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Jonathan Cardy
I like to think that in time importance will win out over popularity. If Wikipedia still exists in fifty of five hundred years time and we are still using pasteurisation and indeed still eating hydrocarbon based foods, then I suspect the pop group you mention will be less frequently read about than the pasteurisation process.

In the meantime if we try to work it out at all it has to be something of a judgement call, and one we will occasionally get wrong. Any guesses as to which current branches of science will be as forgotten in a century as phrenology is today?

At an extreme the weekly top ten most viewed articles are a good guide to what is trending in the popular cultures of India and the USA. I'm assuming that most modern pop culture is inherently ephemeral. Of course digital historians of future centuries may be rolling on the floor laughing at this email, and the TV dramas currently being filmed may still be widely studied and universally known classics while our leading edge science lies buried in the foundations of their science.

Regards

Jonathan


> On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email]> wrote:
>
> Yes I totally agree that "importance is a relative metric rather than
> absolute." I also agree that incoming links and pageviews are not accurate
> measurements of "importance" for all of the reasons you mention. However,
> we are still a project that is actively exploring the universe of
> knowledge, and leaning heavily on academia and other established sources we
> must "boldly go where no man has gone before" (and please feel free to
> insert "white, euro-centric" before the man part). So do you have any
> suggestions what we could measure going forward that would cough up some
> interesting stats to monitor? Pagewatching is useful , but problematic
> because these are only assigned at page-creation, while some marginal
> editor interest might be expanded to whole categories (speaking as someone
> who has thousands of pages watchlisted on multiple projects). I like your
> thoughts about looking for key articles such as those used as the "article
> as the "main" article for a category or as the title of a navbox ".  I am
> looking for similar usages of paintings as a way to find popular painters
> or paintings rather than just those paintings which have articles written
> about them (which are often written for totally random reasons such as
> theft/sale/wikiproject).
>
> On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <[hidden email]>
> wrote:
>
>> Just a few musings on the issue of Importance and how to research it ...
>>
>> I agree it is intuitive that importance is likely to be linked to
>> pageviews and inbound links but, as the preliminary experiment showed, it's
>> probably not that simple.
>>
>> Pageviews tells us something about importance to readers of Wikipedia,
>> while inbound links tells us something about importance to writers of
>> Wikipedia, and I suspect that writers are not a proxy for readers as the
>> editor surveys suggest that Wikipedia writers are not typical of broader
>> society on at least two variables: gender and level of education (might be
>> others, I can't remember).
>>
>> But I think importance is a relative metric rather than  absolute. I think
>> by taking the mean value of importance across a number of WikiProjects in
>> the preliminary experiment may have lost something because it tried
>> (through averaging) to look at importance "generally". I would suspect
>> conducting an experiment considering only the importance ratings wrt to a
>> single WikiProject would be more likely to show correlation with pageviews
>> (wrt to other articles in that same WikiProject) and inbound links. And I
>> think there are two kinds of inbound links to be considered, those coming
>> from other articles within the same WikiProject and those coming from
>> outside that Wikiproject. I suspect different insights will be obtained by
>> looking at both types of inbound links separately rather than treating them
>> as an aggregate. I note also that WikiProjects are not entirely independent
>> of one another but have relationships between them. For example, The
>> WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
>> of WikiProject Highways and WikiProject Australia, so I expect that we
>> would find greater correlation in importance between related WikiProjects
>> than between unrelated WikiProjects.
>>
>> When thinking about readers and pageviews, I think we have to ask
>> ourselves is there a difference between popularity and importance. Or
>> whether popularity *is* importance. I sense that, as a group of educated
>> people, those of us reading this research mailing list probably do think
>> there is a difference. Certainly if there is no difference, then this
>> research can stop now -- just judge importance by  pageviews. Let's assume
>> a difference then. When looking at pageviews of an article, they are not
>> always consistent over time. Here are the pageviews for Drottninggatan
>>
>> https://tools.wmflabs.org/pageviews/?project=en.
>> wikipedia.org&platform=all-access&agent=user&range=
>> latest-90&pages=Drottninggatan
>>
>> Why so interesting on 8 April? A terrorist attack occurred there. This
>> spike in pageviews occurs all the time when some topic is in the news (even
>> peripherally as in this case where it is not the article about the
>> terrorist attack but about the street in which it occurred). Did the street
>> become more "important"? I think it became more interesting but not more
>> important. So I think we do have to be careful to understand that pageviews
>> probably reflect interest rather than importance.  I note that The
>> Chainsmokers (a music group with a number of songs in the current USA music
>> charts) gets many more Wikipedia article pageviews  than the Wikipedia
>> article on Pasteurization but The Chainsmokers are not rated as being of
>> high importance by the relevant WikiProjects while Pasteurization is very
>> important in WikiProject Food and Drink. Since pasteurisation prevents a
>> lot of deaths, I think we might agree that in the real world pasteurisation
>> is more important than a music group regardless of what pageviews tell us.
>>
>> https://tools.wmflabs.org/pageviews/?project=en.
>> wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_
>> Chainsmokers|Pasteurization
>>
>> Of course it is matters for Wikipedia's success that our *popular*
>> articles are of high quality, but I think we have be cautious about
>> pageviews being a proxy for importance.
>>
>> When we look at Wikipedia writers' decisions in tagging the importance of
>> articles to WikiProjects, what do we find? As we know, project tags are
>> often placed on new articles (and often not subsequently reviewed). So
>> while I find that quality tags are often out-of-date, the importance seems
>> to be pretty accurate even on a new stub articles. This is because it is
>> the importance of the *topic* that is being assessed which is independent
>> of the Wikipedia article itself. Provided the article is clear enough about
>> what it is about and why it matters (which is the traditional content of
>> that first paragraph or two and failing to provide it will likely result in
>> speedy deletion of the new article), assessment of the topic's importance
>> can be made even at new stub level. This tells us that importance for
>> Wikipedia writers is determined by something outside of Wikipedia (probably
>> their real-world knowledge of that topic space -- one assumes that project
>> taggers are quite interested in the topic space of that project). While
>> article quality hopefully improves over time, I would be surprised if
>> article importance greatly changed over time. Obviously there are
>> counter-examples.  I am guessing Donald Trump's article may have grown in
>> importance over time but that's probably because his lede para changed.
>> Adding President of the USA into the lede paragraph makes him much more
>> important than he was before in the real world and internal to Wikipedia he
>> has acquired an inbound link from the presumably high-importance President
>> of the USA article. So I think it might be interesting to study those
>> articles whose importance does change over time to see if there are any
>> strong correlations with what is happening to the article inside Wikipedia.
>> I think it is this set of importance-changing articles may be where we
>> really learn what Wikipedia article characteristics are strongly correlated
>> to "importance" given that importance itself appears to be pretty stable
>> for most articles.
>>
>> Although not stated explicitly, I imagine we believe that generally less
>> important articles tend to link to more important articles but more
>> important articles don't link to less important articles. And hence
>> in-bound links are likely to matter in assessing importance and that
>> in-bound links from "important" articles are more valuable than in-bound
>> links from less important articles (which creates something of a
>> bootstrapping problem) similar to the issue to Google's PageRank
>> algorithms. But I think we do have some information that Google doesn't
>> have. The average webpage does not have a lede paragraph that situates the
>> topic relative to other topics; a Wikipedia article does. If I have to
>> choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
>> more important than X. If Y also defines itself in terms of X, then it
>> tends to suggest they are equivalent in importance at some way. Indeed I
>> suspect when we get to the VERY IMPORTANT topics we will see this kind of
>> circular definition (e.g. you see circular definitions in Wikipedia around
>> Philosophy and Knowledge). Aside, if you have never done this before, try
>> this experiment. Choose a random article (left hand tool bar in Desktop
>> Wikipedia), then click the first link in the article that matters (i.e.
>> ignore links hatnotes or links inside parentheses). Repeat this first link
>> clicking and sooner or later you will reach articles like Knowledge and
>> Philosophy, which all sit inside circular definition groups.
>>
>> If we look at the Donald Trump article, his first sentence contains only
>> two links, one to List of Presidents of the USA and the other to President
>> of the USA. If we look at the those two articles, we find that both of them
>> mention Donald Trump in their lede paras (although not as early as the
>> first sentence) and before mentions of any other US President elsewhere in
>> the article. Which is consistent with what we know about the real world,
>> the role of the President is more important than its officeholders and that
>> the current officeholder has more importance than a past officeholder. So
>> topic importance does seems to be skewed towards the "present day".
>>
>> So I suspect the links in the lede paras are of greater relevance to the
>> assessment of importance than links further down in the article which will
>> be more likely relate to details of a topic and may include examples and
>> counter-examples (this is a way in which high importance article may
>> mention much lower importance articles). However, we do have to be a little
>> bit careful here because of the MoS practice of not linking very common
>> terms. For example, an Australian article will often refer to Australia in
>> the lede para but it will almost certainly not be linked to the Australia
>> article (and any attempt to add such a link will likely see it removed with
>> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
>> problem if you link to an Australian state article, e.g. New South Wales.
>> So we might find that some very important topics that often appear in ledes
>> might get fewer links that you might expect because of the MoS policies on
>> overlinking, which may be problem when working with inbound links. It may
>> be that for "very common topics" the presence of the article title (or its
>> synonyms) in the lede may have to be considered as if it were an in-bound
>> link for statistical research purposes.
>>
>> Given all of the above, perhaps the most interesting group of articles to
>> study in Wikipedia are those articles whose manually-assessed importance
>> has changed over the life of the article AND which were NOT current topics
>> in the lifetime of Wikipedia (given the influence of "current" on
>> importance). But having said that, I wonder if that group of articles
>> actually exists. Recently a newish Australian contributor expressed
>> disappointment that all the new articles they had created were tagged (by
>> others) as of Low Importance. My instinctive reply was "that's normal, I
>> think of the thousands of articles I have started only a couple even rated
>> as Mid importance, this is because the really important articles were all
>> started long ago precisely because they were important". I suspect topics
>> that are very important (for reasons other than being short-lived
>> importance due in being "current" in the lifetime of Wikipedia) will
>> generally show up as having started early in Wikipedia's life and that
>> those that become more/less important over time will be largely linked to
>> becoming or ceasing to be "current" topics). E.g. article Pasteurization
>> started in May 2001 saying nothing more than " Pasteurization is the
>> process of killing off bacteria in milk by quickly heating it to a near
>> boiling temperature, then quickly cooling it again before the taste and
>> other desirable properties are affected. The process was named after its
>> inventor, French scientist Louis Pasteur. See also dairy products." The
>> links in this very first version are still present in its lede paragraph
>> today, suggesting our understanding of "non-current" topics is stable and
>> hence initial importance determinations can probably be accurately made.
>> For Pasteurization the Talk page shows it was not project-tagged until 2007
>> when it was assigned High Importance as its first assessment.
>>
>> I suspect we will find that initial manual assessment of article
>> importance will be pretty accurate for most articles. And I suspect if we
>> plot initial importance assessments against time of assessment, we will
>> find the higher importance articles commenced life on Wikipedia earlier
>> than the lower importance articles. If I am correct, then there isn't a lot
>> of value in machine-assessment of importance of topics because it relates
>> to factors external to Wikipedia and often does not change over time and
>> therefore can often be correctly assessed manually even on new stub
>> articles (and any unassessed articles can probably be rated as Low
>> Importance as statistically that's almost certainly going to be correct).
>> If a topic becomes more important due to "current" events, then invariably
>> that article will be updated by many people and one of them will sooner or
>> later manually adjust its importance. What is less likely to happen is
>> re-assessing downwards of Importance when an important "current" topic
>> loses its importance when it is no longer current, e.g. are former American
>> presidents like Barack Obama or George W Bush or further back less
>> important now? These articles will not be updated frequently once the topic
>> is no longer in the news and therefore it is less likely an editor will
>> notice and manually downgrade the importance, so there may be a greater
>> role for machine-assessment in downgrading importance rather than upgrading
>> importance.
>>
>> Another area where there might be a role for machine-assessed importance
>> in regards to POV-pushing where an POV-motivated editor might change the
>> manual-assessment importance of articles to be higher or lower based on
>> their POV (e.g. my political party is Top Importance, other parties are of
>> Low Importance). I suspect that often a page watcher would correct or at
>> least question that kind of re-assessment. However, articles with few
>> active pagewatchers you might get away with POV-pushing the article's
>> importance tag because nobody noticed. In this situation, a machine
>> assessment could be useful in spotting this kind of thing.
>>
>> This suggests that another metric of interest to importance might be
>> number of pagewatchers, although I suspect that pagewatching may relate
>> more to caring about the article than to caring about the topic. And one
>> has to be careful to distinguish active pagewatchers (those who actually do
>> review changes on their watchlists) from those who don't, as that may make
>> a difference (although I am not sure we can really tell which pagewatchers
>> are truly actively reviewing as a "satisfactory review" doesn't leave a
>> trace whereas an "unsatisfactory" review is likely to lead to a relatively
>> soon revert or some other change to the article, the article Talk or the
>> User Talk of reviewed contributor which may be detectable).
>>
>> The other aspect of articles that occurs to me as being possibly linked to
>> importance of the topic would be use of the article as the "main" article
>> for a category or as the title of a navbox (as it suggests that the
>> articles in the category or navbox are in some way subordinate to the
>> main/title article). Similarly for list articles, the "type" of the list is
>> often more important than its instances).
>>
>> Kerry
>>
>> -----Original Message-----
>> From: Wiki-research-l [mailto:[hidden email]]
>> On Behalf Of Morten Wang
>> Sent: Friday, 21 April 2017 6:04 AM
>> To: Research into Wikimedia content and communities <
>> [hidden email]>
>> Subject: Re: [Wiki-research-l] Project exploring automated classification
>> of article importance
>>
>> Hi Pine,
>>
>> These are great pointers to existing practices on enwiki, some of which
>> I've been looking for and/or missed, thanks!
>>
>>
>> Cheers,
>> Morten
>>
>>> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
>>>
>>> Hi Nettrom,
>>>
>>> A few resources from English Wikipedia regarding article importance as
>>> ranked by humans:
>>>
>>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
>>>
>>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
>>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
>>>
>>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
>>> ics
>>>
>>> I infer from the ENWP Wikicup's scoring protocol that for purposes of
>>> the competition, an article's "importance" is loosely inferred from
>>> the number of language editions of Wikipedia in which the article
>> appears:
>>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
>>>
>>> HTH,
>>>
>>> Pine
>>>
>>>
>>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I am currently working with Aaron Halfaker and Dario Taraborelli at
>>>> the Wikimedia Foundation on a project exploring automated
>>>> classification of article importance. Our goal is to characterize
>>>> the importance of an article within a given context and design a
>>>> system to predict a relative importance rank. We have a project page
>>>> on meta[1] and welcome comments
>>> or
>>>> thoughts on our talk page. You can of course also respond here on
>>>> wiki-research-l, or send me an email.
>>>>
>>>> Before moving on to model-building I did a fairly thorough
>>>> literature review, finding a myriad of papers spanning several
>>>> disciplines. We have
>>> a
>>>> draft literature review also up on meta[2], which should give you a
>>>> reasonable introduction to the topic. Again, comments or thoughts (e.g.
>>>> papers we’ve missed) on the talk page, mailing list, or through
>>>> email are welcome.
>>>>
>>>> Links:
>>>>
>>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
>>>>   classification_of_article_importance
>>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
>>>> classification_of_article_importance>
>>>>   2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
>>>>
>>>> Regards,
>>>> Morten
>>>> [[User:Nettrom]] aka [[User:SuggestBot]]
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> [hidden email]
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Kerry Raymond
In reply to this post by Jane Darnell
I think you are reading my comments too negatively. I’m not saying to ignore pageviews or incoming links. I’m saying that a naïve look at their stats may not be as useful as some of the variations I mention. I think it is worth looking at pageviews relative to those articles in the same WikiProject. I think it is worth looking at inbound links but to consider two groups, those coming from the same WikiProject(s) and from other WikiProjects. I think the position of the incoming links within their source articles is also significant, either first sentence, first para, whole of lede, or absolute/relative position of the link in the article (e.g. 2000 bytes from start, or 40% from start).

 

The big difference between machine-assessment of article quality and article importance is that quality is a metric on the article but importance is a metric on the topic. Also, my informal observation is that article quality does improve and degrade over time and hence is much more dynamic than topic importance, which seems to me to be much more stable. So I think there is less scope for dramatically improving the situation by being able to determine topic importance than the benefits likely to be achieved from automated quality assessment, but there may be benefit if there are heuristics to spot the relatively few articles which do need  importance re-assessed due to “current events”. In which case “editor activity” may be a metric, particularly “editor activity” on the lede para or other more critical areas of the article.

 

I am not too worried about 22nd century. I think we should look more at the next decade. Who would have predicted the demise of Usenet? It seemed pretty sexy at the time, etc. Wikipedia, like many things, will pass. It’s not to say it will pass into oblivion but it may morph into something very different to what it is today. Being CC-BY-SA improves the chances that any successor can build on it, but maybe we should put into WMF’s constitution, “if WMF shuts down, we release the contents of the projects as CC0” (to increase the likelihood that the content has a future). Having had to shut down a number of research institutes when the funding ran out, I know the utter stupidity occurs when they retain a skeleton of staff to “sell off all our valuable IP” which every closing-down institution seems to wants to do and the result is that the IP gets wasted because it isn’t sold or it’s sold to one of those companies who buy IP for tuppence on the off-chance they can potentially engage in patent litigation (or other IP litigation) downstream. We waste so much IP with this kind of “make a buck” thinking. <end of rant>

 

Kerry

 

From: Jane Darnell [mailto:[hidden email]]
Sent: Wednesday, 26 April 2017 5:51 PM
To: [hidden email]; Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

 

Yes I totally agree that "importance is a relative metric rather than absolute." I also agree that incoming links and pageviews are not accurate measurements of "importance" for all of the reasons you mention. However, we are still a project that is actively exploring the universe of knowledge, and leaning heavily on academia and other established sources we must "boldly go where no man has gone before" (and please feel free to insert "white, euro-centric" before the man part). So do you have any suggestions what we could measure going forward that would cough up some interesting stats to monitor? Pagewatching is useful , but problematic because these are only assigned at page-creation, while some marginal editor interest might be expanded to whole categories (speaking as someone who has thousands of pages watchlisted on multiple projects). I like your thoughts about looking for key articles such as those used as the "article as the "main" article for a category or as the title of a navbox ".  I am looking for similar usages of paintings as a way to find popular painters or paintings rather than just those paintings which have articles written about them (which are often written for totally random reasons such as theft/sale/wikiproject).

 

On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <[hidden email] <mailto:[hidden email]> > wrote:

Just a few musings on the issue of Importance and how to research it ...

I agree it is intuitive that importance is likely to be linked to pageviews and inbound links but, as the preliminary experiment showed, it's probably not that simple.

Pageviews tells us something about importance to readers of Wikipedia, while inbound links tells us something about importance to writers of Wikipedia, and I suspect that writers are not a proxy for readers as the editor surveys suggest that Wikipedia writers are not typical of broader society on at least two variables: gender and level of education (might be others, I can't remember).

But I think importance is a relative metric rather than  absolute. I think by taking the mean value of importance across a number of WikiProjects in the preliminary experiment may have lost something because it tried (through averaging) to look at importance "generally". I would suspect conducting an experiment considering only the importance ratings wrt to a single WikiProject would be more likely to show correlation with pageviews (wrt to other articles in that same WikiProject) and inbound links. And I think there are two kinds of inbound links to be considered, those coming from other articles within the same WikiProject and those coming from outside that Wikiproject. I suspect different insights will be obtained by looking at both types of inbound links separately rather than treating them as an aggregate. I note also that WikiProjects are not entirely independent of one another but have relationships between them. For example, The WikiProject Australian Roads describes itself as an "intersection" (ha ha!) of WikiProject Highways and WikiProject Australia, so I expect that we would find greater correlation in importance between related WikiProjects than between unrelated WikiProjects.

When thinking about readers and pageviews, I think we have to ask ourselves is there a difference between popularity and importance. Or whether popularity *is* importance. I sense that, as a group of educated people, those of us reading this research mailing list probably do think there is a difference. Certainly if there is no difference, then this research can stop now -- just judge importance by  pageviews. Let's assume a difference then. When looking at pageviews of an article, they are not always consistent over time. Here are the pageviews for Drottninggatan

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org <https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan> &platform=all-access&agent=user&range=latest-90&pages=Drottninggatan

Why so interesting on 8 April? A terrorist attack occurred there. This spike in pageviews occurs all the time when some topic is in the news (even peripherally as in this case where it is not the article about the terrorist attack but about the street in which it occurred). Did the street become more "important"? I think it became more interesting but not more important. So I think we do have to be careful to understand that pageviews probably reflect interest rather than importance.  I note that The Chainsmokers (a music group with a number of songs in the current USA music charts) gets many more Wikipedia article pageviews  than the Wikipedia article on Pasteurization but The Chainsmokers are not rated as being of high importance by the relevant WikiProjects while Pasteurization is very important in WikiProject Food and Drink. Since pasteurisation prevents a lot of deaths, I think we might agree that in the real world pasteurisation is more important than a music group regardless of what pageviews tell us.

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org <https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization> &platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization

Of course it is matters for Wikipedia's success that our *popular* articles are of high quality, but I think we have be cautious about pageviews being a proxy for importance.

When we look at Wikipedia writers' decisions in tagging the importance of articles to WikiProjects, what do we find? As we know, project tags are often placed on new articles (and often not subsequently reviewed). So while I find that quality tags are often out-of-date, the importance seems to be pretty accurate even on a new stub articles. This is because it is the importance of the *topic* that is being assessed which is independent of the Wikipedia article itself. Provided the article is clear enough about what it is about and why it matters (which is the traditional content of that first paragraph or two and failing to provide it will likely result in speedy deletion of the new article), assessment of the topic's importance can be made even at new stub level. This tells us that importance for Wikipedia writers is determined by something outside of Wikipedia (probably their real-world knowledge of that topic space -- one assumes that project taggers are quite interested in the topic space of that project). While article quality hopefully improves over time, I would be surprised if article importance greatly changed over time. Obviously there are counter-examples.  I am guessing Donald Trump's article may have grown in importance over time but that's probably because his lede para changed. Adding President of the USA into the lede paragraph makes him much more important than he was before in the real world and internal to Wikipedia he has acquired an inbound link from the presumably high-importance President of the USA article. So I think it might be interesting to study those articles whose importance does change over time to see if there are any strong correlations with what is happening to the article inside Wikipedia. I think it is this set of importance-changing articles may be where we really learn what Wikipedia article characteristics are strongly correlated to "importance" given that importance itself appears to be pretty stable for most articles.

Although not stated explicitly, I imagine we believe that generally less important articles tend to link to more important articles but more important articles don't link to less important articles. And hence in-bound links are likely to matter in assessing importance and that in-bound links from "important" articles are more valuable than in-bound links from less important articles (which creates something of a bootstrapping problem) similar to the issue to Google's PageRank algorithms. But I think we do have some information that Google doesn't have. The average webpage does not have a lede paragraph that situates the topic relative to other topics; a Wikipedia article does. If I have to choose to define Thing X in terms of Thing Y, it tends to suggest that Y is more important than X. If Y also defines itself in terms of X, then it tends to suggest they are equivalent in importance at some way. Indeed I suspect when we get to the VERY IMPORTANT topics we will see this kind of circular definition (e.g. you see circular definitions in Wikipedia around Philosophy and Knowledge). Aside, if you have never done this before, try this experiment. Choose a random article (left hand tool bar in Desktop Wikipedia), then click the first link in the article that matters (i.e. ignore links hatnotes or links inside parentheses). Repeat this first link clicking and sooner or later you will reach articles like Knowledge and Philosophy, which all sit inside circular definition groups.

If we look at the Donald Trump article, his first sentence contains only two links, one to List of Presidents of the USA and the other to President of the USA. If we look at the those two articles, we find that both of them mention Donald Trump in their lede paras (although not as early as the first sentence) and before mentions of any other US President elsewhere in the article. Which is consistent with what we know about the real world, the role of the President is more important than its officeholders and that the current officeholder has more importance than a past officeholder. So topic importance does seems to be skewed towards the "present day".

So I suspect the links in the lede paras are of greater relevance to the assessment of importance than links further down in the article which will be more likely relate to details of a topic and may include examples and counter-examples (this is a way in which high importance article may mention much lower importance articles). However, we do have to be a little bit careful here because of the MoS practice of not linking very common terms. For example, an Australian article will often refer to Australia in the lede para but it will almost certainly not be linked to the Australia article (and any attempt to add such a link will likely see it removed with an edit summary that mentions [[WP:Overlinking]]) whereas there is no problem if you link to an Australian state article, e.g. New South Wales. So we might find that some very important topics that often appear in ledes might get fewer links that you might expect because of the MoS policies on overlinking, which may be problem when working with inbound links. It may be that for "very common topics" the presence of the article title (or its synonyms) in the lede may have to be considered as if it were an in-bound link for statistical research purposes.

Given all of the above, perhaps the most interesting group of articles to study in Wikipedia are those articles whose manually-assessed importance has changed over the life of the article AND which were NOT current topics in the lifetime of Wikipedia (given the influence of "current" on importance). But having said that, I wonder if that group of articles actually exists. Recently a newish Australian contributor expressed disappointment that all the new articles they had created were tagged (by others) as of Low Importance. My instinctive reply was "that's normal, I think of the thousands of articles I have started only a couple even rated as Mid importance, this is because the really important articles were all started long ago precisely because they were important". I suspect topics that are very important (for reasons other than being short-lived importance due in being "current" in the lifetime of Wikipedia) will generally show up as having started early in Wikipedia's life and that those that become more/less important over time will be largely linked to becoming or ceasing to be "current" topics). E.g. article Pasteurization started in May 2001 saying nothing more than " Pasteurization is the process of killing off bacteria in milk by quickly heating it to a near boiling temperature, then quickly cooling it again before the taste and other desirable properties are affected. The process was named after its inventor, French scientist Louis Pasteur. See also dairy products." The links in this very first version are still present in its lede paragraph today, suggesting our understanding of "non-current" topics is stable and hence initial importance determinations can probably be accurately made. For Pasteurization the Talk page shows it was not project-tagged until 2007 when it was assigned High Importance as its first assessment.

I suspect we will find that initial manual assessment of article importance will be pretty accurate for most articles. And I suspect if we plot initial importance assessments against time of assessment, we will find the higher importance articles commenced life on Wikipedia earlier than the lower importance articles. If I am correct, then there isn't a lot of value in machine-assessment of importance of topics because it relates to factors external to Wikipedia and often does not change over time and therefore can often be correctly assessed manually even on new stub articles (and any unassessed articles can probably be rated as Low Importance as statistically that's almost certainly going to be correct). If a topic becomes more important due to "current" events, then invariably that article will be updated by many people and one of them will sooner or later manually adjust its importance. What is less likely to happen is re-assessing downwards of Importance when an important "current" topic loses its importance when it is no longer current, e.g. are former American presidents like Barack Obama or George W Bush or further back less important now? These articles will not be updated frequently once the topic is no longer in the news and therefore it is less likely an editor will notice and manually downgrade the importance, so there may be a greater role for machine-assessment in downgrading importance rather than upgrading importance.

Another area where there might be a role for machine-assessed importance in regards to POV-pushing where an POV-motivated editor might change the manual-assessment importance of articles to be higher or lower based on their POV (e.g. my political party is Top Importance, other parties are of Low Importance). I suspect that often a page watcher would correct or at least question that kind of re-assessment. However, articles with few active pagewatchers you might get away with POV-pushing the article's importance tag because nobody noticed. In this situation, a machine assessment could be useful in spotting this kind of thing.

This suggests that another metric of interest to importance might be number of pagewatchers, although I suspect that pagewatching may relate more to caring about the article than to caring about the topic. And one has to be careful to distinguish active pagewatchers (those who actually do review changes on their watchlists) from those who don't, as that may make a difference (although I am not sure we can really tell which pagewatchers are truly actively reviewing as a "satisfactory review" doesn't leave a trace whereas an "unsatisfactory" review is likely to lead to a relatively soon revert or some other change to the article, the article Talk or the User Talk of reviewed contributor which may be detectable).

The other aspect of articles that occurs to me as being possibly linked to importance of the topic would be use of the article as the "main" article for a category or as the title of a navbox (as it suggests that the articles in the category or navbox are in some way subordinate to the main/title article). Similarly for list articles, the "type" of the list is often more important than its instances).

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email] <mailto:[hidden email]> ] On Behalf Of Morten Wang
Sent: Friday, 21 April 2017 6:04 AM
To: Research into Wikimedia content and communities <[hidden email] <mailto:[hidden email]> >
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

Hi Pine,

These are great pointers to existing practices on enwiki, some of which I've been looking for and/or missed, thanks!


Cheers,
Morten

On 19 April 2017 at 22:35, Pine W <[hidden email] <mailto:[hidden email]> > wrote:

> Hi Nettrom,
>
> A few resources from English Wikipedia regarding article importance as
> ranked by humans:
>
> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
>
> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> Editorial_Team/Release_Version_Criteria#Priority_of_topic
>
> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> ics
>
> I infer from the ENWP Wikicup's scoring protocol that for purposes of
> the competition, an article's "importance" is loosely inferred from
> the number of language editions of Wikipedia in which the article appears:
> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
>
> HTH,
>
> Pine
>
>
> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email] <mailto:[hidden email]> > wrote:
>
> > Hello everyone,
> >
> > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > the Wikimedia Foundation on a project exploring automated
> > classification of article importance. Our goal is to characterize
> > the importance of an article within a given context and design a
> > system to predict a relative importance rank. We have a project page
> > on meta[1] and welcome comments
> or
> > thoughts on our talk page. You can of course also respond here on
> > wiki-research-l, or send me an email.
> >
> > Before moving on to model-building I did a fairly thorough
> > literature review, finding a myriad of papers spanning several
> > disciplines. We have
> a
> > draft literature review also up on meta[2], which should give you a
> > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > papers we’ve missed) on the talk page, mailing list, or through
> > email are welcome.
> >
> > Links:
> >
> >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> >    classification_of_article_importance
> >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > classification_of_article_importance>
> >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> >
> > Regards,
> > Morten
> > [[User:Nettrom]] aka [[User:SuggestBot]]
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email] <mailto:[hidden email]>
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email] <mailto:[hidden email]>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email] <mailto:[hidden email]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email] <mailto:[hidden email]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Jane Darnell
Sorry if I seemed negative! I am just responding to your comments in the
same way I have been trying to decide how to measure stuff to enable my
wikiprojects to move forward. This is very frustrating stuff! I also agree
that editor activity is probably a very good way to measure all sorts of
things, and it just seems sad that any attempts in this area seem to come
up hard against a wall of "privacy issues". Privacy is also linked to
ownership, and as it stands now, Wikipedia editors still own their own
words & media, which means we can't let go of the cc-by-sa licensing yet. I
do agree however that we should move to a model of "cc0 by default" rather
than "cc-by-sa" by default. Most people don't care and if you explain the
difference they are surprised that there is an option that is more open
than the one they thought they were using. We can't retroactively make
cc-by-sa turn into cc-0 without the consent of the original
uploader/writers, but we can try to get documents and data released cc0 in
Wikisource and more cc0 material uploaded to Commons!

On Wed, Apr 26, 2017 at 2:32 PM, Kerry Raymond <[hidden email]>
wrote:

> I think you are reading my comments too negatively. I’m not saying to
> ignore pageviews or incoming links. I’m saying that a naïve look at their
> stats may not be as useful as some of the variations I mention. I think it
> is worth looking at pageviews relative to those articles in the same
> WikiProject. I think it is worth looking at inbound links but to consider
> two groups, those coming from the same WikiProject(s) and from other
> WikiProjects. I think the position of the incoming links within their
> source articles is also significant, either first sentence, first para,
> whole of lede, or absolute/relative position of the link in the article
> (e.g. 2000 bytes from start, or 40% from start).
>
>
>
> The big difference between machine-assessment of article quality and
> article importance is that quality is a metric on the article but
> importance is a metric on the topic. Also, my informal observation is that
> article quality does improve and degrade over time and hence is much more
> dynamic than topic importance, which seems to me to be much more stable. So
> I think there is less scope for dramatically improving the situation by
> being able to determine topic importance than the benefits likely to be
> achieved from automated quality assessment, but there may be benefit if
> there are heuristics to spot the relatively few articles which do need
>  importance re-assessed due to “current events”. In which case “editor
> activity” may be a metric, particularly “editor activity” on the lede para
> or other more critical areas of the article.
>
>
>
> I am not too worried about 22nd century. I think we should look more at
> the next decade. Who would have predicted the demise of Usenet? It seemed
> pretty sexy at the time, etc. Wikipedia, like many things, will pass. It’s
> not to say it will pass into oblivion but it may morph into something very
> different to what it is today. Being CC-BY-SA improves the chances that any
> successor can build on it, but maybe we should put into WMF’s constitution,
> “if WMF shuts down, we release the contents of the projects as CC0” (to
> increase the likelihood that the content has a future). Having had to shut
> down a number of research institutes when the funding ran out, I know the
> utter stupidity occurs when they retain a skeleton of staff to “sell off
> all our valuable IP” which every closing-down institution seems to wants to
> do and the result is that the IP gets wasted because it isn’t sold or it’s
> sold to one of those companies who buy IP for tuppence on the off-chance
> they can potentially engage in patent litigation (or other IP litigation)
> downstream. We waste so much IP with this kind of “make a buck” thinking.
> <end of rant>
>
>
>
> Kerry
>
>
>
> *From:* Jane Darnell [mailto:[hidden email]]
> *Sent:* Wednesday, 26 April 2017 5:51 PM
> *To:* [hidden email]; Research into Wikimedia content and
> communities <[hidden email]>
> *Subject:* Re: [Wiki-research-l] Project exploring automated
> classification of article importance
>
>
>
> Yes I totally agree that "importance is a relative metric rather than
> absolute." I also agree that incoming links and pageviews are not accurate
> measurements of "importance" for all of the reasons you mention. However,
> we are still a project that is actively exploring the universe of
> knowledge, and leaning heavily on academia and other established sources we
> must "boldly go where no man has gone before" (and please feel free to
> insert "white, euro-centric" before the man part). So do you have any
> suggestions what we could measure going forward that would cough up some
> interesting stats to monitor? Pagewatching is useful , but problematic
> because these are only assigned at page-creation, while some marginal
> editor interest might be expanded to whole categories (speaking as someone
> who has thousands of pages watchlisted on multiple projects). I like your
> thoughts about looking for key articles such as those used as the "article
> as the "main" article for a category or as the title of a navbox ".  I am
> looking for similar usages of paintings as a way to find popular painters
> or paintings rather than just those paintings which have articles written
> about them (which are often written for totally random reasons such as
> theft/sale/wikiproject).
>
>
>
> On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <[hidden email]>
> wrote:
>
> Just a few musings on the issue of Importance and how to research it ...
>
> I agree it is intuitive that importance is likely to be linked to
> pageviews and inbound links but, as the preliminary experiment showed, it's
> probably not that simple.
>
> Pageviews tells us something about importance to readers of Wikipedia,
> while inbound links tells us something about importance to writers of
> Wikipedia, and I suspect that writers are not a proxy for readers as the
> editor surveys suggest that Wikipedia writers are not typical of broader
> society on at least two variables: gender and level of education (might be
> others, I can't remember).
>
> But I think importance is a relative metric rather than  absolute. I think
> by taking the mean value of importance across a number of WikiProjects in
> the preliminary experiment may have lost something because it tried
> (through averaging) to look at importance "generally". I would suspect
> conducting an experiment considering only the importance ratings wrt to a
> single WikiProject would be more likely to show correlation with pageviews
> (wrt to other articles in that same WikiProject) and inbound links. And I
> think there are two kinds of inbound links to be considered, those coming
> from other articles within the same WikiProject and those coming from
> outside that Wikiproject. I suspect different insights will be obtained by
> looking at both types of inbound links separately rather than treating them
> as an aggregate. I note also that WikiProjects are not entirely independent
> of one another but have relationships between them. For example, The
> WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
> of WikiProject Highways and WikiProject Australia, so I expect that we
> would find greater correlation in importance between related WikiProjects
> than between unrelated WikiProjects.
>
> When thinking about readers and pageviews, I think we have to ask
> ourselves is there a difference between popularity and importance. Or
> whether popularity *is* importance. I sense that, as a group of educated
> people, those of us reading this research mailing list probably do think
> there is a difference. Certainly if there is no difference, then this
> research can stop now -- just judge importance by  pageviews. Let's assume
> a difference then. When looking at pageviews of an article, they are not
> always consistent over time. Here are the pageviews for Drottninggatan
>
> https://tools.wmflabs.org/pageviews/?project=en.
> wikipedia.org&platform=all-access&agent=user&range=
> latest-90&pages=Drottninggatan
>
> Why so interesting on 8 April? A terrorist attack occurred there. This
> spike in pageviews occurs all the time when some topic is in the news (even
> peripherally as in this case where it is not the article about the
> terrorist attack but about the street in which it occurred). Did the street
> become more "important"? I think it became more interesting but not more
> important. So I think we do have to be careful to understand that pageviews
> probably reflect interest rather than importance.  I note that The
> Chainsmokers (a music group with a number of songs in the current USA music
> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> article on Pasteurization but The Chainsmokers are not rated as being of
> high importance by the relevant WikiProjects while Pasteurization is very
> important in WikiProject Food and Drink. Since pasteurisation prevents a
> lot of deaths, I think we might agree that in the real world pasteurisation
> is more important than a music group regardless of what pageviews tell us.
>
> https://tools.wmflabs.org/pageviews/?project=en.
> wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_
> Chainsmokers|Pasteurization
>
> Of course it is matters for Wikipedia's success that our *popular*
> articles are of high quality, but I think we have be cautious about
> pageviews being a proxy for importance.
>
> When we look at Wikipedia writers' decisions in tagging the importance of
> articles to WikiProjects, what do we find? As we know, project tags are
> often placed on new articles (and often not subsequently reviewed). So
> while I find that quality tags are often out-of-date, the importance seems
> to be pretty accurate even on a new stub articles. This is because it is
> the importance of the *topic* that is being assessed which is independent
> of the Wikipedia article itself. Provided the article is clear enough about
> what it is about and why it matters (which is the traditional content of
> that first paragraph or two and failing to provide it will likely result in
> speedy deletion of the new article), assessment of the topic's importance
> can be made even at new stub level. This tells us that importance for
> Wikipedia writers is determined by something outside of Wikipedia (probably
> their real-world knowledge of that topic space -- one assumes that project
> taggers are quite interested in the topic space of that project). While
> article quality hopefully improves over time, I would be surprised if
> article importance greatly changed over time. Obviously there are
> counter-examples.  I am guessing Donald Trump's article may have grown in
> importance over time but that's probably because his lede para changed.
> Adding President of the USA into the lede paragraph makes him much more
> important than he was before in the real world and internal to Wikipedia he
> has acquired an inbound link from the presumably high-importance President
> of the USA article. So I think it might be interesting to study those
> articles whose importance does change over time to see if there are any
> strong correlations with what is happening to the article inside Wikipedia.
> I think it is this set of importance-changing articles may be where we
> really learn what Wikipedia article characteristics are strongly correlated
> to "importance" given that importance itself appears to be pretty stable
> for most articles.
>
> Although not stated explicitly, I imagine we believe that generally less
> important articles tend to link to more important articles but more
> important articles don't link to less important articles. And hence
> in-bound links are likely to matter in assessing importance and that
> in-bound links from "important" articles are more valuable than in-bound
> links from less important articles (which creates something of a
> bootstrapping problem) similar to the issue to Google's PageRank
> algorithms. But I think we do have some information that Google doesn't
> have. The average webpage does not have a lede paragraph that situates the
> topic relative to other topics; a Wikipedia article does. If I have to
> choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
> more important than X. If Y also defines itself in terms of X, then it
> tends to suggest they are equivalent in importance at some way. Indeed I
> suspect when we get to the VERY IMPORTANT topics we will see this kind of
> circular definition (e.g. you see circular definitions in Wikipedia around
> Philosophy and Knowledge). Aside, if you have never done this before, try
> this experiment. Choose a random article (left hand tool bar in Desktop
> Wikipedia), then click the first link in the article that matters (i.e.
> ignore links hatnotes or links inside parentheses). Repeat this first link
> clicking and sooner or later you will reach articles like Knowledge and
> Philosophy, which all sit inside circular definition groups.
>
> If we look at the Donald Trump article, his first sentence contains only
> two links, one to List of Presidents of the USA and the other to President
> of the USA. If we look at the those two articles, we find that both of them
> mention Donald Trump in their lede paras (although not as early as the
> first sentence) and before mentions of any other US President elsewhere in
> the article. Which is consistent with what we know about the real world,
> the role of the President is more important than its officeholders and that
> the current officeholder has more importance than a past officeholder. So
> topic importance does seems to be skewed towards the "present day".
>
> So I suspect the links in the lede paras are of greater relevance to the
> assessment of importance than links further down in the article which will
> be more likely relate to details of a topic and may include examples and
> counter-examples (this is a way in which high importance article may
> mention much lower importance articles). However, we do have to be a little
> bit careful here because of the MoS practice of not linking very common
> terms. For example, an Australian article will often refer to Australia in
> the lede para but it will almost certainly not be linked to the Australia
> article (and any attempt to add such a link will likely see it removed with
> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
> problem if you link to an Australian state article, e.g. New South Wales.
> So we might find that some very important topics that often appear in ledes
> might get fewer links that you might expect because of the MoS policies on
> overlinking, which may be problem when working with inbound links. It may
> be that for "very common topics" the presence of the article title (or its
> synonyms) in the lede may have to be considered as if it were an in-bound
> link for statistical research purposes.
>
> Given all of the above, perhaps the most interesting group of articles to
> study in Wikipedia are those articles whose manually-assessed importance
> has changed over the life of the article AND which were NOT current topics
> in the lifetime of Wikipedia (given the influence of "current" on
> importance). But having said that, I wonder if that group of articles
> actually exists. Recently a newish Australian contributor expressed
> disappointment that all the new articles they had created were tagged (by
> others) as of Low Importance. My instinctive reply was "that's normal, I
> think of the thousands of articles I have started only a couple even rated
> as Mid importance, this is because the really important articles were all
> started long ago precisely because they were important". I suspect topics
> that are very important (for reasons other than being short-lived
> importance due in being "current" in the lifetime of Wikipedia) will
> generally show up as having started early in Wikipedia's life and that
> those that become more/less important over time will be largely linked to
> becoming or ceasing to be "current" topics). E.g. article Pasteurization
> started in May 2001 saying nothing more than " Pasteurization is the
> process of killing off bacteria in milk by quickly heating it to a near
> boiling temperature, then quickly cooling it again before the taste and
> other desirable properties are affected. The process was named after its
> inventor, French scientist Louis Pasteur. See also dairy products." The
> links in this very first version are still present in its lede paragraph
> today, suggesting our understanding of "non-current" topics is stable and
> hence initial importance determinations can probably be accurately made.
> For Pasteurization the Talk page shows it was not project-tagged until 2007
> when it was assigned High Importance as its first assessment.
>
> I suspect we will find that initial manual assessment of article
> importance will be pretty accurate for most articles. And I suspect if we
> plot initial importance assessments against time of assessment, we will
> find the higher importance articles commenced life on Wikipedia earlier
> than the lower importance articles. If I am correct, then there isn't a lot
> of value in machine-assessment of importance of topics because it relates
> to factors external to Wikipedia and often does not change over time and
> therefore can often be correctly assessed manually even on new stub
> articles (and any unassessed articles can probably be rated as Low
> Importance as statistically that's almost certainly going to be correct).
> If a topic becomes more important due to "current" events, then invariably
> that article will be updated by many people and one of them will sooner or
> later manually adjust its importance. What is less likely to happen is
> re-assessing downwards of Importance when an important "current" topic
> loses its importance when it is no longer current, e.g. are former American
> presidents like Barack Obama or George W Bush or further back less
> important now? These articles will not be updated frequently once the topic
> is no longer in the news and therefore it is less likely an editor will
> notice and manually downgrade the importance, so there may be a greater
> role for machine-assessment in downgrading importance rather than upgrading
> importance.
>
> Another area where there might be a role for machine-assessed importance
> in regards to POV-pushing where an POV-motivated editor might change the
> manual-assessment importance of articles to be higher or lower based on
> their POV (e.g. my political party is Top Importance, other parties are of
> Low Importance). I suspect that often a page watcher would correct or at
> least question that kind of re-assessment. However, articles with few
> active pagewatchers you might get away with POV-pushing the article's
> importance tag because nobody noticed. In this situation, a machine
> assessment could be useful in spotting this kind of thing.
>
> This suggests that another metric of interest to importance might be
> number of pagewatchers, although I suspect that pagewatching may relate
> more to caring about the article than to caring about the topic. And one
> has to be careful to distinguish active pagewatchers (those who actually do
> review changes on their watchlists) from those who don't, as that may make
> a difference (although I am not sure we can really tell which pagewatchers
> are truly actively reviewing as a "satisfactory review" doesn't leave a
> trace whereas an "unsatisfactory" review is likely to lead to a relatively
> soon revert or some other change to the article, the article Talk or the
> User Talk of reviewed contributor which may be detectable).
>
> The other aspect of articles that occurs to me as being possibly linked to
> importance of the topic would be use of the article as the "main" article
> for a category or as the title of a navbox (as it suggests that the
> articles in the category or navbox are in some way subordinate to the
> main/title article). Similarly for list articles, the "type" of the list is
> often more important than its instances).
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Morten Wang
> Sent: Friday, 21 April 2017 6:04 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: Re: [Wiki-research-l] Project exploring automated classification
> of article importance
>
> Hi Pine,
>
> These are great pointers to existing practices on enwiki, some of which
> I've been looking for and/or missed, thanks!
>
>
> Cheers,
> Morten
>
> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
>
> > Hi Nettrom,
> >
> > A few resources from English Wikipedia regarding article importance as
> > ranked by humans:
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > Editorial_Team/Release_Version_Criteria#Priority_of_topic
> >
> > https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> > ics
> >
> > I infer from the ENWP Wikicup's scoring protocol that for purposes of
> > the competition, an article's "importance" is loosely inferred from
> > the number of language editions of Wikipedia in which the article
> appears:
> > https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
> >
> > HTH,
> >
> > Pine
> >
> >
> > On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
> >
> > > Hello everyone,
> > >
> > > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > > the Wikimedia Foundation on a project exploring automated
> > > classification of article importance. Our goal is to characterize
> > > the importance of an article within a given context and design a
> > > system to predict a relative importance rank. We have a project page
> > > on meta[1] and welcome comments
> > or
> > > thoughts on our talk page. You can of course also respond here on
> > > wiki-research-l, or send me an email.
> > >
> > > Before moving on to model-building I did a fairly thorough
> > > literature review, finding a myriad of papers spanning several
> > > disciplines. We have
> > a
> > > draft literature review also up on meta[2], which should give you a
> > > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > > papers we’ve missed) on the talk page, mailing list, or through
> > > email are welcome.
> > >
> > > Links:
> > >
> > >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >    classification_of_article_importance
> > >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > > classification_of_article_importance>
> > >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >
> > > Regards,
> > > Morten
> > > [[User:Nettrom]] aka [[User:SuggestBot]]
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Gerard Meijssen-3
In reply to this post by Jonathan Cardy
Hoi,
I have read the proposal and it leaves me wondering. Also the notion of
importance is indeed neither easy nor obvious. I think the question what is
most important is irrelevant depending on how you look at it. Subject can
be irrelevant when you look at it from a personal perspective, looking at
it from a particular perspective and indeed what seems relevant may become
irrelevant or relevant over time. When you use metrics there will always be
one way or another why it will be found to be problematic.

When you consider Wikipedia, the difference it makes with similar resources
is that its long tail is so much longer and still it is easy and obvious to
show how the English Wikipedia's long tail is not long enough [1]. When you
are looking for links and relevance, Wikidata includes data on all
Wikipedias and thereby more avenues to establish relevance.

Research has been done that shows that when people are suggested to write
articles or amend articles, it works best when it is about subjects they
care about. What people are interested in was based in the research on past
behaviour. What we could do is flip this and ask people. Based on
categories, on projects, whatever people do to categorise what is their
interest. This will work on a micro level. On a meta level, it may drive
cooperation when we enable people to share their interest (at that moment
in time). On a macro level data may arrive at Wikidata and this will allow
us to seek what articles include specific data (think date of death for
instance). On a meta and macro level, we could ask readers what subjects
they are missing. This would provide an additional incentive for people to
write. For this last suggestion we could measure what people are missing.

Anyway, relevance and importance depend on a point of view. When our
community is enabled to make a difference, it will help us with our
content. As a movement we know that there is enough that we do not properly
cover. Advocating these issues and targeting and educating potential
communities is where the WMF could play more of a role.
Thanks,
       GerardM



[1]
http://ultimategerardm.blogspot.nl/2017/04/wikidata-user-stories-sum-of-all.html

On 26 April 2017 at 13:48, Jonathan Cardy <[hidden email]>
wrote:

> I like to think that in time importance will win out over popularity. If
> Wikipedia still exists in fifty of five hundred years time and we are still
> using pasteurisation and indeed still eating hydrocarbon based foods, then
> I suspect the pop group you mention will be less frequently read about than
> the pasteurisation process.
>
> In the meantime if we try to work it out at all it has to be something of
> a judgement call, and one we will occasionally get wrong. Any guesses as to
> which current branches of science will be as forgotten in a century as
> phrenology is today?
>
> At an extreme the weekly top ten most viewed articles are a good guide to
> what is trending in the popular cultures of India and the USA. I'm assuming
> that most modern pop culture is inherently ephemeral. Of course digital
> historians of future centuries may be rolling on the floor laughing at this
> email, and the TV dramas currently being filmed may still be widely studied
> and universally known classics while our leading edge science lies buried
> in the foundations of their science.
>
> Regards
>
> Jonathan
>
>
> > On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email]> wrote:
> >
> > Yes I totally agree that "importance is a relative metric rather than
> > absolute." I also agree that incoming links and pageviews are not
> accurate
> > measurements of "importance" for all of the reasons you mention. However,
> > we are still a project that is actively exploring the universe of
> > knowledge, and leaning heavily on academia and other established sources
> we
> > must "boldly go where no man has gone before" (and please feel free to
> > insert "white, euro-centric" before the man part). So do you have any
> > suggestions what we could measure going forward that would cough up some
> > interesting stats to monitor? Pagewatching is useful , but problematic
> > because these are only assigned at page-creation, while some marginal
> > editor interest might be expanded to whole categories (speaking as
> someone
> > who has thousands of pages watchlisted on multiple projects). I like your
> > thoughts about looking for key articles such as those used as the
> "article
> > as the "main" article for a category or as the title of a navbox ".  I am
> > looking for similar usages of paintings as a way to find popular painters
> > or paintings rather than just those paintings which have articles written
> > about them (which are often written for totally random reasons such as
> > theft/sale/wikiproject).
> >
> > On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <[hidden email]>
> > wrote:
> >
> >> Just a few musings on the issue of Importance and how to research it ...
> >>
> >> I agree it is intuitive that importance is likely to be linked to
> >> pageviews and inbound links but, as the preliminary experiment showed,
> it's
> >> probably not that simple.
> >>
> >> Pageviews tells us something about importance to readers of Wikipedia,
> >> while inbound links tells us something about importance to writers of
> >> Wikipedia, and I suspect that writers are not a proxy for readers as the
> >> editor surveys suggest that Wikipedia writers are not typical of broader
> >> society on at least two variables: gender and level of education (might
> be
> >> others, I can't remember).
> >>
> >> But I think importance is a relative metric rather than  absolute. I
> think
> >> by taking the mean value of importance across a number of WikiProjects
> in
> >> the preliminary experiment may have lost something because it tried
> >> (through averaging) to look at importance "generally". I would suspect
> >> conducting an experiment considering only the importance ratings wrt to
> a
> >> single WikiProject would be more likely to show correlation with
> pageviews
> >> (wrt to other articles in that same WikiProject) and inbound links. And
> I
> >> think there are two kinds of inbound links to be considered, those
> coming
> >> from other articles within the same WikiProject and those coming from
> >> outside that Wikiproject. I suspect different insights will be obtained
> by
> >> looking at both types of inbound links separately rather than treating
> them
> >> as an aggregate. I note also that WikiProjects are not entirely
> independent
> >> of one another but have relationships between them. For example, The
> >> WikiProject Australian Roads describes itself as an "intersection" (ha
> ha!)
> >> of WikiProject Highways and WikiProject Australia, so I expect that we
> >> would find greater correlation in importance between related
> WikiProjects
> >> than between unrelated WikiProjects.
> >>
> >> When thinking about readers and pageviews, I think we have to ask
> >> ourselves is there a difference between popularity and importance. Or
> >> whether popularity *is* importance. I sense that, as a group of educated
> >> people, those of us reading this research mailing list probably do think
> >> there is a difference. Certainly if there is no difference, then this
> >> research can stop now -- just judge importance by  pageviews. Let's
> assume
> >> a difference then. When looking at pageviews of an article, they are not
> >> always consistent over time. Here are the pageviews for Drottninggatan
> >>
> >> https://tools.wmflabs.org/pageviews/?project=en.
> >> wikipedia.org&platform=all-access&agent=user&range=
> >> latest-90&pages=Drottninggatan
> >>
> >> Why so interesting on 8 April? A terrorist attack occurred there. This
> >> spike in pageviews occurs all the time when some topic is in the news
> (even
> >> peripherally as in this case where it is not the article about the
> >> terrorist attack but about the street in which it occurred). Did the
> street
> >> become more "important"? I think it became more interesting but not more
> >> important. So I think we do have to be careful to understand that
> pageviews
> >> probably reflect interest rather than importance.  I note that The
> >> Chainsmokers (a music group with a number of songs in the current USA
> music
> >> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> >> article on Pasteurization but The Chainsmokers are not rated as being of
> >> high importance by the relevant WikiProjects while Pasteurization is
> very
> >> important in WikiProject Food and Drink. Since pasteurisation prevents a
> >> lot of deaths, I think we might agree that in the real world
> pasteurisation
> >> is more important than a music group regardless of what pageviews tell
> us.
> >>
> >> https://tools.wmflabs.org/pageviews/?project=en.
> >> wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_
> >> Chainsmokers|Pasteurization
> >>
> >> Of course it is matters for Wikipedia's success that our *popular*
> >> articles are of high quality, but I think we have be cautious about
> >> pageviews being a proxy for importance.
> >>
> >> When we look at Wikipedia writers' decisions in tagging the importance
> of
> >> articles to WikiProjects, what do we find? As we know, project tags are
> >> often placed on new articles (and often not subsequently reviewed). So
> >> while I find that quality tags are often out-of-date, the importance
> seems
> >> to be pretty accurate even on a new stub articles. This is because it is
> >> the importance of the *topic* that is being assessed which is
> independent
> >> of the Wikipedia article itself. Provided the article is clear enough
> about
> >> what it is about and why it matters (which is the traditional content of
> >> that first paragraph or two and failing to provide it will likely
> result in
> >> speedy deletion of the new article), assessment of the topic's
> importance
> >> can be made even at new stub level. This tells us that importance for
> >> Wikipedia writers is determined by something outside of Wikipedia
> (probably
> >> their real-world knowledge of that topic space -- one assumes that
> project
> >> taggers are quite interested in the topic space of that project). While
> >> article quality hopefully improves over time, I would be surprised if
> >> article importance greatly changed over time. Obviously there are
> >> counter-examples.  I am guessing Donald Trump's article may have grown
> in
> >> importance over time but that's probably because his lede para changed.
> >> Adding President of the USA into the lede paragraph makes him much more
> >> important than he was before in the real world and internal to
> Wikipedia he
> >> has acquired an inbound link from the presumably high-importance
> President
> >> of the USA article. So I think it might be interesting to study those
> >> articles whose importance does change over time to see if there are any
> >> strong correlations with what is happening to the article inside
> Wikipedia.
> >> I think it is this set of importance-changing articles may be where we
> >> really learn what Wikipedia article characteristics are strongly
> correlated
> >> to "importance" given that importance itself appears to be pretty stable
> >> for most articles.
> >>
> >> Although not stated explicitly, I imagine we believe that generally less
> >> important articles tend to link to more important articles but more
> >> important articles don't link to less important articles. And hence
> >> in-bound links are likely to matter in assessing importance and that
> >> in-bound links from "important" articles are more valuable than in-bound
> >> links from less important articles (which creates something of a
> >> bootstrapping problem) similar to the issue to Google's PageRank
> >> algorithms. But I think we do have some information that Google doesn't
> >> have. The average webpage does not have a lede paragraph that situates
> the
> >> topic relative to other topics; a Wikipedia article does. If I have to
> >> choose to define Thing X in terms of Thing Y, it tends to suggest that
> Y is
> >> more important than X. If Y also defines itself in terms of X, then it
> >> tends to suggest they are equivalent in importance at some way. Indeed I
> >> suspect when we get to the VERY IMPORTANT topics we will see this kind
> of
> >> circular definition (e.g. you see circular definitions in Wikipedia
> around
> >> Philosophy and Knowledge). Aside, if you have never done this before,
> try
> >> this experiment. Choose a random article (left hand tool bar in Desktop
> >> Wikipedia), then click the first link in the article that matters (i.e.
> >> ignore links hatnotes or links inside parentheses). Repeat this first
> link
> >> clicking and sooner or later you will reach articles like Knowledge and
> >> Philosophy, which all sit inside circular definition groups.
> >>
> >> If we look at the Donald Trump article, his first sentence contains only
> >> two links, one to List of Presidents of the USA and the other to
> President
> >> of the USA. If we look at the those two articles, we find that both of
> them
> >> mention Donald Trump in their lede paras (although not as early as the
> >> first sentence) and before mentions of any other US President elsewhere
> in
> >> the article. Which is consistent with what we know about the real world,
> >> the role of the President is more important than its officeholders and
> that
> >> the current officeholder has more importance than a past officeholder.
> So
> >> topic importance does seems to be skewed towards the "present day".
> >>
> >> So I suspect the links in the lede paras are of greater relevance to the
> >> assessment of importance than links further down in the article which
> will
> >> be more likely relate to details of a topic and may include examples and
> >> counter-examples (this is a way in which high importance article may
> >> mention much lower importance articles). However, we do have to be a
> little
> >> bit careful here because of the MoS practice of not linking very common
> >> terms. For example, an Australian article will often refer to Australia
> in
> >> the lede para but it will almost certainly not be linked to the
> Australia
> >> article (and any attempt to add such a link will likely see it removed
> with
> >> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
> >> problem if you link to an Australian state article, e.g. New South
> Wales.
> >> So we might find that some very important topics that often appear in
> ledes
> >> might get fewer links that you might expect because of the MoS policies
> on
> >> overlinking, which may be problem when working with inbound links. It
> may
> >> be that for "very common topics" the presence of the article title (or
> its
> >> synonyms) in the lede may have to be considered as if it were an
> in-bound
> >> link for statistical research purposes.
> >>
> >> Given all of the above, perhaps the most interesting group of articles
> to
> >> study in Wikipedia are those articles whose manually-assessed importance
> >> has changed over the life of the article AND which were NOT current
> topics
> >> in the lifetime of Wikipedia (given the influence of "current" on
> >> importance). But having said that, I wonder if that group of articles
> >> actually exists. Recently a newish Australian contributor expressed
> >> disappointment that all the new articles they had created were tagged
> (by
> >> others) as of Low Importance. My instinctive reply was "that's normal, I
> >> think of the thousands of articles I have started only a couple even
> rated
> >> as Mid importance, this is because the really important articles were
> all
> >> started long ago precisely because they were important". I suspect
> topics
> >> that are very important (for reasons other than being short-lived
> >> importance due in being "current" in the lifetime of Wikipedia) will
> >> generally show up as having started early in Wikipedia's life and that
> >> those that become more/less important over time will be largely linked
> to
> >> becoming or ceasing to be "current" topics). E.g. article Pasteurization
> >> started in May 2001 saying nothing more than " Pasteurization is the
> >> process of killing off bacteria in milk by quickly heating it to a near
> >> boiling temperature, then quickly cooling it again before the taste and
> >> other desirable properties are affected. The process was named after its
> >> inventor, French scientist Louis Pasteur. See also dairy products." The
> >> links in this very first version are still present in its lede paragraph
> >> today, suggesting our understanding of "non-current" topics is stable
> and
> >> hence initial importance determinations can probably be accurately made.
> >> For Pasteurization the Talk page shows it was not project-tagged until
> 2007
> >> when it was assigned High Importance as its first assessment.
> >>
> >> I suspect we will find that initial manual assessment of article
> >> importance will be pretty accurate for most articles. And I suspect if
> we
> >> plot initial importance assessments against time of assessment, we will
> >> find the higher importance articles commenced life on Wikipedia earlier
> >> than the lower importance articles. If I am correct, then there isn't a
> lot
> >> of value in machine-assessment of importance of topics because it
> relates
> >> to factors external to Wikipedia and often does not change over time and
> >> therefore can often be correctly assessed manually even on new stub
> >> articles (and any unassessed articles can probably be rated as Low
> >> Importance as statistically that's almost certainly going to be
> correct).
> >> If a topic becomes more important due to "current" events, then
> invariably
> >> that article will be updated by many people and one of them will sooner
> or
> >> later manually adjust its importance. What is less likely to happen is
> >> re-assessing downwards of Importance when an important "current" topic
> >> loses its importance when it is no longer current, e.g. are former
> American
> >> presidents like Barack Obama or George W Bush or further back less
> >> important now? These articles will not be updated frequently once the
> topic
> >> is no longer in the news and therefore it is less likely an editor will
> >> notice and manually downgrade the importance, so there may be a greater
> >> role for machine-assessment in downgrading importance rather than
> upgrading
> >> importance.
> >>
> >> Another area where there might be a role for machine-assessed importance
> >> in regards to POV-pushing where an POV-motivated editor might change the
> >> manual-assessment importance of articles to be higher or lower based on
> >> their POV (e.g. my political party is Top Importance, other parties are
> of
> >> Low Importance). I suspect that often a page watcher would correct or at
> >> least question that kind of re-assessment. However, articles with few
> >> active pagewatchers you might get away with POV-pushing the article's
> >> importance tag because nobody noticed. In this situation, a machine
> >> assessment could be useful in spotting this kind of thing.
> >>
> >> This suggests that another metric of interest to importance might be
> >> number of pagewatchers, although I suspect that pagewatching may relate
> >> more to caring about the article than to caring about the topic. And one
> >> has to be careful to distinguish active pagewatchers (those who
> actually do
> >> review changes on their watchlists) from those who don't, as that may
> make
> >> a difference (although I am not sure we can really tell which
> pagewatchers
> >> are truly actively reviewing as a "satisfactory review" doesn't leave a
> >> trace whereas an "unsatisfactory" review is likely to lead to a
> relatively
> >> soon revert or some other change to the article, the article Talk or the
> >> User Talk of reviewed contributor which may be detectable).
> >>
> >> The other aspect of articles that occurs to me as being possibly linked
> to
> >> importance of the topic would be use of the article as the "main"
> article
> >> for a category or as the title of a navbox (as it suggests that the
> >> articles in the category or navbox are in some way subordinate to the
> >> main/title article). Similarly for list articles, the "type" of the
> list is
> >> often more important than its instances).
> >>
> >> Kerry
> >>
> >> -----Original Message-----
> >> From: Wiki-research-l [mailto:wiki-research-l-
> [hidden email]]
> >> On Behalf Of Morten Wang
> >> Sent: Friday, 21 April 2017 6:04 AM
> >> To: Research into Wikimedia content and communities <
> >> [hidden email]>
> >> Subject: Re: [Wiki-research-l] Project exploring automated
> classification
> >> of article importance
> >>
> >> Hi Pine,
> >>
> >> These are great pointers to existing practices on enwiki, some of which
> >> I've been looking for and/or missed, thanks!
> >>
> >>
> >> Cheers,
> >> Morten
> >>
> >>> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
> >>>
> >>> Hi Nettrom,
> >>>
> >>> A few resources from English Wikipedia regarding article importance as
> >>> ranked by humans:
> >>>
> >>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> >>>
> >>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> >>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
> >>>
> >>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> >>> ics
> >>>
> >>> I infer from the ENWP Wikicup's scoring protocol that for purposes of
> >>> the competition, an article's "importance" is loosely inferred from
> >>> the number of language editions of Wikipedia in which the article
> >> appears:
> >>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
> >>>
> >>> HTH,
> >>>
> >>> Pine
> >>>
> >>>
> >>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]>
> wrote:
> >>>>
> >>>> Hello everyone,
> >>>>
> >>>> I am currently working with Aaron Halfaker and Dario Taraborelli at
> >>>> the Wikimedia Foundation on a project exploring automated
> >>>> classification of article importance. Our goal is to characterize
> >>>> the importance of an article within a given context and design a
> >>>> system to predict a relative importance rank. We have a project page
> >>>> on meta[1] and welcome comments
> >>> or
> >>>> thoughts on our talk page. You can of course also respond here on
> >>>> wiki-research-l, or send me an email.
> >>>>
> >>>> Before moving on to model-building I did a fairly thorough
> >>>> literature review, finding a myriad of papers spanning several
> >>>> disciplines. We have
> >>> a
> >>>> draft literature review also up on meta[2], which should give you a
> >>>> reasonable introduction to the topic. Again, comments or thoughts
> (e.g.
> >>>> papers we’ve missed) on the talk page, mailing list, or through
> >>>> email are welcome.
> >>>>
> >>>> Links:
> >>>>
> >>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
> >>>>   classification_of_article_importance
> >>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
> >>>> classification_of_article_importance>
> >>>>   2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> >>>>
> >>>> Regards,
> >>>> Morten
> >>>> [[User:Nettrom]] aka [[User:SuggestBot]]
> >>>> _______________________________________________
> >>>> Wiki-research-l mailing list
> >>>> [hidden email]
> >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >>> _______________________________________________
> >>> Wiki-research-l mailing list
> >>> [hidden email]
> >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >> _______________________________________________
> >> Wiki-research-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >>
> >>
> >> _______________________________________________
> >> Wiki-research-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Stuart A. Yeates
On em.wiki article importance is relative to some wikiproject. This is
encoded in https://en.wikipedia.org/wiki/Template:WPBannerMeta which
appears on 16% of all wikipedia pages via specialisations such as
https://en.wikipedia.org/wiki/Template:WikiProject_New_Zealand

Within Wikiproject New Zealand, there are articles which we think are very
important to us, which we would never argue are even marginally important
on a global scale. Take for example
https://en.wikipedia.org/wiki/Pavlova_(food)

For the mathematically inclined, this is a classic case of graph and many
subgraphs.

cheers
stuart


--
...let us be heard from red core to black sky

On 27 April 2017 at 21:44, Gerard Meijssen <[hidden email]>
wrote:

> Hoi,
> I have read the proposal and it leaves me wondering. Also the notion of
> importance is indeed neither easy nor obvious. I think the question what is
> most important is irrelevant depending on how you look at it. Subject can
> be irrelevant when you look at it from a personal perspective, looking at
> it from a particular perspective and indeed what seems relevant may become
> irrelevant or relevant over time. When you use metrics there will always be
> one way or another why it will be found to be problematic.
>
> When you consider Wikipedia, the difference it makes with similar resources
> is that its long tail is so much longer and still it is easy and obvious to
> show how the English Wikipedia's long tail is not long enough [1]. When you
> are looking for links and relevance, Wikidata includes data on all
> Wikipedias and thereby more avenues to establish relevance.
>
> Research has been done that shows that when people are suggested to write
> articles or amend articles, it works best when it is about subjects they
> care about. What people are interested in was based in the research on past
> behaviour. What we could do is flip this and ask people. Based on
> categories, on projects, whatever people do to categorise what is their
> interest. This will work on a micro level. On a meta level, it may drive
> cooperation when we enable people to share their interest (at that moment
> in time). On a macro level data may arrive at Wikidata and this will allow
> us to seek what articles include specific data (think date of death for
> instance). On a meta and macro level, we could ask readers what subjects
> they are missing. This would provide an additional incentive for people to
> write. For this last suggestion we could measure what people are missing.
>
> Anyway, relevance and importance depend on a point of view. When our
> community is enabled to make a difference, it will help us with our
> content. As a movement we know that there is enough that we do not properly
> cover. Advocating these issues and targeting and educating potential
> communities is where the WMF could play more of a role.
> Thanks,
>        GerardM
>
>
>
> [1]
> http://ultimategerardm.blogspot.nl/2017/04/wikidata-
> user-stories-sum-of-all.html
>
> On 26 April 2017 at 13:48, Jonathan Cardy <[hidden email]>
> wrote:
>
> > I like to think that in time importance will win out over popularity. If
> > Wikipedia still exists in fifty of five hundred years time and we are
> still
> > using pasteurisation and indeed still eating hydrocarbon based foods,
> then
> > I suspect the pop group you mention will be less frequently read about
> than
> > the pasteurisation process.
> >
> > In the meantime if we try to work it out at all it has to be something of
> > a judgement call, and one we will occasionally get wrong. Any guesses as
> to
> > which current branches of science will be as forgotten in a century as
> > phrenology is today?
> >
> > At an extreme the weekly top ten most viewed articles are a good guide to
> > what is trending in the popular cultures of India and the USA. I'm
> assuming
> > that most modern pop culture is inherently ephemeral. Of course digital
> > historians of future centuries may be rolling on the floor laughing at
> this
> > email, and the TV dramas currently being filmed may still be widely
> studied
> > and universally known classics while our leading edge science lies buried
> > in the foundations of their science.
> >
> > Regards
> >
> > Jonathan
> >
> >
> > > On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email]> wrote:
> > >
> > > Yes I totally agree that "importance is a relative metric rather than
> > > absolute." I also agree that incoming links and pageviews are not
> > accurate
> > > measurements of "importance" for all of the reasons you mention.
> However,
> > > we are still a project that is actively exploring the universe of
> > > knowledge, and leaning heavily on academia and other established
> sources
> > we
> > > must "boldly go where no man has gone before" (and please feel free to
> > > insert "white, euro-centric" before the man part). So do you have any
> > > suggestions what we could measure going forward that would cough up
> some
> > > interesting stats to monitor? Pagewatching is useful , but problematic
> > > because these are only assigned at page-creation, while some marginal
> > > editor interest might be expanded to whole categories (speaking as
> > someone
> > > who has thousands of pages watchlisted on multiple projects). I like
> your
> > > thoughts about looking for key articles such as those used as the
> > "article
> > > as the "main" article for a category or as the title of a navbox ".  I
> am
> > > looking for similar usages of paintings as a way to find popular
> painters
> > > or paintings rather than just those paintings which have articles
> written
> > > about them (which are often written for totally random reasons such as
> > > theft/sale/wikiproject).
> > >
> > > On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <
> [hidden email]>
> > > wrote:
> > >
> > >> Just a few musings on the issue of Importance and how to research it
> ...
> > >>
> > >> I agree it is intuitive that importance is likely to be linked to
> > >> pageviews and inbound links but, as the preliminary experiment showed,
> > it's
> > >> probably not that simple.
> > >>
> > >> Pageviews tells us something about importance to readers of Wikipedia,
> > >> while inbound links tells us something about importance to writers of
> > >> Wikipedia, and I suspect that writers are not a proxy for readers as
> the
> > >> editor surveys suggest that Wikipedia writers are not typical of
> broader
> > >> society on at least two variables: gender and level of education
> (might
> > be
> > >> others, I can't remember).
> > >>
> > >> But I think importance is a relative metric rather than  absolute. I
> > think
> > >> by taking the mean value of importance across a number of WikiProjects
> > in
> > >> the preliminary experiment may have lost something because it tried
> > >> (through averaging) to look at importance "generally". I would suspect
> > >> conducting an experiment considering only the importance ratings wrt
> to
> > a
> > >> single WikiProject would be more likely to show correlation with
> > pageviews
> > >> (wrt to other articles in that same WikiProject) and inbound links.
> And
> > I
> > >> think there are two kinds of inbound links to be considered, those
> > coming
> > >> from other articles within the same WikiProject and those coming from
> > >> outside that Wikiproject. I suspect different insights will be
> obtained
> > by
> > >> looking at both types of inbound links separately rather than treating
> > them
> > >> as an aggregate. I note also that WikiProjects are not entirely
> > independent
> > >> of one another but have relationships between them. For example, The
> > >> WikiProject Australian Roads describes itself as an "intersection" (ha
> > ha!)
> > >> of WikiProject Highways and WikiProject Australia, so I expect that we
> > >> would find greater correlation in importance between related
> > WikiProjects
> > >> than between unrelated WikiProjects.
> > >>
> > >> When thinking about readers and pageviews, I think we have to ask
> > >> ourselves is there a difference between popularity and importance. Or
> > >> whether popularity *is* importance. I sense that, as a group of
> educated
> > >> people, those of us reading this research mailing list probably do
> think
> > >> there is a difference. Certainly if there is no difference, then this
> > >> research can stop now -- just judge importance by  pageviews. Let's
> > assume
> > >> a difference then. When looking at pageviews of an article, they are
> not
> > >> always consistent over time. Here are the pageviews for Drottninggatan
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org&platform=all-access&agent=user&range=
> > >> latest-90&pages=Drottninggatan
> > >>
> > >> Why so interesting on 8 April? A terrorist attack occurred there. This
> > >> spike in pageviews occurs all the time when some topic is in the news
> > (even
> > >> peripherally as in this case where it is not the article about the
> > >> terrorist attack but about the street in which it occurred). Did the
> > street
> > >> become more "important"? I think it became more interesting but not
> more
> > >> important. So I think we do have to be careful to understand that
> > pageviews
> > >> probably reflect interest rather than importance.  I note that The
> > >> Chainsmokers (a music group with a number of songs in the current USA
> > music
> > >> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> > >> article on Pasteurization but The Chainsmokers are not rated as being
> of
> > >> high importance by the relevant WikiProjects while Pasteurization is
> > very
> > >> important in WikiProject Food and Drink. Since pasteurisation
> prevents a
> > >> lot of deaths, I think we might agree that in the real world
> > pasteurisation
> > >> is more important than a music group regardless of what pageviews tell
> > us.
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org&platform=all-access&agent=user&range=
> latest-90&pages=The_
> > >> Chainsmokers|Pasteurization
> > >>
> > >> Of course it is matters for Wikipedia's success that our *popular*
> > >> articles are of high quality, but I think we have be cautious about
> > >> pageviews being a proxy for importance.
> > >>
> > >> When we look at Wikipedia writers' decisions in tagging the importance
> > of
> > >> articles to WikiProjects, what do we find? As we know, project tags
> are
> > >> often placed on new articles (and often not subsequently reviewed). So
> > >> while I find that quality tags are often out-of-date, the importance
> > seems
> > >> to be pretty accurate even on a new stub articles. This is because it
> is
> > >> the importance of the *topic* that is being assessed which is
> > independent
> > >> of the Wikipedia article itself. Provided the article is clear enough
> > about
> > >> what it is about and why it matters (which is the traditional content
> of
> > >> that first paragraph or two and failing to provide it will likely
> > result in
> > >> speedy deletion of the new article), assessment of the topic's
> > importance
> > >> can be made even at new stub level. This tells us that importance for
> > >> Wikipedia writers is determined by something outside of Wikipedia
> > (probably
> > >> their real-world knowledge of that topic space -- one assumes that
> > project
> > >> taggers are quite interested in the topic space of that project).
> While
> > >> article quality hopefully improves over time, I would be surprised if
> > >> article importance greatly changed over time. Obviously there are
> > >> counter-examples.  I am guessing Donald Trump's article may have grown
> > in
> > >> importance over time but that's probably because his lede para
> changed.
> > >> Adding President of the USA into the lede paragraph makes him much
> more
> > >> important than he was before in the real world and internal to
> > Wikipedia he
> > >> has acquired an inbound link from the presumably high-importance
> > President
> > >> of the USA article. So I think it might be interesting to study those
> > >> articles whose importance does change over time to see if there are
> any
> > >> strong correlations with what is happening to the article inside
> > Wikipedia.
> > >> I think it is this set of importance-changing articles may be where we
> > >> really learn what Wikipedia article characteristics are strongly
> > correlated
> > >> to "importance" given that importance itself appears to be pretty
> stable
> > >> for most articles.
> > >>
> > >> Although not stated explicitly, I imagine we believe that generally
> less
> > >> important articles tend to link to more important articles but more
> > >> important articles don't link to less important articles. And hence
> > >> in-bound links are likely to matter in assessing importance and that
> > >> in-bound links from "important" articles are more valuable than
> in-bound
> > >> links from less important articles (which creates something of a
> > >> bootstrapping problem) similar to the issue to Google's PageRank
> > >> algorithms. But I think we do have some information that Google
> doesn't
> > >> have. The average webpage does not have a lede paragraph that situates
> > the
> > >> topic relative to other topics; a Wikipedia article does. If I have to
> > >> choose to define Thing X in terms of Thing Y, it tends to suggest that
> > Y is
> > >> more important than X. If Y also defines itself in terms of X, then it
> > >> tends to suggest they are equivalent in importance at some way.
> Indeed I
> > >> suspect when we get to the VERY IMPORTANT topics we will see this kind
> > of
> > >> circular definition (e.g. you see circular definitions in Wikipedia
> > around
> > >> Philosophy and Knowledge). Aside, if you have never done this before,
> > try
> > >> this experiment. Choose a random article (left hand tool bar in
> Desktop
> > >> Wikipedia), then click the first link in the article that matters
> (i.e.
> > >> ignore links hatnotes or links inside parentheses). Repeat this first
> > link
> > >> clicking and sooner or later you will reach articles like Knowledge
> and
> > >> Philosophy, which all sit inside circular definition groups.
> > >>
> > >> If we look at the Donald Trump article, his first sentence contains
> only
> > >> two links, one to List of Presidents of the USA and the other to
> > President
> > >> of the USA. If we look at the those two articles, we find that both of
> > them
> > >> mention Donald Trump in their lede paras (although not as early as the
> > >> first sentence) and before mentions of any other US President
> elsewhere
> > in
> > >> the article. Which is consistent with what we know about the real
> world,
> > >> the role of the President is more important than its officeholders and
> > that
> > >> the current officeholder has more importance than a past officeholder.
> > So
> > >> topic importance does seems to be skewed towards the "present day".
> > >>
> > >> So I suspect the links in the lede paras are of greater relevance to
> the
> > >> assessment of importance than links further down in the article which
> > will
> > >> be more likely relate to details of a topic and may include examples
> and
> > >> counter-examples (this is a way in which high importance article may
> > >> mention much lower importance articles). However, we do have to be a
> > little
> > >> bit careful here because of the MoS practice of not linking very
> common
> > >> terms. For example, an Australian article will often refer to
> Australia
> > in
> > >> the lede para but it will almost certainly not be linked to the
> > Australia
> > >> article (and any attempt to add such a link will likely see it removed
> > with
> > >> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
> > >> problem if you link to an Australian state article, e.g. New South
> > Wales.
> > >> So we might find that some very important topics that often appear in
> > ledes
> > >> might get fewer links that you might expect because of the MoS
> policies
> > on
> > >> overlinking, which may be problem when working with inbound links. It
> > may
> > >> be that for "very common topics" the presence of the article title (or
> > its
> > >> synonyms) in the lede may have to be considered as if it were an
> > in-bound
> > >> link for statistical research purposes.
> > >>
> > >> Given all of the above, perhaps the most interesting group of articles
> > to
> > >> study in Wikipedia are those articles whose manually-assessed
> importance
> > >> has changed over the life of the article AND which were NOT current
> > topics
> > >> in the lifetime of Wikipedia (given the influence of "current" on
> > >> importance). But having said that, I wonder if that group of articles
> > >> actually exists. Recently a newish Australian contributor expressed
> > >> disappointment that all the new articles they had created were tagged
> > (by
> > >> others) as of Low Importance. My instinctive reply was "that's
> normal, I
> > >> think of the thousands of articles I have started only a couple even
> > rated
> > >> as Mid importance, this is because the really important articles were
> > all
> > >> started long ago precisely because they were important". I suspect
> > topics
> > >> that are very important (for reasons other than being short-lived
> > >> importance due in being "current" in the lifetime of Wikipedia) will
> > >> generally show up as having started early in Wikipedia's life and that
> > >> those that become more/less important over time will be largely linked
> > to
> > >> becoming or ceasing to be "current" topics). E.g. article
> Pasteurization
> > >> started in May 2001 saying nothing more than " Pasteurization is the
> > >> process of killing off bacteria in milk by quickly heating it to a
> near
> > >> boiling temperature, then quickly cooling it again before the taste
> and
> > >> other desirable properties are affected. The process was named after
> its
> > >> inventor, French scientist Louis Pasteur. See also dairy products."
> The
> > >> links in this very first version are still present in its lede
> paragraph
> > >> today, suggesting our understanding of "non-current" topics is stable
> > and
> > >> hence initial importance determinations can probably be accurately
> made.
> > >> For Pasteurization the Talk page shows it was not project-tagged until
> > 2007
> > >> when it was assigned High Importance as its first assessment.
> > >>
> > >> I suspect we will find that initial manual assessment of article
> > >> importance will be pretty accurate for most articles. And I suspect if
> > we
> > >> plot initial importance assessments against time of assessment, we
> will
> > >> find the higher importance articles commenced life on Wikipedia
> earlier
> > >> than the lower importance articles. If I am correct, then there isn't
> a
> > lot
> > >> of value in machine-assessment of importance of topics because it
> > relates
> > >> to factors external to Wikipedia and often does not change over time
> and
> > >> therefore can often be correctly assessed manually even on new stub
> > >> articles (and any unassessed articles can probably be rated as Low
> > >> Importance as statistically that's almost certainly going to be
> > correct).
> > >> If a topic becomes more important due to "current" events, then
> > invariably
> > >> that article will be updated by many people and one of them will
> sooner
> > or
> > >> later manually adjust its importance. What is less likely to happen is
> > >> re-assessing downwards of Importance when an important "current" topic
> > >> loses its importance when it is no longer current, e.g. are former
> > American
> > >> presidents like Barack Obama or George W Bush or further back less
> > >> important now? These articles will not be updated frequently once the
> > topic
> > >> is no longer in the news and therefore it is less likely an editor
> will
> > >> notice and manually downgrade the importance, so there may be a
> greater
> > >> role for machine-assessment in downgrading importance rather than
> > upgrading
> > >> importance.
> > >>
> > >> Another area where there might be a role for machine-assessed
> importance
> > >> in regards to POV-pushing where an POV-motivated editor might change
> the
> > >> manual-assessment importance of articles to be higher or lower based
> on
> > >> their POV (e.g. my political party is Top Importance, other parties
> are
> > of
> > >> Low Importance). I suspect that often a page watcher would correct or
> at
> > >> least question that kind of re-assessment. However, articles with few
> > >> active pagewatchers you might get away with POV-pushing the article's
> > >> importance tag because nobody noticed. In this situation, a machine
> > >> assessment could be useful in spotting this kind of thing.
> > >>
> > >> This suggests that another metric of interest to importance might be
> > >> number of pagewatchers, although I suspect that pagewatching may
> relate
> > >> more to caring about the article than to caring about the topic. And
> one
> > >> has to be careful to distinguish active pagewatchers (those who
> > actually do
> > >> review changes on their watchlists) from those who don't, as that may
> > make
> > >> a difference (although I am not sure we can really tell which
> > pagewatchers
> > >> are truly actively reviewing as a "satisfactory review" doesn't leave
> a
> > >> trace whereas an "unsatisfactory" review is likely to lead to a
> > relatively
> > >> soon revert or some other change to the article, the article Talk or
> the
> > >> User Talk of reviewed contributor which may be detectable).
> > >>
> > >> The other aspect of articles that occurs to me as being possibly
> linked
> > to
> > >> importance of the topic would be use of the article as the "main"
> > article
> > >> for a category or as the title of a navbox (as it suggests that the
> > >> articles in the category or navbox are in some way subordinate to the
> > >> main/title article). Similarly for list articles, the "type" of the
> > list is
> > >> often more important than its instances).
> > >>
> > >> Kerry
> > >>
> > >> -----Original Message-----
> > >> From: Wiki-research-l [mailto:wiki-research-l-
> > [hidden email]]
> > >> On Behalf Of Morten Wang
> > >> Sent: Friday, 21 April 2017 6:04 AM
> > >> To: Research into Wikimedia content and communities <
> > >> [hidden email]>
> > >> Subject: Re: [Wiki-research-l] Project exploring automated
> > classification
> > >> of article importance
> > >>
> > >> Hi Pine,
> > >>
> > >> These are great pointers to existing practices on enwiki, some of
> which
> > >> I've been looking for and/or missed, thanks!
> > >>
> > >>
> > >> Cheers,
> > >> Morten
> > >>
> > >>> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
> > >>>
> > >>> Hi Nettrom,
> > >>>
> > >>> A few resources from English Wikipedia regarding article importance
> as
> > >>> ranked by humans:
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > >>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_
> assessment#Statist
> > >>> ics
> > >>>
> > >>> I infer from the ENWP Wikicup's scoring protocol that for purposes of
> > >>> the competition, an article's "importance" is loosely inferred from
> > >>> the number of language editions of Wikipedia in which the article
> > >> appears:
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points
> .
> > >>>
> > >>> HTH,
> > >>>
> > >>> Pine
> > >>>
> > >>>
> > >>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]>
> > wrote:
> > >>>>
> > >>>> Hello everyone,
> > >>>>
> > >>>> I am currently working with Aaron Halfaker and Dario Taraborelli at
> > >>>> the Wikimedia Foundation on a project exploring automated
> > >>>> classification of article importance. Our goal is to characterize
> > >>>> the importance of an article within a given context and design a
> > >>>> system to predict a relative importance rank. We have a project page
> > >>>> on meta[1] and welcome comments
> > >>> or
> > >>>> thoughts on our talk page. You can of course also respond here on
> > >>>> wiki-research-l, or send me an email.
> > >>>>
> > >>>> Before moving on to model-building I did a fairly thorough
> > >>>> literature review, finding a myriad of papers spanning several
> > >>>> disciplines. We have
> > >>> a
> > >>>> draft literature review also up on meta[2], which should give you a
> > >>>> reasonable introduction to the topic. Again, comments or thoughts
> > (e.g.
> > >>>> papers we’ve missed) on the talk page, mailing list, or through
> > >>>> email are welcome.
> > >>>>
> > >>>> Links:
> > >>>>
> > >>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>>   classification_of_article_importance
> > >>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>> classification_of_article_importance>
> > >>>>   2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >>>>
> > >>>> Regards,
> > >>>> Morten
> > >>>> [[User:Nettrom]] aka [[User:SuggestBot]]
> > >>>> _______________________________________________
> > >>>> Wiki-research-l mailing list
> > >>>> [hidden email]
> > >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>> _______________________________________________
> > >>> Wiki-research-l mailing list
> > >>> [hidden email]
> > >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email]
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>
> > >>
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email]
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Kerry Raymond
I observe (and am unsurprised) that WikiProject Australia also rates the Pavlova article as High importance, which demonstrates into the Stuart's comments about graphs and subgraphs. If there are relationships between WikiProjects, there is probably some correlation about importance of articles as seen by those projects. As it happens, WikiProject Australia and WikiProject New Zealand are related on Wikipedia only by both being within the category "WikiProject Countries projects" (along with every other national WikiProject), so this is an example where you cannot see the connection between these projects "on-wiki" but anyone who knows anything about the geography, history, and culture of the two countries will understand the close connection (e.g. ANZAC, sheep, pavlova, rugby union) but, as the project tagging will show, we do have our differences, e.g. Whitebait is a High Importance article for NZ but Oz doesn't even tag it (we don't share the NZ passion for these small fish). And perhaps more seriously, our two countries have different indigenous peoples so our project tagging around Maori (NZ) and Aboriginal and Torres Strait Islander (Oz) articles would usually be quite disjoint.

So if there are correlations between project tagging, it may be something exploitable in machine assessment of importance.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email]] On Behalf Of Stuart A. Yeates
Sent: Friday, 28 April 2017 6:18 AM
To: Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

On em.wiki article importance is relative to some wikiproject. This is encoded in https://en.wikipedia.org/wiki/Template:WPBannerMeta which appears on 16% of all wikipedia pages via specialisations such as https://en.wikipedia.org/wiki/Template:WikiProject_New_Zealand

Within Wikiproject New Zealand, there are articles which we think are very important to us, which we would never argue are even marginally important on a global scale. Take for example
https://en.wikipedia.org/wiki/Pavlova_(food)

For the mathematically inclined, this is a classic case of graph and many subgraphs.

cheers
stuart


--
...let us be heard from red core to black sky

On 27 April 2017 at 21:44, Gerard Meijssen <[hidden email]>
wrote:

> Hoi,
> I have read the proposal and it leaves me wondering. Also the notion
> of importance is indeed neither easy nor obvious. I think the question
> what is most important is irrelevant depending on how you look at it.
> Subject can be irrelevant when you look at it from a personal
> perspective, looking at it from a particular perspective and indeed
> what seems relevant may become irrelevant or relevant over time. When
> you use metrics there will always be one way or another why it will be found to be problematic.
>
> When you consider Wikipedia, the difference it makes with similar
> resources is that its long tail is so much longer and still it is easy
> and obvious to show how the English Wikipedia's long tail is not long
> enough [1]. When you are looking for links and relevance, Wikidata
> includes data on all Wikipedias and thereby more avenues to establish relevance.
>
> Research has been done that shows that when people are suggested to
> write articles or amend articles, it works best when it is about
> subjects they care about. What people are interested in was based in
> the research on past behaviour. What we could do is flip this and ask
> people. Based on categories, on projects, whatever people do to
> categorise what is their interest. This will work on a micro level. On
> a meta level, it may drive cooperation when we enable people to share
> their interest (at that moment in time). On a macro level data may
> arrive at Wikidata and this will allow us to seek what articles
> include specific data (think date of death for instance). On a meta
> and macro level, we could ask readers what subjects they are missing.
> This would provide an additional incentive for people to write. For this last suggestion we could measure what people are missing.
>
> Anyway, relevance and importance depend on a point of view. When our
> community is enabled to make a difference, it will help us with our
> content. As a movement we know that there is enough that we do not
> properly cover. Advocating these issues and targeting and educating
> potential communities is where the WMF could play more of a role.
> Thanks,
>        GerardM
>
>
>
> [1]
> http://ultimategerardm.blogspot.nl/2017/04/wikidata-
> user-stories-sum-of-all.html
>
> On 26 April 2017 at 13:48, Jonathan Cardy
> <[hidden email]>
> wrote:
>
> > I like to think that in time importance will win out over
> > popularity. If Wikipedia still exists in fifty of five hundred years
> > time and we are
> still
> > using pasteurisation and indeed still eating hydrocarbon based
> > foods,
> then
> > I suspect the pop group you mention will be less frequently read
> > about
> than
> > the pasteurisation process.
> >
> > In the meantime if we try to work it out at all it has to be
> > something of a judgement call, and one we will occasionally get
> > wrong. Any guesses as
> to
> > which current branches of science will be as forgotten in a century
> > as phrenology is today?
> >
> > At an extreme the weekly top ten most viewed articles are a good
> > guide to what is trending in the popular cultures of India and the
> > USA. I'm
> assuming
> > that most modern pop culture is inherently ephemeral. Of course
> > digital historians of future centuries may be rolling on the floor
> > laughing at
> this
> > email, and the TV dramas currently being filmed may still be widely
> studied
> > and universally known classics while our leading edge science lies
> > buried in the foundations of their science.
> >
> > Regards
> >
> > Jonathan
> >
> >
> > > On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email]> wrote:
> > >
> > > Yes I totally agree that "importance is a relative metric rather
> > > than absolute." I also agree that incoming links and pageviews are
> > > not
> > accurate
> > > measurements of "importance" for all of the reasons you mention.
> However,
> > > we are still a project that is actively exploring the universe of
> > > knowledge, and leaning heavily on academia and other established
> sources
> > we
> > > must "boldly go where no man has gone before" (and please feel
> > > free to insert "white, euro-centric" before the man part). So do
> > > you have any suggestions what we could measure going forward that
> > > would cough up
> some
> > > interesting stats to monitor? Pagewatching is useful , but
> > > problematic because these are only assigned at page-creation,
> > > while some marginal editor interest might be expanded to whole
> > > categories (speaking as
> > someone
> > > who has thousands of pages watchlisted on multiple projects). I
> > > like
> your
> > > thoughts about looking for key articles such as those used as the
> > "article
> > > as the "main" article for a category or as the title of a navbox
> > > ".  I
> am
> > > looking for similar usages of paintings as a way to find popular
> painters
> > > or paintings rather than just those paintings which have articles
> written
> > > about them (which are often written for totally random reasons
> > > such as theft/sale/wikiproject).
> > >
> > > On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <
> [hidden email]>
> > > wrote:
> > >
> > >> Just a few musings on the issue of Importance and how to research
> > >> it
> ...
> > >>
> > >> I agree it is intuitive that importance is likely to be linked to
> > >> pageviews and inbound links but, as the preliminary experiment
> > >> showed,
> > it's
> > >> probably not that simple.
> > >>
> > >> Pageviews tells us something about importance to readers of
> > >> Wikipedia, while inbound links tells us something about
> > >> importance to writers of Wikipedia, and I suspect that writers
> > >> are not a proxy for readers as
> the
> > >> editor surveys suggest that Wikipedia writers are not typical of
> broader
> > >> society on at least two variables: gender and level of education
> (might
> > be
> > >> others, I can't remember).
> > >>
> > >> But I think importance is a relative metric rather than  
> > >> absolute. I
> > think
> > >> by taking the mean value of importance across a number of
> > >> WikiProjects
> > in
> > >> the preliminary experiment may have lost something because it
> > >> tried (through averaging) to look at importance "generally". I
> > >> would suspect conducting an experiment considering only the
> > >> importance ratings wrt
> to
> > a
> > >> single WikiProject would be more likely to show correlation with
> > pageviews
> > >> (wrt to other articles in that same WikiProject) and inbound links.
> And
> > I
> > >> think there are two kinds of inbound links to be considered,
> > >> those
> > coming
> > >> from other articles within the same WikiProject and those coming
> > >> from outside that Wikiproject. I suspect different insights will
> > >> be
> obtained
> > by
> > >> looking at both types of inbound links separately rather than
> > >> treating
> > them
> > >> as an aggregate. I note also that WikiProjects are not entirely
> > independent
> > >> of one another but have relationships between them. For example,
> > >> The WikiProject Australian Roads describes itself as an
> > >> "intersection" (ha
> > ha!)
> > >> of WikiProject Highways and WikiProject Australia, so I expect
> > >> that we would find greater correlation in importance between
> > >> related
> > WikiProjects
> > >> than between unrelated WikiProjects.
> > >>
> > >> When thinking about readers and pageviews, I think we have to ask
> > >> ourselves is there a difference between popularity and
> > >> importance. Or whether popularity *is* importance. I sense that,
> > >> as a group of
> educated
> > >> people, those of us reading this research mailing list probably
> > >> do
> think
> > >> there is a difference. Certainly if there is no difference, then
> > >> this research can stop now -- just judge importance by  
> > >> pageviews. Let's
> > assume
> > >> a difference then. When looking at pageviews of an article, they
> > >> are
> not
> > >> always consistent over time. Here are the pageviews for
> > >> Drottninggatan
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org&platform=all-access&agent=user&range=
> > >> latest-90&pages=Drottninggatan
> > >>
> > >> Why so interesting on 8 April? A terrorist attack occurred there.
> > >> This spike in pageviews occurs all the time when some topic is in
> > >> the news
> > (even
> > >> peripherally as in this case where it is not the article about
> > >> the terrorist attack but about the street in which it occurred).
> > >> Did the
> > street
> > >> become more "important"? I think it became more interesting but
> > >> not
> more
> > >> important. So I think we do have to be careful to understand that
> > pageviews
> > >> probably reflect interest rather than importance.  I note that
> > >> The Chainsmokers (a music group with a number of songs in the
> > >> current USA
> > music
> > >> charts) gets many more Wikipedia article pageviews  than the
> > >> Wikipedia article on Pasteurization but The Chainsmokers are not
> > >> rated as being
> of
> > >> high importance by the relevant WikiProjects while Pasteurization
> > >> is
> > very
> > >> important in WikiProject Food and Drink. Since pasteurisation
> prevents a
> > >> lot of deaths, I think we might agree that in the real world
> > pasteurisation
> > >> is more important than a music group regardless of what pageviews
> > >> tell
> > us.
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org&platform=all-access&agent=user&range=
> latest-90&pages=The_
> > >> Chainsmokers|Pasteurization
> > >>
> > >> Of course it is matters for Wikipedia's success that our
> > >> *popular* articles are of high quality, but I think we have be
> > >> cautious about pageviews being a proxy for importance.
> > >>
> > >> When we look at Wikipedia writers' decisions in tagging the
> > >> importance
> > of
> > >> articles to WikiProjects, what do we find? As we know, project
> > >> tags
> are
> > >> often placed on new articles (and often not subsequently
> > >> reviewed). So while I find that quality tags are often
> > >> out-of-date, the importance
> > seems
> > >> to be pretty accurate even on a new stub articles. This is
> > >> because it
> is
> > >> the importance of the *topic* that is being assessed which is
> > independent
> > >> of the Wikipedia article itself. Provided the article is clear
> > >> enough
> > about
> > >> what it is about and why it matters (which is the traditional
> > >> content
> of
> > >> that first paragraph or two and failing to provide it will likely
> > result in
> > >> speedy deletion of the new article), assessment of the topic's
> > importance
> > >> can be made even at new stub level. This tells us that importance
> > >> for Wikipedia writers is determined by something outside of
> > >> Wikipedia
> > (probably
> > >> their real-world knowledge of that topic space -- one assumes
> > >> that
> > project
> > >> taggers are quite interested in the topic space of that project).
> While
> > >> article quality hopefully improves over time, I would be
> > >> surprised if article importance greatly changed over time.
> > >> Obviously there are counter-examples.  I am guessing Donald
> > >> Trump's article may have grown
> > in
> > >> importance over time but that's probably because his lede para
> changed.
> > >> Adding President of the USA into the lede paragraph makes him
> > >> much
> more
> > >> important than he was before in the real world and internal to
> > Wikipedia he
> > >> has acquired an inbound link from the presumably high-importance
> > President
> > >> of the USA article. So I think it might be interesting to study
> > >> those articles whose importance does change over time to see if
> > >> there are
> any
> > >> strong correlations with what is happening to the article inside
> > Wikipedia.
> > >> I think it is this set of importance-changing articles may be
> > >> where we really learn what Wikipedia article characteristics are
> > >> strongly
> > correlated
> > >> to "importance" given that importance itself appears to be pretty
> stable
> > >> for most articles.
> > >>
> > >> Although not stated explicitly, I imagine we believe that
> > >> generally
> less
> > >> important articles tend to link to more important articles but
> > >> more important articles don't link to less important articles.
> > >> And hence in-bound links are likely to matter in assessing
> > >> importance and that in-bound links from "important" articles are
> > >> more valuable than
> in-bound
> > >> links from less important articles (which creates something of a
> > >> bootstrapping problem) similar to the issue to Google's PageRank
> > >> algorithms. But I think we do have some information that Google
> doesn't
> > >> have. The average webpage does not have a lede paragraph that
> > >> situates
> > the
> > >> topic relative to other topics; a Wikipedia article does. If I
> > >> have to choose to define Thing X in terms of Thing Y, it tends to
> > >> suggest that
> > Y is
> > >> more important than X. If Y also defines itself in terms of X,
> > >> then it tends to suggest they are equivalent in importance at some way.
> Indeed I
> > >> suspect when we get to the VERY IMPORTANT topics we will see this
> > >> kind
> > of
> > >> circular definition (e.g. you see circular definitions in
> > >> Wikipedia
> > around
> > >> Philosophy and Knowledge). Aside, if you have never done this
> > >> before,
> > try
> > >> this experiment. Choose a random article (left hand tool bar in
> Desktop
> > >> Wikipedia), then click the first link in the article that matters
> (i.e.
> > >> ignore links hatnotes or links inside parentheses). Repeat this
> > >> first
> > link
> > >> clicking and sooner or later you will reach articles like
> > >> Knowledge
> and
> > >> Philosophy, which all sit inside circular definition groups.
> > >>
> > >> If we look at the Donald Trump article, his first sentence
> > >> contains
> only
> > >> two links, one to List of Presidents of the USA and the other to
> > President
> > >> of the USA. If we look at the those two articles, we find that
> > >> both of
> > them
> > >> mention Donald Trump in their lede paras (although not as early
> > >> as the first sentence) and before mentions of any other US
> > >> President
> elsewhere
> > in
> > >> the article. Which is consistent with what we know about the real
> world,
> > >> the role of the President is more important than its
> > >> officeholders and
> > that
> > >> the current officeholder has more importance than a past officeholder.
> > So
> > >> topic importance does seems to be skewed towards the "present day".
> > >>
> > >> So I suspect the links in the lede paras are of greater relevance
> > >> to
> the
> > >> assessment of importance than links further down in the article
> > >> which
> > will
> > >> be more likely relate to details of a topic and may include
> > >> examples
> and
> > >> counter-examples (this is a way in which high importance article
> > >> may mention much lower importance articles). However, we do have
> > >> to be a
> > little
> > >> bit careful here because of the MoS practice of not linking very
> common
> > >> terms. For example, an Australian article will often refer to
> Australia
> > in
> > >> the lede para but it will almost certainly not be linked to the
> > Australia
> > >> article (and any attempt to add such a link will likely see it
> > >> removed
> > with
> > >> an edit summary that mentions [[WP:Overlinking]]) whereas there
> > >> is no problem if you link to an Australian state article, e.g.
> > >> New South
> > Wales.
> > >> So we might find that some very important topics that often
> > >> appear in
> > ledes
> > >> might get fewer links that you might expect because of the MoS
> policies
> > on
> > >> overlinking, which may be problem when working with inbound
> > >> links. It
> > may
> > >> be that for "very common topics" the presence of the article
> > >> title (or
> > its
> > >> synonyms) in the lede may have to be considered as if it were an
> > in-bound
> > >> link for statistical research purposes.
> > >>
> > >> Given all of the above, perhaps the most interesting group of
> > >> articles
> > to
> > >> study in Wikipedia are those articles whose manually-assessed
> importance
> > >> has changed over the life of the article AND which were NOT
> > >> current
> > topics
> > >> in the lifetime of Wikipedia (given the influence of "current" on
> > >> importance). But having said that, I wonder if that group of
> > >> articles actually exists. Recently a newish Australian
> > >> contributor expressed disappointment that all the new articles
> > >> they had created were tagged
> > (by
> > >> others) as of Low Importance. My instinctive reply was "that's
> normal, I
> > >> think of the thousands of articles I have started only a couple
> > >> even
> > rated
> > >> as Mid importance, this is because the really important articles
> > >> were
> > all
> > >> started long ago precisely because they were important". I
> > >> suspect
> > topics
> > >> that are very important (for reasons other than being short-lived
> > >> importance due in being "current" in the lifetime of Wikipedia)
> > >> will generally show up as having started early in Wikipedia's
> > >> life and that those that become more/less important over time
> > >> will be largely linked
> > to
> > >> becoming or ceasing to be "current" topics). E.g. article
> Pasteurization
> > >> started in May 2001 saying nothing more than " Pasteurization is
> > >> the process of killing off bacteria in milk by quickly heating it
> > >> to a
> near
> > >> boiling temperature, then quickly cooling it again before the
> > >> taste
> and
> > >> other desirable properties are affected. The process was named
> > >> after
> its
> > >> inventor, French scientist Louis Pasteur. See also dairy products."
> The
> > >> links in this very first version are still present in its lede
> paragraph
> > >> today, suggesting our understanding of "non-current" topics is
> > >> stable
> > and
> > >> hence initial importance determinations can probably be
> > >> accurately
> made.
> > >> For Pasteurization the Talk page shows it was not project-tagged
> > >> until
> > 2007
> > >> when it was assigned High Importance as its first assessment.
> > >>
> > >> I suspect we will find that initial manual assessment of article
> > >> importance will be pretty accurate for most articles. And I
> > >> suspect if
> > we
> > >> plot initial importance assessments against time of assessment,
> > >> we
> will
> > >> find the higher importance articles commenced life on Wikipedia
> earlier
> > >> than the lower importance articles. If I am correct, then there
> > >> isn't
> a
> > lot
> > >> of value in machine-assessment of importance of topics because it
> > relates
> > >> to factors external to Wikipedia and often does not change over
> > >> time
> and
> > >> therefore can often be correctly assessed manually even on new
> > >> stub articles (and any unassessed articles can probably be rated
> > >> as Low Importance as statistically that's almost certainly going
> > >> to be
> > correct).
> > >> If a topic becomes more important due to "current" events, then
> > invariably
> > >> that article will be updated by many people and one of them will
> sooner
> > or
> > >> later manually adjust its importance. What is less likely to
> > >> happen is re-assessing downwards of Importance when an important
> > >> "current" topic loses its importance when it is no longer
> > >> current, e.g. are former
> > American
> > >> presidents like Barack Obama or George W Bush or further back
> > >> less important now? These articles will not be updated frequently
> > >> once the
> > topic
> > >> is no longer in the news and therefore it is less likely an
> > >> editor
> will
> > >> notice and manually downgrade the importance, so there may be a
> greater
> > >> role for machine-assessment in downgrading importance rather than
> > upgrading
> > >> importance.
> > >>
> > >> Another area where there might be a role for machine-assessed
> importance
> > >> in regards to POV-pushing where an POV-motivated editor might
> > >> change
> the
> > >> manual-assessment importance of articles to be higher or lower
> > >> based
> on
> > >> their POV (e.g. my political party is Top Importance, other
> > >> parties
> are
> > of
> > >> Low Importance). I suspect that often a page watcher would
> > >> correct or
> at
> > >> least question that kind of re-assessment. However, articles with
> > >> few active pagewatchers you might get away with POV-pushing the
> > >> article's importance tag because nobody noticed. In this
> > >> situation, a machine assessment could be useful in spotting this kind of thing.
> > >>
> > >> This suggests that another metric of interest to importance might
> > >> be number of pagewatchers, although I suspect that pagewatching
> > >> may
> relate
> > >> more to caring about the article than to caring about the topic.
> > >> And
> one
> > >> has to be careful to distinguish active pagewatchers (those who
> > actually do
> > >> review changes on their watchlists) from those who don't, as that
> > >> may
> > make
> > >> a difference (although I am not sure we can really tell which
> > pagewatchers
> > >> are truly actively reviewing as a "satisfactory review" doesn't
> > >> leave
> a
> > >> trace whereas an "unsatisfactory" review is likely to lead to a
> > relatively
> > >> soon revert or some other change to the article, the article Talk
> > >> or
> the
> > >> User Talk of reviewed contributor which may be detectable).
> > >>
> > >> The other aspect of articles that occurs to me as being possibly
> linked
> > to
> > >> importance of the topic would be use of the article as the "main"
> > article
> > >> for a category or as the title of a navbox (as it suggests that
> > >> the articles in the category or navbox are in some way
> > >> subordinate to the main/title article). Similarly for list
> > >> articles, the "type" of the
> > list is
> > >> often more important than its instances).
> > >>
> > >> Kerry
> > >>
> > >> -----Original Message-----
> > >> From: Wiki-research-l [mailto:wiki-research-l-
> > [hidden email]]
> > >> On Behalf Of Morten Wang
> > >> Sent: Friday, 21 April 2017 6:04 AM
> > >> To: Research into Wikimedia content and communities <
> > >> [hidden email]>
> > >> Subject: Re: [Wiki-research-l] Project exploring automated
> > classification
> > >> of article importance
> > >>
> > >> Hi Pine,
> > >>
> > >> These are great pointers to existing practices on enwiki, some of
> which
> > >> I've been looking for and/or missed, thanks!
> > >>
> > >>
> > >> Cheers,
> > >> Morten
> > >>
> > >>> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
> > >>>
> > >>> Hi Nettrom,
> > >>>
> > >>> A few resources from English Wikipedia regarding article
> > >>> importance
> as
> > >>> ranked by humans:
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > >>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_
> assessment#Statist
> > >>> ics
> > >>>
> > >>> I infer from the ENWP Wikicup's scoring protocol that for
> > >>> purposes of the competition, an article's "importance" is
> > >>> loosely inferred from the number of language editions of
> > >>> Wikipedia in which the article
> > >> appears:
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_po
> > >>> ints
> .
> > >>>
> > >>> HTH,
> > >>>
> > >>> Pine
> > >>>
> > >>>
> > >>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang
> > >>>> <[hidden email]>
> > wrote:
> > >>>>
> > >>>> Hello everyone,
> > >>>>
> > >>>> I am currently working with Aaron Halfaker and Dario
> > >>>> Taraborelli at the Wikimedia Foundation on a project exploring
> > >>>> automated classification of article importance. Our goal is to
> > >>>> characterize the importance of an article within a given
> > >>>> context and design a system to predict a relative importance
> > >>>> rank. We have a project page on meta[1] and welcome comments
> > >>> or
> > >>>> thoughts on our talk page. You can of course also respond here
> > >>>> on wiki-research-l, or send me an email.
> > >>>>
> > >>>> Before moving on to model-building I did a fairly thorough
> > >>>> literature review, finding a myriad of papers spanning several
> > >>>> disciplines. We have
> > >>> a
> > >>>> draft literature review also up on meta[2], which should give
> > >>>> you a reasonable introduction to the topic. Again, comments or
> > >>>> thoughts
> > (e.g.
> > >>>> papers we’ve missed) on the talk page, mailing list, or through
> > >>>> email are welcome.
> > >>>>
> > >>>> Links:
> > >>>>
> > >>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>>   classification_of_article_importance
> > >>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>> classification_of_article_importance>
> > >>>>   2.
> > >>>> https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >>>>
> > >>>> Regards,
> > >>>> Morten
> > >>>> [[User:Nettrom]] aka [[User:SuggestBot]]
> > >>>> _______________________________________________
> > >>>> Wiki-research-l mailing list
> > >>>> [hidden email]
> > >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>> _______________________________________________
> > >>> Wiki-research-l mailing list
> > >>> [hidden email]
> > >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email]
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>
> > >>
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email]
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Morten Wang
In reply to this post by Kerry Raymond
Thanks for the thoughtful comments, Kerry! There were many great points in
your email, I'd like to focus on some of them.

Your likening of viewership to readers and inlinks to writers echoes how we
think about this as well. I agree that these two groups differ on many
characteristics, something both the contributor surveys you mention shows,
as well as research. For example West et al's 2012 paper (see citation
below) looks at how the browsing history shows differing interests between
readers and contributors, and the "WP:Clubhouse" paper (Lam et al, 2011)
starts getting at how the gender proportions differ (there are of course
other papers as well, these were the first that came to mind). By combining
both, we get more signal.

This also touches on the discussion of how popularity is related to
importance, and whether importance changes over time. The article about
Drottninggatan in Stockholm is but one example of an article that becomes
the center of attention due to a breaking news event. We did an analysis of
a dataset of very popular articles in our 2015 ICWSM paper, finding that
about half of them show this kind of transient behaviour. In that paper we
argue that the more popular articles are more important and should have
higher quality, which means that it's partly chasing a moving target and
partly a focused effort on the long-term important content (of which
pasteurization is probably one example). For some topics it is easier to
predict their shifts in importance because they are seasonal, e.g.
christmas, easter, or sporting events like world championships. When it
comes to others it might be harder, e.g. Trump, or Google Flu Trends, which
I recently came across. How important is the latter article now that the
website is no longer available?

When it comes to links, you point out that they are not all equal. This is
something we're incorporating in our work. Currently we have a model for
WikiProject Medicine, and it accounts for both inlinks from all across
English Wikipedia, as well as to what extent they come from other articles
tagged by the project. We also use the clickstream dataset to add
information about whether an article's traffic comes from other Wikipedia
articles, meaning it is useful as supporting information for those, or
whether it comes from elsewhere. Lastly, we use the clickstream dataset to
get an idea about how many inlinks to an article are actually used. As you
write, the links in the lede are more important, something at least one
research paper points to (Dimitrov et al, 2016), and something the
clickstream dataset allows us to estimate. I think it's great to see these
ideas pop up in the discussion and be able to show how we're incorporating
these into what we're doing and that they affect our results.

As I wrap up, I would like to challenge the assertion that initial
importance ratings are "pretty accurate". I'm not sure we really know that.
They might be, but it might be because the vast majority of them are newly
created stubs that get rated "low importance". More interesting are perhaps
other types of articles, where I suspect that importance ratings are copied
from one WikiProject template to another, and one could argue that they
need updating. Our collaboration with WikiProject Medicine has resulted in
updated ratings of a couple of hundred or so articles so far, although most
of them were corrections that increase consistency in the ratings. As I
continue working on this project I hope to expand our collaborations to
other WikiProjects, and I'm looking forward to seeing how well we fare with
those!


Citations:
West, R.; Weber, I.; and Castillo, C. 2012. Drawing a Data-driven Portrait
of Wikipedia Editors. In Proc. of OpenSym/WikiSym, 3:1–3:10.

Lam, S. T. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen,
L.; and Riedl, J. 2011. WP:Clubhouse?: An Exploration of Wikipedia's Gender
Imbalance. In Proc. of WikiSym, 1–10.

Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment
Between Supply and Demand of Quality Content in Peer Production
Communities" in the proceedings of ICWSM 2015.

Dimitrov, D., Singer, P., Lemmerich, F., & Strohmaier, M. (2016, April).
Visual positions of links and clicks on wikipedia. In Proceedings of the
25th International Conference Companion on WWW (pp. 27-28).


Cheers,
Morten

On 25 April 2017 at 20:39, Kerry Raymond <[hidden email]> wrote:

> Just a few musings on the issue of Importance and how to research it ...
>
> I agree it is intuitive that importance is likely to be linked to
> pageviews and inbound links but, as the preliminary experiment showed, it's
> probably not that simple.
>
> Pageviews tells us something about importance to readers of Wikipedia,
> while inbound links tells us something about importance to writers of
> Wikipedia, and I suspect that writers are not a proxy for readers as the
> editor surveys suggest that Wikipedia writers are not typical of broader
> society on at least two variables: gender and level of education (might be
> others, I can't remember).
>
> But I think importance is a relative metric rather than  absolute. I think
> by taking the mean value of importance across a number of WikiProjects in
> the preliminary experiment may have lost something because it tried
> (through averaging) to look at importance "generally". I would suspect
> conducting an experiment considering only the importance ratings wrt to a
> single WikiProject would be more likely to show correlation with pageviews
> (wrt to other articles in that same WikiProject) and inbound links. And I
> think there are two kinds of inbound links to be considered, those coming
> from other articles within the same WikiProject and those coming from
> outside that Wikiproject. I suspect different insights will be obtained by
> looking at both types of inbound links separately rather than treating them
> as an aggregate. I note also that WikiProjects are not entirely independent
> of one another but have relationships between them. For example, The
> WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
> of WikiProject Highways and WikiProject Australia, so I expect that we
> would find greater correlation in importance between related WikiProjects
> than between unrelated WikiProjects.
>
> When thinking about readers and pageviews, I think we have to ask
> ourselves is there a difference between popularity and importance. Or
> whether popularity *is* importance. I sense that, as a group of educated
> people, those of us reading this research mailing list probably do think
> there is a difference. Certainly if there is no difference, then this
> research can stop now -- just judge importance by  pageviews. Let's assume
> a difference then. When looking at pageviews of an article, they are not
> always consistent over time. Here are the pageviews for Drottninggatan
>
> https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
> org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan
>
> Why so interesting on 8 April? A terrorist attack occurred there. This
> spike in pageviews occurs all the time when some topic is in the news (even
> peripherally as in this case where it is not the article about the
> terrorist attack but about the street in which it occurred). Did the street
> become more "important"? I think it became more interesting but not more
> important. So I think we do have to be careful to understand that pageviews
> probably reflect interest rather than importance.  I note that The
> Chainsmokers (a music group with a number of songs in the current USA music
> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> article on Pasteurization but The Chainsmokers are not rated as being of
> high importance by the relevant WikiProjects while Pasteurization is very
> important in WikiProject Food and Drink. Since pasteurisation prevents a
> lot of deaths, I think we might agree that in the real world pasteurisation
> is more important than a music group regardless of what pageviews tell us.
>
> https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
> org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|
> Pasteurization
>
> Of course it is matters for Wikipedia's success that our *popular*
> articles are of high quality, but I think we have be cautious about
> pageviews being a proxy for importance.
>
> When we look at Wikipedia writers' decisions in tagging the importance of
> articles to WikiProjects, what do we find? As we know, project tags are
> often placed on new articles (and often not subsequently reviewed). So
> while I find that quality tags are often out-of-date, the importance seems
> to be pretty accurate even on a new stub articles. This is because it is
> the importance of the *topic* that is being assessed which is independent
> of the Wikipedia article itself. Provided the article is clear enough about
> what it is about and why it matters (which is the traditional content of
> that first paragraph or two and failing to provide it will likely result in
> speedy deletion of the new article), assessment of the topic's importance
> can be made even at new stub level. This tells us that importance for
> Wikipedia writers is determined by something outside of Wikipedia (probably
> their real-world knowledge of that topic space -- one assumes that project
> taggers are quite interested in the topic space of that project). While
> article quality hopefully improves over time, I would be surprised if
> article importance greatly changed over time. Obviously there are
> counter-examples.  I am guessing Donald Trump's article may have grown in
> importance over time but that's probably because his lede para changed.
> Adding President of the USA into the lede paragraph makes him much more
> important than he was before in the real world and internal to Wikipedia he
> has acquired an inbound link from the presumably high-importance President
> of the USA article. So I think it might be interesting to study those
> articles whose importance does change over time to see if there are any
> strong correlations with what is happening to the article inside Wikipedia.
> I think it is this set of importance-changing articles may be where we
> really learn what Wikipedia article characteristics are strongly correlated
> to "importance" given that importance itself appears to be pretty stable
> for most articles.
>
> Although not stated explicitly, I imagine we believe that generally less
> important articles tend to link to more important articles but more
> important articles don't link to less important articles. And hence
> in-bound links are likely to matter in assessing importance and that
> in-bound links from "important" articles are more valuable than in-bound
> links from less important articles (which creates something of a
> bootstrapping problem) similar to the issue to Google's PageRank
> algorithms. But I think we do have some information that Google doesn't
> have. The average webpage does not have a lede paragraph that situates the
> topic relative to other topics; a Wikipedia article does. If I have to
> choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
> more important than X. If Y also defines itself in terms of X, then it
> tends to suggest they are equivalent in importance at some way. Indeed I
> suspect when we get to the VERY IMPORTANT topics we will see this kind of
> circular definition (e.g. you see circular definitions in Wikipedia around
> Philosophy and Knowledge). Aside, if you have never done this before, try
> this experiment. Choose a random article (left hand tool bar in Desktop
> Wikipedia), then click the first link in the article that matters (i.e.
> ignore links hatnotes or links inside parentheses). Repeat this first link
> clicking and sooner or later you will reach articles like Knowledge and
> Philosophy, which all sit inside circular definition groups.
>
> If we look at the Donald Trump article, his first sentence contains only
> two links, one to List of Presidents of the USA and the other to President
> of the USA. If we look at the those two articles, we find that both of them
> mention Donald Trump in their lede paras (although not as early as the
> first sentence) and before mentions of any other US President elsewhere in
> the article. Which is consistent with what we know about the real world,
> the role of the President is more important than its officeholders and that
> the current officeholder has more importance than a past officeholder. So
> topic importance does seems to be skewed towards the "present day".
>
> So I suspect the links in the lede paras are of greater relevance to the
> assessment of importance than links further down in the article which will
> be more likely relate to details of a topic and may include examples and
> counter-examples (this is a way in which high importance article may
> mention much lower importance articles). However, we do have to be a little
> bit careful here because of the MoS practice of not linking very common
> terms. For example, an Australian article will often refer to Australia in
> the lede para but it will almost certainly not be linked to the Australia
> article (and any attempt to add such a link will likely see it removed with
> an edit summary that mentions [[WP:Overlinking]]) whereas there is no
> problem if you link to an Australian state article, e.g. New South Wales.
> So we might find that some very important topics that often appear in ledes
> might get fewer links that you might expect because of the MoS policies on
> overlinking, which may be problem when working with inbound links. It may
> be that for "very common topics" the presence of the article title (or its
> synonyms) in the lede may have to be considered as if it were an in-bound
> link for statistical research purposes.
>
> Given all of the above, perhaps the most interesting group of articles to
> study in Wikipedia are those articles whose manually-assessed importance
> has changed over the life of the article AND which were NOT current topics
> in the lifetime of Wikipedia (given the influence of "current" on
> importance). But having said that, I wonder if that group of articles
> actually exists. Recently a newish Australian contributor expressed
> disappointment that all the new articles they had created were tagged (by
> others) as of Low Importance. My instinctive reply was "that's normal, I
> think of the thousands of articles I have started only a couple even rated
> as Mid importance, this is because the really important articles were all
> started long ago precisely because they were important". I suspect topics
> that are very important (for reasons other than being short-lived
> importance due in being "current" in the lifetime of Wikipedia) will
> generally show up as having started early in Wikipedia's life and that
> those that become more/less important over time will be largely linked to
> becoming or ceasing to be "current" topics). E.g. article Pasteurization
> started in May 2001 saying nothing more than " Pasteurization is the
> process of killing off bacteria in milk by quickly heating it to a near
> boiling temperature, then quickly cooling it again before the taste and
> other desirable properties are affected. The process was named after its
> inventor, French scientist Louis Pasteur. See also dairy products." The
> links in this very first version are still present in its lede paragraph
> today, suggesting our understanding of "non-current" topics is stable and
> hence initial importance determinations can probably be accurately made.
> For Pasteurization the Talk page shows it was not project-tagged until 2007
> when it was assigned High Importance as its first assessment.
>
> I suspect we will find that initial manual assessment of article
> importance will be pretty accurate for most articles. And I suspect if we
> plot initial importance assessments against time of assessment, we will
> find the higher importance articles commenced life on Wikipedia earlier
> than the lower importance articles. If I am correct, then there isn't a lot
> of value in machine-assessment of importance of topics because it relates
> to factors external to Wikipedia and often does not change over time and
> therefore can often be correctly assessed manually even on new stub
> articles (and any unassessed articles can probably be rated as Low
> Importance as statistically that's almost certainly going to be correct).
> If a topic becomes more important due to "current" events, then invariably
> that article will be updated by many people and one of them will sooner or
> later manually adjust its importance. What is less likely to happen is
> re-assessing downwards of Importance when an important "current" topic
> loses its importance when it is no longer current, e.g. are former American
> presidents like Barack Obama or George W Bush or further back less
> important now? These articles will not be updated frequently once the topic
> is no longer in the news and therefore it is less likely an editor will
> notice and manually downgrade the importance, so there may be a greater
> role for machine-assessment in downgrading importance rather than upgrading
> importance.
>
> Another area where there might be a role for machine-assessed importance
> in regards to POV-pushing where an POV-motivated editor might change the
> manual-assessment importance of articles to be higher or lower based on
> their POV (e.g. my political party is Top Importance, other parties are of
> Low Importance). I suspect that often a page watcher would correct or at
> least question that kind of re-assessment. However, articles with few
> active pagewatchers you might get away with POV-pushing the article's
> importance tag because nobody noticed. In this situation, a machine
> assessment could be useful in spotting this kind of thing.
>
> This suggests that another metric of interest to importance might be
> number of pagewatchers, although I suspect that pagewatching may relate
> more to caring about the article than to caring about the topic. And one
> has to be careful to distinguish active pagewatchers (those who actually do
> review changes on their watchlists) from those who don't, as that may make
> a difference (although I am not sure we can really tell which pagewatchers
> are truly actively reviewing as a "satisfactory review" doesn't leave a
> trace whereas an "unsatisfactory" review is likely to lead to a relatively
> soon revert or some other change to the article, the article Talk or the
> User Talk of reviewed contributor which may be detectable).
>
> The other aspect of articles that occurs to me as being possibly linked to
> importance of the topic would be use of the article as the "main" article
> for a category or as the title of a navbox (as it suggests that the
> articles in the category or navbox are in some way subordinate to the
> main/title article). Similarly for list articles, the "type" of the list is
> often more important than its instances).
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Morten Wang
> Sent: Friday, 21 April 2017 6:04 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: Re: [Wiki-research-l] Project exploring automated classification
> of article importance
>
> Hi Pine,
>
> These are great pointers to existing practices on enwiki, some of which
> I've been looking for and/or missed, thanks!
>
>
> Cheers,
> Morten
>
> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
>
> > Hi Nettrom,
> >
> > A few resources from English Wikipedia regarding article importance as
> > ranked by humans:
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > Editorial_Team/Release_Version_Criteria#Priority_of_topic
> >
> > https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
> > ics
> >
> > I infer from the ENWP Wikicup's scoring protocol that for purposes of
> > the competition, an article's "importance" is loosely inferred from
> > the number of language editions of Wikipedia in which the article
> appears:
> > https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
> >
> > HTH,
> >
> > Pine
> >
> >
> > On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
> >
> > > Hello everyone,
> > >
> > > I am currently working with Aaron Halfaker and Dario Taraborelli at
> > > the Wikimedia Foundation on a project exploring automated
> > > classification of article importance. Our goal is to characterize
> > > the importance of an article within a given context and design a
> > > system to predict a relative importance rank. We have a project page
> > > on meta[1] and welcome comments
> > or
> > > thoughts on our talk page. You can of course also respond here on
> > > wiki-research-l, or send me an email.
> > >
> > > Before moving on to model-building I did a fairly thorough
> > > literature review, finding a myriad of papers spanning several
> > > disciplines. We have
> > a
> > > draft literature review also up on meta[2], which should give you a
> > > reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > > papers we’ve missed) on the talk page, mailing list, or through
> > > email are welcome.
> > >
> > > Links:
> > >
> > >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >    classification_of_article_importance
> > >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > > classification_of_article_importance>
> > >    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >
> > > Regards,
> > > Morten
> > > [[User:Nettrom]] aka [[User:SuggestBot]]
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Stuart A. Yeates
In reply to this post by Kerry Raymond
Following up Kerry's comments: far more useful to our encyclopedia building
project would not be a global importance assessor, but a assessor of which
wikiprojects a page is likely to be of interest to. There are hundreds of
thousands of en.wiki pages which are not tagged properly to their
wikiprojects and are thus effectively invisible to the community of editors
who case about them.

This is a classic example of statistical classification, so it shouldn't be
too technically difficult...

cheers
stuart

--
...let us be heard from red core to black sky

On 28 April 2017 at 12:28, Kerry Raymond <[hidden email]> wrote:

> I observe (and am unsurprised) that WikiProject Australia also rates the
> Pavlova article as High importance, which demonstrates into the Stuart's
> comments about graphs and subgraphs. If there are relationships between
> WikiProjects, there is probably some correlation about importance of
> articles as seen by those projects. As it happens, WikiProject Australia
> and WikiProject New Zealand are related on Wikipedia only by both being
> within the category "WikiProject Countries projects" (along with every
> other national WikiProject), so this is an example where you cannot see the
> connection between these projects "on-wiki" but anyone who knows anything
> about the geography, history, and culture of the two countries will
> understand the close connection (e.g. ANZAC, sheep, pavlova, rugby union)
> but, as the project tagging will show, we do have our differences, e.g.
> Whitebait is a High Importance article for NZ but Oz doesn't even tag it
> (we don't share the NZ passion for these small fish). And perhaps more
> seriously, our two countries have different indigenous peoples so our
> project tagging around Maori (NZ) and Aboriginal and Torres Strait Islander
> (Oz) articles would usually be quite disjoint.
>
> So if there are correlations between project tagging, it may be something
> exploitable in machine assessment of importance.
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Stuart A. Yeates
> Sent: Friday, 28 April 2017 6:18 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: Re: [Wiki-research-l] Project exploring automated classification
> of article importance
>
> On em.wiki article importance is relative to some wikiproject. This is
> encoded in https://en.wikipedia.org/wiki/Template:WPBannerMeta which
> appears on 16% of all wikipedia pages via specialisations such as
> https://en.wikipedia.org/wiki/Template:WikiProject_New_Zealand
>
> Within Wikiproject New Zealand, there are articles which we think are very
> important to us, which we would never argue are even marginally important
> on a global scale. Take for example
> https://en.wikipedia.org/wiki/Pavlova_(food)
>
> For the mathematically inclined, this is a classic case of graph and many
> subgraphs.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On 27 April 2017 at 21:44, Gerard Meijssen <[hidden email]>
> wrote:
>
> > Hoi,
> > I have read the proposal and it leaves me wondering. Also the notion
> > of importance is indeed neither easy nor obvious. I think the question
> > what is most important is irrelevant depending on how you look at it.
> > Subject can be irrelevant when you look at it from a personal
> > perspective, looking at it from a particular perspective and indeed
> > what seems relevant may become irrelevant or relevant over time. When
> > you use metrics there will always be one way or another why it will be
> found to be problematic.
> >
> > When you consider Wikipedia, the difference it makes with similar
> > resources is that its long tail is so much longer and still it is easy
> > and obvious to show how the English Wikipedia's long tail is not long
> > enough [1]. When you are looking for links and relevance, Wikidata
> > includes data on all Wikipedias and thereby more avenues to establish
> relevance.
> >
> > Research has been done that shows that when people are suggested to
> > write articles or amend articles, it works best when it is about
> > subjects they care about. What people are interested in was based in
> > the research on past behaviour. What we could do is flip this and ask
> > people. Based on categories, on projects, whatever people do to
> > categorise what is their interest. This will work on a micro level. On
> > a meta level, it may drive cooperation when we enable people to share
> > their interest (at that moment in time). On a macro level data may
> > arrive at Wikidata and this will allow us to seek what articles
> > include specific data (think date of death for instance). On a meta
> > and macro level, we could ask readers what subjects they are missing.
> > This would provide an additional incentive for people to write. For this
> last suggestion we could measure what people are missing.
> >
> > Anyway, relevance and importance depend on a point of view. When our
> > community is enabled to make a difference, it will help us with our
> > content. As a movement we know that there is enough that we do not
> > properly cover. Advocating these issues and targeting and educating
> > potential communities is where the WMF could play more of a role.
> > Thanks,
> >        GerardM
> >
> >
> >
> > [1]
> > http://ultimategerardm.blogspot.nl/2017/04/wikidata-
> > user-stories-sum-of-all.html
> >
> > On 26 April 2017 at 13:48, Jonathan Cardy
> > <[hidden email]>
> > wrote:
> >
> > > I like to think that in time importance will win out over
> > > popularity. If Wikipedia still exists in fifty of five hundred years
> > > time and we are
> > still
> > > using pasteurisation and indeed still eating hydrocarbon based
> > > foods,
> > then
> > > I suspect the pop group you mention will be less frequently read
> > > about
> > than
> > > the pasteurisation process.
> > >
> > > In the meantime if we try to work it out at all it has to be
> > > something of a judgement call, and one we will occasionally get
> > > wrong. Any guesses as
> > to
> > > which current branches of science will be as forgotten in a century
> > > as phrenology is today?
> > >
> > > At an extreme the weekly top ten most viewed articles are a good
> > > guide to what is trending in the popular cultures of India and the
> > > USA. I'm
> > assuming
> > > that most modern pop culture is inherently ephemeral. Of course
> > > digital historians of future centuries may be rolling on the floor
> > > laughing at
> > this
> > > email, and the TV dramas currently being filmed may still be widely
> > studied
> > > and universally known classics while our leading edge science lies
> > > buried in the foundations of their science.
> > >
> > > Regards
> > >
> > > Jonathan
> > >
> > >
> > > > On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email]> wrote:
> > > >
> > > > Yes I totally agree that "importance is a relative metric rather
> > > > than absolute." I also agree that incoming links and pageviews are
> > > > not
> > > accurate
> > > > measurements of "importance" for all of the reasons you mention.
> > However,
> > > > we are still a project that is actively exploring the universe of
> > > > knowledge, and leaning heavily on academia and other established
> > sources
> > > we
> > > > must "boldly go where no man has gone before" (and please feel
> > > > free to insert "white, euro-centric" before the man part). So do
> > > > you have any suggestions what we could measure going forward that
> > > > would cough up
> > some
> > > > interesting stats to monitor? Pagewatching is useful , but
> > > > problematic because these are only assigned at page-creation,
> > > > while some marginal editor interest might be expanded to whole
> > > > categories (speaking as
> > > someone
> > > > who has thousands of pages watchlisted on multiple projects). I
> > > > like
> > your
> > > > thoughts about looking for key articles such as those used as the
> > > "article
> > > > as the "main" article for a category or as the title of a navbox
> > > > ".  I
> > am
> > > > looking for similar usages of paintings as a way to find popular
> > painters
> > > > or paintings rather than just those paintings which have articles
> > written
> > > > about them (which are often written for totally random reasons
> > > > such as theft/sale/wikiproject).
> > > >
> > > > On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <
> > [hidden email]>
> > > > wrote:
> > > >
> > > >> Just a few musings on the issue of Importance and how to research
> > > >> it
> > ...
> > > >>
> > > >> I agree it is intuitive that importance is likely to be linked to
> > > >> pageviews and inbound links but, as the preliminary experiment
> > > >> showed,
> > > it's
> > > >> probably not that simple.
> > > >>
> > > >> Pageviews tells us something about importance to readers of
> > > >> Wikipedia, while inbound links tells us something about
> > > >> importance to writers of Wikipedia, and I suspect that writers
> > > >> are not a proxy for readers as
> > the
> > > >> editor surveys suggest that Wikipedia writers are not typical of
> > broader
> > > >> society on at least two variables: gender and level of education
> > (might
> > > be
> > > >> others, I can't remember).
> > > >>
> > > >> But I think importance is a relative metric rather than
> > > >> absolute. I
> > > think
> > > >> by taking the mean value of importance across a number of
> > > >> WikiProjects
> > > in
> > > >> the preliminary experiment may have lost something because it
> > > >> tried (through averaging) to look at importance "generally". I
> > > >> would suspect conducting an experiment considering only the
> > > >> importance ratings wrt
> > to
> > > a
> > > >> single WikiProject would be more likely to show correlation with
> > > pageviews
> > > >> (wrt to other articles in that same WikiProject) and inbound links.
> > And
> > > I
> > > >> think there are two kinds of inbound links to be considered,
> > > >> those
> > > coming
> > > >> from other articles within the same WikiProject and those coming
> > > >> from outside that Wikiproject. I suspect different insights will
> > > >> be
> > obtained
> > > by
> > > >> looking at both types of inbound links separately rather than
> > > >> treating
> > > them
> > > >> as an aggregate. I note also that WikiProjects are not entirely
> > > independent
> > > >> of one another but have relationships between them. For example,
> > > >> The WikiProject Australian Roads describes itself as an
> > > >> "intersection" (ha
> > > ha!)
> > > >> of WikiProject Highways and WikiProject Australia, so I expect
> > > >> that we would find greater correlation in importance between
> > > >> related
> > > WikiProjects
> > > >> than between unrelated WikiProjects.
> > > >>
> > > >> When thinking about readers and pageviews, I think we have to ask
> > > >> ourselves is there a difference between popularity and
> > > >> importance. Or whether popularity *is* importance. I sense that,
> > > >> as a group of
> > educated
> > > >> people, those of us reading this research mailing list probably
> > > >> do
> > think
> > > >> there is a difference. Certainly if there is no difference, then
> > > >> this research can stop now -- just judge importance by
> > > >> pageviews. Let's
> > > assume
> > > >> a difference then. When looking at pageviews of an article, they
> > > >> are
> > not
> > > >> always consistent over time. Here are the pageviews for
> > > >> Drottninggatan
> > > >>
> > > >> https://tools.wmflabs.org/pageviews/?project=en.
> > > >> wikipedia.org&platform=all-access&agent=user&range=
> > > >> latest-90&pages=Drottninggatan
> > > >>
> > > >> Why so interesting on 8 April? A terrorist attack occurred there.
> > > >> This spike in pageviews occurs all the time when some topic is in
> > > >> the news
> > > (even
> > > >> peripherally as in this case where it is not the article about
> > > >> the terrorist attack but about the street in which it occurred).
> > > >> Did the
> > > street
> > > >> become more "important"? I think it became more interesting but
> > > >> not
> > more
> > > >> important. So I think we do have to be careful to understand that
> > > pageviews
> > > >> probably reflect interest rather than importance.  I note that
> > > >> The Chainsmokers (a music group with a number of songs in the
> > > >> current USA
> > > music
> > > >> charts) gets many more Wikipedia article pageviews  than the
> > > >> Wikipedia article on Pasteurization but The Chainsmokers are not
> > > >> rated as being
> > of
> > > >> high importance by the relevant WikiProjects while Pasteurization
> > > >> is
> > > very
> > > >> important in WikiProject Food and Drink. Since pasteurisation
> > prevents a
> > > >> lot of deaths, I think we might agree that in the real world
> > > pasteurisation
> > > >> is more important than a music group regardless of what pageviews
> > > >> tell
> > > us.
> > > >>
> > > >> https://tools.wmflabs.org/pageviews/?project=en.
> > > >> wikipedia.org&platform=all-access&agent=user&range=
> > latest-90&pages=The_
> > > >> Chainsmokers|Pasteurization
> > > >>
> > > >> Of course it is matters for Wikipedia's success that our
> > > >> *popular* articles are of high quality, but I think we have be
> > > >> cautious about pageviews being a proxy for importance.
> > > >>
> > > >> When we look at Wikipedia writers' decisions in tagging the
> > > >> importance
> > > of
> > > >> articles to WikiProjects, what do we find? As we know, project
> > > >> tags
> > are
> > > >> often placed on new articles (and often not subsequently
> > > >> reviewed). So while I find that quality tags are often
> > > >> out-of-date, the importance
> > > seems
> > > >> to be pretty accurate even on a new stub articles. This is
> > > >> because it
> > is
> > > >> the importance of the *topic* that is being assessed which is
> > > independent
> > > >> of the Wikipedia article itself. Provided the article is clear
> > > >> enough
> > > about
> > > >> what it is about and why it matters (which is the traditional
> > > >> content
> > of
> > > >> that first paragraph or two and failing to provide it will likely
> > > result in
> > > >> speedy deletion of the new article), assessment of the topic's
> > > importance
> > > >> can be made even at new stub level. This tells us that importance
> > > >> for Wikipedia writers is determined by something outside of
> > > >> Wikipedia
> > > (probably
> > > >> their real-world knowledge of that topic space -- one assumes
> > > >> that
> > > project
> > > >> taggers are quite interested in the topic space of that project).
> > While
> > > >> article quality hopefully improves over time, I would be
> > > >> surprised if article importance greatly changed over time.
> > > >> Obviously there are counter-examples.  I am guessing Donald
> > > >> Trump's article may have grown
> > > in
> > > >> importance over time but that's probably because his lede para
> > changed.
> > > >> Adding President of the USA into the lede paragraph makes him
> > > >> much
> > more
> > > >> important than he was before in the real world and internal to
> > > Wikipedia he
> > > >> has acquired an inbound link from the presumably high-importance
> > > President
> > > >> of the USA article. So I think it might be interesting to study
> > > >> those articles whose importance does change over time to see if
> > > >> there are
> > any
> > > >> strong correlations with what is happening to the article inside
> > > Wikipedia.
> > > >> I think it is this set of importance-changing articles may be
> > > >> where we really learn what Wikipedia article characteristics are
> > > >> strongly
> > > correlated
> > > >> to "importance" given that importance itself appears to be pretty
> > stable
> > > >> for most articles.
> > > >>
> > > >> Although not stated explicitly, I imagine we believe that
> > > >> generally
> > less
> > > >> important articles tend to link to more important articles but
> > > >> more important articles don't link to less important articles.
> > > >> And hence in-bound links are likely to matter in assessing
> > > >> importance and that in-bound links from "important" articles are
> > > >> more valuable than
> > in-bound
> > > >> links from less important articles (which creates something of a
> > > >> bootstrapping problem) similar to the issue to Google's PageRank
> > > >> algorithms. But I think we do have some information that Google
> > doesn't
> > > >> have. The average webpage does not have a lede paragraph that
> > > >> situates
> > > the
> > > >> topic relative to other topics; a Wikipedia article does. If I
> > > >> have to choose to define Thing X in terms of Thing Y, it tends to
> > > >> suggest that
> > > Y is
> > > >> more important than X. If Y also defines itself in terms of X,
> > > >> then it tends to suggest they are equivalent in importance at some
> way.
> > Indeed I
> > > >> suspect when we get to the VERY IMPORTANT topics we will see this
> > > >> kind
> > > of
> > > >> circular definition (e.g. you see circular definitions in
> > > >> Wikipedia
> > > around
> > > >> Philosophy and Knowledge). Aside, if you have never done this
> > > >> before,
> > > try
> > > >> this experiment. Choose a random article (left hand tool bar in
> > Desktop
> > > >> Wikipedia), then click the first link in the article that matters
> > (i.e.
> > > >> ignore links hatnotes or links inside parentheses). Repeat this
> > > >> first
> > > link
> > > >> clicking and sooner or later you will reach articles like
> > > >> Knowledge
> > and
> > > >> Philosophy, which all sit inside circular definition groups.
> > > >>
> > > >> If we look at the Donald Trump article, his first sentence
> > > >> contains
> > only
> > > >> two links, one to List of Presidents of the USA and the other to
> > > President
> > > >> of the USA. If we look at the those two articles, we find that
> > > >> both of
> > > them
> > > >> mention Donald Trump in their lede paras (although not as early
> > > >> as the first sentence) and before mentions of any other US
> > > >> President
> > elsewhere
> > > in
> > > >> the article. Which is consistent with what we know about the real
> > world,
> > > >> the role of the President is more important than its
> > > >> officeholders and
> > > that
> > > >> the current officeholder has more importance than a past
> officeholder.
> > > So
> > > >> topic importance does seems to be skewed towards the "present day".
> > > >>
> > > >> So I suspect the links in the lede paras are of greater relevance
> > > >> to
> > the
> > > >> assessment of importance than links further down in the article
> > > >> which
> > > will
> > > >> be more likely relate to details of a topic and may include
> > > >> examples
> > and
> > > >> counter-examples (this is a way in which high importance article
> > > >> may mention much lower importance articles). However, we do have
> > > >> to be a
> > > little
> > > >> bit careful here because of the MoS practice of not linking very
> > common
> > > >> terms. For example, an Australian article will often refer to
> > Australia
> > > in
> > > >> the lede para but it will almost certainly not be linked to the
> > > Australia
> > > >> article (and any attempt to add such a link will likely see it
> > > >> removed
> > > with
> > > >> an edit summary that mentions [[WP:Overlinking]]) whereas there
> > > >> is no problem if you link to an Australian state article, e.g.
> > > >> New South
> > > Wales.
> > > >> So we might find that some very important topics that often
> > > >> appear in
> > > ledes
> > > >> might get fewer links that you might expect because of the MoS
> > policies
> > > on
> > > >> overlinking, which may be problem when working with inbound
> > > >> links. It
> > > may
> > > >> be that for "very common topics" the presence of the article
> > > >> title (or
> > > its
> > > >> synonyms) in the lede may have to be considered as if it were an
> > > in-bound
> > > >> link for statistical research purposes.
> > > >>
> > > >> Given all of the above, perhaps the most interesting group of
> > > >> articles
> > > to
> > > >> study in Wikipedia are those articles whose manually-assessed
> > importance
> > > >> has changed over the life of the article AND which were NOT
> > > >> current
> > > topics
> > > >> in the lifetime of Wikipedia (given the influence of "current" on
> > > >> importance). But having said that, I wonder if that group of
> > > >> articles actually exists. Recently a newish Australian
> > > >> contributor expressed disappointment that all the new articles
> > > >> they had created were tagged
> > > (by
> > > >> others) as of Low Importance. My instinctive reply was "that's
> > normal, I
> > > >> think of the thousands of articles I have started only a couple
> > > >> even
> > > rated
> > > >> as Mid importance, this is because the really important articles
> > > >> were
> > > all
> > > >> started long ago precisely because they were important". I
> > > >> suspect
> > > topics
> > > >> that are very important (for reasons other than being short-lived
> > > >> importance due in being "current" in the lifetime of Wikipedia)
> > > >> will generally show up as having started early in Wikipedia's
> > > >> life and that those that become more/less important over time
> > > >> will be largely linked
> > > to
> > > >> becoming or ceasing to be "current" topics). E.g. article
> > Pasteurization
> > > >> started in May 2001 saying nothing more than " Pasteurization is
> > > >> the process of killing off bacteria in milk by quickly heating it
> > > >> to a
> > near
> > > >> boiling temperature, then quickly cooling it again before the
> > > >> taste
> > and
> > > >> other desirable properties are affected. The process was named
> > > >> after
> > its
> > > >> inventor, French scientist Louis Pasteur. See also dairy products."
> > The
> > > >> links in this very first version are still present in its lede
> > paragraph
> > > >> today, suggesting our understanding of "non-current" topics is
> > > >> stable
> > > and
> > > >> hence initial importance determinations can probably be
> > > >> accurately
> > made.
> > > >> For Pasteurization the Talk page shows it was not project-tagged
> > > >> until
> > > 2007
> > > >> when it was assigned High Importance as its first assessment.
> > > >>
> > > >> I suspect we will find that initial manual assessment of article
> > > >> importance will be pretty accurate for most articles. And I
> > > >> suspect if
> > > we
> > > >> plot initial importance assessments against time of assessment,
> > > >> we
> > will
> > > >> find the higher importance articles commenced life on Wikipedia
> > earlier
> > > >> than the lower importance articles. If I am correct, then there
> > > >> isn't
> > a
> > > lot
> > > >> of value in machine-assessment of importance of topics because it
> > > relates
> > > >> to factors external to Wikipedia and often does not change over
> > > >> time
> > and
> > > >> therefore can often be correctly assessed manually even on new
> > > >> stub articles (and any unassessed articles can probably be rated
> > > >> as Low Importance as statistically that's almost certainly going
> > > >> to be
> > > correct).
> > > >> If a topic becomes more important due to "current" events, then
> > > invariably
> > > >> that article will be updated by many people and one of them will
> > sooner
> > > or
> > > >> later manually adjust its importance. What is less likely to
> > > >> happen is re-assessing downwards of Importance when an important
> > > >> "current" topic loses its importance when it is no longer
> > > >> current, e.g. are former
> > > American
> > > >> presidents like Barack Obama or George W Bush or further back
> > > >> less important now? These articles will not be updated frequently
> > > >> once the
> > > topic
> > > >> is no longer in the news and therefore it is less likely an
> > > >> editor
> > will
> > > >> notice and manually downgrade the importance, so there may be a
> > greater
> > > >> role for machine-assessment in downgrading importance rather than
> > > upgrading
> > > >> importance.
> > > >>
> > > >> Another area where there might be a role for machine-assessed
> > importance
> > > >> in regards to POV-pushing where an POV-motivated editor might
> > > >> change
> > the
> > > >> manual-assessment importance of articles to be higher or lower
> > > >> based
> > on
> > > >> their POV (e.g. my political party is Top Importance, other
> > > >> parties
> > are
> > > of
> > > >> Low Importance). I suspect that often a page watcher would
> > > >> correct or
> > at
> > > >> least question that kind of re-assessment. However, articles with
> > > >> few active pagewatchers you might get away with POV-pushing the
> > > >> article's importance tag because nobody noticed. In this
> > > >> situation, a machine assessment could be useful in spotting this
> kind of thing.
> > > >>
> > > >> This suggests that another metric of interest to importance might
> > > >> be number of pagewatchers, although I suspect that pagewatching
> > > >> may
> > relate
> > > >> more to caring about the article than to caring about the topic.
> > > >> And
> > one
> > > >> has to be careful to distinguish active pagewatchers (those who
> > > actually do
> > > >> review changes on their watchlists) from those who don't, as that
> > > >> may
> > > make
> > > >> a difference (although I am not sure we can really tell which
> > > pagewatchers
> > > >> are truly actively reviewing as a "satisfactory review" doesn't
> > > >> leave
> > a
> > > >> trace whereas an "unsatisfactory" review is likely to lead to a
> > > relatively
> > > >> soon revert or some other change to the article, the article Talk
> > > >> or
> > the
> > > >> User Talk of reviewed contributor which may be detectable).
> > > >>
> > > >> The other aspect of articles that occurs to me as being possibly
> > linked
> > > to
> > > >> importance of the topic would be use of the article as the "main"
> > > article
> > > >> for a category or as the title of a navbox (as it suggests that
> > > >> the articles in the category or navbox are in some way
> > > >> subordinate to the main/title article). Similarly for list
> > > >> articles, the "type" of the
> > > list is
> > > >> often more important than its instances).
> > > >>
> > > >> Kerry
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wiki-research-l [mailto:wiki-research-l-
> > > [hidden email]]
> > > >> On Behalf Of Morten Wang
> > > >> Sent: Friday, 21 April 2017 6:04 AM
> > > >> To: Research into Wikimedia content and communities <
> > > >> [hidden email]>
> > > >> Subject: Re: [Wiki-research-l] Project exploring automated
> > > classification
> > > >> of article importance
> > > >>
> > > >> Hi Pine,
> > > >>
> > > >> These are great pointers to existing practices on enwiki, some of
> > which
> > > >> I've been looking for and/or missed, thanks!
> > > >>
> > > >>
> > > >> Cheers,
> > > >> Morten
> > > >>
> > > >>> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
> > > >>>
> > > >>> Hi Nettrom,
> > > >>>
> > > >>> A few resources from English Wikipedia regarding article
> > > >>> importance
> > as
> > > >>> ranked by humans:
> > > >>>
> > > >>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> > > >>>
> > > >>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > > >>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
> > > >>>
> > > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_
> > assessment#Statist
> > > >>> ics
> > > >>>
> > > >>> I infer from the ENWP Wikicup's scoring protocol that for
> > > >>> purposes of the competition, an article's "importance" is
> > > >>> loosely inferred from the number of language editions of
> > > >>> Wikipedia in which the article
> > > >> appears:
> > > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_po
> > > >>> ints
> > .
> > > >>>
> > > >>> HTH,
> > > >>>
> > > >>> Pine
> > > >>>
> > > >>>
> > > >>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang
> > > >>>> <[hidden email]>
> > > wrote:
> > > >>>>
> > > >>>> Hello everyone,
> > > >>>>
> > > >>>> I am currently working with Aaron Halfaker and Dario
> > > >>>> Taraborelli at the Wikimedia Foundation on a project exploring
> > > >>>> automated classification of article importance. Our goal is to
> > > >>>> characterize the importance of an article within a given
> > > >>>> context and design a system to predict a relative importance
> > > >>>> rank. We have a project page on meta[1] and welcome comments
> > > >>> or
> > > >>>> thoughts on our talk page. You can of course also respond here
> > > >>>> on wiki-research-l, or send me an email.
> > > >>>>
> > > >>>> Before moving on to model-building I did a fairly thorough
> > > >>>> literature review, finding a myriad of papers spanning several
> > > >>>> disciplines. We have
> > > >>> a
> > > >>>> draft literature review also up on meta[2], which should give
> > > >>>> you a reasonable introduction to the topic. Again, comments or
> > > >>>> thoughts
> > > (e.g.
> > > >>>> papers we’ve missed) on the talk page, mailing list, or through
> > > >>>> email are welcome.
> > > >>>>
> > > >>>> Links:
> > > >>>>
> > > >>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
> > > >>>>   classification_of_article_importance
> > > >>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
> > > >>>> classification_of_article_importance>
> > > >>>>   2.
> > > >>>> https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > > >>>>
> > > >>>> Regards,
> > > >>>> Morten
> > > >>>> [[User:Nettrom]] aka [[User:SuggestBot]]
> > > >>>> _______________________________________________
> > > >>>> Wiki-research-l mailing list
> > > >>>> [hidden email]
> > > >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >>> _______________________________________________
> > > >>> Wiki-research-l mailing list
> > > >>> [hidden email]
> > > >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >> _______________________________________________
> > > >> Wiki-research-l mailing list
> > > >> [hidden email]
> > > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> Wiki-research-l mailing list
> > > >> [hidden email]
> > > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Kerry Raymond
In reply to this post by Morten Wang
Re: initial setting of Importance in project tags.

I don't offer any evidence for my claim that initial project tagging often gets importance correct, it's just my observation that this is so. Since importance is about the topic importance rather than article, I suspect it can be reliably assigned on a stub article.

However, I think if we looked at a project which is known to be diligent in their tagging (your collaboration with WikiProject Medicine might have this data), I would still be very interested to compare the start dates of the articles relative to their current importance to test my hypothesis that the more important an article is, the more likely it is to have started earlier.

And for those articles which have had their importance raised over time, did the articles have  increasing pageviews, either in a sustained way or as a series upward spikes (which might suggest a growing real-world interest in the topic) between the initial tagging and the re-tagging. And for those articles which had their importance reduced over time, did it correspond to diminishing levels of pageviews (I presume downward spikes are an unlikely phenomenon - although that would be an interesting question to ask across Wikipedia generally to confirm my theory that they don’t occur) suggesting declining real-world interest. Or to put it another way, did the re-assignment of article importance reflect the topic's changing importance in the real world or not (for which I think pageviews are the best proxy) or, if it occurs in a way apparently unrelated to real-world interest, is it a case of the original tagging simply being "wrong"? Obviously some project taggers may be more knowledgable about the topic space than others while some taggers may have POV or COI reasons for overstating/understating a topc's importance.

So many interesting questions, inquiring minds want to know ....

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email]] On Behalf Of Morten Wang
Sent: Friday, 28 April 2017 10:48 AM
To: Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

Thanks for the thoughtful comments, Kerry! There were many great points in your email, I'd like to focus on some of them.

Your likening of viewership to readers and inlinks to writers echoes how we think about this as well. I agree that these two groups differ on many characteristics, something both the contributor surveys you mention shows, as well as research. For example West et al's 2012 paper (see citation
below) looks at how the browsing history shows differing interests between readers and contributors, and the "WP:Clubhouse" paper (Lam et al, 2011) starts getting at how the gender proportions differ (there are of course other papers as well, these were the first that came to mind). By combining both, we get more signal.

This also touches on the discussion of how popularity is related to importance, and whether importance changes over time. The article about Drottninggatan in Stockholm is but one example of an article that becomes the center of attention due to a breaking news event. We did an analysis of a dataset of very popular articles in our 2015 ICWSM paper, finding that about half of them show this kind of transient behaviour. In that paper we argue that the more popular articles are more important and should have higher quality, which means that it's partly chasing a moving target and partly a focused effort on the long-term important content (of which pasteurization is probably one example). For some topics it is easier to predict their shifts in importance because they are seasonal, e.g.
christmas, easter, or sporting events like world championships. When it comes to others it might be harder, e.g. Trump, or Google Flu Trends, which I recently came across. How important is the latter article now that the website is no longer available?

When it comes to links, you point out that they are not all equal. This is something we're incorporating in our work. Currently we have a model for WikiProject Medicine, and it accounts for both inlinks from all across English Wikipedia, as well as to what extent they come from other articles tagged by the project. We also use the clickstream dataset to add information about whether an article's traffic comes from other Wikipedia articles, meaning it is useful as supporting information for those, or whether it comes from elsewhere. Lastly, we use the clickstream dataset to get an idea about how many inlinks to an article are actually used. As you write, the links in the lede are more important, something at least one research paper points to (Dimitrov et al, 2016), and something the clickstream dataset allows us to estimate. I think it's great to see these ideas pop up in the discussion and be able to show how we're incorporating these into what we're doing and that they affect our results.

As I wrap up, I would like to challenge the assertion that initial importance ratings are "pretty accurate". I'm not sure we really know that.
They might be, but it might be because the vast majority of them are newly created stubs that get rated "low importance". More interesting are perhaps other types of articles, where I suspect that importance ratings are copied from one WikiProject template to another, and one could argue that they need updating. Our collaboration with WikiProject Medicine has resulted in updated ratings of a couple of hundred or so articles so far, although most of them were corrections that increase consistency in the ratings. As I continue working on this project I hope to expand our collaborations to other WikiProjects, and I'm looking forward to seeing how well we fare with those!


Citations:
West, R.; Weber, I.; and Castillo, C. 2012. Drawing a Data-driven Portrait of Wikipedia Editors. In Proc. of OpenSym/WikiSym, 3:1–3:10.

Lam, S. T. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen, L.; and Riedl, J. 2011. WP:Clubhouse?: An Exploration of Wikipedia's Gender Imbalance. In Proc. of WikiSym, 1–10.

Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities" in the proceedings of ICWSM 2015.

Dimitrov, D., Singer, P., Lemmerich, F., & Strohmaier, M. (2016, April).
Visual positions of links and clicks on wikipedia. In Proceedings of the 25th International Conference Companion on WWW (pp. 27-28).


Cheers,
Morten

On 25 April 2017 at 20:39, Kerry Raymond <[hidden email]> wrote:

> Just a few musings on the issue of Importance and how to research it ...
>
> I agree it is intuitive that importance is likely to be linked to
> pageviews and inbound links but, as the preliminary experiment showed,
> it's probably not that simple.
>
> Pageviews tells us something about importance to readers of Wikipedia,
> while inbound links tells us something about importance to writers of
> Wikipedia, and I suspect that writers are not a proxy for readers as
> the editor surveys suggest that Wikipedia writers are not typical of
> broader society on at least two variables: gender and level of
> education (might be others, I can't remember).
>
> But I think importance is a relative metric rather than  absolute. I
> think by taking the mean value of importance across a number of
> WikiProjects in the preliminary experiment may have lost something
> because it tried (through averaging) to look at importance
> "generally". I would suspect conducting an experiment considering only
> the importance ratings wrt to a single WikiProject would be more
> likely to show correlation with pageviews (wrt to other articles in
> that same WikiProject) and inbound links. And I think there are two
> kinds of inbound links to be considered, those coming from other
> articles within the same WikiProject and those coming from outside
> that Wikiproject. I suspect different insights will be obtained by
> looking at both types of inbound links separately rather than treating
> them as an aggregate. I note also that WikiProjects are not entirely
> independent of one another but have relationships between them. For
> example, The WikiProject Australian Roads describes itself as an
> "intersection" (ha ha!) of WikiProject Highways and WikiProject
> Australia, so I expect that we would find greater correlation in importance between related WikiProjects than between unrelated WikiProjects.
>
> When thinking about readers and pageviews, I think we have to ask
> ourselves is there a difference between popularity and importance. Or
> whether popularity *is* importance. I sense that, as a group of
> educated people, those of us reading this research mailing list
> probably do think there is a difference. Certainly if there is no
> difference, then this research can stop now -- just judge importance
> by  pageviews. Let's assume a difference then. When looking at
> pageviews of an article, they are not always consistent over time.
> Here are the pageviews for Drottninggatan
>
> https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
> org&platform=all-access&agent=user&range=latest-90&pages=Drottninggata
> n
>
> Why so interesting on 8 April? A terrorist attack occurred there. This
> spike in pageviews occurs all the time when some topic is in the news
> (even peripherally as in this case where it is not the article about
> the terrorist attack but about the street in which it occurred). Did
> the street become more "important"? I think it became more interesting
> but not more important. So I think we do have to be careful to
> understand that pageviews probably reflect interest rather than
> importance.  I note that The Chainsmokers (a music group with a number
> of songs in the current USA music
> charts) gets many more Wikipedia article pageviews  than the Wikipedia
> article on Pasteurization but The Chainsmokers are not rated as being
> of high importance by the relevant WikiProjects while Pasteurization
> is very important in WikiProject Food and Drink. Since pasteurisation
> prevents a lot of deaths, I think we might agree that in the real
> world pasteurisation is more important than a music group regardless of what pageviews tell us.
>
> https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
> org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmok
> ers|
> Pasteurization
>
> Of course it is matters for Wikipedia's success that our *popular*
> articles are of high quality, but I think we have be cautious about
> pageviews being a proxy for importance.
>
> When we look at Wikipedia writers' decisions in tagging the importance
> of articles to WikiProjects, what do we find? As we know, project tags
> are often placed on new articles (and often not subsequently
> reviewed). So while I find that quality tags are often out-of-date,
> the importance seems to be pretty accurate even on a new stub
> articles. This is because it is the importance of the *topic* that is
> being assessed which is independent of the Wikipedia article itself.
> Provided the article is clear enough about what it is about and why it
> matters (which is the traditional content of that first paragraph or
> two and failing to provide it will likely result in speedy deletion of
> the new article), assessment of the topic's importance can be made
> even at new stub level. This tells us that importance for Wikipedia
> writers is determined by something outside of Wikipedia (probably
> their real-world knowledge of that topic space -- one assumes that
> project taggers are quite interested in the topic space of that
> project). While article quality hopefully improves over time, I would
> be surprised if article importance greatly changed over time.
> Obviously there are counter-examples.  I am guessing Donald Trump's article may have grown in importance over time but that's probably because his lede para changed.
> Adding President of the USA into the lede paragraph makes him much
> more important than he was before in the real world and internal to
> Wikipedia he has acquired an inbound link from the presumably
> high-importance President of the USA article. So I think it might be
> interesting to study those articles whose importance does change over
> time to see if there are any strong correlations with what is happening to the article inside Wikipedia.
> I think it is this set of importance-changing articles may be where we
> really learn what Wikipedia article characteristics are strongly
> correlated to "importance" given that importance itself appears to be
> pretty stable for most articles.
>
> Although not stated explicitly, I imagine we believe that generally
> less important articles tend to link to more important articles but
> more important articles don't link to less important articles. And
> hence in-bound links are likely to matter in assessing importance and
> that in-bound links from "important" articles are more valuable than
> in-bound links from less important articles (which creates something
> of a bootstrapping problem) similar to the issue to Google's PageRank
> algorithms. But I think we do have some information that Google
> doesn't have. The average webpage does not have a lede paragraph that
> situates the topic relative to other topics; a Wikipedia article does.
> If I have to choose to define Thing X in terms of Thing Y, it tends to
> suggest that Y is more important than X. If Y also defines itself in
> terms of X, then it tends to suggest they are equivalent in importance
> at some way. Indeed I suspect when we get to the VERY IMPORTANT topics
> we will see this kind of circular definition (e.g. you see circular
> definitions in Wikipedia around Philosophy and Knowledge). Aside, if
> you have never done this before, try this experiment. Choose a random
> article (left hand tool bar in Desktop Wikipedia), then click the first link in the article that matters (i.e.
> ignore links hatnotes or links inside parentheses). Repeat this first
> link clicking and sooner or later you will reach articles like
> Knowledge and Philosophy, which all sit inside circular definition groups.
>
> If we look at the Donald Trump article, his first sentence contains
> only two links, one to List of Presidents of the USA and the other to
> President of the USA. If we look at the those two articles, we find
> that both of them mention Donald Trump in their lede paras (although
> not as early as the first sentence) and before mentions of any other
> US President elsewhere in the article. Which is consistent with what
> we know about the real world, the role of the President is more
> important than its officeholders and that the current officeholder has
> more importance than a past officeholder. So topic importance does seems to be skewed towards the "present day".
>
> So I suspect the links in the lede paras are of greater relevance to
> the assessment of importance than links further down in the article
> which will be more likely relate to details of a topic and may include
> examples and counter-examples (this is a way in which high importance
> article may mention much lower importance articles). However, we do
> have to be a little bit careful here because of the MoS practice of
> not linking very common terms. For example, an Australian article will
> often refer to Australia in the lede para but it will almost certainly
> not be linked to the Australia article (and any attempt to add such a
> link will likely see it removed with an edit summary that mentions
> [[WP:Overlinking]]) whereas there is no problem if you link to an Australian state article, e.g. New South Wales.
> So we might find that some very important topics that often appear in
> ledes might get fewer links that you might expect because of the MoS
> policies on overlinking, which may be problem when working with
> inbound links. It may be that for "very common topics" the presence of
> the article title (or its
> synonyms) in the lede may have to be considered as if it were an
> in-bound link for statistical research purposes.
>
> Given all of the above, perhaps the most interesting group of articles
> to study in Wikipedia are those articles whose manually-assessed
> importance has changed over the life of the article AND which were NOT
> current topics in the lifetime of Wikipedia (given the influence of
> "current" on importance). But having said that, I wonder if that group
> of articles actually exists. Recently a newish Australian contributor
> expressed disappointment that all the new articles they had created
> were tagged (by
> others) as of Low Importance. My instinctive reply was "that's normal,
> I think of the thousands of articles I have started only a couple even
> rated as Mid importance, this is because the really important articles
> were all started long ago precisely because they were important". I
> suspect topics that are very important (for reasons other than being
> short-lived importance due in being "current" in the lifetime of
> Wikipedia) will generally show up as having started early in
> Wikipedia's life and that those that become more/less important over
> time will be largely linked to becoming or ceasing to be "current"
> topics). E.g. article Pasteurization started in May 2001 saying
> nothing more than " Pasteurization is the process of killing off
> bacteria in milk by quickly heating it to a near boiling temperature,
> then quickly cooling it again before the taste and other desirable
> properties are affected. The process was named after its inventor,
> French scientist Louis Pasteur. See also dairy products." The links in
> this very first version are still present in its lede paragraph today,
> suggesting our understanding of "non-current" topics is stable and hence initial importance determinations can probably be accurately made.
> For Pasteurization the Talk page shows it was not project-tagged until
> 2007 when it was assigned High Importance as its first assessment.
>
> I suspect we will find that initial manual assessment of article
> importance will be pretty accurate for most articles. And I suspect if
> we plot initial importance assessments against time of assessment, we
> will find the higher importance articles commenced life on Wikipedia
> earlier than the lower importance articles. If I am correct, then
> there isn't a lot of value in machine-assessment of importance of
> topics because it relates to factors external to Wikipedia and often
> does not change over time and therefore can often be correctly
> assessed manually even on new stub articles (and any unassessed
> articles can probably be rated as Low Importance as statistically that's almost certainly going to be correct).
> If a topic becomes more important due to "current" events, then
> invariably that article will be updated by many people and one of them
> will sooner or later manually adjust its importance. What is less
> likely to happen is re-assessing downwards of Importance when an
> important "current" topic loses its importance when it is no longer
> current, e.g. are former American presidents like Barack Obama or
> George W Bush or further back less important now? These articles will
> not be updated frequently once the topic is no longer in the news and
> therefore it is less likely an editor will notice and manually
> downgrade the importance, so there may be a greater role for
> machine-assessment in downgrading importance rather than upgrading importance.
>
> Another area where there might be a role for machine-assessed
> importance in regards to POV-pushing where an POV-motivated editor
> might change the manual-assessment importance of articles to be higher
> or lower based on their POV (e.g. my political party is Top
> Importance, other parties are of Low Importance). I suspect that often
> a page watcher would correct or at least question that kind of
> re-assessment. However, articles with few active pagewatchers you
> might get away with POV-pushing the article's importance tag because
> nobody noticed. In this situation, a machine assessment could be useful in spotting this kind of thing.
>
> This suggests that another metric of interest to importance might be
> number of pagewatchers, although I suspect that pagewatching may
> relate more to caring about the article than to caring about the
> topic. And one has to be careful to distinguish active pagewatchers
> (those who actually do review changes on their watchlists) from those
> who don't, as that may make a difference (although I am not sure we
> can really tell which pagewatchers are truly actively reviewing as a
> "satisfactory review" doesn't leave a trace whereas an
> "unsatisfactory" review is likely to lead to a relatively soon revert
> or some other change to the article, the article Talk or the User Talk of reviewed contributor which may be detectable).
>
> The other aspect of articles that occurs to me as being possibly
> linked to importance of the topic would be use of the article as the
> "main" article for a category or as the title of a navbox (as it
> suggests that the articles in the category or navbox are in some way
> subordinate to the main/title article). Similarly for list articles,
> the "type" of the list is often more important than its instances).
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l
> [mailto:[hidden email]]
> On Behalf Of Morten Wang
> Sent: Friday, 21 April 2017 6:04 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: Re: [Wiki-research-l] Project exploring automated
> classification of article importance
>
> Hi Pine,
>
> These are great pointers to existing practices on enwiki, some of
> which I've been looking for and/or missed, thanks!
>
>
> Cheers,
> Morten
>
> On 19 April 2017 at 22:35, Pine W <[hidden email]> wrote:
>
> > Hi Nettrom,
> >
> > A few resources from English Wikipedia regarding article importance
> > as ranked by humans:
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> >
> > https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > Editorial_Team/Release_Version_Criteria#Priority_of_topic
> >
> > https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Stati
> > st
> > ics
> >
> > I infer from the ENWP Wikicup's scoring protocol that for purposes
> > of the competition, an article's "importance" is loosely inferred
> > from the number of language editions of Wikipedia in which the
> > article
> appears:
> > https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
> >
> > HTH,
> >
> > Pine
> >
> >
> > On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <[hidden email]> wrote:
> >
> > > Hello everyone,
> > >
> > > I am currently working with Aaron Halfaker and Dario Taraborelli
> > > at the Wikimedia Foundation on a project exploring automated
> > > classification of article importance. Our goal is to characterize
> > > the importance of an article within a given context and design a
> > > system to predict a relative importance rank. We have a project
> > > page on meta[1] and welcome comments
> > or
> > > thoughts on our talk page. You can of course also respond here on
> > > wiki-research-l, or send me an email.
> > >
> > > Before moving on to model-building I did a fairly thorough
> > > literature review, finding a myriad of papers spanning several
> > > disciplines. We have
> > a
> > > draft literature review also up on meta[2], which should give you
> > > a reasonable introduction to the topic. Again, comments or thoughts (e.g.
> > > papers we’ve missed) on the talk page, mailing list, or through
> > > email are welcome.
> > >
> > > Links:
> > >
> > >    1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >    classification_of_article_importance
> > >    <https://meta.wikimedia.org/wiki/Research:Automated_
> > > classification_of_article_importance>
> > >    2.
> > > https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >
> > > Regards,
> > > Morten
> > > [[User:Nettrom]] aka [[User:SuggestBot]]
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Project exploring automated classification of article importance

Kerry Raymond
In reply to this post by Stuart A. Yeates
Yes, under-categorised and/or under-tagged articles could probably be detected by inbound/outbound link analysis and presented as candidates to the relevant WikiProjects for categorising and tagging. So long as you didn’t deliver up too many false positives, people would probably still deal with the false positives by a best efforts categorisation of tagging or at least pass them off to a more relevant project based on their human intelligence.

 

On a related theme, outgoing link analysis could be used to draw orphan articles to the attention of likely WikiProjects.

Kerry

 

From: Stuart A. Yeates [mailto:[hidden email]]
Sent: Friday, 28 April 2017 10:59 AM
To: [hidden email]; Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

 

Following up Kerry's comments: far more useful to our encyclopedia building project would not be a global importance assessor, but a assessor of which wikiprojects a page is likely to be of interest to. There are hundreds of thousands of en.wiki pages which are not tagged properly to their wikiprojects and are thus effectively invisible to the community of editors who case about them.

This is a classic example of statistical classification, so it shouldn't be too technically difficult...

 

cheers

stuart




--
...let us be heard from red core to black sky

 

On 28 April 2017 at 12:28, Kerry Raymond <[hidden email] <mailto:[hidden email]> > wrote:

I observe (and am unsurprised) that WikiProject Australia also rates the Pavlova article as High importance, which demonstrates into the Stuart's comments about graphs and subgraphs. If there are relationships between WikiProjects, there is probably some correlation about importance of articles as seen by those projects. As it happens, WikiProject Australia and WikiProject New Zealand are related on Wikipedia only by both being within the category "WikiProject Countries projects" (along with every other national WikiProject), so this is an example where you cannot see the connection between these projects "on-wiki" but anyone who knows anything about the geography, history, and culture of the two countries will understand the close connection (e.g. ANZAC, sheep, pavlova, rugby union) but, as the project tagging will show, we do have our differences, e.g. Whitebait is a High Importance article for NZ but Oz doesn't even tag it (we don't share the NZ passion for these small fish). And perhaps more seriously, our two countries have different indigenous peoples so our project tagging around Maori (NZ) and Aboriginal and Torres Strait Islander (Oz) articles would usually be quite disjoint.

So if there are correlations between project tagging, it may be something exploitable in machine assessment of importance.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email] <mailto:[hidden email]> ] On Behalf Of Stuart A. Yeates
Sent: Friday, 28 April 2017 6:18 AM
To: Research into Wikimedia content and communities <[hidden email] <mailto:[hidden email]> >
Subject: Re: [Wiki-research-l] Project exploring automated classification of article importance

On em.wiki article importance is relative to some wikiproject. This is encoded in https://en.wikipedia.org/wiki/Template:WPBannerMeta which appears on 16% of all wikipedia pages via specialisations such as https://en.wikipedia.org/wiki/Template:WikiProject_New_Zealand

Within Wikiproject New Zealand, there are articles which we think are very important to us, which we would never argue are even marginally important on a global scale. Take for example
https://en.wikipedia.org/wiki/Pavlova_(food)

For the mathematically inclined, this is a classic case of graph and many subgraphs.

cheers
stuart


--
...let us be heard from red core to black sky

On 27 April 2017 at 21:44, Gerard Meijssen <[hidden email] <mailto:[hidden email]> >
wrote:

> Hoi,
> I have read the proposal and it leaves me wondering. Also the notion
> of importance is indeed neither easy nor obvious. I think the question
> what is most important is irrelevant depending on how you look at it.
> Subject can be irrelevant when you look at it from a personal
> perspective, looking at it from a particular perspective and indeed
> what seems relevant may become irrelevant or relevant over time. When
> you use metrics there will always be one way or another why it will be found to be problematic.
>
> When you consider Wikipedia, the difference it makes with similar
> resources is that its long tail is so much longer and still it is easy
> and obvious to show how the English Wikipedia's long tail is not long
> enough [1]. When you are looking for links and relevance, Wikidata
> includes data on all Wikipedias and thereby more avenues to establish relevance.
>
> Research has been done that shows that when people are suggested to
> write articles or amend articles, it works best when it is about
> subjects they care about. What people are interested in was based in
> the research on past behaviour. What we could do is flip this and ask
> people. Based on categories, on projects, whatever people do to
> categorise what is their interest. This will work on a micro level. On
> a meta level, it may drive cooperation when we enable people to share
> their interest (at that moment in time). On a macro level data may
> arrive at Wikidata and this will allow us to seek what articles
> include specific data (think date of death for instance). On a meta
> and macro level, we could ask readers what subjects they are missing.
> This would provide an additional incentive for people to write. For this last suggestion we could measure what people are missing.
>
> Anyway, relevance and importance depend on a point of view. When our
> community is enabled to make a difference, it will help us with our
> content. As a movement we know that there is enough that we do not
> properly cover. Advocating these issues and targeting and educating
> potential communities is where the WMF could play more of a role.
> Thanks,
>        GerardM
>
>
>
> [1]
> http://ultimategerardm.blogspot.nl/2017/04/wikidata-
> user-stories-sum-of-all.html
>
> On 26 April 2017 at 13:48, Jonathan Cardy
> <[hidden email] <mailto:[hidden email]> >
> wrote:
>
> > I like to think that in time importance will win out over
> > popularity. If Wikipedia still exists in fifty of five hundred years
> > time and we are
> still
> > using pasteurisation and indeed still eating hydrocarbon based
> > foods,
> then
> > I suspect the pop group you mention will be less frequently read
> > about
> than
> > the pasteurisation process.
> >
> > In the meantime if we try to work it out at all it has to be
> > something of a judgement call, and one we will occasionally get
> > wrong. Any guesses as
> to
> > which current branches of science will be as forgotten in a century
> > as phrenology is today?
> >
> > At an extreme the weekly top ten most viewed articles are a good
> > guide to what is trending in the popular cultures of India and the
> > USA. I'm
> assuming
> > that most modern pop culture is inherently ephemeral. Of course
> > digital historians of future centuries may be rolling on the floor
> > laughing at
> this
> > email, and the TV dramas currently being filmed may still be widely
> studied
> > and universally known classics while our leading edge science lies
> > buried in the foundations of their science.
> >
> > Regards
> >
> > Jonathan
> >
> >
> > > On 26 Apr 2017, at 08:50, Jane Darnell <[hidden email] <mailto:[hidden email]> > wrote:
> > >
> > > Yes I totally agree that "importance is a relative metric rather
> > > than absolute." I also agree that incoming links and pageviews are
> > > not
> > accurate
> > > measurements of "importance" for all of the reasons you mention.
> However,
> > > we are still a project that is actively exploring the universe of
> > > knowledge, and leaning heavily on academia and other established
> sources
> > we
> > > must "boldly go where no man has gone before" (and please feel
> > > free to insert "white, euro-centric" before the man part). So do
> > > you have any suggestions what we could measure going forward that
> > > would cough up
> some
> > > interesting stats to monitor? Pagewatching is useful , but
> > > problematic because these are only assigned at page-creation,
> > > while some marginal editor interest might be expanded to whole
> > > categories (speaking as
> > someone
> > > who has thousands of pages watchlisted on multiple projects). I
> > > like
> your
> > > thoughts about looking for key articles such as those used as the
> > "article
> > > as the "main" article for a category or as the title of a navbox
> > > ".  I
> am
> > > looking for similar usages of paintings as a way to find popular
> painters
> > > or paintings rather than just those paintings which have articles
> written
> > > about them (which are often written for totally random reasons
> > > such as theft/sale/wikiproject).
> > >
> > > On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <
> [hidden email] <mailto:[hidden email]> >
> > > wrote:
> > >
> > >> Just a few musings on the issue of Importance and how to research
> > >> it
> ...
> > >>
> > >> I agree it is intuitive that importance is likely to be linked to
> > >> pageviews and inbound links but, as the preliminary experiment
> > >> showed,
> > it's
> > >> probably not that simple.
> > >>
> > >> Pageviews tells us something about importance to readers of
> > >> Wikipedia, while inbound links tells us something about
> > >> importance to writers of Wikipedia, and I suspect that writers
> > >> are not a proxy for readers as
> the
> > >> editor surveys suggest that Wikipedia writers are not typical of
> broader
> > >> society on at least two variables: gender and level of education
> (might
> > be
> > >> others, I can't remember).
> > >>
> > >> But I think importance is a relative metric rather than
> > >> absolute. I
> > think
> > >> by taking the mean value of importance across a number of
> > >> WikiProjects
> > in
> > >> the preliminary experiment may have lost something because it
> > >> tried (through averaging) to look at importance "generally". I
> > >> would suspect conducting an experiment considering only the
> > >> importance ratings wrt
> to
> > a
> > >> single WikiProject would be more likely to show correlation with
> > pageviews
> > >> (wrt to other articles in that same WikiProject) and inbound links.
> And
> > I
> > >> think there are two kinds of inbound links to be considered,
> > >> those
> > coming
> > >> from other articles within the same WikiProject and those coming
> > >> from outside that Wikiproject. I suspect different insights will
> > >> be
> obtained
> > by
> > >> looking at both types of inbound links separately rather than
> > >> treating
> > them
> > >> as an aggregate. I note also that WikiProjects are not entirely
> > independent
> > >> of one another but have relationships between them. For example,
> > >> The WikiProject Australian Roads describes itself as an
> > >> "intersection" (ha
> > ha!)
> > >> of WikiProject Highways and WikiProject Australia, so I expect
> > >> that we would find greater correlation in importance between
> > >> related
> > WikiProjects
> > >> than between unrelated WikiProjects.
> > >>
> > >> When thinking about readers and pageviews, I think we have to ask
> > >> ourselves is there a difference between popularity and
> > >> importance. Or whether popularity *is* importance. I sense that,
> > >> as a group of
> educated
> > >> people, those of us reading this research mailing list probably
> > >> do
> think
> > >> there is a difference. Certainly if there is no difference, then
> > >> this research can stop now -- just judge importance by
> > >> pageviews. Let's
> > assume
> > >> a difference then. When looking at pageviews of an article, they
> > >> are
> not
> > >> always consistent over time. Here are the pageviews for
> > >> Drottninggatan
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org <http://wikipedia.org> &platform=all-access&agent=user&range=
> > >> latest-90&pages=Drottninggatan
> > >>
> > >> Why so interesting on 8 April? A terrorist attack occurred there.
> > >> This spike in pageviews occurs all the time when some topic is in
> > >> the news
> > (even
> > >> peripherally as in this case where it is not the article about
> > >> the terrorist attack but about the street in which it occurred).
> > >> Did the
> > street
> > >> become more "important"? I think it became more interesting but
> > >> not
> more
> > >> important. So I think we do have to be careful to understand that
> > pageviews
> > >> probably reflect interest rather than importance.  I note that
> > >> The Chainsmokers (a music group with a number of songs in the
> > >> current USA
> > music
> > >> charts) gets many more Wikipedia article pageviews  than the
> > >> Wikipedia article on Pasteurization but The Chainsmokers are not
> > >> rated as being
> of
> > >> high importance by the relevant WikiProjects while Pasteurization
> > >> is
> > very
> > >> important in WikiProject Food and Drink. Since pasteurisation
> prevents a
> > >> lot of deaths, I think we might agree that in the real world
> > pasteurisation
> > >> is more important than a music group regardless of what pageviews
> > >> tell
> > us.
> > >>
> > >> https://tools.wmflabs.org/pageviews/?project=en.
> > >> wikipedia.org <http://wikipedia.org> &platform=all-access&agent=user&range=
> latest-90&pages=The_
> > >> Chainsmokers|Pasteurization
> > >>
> > >> Of course it is matters for Wikipedia's success that our
> > >> *popular* articles are of high quality, but I think we have be
> > >> cautious about pageviews being a proxy for importance.
> > >>
> > >> When we look at Wikipedia writers' decisions in tagging the
> > >> importance
> > of
> > >> articles to WikiProjects, what do we find? As we know, project
> > >> tags
> are
> > >> often placed on new articles (and often not subsequently
> > >> reviewed). So while I find that quality tags are often
> > >> out-of-date, the importance
> > seems
> > >> to be pretty accurate even on a new stub articles. This is
> > >> because it
> is
> > >> the importance of the *topic* that is being assessed which is
> > independent
> > >> of the Wikipedia article itself. Provided the article is clear
> > >> enough
> > about
> > >> what it is about and why it matters (which is the traditional
> > >> content
> of
> > >> that first paragraph or two and failing to provide it will likely
> > result in
> > >> speedy deletion of the new article), assessment of the topic's
> > importance
> > >> can be made even at new stub level. This tells us that importance
> > >> for Wikipedia writers is determined by something outside of
> > >> Wikipedia
> > (probably
> > >> their real-world knowledge of that topic space -- one assumes
> > >> that
> > project
> > >> taggers are quite interested in the topic space of that project).
> While
> > >> article quality hopefully improves over time, I would be
> > >> surprised if article importance greatly changed over time.
> > >> Obviously there are counter-examples.  I am guessing Donald
> > >> Trump's article may have grown
> > in
> > >> importance over time but that's probably because his lede para
> changed.
> > >> Adding President of the USA into the lede paragraph makes him
> > >> much
> more
> > >> important than he was before in the real world and internal to
> > Wikipedia he
> > >> has acquired an inbound link from the presumably high-importance
> > President
> > >> of the USA article. So I think it might be interesting to study
> > >> those articles whose importance does change over time to see if
> > >> there are
> any
> > >> strong correlations with what is happening to the article inside
> > Wikipedia.
> > >> I think it is this set of importance-changing articles may be
> > >> where we really learn what Wikipedia article characteristics are
> > >> strongly
> > correlated
> > >> to "importance" given that importance itself appears to be pretty
> stable
> > >> for most articles.
> > >>
> > >> Although not stated explicitly, I imagine we believe that
> > >> generally
> less
> > >> important articles tend to link to more important articles but
> > >> more important articles don't link to less important articles.
> > >> And hence in-bound links are likely to matter in assessing
> > >> importance and that in-bound links from "important" articles are
> > >> more valuable than
> in-bound
> > >> links from less important articles (which creates something of a
> > >> bootstrapping problem) similar to the issue to Google's PageRank
> > >> algorithms. But I think we do have some information that Google
> doesn't
> > >> have. The average webpage does not have a lede paragraph that
> > >> situates
> > the
> > >> topic relative to other topics; a Wikipedia article does. If I
> > >> have to choose to define Thing X in terms of Thing Y, it tends to
> > >> suggest that
> > Y is
> > >> more important than X. If Y also defines itself in terms of X,
> > >> then it tends to suggest they are equivalent in importance at some way.
> Indeed I
> > >> suspect when we get to the VERY IMPORTANT topics we will see this
> > >> kind
> > of
> > >> circular definition (e.g. you see circular definitions in
> > >> Wikipedia
> > around
> > >> Philosophy and Knowledge). Aside, if you have never done this
> > >> before,
> > try
> > >> this experiment. Choose a random article (left hand tool bar in
> Desktop
> > >> Wikipedia), then click the first link in the article that matters
> (i.e.
> > >> ignore links hatnotes or links inside parentheses). Repeat this
> > >> first
> > link
> > >> clicking and sooner or later you will reach articles like
> > >> Knowledge
> and
> > >> Philosophy, which all sit inside circular definition groups.
> > >>
> > >> If we look at the Donald Trump article, his first sentence
> > >> contains
> only
> > >> two links, one to List of Presidents of the USA and the other to
> > President
> > >> of the USA. If we look at the those two articles, we find that
> > >> both of
> > them
> > >> mention Donald Trump in their lede paras (although not as early
> > >> as the first sentence) and before mentions of any other US
> > >> President
> elsewhere
> > in
> > >> the article. Which is consistent with what we know about the real
> world,
> > >> the role of the President is more important than its
> > >> officeholders and
> > that
> > >> the current officeholder has more importance than a past officeholder.
> > So
> > >> topic importance does seems to be skewed towards the "present day".
> > >>
> > >> So I suspect the links in the lede paras are of greater relevance
> > >> to
> the
> > >> assessment of importance than links further down in the article
> > >> which
> > will
> > >> be more likely relate to details of a topic and may include
> > >> examples
> and
> > >> counter-examples (this is a way in which high importance article
> > >> may mention much lower importance articles). However, we do have
> > >> to be a
> > little
> > >> bit careful here because of the MoS practice of not linking very
> common
> > >> terms. For example, an Australian article will often refer to
> Australia
> > in
> > >> the lede para but it will almost certainly not be linked to the
> > Australia
> > >> article (and any attempt to add such a link will likely see it
> > >> removed
> > with
> > >> an edit summary that mentions [[WP:Overlinking]]) whereas there
> > >> is no problem if you link to an Australian state article, e.g.
> > >> New South
> > Wales.
> > >> So we might find that some very important topics that often
> > >> appear in
> > ledes
> > >> might get fewer links that you might expect because of the MoS
> policies
> > on
> > >> overlinking, which may be problem when working with inbound
> > >> links. It
> > may
> > >> be that for "very common topics" the presence of the article
> > >> title (or
> > its
> > >> synonyms) in the lede may have to be considered as if it were an
> > in-bound
> > >> link for statistical research purposes.
> > >>
> > >> Given all of the above, perhaps the most interesting group of
> > >> articles
> > to
> > >> study in Wikipedia are those articles whose manually-assessed
> importance
> > >> has changed over the life of the article AND which were NOT
> > >> current
> > topics
> > >> in the lifetime of Wikipedia (given the influence of "current" on
> > >> importance). But having said that, I wonder if that group of
> > >> articles actually exists. Recently a newish Australian
> > >> contributor expressed disappointment that all the new articles
> > >> they had created were tagged
> > (by
> > >> others) as of Low Importance. My instinctive reply was "that's
> normal, I
> > >> think of the thousands of articles I have started only a couple
> > >> even
> > rated
> > >> as Mid importance, this is because the really important articles
> > >> were
> > all
> > >> started long ago precisely because they were important". I
> > >> suspect
> > topics
> > >> that are very important (for reasons other than being short-lived
> > >> importance due in being "current" in the lifetime of Wikipedia)
> > >> will generally show up as having started early in Wikipedia's
> > >> life and that those that become more/less important over time
> > >> will be largely linked
> > to
> > >> becoming or ceasing to be "current" topics). E.g. article
> Pasteurization
> > >> started in May 2001 saying nothing more than " Pasteurization is
> > >> the process of killing off bacteria in milk by quickly heating it
> > >> to a
> near
> > >> boiling temperature, then quickly cooling it again before the
> > >> taste
> and
> > >> other desirable properties are affected. The process was named
> > >> after
> its
> > >> inventor, French scientist Louis Pasteur. See also dairy products."
> The
> > >> links in this very first version are still present in its lede
> paragraph
> > >> today, suggesting our understanding of "non-current" topics is
> > >> stable
> > and
> > >> hence initial importance determinations can probably be
> > >> accurately
> made.
> > >> For Pasteurization the Talk page shows it was not project-tagged
> > >> until
> > 2007
> > >> when it was assigned High Importance as its first assessment.
> > >>
> > >> I suspect we will find that initial manual assessment of article
> > >> importance will be pretty accurate for most articles. And I
> > >> suspect if
> > we
> > >> plot initial importance assessments against time of assessment,
> > >> we
> will
> > >> find the higher importance articles commenced life on Wikipedia
> earlier
> > >> than the lower importance articles. If I am correct, then there
> > >> isn't
> a
> > lot
> > >> of value in machine-assessment of importance of topics because it
> > relates
> > >> to factors external to Wikipedia and often does not change over
> > >> time
> and
> > >> therefore can often be correctly assessed manually even on new
> > >> stub articles (and any unassessed articles can probably be rated
> > >> as Low Importance as statistically that's almost certainly going
> > >> to be
> > correct).
> > >> If a topic becomes more important due to "current" events, then
> > invariably
> > >> that article will be updated by many people and one of them will
> sooner
> > or
> > >> later manually adjust its importance. What is less likely to
> > >> happen is re-assessing downwards of Importance when an important
> > >> "current" topic loses its importance when it is no longer
> > >> current, e.g. are former
> > American
> > >> presidents like Barack Obama or George W Bush or further back
> > >> less important now? These articles will not be updated frequently
> > >> once the
> > topic
> > >> is no longer in the news and therefore it is less likely an
> > >> editor
> will
> > >> notice and manually downgrade the importance, so there may be a
> greater
> > >> role for machine-assessment in downgrading importance rather than
> > upgrading
> > >> importance.
> > >>
> > >> Another area where there might be a role for machine-assessed
> importance
> > >> in regards to POV-pushing where an POV-motivated editor might
> > >> change
> the
> > >> manual-assessment importance of articles to be higher or lower
> > >> based
> on
> > >> their POV (e.g. my political party is Top Importance, other
> > >> parties
> are
> > of
> > >> Low Importance). I suspect that often a page watcher would
> > >> correct or
> at
> > >> least question that kind of re-assessment. However, articles with
> > >> few active pagewatchers you might get away with POV-pushing the
> > >> article's importance tag because nobody noticed. In this
> > >> situation, a machine assessment could be useful in spotting this kind of thing.
> > >>
> > >> This suggests that another metric of interest to importance might
> > >> be number of pagewatchers, although I suspect that pagewatching
> > >> may
> relate
> > >> more to caring about the article than to caring about the topic.
> > >> And
> one
> > >> has to be careful to distinguish active pagewatchers (those who
> > actually do
> > >> review changes on their watchlists) from those who don't, as that
> > >> may
> > make
> > >> a difference (although I am not sure we can really tell which
> > pagewatchers
> > >> are truly actively reviewing as a "satisfactory review" doesn't
> > >> leave
> a
> > >> trace whereas an "unsatisfactory" review is likely to lead to a
> > relatively
> > >> soon revert or some other change to the article, the article Talk
> > >> or
> the
> > >> User Talk of reviewed contributor which may be detectable).
> > >>
> > >> The other aspect of articles that occurs to me as being possibly
> linked
> > to
> > >> importance of the topic would be use of the article as the "main"
> > article
> > >> for a category or as the title of a navbox (as it suggests that
> > >> the articles in the category or navbox are in some way
> > >> subordinate to the main/title article). Similarly for list
> > >> articles, the "type" of the
> > list is
> > >> often more important than its instances).
> > >>
> > >> Kerry
> > >>
> > >> -----Original Message-----
> > >> From: Wiki-research-l [mailto:wiki-research-l- <mailto:wiki-research-l->
> > [hidden email] <mailto:[hidden email]> ]
> > >> On Behalf Of Morten Wang
> > >> Sent: Friday, 21 April 2017 6:04 AM
> > >> To: Research into Wikimedia content and communities <
> > >> [hidden email] <mailto:[hidden email]> >
> > >> Subject: Re: [Wiki-research-l] Project exploring automated
> > classification
> > >> of article importance
> > >>
> > >> Hi Pine,
> > >>
> > >> These are great pointers to existing practices on enwiki, some of
> which
> > >> I've been looking for and/or missed, thanks!
> > >>
> > >>
> > >> Cheers,
> > >> Morten
> > >>
> > >>> On 19 April 2017 at 22:35, Pine W <[hidden email] <mailto:[hidden email]> > wrote:
> > >>>
> > >>> Hi Nettrom,
> > >>>
> > >>> A few resources from English Wikipedia regarding article
> > >>> importance
> as
> > >>> ranked by humans:
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
> > >>> Editorial_Team/Release_Version_Criteria#Priority_of_topic
> > >>>
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiProject_
> assessment#Statist
> > >>> ics
> > >>>
> > >>> I infer from the ENWP Wikicup's scoring protocol that for
> > >>> purposes of the competition, an article's "importance" is
> > >>> loosely inferred from the number of language editions of
> > >>> Wikipedia in which the article
> > >> appears:
> > >>> https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_po
> > >>> ints
> .
> > >>>
> > >>> HTH,
> > >>>
> > >>> Pine
> > >>>
> > >>>
> > >>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang
> > >>>> <[hidden email] <mailto:[hidden email]> >
> > wrote:
> > >>>>
> > >>>> Hello everyone,
> > >>>>
> > >>>> I am currently working with Aaron Halfaker and Dario
> > >>>> Taraborelli at the Wikimedia Foundation on a project exploring
> > >>>> automated classification of article importance. Our goal is to
> > >>>> characterize the importance of an article within a given
> > >>>> context and design a system to predict a relative importance
> > >>>> rank. We have a project page on meta[1] and welcome comments
> > >>> or
> > >>>> thoughts on our talk page. You can of course also respond here
> > >>>> on wiki-research-l, or send me an email.
> > >>>>
> > >>>> Before moving on to model-building I did a fairly thorough
> > >>>> literature review, finding a myriad of papers spanning several
> > >>>> disciplines. We have
> > >>> a
> > >>>> draft literature review also up on meta[2], which should give
> > >>>> you a reasonable introduction to the topic. Again, comments or
> > >>>> thoughts
> > (e.g.
> > >>>> papers we’ve missed) on the talk page, mailing list, or through
> > >>>> email are welcome.
> > >>>>
> > >>>> Links:
> > >>>>
> > >>>>   1. https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>>   classification_of_article_importance
> > >>>>   <https://meta.wikimedia.org/wiki/Research:Automated_
> > >>>> classification_of_article_importance>
> > >>>>   2.
> > >>>> https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
> > >>>>
> > >>>> Regards,
> > >>>> Morten
> > >>>> [[User:Nettrom]] aka [[User:SuggestBot]]
> > >>>> _______________________________________________
> > >>>> Wiki-research-l mailing list
> > >>>> [hidden email] <mailto:[hidden email]>
> > >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>> _______________________________________________
> > >>> Wiki-research-l mailing list
> > >>> [hidden email] <mailto:[hidden email]>
> > >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email] <mailto:[hidden email]>
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>
> > >>
> > >> _______________________________________________
> > >> Wiki-research-l mailing list
> > >> [hidden email] <mailto:[hidden email]>
> > >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email] <mailto:[hidden email]>
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email] <mailto:[hidden email]>
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email] <mailto:[hidden email]>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email] <mailto:[hidden email]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email] <mailto:[hidden email]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l