Content similarity between two Wikipedia articles

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Content similarity between two Wikipedia articles

Haifeng Zhang
Dear folks,

Is there a way to compute content similarity between two Wikipedia articles?

For example, I can think of representing each article as a vector of likelihoods over possible topics.

But, I wonder there are other work people have already explored in the past.


Thanks,

Haifeng
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Content similarity between two Wikipedia articles

RhinosF1 Wikipedia
The comparison tool on
https://tools.wmflabs.org/copyvios/ can look for repeated phrases.

You might be able to tweak that a bit.

On Sat, 4 May 2019 at 12:48, Haifeng Zhang <[hidden email]> wrote:

> Dear folks,
>
> Is there a way to compute content similarity between two Wikipedia
> articles?
>
> For example, I can think of representing each article as a vector of
> likelihoods over possible topics.
>
> But, I wonder there are other work people have already explored in the
> past.
>
>
> Thanks,
>
> Haifeng
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Content similarity between two Wikipedia articles

Morten Wang
In reply to this post by Haifeng Zhang
Hi Haifeng,

Yes, you might want to look into some of the work done by Hecht et al. on
content similarity between languages, as well as work by Sen et al. on
semantic relatedness algorithms (which are implemented in the WikiBrain
framework <http://wikibrainapi.org/>, by the way, see reference below).
Some paper to start out with could be:

   - Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M. and Gergle,
D. "Omnipedia:
   Bridging the Wikipedia Language Gap
   <http://www.brenthecht.com/publications/bhecht_CHI2012_omnipedia.pdf>"
   CHI 2012
   - Hecht, B. and Gergle, D. "The Tower of Babel Meets Web 2.0:
   User-Generated Content and Its Applications in a Multilingual Context
   <http://www.brenthecht.com/publications/bhecht_chi2010_towerofbabel.pdf>"
   CHI 2010
   - Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse
   Dippenaar, Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, Brent Hecht
   "Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement
   <http://www.shilad.com/static/cartograph-iui-2017-final.pdf>" IUI 2017
   - Sen, Shilad; Johnson, Isaac; Harper, Rebecca; Mai, Huy; Horlbeck
   Olsen, Samuel; Mathers, Benjamin; Souza Vonessen, Laura; Wright, Matthew;
   Hecht, Brent "Towards Domain-Specific Semantic Relatedness: A Case Study
   in Geography <http://ijcai.org/papers15/Papers/IJCAI15-334.pdf>" IJCAI,
   2015
   - Sen, Shilad; Lesicko, Matthew; Giesel, Margaret; Gold, Rebecca;
   Hillmann, Benjamin; Naden, Samuel; Russell, Jesse; Wang, Zixiao "Ken";
   Hecht, Brent "Turkers, Scholars, "Arafat" and "Peace": Cultural
   Communities and Algorithmic Gold Standards
   <http://www-users.cs.umn.edu/~bhecht/publications/goldstandards_CSCW2015.pdf>
   "
   - Sen, Shilad; Li, Toby Jia-Jun; Lesicko, Matthew; Weiland, Ari; Gold,
   Rebecca; Li, Yulun; Hillmann, Benjamin; Hecht, Brent "WikiBrain:
   Democratizing computation on Wikipedia
   <http://www-users.cs.umn.edu/~bhecht/publications/WikiBrain-WikiSym2014.pdf>"
   OpenSym 2014

You can of course also utilize similarity measures from the recommender
systems and information retrieval fields, e.g. use edit histories to
identify articles who have been edited by the same users, or apply search
engine techniques like TF/IDF and content vectors.


Cheers,
Morten

On Sat, 4 May 2019 at 04:48, Haifeng Zhang <[hidden email]> wrote:

> Dear folks,
>
> Is there a way to compute content similarity between two Wikipedia
> articles?
>
> For example, I can think of representing each article as a vector of
> likelihoods over possible topics.
>
> But, I wonder there are other work people have already explored in the
> past.
>
>
> Thanks,
>
> Haifeng
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
fn
Reply | Threaded
Open this post in threaded view
|

Re: Content similarity between two Wikipedia articles

fn
In reply to this post by Haifeng Zhang
Dear Haifeng,


Would you not be able to use ordinary information retrieval techniques
such as bag-of-words/phrases and tfidf? Explicit semantic analysis (ESA)
uses this approach (though its primary focus is word semantic similarity).

There are a few papers for ESA:
https://tools.wmflabs.org/scholia/topic/Q5421270

I have also used it in "Open semantic analysis: The case of word level
semantics in Danish"
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7029.pdf


Finn Årup Nielsen
http://people.compute.dtu.dk/faan/



On 04/05/2019 13:47, Haifeng Zhang wrote:

> Dear folks,
>
> Is there a way to compute content similarity between two Wikipedia articles?
>
> For example, I can think of representing each article as a vector of likelihoods over possible topics.
>
> But, I wonder there are other work people have already explored in the past.
>
>
> Thanks,
>
> Haifeng
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Content similarity between two Wikipedia articles

Isaac Johnson
Hey Haifeng,
On top of all the excellent answers provided, I'd also add that the answer
to your question depends on what you want to use the similarity scores for.
For some insight into what it might mean to make choose one approach over
another, see this recent publication:
https://dl.acm.org/citation.cfm?id=3213769

At a high level, I'd say that there are three ways you might approach
article similarity on Wikipedia:
* Reader similarity: two articles are similar if the same people who read
one also frequently read the other. Navigation embeddings that implement
this definition based on page views were generated last in 2017, so newer
articles will not be represented, but here is the dataset [
https://figshare.com/articles/Wikipedia_Vectors/3146878 ] and meta page [
https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors ].
The clickstream dataset [
https://dumps.wikimedia.org/other/clickstream/readme.html ], which is more
recent, might be used in a similar way.
* Content similarity: two articles are similar if they contain similar
content -- i.e. in most cases, similar text. This covers most of the
suggestions provided to you in this email chain. Some are simpler but are
language-specific unless you make substantial modifications (e.g., ESA, the
LDA model described here:
https://cs.stanford.edu/people/jure/pubs/wikipedia-www17.pdf) while others
are more complicated but work across multiple languages (e.g., recent WSDM
paper: https://twitter.com/cervisiarius/status/1115510356976242688).
* Link similarity: two articles are similar if they link to similar
articles. Generally, this approach involves creating a graph of Wikipedia's
link structure and then using an approach such as node2vec to reduce the
graph to article embeddings. I know less about the current approaches in
this space, but some searching should turn up a variety of approaches --
e.g., Milne and Witten's 2008 approach [
http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf ], which
is implemented in WikiBrain as Morten mentioned.

There are also other, more structured approaches like ORES drafttopic,
which predicts which topics (based on WikiProjects) are most likely to
apply to a given English Wikipedia article:
https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic

On Tue, May 7, 2019 at 9:54 AM <[hidden email]> wrote:

> Dear Haifeng,
>
>
> Would you not be able to use ordinary information retrieval techniques
> such as bag-of-words/phrases and tfidf? Explicit semantic analysis (ESA)
> uses this approach (though its primary focus is word semantic similarity).
>
> There are a few papers for ESA:
> https://tools.wmflabs.org/scholia/topic/Q5421270
>
> I have also used it in "Open semantic analysis: The case of word level
> semantics in Danish"
>
> http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7029.pdf
>
>
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
>
> On 04/05/2019 13:47, Haifeng Zhang wrote:
> > Dear folks,
> >
> > Is there a way to compute content similarity between two Wikipedia
> articles?
> >
> > For example, I can think of representing each article as a vector of
> likelihoods over possible topics.
> >
> > But, I wonder there are other work people have already explored in the
> past.
> >
> >
> > Thanks,
> >
> > Haifeng
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Content similarity between two Wikipedia articles

Kerry Raymond
Indeed, the purpose does  matter. Is the end goal the content similarity of articles themselves (perhaps say to detect articles that might be merged) or is the end goal the relatedness of topics represented by those articles? If the latter is the goal, then the Wikipedia category system relates articles with some commonality of topic, and distance between articles via the category hierarchy is an indicator of levels of relatedness. Similarly navboxes relate articles that have something in common, as do list articles. All of these three things are manually curated, and may be a much cheaper way to determine relatedness of topics than messing about with bags of words, etc. But it all really depends on the end goal.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email]] On Behalf Of Isaac Johnson
Sent: Wednesday, 8 May 2019 1:35 AM
To: Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Content similarity between two Wikipedia articles

Hey Haifeng,
On top of all the excellent answers provided, I'd also add that the answer to your question depends on what you want to use the similarity scores for.
For some insight into what it might mean to make choose one approach over another, see this recent publication:
https://dl.acm.org/citation.cfm?id=3213769

At a high level, I'd say that there are three ways you might approach article similarity on Wikipedia:
* Reader similarity: two articles are similar if the same people who read one also frequently read the other. Navigation embeddings that implement this definition based on page views were generated last in 2017, so newer articles will not be represented, but here is the dataset [
https://figshare.com/articles/Wikipedia_Vectors/3146878 ] and meta page [ https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors ].
The clickstream dataset [
https://dumps.wikimedia.org/other/clickstream/readme.html ], which is more recent, might be used in a similar way.
* Content similarity: two articles are similar if they contain similar content -- i.e. in most cases, similar text. This covers most of the suggestions provided to you in this email chain. Some are simpler but are language-specific unless you make substantial modifications (e.g., ESA, the LDA model described here:
https://cs.stanford.edu/people/jure/pubs/wikipedia-www17.pdf) while others are more complicated but work across multiple languages (e.g., recent WSDM
paper: https://twitter.com/cervisiarius/status/1115510356976242688).
* Link similarity: two articles are similar if they link to similar articles. Generally, this approach involves creating a graph of Wikipedia's link structure and then using an approach such as node2vec to reduce the graph to article embeddings. I know less about the current approaches in this space, but some searching should turn up a variety of approaches -- e.g., Milne and Witten's 2008 approach [ http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf ], which is implemented in WikiBrain as Morten mentioned.

There are also other, more structured approaches like ORES drafttopic, which predicts which topics (based on WikiProjects) are most likely to apply to a given English Wikipedia article:
https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic

On Tue, May 7, 2019 at 9:54 AM <[hidden email]> wrote:

> Dear Haifeng,
>
>
> Would you not be able to use ordinary information retrieval techniques
> such as bag-of-words/phrases and tfidf? Explicit semantic analysis
> (ESA) uses this approach (though its primary focus is word semantic similarity).
>
> There are a few papers for ESA:
> https://tools.wmflabs.org/scholia/topic/Q5421270
>
> I have also used it in "Open semantic analysis: The case of word level
> semantics in Danish"
>
> http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7
> 029.pdf
>
>
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
>
> On 04/05/2019 13:47, Haifeng Zhang wrote:
> > Dear folks,
> >
> > Is there a way to compute content similarity between two Wikipedia
> articles?
> >
> > For example, I can think of representing each article as a vector of
> likelihoods over possible topics.
> >
> > But, I wonder there are other work people have already explored in
> > the
> past.
> >
> >
> > Thanks,
> >
> > Haifeng
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l