Finding the number of links between two wikipedia pages

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding the number of links between two wikipedia pages

Mara Sorella
Hi everybody, I'm new to the list and have been referred here by a comment from a SO user as per my question [1], that I'm quoting next:


I have been successfully able to use the Wikipedia pagelinks SQL dump to obtain hyperlinks between Wikipedia pages for a specific revision time.

However, there are cases where multiple instances of such links exist, e.g. the very same https://en.wikipedia.org/wiki/Wikipedia page and https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find number of links between pairs of pages for a specific revision.

Ideal solutions would involve dump files other than pagelinks (which I'm not aware of), or using the MediaWiki API.



To elaborate, I need this information to weight (almost) every hyperlink between article pages (that is, in NS0), that was present in a specific wikipedia revision (end of 2015), therefore, I would prefer not to follow the solution suggested by the SO user, that would be rather impractical.
 
Indeed, my final aim is to use this weight in a thresholding fashion to sparsify the wikipedia graph (that due to the short diameter is more or less a giant connected component), in a way that should reflect the "relatedness" of the linked pages (where relatedness is not intended as strictly semantic, but at a higher "concept" level, if I may say so). 
For this reason, other suggestions on how determine such weights (possibly using other data sources -- ontologies?) are more than welcome.

The graph will be used as dataset to test an event tracking algorithm I am doing research on.


Thanks,


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding the number of links between two wikipedia pages

Giuseppe Profiti
2017-02-19 20:56 GMT+01:00 Mara Sorella <[hidden email]>:

> Hi everybody, I'm new to the list and have been referred here by a comment
> from a SO user as per my question [1], that I'm quoting next:
>
>
> I have been successfully able to use the Wikipedia pagelinks SQL dump to
> obtain hyperlinks between Wikipedia pages for a specific revision time.
>
> However, there are cases where multiple instances of such links exist, e.g.
> the very same https://en.wikipedia.org/wiki/Wikipedia page and
> https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find
> number of links between pairs of pages for a specific revision.
>
> Ideal solutions would involve dump files other than pagelinks (which I'm not
> aware of), or using the MediaWiki API.
>
>
>
> To elaborate, I need this information to weight (almost) every hyperlink
> between article pages (that is, in NS0), that was present in a specific
> wikipedia revision (end of 2015), therefore, I would prefer not to follow
> the solution suggested by the SO user, that would be rather impractical.

Hi Mara,
Mediawiki API does not return the multiplicity of the links [1]. As
far as I can see from the database layout, you can't get the
multiplicity of links from it either [2]. The only solution that
occurs to me is to parse the wikitext of the page, like the SO user
suggested.

In any case, some communities established writing styles that
discourage multiple links towards the same article (i.e. in the
Italian Wikipedia a link is associated only to the first occurrence of
the word). Then, the numbers you could get may vary depending on the
style of the community and/or last editor.

>
> Indeed, my final aim is to use this weight in a thresholding fashion to
> sparsify the wikipedia graph (that due to the short diameter is more or less
> a giant connected component), in a way that should reflect the "relatedness"
> of the linked pages (where relatedness is not intended as strictly semantic,
> but at a higher "concept" level, if I may say so).
> For this reason, other suggestions on how determine such weights (possibly
> using other data sources -- ontologies?) are more than welcome.

When you get the graph of connections, instead of using the
multiplicity as weight, you could try to use community detection
methods to isolate subclusters of strongly connected articles.
Another approach my be to use centrality measures, however the only
one that can be applied to edges instead of just nodes is betweenness
centrality, if I remember correctly.

In case of a fast technical solution may come to mind, I'll write here again.

Best,
Giuseppe

[1] https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Wikipedia&plnamespace=0&pllimit=500&pltitles=Wikimedia_Foundation
[2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWiki_1.28.0_database_schema.svg

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding the number of links between two wikipedia pages

Ward Cunningham
I've built and open sourced technology that can extract these sorts of unplanned features from dumps. The system would be primed with specific dumps and trained for the feature of interest. This might take a day.  Then a full run would take an hour to produce a csv file of features for downstream study.

Would this be of interest?

Best regards -- Ward

> On Feb 21, 2017, at 8:48 AM, Giuseppe Profiti <[hidden email]> wrote:
>
> 2017-02-19 20:56 GMT+01:00 Mara Sorella <[hidden email]>:
>> Hi everybody, I'm new to the list and have been referred here by a comment
>> from a SO user as per my question [1], that I'm quoting next:
>>
>>
>> I have been successfully able to use the Wikipedia pagelinks SQL dump to
>> obtain hyperlinks between Wikipedia pages for a specific revision time.
>>
>> However, there are cases where multiple instances of such links exist, e.g.
>> the very same https://en.wikipedia.org/wiki/Wikipedia page and
>> https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find
>> number of links between pairs of pages for a specific revision.
>>
>> Ideal solutions would involve dump files other than pagelinks (which I'm not
>> aware of), or using the MediaWiki API.
>>
>>
>>
>> To elaborate, I need this information to weight (almost) every hyperlink
>> between article pages (that is, in NS0), that was present in a specific
>> wikipedia revision (end of 2015), therefore, I would prefer not to follow
>> the solution suggested by the SO user, that would be rather impractical.
>
> Hi Mara,
> Mediawiki API does not return the multiplicity of the links [1]. As
> far as I can see from the database layout, you can't get the
> multiplicity of links from it either [2]. The only solution that
> occurs to me is to parse the wikitext of the page, like the SO user
> suggested.
>
> In any case, some communities established writing styles that
> discourage multiple links towards the same article (i.e. in the
> Italian Wikipedia a link is associated only to the first occurrence of
> the word). Then, the numbers you could get may vary depending on the
> style of the community and/or last editor.
>
>>
>> Indeed, my final aim is to use this weight in a thresholding fashion to
>> sparsify the wikipedia graph (that due to the short diameter is more or less
>> a giant connected component), in a way that should reflect the "relatedness"
>> of the linked pages (where relatedness is not intended as strictly semantic,
>> but at a higher "concept" level, if I may say so).
>> For this reason, other suggestions on how determine such weights (possibly
>> using other data sources -- ontologies?) are more than welcome.
>
> When you get the graph of connections, instead of using the
> multiplicity as weight, you could try to use community detection
> methods to isolate subclusters of strongly connected articles.
> Another approach my be to use centrality measures, however the only
> one that can be applied to edges instead of just nodes is betweenness
> centrality, if I remember correctly.
>
> In case of a fast technical solution may come to mind, I'll write here again.
>
> Best,
> Giuseppe
>
> [1] https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Wikipedia&plnamespace=0&pllimit=500&pltitles=Wikimedia_Foundation
> [2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWiki_1.28.0_database_schema.svg
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding the number of links between two wikipedia pages

Mara Sorella
In reply to this post by Giuseppe Profiti
Hi Giuseppe, Ward

On Tue, Feb 21, 2017 at 5:48 PM, Giuseppe Profiti <[hidden email]> wrote:
2017-02-19 20:56 GMT+01:00 Mara Sorella <[hidden email]>:
> Hi everybody, I'm new to the list and have been referred here by a comment
> from a SO user as per my question [1], that I'm quoting next:
>
>
> I have been successfully able to use the Wikipedia pagelinks SQL dump to
> obtain hyperlinks between Wikipedia pages for a specific revision time.
>
> However, there are cases where multiple instances of such links exist, e.g.
> the very same https://en.wikipedia.org/wiki/Wikipedia page and
> https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find
> number of links between pairs of pages for a specific revision.
>
> Ideal solutions would involve dump files other than pagelinks (which I'm not
> aware of), or using the MediaWiki API.
>
>
>
> To elaborate, I need this information to weight (almost) every hyperlink
> between article pages (that is, in NS0), that was present in a specific
> wikipedia revision (end of 2015), therefore, I would prefer not to follow
> the solution suggested by the SO user, that would be rather impractical.

Hi Mara,
Mediawiki API does not return the multiplicity of the links [1]. As
far as I can see from the database layout, you can't get the
multiplicity of links from it either [2]. The only solution that
occurs to me is to parse the wikitext of the page, like the SO user
suggested.

In any case, some communities established writing styles that
discourage multiple links towards the same article (i.e. in the
Italian Wikipedia a link is associated only to the first occurrence of
the word). Then, the numbers you could get may vary depending on the
style of the community and/or last editor.
Yes, this is a good practice that I noticed being very widespread. Indeed this would lead the link-multiplicity based weighting approach to fail.
A (costly) option would be inspecting the actual article text (possibly only the abstract). I guess this can be done starting from the dump files.

@Ward: could your technology be of help for this task?


>
> Indeed, my final aim is to use this weight in a thresholding fashion to
> sparsify the wikipedia graph (that due to the short diameter is more or less
> a giant connected component), in a way that should reflect the "relatedness"
> of the linked pages (where relatedness is not intended as strictly semantic,
> but at a higher "concept" level, if I may say so).
> For this reason, other suggestions on how determine such weights (possibly
> using other data sources -- ontologies?) are more than welcome.

When you get the graph of connections, instead of using the
multiplicity as weight, you could try to use community detection
methods to isolate subclusters of strongly connected articles.
Another approach my be to use centrality measures, however the only
one that can be applied to edges instead of just nodes is betweenness
centrality, if I remember correctly.
 
Currently, I resorted to keep only reciprocal links, but I still get quite big connected components (despite the fact that I'm actually carrying out a temporal analysis, where I consider, for each time instant, only pages exhibiting an unusually high traffic).
Concerning community detection techniques/centrality: I discarded them because I don't want to "impose" connectedness (reachability) at the subgraph level, but only between single entities (since my algorithm aims to find some sort of temporally persistent subgraphs having some properties).


In case of a fast technical solution may come to mind, I'll write here again.

Best,
Giuseppe

[1] https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Wikipedia&plnamespace=0&pllimit=500&pltitles=Wikimedia_Foundation
[2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWiki_1.28.0_database_schema.svg


Thank you both for your feedback!

Best,

Mara 


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding the number of links between two wikipedia pages

Giovanni Luca Ciampaglia-3
In reply to this post by Mara Sorella
Hi Mara, 

since you were asking about ontologies, let me point you to our work on computational fact checking from knowledge networks PLoS ONE. We developed a measure of semantic similarity based on shortest paths between any two concepts of Wikipedia using the linked data from DBPedia; these the are links found in the infoboxes of Wikipedia articles; so it is a subset of the hyperlinks of the whole web page. 

In the article we use it as a way to check simple relational statements, but it could be used for other uses too. And there are also a couple other approaches from the literature, which we cite in the paper, that could also be relevant for what you are doing.

HTH!

Giovanni 


Giovanni Luca Ciampaglia  Assistant Research Scientist, Indiana University


On Sun, Feb 19, 2017 at 2:56 PM, Mara Sorella <[hidden email]> wrote:
Hi everybody, I'm new to the list and have been referred here by a comment from a SO user as per my question [1], that I'm quoting next:


I have been successfully able to use the Wikipedia pagelinks SQL dump to obtain hyperlinks between Wikipedia pages for a specific revision time.

However, there are cases where multiple instances of such links exist, e.g. the very same https://en.wikipedia.org/wiki/Wikipedia page and https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find number of links between pairs of pages for a specific revision.

Ideal solutions would involve dump files other than pagelinks (which I'm not aware of), or using the MediaWiki API.



To elaborate, I need this information to weight (almost) every hyperlink between article pages (that is, in NS0), that was present in a specific wikipedia revision (end of 2015), therefore, I would prefer not to follow the solution suggested by the SO user, that would be rather impractical.
 
Indeed, my final aim is to use this weight in a thresholding fashion to sparsify the wikipedia graph (that due to the short diameter is more or less a giant connected component), in a way that should reflect the "relatedness" of the linked pages (where relatedness is not intended as strictly semantic, but at a higher "concept" level, if I may say so). 
For this reason, other suggestions on how determine such weights (possibly using other data sources -- ontologies?) are more than welcome.

The graph will be used as dataset to test an event tracking algorithm I am doing research on.


Thanks,


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding the number of links between two wikipedia pages

Mara Sorella
Thank you, Giovanni, I'll check it out!

Mara

On Fri, Feb 24, 2017 at 10:59 PM, Giovanni Luca Ciampaglia <[hidden email]> wrote:
Hi Mara, 

since you were asking about ontologies, let me point you to our work on computational fact checking from knowledge networks PLoS ONE. We developed a measure of semantic similarity based on shortest paths between any two concepts of Wikipedia using the linked data from DBPedia; these the are links found in the infoboxes of Wikipedia articles; so it is a subset of the hyperlinks of the whole web page. 

In the article we use it as a way to check simple relational statements, but it could be used for other uses too. And there are also a couple other approaches from the literature, which we cite in the paper, that could also be relevant for what you are doing.

HTH!

Giovanni 


Giovanni Luca Ciampaglia  Assistant Research Scientist, Indiana University


On Sun, Feb 19, 2017 at 2:56 PM, Mara Sorella <[hidden email]> wrote:
Hi everybody, I'm new to the list and have been referred here by a comment from a SO user as per my question [1], that I'm quoting next:


I have been successfully able to use the Wikipedia pagelinks SQL dump to obtain hyperlinks between Wikipedia pages for a specific revision time.

However, there are cases where multiple instances of such links exist, e.g. the very same https://en.wikipedia.org/wiki/Wikipedia page and https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find number of links between pairs of pages for a specific revision.

Ideal solutions would involve dump files other than pagelinks (which I'm not aware of), or using the MediaWiki API.



To elaborate, I need this information to weight (almost) every hyperlink between article pages (that is, in NS0), that was present in a specific wikipedia revision (end of 2015), therefore, I would prefer not to follow the solution suggested by the SO user, that would be rather impractical.
 
Indeed, my final aim is to use this weight in a thresholding fashion to sparsify the wikipedia graph (that due to the short diameter is more or less a giant connected component), in a way that should reflect the "relatedness" of the linked pages (where relatedness is not intended as strictly semantic, but at a higher "concept" level, if I may say so). 
For this reason, other suggestions on how determine such weights (possibly using other data sources -- ontologies?) are more than welcome.

The graph will be used as dataset to test an event tracking algorithm I am doing research on.


Thanks,


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l