Incoming and outgoing links enquiry

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Incoming and outgoing links enquiry

Nick Bell
Hi there,

I'm a final year Mathematics student at the University of Bristol, and I'm
studying Wikipedia as a graph for my project.

I'd like to get data regarding the number of outgoing links on each page,
and the number of pages with links to each page. I have already
inquired about this with the Analytics Team mailing list, who gave me a few
suggestions.

One of these was to run the code at this link https://quarry.wmflabs.org/
query/25400
with these instructions:

"You will have to fork it and remove the "LIMIT 10" to get it to run on
all the English Wikipedia articles. It may take too long or produce
too much data, in which case please ask on this list for someone who
can run it for you."

I ran the code as instructed, but the query was killed as it took longer
than 30 minutes to run. I asked if anyone on the mailing list could run it
for me, but no one replied saying they could. The guy who wrote the code
suggested I try this mailing list to see if anyone can help.

I'm a beginner in programming and coding etc., so any and all help you can
give me would be greatly appreciated.

Many thanks,
Nick Bell
University of Bristol
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Incoming and outgoing links enquiry

Brian Wolff
Hi,

You can run longer queries by getting access to toolforge (
https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the
command line.

However the query in question might  still take an excessively long time
(if you are doing all of wikipedia). I would expect that query to result in
about 150mb of data and maybe take days to complete.

You can also break it down into parts by adding WHERE page_title >='a' AND
page_title < 'b'

Note, also of interest: full dumps of all the links is available at
https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz
(you would also need
https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to
convert page ids to page names)
--
Brian
On Sunday, March 18, 2018, Nick Bell <[hidden email]> wrote:
> Hi there,
>
> I'm a final year Mathematics student at the University of Bristol, and I'm
> studying Wikipedia as a graph for my project.
>
> I'd like to get data regarding the number of outgoing links on each page,
> and the number of pages with links to each page. I have already
> inquired about this with the Analytics Team mailing list, who gave me a
few

> suggestions.
>
> One of these was to run the code at this link https://quarry.wmflabs.org/
> query/25400
> with these instructions:
>
> "You will have to fork it and remove the "LIMIT 10" to get it to run on
> all the English Wikipedia articles. It may take too long or produce
> too much data, in which case please ask on this list for someone who
> can run it for you."
>
> I ran the code as instructed, but the query was killed as it took longer
> than 30 minutes to run. I asked if anyone on the mailing list could run it
> for me, but no one replied saying they could. The guy who wrote the code
> suggested I try this mailing list to see if anyone can help.
>
> I'm a beginner in programming and coding etc., so any and all help you can
> give me would be greatly appreciated.
>
> Many thanks,
> Nick Bell
> University of Bristol
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Incoming and outgoing links enquiry

John Doe-27
I would second the recommendation of using the dumps for such a large
graphing project. If it's more than a couple hundred pages the API/database
queries can get bulky

On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff <[hidden email]> wrote:

> Hi,
>
> You can run longer queries by getting access to toolforge (
> https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the
> command line.
>
> However the query in question might  still take an excessively long time
> (if you are doing all of wikipedia). I would expect that query to result in
> about 150mb of data and maybe take days to complete.
>
> You can also break it down into parts by adding WHERE page_title >='a' AND
> page_title < 'b'
>
> Note, also of interest: full dumps of all the links is available at
>
> https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz
> (you would also need
> https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to
> convert page ids to page names)
> --
> Brian
> On Sunday, March 18, 2018, Nick Bell <[hidden email]> wrote:
> > Hi there,
> >
> > I'm a final year Mathematics student at the University of Bristol, and
> I'm
> > studying Wikipedia as a graph for my project.
> >
> > I'd like to get data regarding the number of outgoing links on each page,
> > and the number of pages with links to each page. I have already
> > inquired about this with the Analytics Team mailing list, who gave me a
> few
> > suggestions.
> >
> > One of these was to run the code at this link
> https://quarry.wmflabs.org/
> > query/25400
> > with these instructions:
> >
> > "You will have to fork it and remove the "LIMIT 10" to get it to run on
> > all the English Wikipedia articles. It may take too long or produce
> > too much data, in which case please ask on this list for someone who
> > can run it for you."
> >
> > I ran the code as instructed, but the query was killed as it took longer
> > than 30 minutes to run. I asked if anyone on the mailing list could run
> it
> > for me, but no one replied saying they could. The guy who wrote the code
> > suggested I try this mailing list to see if anyone can help.
> >
> > I'm a beginner in programming and coding etc., so any and all help you
> can
> > give me would be greatly appreciated.
> >
> > Many thanks,
> > Nick Bell
> > University of Bristol
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Incoming and outgoing links enquiry

Erik Bernhardson
This information is available mostly pre-calculated in the CirrusSearch
dumps at http://dumps.wikimedia.your.org/other/cirrussearch/current/

Each article is represented by a line of json in those dumps. There is a
field called 'incoming_links' which is the number of unique articles with
links from the content namespace(s) to that article. Each article
additionally contains an `outgoing_link` field which contains a list of
strings representing the pages the article links to (incoming_links is
calculated by querying the outgoing_link field). I've done graph work on
wikipedia before using this and the outgoing_link field is typically enough
to build a full graph.



On Sun, Mar 18, 2018 at 2:18 PM, John <[hidden email]> wrote:

> I would second the recommendation of using the dumps for such a large
> graphing project. If it's more than a couple hundred pages the API/database
> queries can get bulky
>
> On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff <[hidden email]> wrote:
>
> > Hi,
> >
> > You can run longer queries by getting access to toolforge (
> > https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from
> the
> > command line.
> >
> > However the query in question might  still take an excessively long time
> > (if you are doing all of wikipedia). I would expect that query to result
> in
> > about 150mb of data and maybe take days to complete.
> >
> > You can also break it down into parts by adding WHERE page_title >='a'
> AND
> > page_title < 'b'
> >
> > Note, also of interest: full dumps of all the links is available at
> >
> > https://dumps.wikimedia.org/enwiki/20180301/enwiki-
> 20180301-pagelinks.sql.gz
> > (you would also need
> > https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz
> to
> > convert page ids to page names)
> > --
> > Brian
> > On Sunday, March 18, 2018, Nick Bell <[hidden email]> wrote:
> > > Hi there,
> > >
> > > I'm a final year Mathematics student at the University of Bristol, and
> > I'm
> > > studying Wikipedia as a graph for my project.
> > >
> > > I'd like to get data regarding the number of outgoing links on each
> page,
> > > and the number of pages with links to each page. I have already
> > > inquired about this with the Analytics Team mailing list, who gave me a
> > few
> > > suggestions.
> > >
> > > One of these was to run the code at this link
> > https://quarry.wmflabs.org/
> > > query/25400
> > > with these instructions:
> > >
> > > "You will have to fork it and remove the "LIMIT 10" to get it to run on
> > > all the English Wikipedia articles. It may take too long or produce
> > > too much data, in which case please ask on this list for someone who
> > > can run it for you."
> > >
> > > I ran the code as instructed, but the query was killed as it took
> longer
> > > than 30 minutes to run. I asked if anyone on the mailing list could run
> > it
> > > for me, but no one replied saying they could. The guy who wrote the
> code
> > > suggested I try this mailing list to see if anyone can help.
> > >
> > > I'm a beginner in programming and coding etc., so any and all help you
> > can
> > > give me would be greatly appreciated.
> > >
> > > Many thanks,
> > > Nick Bell
> > > University of Bristol
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l