Quantcast

Get Wikipedia Page Titles using API looks Endless

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Get Wikipedia Page Titles using API looks Endless

Abdulfattah Safa
I'm trying to get all the page titles in Wikipedia in namespace using the
API as following:

https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE

I keep requesting this url and checking the response if contains continue
tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
the value in apcontinue attribute in the response.
My applications had been running since 3 days and number of retrieved
exceeds 30M, whereas it is about 13M in the dumps.
any idea?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Abdulfattah Safa
for the & in $Continue=-||, it's a type. It doesn't exist in the code.

On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <[hidden email]>
wrote:

> I'm trying to get all the page titles in Wikipedia in namespace using the
> API as following:
>
> https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
>
> I keep requesting this url and checking the response if contains continue
> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
> the value in apcontinue attribute in the response.
> My applications had been running since 3 days and number of retrieved
> exceeds 30M, whereas it is about 13M in the dumps.
> any idea?
>
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Eran Rosenthal
1. You can use limit parameter to get more titles in each request
2. For getting many entries it is recommended to extract from dumps or from
database using quarry

On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]> wrote:

> for the & in $Continue=-||, it's a type. It doesn't exist in the code.
>
> On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <[hidden email]>
> wrote:
>
> > I'm trying to get all the page titles in Wikipedia in namespace using the
> > API as following:
> >
> > https://en.wikipedia.org/w/api.php?action=query&format=
> xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
> >
> > I keep requesting this url and checking the response if contains continue
> > tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
> > the value in apcontinue attribute in the response.
> > My applications had been running since 3 days and number of retrieved
> > exceeds 30M, whereas it is about 13M in the dumps.
> > any idea?
> >
> >
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Abdulfattah Safa
1. I'm usng max as a limit parameter
2. I'm not sure if the dumps have the data I need. I need to get the titles
for all Articles (name space = 0), with no redirects and also need the
titles of all Categories (namespace = 14) without redirects

On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <[hidden email]> wrote:

> 1. You can use limit parameter to get more titles in each request
> 2. For getting many entries it is recommended to extract from dumps or from
> database using quarry
>
> On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]> wrote:
>
> > for the & in $Continue=-||, it's a type. It doesn't exist in the code.
> >
> > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <[hidden email]>
> > wrote:
> >
> > > I'm trying to get all the page titles in Wikipedia in namespace using
> the
> > > API as following:
> > >
> > > https://en.wikipedia.org/w/api.php?action=query&format=
> > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
> > >
> > > I keep requesting this url and checking the response if contains
> continue
> > > tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
> *to
> > > the value in apcontinue attribute in the response.
> > > My applications had been running since 3 days and number of retrieved
> > > exceeds 30M, whereas it is about 13M in the dumps.
> > > any idea?
> > >
> > >
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

John Doe-27
Give me a few minutes I can get you a database dump of what you need.

On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <[hidden email]>
wrote:

> 1. I'm usng max as a limit parameter
> 2. I'm not sure if the dumps have the data I need. I need to get the titles
> for all Articles (name space = 0), with no redirects and also need the
> titles of all Categories (namespace = 14) without redirects
>
> On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <[hidden email]>
> wrote:
>
> > 1. You can use limit parameter to get more titles in each request
> > 2. For getting many entries it is recommended to extract from dumps or
> from
> > database using quarry
> >
> > On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]> wrote:
> >
> > > for the & in $Continue=-||, it's a type. It doesn't exist in the code.
> > >
> > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
> [hidden email]>
> > > wrote:
> > >
> > > > I'm trying to get all the page titles in Wikipedia in namespace using
> > the
> > > > API as following:
> > > >
> > > > https://en.wikipedia.org/w/api.php?action=query&format=
> > > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> > > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
> > > >
> > > > I keep requesting this url and checking the response if contains
> > continue
> > > > tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
> > *to
> > > > the value in apcontinue attribute in the response.
> > > > My applications had been running since 3 days and number of retrieved
> > > > exceeds 30M, whereas it is about 13M in the dumps.
> > > > any idea?
> > > >
> > > >
> > > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

John Doe-27
Here you go
ns_0.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z>
ns_14.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z>

On Sat, May 6, 2017 at 5:27 PM, John <[hidden email]> wrote:

> Give me a few minutes I can get you a database dump of what you need.
>
> On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <[hidden email]>
> wrote:
>
>> 1. I'm usng max as a limit parameter
>> 2. I'm not sure if the dumps have the data I need. I need to get the
>> titles
>> for all Articles (name space = 0), with no redirects and also need the
>> titles of all Categories (namespace = 14) without redirects
>>
>> On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <[hidden email]>
>> wrote:
>>
>> > 1. You can use limit parameter to get more titles in each request
>> > 2. For getting many entries it is recommended to extract from dumps or
>> from
>> > database using quarry
>> >
>> > On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]> wrote:
>> >
>> > > for the & in $Continue=-||, it's a type. It doesn't exist in the code.
>> > >
>> > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
>> [hidden email]>
>> > > wrote:
>> > >
>> > > > I'm trying to get all the page titles in Wikipedia in namespace
>> using
>> > the
>> > > > API as following:
>> > > >
>> > > > https://en.wikipedia.org/w/api.php?action=query&format=
>> > > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
>> > > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
>> > > >
>> > > > I keep requesting this url and checking the response if contains
>> > continue
>> > > > tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
>> > *to
>> > > > the value in apcontinue attribute in the response.
>> > > > My applications had been running since 3 days and number of
>> retrieved
>> > > > exceeds 30M, whereas it is about 13M in the dumps.
>> > > > any idea?
>> > > >
>> > > >
>> > > >
>> > > _______________________________________________
>> > > Wikitech-l mailing list
>> > > [hidden email]
>> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Abdulfattah Safa
hello John,
Thanks for your effort. Actually I need official dumps as I need to use
them in my thesis.
Could you please point me how did you get these ones?
Also, any idea why the API doesn't work properly for en Wikipedia? I use
the same code for other language and it worked.

Thanks,
Abed,

On Sun, May 7, 2017 at 1:45 AM John <[hidden email]> wrote:

> Here you go
> ns_0.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z>
> ns_14.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z>
>
> On Sat, May 6, 2017 at 5:27 PM, John <[hidden email]> wrote:
>
> > Give me a few minutes I can get you a database dump of what you need.
> >
> > On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <[hidden email]>
> > wrote:
> >
> >> 1. I'm usng max as a limit parameter
> >> 2. I'm not sure if the dumps have the data I need. I need to get the
> >> titles
> >> for all Articles (name space = 0), with no redirects and also need the
> >> titles of all Categories (namespace = 14) without redirects
> >>
> >> On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <[hidden email]>
> >> wrote:
> >>
> >> > 1. You can use limit parameter to get more titles in each request
> >> > 2. For getting many entries it is recommended to extract from dumps or
> >> from
> >> > database using quarry
> >> >
> >> > On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]>
> wrote:
> >> >
> >> > > for the & in $Continue=-||, it's a type. It doesn't exist in the
> code.
> >> > >
> >> > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
> >> [hidden email]>
> >> > > wrote:
> >> > >
> >> > > > I'm trying to get all the page titles in Wikipedia in namespace
> >> using
> >> > the
> >> > > > API as following:
> >> > > >
> >> > > > https://en.wikipedia.org/w/api.php?action=query&format=
> >> > > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> >> > > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
> >> > > >
> >> > > > I keep requesting this url and checking the response if contains
> >> > continue
> >> > > > tag. if yes, then I use same request but change the
> *BASE_PAGE_TITLE
> >> > *to
> >> > > > the value in apcontinue attribute in the response.
> >> > > > My applications had been running since 3 days and number of
> >> retrieved
> >> > > > exceeds 30M, whereas it is about 13M in the dumps.
> >> > > > any idea?
> >> > > >
> >> > > >
> >> > > >
> >> > > _______________________________________________
> >> > > Wikitech-l mailing list
> >> > > [hidden email]
> >> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > [hidden email]
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

John Doe-27
those are official, I ran the report from toollabs which is Wikimedia's
developer platform which includes a copy of en.Wikipedia's database (with
sensitive fields removed). Without looking at your code and doing some
testing, which unfortunately I don't have the time for, I cannot help
debugging why your code isn't working. Those two files where created by
running "sql enwiki_p "select page_namespace from page where
page_is_redirect =0 and page_namespace = 0;"> ns_0.txt" then compressing
the resulting text file via 7zip. For the category namespace I just changed
page_namespace = 0 to page_namespace = 14,

On Sun, May 7, 2017 at 3:41 AM, Abdulfattah Safa <[hidden email]>
wrote:

> hello John,
> Thanks for your effort. Actually I need official dumps as I need to use
> them in my thesis.
> Could you please point me how did you get these ones?
> Also, any idea why the API doesn't work properly for en Wikipedia? I use
> the same code for other language and it worked.
>
> Thanks,
> Abed,
>
> On Sun, May 7, 2017 at 1:45 AM John <[hidden email]> wrote:
>
> > Here you go
> > ns_0.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z>
> > ns_14.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z>
> >
> > On Sat, May 6, 2017 at 5:27 PM, John <[hidden email]> wrote:
> >
> > > Give me a few minutes I can get you a database dump of what you need.
> > >
> > > On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <
> [hidden email]>
> > > wrote:
> > >
> > >> 1. I'm usng max as a limit parameter
> > >> 2. I'm not sure if the dumps have the data I need. I need to get the
> > >> titles
> > >> for all Articles (name space = 0), with no redirects and also need the
> > >> titles of all Categories (namespace = 14) without redirects
> > >>
> > >> On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <[hidden email]>
> > >> wrote:
> > >>
> > >> > 1. You can use limit parameter to get more titles in each request
> > >> > 2. For getting many entries it is recommended to extract from dumps
> or
> > >> from
> > >> > database using quarry
> > >> >
> > >> > On May 6, 2017 22:36, "Abdulfattah Safa" <[hidden email]>
> > wrote:
> > >> >
> > >> > > for the & in $Continue=-||, it's a type. It doesn't exist in the
> > code.
> > >> > >
> > >> > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
> > >> [hidden email]>
> > >> > > wrote:
> > >> > >
> > >> > > > I'm trying to get all the page titles in Wikipedia in namespace
> > >> using
> > >> > the
> > >> > > > API as following:
> > >> > > >
> > >> > > > https://en.wikipedia.org/w/api.php?action=query&format=
> > >> > > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> > >> > > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
> > >> > > >
> > >> > > > I keep requesting this url and checking the response if contains
> > >> > continue
> > >> > > > tag. if yes, then I use same request but change the
> > *BASE_PAGE_TITLE
> > >> > *to
> > >> > > > the value in apcontinue attribute in the response.
> > >> > > > My applications had been running since 3 days and number of
> > >> retrieved
> > >> > > > exceeds 30M, whereas it is about 13M in the dumps.
> > >> > > > any idea?
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > _______________________________________________
> > >> > > Wikitech-l mailing list
> > >> > > [hidden email]
> > >> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >> > _______________________________________________
> > >> > Wikitech-l mailing list
> > >> > [hidden email]
> > >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >> _______________________________________________
> > >> Wikitech-l mailing list
> > >> [hidden email]
> > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >>
> > >
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Jaime Crespo
In reply to this post by Abdulfattah Safa
On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa <[hidden email]>
 wrote:

> I'm trying to get all the page titles in Wikipedia in namespace using the
> API as following:
>
> https://en.wikipedia.org/w/api.php?action=query&format=
> xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
> aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
>
> I keep requesting this url and checking the response if contains continue
> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
> the value in apcontinue attribute in the response.
> My applications had been running since 3 days and number of retrieved
> exceeds 30M, whereas it is about 13M in the dumps.
> any idea?
>

Please do not scrap the web for those kind of requests- it is a waste of
resources for you and for Wikimedia servers (given that there is a faster
and more reliable alternative).

Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:

2017-05-03 07:26:20 done List of all page titles
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz
(221.7 MB)
2017-05-03 07:22:02 done List of page titles in main namespace
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-ns0.gz
(70.8 MB)

Use one of the above. Not only it is faster, you will also get consistent
results- by the time you stop going over your loop, pages have been created
and deleted. The above exports are done trying to get the most consistent
state as practically possible, and actively monitored by WMF staff.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Get Wikipedia Page Titles using API looks Endless

Jaime Crespo
>
> Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:
>
> 2017-05-03 07:26:20 done List of all page titles
> https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz
> (221.7 MB)
> 2017-05-03 07:22:02 done List of page titles in main namespace
> https://dumps.wikimedia.org/enwiki/20170501/enwiki-
> 20170501-all-titles-in-ns0.gz (70.8 MB)
>

If you want to do analysis of namespaces and redirects on your own, you can
also use:
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-page.sql.gz
It is larger, but you can filter by columns page_is_redirect
and page_namespace on your own terms.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Loading...