How can I programmatically get the pages in a list?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How can I programmatically get the pages in a list?

Mike MacHenry
Hello everyone,

I am trying to use the MediaWiki API to create a dictionary based on categories or lists on Wikipedia. I would like to be able to select a category, or perhaps a list page, and get all members of that list.

I've done some reading of the API, and implemented a prototype. It works a little bit but only when the data is structured just perfectly for my purposes. For example, I can easily get a list of all of the English-language films. I'm using the action=query and list=categorymembers for this. I end up with 500 films at a time, and I can continue as needed to get all 60k or so. This is because there is a category that is tagged to each English-language film's individual page.

On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.

Is there a standardized way to deal with lists and sublists that I might have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile. 

Thank you for the help,
-mike

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: How can I programmatically get the pages in a list?

Gergo Tisza
On Wed, Jan 10, 2018 at 2:19 PM, Mike MacHenry <[hidden email]> wrote:
On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.

There is a Category:National Hockey League players. You'll have to handle subcategories on your own but that's still a lot less messy than parsing HTML.

Is there a standardized way to deal with lists and sublists that I might have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile. 

There is not. You can check if Wikidata has something appropriate (e.g. all humans with the P3522 (NHL.com player ID) property), but otherwise you are on your own. Also, there is no guarantee Wikipedia and Wikidata has the same data (every Wikipedia article has an item in Wikidata but often the properties are not fleshed out yet).

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: How can I programmatically get the pages in a list?

Steve Siznax
In reply to this post by Mike MacHenry
It’s not quite ready for “gold standard” evaluation, but I wonder if wptools could be helpful?


We just added support for category continuations.


On Jan 10, 2018, at 2:19 PM, Mike MacHenry <[hidden email]> wrote:

Hello everyone,

I am trying to use the MediaWiki API to create a dictionary based on categories or lists on Wikipedia. I would like to be able to select a category, or perhaps a list page, and get all members of that list.

I've done some reading of the API, and implemented a prototype. It works a little bit but only when the data is structured just perfectly for my purposes. For example, I can easily get a list of all of the English-language films. I'm using the action=query and list=categorymembers for this. I end up with 500 films at a time, and I can continue as needed to get all 60k or so. This is because there is a category that is tagged to each English-language film's individual page.

On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.

Is there a standardized way to deal with lists and sublists that I might have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile. 

Thank you for the help,
-mike
_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api