snapshot of categories content and template usage

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

snapshot of categories content and template usage

Piotr Jagielski
Hello,

I'm trying to get information from Wikipedia dump based on
categorization or template usage. I first query MediaWiki API with
embeddedin or categorymembers query to get a list of articles I'm
interested in. Then I retrieve them from the dump and extract the
information I need. The problem is that sometimes the current titles
retrieved using the API doesn't match with what's in the dump because
the article has been moved, for example.

I think I could use two options to solve the problem:
- Parse the categorization and template usage information from all
articles in the dump and build the list of all articles in given
category and using given templates myself. This might be prone to errors
because of the need of custom parsing.
- Import the dump into the local MediaWiki installation and query the
API locally. But from what I read in the documentation importing the
dump into a database can take an excessive amount of time.

Is there any easier option? Is there a dump of categorization and
template usage kept somewhere? Or perhaps I missed something and this
information can be retrieved from the dump without parsing it?

Thanks,
Piotr

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: snapshot of categories content and template usage

Mormegil
Hi,

On 15 January 2012 12:26, Piotr Jagielski <[hidden email]> wrote:
> Is there any easier option? Is there a dump of categorization and
> template usage kept somewhere?

Yes, it is. There are raw SQL dumps of the templatelinks and
categorylinks tables available in the dumps next to the XML dump (look
for ...-categorylinks.sql.gz and ...-templatelinks.sql.gz). These do
not have a stable clearly defined format (like the XML dumps), but
they are quite a good choice for your needs, I guess. (See also
https://www.mediawiki.org/wiki/Manual:Database_layout.)

HTH,
-- [[cs:User:Mormegil | Petr Kadlec]]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: snapshot of categories content and template usage

Piotr Jagielski
It looks like exactly what I was looking for. Thank you!

Piotr

On 2012-01-16 17:17, Petr Kadlec wrote:

> Hi,
>
> On 15 January 2012 12:26, Piotr Jagielski<[hidden email]>  wrote:
>> Is there any easier option? Is there a dump of categorization and
>> template usage kept somewhere?
> Yes, it is. There are raw SQL dumps of the templatelinks and
> categorylinks tables available in the dumps next to the XML dump (look
> for ...-categorylinks.sql.gz and ...-templatelinks.sql.gz). These do
> not have a stable clearly defined format (like the XML dumps), but
> they are quite a good choice for your needs, I guess. (See also
> https://www.mediawiki.org/wiki/Manual:Database_layout.)
>
> HTH,
> -- [[cs:User:Mormegil | Petr Kadlec]]
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l