Datamining infoboxes

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Datamining infoboxes

Andrew Dunbar
Infoboxes in Wikipedia often contain information which is quite useful
outside Wikipedia but can be surprisingly difficult to data-mine.

I would like to find all Wikipedia pages that use
Template:Infobox_Language and parse the parameters iso3 and
fam1...fam15

But my attempts to find such pages using either the Toolserver's
Wikipedia database or the Mediawiki API have not been fruitful. In
particular, SQL queries on the templatelinks table are intractably
slow. Why are there no keys on tl_from or tl_title?

Andrew Dunbar (hippietrail)

--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Daniel Schwen-2
> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?

How are you planning to get the template parameters? Have I missed a
recent schema change?
I'd be interested in following your progress. I'm not extracting
infobox data, but parameters of the coordinate template. Maybe a
similar approach could be interesting for you:

 The coordinate template stuffs all its parameters int an external
link (which can easily be obtained from the externallinks table).
Creating dummy links containing parameters for some infoboxes could be
one way of making the data available for automatic extraction (yes,
it's a hack, but I'd prefer better suggestions over flames).

The link could actually be made useful, it could point to a query page
for the data in these infoboxes.

[[User:Dschwen]]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Andrew Dunbar
2009/10/22 Daniel Schwen <[hidden email]>:
>> particular, SQL queries on the templatelinks table are intractably
>> slow. Why are there no keys on tl_from or tl_title?
>
> How are you planning to get the template parameters? Have I missed a
> recent schema change?

I've been trying to parse the wikitext of section 0 with a minimal
parser that uses just the tokens {{ }} {{{ and }}} but it already has
probems when it sees }}}}

> I'd be interested in following your progress. I'm not extracting
> infobox data, but parameters of the coordinate template. Maybe a
> similar approach could be interesting for you:
>
>  The coordinate template stuffs all its parameters int an external
> link (which can easily be obtained from the externallinks table).
> Creating dummy links containing parameters for some infoboxes could be
> one way of making the data available for automatic extraction (yes,
> it's a hack, but I'd prefer better suggestions over flames).
>
> The link could actually be made useful, it could point to a query page
> for the data in these infoboxes.

The template and parameters I'm interested don't generate any such
external links and probably couldn't very easily...

But I have just discovered the rvgeneratexml parameter to
action=query&prop=revisions
This includes a <part> field for each template parameter with a <name>
and a <value> for each...

Andrew Dunbar (hippietrail)

> [[User:Dschwen]]
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

George William Herbert
This discussion brings to mind several historical threads.

I wonder if a project to simply mine the whole article contents and
provide a DB of some sort with the articles and infobox contents would
be worthwhile.  Develop a specific parser and generate and publish the
complete set of article-infobox-(key-value) sets...


On Thu, Oct 22, 2009 at 11:13 PM, Andrew Dunbar <[hidden email]> wrote:

> 2009/10/22 Daniel Schwen <[hidden email]>:
>>> particular, SQL queries on the templatelinks table are intractably
>>> slow. Why are there no keys on tl_from or tl_title?
>>
>> How are you planning to get the template parameters? Have I missed a
>> recent schema change?
>
> I've been trying to parse the wikitext of section 0 with a minimal
> parser that uses just the tokens {{ }} {{{ and }}} but it already has
> probems when it sees }}}}
>
>> I'd be interested in following your progress. I'm not extracting
>> infobox data, but parameters of the coordinate template. Maybe a
>> similar approach could be interesting for you:
>>
>>  The coordinate template stuffs all its parameters int an external
>> link (which can easily be obtained from the externallinks table).
>> Creating dummy links containing parameters for some infoboxes could be
>> one way of making the data available for automatic extraction (yes,
>> it's a hack, but I'd prefer better suggestions over flames).
>>
>> The link could actually be made useful, it could point to a query page
>> for the data in these infoboxes.
>
> The template and parameters I'm interested don't generate any such
> external links and probably couldn't very easily...
>
> But I have just discovered the rvgeneratexml parameter to
> action=query&prop=revisions
> This includes a <part> field for each template parameter with a <name>
> and a <value> for each...
>
> Andrew Dunbar (hippietrail)
>
>> [[User:Dschwen]]
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> http://wiktionarydev.leuksman.com http://linguaphile.sf.net
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
-george william herbert
[hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

William Pietri
George Herbert wrote:
> This discussion brings to mind several historical threads.
>
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...
>  

I don't know anybody on the data side at Metaweb anymore, but I know
that they did something like that to import a lot of structured
Wikipedia data into their Freebase project. They publish some sort of
data dump here:

http://download.freebase.com/wex/

Perhaps they'd be willing to open-source their parser.

William

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Roan Kattouw-2
In reply to this post by Andrew Dunbar
2009/10/23 Andrew Dunbar <[hidden email]>:
> But my attempts to find such pages using either the Toolserver's
> Wikipedia database or the Mediawiki API have not been fruitful. In
> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?
>
There are:
CREATE UNIQUE INDEX /*i*/tl_from ON /*_*/templatelinks
(tl_from,tl_namespace,tl_title);
CREATE UNIQUE INDEX /*i*/tl_namespace ON /*_*/templatelinks
(tl_namespace,tl_title,tl_from);

It's just that tl_title is always coupled with tl_namespace because
that's how you should be using it (tl_namespace=10 for the template
namespace). Note that the former index can be used as an index on
(tl_from) as well.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Bugzilla from jcsahnwaldt@gmail.com
In reply to this post by George William Herbert
On Fri, Oct 23, 2009 at 08:37, George Herbert <[hidden email]> wrote:
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...

That's what DBpedia is doing.

The extracted data can be found here, in N-Triples and CSV format:

http://wiki.dbpedia.org/Downloads

The entries in the row labelled 'Infoboxes' are files
that contain the extracted values of all template
properties in each page of a Wikipedia instance.
For large Wikipedias like en, the unzipped files are
pretty big (several GB).

Most of the extraction code can be found in these
PHP classes:

https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/InfoboxExtractor.php
https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/infobox/


Christopher

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Robert Ullmann
In reply to this post by Roan Kattouw-2
Hi Hippietrail!

What do you mean by "intractably slow"? Just how fast must it be?

If I do
http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
it says (on one given try) that it was served in 0,047 seconds. How
long can it take to read them all? A few minutes?

Seems to me that time would be swamped by the time it takes to pull
the wikitext for the pages?

And methinks you might be trying too hard to parse the text, some
fairly simple regex or such can extract the template invocation and
the parameters; people use it in a pretty regular way.

Oh, and do remember to look for "Template:Infobox language" as well,
depending on which way you find them.

Robert

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Daniel Schwen-2
In reply to this post by George William Herbert
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...

That is a brilliant idea...
...that somebody else already had and implemented

Templatetiger
http://toolserver.org/~kolossos/templatetiger/template-choice.php?lang=enwiki

Should have mentioned that earlier.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Andrew Dunbar
In reply to this post by Robert Ullmann
2009/10/23 Robert Ullmann <[hidden email]>:
> Hi Hippietrail!
>
> What do you mean by "intractably slow"? Just how fast must it be?
>
> If I do
> http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
> it says (on one given try) that it was served in 0,047 seconds. How
> long can it take to read them all? A few minutes?

Yes I found how to get it through the API now. It was actually just
the Toolserver database that was intractably slow.

> Seems to me that time would be swamped by the time it takes to pull
> the wikitext for the pages?
>
> And methinks you might be trying too hard to parse the text, some
> fairly simple regex or such can extract the template invocation and
> the parameters; people use it in a pretty regular way.

I've been spending hours on the parsing now and don't find it simple
at all due to the fact that templates can be nested. Just extracting
the Infobox as one big lump is hard due to the need to match nested {{
and }}

Andrew Dunbar (hippietrail)

> Oh, and do remember to look for "Template:Infobox language" as well,
> depending on which way you find them.
>
> Robert
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Robert Rohde
Given the fairly obvious utility for data mining, it might make sense
for someone to extend the Mediawiki API to generate a list of template
calls and the parameters sent in each case.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Magnus Manske-2
In reply to this post by Andrew Dunbar
I am so glad that someone re-re-resurrects this topic :-)


On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar <[hidden email]> wrote:
> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due to the need to match nested {{
> and }}

Not perfect, but try
http://toolserver.org/~magnus/wiki2xml/w2x.php

1. Unckeck "Use API", chose "Do not use templates"
2. Enter article name(s)
3. Get XML
4. Parse XML, re-submit the wiki text in templates to process the next
level of templates

I should really offer #4 in this...

Caveat: Will break on things like HTML attributes that are filled by
templates etc.

Cheers,
Magnus

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Robert Ullmann
In reply to this post by Andrew Dunbar
> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due to the need to match nested {{
> and }}
>
> Andrew Dunbar (hippietrail)

Hi,

Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
the text, then count braces. Start at depth=2, count up and down 'till
you reach 0, and you are at the end of the template. (you can be picky
about only counting them if paired if you like ;-)

Then just regex match the lines/parameters you want.

However, if you are pulling the wikitext with the API, the XML parse
tree option sounds good; then you can just use elementTree (or the
like) and pull out the parameters directly

Robert

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Andrew Dunbar
2009/10/23 Robert Ullmann <[hidden email]>:

>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>
> Hi,
>
> Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
> the text, then count braces. Start at depth=2, count up and down 'till
> you reach 0, and you are at the end of the template. (you can be picky
> about only counting them if paired if you like ;-)

Actually you have to find "{{[Ii]nfobox[ _][Ll]anguage"
And I wanted to be robust. It's perfectly legal for single unmatched
braces to apear anywhere and I didn't want them to break my code. As
it happens there don't seem to currently be any in the language
infofoxes.
I couldn't be sure whether there would be any cases where a {{{ or }}}
might show up either. And a few other edge cases such as HTML
comments, <nowiki> and friends, template invocations in values, and
even possibly template invokations in names?

> Then just regex match the lines/parameters you want.
>
> However, if you are pulling the wikitext with the API, the XML parse
> tree option sounds good; then you can just use elementTree (or the
> like) and pull out the parameters directly

I've got it extracting the name/value pairs from the XML finally but
parsing XML is always a pain. And it still misses Norwegian, Bokmal,
and Nynorsk which wrap the infobox in another template...

Andrew Dunbar (hippietrail)

> Robert
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Roan Kattouw-2
In reply to this post by Robert Rohde
2009/10/23 Robert Rohde <[hidden email]>:
> Given the fairly obvious utility for data mining, it might make sense
> for someone to extend the Mediawiki API to generate a list of template
> calls and the parameters sent in each case.
>
We had a discussion about this Tuesday in the tech staff meeting, and
decided that we want to put this data mining possibility in core at
some point (using a table like pagelinks to store these key/value
pairs and modifying the parser). As you may understand this is not a
very high priority project, and I don't know if any of the paid
developers are gonna do it any time soon.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Neil Harris
In reply to this post by Robert Ullmann
Robert Ullmann wrote:

>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>>    
>
> Hi,
>
> Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
> the text, then count braces. Start at depth=2, count up and down 'till
> you reach 0, and you are at the end of the template. (you can be picky
> about only counting them if paired if you like ;-)
>
> Then just regex match the lines/parameters you want.
>
> However, if you are pulling the wikitext with the API, the XML parse
> tree option sounds good; then you can just use elementTree (or the
> like) and pull out the parameters directly
>
> Robert
>  

Or you could use the pyparsing Python library, with which you can
implement the grammar of your choice, making matching nested template
extraction trivial. Using the psyco package to accelerate it, you can
parse a whole en: dump in a few hours.

See the code below for a sample grammar...

-- Neil

------------------------------------------------

# Use pyparsing, enablePackrat()  _and_ psyco for a considerable speed-up
from pyparsing import *
import psyco
# These two must be in the correct order, or bad things will happen
ParserElement.enablePackrat()
psyco.full()

wikitemplate = Forward()

wikilink = Combine("[[" + SkipTo("]]") + "]]")

wikiargname = CharsNotIn("|{}=")
wikiargval = ZeroOrMore(
    wikilink | Group(wikitemplate) | CharsNotIn("[|{}") | "[" | "{" |
Regex("}[^}]"))

wikiarg = Group(Optional(wikiargname + Suppress("="), default="??") +
wikiargval)

wikitemplate << (Suppress("{{") + wikiargname + Optional(Suppress("|") +
delimitedList(wikiarg, "|")) + Suppress("}}"))

wikitext = ZeroOrMore(CharsNotIn("{") | Group(wikitemplate) | "{" )

def parse_page(text):
   return wikitext.parseString(text)


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Aryeh Gregor
In reply to this post by Andrew Dunbar
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <[hidden email]> wrote:
> Yes I found how to get it through the API now. It was actually just
> the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true
PAGER set to 'true'
mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
tl_title IN ('Infobox_Language', 'Infobox_language');
3144 rows in set (0.12 sec)

Your query might have been what was slow.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Daniel Schwen-2
In reply to this post by Neil Harris
Fascinating!
It seems to be a repeating pattern on these mailing lists that people
ignore existing solutions and discuss re-inventing wheels (please
correct me if I'm wrong here).
While I agree this is fun some it rarely helps the OP...

[[User:Dschwen]]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

David Gerard-2
In reply to this post by William Pietri
2009/10/23 William Pietri <[hidden email]>:
> George Herbert wrote:

>> This discussion brings to mind several historical threads.
>> I wonder if a project to simply mine the whole article contents and
>> provide a DB of some sort with the articles and infobox contents would
>> be worthwhile.  Develop a specific parser and generate and publish the
>> complete set of article-infobox-(key-value) sets...

> I don't know anybody on the data side at Metaweb anymore, but I know
> that they did something like that to import a lot of structured
> Wikipedia data into their Freebase project. They publish some sort of
> data dump here:
> http://download.freebase.com/wex/
> Perhaps they'd be willing to open-source their parser.


They're right into open source, I suspect they would.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Datamining infoboxes

Andrew Dunbar
In reply to this post by Aryeh Gregor
2009/10/23 Aryeh Gregor <[hidden email]>:

> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <[hidden email]> wrote:
>> Yes I found how to get it through the API now. It was actually just
>> the Toolserver database that was intractably slow.
>
> There's nothing slow about the TS database here:
>
> mysql> pager true
> PAGER set to 'true'
> mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
> tl_title IN ('Infobox_Language', 'Infobox_language');
> 3144 rows in set (0.12 sec)
>
> Your query might have been what was slow.

Yes I didn't specify tl_namespace and when I check for which columns
have keys I could see none:
mysql> describe templatelinks;
+--------------+-----------------+------+-----+---------+-------+
| Field        | Type            | Null | Key | Default | Extra |
+--------------+-----------------+------+-----+---------+-------+
| tl_from      | int(8) unsigned | NO   |     | 0       |       |
| tl_namespace | int(11)         | NO   |     | 0       |       |
| tl_title     | varchar(255)    | NO   |     |         |       |
+--------------+-----------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value
pairs for language infobox ISO 639 language codes and language family
string by the way.

But the values still need some work before I can try to match them
against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12