category extraction question

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

category extraction question

Leila Zia
Hi all,

[If you are not interested in discussions related to the category system
​ (on English Wikipedia)​
, you can stop here. :)]

We have run into a problem that some of you may have thought about or
addressed before. We are trying to clean up the category system on English
Wikipedia by turning the category structure to an IS-A hierarchy. (The
output of this work can be useful for the research on template
recommendation [1], for example, but the use-cases won't stop there). One
issue that we are facing is the following:

We are currently
​using
 SQL dumps to extract categories associated with every article on English
Wikipedia (main namespace). [2]
​ Using this approach, we get 5 categories associated with Flow cytometry
bioinformatics article [3]:

Flow_cytometry
Bioinformatics

Wikipedia_articles_published_in_peer-reviewed_literature
Wikipedia_articles_published_in_PLOS_Computational_Biology
CS1_maint:_Multiple_names:_authors_list

​The problem is that only the first two categories are the ones we are
interested in. We have one cleaning step through which we only keep
categories that belong to category Article and that step removes the last
category above, but the other two Wikipedia_... remain there. We need to
somehow prune the data and clean it from those two categories.

One way we could do the above would be to parse wikitext instead of the SQL
dumps and focus on extracting categories marked by pattern [[Category:XX]],
but in that case, we would lose a good category such as
Guided_missiles_of_Norway​
​ because that's generated by a template.​

Any ideas on how we can start with a "cleaner" dataset of categories
related to the topic of the articles as opposed to maintenance related or
other types of categories?

Thanks,
Leila

[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
_stubs_across_languages

[2] The exact code we use is

SELECT p.page_id id, p.page_title title, cl.cl_to category
FROM categorylinks cl
JOIN page p
on cl.cl_from = p.page_id
where cl_type = 'page'
and page_namespace = 0
and page_is_redirect = 0

​and the edges of the category graph are extracted with

*SELECT p.page_title category, cl.cl_to parent *
*FROM categorylinks cl *
*JOIN page p *
*ON p.page_id = cl.cl_from *
*where p.page_namespace = 14*​


​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Stuart A. Yeates
The category system on en.wiki is not an IS-A system and there have been
several discussions about making it it based on mathematical principals
which have come to nothing because the consensus of editors is against it.
The best way to think about categories is as a locally-faceted related
links system.

Having said that, Category:Wikipedia maintenance is an important root
probably useful for separating  the wheat from the chaff. Most of these are
also hidden categories. I'm not sure whether this flag appears in the SQL,
but see
https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

cheers
stuart

--
...let us be heard from red core to black sky

On 11 July 2017 at 13:20, Leila Zia <[hidden email]> wrote:

> Hi all,
>
> [If you are not interested in discussions related to the category system
> ​ (on English Wikipedia)​
> , you can stop here. :)]
>
> We have run into a problem that some of you may have thought about or
> addressed before. We are trying to clean up the category system on English
> Wikipedia by turning the category structure to an IS-A hierarchy. (The
> output of this work can be useful for the research on template
> recommendation [1], for example, but the use-cases won't stop there). One
> issue that we are facing is the following:
>
> We are currently
> ​using
>  SQL dumps to extract categories associated with every article on English
> Wikipedia (main namespace). [2]
> ​ Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article [3]:
>
> Flow_cytometry
> Bioinformatics
>
> Wikipedia_articles_published_in_peer-reviewed_literature
> Wikipedia_articles_published_in_PLOS_Computational_Biology
> CS1_maint:_Multiple_names:_authors_list
>
> ​The problem is that only the first two categories are the ones we are
> interested in. We have one cleaning step through which we only keep
> categories that belong to category Article and that step removes the last
> category above, but the other two Wikipedia_... remain there. We need to
> somehow prune the data and clean it from those two categories.
>
> One way we could do the above would be to parse wikitext instead of the SQL
> dumps and focus on extracting categories marked by pattern [[Category:XX]],
> but in that case, we would lose a good category such as
> Guided_missiles_of_Norway​
> ​ because that's generated by a template.​
>
> Any ideas on how we can start with a "cleaner" dataset of categories
> related to the topic of the articles as opposed to maintenance related or
> other types of categories?
>
> Thanks,
> Leila
>
> [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> _stubs_across_languages
>
> [2] The exact code we use is
>
> SELECT p.page_id id, p.page_title title, cl.cl_to category
> FROM categorylinks cl
> JOIN page p
> on cl.cl_from = p.page_id
> where cl_type = 'page'
> and page_namespace = 0
> and page_is_redirect = 0
>
> ​and the edges of the category graph are extracted with
>
> *SELECT p.page_title category, cl.cl_to parent *
> *FROM categorylinks cl *
> *JOIN page p *
> *ON p.page_id = cl.cl_from *
> *where p.page_namespace = 14*​
>
>
> ​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Bowen Yu
Hi Leila,

I did something similar before. I was trying to create "top-level" category
labels for the articles, like history, society, technology, etc. I parsed
the wikitext in dump data to extract all the sub category labels of the
article. Also, by parsing pages of namespace 14, I created a
category-relation graph for all the category labels, where ideally, each
sub category can reach some "top-level" category. Then, for each article,
you can take the sub category label into the graph for the top-level
categories. More detail can be found in 3.3.2 Independent Variables -
Identity-based Attachment subsection in the paper. Hope it helps!

On Mon, Jul 10, 2017 at 8:45 PM, Stuart A. Yeates <[hidden email]> wrote:

> The category system on en.wiki is not an IS-A system and there have been
> several discussions about making it it based on mathematical principals
> which have come to nothing because the consensus of editors is against it.
> The best way to think about categories is as a locally-faceted related
> links system.
>
> Having said that, Category:Wikipedia maintenance is an important root
> probably useful for separating  the wheat from the chaff. Most of these are
> also hidden categories. I'm not sure whether this flag appears in the SQL,
> but see
> https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On 11 July 2017 at 13:20, Leila Zia <[hidden email]> wrote:
>
> > Hi all,
> >
> > [If you are not interested in discussions related to the category system
> > ​ (on English Wikipedia)​
> > , you can stop here. :)]
> >
> > We have run into a problem that some of you may have thought about or
> > addressed before. We are trying to clean up the category system on
> English
> > Wikipedia by turning the category structure to an IS-A hierarchy. (The
> > output of this work can be useful for the research on template
> > recommendation [1], for example, but the use-cases won't stop there). One
> > issue that we are facing is the following:
> >
> > We are currently
> > ​using
> >  SQL dumps to extract categories associated with every article on English
> > Wikipedia (main namespace). [2]
> > ​ Using this approach, we get 5 categories associated with Flow cytometry
> > bioinformatics article [3]:
> >
> > Flow_cytometry
> > Bioinformatics
> >
> > Wikipedia_articles_published_in_peer-reviewed_literature
> > Wikipedia_articles_published_in_PLOS_Computational_Biology
> > CS1_maint:_Multiple_names:_authors_list
> >
> > ​The problem is that only the first two categories are the ones we are
> > interested in. We have one cleaning step through which we only keep
> > categories that belong to category Article and that step removes the last
> > category above, but the other two Wikipedia_... remain there. We need to
> > somehow prune the data and clean it from those two categories.
> >
> > One way we could do the above would be to parse wikitext instead of the
> SQL
> > dumps and focus on extracting categories marked by pattern
> [[Category:XX]],
> > but in that case, we would lose a good category such as
> > Guided_missiles_of_Norway​
> > ​ because that's generated by a template.​
> >
> > Any ideas on how we can start with a "cleaner" dataset of categories
> > related to the topic of the articles as opposed to maintenance related or
> > other types of categories?
> >
> > Thanks,
> > Leila
> >
> > [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> > _stubs_across_languages
> >
> > [2] The exact code we use is
> >
> > SELECT p.page_id id, p.page_title title, cl.cl_to category
> > FROM categorylinks cl
> > JOIN page p
> > on cl.cl_from = p.page_id
> > where cl_type = 'page'
> > and page_namespace = 0
> > and page_is_redirect = 0
> >
> > ​and the edges of the category graph are extracted with
> >
> > *SELECT p.page_title category, cl.cl_to parent *
> > *FROM categorylinks cl *
> > *JOIN page p *
> > *ON p.page_id = cl.cl_from *
> > *where p.page_namespace = 14*​
> >
> >
> > ​[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Marco Fossati-2
In reply to this post by Leila Zia
Hi Leila,

I've been working on taxonomy learning from Wikipedia categories in my
past research.
Here's a recap of the approach I proposed to address the pruning problem
you faced. It's a pipeline with a bottom-up direction, i.e., from the
leaves up to the root.

Stage 1: leaf nodes
INPUT = category + category links SQL dumps, like you do
1.1. extract the full set of article pages;
1.2. extract categories that are linked to article pages only, by
looking at the outgoing links for each article;
1.3. identify the set of categories with no sub-categories.

Stage 2: prominent nodes
INPUT = stage 1 output
2.1. traverse the leaf graph, see the algorithm [1];
2.2. NLP to identify categories that hold is-a relations, i.e., *noun
phrases* with *plural head*, inspired by the YAGO approach [2, 3];
2.3. (optional) set a usage weight based on the number of category
interlanguage links (more links = more usage across language chapters).

These 2 stages should output the clean dataset you're looking for.
Based on that, you can then build the taxonomy.

Feel free to ping me if you need more information.
Best,

Marco

[1] Input: L (leaf nodes set) Output: PN (prominent nodes set)
for all l in L do
        isProminent = true;
        P = getTransitiveParents(l);
        for all p in P do
                C = getChildren(p);
                areAllLeaves = true;
  for all c in C do
                        if c not in L then
                                areAllLeaves = false;
                                break;
                end for
                if areAllLeaves then
                        PN.add(p);
                        isProminent = false;
        end for
        if isProminent then
                PN.add(l);
end for
return PN
[2] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a
core of semantic knowledge. In Proceedings of the 16th
International Conference on World Wide Web, pages
697–706. ACM, 2007.
[3] J. Hoffart, F. M. Suchanek, K. Berberich, and
G. Weikum. Yago2: a spatially and temporally
enhanced knowledge base from wikipedia. AI,
194:28–61, 2013.

On 7/11/17 03:21, [hidden email] wrote:

> Date: Mon, 10 Jul 2017 18:20:47 -0700
> From: Leila Zia<[hidden email]>
> To: Research into Wikimedia content and communities
> <[hidden email]>
> Subject: [Wiki-research-l] category extraction question
> Message-ID:
> <[hidden email]>
> Content-Type: text/plain; charset="UTF-8"
>
> Hi all,
>
> [If you are not interested in discussions related to the category system
> ​ (on English Wikipedia)​
> , you can stop here. :)]
>
> We have run into a problem that some of you may have thought about or
> addressed before. We are trying to clean up the category system on English
> Wikipedia by turning the category structure to an IS-A hierarchy. (The
> output of this work can be useful for the research on template
> recommendation [1], for example, but the use-cases won't stop there). One
> issue that we are facing is the following:
>
> We are currently
> ​using
>   SQL dumps to extract categories associated with every article on English
> Wikipedia (main namespace). [2]
> ​ Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article [3]:
>
> Flow_cytometry
> Bioinformatics
>
> Wikipedia_articles_published_in_peer-reviewed_literature
> Wikipedia_articles_published_in_PLOS_Computational_Biology
> CS1_maint:_Multiple_names:_authors_list
>
> ​The problem is that only the first two categories are the ones we are
> interested in. We have one cleaning step through which we only keep
> categories that belong to category Article and that step removes the last
> category above, but the other two Wikipedia_... remain there. We need to
> somehow prune the data and clean it from those two categories.
>
> One way we could do the above would be to parse wikitext instead of the SQL
> dumps and focus on extracting categories marked by pattern [[Category:XX]],
> but in that case, we would lose a good category such as
> Guided_missiles_of_Norway​
> ​ because that's generated by a template.​
>
> Any ideas on how we can start with a "cleaner" dataset of categories
> related to the topic of the articles as opposed to maintenance related or
> other types of categories?
>
> Thanks,
> Leila
>
> [1]https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> _stubs_across_languages
>
> [2] The exact code we use is
>
> SELECT p.page_id id, p.page_title title, cl.cl_to category
> FROM categorylinks cl
> JOIN page p
> on cl.cl_from = p.page_id
> where cl_type = 'page'
> and page_namespace = 0
> and page_is_redirect = 0
>
> ​and the edges of the category graph are extracted with
>
> *SELECT p.page_title category, cl.cl_to parent *
> *FROM categorylinks cl *
> *JOIN page p *
> *ON p.page_id = cl.cl_from *
> *where p.page_namespace = 14*​
>
>
> ​[3]https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics​

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Leila Zia
Hi all,

Thank you very much for your input. I will report back here which
approach we took to further clean up the data.

Best,
Leila


On Tue, Jul 11, 2017 at 5:43 AM, Marco Fossati <[hidden email]> wrote:

> Hi Leila,
>
> I've been working on taxonomy learning from Wikipedia categories in my past
> research.
> Here's a recap of the approach I proposed to address the pruning problem you
> faced. It's a pipeline with a bottom-up direction, i.e., from the leaves up
> to the root.
>
> Stage 1: leaf nodes
> INPUT = category + category links SQL dumps, like you do
> 1.1. extract the full set of article pages;
> 1.2. extract categories that are linked to article pages only, by looking at
> the outgoing links for each article;
> 1.3. identify the set of categories with no sub-categories.
>
> Stage 2: prominent nodes
> INPUT = stage 1 output
> 2.1. traverse the leaf graph, see the algorithm [1];
> 2.2. NLP to identify categories that hold is-a relations, i.e., *noun
> phrases* with *plural head*, inspired by the YAGO approach [2, 3];
> 2.3. (optional) set a usage weight based on the number of category
> interlanguage links (more links = more usage across language chapters).
>
> These 2 stages should output the clean dataset you're looking for.
> Based on that, you can then build the taxonomy.
>
> Feel free to ping me if you need more information.
> Best,
>
> Marco
>
> [1] Input: L (leaf nodes set) Output: PN (prominent nodes set)
> for all l in L do
>         isProminent = true;
>         P = getTransitiveParents(l);
>         for all p in P do
>                 C = getChildren(p);
>                 areAllLeaves = true;
>                 for all c in C do
>                         if c not in L then
>                                 areAllLeaves = false;
>                                 break;
>                 end for
>                 if areAllLeaves then
>                         PN.add(p);
>                         isProminent = false;
>         end for
>         if isProminent then
>                 PN.add(l);
> end for
> return PN
> [2] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a
> core of semantic knowledge. In Proceedings of the 16th
> International Conference on World Wide Web, pages
> 697–706. ACM, 2007.
> [3] J. Hoffart, F. M. Suchanek, K. Berberich, and
> G. Weikum. Yago2: a spatially and temporally
> enhanced knowledge base from wikipedia. AI,
> 194:28–61, 2013.
>
> On 7/11/17 03:21, [hidden email] wrote:
>>
>> Date: Mon, 10 Jul 2017 18:20:47 -0700
>> From: Leila Zia<[hidden email]>
>> To: Research into Wikimedia content and communities
>>         <[hidden email]>
>> Subject: [Wiki-research-l] category extraction question
>> Message-ID:
>>
>> <[hidden email]>
>> Content-Type: text/plain; charset="UTF-8"
>>
>>
>> Hi all,
>>
>> [If you are not interested in discussions related to the category system
>> (on English Wikipedia)
>> , you can stop here. :)]
>>
>> We have run into a problem that some of you may have thought about or
>> addressed before. We are trying to clean up the category system on English
>> Wikipedia by turning the category structure to an IS-A hierarchy. (The
>> output of this work can be useful for the research on template
>> recommendation [1], for example, but the use-cases won't stop there). One
>> issue that we are facing is the following:
>>
>> We are currently
>> using
>>   SQL dumps to extract categories associated with every article on English
>> Wikipedia (main namespace). [2]
>> Using this approach, we get 5 categories associated with Flow cytometry
>> bioinformatics article [3]:
>>
>> Flow_cytometry
>> Bioinformatics
>>
>> Wikipedia_articles_published_in_peer-reviewed_literature
>> Wikipedia_articles_published_in_PLOS_Computational_Biology
>> CS1_maint:_Multiple_names:_authors_list
>>
>> The problem is that only the first two categories are the ones we are
>> interested in. We have one cleaning step through which we only keep
>> categories that belong to category Article and that step removes the last
>> category above, but the other two Wikipedia_... remain there. We need to
>> somehow prune the data and clean it from those two categories.
>>
>> One way we could do the above would be to parse wikitext instead of the
>> SQL
>> dumps and focus on extracting categories marked by pattern
>> [[Category:XX]],
>> but in that case, we would lose a good category such as
>> Guided_missiles_of_Norway
>> because that's generated by a template.
>>
>> Any ideas on how we can start with a "cleaner" dataset of categories
>> related to the topic of the articles as opposed to maintenance related or
>> other types of categories?
>>
>> Thanks,
>> Leila
>>
>> [1]https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
>> _stubs_across_languages
>>
>> [2] The exact code we use is
>>
>> SELECT p.page_id id, p.page_title title, cl.cl_to category
>> FROM categorylinks cl
>> JOIN page p
>> on cl.cl_from = p.page_id
>> where cl_type = 'page'
>> and page_namespace = 0
>> and page_is_redirect = 0
>>
>> and the edges of the category graph are extracted with
>>
>> *SELECT p.page_title category, cl.cl_to parent *
>> *FROM categorylinks cl *
>> *JOIN page p *
>> *ON p.page_id = cl.cl_from *
>> *where p.page_namespace = 14*
>>
>>
>> [3]https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Leila Zia
In reply to this post by Stuart A. Yeates
Hi Stuart,

On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates <[hidden email]> wrote:
> The category system on en.wiki is not an IS-A system and there have been
> several discussions about making it it based on mathematical principals
> which have come to nothing because the consensus of editors is against it.
> The best way to think about categories is as a locally-faceted related
> links system.

It would be great if you can share a link to one or more of those
conversations, if it's not too hard to find them. This is a
conversation that comes up often and I'd like to educate myself with
this background. (and to confirm: on our end the goal is not to change
the category system on enwiki, but to make it machine understandable
for specific applications.)

> Having said that, Category:Wikipedia maintenance is an important root
> probably useful for separating  the wheat from the chaff. Most of these are
> also hidden categories. I'm not sure whether this flag appears in the SQL,
> but see
> https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories

Looking into these. thanks!

Best,
Leila

> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On 11 July 2017 at 13:20, Leila Zia <[hidden email]> wrote:
>
>> Hi all,
>>
>> [If you are not interested in discussions related to the category system
>> (on English Wikipedia)
>> , you can stop here. :)]
>>
>> We have run into a problem that some of you may have thought about or
>> addressed before. We are trying to clean up the category system on English
>> Wikipedia by turning the category structure to an IS-A hierarchy. (The
>> output of this work can be useful for the research on template
>> recommendation [1], for example, but the use-cases won't stop there). One
>> issue that we are facing is the following:
>>
>> We are currently
>> using
>>  SQL dumps to extract categories associated with every article on English
>> Wikipedia (main namespace). [2]
>> Using this approach, we get 5 categories associated with Flow cytometry
>> bioinformatics article [3]:
>>
>> Flow_cytometry
>> Bioinformatics
>>
>> Wikipedia_articles_published_in_peer-reviewed_literature
>> Wikipedia_articles_published_in_PLOS_Computational_Biology
>> CS1_maint:_Multiple_names:_authors_list
>>
>> The problem is that only the first two categories are the ones we are
>> interested in. We have one cleaning step through which we only keep
>> categories that belong to category Article and that step removes the last
>> category above, but the other two Wikipedia_... remain there. We need to
>> somehow prune the data and clean it from those two categories.
>>
>> One way we could do the above would be to parse wikitext instead of the SQL
>> dumps and focus on extracting categories marked by pattern [[Category:XX]],
>> but in that case, we would lose a good category such as
>> Guided_missiles_of_Norway
>> because that's generated by a template.
>>
>> Any ideas on how we can start with a "cleaner" dataset of categories
>> related to the topic of the articles as opposed to maintenance related or
>> other types of categories?
>>
>> Thanks,
>> Leila
>>
>> [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
>> _stubs_across_languages
>>
>> [2] The exact code we use is
>>
>> SELECT p.page_id id, p.page_title title, cl.cl_to category
>> FROM categorylinks cl
>> JOIN page p
>> on cl.cl_from = p.page_id
>> where cl_type = 'page'
>> and page_namespace = 0
>> and page_is_redirect = 0
>>
>> and the edges of the category graph are extracted with
>>
>> *SELECT p.page_title category, cl.cl_to parent *
>> *FROM categorylinks cl *
>> *JOIN page p *
>> *ON p.page_id = cl.cl_from *
>> *where p.page_namespace = 14*
>>
>>
>> [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Cristian Consonni-3
In reply to this post by Leila Zia
Hi Leila,

On 11/07/2017 03:20, Leila Zia wrote:
> ​ Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article [3]:
>
> Flow_cytometry
> Bioinformatics
>
> Wikipedia_articles_published_in_peer-reviewed_literature
> Wikipedia_articles_published_in_PLOS_Computational_Biology
> CS1_maint:_Multiple_names:_authors_list

I wanted to point out that to me the main difference between the first
two categories and the last three is that the former are automatically
added by templates. In fact, if you look at the page source you will
only find the first two.

Cristian


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Stuart A. Yeates
In reply to this post by Leila Zia
Sorry it's taken me so long to get back to this.

https://pdfs.semanticscholar.org/dea9/142b39bdc2c3738e0f9cb7c6d117750ef2f7.pdf
and https://meta.wikimedia.org/wiki/Beyond_categories are good places to
start on the issues with cats on en.wiki.

cheers
stuart

--
...let us be heard from red core to black sky

On 12 July 2017 at 02:53, Leila Zia <[hidden email]> wrote:

> Hi Stuart,
>
> On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates <[hidden email]>
> wrote:
> > The category system on en.wiki is not an IS-A system and there have been
> > several discussions about making it it based on mathematical principals
> > which have come to nothing because the consensus of editors is against
> it.
> > The best way to think about categories is as a locally-faceted related
> > links system.
>
> It would be great if you can share a link to one or more of those
> conversations, if it's not too hard to find them. This is a
> conversation that comes up often and I'd like to educate myself with
> this background. (and to confirm: on our end the goal is not to change
> the category system on enwiki, but to make it machine understandable
> for specific applications.)
>
> > Having said that, Category:Wikipedia maintenance is an important root
> > probably useful for separating  the wheat from the chaff. Most of these
> are
> > also hidden categories. I'm not sure whether this flag appears in the
> SQL,
> > but see
> > https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories
>
> Looking into these. thanks!
>
> Best,
> Leila
>
> > cheers
> > stuart
> >
> > --
> > ...let us be heard from red core to black sky
> >
> > On 11 July 2017 at 13:20, Leila Zia <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >> [If you are not interested in discussions related to the category system
> >> (on English Wikipedia)
> >> , you can stop here. :)]
> >>
> >> We have run into a problem that some of you may have thought about or
> >> addressed before. We are trying to clean up the category system on
> English
> >> Wikipedia by turning the category structure to an IS-A hierarchy. (The
> >> output of this work can be useful for the research on template
> >> recommendation [1], for example, but the use-cases won't stop there).
> One
> >> issue that we are facing is the following:
> >>
> >> We are currently
> >> using
> >>  SQL dumps to extract categories associated with every article on
> English
> >> Wikipedia (main namespace). [2]
> >> Using this approach, we get 5 categories associated with Flow cytometry
> >> bioinformatics article [3]:
> >>
> >> Flow_cytometry
> >> Bioinformatics
> >>
> >> Wikipedia_articles_published_in_peer-reviewed_literature
> >> Wikipedia_articles_published_in_PLOS_Computational_Biology
> >> CS1_maint:_Multiple_names:_authors_list
> >>
> >> The problem is that only the first two categories are the ones we are
> >> interested in. We have one cleaning step through which we only keep
> >> categories that belong to category Article and that step removes the
> last
> >> category above, but the other two Wikipedia_... remain there. We need to
> >> somehow prune the data and clean it from those two categories.
> >>
> >> One way we could do the above would be to parse wikitext instead of the
> SQL
> >> dumps and focus on extracting categories marked by pattern
> [[Category:XX]],
> >> but in that case, we would lose a good category such as
> >> Guided_missiles_of_Norway
> >> because that's generated by a template.
> >>
> >> Any ideas on how we can start with a "cleaner" dataset of categories
> >> related to the topic of the articles as opposed to maintenance related
> or
> >> other types of categories?
> >>
> >> Thanks,
> >> Leila
> >>
> >> [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> >> _stubs_across_languages
> >>
> >> [2] The exact code we use is
> >>
> >> SELECT p.page_id id, p.page_title title, cl.cl_to category
> >> FROM categorylinks cl
> >> JOIN page p
> >> on cl.cl_from = p.page_id
> >> where cl_type = 'page'
> >> and page_namespace = 0
> >> and page_is_redirect = 0
> >>
> >> and the edges of the category graph are extracted with
> >>
> >> *SELECT p.page_title category, cl.cl_to parent *
> >> *FROM categorylinks cl *
> >> *JOIN page p *
> >> *ON p.page_id = cl.cl_from *
> >> *where p.page_namespace = 14*
> >>
> >>
> >> [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
> >> _______________________________________________
> >> Wiki-research-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >>
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Leila Zia
In reply to this post by Cristian Consonni-3
Hi Cristian,

On Thu, Jul 20, 2017 at 8:21 AM, Cristian Consonni <[hidden email]> wrote:

> Hi Leila,
>
> On 11/07/2017 03:20, Leila Zia wrote:
>> Using this approach, we get 5 categories associated with Flow cytometry
>> bioinformatics article [3]:
>>
>> Flow_cytometry
>> Bioinformatics
>>
>> Wikipedia_articles_published_in_peer-reviewed_literature
>> Wikipedia_articles_published_in_PLOS_Computational_Biology
>> CS1_maint:_Multiple_names:_authors_list
>
> I wanted to point out that to me the main difference between the first
> two categories and the last three is that the former are automatically
> added by templates. In fact, if you look at the page source you will
> only find the first two.

This makes sense. Here is why we ended up in this place:
* If we would use XML dumps (which we initially did) for category
extraction (based on link extraction), we would consider a category
such as Guided_missiles_of_Norway a root category (which is wrong).
The issue with this category is that its parents' categories are
generated by templates and we could not (at least relatively easily)
pick this information up from XML dumps. As a result, we decided to go
with SQL dumps.
* The nice thing about using SQL dumps is that we can save the parents
of a category such as Guided_missiles_of_Norway, the downside is that
we lose information about which category is generated via template and
which one the usual way.

Two more things to add:
* Focusing on categories that belong to Main_topic_articles seems to
address the issue we ran into.
* We discussed whether a category such as
"Wikipedia_articles_published_in_PLOS_Computational_Biology" is a good
one or not, and given that its path is reasonable (by eye-balling), we
now consider it a category that should stay as a good category in the
category graph. Check the path for it:

Wikipedia_articles_published_in_PLOS_Computational_Biology
Public_Library_of_Science
Open_access_publishers
Academic_publishing_companies
Academic_publishing
Academia
Education
Euthenics
Social_sciences
... and up to the root

so now we know that it's good that our approach for building the graph
of categories doesn't exclude this category immediately.

Best,
Leila

>
> Cristian
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: category extraction question

Leila Zia
In reply to this post by Stuart A. Yeates
On Mon, Jul 24, 2017 at 5:22 PM, Stuart A. Yeates <[hidden email]> wrote:
> Sorry it's taken me so long to get back to this.
> https://pdfs.semanticscholar.org/dea9/142b39bdc2c3738e0f9cb7c6d117750ef2f7.pdf
> and https://meta.wikimedia.org/wiki/Beyond_categories are good places to
> start on the issues with cats on en.wiki.

very helpful. Thanks!

Leila

> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On 12 July 2017 at 02:53, Leila Zia <[hidden email]> wrote:
>
>> Hi Stuart,
>>
>> On Mon, Jul 10, 2017 at 6:45 PM, Stuart A. Yeates <[hidden email]>
>> wrote:
>> > The category system on en.wiki is not an IS-A system and there have been
>> > several discussions about making it it based on mathematical principals
>> > which have come to nothing because the consensus of editors is against
>> it.
>> > The best way to think about categories is as a locally-faceted related
>> > links system.
>>
>> It would be great if you can share a link to one or more of those
>> conversations, if it's not too hard to find them. This is a
>> conversation that comes up often and I'd like to educate myself with
>> this background. (and to confirm: on our end the goal is not to change
>> the category system on enwiki, but to make it machine understandable
>> for specific applications.)
>>
>> > Having said that, Category:Wikipedia maintenance is an important root
>> > probably useful for separating  the wheat from the chaff. Most of these
>> are
>> > also hidden categories. I'm not sure whether this flag appears in the
>> SQL,
>> > but see
>> > https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories
>>
>> Looking into these. thanks!
>>
>> Best,
>> Leila
>>
>> > cheers
>> > stuart
>> >
>> > --
>> > ...let us be heard from red core to black sky
>> >
>> > On 11 July 2017 at 13:20, Leila Zia <[hidden email]> wrote:
>> >
>> >> Hi all,
>> >>
>> >> [If you are not interested in discussions related to the category system
>> >> (on English Wikipedia)
>> >> , you can stop here. :)]
>> >>
>> >> We have run into a problem that some of you may have thought about or
>> >> addressed before. We are trying to clean up the category system on
>> English
>> >> Wikipedia by turning the category structure to an IS-A hierarchy. (The
>> >> output of this work can be useful for the research on template
>> >> recommendation [1], for example, but the use-cases won't stop there).
>> One
>> >> issue that we are facing is the following:
>> >>
>> >> We are currently
>> >> using
>> >>  SQL dumps to extract categories associated with every article on
>> English
>> >> Wikipedia (main namespace). [2]
>> >> Using this approach, we get 5 categories associated with Flow cytometry
>> >> bioinformatics article [3]:
>> >>
>> >> Flow_cytometry
>> >> Bioinformatics
>> >>
>> >> Wikipedia_articles_published_in_peer-reviewed_literature
>> >> Wikipedia_articles_published_in_PLOS_Computational_Biology
>> >> CS1_maint:_Multiple_names:_authors_list
>> >>
>> >> The problem is that only the first two categories are the ones we are
>> >> interested in. We have one cleaning step through which we only keep
>> >> categories that belong to category Article and that step removes the
>> last
>> >> category above, but the other two Wikipedia_... remain there. We need to
>> >> somehow prune the data and clean it from those two categories.
>> >>
>> >> One way we could do the above would be to parse wikitext instead of the
>> SQL
>> >> dumps and focus on extracting categories marked by pattern
>> [[Category:XX]],
>> >> but in that case, we would lose a good category such as
>> >> Guided_missiles_of_Norway
>> >> because that's generated by a template.
>> >>
>> >> Any ideas on how we can start with a "cleaner" dataset of categories
>> >> related to the topic of the articles as opposed to maintenance related
>> or
>> >> other types of categories?
>> >>
>> >> Thanks,
>> >> Leila
>> >>
>> >> [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
>> >> _stubs_across_languages
>> >>
>> >> [2] The exact code we use is
>> >>
>> >> SELECT p.page_id id, p.page_title title, cl.cl_to category
>> >> FROM categorylinks cl
>> >> JOIN page p
>> >> on cl.cl_from = p.page_id
>> >> where cl_type = 'page'
>> >> and page_namespace = 0
>> >> and page_is_redirect = 0
>> >>
>> >> and the edges of the category graph are extracted with
>> >>
>> >> *SELECT p.page_title category, cl.cl_to parent *
>> >> *FROM categorylinks cl *
>> >> *JOIN page p *
>> >> *ON p.page_id = cl.cl_from *
>> >> *where p.page_namespace = 14*
>> >>
>> >>
>> >> [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
>> >> _______________________________________________
>> >> Wiki-research-l mailing list
>> >> [hidden email]
>> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >>
>> > _______________________________________________
>> > Wiki-research-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Loading...