Flattening a wikimedia category

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Flattening a wikimedia category

Rayson Ho
Seems like it is no easy way to display all the media files under a
wikimedia category -- for example if someone wants a picture of a
library, he or she will need to go into each sub-category under
"Libraries":

http://commons.wikimedia.org/wiki/Category:Libraries

While Wikimedia is not yet the most popular stock photo source, IMO
having this flattening functionality would be useful to those who are
looking for stock photos.

Rayson

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Schwen-2
> While Wikimedia is not yet the most popular stock photo source, IMO
> having this flattening functionality would be useful to those who are
> looking for stock photos.

Just I love this recurring debate sooo much I drop a two more bits:

* atomic categorization would solve this
* category intersection would be useful (imagine a user searching for
a picture of a library in asia)

open fire!

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Aryeh Gregor
In reply to this post by Rayson Ho
On Wed, Feb 3, 2010 at 10:10 PM, Rayson Ho <[hidden email]> wrote:

> Seems like it is no easy way to display all the media files under a
> wikimedia category -- for example if someone wants a picture of a
> library, he or she will need to go into each sub-category under
> "Libraries":
>
> http://commons.wikimedia.org/wiki/Category:Libraries
>
> While Wikimedia is not yet the most popular stock photo source, IMO
> having this flattening functionality would be useful to those who are
> looking for stock photos.

This is a regular request.  There are two major problems:

1) Our database schema is not set up to handle this efficiently for
large result sets.  At least I don't think so, off the top of my head.

2) In practice, collapsing categories like this can often lead to
crazy stuff being included, because subcategory relations aren't used
strictly in a "everything in category A is also in category B" sense.
It's easy to come up with examples.  For instance:
[[Category:Punishments in religion]] -> [[Category:Religion and
capital punishment]] -> [[Category:People executed for heresy]] ->
[[Category:Joan of Arc]] -> [[English claims to the French throne]].
Thus, if you try to get all articles in [[Category:Punishments in
religion]] or subcategories, you'll get results like [[English claims
to the French throne]].

However, this is definitely on the long-term "it would be nice if
someone did this someday" list.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Gregory Maxwell
On Thu, Feb 4, 2010 at 9:27 AM, Aryeh Gregor
<[hidden email]> wrote:
> 1) Our database schema is not set up to handle this efficiently for
> large result sets.  At least I don't think so, off the top of my head.

I've never been able to come up with an acceptable data-structure for
flattening on the fly.
(I think acceptable is something like O(1) or O(log something) on
insert, delete, and no worse then something like O(results log
something) on query).

But if you do atomic categories explicitly enumerated on the pages
then you get the right properties, and fast search with intersections
is the same problem as full text search. I.e. solved.

> 2) In practice, collapsing categories like this can often lead to
> crazy stuff being included, because subcategory relations aren't used
> strictly in a "everything in category A is also in category B" sense.

Yea, automatic collapsing is mostly good for hilarious results...
manual collapsing OTOH.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Aryeh Gregor
On Thu, Feb 4, 2010 at 10:02 AM, Gregory Maxwell <[hidden email]> wrote:
> But if you do atomic categories explicitly enumerated on the pages
> then you get the right properties, and fast search with intersections
> is the same problem as full text search. I.e. solved.

Right.  Supporting category intersection and search in category with
better UI (we already sort of support it if you know the right magic
terms) is what we should be aiming for here.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Robert Stojnic-2

Aryeh Gregor wrote:
> Right.  Supporting category intersection and search in category with
> better UI (we already sort of support it if you know the right magic
> terms) is what we should be aiming for here.
>  

Last year, just around this time, we came to the exactly same
conclusion. And similarly like then, there is no shortage of good
opinions on how to do it, but people to actually do the programming.

r.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Aryeh Gregor
On Thu, Feb 4, 2010 at 11:03 AM, Robert Stojnic <[hidden email]> wrote:
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.

Yup.  Any volunteers?  My understanding is that right now, the backend
supports category searches as long as the categories are spelled out
literally in the wikitext (not via template).  That's not a big
restriction, so what we could really use right now is UI, which
shouldn't require such specialized skills.

So, does anyone want to:

1) Mock up basic UI for category intersections/search in category?

2) Implement it?

After that we can talk about fancy things like automatically
suggesting categories to intersect with or whatever . . . we don't
even have the most basic UI right now.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Neil Harris-2
In reply to this post by Robert Stojnic-2
On 04/02/10 16:03, Robert Stojnic wrote:

> Aryeh Gregor wrote:
>    
>> Right.  Supporting category intersection and search in category with
>> better UI (we already sort of support it if you know the right magic
>> terms) is what we should be aiming for here.
>>
>>      
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.
>
> r.
>
>    
I'm working on it.

-- Neil


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Conrad Irwin
In reply to this post by Aryeh Gregor
On 02/04/2010 04:10 PM, Aryeh Gregor wrote:
>
> Yup.  Any volunteers?  My understanding is that right now, the backend
> supports category searches as long as the categories are spelled out
> literally in the wikitext (not via template).  

Presumably it would not be too hard to append the full category list to
the blob that gets sent to the search engine, (perhaps as part of
fixing: https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 -nudge-nudge)

Whether this is a big restriction or not depends a lot on your wiki, I
estimate that 90% or more of categories on en.wiktionary are added by
templates (but then so's most of our output anyway).

Conrad

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Schwen-2
In reply to this post by Neil Harris-2
This is putting the cart in front of the ox yet again. A few mails up
Aryeh and Gregory both come to the conclusion that automatic
flattening is useless.
Yet category flattening would be a prerequisite to intersections.
The only way to get proper intersection is manual flattening i.e.
atomic categorization. As long as nobody is pushing commons _hard_ to
change their categorization system _nothing_ will happen and we'll
meet on this list again in about one year repeating the same
discussion.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Aryeh Gregor
In reply to this post by Conrad Irwin
On Thu, Feb 4, 2010 at 11:28 AM, Conrad Irwin
<[hidden email]> wrote:
> Presumably it would not be too hard to append the full category list to
> the blob that gets sent to the search engine

No, probably not, but it would be even easier to not worry about it
yet (unless someone wants to!).

On Thu, Feb 4, 2010 at 11:37 AM, Daniel Schwen <[hidden email]> wrote:
> Yet category flattening would be a prerequisite to intersections.
> The only way to get proper intersection is manual flattening i.e.
> atomic categorization.

Correct.  Automatic flattening is not good enough -- manual flattening
is necessary.  Maybe if we had a better category intersect feature,
more wikis would do manual flattening.  If they don't, I guess they
won't get the feature.  Automatic flattening is not a substitute.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Kinzler
In reply to this post by Robert Stojnic-2
Robert Stojnic schrieb:

> Aryeh Gregor wrote:
>> Right.  Supporting category intersection and search in category with
>> better UI (we already sort of support it if you know the right magic
>> terms) is what we should be aiming for here.
>>  
>
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.
>
> r.

Wikimedia Germany has contracted Neil Harris to work on implementing deep
category intersection. The goal is basically a rewrite of my sucky CatScan tool.
The result is hopefully fast & generic enough so it can be used as a service
that integrates with the current search infrastructure.

The project has started, there is funding and a project plan. I expect to see
usable results soon. In fact, I hope to present this at the developer meeting in
april (neil, contact me about attending) and discuss the integration into lucene
search.

I agree that full recursive flattening of the current category structure leads
to bad results some times (especially on the english wikipedia, commons is quite
bad too), a depth of 5 however is generally useful. One common use case is
intersecting a content category with a maintenance category, for organizing
editorial work in a wiki project. In that case, at least one category comes from
a template.

Atomic categorization aka tagging however also sucks: the tags are either too
generic (so it's hard to find stuff) or too specific (you never know what to
search for). tags implying/including other tags is very useful. which is exactly
what categories with deep intersection will provide.


-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

David Gerard-2
In reply to this post by Daniel Schwen-2
On 4 February 2010 16:37, Daniel Schwen <[hidden email]> wrote:

> The only way to get proper intersection is manual flattening i.e.
> atomic categorization. As long as nobody is pushing commons _hard_ to
> change their categorization system _nothing_ will happen and we'll
> meet on this list again in about one year repeating the same
> discussion.


Commons really wants this. LOTS AND LOTS.

But we need the functionality there first, so we can *then* flatten.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Schwen-2
> But we need the functionality there first, so we can *then* flatten.

Ahh, the good old chicken and egg ;-)
I don't let that count. We have plenty of working category
intersection tools already. Their usefulness is limited however
because the category system is so screwed up.
The ball is definitely in the categorization-court!

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

David Gerard-2
On 4 February 2010 17:38, Daniel Schwen <[hidden email]> wrote:

>> But we need the functionality there first, so we can *then* flatten.

> Ahh, the good old chicken and egg ;-)
> I don't let that count. We have plenty of working category
> intersection tools already.


Yes, but they're not part of the interface.

The technology needs to work with the data - the six million files and
their categories, carefully added by hand by humans.

If category intersections worked, they could then be broken down to
work better with category intersections.

Demanding that all six million files be de-categorised before you'll
even allow a category intersection tool to *possibly* be deployed is
backward.

People need to be able to go gradually.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Magnus Manske-2
In reply to this post by Daniel Kinzler
On Thu, Feb 4, 2010 at 5:02 PM, Daniel Kinzler <[hidden email]> wrote:

> Robert Stojnic schrieb:
>> Aryeh Gregor wrote:
>>> Right.  Supporting category intersection and search in category with
>>> better UI (we already sort of support it if you know the right magic
>>> terms) is what we should be aiming for here.
>>>
>>
>> Last year, just around this time, we came to the exactly same
>> conclusion. And similarly like then, there is no shortage of good
>> opinions on how to do it, but people to actually do the programming.
>>
>> r.
>
> Wikimedia Germany has contracted Neil Harris to work on implementing deep
> category intersection. The goal is basically a rewrite of my sucky CatScan tool.

In the meantime:
http://toolserver.org/~magnus/catscan_rewrite.php

(toolserver seems to have a problem ATM, though...)

Magnus

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Kinzler
Magnus Manske schrieb:
> In the meantime:
> http://toolserver.org/~magnus/catscan_rewrite.php
>
> (toolserver seems to have a problem ATM, though...)

Yes, lots more options than my old thingy, thanks magnus :) but still bound to
recursive calls to the database, which is what i really want to get rid of. the
lookup needs to be snappy.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Tim Landscheidt
Daniel Kinzler <[hidden email]> wrote:

>> In the meantime:
>> http://toolserver.org/~magnus/catscan_rewrite.php

>> (toolserver seems to have a problem ATM, though...)

> Yes, lots more options than my old thingy, thanks magnus :) but still bound to
> recursive calls to the database, which is what i really want to get rid of. the
> lookup needs to be snappy.

Is there any reason not to have a flatted structure some-
where on the toolserver (or, in the long run, in MediaWiki)?
A quick look at recentchanges for dewp shows about
22000 changes per month, about one every two minutes. With
about 80000 categories in all, it should be feasible to up-
date the structure incrementally, with daily/weekly/monthly
clean new full "dumps" (or even dispense with up-to-the-se-
cond data and just dump the flat structure hourly).

Tim


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Gregory Maxwell
On Thu, Feb 4, 2010 at 6:40 PM, Tim Landscheidt <[hidden email]> wrote:
> Is there any reason not to have a flatted structure some-
> where on the toolserver (or, in the long run, in MediaWiki)?
> A quick look at recentchanges for dewp shows about
> 22000 changes per month, about one every two minutes. With
> about 80000 categories in all, it should be feasible to up-
> date the structure incrementally, with daily/weekly/monthly
> clean new full "dumps" (or even dispense with up-to-the-se-
> cond data and just dump the flat structure hourly).

Incremental updates for a 'flattened copy' aren't especially
realistic... as one user operation can produce millions of operations
on the server.

I  won't bother saying much more, Daniel Schwen pretty much speaks for my view.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Flattening a wikimedia category

Daniel Kinzler
In reply to this post by Tim Landscheidt
Tim Landscheidt schrieb:

> Daniel Kinzler <[hidden email]> wrote:
>
>>> In the meantime:
>>> http://toolserver.org/~magnus/catscan_rewrite.php
>
>>> (toolserver seems to have a problem ATM, though...)
>
>> Yes, lots more options than my old thingy, thanks magnus :) but still bound to
>> recursive calls to the database, which is what i really want to get rid of. the
>> lookup needs to be snappy.
>
> Is there any reason not to have a flatted structure some-
> where on the toolserver (or, in the long run, in MediaWiki)?
> A quick look at recentchanges for dewp shows about
> 22000 changes per month, about one every two minutes. With
> about 80000 categories in all, it should be feasible to up-
> date the structure incrementally, with daily/weekly/monthly
> clean new full "dumps" (or even dispense with up-to-the-se-
> cond data and just dump the flat structure hourly).

Basically: yes, this is the idea, but detecting categorization changes isn't
trivial. also, really keeping a copy of the flat content of each category would
be redundant to the extreme. it would result in hundreds of millions of entries,
and would be hard to handle. a data structure for fast recursive lookup makes
more sense. Neil is working on this.

As to the general approach: I hope that by providing a way to intersect
categories, we can get rid of most of the "Foo in Bar" cross-section catgories.
I still believe hierarchical structuring/inclusion of categories is useful. Or,
to put it differently: let people use "flat tagging", but let's keep the notion
of one tag implying another, i.e. math implying science and texas implying america.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12