Category finder

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Category finder

Magnus Manske
Hi,

I worte a little PHP object that takes
* a list of article IDs
* a list of categories
and returns those article IDs which belong to these categories *or their
subcategories*.

The algorithm is written to minimize the number of read queries when
parsing the category tree. The maximum number of read queries is the
"tallest" category tree in the set of articles (times two, one for the
"parents", one for their page_ids), which IMHO should mostly be below 10
(I guess it's usually 5 or less, but I have no data to back that up).

So, if the tree depth for an article is 8, and I add more articles to
search for which all have a depth of 8 or less, no additional database
queries are neccessary. The individual query will grow in size, though.

I intend to use it on the Tasks feature search, which is why I put it
into that extension ("extensions" module, "Tasks/categoryfinder.php").
But I hope this can be used for other searches as well. The idea is,
instead of searching for, say, 25 articles, to internally search for
more (e.g. a few hundred), then filter that set through the
categoryfinder, until there are 25 matching articles.

Before I start implementing the actual search interface, can anyone tell
me if that would put too much stress on the DB slaves? Keep in mind that
limiting several searches to "articles in [[Category:Physics]] and its
subcategories" appears to be immensely useful, so it might be well worth
the DB stress from a user standpoint. (Damn you, users! ;-)

Magnus

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Category finder

Domas Mituzas
Magnus,

> So, if the tree depth for an article is 8, and I add more articles to
> search for which all have a depth of 8 or less, no additional database
> queries are neccessary. The individual query will grow in size,  
> though.

Databases aren't about number of queries sent, with our workloads  
(ergh, 20000 SQL queries per second on a cluster) query count does  
not matter that much. What matters though, is number of rows looked  
up by each query, and how hot that data is. Whenever we give users  
possibility to hit our disks, we slow our systems down.

> limiting several searches to "articles in [[Category:Physics]] and its
> subcategories" appears to be immensely useful, so it might be well  
> worth
> the DB stress from a user standpoint. (Damn you, users! ;-)

For sake of user experience we are in constant fight for and against  
features. ;)

*shrug*, just EXPLAIN your queries and take a look how hard it's  
going to hit our databases. And try that on our biggest category trees.

Domas
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Category finder

Ray Saintonge
In reply to this post by Magnus Manske
Magnus Manske wrote:

>Hi,
>
>I worte a little PHP object that takes
>* a list of article IDs
>* a list of categories
>and returns those article IDs which belong to these categories *or their
>subcategories*.
>
>The algorithm is written to minimize the number of read queries when
>parsing the category tree. The maximum number of read queries is the
>"tallest" category tree in the set of articles (times two, one for the
>"parents", one for their page_ids), which IMHO should mostly be below 10
>(I guess it's usually 5 or less, but I have no data to back that up).
>
This problem seems like the top-down counterpart of bottom-up problem of
category tracing an article to the one top category.  Or the similar
problem of using "What links here" to trace back to the Main Page.  In
either case the path length should be the same, and many will find it
amazingly short.  It can probably be estimated from the average number
of direct subcategories in a category.  This yields the simple
exponential equation n^x=T where "n" is the average number of
subcategories in a category, and T is the total number of articles in
the database.  In other words x = log T/log n.  In a simple example with
base 10 logarithms: If each category has 10 sub-categories and there are
1,000,000 articles in the database the average path length will only be
6.  Growing to 10,000,000 only increases that path length to 7.

>So, if the tree depth for an article is 8, and I add more articles to
>search for which all have a depth of 8 or less, no additional database
>queries are neccessary. The individual query will grow in size, though.
>
Yes

>I intend to use it on the Tasks feature search, which is why I put it
>into that extension ("extensions" module, "Tasks/categoryfinder.php").
>But I hope this can be used for other searches as well. The idea is,
>instead of searching for, say, 25 articles, to internally search for
>more (e.g. a few hundred), then filter that set through the
>categoryfinder, until there are 25 matching articles.
>
>Before I start implementing the actual search interface, can anyone tell
>me if that would put too much stress on the DB slaves? Keep in mind that
>limiting several searches to "articles in [[Category:Physics]] and its
>subcategories" appears to be immensely useful, so it might be well worth
>the DB stress from a user standpoint.
>
I suspect that the bottom up problem should be easier, especially in a
strict herarchy where a sub-category only belongs to one category.  This
would give a single tracing back to the source.  Ethnologue uses it in
its website, and so do taxonomic databases.  We show this in our species
taxoboxes, and it is a significant feature in Wikispecies.

When you go in the other direction the size of your results can grow
exponentially, and practically you can only show a limited depth of
categories.  After an initial query you can perform a second more
limited query.  In a sense it suggests that a more organized approach to
categories would be superior to the higgledy-piggledy one that currently
prevails.

Ec

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l