How bad is a category with 21, 008 pages for the servers?

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

How bad is a category with 21, 008 pages for the servers?

Ligulem
How bad is a category with 21,008 pages for the servers?

http://en.wikipedia.org/wiki/Category:Articles_with_unsourced_statements

--Ligulem

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Simetrical
On 8/20/06, Ligulem <[hidden email]> wrote:
> How bad is a category with 21,008 pages for the servers?
>
> http://en.wikipedia.org/wiki/Category:Articles_with_unsourced_statements

IIRC, it's not a problem now that we have category paging, but my
recollection may be faulty.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Gregory Maxwell
On 8/20/06, Simetrical <[hidden email]> wrote:
> On 8/20/06, Ligulem <[hidden email]> wrote:
> > How bad is a category with 21,008 pages for the servers?
> >
> > http://en.wikipedia.org/wiki/Category:Articles_with_unsourced_statements
>
> IIRC, it's not a problem now that we have category paging, but my
> recollection may be faulty.

20k isn't that large.. we have quite a few with a lot more than that...
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Simetrical
On 8/20/06, Gregory Maxwell <[hidden email]> wrote:
> 20k isn't that large.. we have quite a few with a lot more than that...

Now that you mention it, we do, and even a special page to account for
them: http://en.wikipedia.org/wiki/Special:Mostlinkedcategories.  The
largest weighs in at over 110,000 pages.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Rob Church
On 21/08/06, Simetrical <[hidden email]> wrote:
> On 8/20/06, Gregory Maxwell <[hidden email]> wrote:
> > 20k isn't that large.. we have quite a few with a lot more than that...
>
> Now that you mention it, we do, and even a special page to account for
> them: http://en.wikipedia.org/wiki/Special:Mostlinkedcategories.  The
> largest weighs in at over 110,000 pages.

Is that the one with an article for each faux pas of a well-known US
political figure in?


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Ligulem
In reply to this post by Gregory Maxwell
Gregory Maxwell wrote:
> On 8/20/06, Simetrical <[hidden email]> wrote:
>> On 8/20/06, Ligulem <[hidden email]> wrote:
>>> How bad is a category with 21,008 pages for the servers?
>>>
>>> http://en.wikipedia.org/wiki/Category:Articles_with_unsourced_statements
>> IIRC, it's not a problem now that we have category paging, but my
>> recollection may be faulty.
>
> 20k isn't that large.. we have quite a few with a lot more than that...

Thanks for the responses. So this is not a technical problem then. I
just wonder what's the benefit of having such huge categories...

I thought categories were meant as a tool for editors to iterate over
the articles contained in them. I can't imagine that a human would ever
iterate over a set of 20K pages (at least not without using specialized
tools like AWB [1] or bots).

[1] http://en.wikipedia.org/wiki/WP:AWB

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Steve Bennett-4
On 8/21/06, Ligulem <[hidden email]> wrote:
> Thanks for the responses. So this is not a technical problem then. I
> just wonder what's the benefit of having such huge categories...
>
> I thought categories were meant as a tool for editors to iterate over
> the articles contained in them. I can't imagine that a human would ever
> iterate over a set of 20K pages (at least not without using specialized
> tools like AWB [1] or bots).

There are several distinct uses of categories:
a) To allow human readers to browse related articles
b) To organise articles for future distribution, publishing etc
c) To assist quality control, such as labelling articles that need cleanup etc
d) To assow bots to work some kind of magic.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Andrew Dunbar
On 8/21/06, Steve Bennett <[hidden email]> wrote:

> On 8/21/06, Ligulem <[hidden email]> wrote:
> > Thanks for the responses. So this is not a technical problem then. I
> > just wonder what's the benefit of having such huge categories...
> >
> > I thought categories were meant as a tool for editors to iterate over
> > the articles contained in them. I can't imagine that a human would ever
> > iterate over a set of 20K pages (at least not without using specialized
> > tools like AWB [1] or bots).
>
> There are several distinct uses of categories:
> a) To allow human readers to browse related articles
> b) To organise articles for future distribution, publishing etc
> c) To assist quality control, such as labelling articles that need cleanup etc
> d) To assow bots to work some kind of magic.

I've been thinking for some weeks or more now that a good feature to improve
the usefulness of large categories would be "Random article in this
category". I would be excellent for maintainence where nobody is going to
iterate through the whole lot but does like to try to keep order etc.

Andrew Dunbar (hippietrail)

> Steve
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l
>


--
http://linguaphile.sf.net
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Steve Bennett-4
On 8/21/06, Andrew Dunbar <[hidden email]> wrote:
> I've been thinking for some weeks or more now that a good feature to improve
> the usefulness of large categories would be "Random article in this
> category". I would be excellent for maintainence where nobody is going to
> iterate through the whole lot but does like to try to keep order etc.

Yes. Definitely support that.

And, going even further, is doing what most "funny photo" sites do,
which is find "similar articles" to the current one, by looking for
articles that share most of the same categories. Maybe with the
differences between our "categories" and the "tags" on other sites, it
wouldn't work as well, but would still be interesting...

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Gregory Maxwell
In reply to this post by Steve Bennett-4
On 8/21/06, Steve Bennett <[hidden email]> wrote:
> There are several distinct uses of categories:
> a) To allow human readers to browse related articles
> b) To organise articles for future distribution, publishing etc
> c) To assist quality control, such as labelling articles that need cleanup etc
> d) To assow bots to work some kind of magic.

Surprised that you didn't name this one, since it is one of the more
useful human oriented ones (and a primary application for the living
people cat):

e) Produce a filtered recent changes feed
(http://en.wikipedia.org/w/index.php?title=Special:Recentchangeslinked&target=Category%3ALiving_people)
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Steve Bennett-4
On 8/21/06, Gregory Maxwell <[hidden email]> wrote:
> Surprised that you didn't name this one, since it is one of the more
> useful human oriented ones (and a primary application for the living
> people cat):
>
> e) Produce a filtered recent changes feed
> (http://en.wikipedia.org/w/index.php?title=Special:Recentchangeslinked&target=Category%3ALiving_people)

Heh, didn't know you could do that.

Lots of these functionalities would be better if they handled
subcategories, but for that to work we really need a better subcatting
system. But I haven't got a solution yet.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Ligulem
In reply to this post by Steve Bennett-4
Steve Bennett wrote:
>  ..
> c) To assist quality control, such as labelling articles that need cleanup etc

It seems rather pointless to me to label 20k pages needing citation. But
that's probably not a technical question.

If I do a random walk on en Wikipedia, almost every page seems to have
some tag needing something ("This page is in need of <insert your pet
peeve here>"). This reminds me of all these "under construction" pages
in the earlier days of the web.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Steve Bennett-4
On 8/21/06, Ligulem <[hidden email]> wrote:
> Steve Bennett wrote:
> >  ..
> > c) To assist quality control, such as labelling articles that need cleanup etc
>
> It seems rather pointless to me to label 20k pages needing citation. But
> that's probably not a technical question.

Totally agree on that one. There are some useful ones though, like
"pages needing categorisation", "pages needing LaTeX formatting" etc.
Any task that can be performed by a non-subject specialist, in
particular.

>
> If I do a random walk on en Wikipedia, almost every page seems to have
> some tag needing something ("This page is in need of <insert your pet
> peeve here>"). This reminds me of all these "under construction" pages
> in the earlier days of the web.

Yeah. I can't stand {{cleanup}} - what's the point? But I occasionally
use {{wfy}} or {{globalize-USA}} or whatever, the latter mostly to let
off steam :)

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Magnus Manske
In reply to this post by Steve Bennett-4
Steve Bennett schrieb:

> On 8/21/06, Gregory Maxwell <[hidden email]> wrote:
>  
>> Surprised that you didn't name this one, since it is one of the more
>> useful human oriented ones (and a primary application for the living
>> people cat):
>>
>> e) Produce a filtered recent changes feed
>> (http://en.wikipedia.org/w/index.php?title=Special:Recentchangeslinked&target=Category%3ALiving_people)
>>    
>
> Heh, didn't know you could do that.
>
> Lots of these functionalities would be better if they handled
> subcategories, but for that to work we really need a better subcatting
> system. But I haven't got a solution yet.
>  
I have. It's in the current code, turned off. It's a filter to mass-sift
through articles fast. The only implemented use (also turned off) is to
filter Recent Changes to show only articles in a category *and its
subcategories*. I have asked to turn it on for testing some time ago,
but was more or less ignored (as usual;-).

It could be tested on one of the smaller wikis to check what impact it
would have on en or de.

Magnus


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Steve Bennett-4
On 8/21/06, Magnus Manske <[hidden email]> wrote:
> I have. It's in the current code, turned off. It's a filter to mass-sift
> through articles fast. The only implemented use (also turned off) is to
> filter Recent Changes to show only articles in a category *and its
> subcategories*. I have asked to turn it on for testing some time ago,
> but was more or less ignored (as usual;-).
>
> It could be tested on one of the smaller wikis to check what impact it
> would have on en or de.

How does it cope with category cycles? Basically I feel that since we
have no real definition of what "subcategory" means (on en, at least),
it's not that meaningful at this stage to presume that subcategories
should be searched along with the main category. Or perhaps the user
should be able to choose whether or not that's meaningful for the
category he's searching...

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Rob Church
In reply to this post by Andrew Dunbar
On 21/08/06, Andrew Dunbar <[hidden email]> wrote:
> I've been thinking for some weeks or more now that a good feature to improve
> the usefulness of large categories would be "Random article in this
> category". I would be excellent for maintainence where nobody is going to
> iterate through the whole lot but does like to try to keep order etc.

Someone tell me why the bollocking hell I never implemented that? I'm
pretty sure I set out to do so at least once in the past.

** adds it to the big list(tm) **


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Tim Starling
Rob Church wrote:
> On 21/08/06, Andrew Dunbar <[hidden email]> wrote:
>> I've been thinking for some weeks or more now that a good feature to improve
>> the usefulness of large categories would be "Random article in this
>> category". I would be excellent for maintainence where nobody is going to
>> iterate through the whole lot but does like to try to keep order etc.
>
> Someone tell me why the bollocking hell I never implemented that? I'm
> pretty sure I set out to do so at least once in the past.

Because it would require a DB query with an execution time proportional to
the number of articles in the category? Having categories with lots of
members is fine, as long as we don't try to traverse them all the time.

The efficient way to do it would be to enumerate the category members,
saving the results to a table for later lookup. The entries in this special
table could have an expiry time. Is that how you were planning on doing it?

-- Tim Starling

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Rob Church
On 21/08/06, Tim Starling <[hidden email]> wrote:
> Because it would require a DB query with an execution time proportional to
> the number of articles in the category? Having categories with lots of
> members is fine, as long as we don't try to traverse them all the time.

That would be it. I couldn't find an effective method which Domas
would agree with.

> The efficient way to do it would be to enumerate the category members,
> saving the results to a table for later lookup. The entries in this special
> table could have an expiry time. Is that how you were planning on doing it?

I don't bother planning things any more. Having four months' worth of
notes turn useless is not a fun experience.

More details? Enumerate them when, and save which results?


Rob Church
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Magnus Manske
In reply to this post by Steve Bennett-4
Steve Bennett schrieb:

> On 8/21/06, Magnus Manske <[hidden email]> wrote:
>  
>> I have. It's in the current code, turned off. It's a filter to mass-sift
>> through articles fast. The only implemented use (also turned off) is to
>> filter Recent Changes to show only articles in a category *and its
>> subcategories*. I have asked to turn it on for testing some time ago,
>> but was more or less ignored (as usual;-).
>>
>> It could be tested on one of the smaller wikis to check what impact it
>> would have on en or de.
>>    
>
> How does it cope with category cycles?
IIRC it remembers which categories were already checked, and doesn't
cycle forever ;-)

Also, it starts with the categories a given list of articles are in,
then goes *down* through the tree, towards the parent/root (or was the
"up"? I keep forgetting), and checks if it finds a given category.
> Basically I feel that since we
> have no real definition of what "subcategory" means (on en, at least),
> it's not that meaningful at this stage to presume that subcategories
> should be searched along with the main category. Or perhaps the user
> should be able to choose whether or not that's meaningful for the
> category he's searching...
>  
Well, saying "show me recent changes in biology and subcategories" is
definitely something I'd enjoy.


Magnus


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How bad is a category with 21, 008 pages for the servers?

Andre Engels
2006/8/21, Magnus Manske <[hidden email]>:

> > Basically I feel that since we
> > have no real definition of what "subcategory" means (on en, at least),
> > it's not that meaningful at this stage to presume that subcategories
> > should be searched along with the main category. Or perhaps the user
> > should be able to choose whether or not that's meaningful for the
> > category he's searching...
> >
> Well, saying "show me recent changes in biology and subcategories" is
> definitely something I'd enjoy.

I'm afraid that it would be awfully timeconsuming. I recently checked
for one article its category and their supercategories and it ran in
the hundreds, perhaps over a 1000. Subcategories, definitely of a
high-level category like Biology, might well have the same problem.


--
Andre Engels, [hidden email]
ICQ: 6260644  --  Skype: a_engels
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
12345