Note: some load problems on upload & image scaler servers

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Note: some load problems on upload & image scaler servers

Brion Vibber-3
A couple quick notes I tossed up on the tech blog:
http://techblog.wikimedia.org/2009/07/intermittent-media-server-load-problems/

Domas thinks it's related to this problem with ZFS snapshots badly
affecting NFS server performance in some cases:
http://www.opensolaris.org/jive/thread.jspa?messageID=64379

Actual load from clients doesn't seem problematic, but the NFS horror
can cause things to time out badly, which sometimes affects the main
apaches as well as the image scalers. (Especially when, say, deleting a
category of 100 image pages. :)

We've got it behaving reasonably well at the moment, but we'll want to
keep an eye on things until we've reduced the coupling between things a
bit...

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

Domas Mituzas
Hi!

my view of things what was happening, may not be accurate, as it was  
first time I was touching this part of the cluster ;-)

1. ms1 had just 128 web threads configured, which could be occupied by  
both looking up file, serving it (and blocking if squid doesn't  
consume it fast enough), and blocking on fastcgi handlers
2. even though ms1 I/O was loaded, it wasn't loaded enough to justify  
20s waits on empty file creations via NFS
3. read operations via NFS were relatively fast, comparing with write  
operations
4. pybal was quite aggressive in depooling scalers
5. scalers were in one of two states, blocked on NFS write or  
depooled, due to stampede of requests hitting one server, and all  
others being depooled
6. if scaler would be depooled, not only it would not get requests,  
but wouldn't be able to write output back (my assumption, based on  
their frozen states, not actual verified fact)
7. due to 4-6, ms1 http threads would be stuck in 404-handler fastcgi  
(waiting for scalers to respond), thus also blocking way out for  
existing files
8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full  
tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg -  
use FF3.5, then you can search for metaslab_alloc in it.
9. some digging around the internals (it is amazing, how I forgot  
pretty much everything I learned in 6h of ZFS sessions last  
November ;-) showed that the costs could have been increased by amount  
of our snapshots.

What was done:

1. Increased amount of worker threads on ms1 (why the heck was it that  
small anyway)
2. Made balancing way less eager to depool servers (thanks sir Mark)
3. Disabled ZIL (didn't give much expected effect though, as problem  
was elsewhere)
4. Dropped few oldest snapshots - thus targeting the metaslab_alloc()  
issue.

Cheers,
Domas

P.S. For anyone Solaris savvy (I am not, despite where I work), you  
know what this means:

               unix`mutex_delay_default+0x7
               unix`mutex_vector_enter+0x99
               genunix`cv_wait+0x70
               zfs`space_map_load_wait+0x20
               zfs`space_map_load+0x36
               zfs`metaslab_activate+0x6f
               zfs`metaslab_group_alloc+0x18d
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              3816

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               genunix`cv_wait+0x70
               zfs`space_map_load_wait+0x20
               zfs`space_map_load+0x36
               zfs`metaslab_activate+0x6f
               zfs`metaslab_group_alloc+0x18d
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              4068

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_write_allocate_gang_members+0x171
               zfs`zio_dva_allocate+0xcc
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              4615

              7500

               unix`mutex_delay_default+0x7
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              7785

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             10487

               genunix`avl_walk+0x39
               zfs`space_map_alloc+0x21
               zfs`metaslab_group_alloc+0x1a2
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_write_allocate_gang_members+0x171
               zfs`zio_dva_allocate+0xcc
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             16297

               genunix`avl_walk+0x39
               zfs`space_map_alloc+0x21
               zfs`metaslab_group_alloc+0x1a2
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             26149



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

William Allen Simpson-2
While editing, I've never seen this odd error before:

   302 (Moved Temporarily)

Also, still seeing a fair number of:

   Proxy Error

   The proxy server received an invalid response from an upstream server.
   The proxy server could not handle the request POST /wikipedia/en/w/index.php.

   Reason: Error reading from remote server

In either case, when I show history, the edit *has* been posted.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

David Gerard-2
In reply to this post by Domas Mituzas
2009/7/13 Domas Mituzas <[hidden email]>:

> 8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full
> tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg -
> use FF3.5, then you can search for metaslab_alloc in it.


This sounds *very* like a ZFS bug in Solaris 10 that we struck at work
a while ago:

Precis: if the file system is very busy (being hammered) *and* it's
over 85% full, the block allocator can get stuck trying to work out
the *very best* allocation rather than one that'll do and let it get
on with other work. To the point where you see CPU go through the
roof, with 80% system CPU and a very unresponsive system. You can't
stop this without rebooting the box.

Sun acknowledged it as a bug and it'll be fixed in a future release;
they gave us a hotpatch. The workaround? Keep the ZFS filesystem in
question under 70% full ...

This is an obscure bug and isn't reason to avoid ZFS in general - the
bug only gets tickled in particular circumstances, when ZFS is having
the heck beaten out of it. I'd still happily recommend ZFS for almost
anything, because it really is *that cool*.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

Domas Mituzas
Dude,

> Precis: if the file system is very busy (being hammered) *and* it's
> over 85% full, the block allocator can get stuck trying to work out
> the *very best* allocation rather than one that'll do and let it get
> on with other work. To the point where you see CPU go through the
> roof, with 80% system CPU and a very unresponsive system. You can't
> stop this without rebooting the box.

This is exactly what we're seeing, except that we could get out of it  
by dropping older snapshots.

> Sun acknowledged it as a bug and it'll be fixed in a future release;
> they gave us a hotpatch. The workaround? Keep the ZFS filesystem in
> question under 70% full ...

:-)

> This is an obscure bug and isn't reason to avoid ZFS in general - the
> bug only gets tickled in particular circumstances, when ZFS is having
> the heck beaten out of it. I'd still happily recommend ZFS for almost
> anything, because it really is *that cool*.

hehehehehe, 'the heck beaten out of it' sounds like what we tend to do  
to our systems at wikimedia ;-)
by the way, if you know such details, what are you doing in editing  
community. get over to the dark side ;-))

Domas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

David Gerard-2
2009/7/13 Domas Mituzas <[hidden email]>:

> hehehehehe, 'the heck beaten out of it' sounds like what we tend to do
> to our systems at wikimedia ;-)
> by the way, if you know such details, what are you doing in editing
> community. get over to the dark side ;-))


If the WMF can pay me £35k to sysadmin (currently looking for £45k but
this is for charity), I am SO THERE.

If not, I have a family to feed and rent to pay ;-p


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

David Gerard-2
In reply to this post by Domas Mituzas
2009/7/13 Domas Mituzas <[hidden email]>:

>> Precis: if the file system is very busy (being hammered) *and* it's
>> over 85% full, the block allocator can get stuck trying to work out
>> the *very best* allocation rather than one that'll do and let it get
>> on with other work. To the point where you see CPU go through the
>> roof, with 80% system CPU and a very unresponsive system. You can't
>> stop this without rebooting the box.

> This is exactly what we're seeing, except that we could get out of it
> by dropping older snapshots.


Yeah - cutting down how full the file system is.


>> Sun acknowledged it as a bug and it'll be fixed in a future release;
>> they gave us a hotpatch. The workaround? Keep the ZFS filesystem in
>> question under 70% full ...

> :-)
> hehehehehe, 'the heck beaten out of it' sounds like what we tend to do
> to our systems at wikimedia ;-)


It's useful testing, and you can be sure Sun will be interested in
your results in detail, we're a reasonably famous site! A coworker
spoke to the Sun kernel engineer tearing his hair out over this one
...

I fear the answer re: ZFS is to some extent "don't do that then" until
it's fixed. Of course, you want snapshots. It's a tricky one.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Note: some load problems on upload & image scaler servers

David Gerard-2
In reply to this post by David Gerard-2
2009/7/13 David Gerard <[hidden email]>:
> 2009/7/13 Domas Mituzas <[hidden email]>:

>> hehehehehe, 'the heck beaten out of it' sounds like what we tend to do
>> to our systems at wikimedia ;-)
>> by the way, if you know such details, what are you doing in editing
>> community. get over to the dark side ;-))

> If the WMF can pay me £35k to sysadmin (currently looking for £45k but
> this is for charity), I am SO THERE.
> If not, I have a family to feed and rent to pay ;-p


And of course if the WMF has £0, as is more likely, feel free to ask
me Solaris horrors and I'll do what I can when I can ;-)

(good lord, 8 yrs Solaris on my CV. Let's hope Oracle keeps it
well-fed and it doesn't become the next VMS jobmarketwise.)


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

URLs that aren't cool...

Paul Houle
In reply to this post by David Gerard-2
I've been looking at the id structure of dbpedia and wikipedia and
finally found an example where case sensitivity issues really bite.

Cases like this with a "redirect" are a little obnoxious,

http://en.wikipedia.org/wiki/New_York_City
http://en.wikipedia.org/wiki/New_york_city

largely because there isn't a redirect...  The same page gets displayed
at each URL. (Ok,  the "redirect" has a little extra stuff at the top
saying that's a redirect)

dbpedia has separate resource pages for the above cases,  so at least
it's explaining the situation clearly -- reasoning systems that work
with dbpedia need to be able to read this.

Here's a case that's just plain bad...

http://en.wikipedia.org/wiki/Direct_instruction
http://en.wikipedia.org/wiki/Direct_Instruction

Last time I looked there were about 10,000 wikipedia urls that varied
only by case.  In this particular one,  it's two articles about the same
topic,  but there could be some cases where the two articles are about
something different.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Aryeh Gregor
On Tue, Jul 28, 2009 at 11:53 AM, Paul Houle<[hidden email]> wrote:
> I've been looking at the id structure of dbpedia and wikipedia and
> finally found an example where case sensitivity issues really bite.

We should keep in mind that case isn't so clear-cut if you move away
from English, though -- is "groß" the same as "GROSS" and thus the
same as "gross"?  How about languages that don't even have bijections
between uppercase and lowercase if you stick to the same dialect?
(I'm pretty sure there are some; don't some language strip diacritics
from uppercase letters?)  There's probably some Unicode standard on
normalization with respect to case, but it's not actually so simple in
an international context.

That said, I think case-insensitivity would be a good thing to support
in the long run, optionally, and that it would probably be suitable
for all Wikipedias.  Or at least almost all, if there are languages
out there where case insensitivity is a real headache -- hopefully
not, since most languages don't have letter case at all.  At any rate
it would be good on enwiki.

But it would require a lot of tedious and error-prone conversion of
old code.  Everything tends to assume that a)
$title->getPrefixedText() is what should be displayed to the user, but
b) two titles are equal if and only if their
$title->getPrefixedText()s are equal.  Likewise for
$title->getPrefixedDbKey().  Those would need to be systematically and
thoroughly fixed.  We'd also have to add a field to the page table or
such to store the normalized form of the title, and fiddle with the
indexes appropriately, and update all other tables to use the
normalized form.  A lot of work.

(But at least we could get rid of the silly Text/DbKey distinction
while we're doing this.  I've heard recent MySQL versions actually
support storage of ASCII space characters in text fields!)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

M. Williamson
Case insensitivity shouldn't be a problem for any language, as long as
you do it properly.

Turkish and other languages using dotless i, for example, will need a
special rule - Turkish lowercase dotted i capitalizes to a capital
dotted İ while lowercase undotted ı capitalizes to regular undotted I.

skype: node.ue

On Tue, Jul 28, 2009 at 9:26 AM, Aryeh
Gregor<[hidden email]> wrote:

> On Tue, Jul 28, 2009 at 11:53 AM, Paul Houle<[hidden email]> wrote:
>> I've been looking at the id structure of dbpedia and wikipedia and
>> finally found an example where case sensitivity issues really bite.
>
> We should keep in mind that case isn't so clear-cut if you move away
> from English, though -- is "groß" the same as "GROSS" and thus the
> same as "gross"?  How about languages that don't even have bijections
> between uppercase and lowercase if you stick to the same dialect?
> (I'm pretty sure there are some; don't some language strip diacritics
> from uppercase letters?)  There's probably some Unicode standard on
> normalization with respect to case, but it's not actually so simple in
> an international context.
>
> That said, I think case-insensitivity would be a good thing to support
> in the long run, optionally, and that it would probably be suitable
> for all Wikipedias.  Or at least almost all, if there are languages
> out there where case insensitivity is a real headache -- hopefully
> not, since most languages don't have letter case at all.  At any rate
> it would be good on enwiki.
>
> But it would require a lot of tedious and error-prone conversion of
> old code.  Everything tends to assume that a)
> $title->getPrefixedText() is what should be displayed to the user, but
> b) two titles are equal if and only if their
> $title->getPrefixedText()s are equal.  Likewise for
> $title->getPrefixedDbKey().  Those would need to be systematically and
> thoroughly fixed.  We'd also have to add a field to the page table or
> such to store the normalized form of the title, and fiddle with the
> indexes appropriately, and update all other tables to use the
> normalized form.  A lot of work.
>
> (But at least we could get rid of the silly Text/DbKey distinction
> while we're doing this.  I've heard recent MySQL versions actually
> support storage of ASCII space characters in text fields!)
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Aryeh Gregor
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]> wrote:
> Case insensitivity shouldn't be a problem for any language, as long as
> you do it properly.
>
> Turkish and other languages using dotless i, for example, will need a
> special rule - Turkish lowercase dotted i capitalizes to a capital
> dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language
the page name is in?  What if a Turkish wiki contains some English
page names as loan words, for instance?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Brion Vibber-3
On 7/28/09 10:04 AM, Aryeh Gregor wrote:

> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]>  wrote:
>> Case insensitivity shouldn't be a problem for any language, as long as
>> you do it properly.
>>
>> Turkish and other languages using dotless i, for example, will need a
>> special rule - Turkish lowercase dotted i capitalizes to a capital
>> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>
> And so what if a wiki is multilingual and you don't know what language
> the page name is in?  What if a Turkish wiki contains some English
> page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win
for human usability, but it's not easy to get right in all cases.

The main problems are:

1) Conflicts when we really do consider something separate, but the case
folding rules match them together

2) Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are
the primary examples off the top of my head. Also, some languages tend
to drop accent markers in capital form (eg, Spanish). What can or should
we do here?


A nearer-term help would be to go ahead and implement what we talked
about a billion years ago but never got around to -- a decent "did you
mean X?" message to display when you go to an empty page but there's
something similar nearby.

If it's at least trivial to click through from [[New york city]] to
[[New York City]], that's better than having to search for it anew.

Of course we have some case-insensitive matching for near-matches on
"go" searches... we could pull from that easily. [Note this is done via
TitleKey for full case-insensitivity at present... and it probably
doesn't handle Turkish correctly yet.]

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

M. Williamson
Since when does Spanish drop accent markers in capital form? If you
have seen anybody do this, it is just a misspelling. For example:
http://es.wikipedia.org/wiki/Ópera or
http://es.wikipedia.org/wiki/África or
http://es.wikipedia.org/wiki/Océano_Índico

I have been told that Greek drops accents in capital form but this may
not be true. Other than that, though, I am not acquainted with any
language that does such a thing (but of course that doesn't mean none
exist).

Mark

skype: node.ue



On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibber<[hidden email]> wrote:

> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]>  wrote:
>>> Case insensitivity shouldn't be a problem for any language, as long as
>>> you do it properly.
>>>
>>> Turkish and other languages using dotless i, for example, will need a
>>> special rule - Turkish lowercase dotted i capitalizes to a capital
>>> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>>
>> And so what if a wiki is multilingual and you don't know what language
>> the page name is in?  What if a Turkish wiki contains some English
>> page names as loan words, for instance?
>
> Indeed, good handling of case-insensitive matchings would be a big win
> for human usability, but it's not easy to get right in all cases.
>
> The main problems are:
>
> 1) Conflicts when we really do consider something separate, but the case
> folding rules match them together
>
> 2) Language-specific case folding rules in a multilingual environment
>
> Turkish I with/without dot and German ß not always matching to SS are
> the primary examples off the top of my head. Also, some languages tend
> to drop accent markers in capital form (eg, Spanish). What can or should
> we do here?
>
>
> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.
>
> If it's at least trivial to click through from [[New york city]] to
> [[New York City]], that's better than having to search for it anew.
>
> Of course we have some case-insensitive matching for near-matches on
> "go" searches... we could pull from that easily. [Note this is done via
> TitleKey for full case-insensitivity at present... and it probably
> doesn't handle Turkish correctly yet.]
>
> -- brion
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Tei-2
The related wikipedia article write that it was a urband leyend:

http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

So is wrong to drop these accents.


On Tue, Jul 28, 2009 at 7:21 PM, Mark Williamson<[hidden email]> wrote:

> Since when does Spanish drop accent markers in capital form? If you
> have seen anybody do this, it is just a misspelling. For example:
> http://es.wikipedia.org/wiki/Ópera or
> http://es.wikipedia.org/wiki/África or
> http://es.wikipedia.org/wiki/Océano_Índico
>
> I have been told that Greek drops accents in capital form but this may
> not be true. Other than that, though, I am not acquainted with any
> language that does such a thing (but of course that doesn't mean none
> exist).
>
> Mark
>
> skype: node.ue
>
>
>
> On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibber<[hidden email]> wrote:
>> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>>> On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]>  wrote:
>>>> Case insensitivity shouldn't be a problem for any language, as long as
>>>> you do it properly.
>>>>
>>>> Turkish and other languages using dotless i, for example, will need a
>>>> special rule - Turkish lowercase dotted i capitalizes to a capital
>>>> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>>>
>>> And so what if a wiki is multilingual and you don't know what language
>>> the page name is in?  What if a Turkish wiki contains some English
>>> page names as loan words, for instance?
>>
>> Indeed, good handling of case-insensitive matchings would be a big win
>> for human usability, but it's not easy to get right in all cases.
>>
>> The main problems are:
>>
>> 1) Conflicts when we really do consider something separate, but the case
>> folding rules match them together
>>
>> 2) Language-specific case folding rules in a multilingual environment
>>
>> Turkish I with/without dot and German ß not always matching to SS are
>> the primary examples off the top of my head. Also, some languages tend
>> to drop accent markers in capital form (eg, Spanish). What can or should
>> we do here?



--
--
ℱin del ℳensaje.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Brion Vibber-3
On 7/28/09 10:30 AM, Tei wrote:
> The related wikipedia article write that it was a urband leyend:
>
> http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

Dang! I've been taken in again by exposure to real-world practice
instead of what's correct. ;)

(In any case, handling that case nicely is wise too.)

-- brion

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Roan Kattouw-2
In reply to this post by M. Williamson
2009/7/28 Mark Williamson <[hidden email]>:

> Since when does Spanish drop accent markers in capital form? If you
> have seen anybody do this, it is just a misspelling. For example:
> http://es.wikipedia.org/wiki/Ópera or
> http://es.wikipedia.org/wiki/África or
> http://es.wikipedia.org/wiki/Océano_Índico
>
> I have been told that Greek drops accents in capital form but this may
> not be true. Other than that, though, I am not acquainted with any
> language that does such a thing (but of course that doesn't mean none
> exist).
>
Frisian (fy) does drop accents in capitals, FWIW.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Dbpedia-discussion] URLs that aren t cool...

Paul Houle
In reply to this post by Paul Houle
Georgi Kobilarov wrote:

>> In this particular one,  it's two articles about the
>> same
>> topic,  but there could be some cases where the two articles are about
>> something different.
>>    
>
> Yes, such as http://en.wikipedia.org/wiki/FROG
> and http://en.wikipedia.org/wiki/Frog
>
> I agree that this can be annoying. One have to make sure to not lose the
> case information (as it happened to me with lookup.dbpedia.org once, hence
> merging FROG and Frog).
>
> But what do you suggest to do about that, Paul? Should Wikipedia make URLs
> case-insensitive and then enforce disambiguation with ()?
>  
    If (wikipedia) were my site,  I'd do two things:

(i) map all case-variant forms to a single form (New yOrK cITy -> New
York City;)  "FROG" gets renamed to "FROG Cipher" or "Frog (Cipher)"
(ii) do a permanent redirect from variant forms to the canonical form

    I think what dbpedia is doing is reasonable considering the situation.

    My own system for handling generic databases has both a VARBINARY
and VARCHAR field for dbpedia URLs/labels.  It does a case-insensitive
lookup first,  and if that fails,  looks at the alternatives that turn
up.  It's also got some heuristics for dealing with redirects,  
disambiguation,  and all that.  In the big picture I see "naming and
identity" as a specific functional module for this kind of system...

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Helder Geovane Gomes de Lima
In reply to this post by Brion Vibber-3
2009/7/28 Brion Vibber <[hidden email]>:
> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.
>
> If it's at least trivial to click through from [[New york city]] to
> [[New York City]], that's better than having to search for it anew.

I think this would be really good to implement this, since it also
help us when creating and following interwiki links (see also the
point 3 I was talking here:
http://lists.wikimedia.org/pipermail/wikitech-l/2009-July/044007.html)

Helder

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Platonides
In reply to this post by Brion Vibber-3
Brion Vibber wrote:
> On 7/28/09 10:30 AM, Tei wrote:
>> The related wikipedia article write that it was a urband leyend:
>>
>> http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas
>
> Dang! I've been taken in again by exposure to real-world practice
> instead of what's correct. ;)

Once upon a time, mechanical typewriters weren't able to properly
acceuntate them.

> (In any case, handling that case nicely is wise too.)
>
> -- brion

At Spanish wikipedia there're some bots creating redirects from titles
lowercased with accents dropped, to make the article show up when
searching without the exact spelling.
I don't really like it, but where the software doesn't work, users get
inventive.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12