Note: some load problems on upload & image scaler servers

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Nikola Smolenski
Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:
> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
> > On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]>  
wrote:

> >> Case insensitivity shouldn't be a problem for any language, as long as
> >> you do it properly.
> >>
> >> Turkish and other languages using dotless i, for example, will need a
> >> special rule - Turkish lowercase dotted i capitalizes to a capital
> >> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
> >
> > And so what if a wiki is multilingual and you don't know what language
> > the page name is in?  What if a Turkish wiki contains some English
> > page names as loan words, for instance?
>
> Indeed, good handling of case-insensitive matchings would be a big win
> for human usability, but it's not easy to get right in all cases.
>
> The main problems are:
>
> 1) Conflicts when we really do consider something separate, but the case
> folding rules match them together
>
> 2) Language-specific case folding rules in a multilingual environment
>
> Turkish I with/without dot and German ß not always matching to SS are
> the primary examples off the top of my head. Also, some languages tend
> to drop accent markers in capital form (eg, Spanish). What can or should
> we do here?

Similar to automatic redirect, we could build an authomatic disambiguation
page. For example, someone on srwiki going to [[Dj]] would get:

Did you mean:

* [[Đ]]
* [[DJ]]
* [[D.J.]]

> A nearer-term help would be to go ahead and implement what we talked
> about a billion years ago but never got around to -- a decent "did you
> mean X?" message to display when you go to an empty page but there's
> something similar nearby.

Was thinking a lot about this. The best solution I thought of would be to add
a column to page table "page_title_canonical". When an article is
created/moved, this canonical title is built from the real title. When an
article is looked up, if there is no match in page_title, build the canonical
title from the URL and see if there is a match in page_title_canonical and if
yes, display "did you mean X" or even go there automatically as if from a
redirect (if there is only one match) or "did you mean *X, *X1" if there are
multiple matches.

This canonical title would be made like this:
* Remove disambiguator from the title if it exists
* Remove punctuation and the like
* Transliterate the title to Latin alphabet
* Transliterate to pure ASCII
* Lowercase
* Order the words alphabetically

What could possibly go wrong?

Note that this would also be very helpful for non-Latin wikis - people often
want Latin-only URLs since non-Latin URLs are toooo long. I also recall a
recent discussion about a wiki in a language with nonstandard spelling (nds?)
where they use bots to create dozens or even hundreds of redirects to an
article title - this would also make that unneeded.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Andrew Dunbar
2009/7/29 Nikola Smolenski <[hidden email]>:

> Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:
>> On 7/28/09 10:04 AM, Aryeh Gregor wrote:
>> > On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamson<[hidden email]>
> wrote:
>> >> Case insensitivity shouldn't be a problem for any language, as long as
>> >> you do it properly.
>> >>
>> >> Turkish and other languages using dotless i, for example, will need a
>> >> special rule - Turkish lowercase dotted i capitalizes to a capital
>> >> dotted İ while lowercase undotted ı capitalizes to regular undotted I.
>> >
>> > And so what if a wiki is multilingual and you don't know what language
>> > the page name is in?  What if a Turkish wiki contains some English
>> > page names as loan words, for instance?
>>
>> Indeed, good handling of case-insensitive matchings would be a big win
>> for human usability, but it's not easy to get right in all cases.
>>
>> The main problems are:
>>
>> 1) Conflicts when we really do consider something separate, but the case
>> folding rules match them together
>>
>> 2) Language-specific case folding rules in a multilingual environment
>>
>> Turkish I with/without dot and German ß not always matching to SS are
>> the primary examples off the top of my head. Also, some languages tend
>> to drop accent markers in capital form (eg, Spanish). What can or should
>> we do here?
>
> Similar to automatic redirect, we could build an authomatic disambiguation
> page. For example, someone on srwiki going to [[Dj]] would get:
>
> Did you mean:
>
> * [[Đ]]
> * [[DJ]]
> * [[D.J.]]
>
>> A nearer-term help would be to go ahead and implement what we talked
>> about a billion years ago but never got around to -- a decent "did you
>> mean X?" message to display when you go to an empty page but there's
>> something similar nearby.
>
> Was thinking a lot about this. The best solution I thought of would be to add
> a column to page table "page_title_canonical". When an article is
> created/moved, this canonical title is built from the real title. When an
> article is looked up, if there is no match in page_title, build the canonical
> title from the URL and see if there is a match in page_title_canonical and if
> yes, display "did you mean X" or even go there automatically as if from a
> redirect (if there is only one match) or "did you mean *X, *X1" if there are
> multiple matches.
>
> This canonical title would be made like this:
> * Remove disambiguator from the title if it exists
> * Remove punctuation and the like
> * Transliterate the title to Latin alphabet
> * Transliterate to pure ASCII
> * Lowercase
> * Order the words alphabetically
>
> What could possibly go wrong?
>
> Note that this would also be very helpful for non-Latin wikis - people often
> want Latin-only URLs since non-Latin URLs are toooo long. I also recall a
> recent discussion about a wiki in a language with nonstandard spelling (nds?)
> where they use bots to create dozens or even hundreds of redirects to an
> article title - this would also make that unneeded.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I actually did make this extension a couple of years, intended for the
English Wiktionary where we manually add an {{also}} template to the
top of pages to like to other pages whose titles differ in minor ways
such as capitalization, hyphenation, apostrophes, accents, periods. I
think I had it working with Hebrew and Arabic and a few other exotic
languages besides.

It was running on Brion's test box for some time but getting little
interest. It's been offline and unmaintained since Brion moved and I
did a couple of overseas trips.

http://www.mediawiki.org/wiki/Extension:DidYouMean
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DidYouMean/
https://bugzilla.wikimedia.org/show_bug.cgi?id=8648

It hooked all ways to create delete or move a page to maintain a
separate table of normalized page titles which it consulted when
displaying a page.
The code for display was designed for compatibility with the
then-current Wiktionary templates and would need to be implemented in
a more general way.
A core version would probably just add a field to the existing table.

Andrew Dunbar (hippietrail)


--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: URLs that aren't cool...

Tim Starling-2
In reply to this post by Aryeh Gregor
Aryeh Gregor wrote:
> (But at least we could get rid of the silly Text/DbKey distinction
> while we're doing this.  I've heard recent MySQL versions actually
> support storage of ASCII space characters in text fields!)

Apparently this poor design choice was made due to some bogus concept
of backwards compatibility with UseMod, or some similarly crappy wiki
engine that stores articles in the filesystem, with filenames chosen
to avoid distressing shellscript fanboys.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12