For title normalization, what characters are converted to uppercase ?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

For title normalization, what characters are converted to uppercase ?

Nicolas Vervelle-4
Hello,

On most wikis, MediaWiki is configuration to convert the first letter of a
title to uppercase, but apparently it's not converting every Unicode
characters : for example, on frwiki ɽ
<https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a
different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even if
the second character is the uppercase version of the first one in Unicode.

So, what characters are actually converted to uppercase by the title
normalization ?

I need to know this information to stop reporting some false positives in
WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>.

Thanks, Nico
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

bawolff
MediaWiki uses php's mb_strtoupper.

I believe this will use normal unicode uppercase algorithm. However this
can vary depending on version of unicode. We are currently in the process
of switching to php7, but for the moment we are still using HHVM's
uppercasing code. There's a list of differences between hhvm and php7.2
uppercasing at
https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Php72ToUpper.php
[All this is probably subject to change]

However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that
character, since the ɽ -> Ɽ mapping has been present since unicode 5
(2006). Guess it was using a really old unicode data or something.

See also  bug T219279 [2]

--
Brian

[1] https://3v4l.org/GHt3b
[2] https://phabricator.wikimedia.org/T219279

On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle <[hidden email]> wrote:

> Hello,
>
> On most wikis, MediaWiki is configuration to convert the first letter of a
> title to uppercase, but apparently it's not converting every Unicode
> characters : for example, on frwiki ɽ
> <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a
> different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even
> if
> the second character is the uppercase version of the first one in Unicode.
>
> So, what characters are actually converted to uppercase by the title
> normalization ?
>
> I need to know this information to stop reporting some false positives in
> WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>.
>
> Thanks, Nico
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Yuri Astrakhan
In reply to this post by Nicolas Vervelle-4
Hi Nico, if possible, can your tool to actually use MW API to normalize
titles? It's a very quick API call, you can do multiple titles at once, but
it will save you a lot of grief over incompatibilities.
--Yuri

On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle <[hidden email]>
wrote:

> Hello,
>
> On most wikis, MediaWiki is configuration to convert the first letter of a
> title to uppercase, but apparently it's not converting every Unicode
> characters : for example, on frwiki ɽ
> <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a
> different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even
> if
> the second character is the uppercase version of the first one in Unicode.
>
> So, what characters are actually converted to uppercase by the title
> normalization ?
>
> I need to know this information to stop reporting some false positives in
> WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>.
>
> Thanks, Nico
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Nicolas Vervelle-4
Thanks Yuri,

I know of the normalization done through the API, but it doesn't work for
the case I'm working on : it's a dump analysis, and I want it to be able to
work offline...

Nico

On Sun, Aug 4, 2019 at 2:12 AM Yuri Astrakhan <[hidden email]>
wrote:

> Hi Nico, if possible, can your tool to actually use MW API to normalize
> titles? It's a very quick API call, you can do multiple titles at once, but
> it will save you a lot of grief over incompatibilities.
> --Yuri
>
> On Sat, Aug 3, 2019 at 10:57 AM Nicolas Vervelle <[hidden email]>
> wrote:
>
> > Hello,
> >
> > On most wikis, MediaWiki is configuration to convert the first letter of
> a
> > title to uppercase, but apparently it's not converting every Unicode
> > characters : for example, on frwiki ɽ
> > <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a
> > different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even
> > if
> > the second character is the uppercase version of the first one in
> Unicode.
> >
> > So, what characters are actually converted to uppercase by the title
> > normalization ?
> >
> > I need to know this information to stop reporting some false positives in
> > WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>.
> >
> > Thanks, Nico
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Nicolas Vervelle-4
In reply to this post by bawolff
Thanks Brian,

Great for the link to Php72ToUpper.php !
I think I understand with it : for example, the first line says 'ƀ' => 'ƀ',
which should mean that this letter shouldn't be converted to uppercase by
MW ?
That's one of the letter I found that wasn't converted to uppercase and
that was generating a false positive in my code : so it's because specific
MW code is preventing the conversion :-)

Nico

On Sun, Aug 4, 2019 at 1:32 AM bawolff <[hidden email]> wrote:

> MediaWiki uses php's mb_strtoupper.
>
> I believe this will use normal unicode uppercase algorithm. However this
> can vary depending on version of unicode. We are currently in the process
> of switching to php7, but for the moment we are still using HHVM's
> uppercasing code. There's a list of differences between hhvm and php7.2
> uppercasing at
>
> https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Php72ToUpper.php
> [All this is probably subject to change]
>
> However, I am at a loss as to why hhvm & php < 5.6 [1] wouldn't map that
> character, since the ɽ -> Ɽ mapping has been present since unicode 5
> (2006). Guess it was using a really old unicode data or something.
>
> See also  bug T219279 [2]
>
> --
> Brian
>
> [1] https://3v4l.org/GHt3b
> [2] https://phabricator.wikimedia.org/T219279
>
> On Sat, Aug 3, 2019 at 7:57 AM Nicolas Vervelle <[hidden email]>
> wrote:
>
> > Hello,
> >
> > On most wikis, MediaWiki is configuration to convert the first letter of
> a
> > title to uppercase, but apparently it's not converting every Unicode
> > characters : for example, on frwiki ɽ
> > <https://fr.wikipedia.org/w/index.php?title=%C9%BD&redirect=no> is a
> > different article than Ɽ <https://fr.wikipedia.org/wiki/%E2%B1%A4>, even
> > if
> > the second character is the uppercase version of the first one in
> Unicode.
> >
> > So, what characters are actually converted to uppercase by the title
> > normalization ?
> >
> > I need to know this information to stop reporting some false positives in
> > WPCleaner <https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WPCleaner>.
> >
> > Thanks, Nico
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Giuseppe Lavagetto
On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <[hidden email]>
wrote:

> Thanks Brian,
>
> Great for the link to Php72ToUpper.php !
> I think I understand with it : for example, the first line says 'ƀ' => 'ƀ',
> which should mean that this letter shouldn't be converted to uppercase by
> MW ?
> That's one of the letter I found that wasn't converted to uppercase and
> that was generating a false positive in my code : so it's because specific
> MW code is preventing the conversion :-)
>

Hi!

No, that file is a temporary measure during a transition between two
versions of php.

In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous
result "ƀ".

In PHP 7.x, the result is the correct capitalization.

The issue is that the titles of wiki articles get normalized, so under php7
we would have

ƀar => Ƀar

which would prevent you from being able to reach the page.

Once we're done with the transition and we go through the process of
coverting the (several hundred) pages/users that have the wrong title
normalization, we will remove that table, and obtain the correct behaviour.

You just need to subscribe https://phabricator.wikimedia.org/T219279 and
wait for its resolution I think - most unicode horrors are fixed in recent
versions of PHP, including the one you were citing.

Cheers,

Giuseppe
--
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Nicolas Vervelle-4
Thanks Giuseppe !

I've subscribed to T219279 to know when the pages are properly converted,
and when I can remove the hack in my code.

Nico

On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto <[hidden email]>
wrote:

> On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <[hidden email]>
> wrote:
>
> > Thanks Brian,
> >
> > Great for the link to Php72ToUpper.php !
> > I think I understand with it : for example, the first line says 'ƀ' =>
> 'ƀ',
> > which should mean that this letter shouldn't be converted to uppercase by
> > MW ?
> > That's one of the letter I found that wasn't converted to uppercase and
> > that was generating a false positive in my code : so it's because
> specific
> > MW code is preventing the conversion :-)
> >
>
> Hi!
>
> No, that file is a temporary measure during a transition between two
> versions of php.
>
> In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous
> result "ƀ".
>
> In PHP 7.x, the result is the correct capitalization.
>
> The issue is that the titles of wiki articles get normalized, so under php7
> we would have
>
> ƀar => Ƀar
>
> which would prevent you from being able to reach the page.
>
> Once we're done with the transition and we go through the process of
> coverting the (several hundred) pages/users that have the wrong title
> normalization, we will remove that table, and obtain the correct behaviour.
>
> You just need to subscribe https://phabricator.wikimedia.org/T219279 and
> wait for its resolution I think - most unicode horrors are fixed in recent
> versions of PHP, including the one you were citing.
>
> Cheers,
>
> Giuseppe
> --
> Giuseppe Lavagetto
> Principal Site Reliability Engineer, Wikimedia Foundation
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

Nicolas Vervelle-4
Last question (I believe) :
I've implemented something similar as Php72ToUpper in WPCleaner, and it
seems to work fine for removing false positives.
I've only one left on frwiki : ⅷ
<https://fr.wikipedia.org/w/index.php?title=%E2%85%B7&redirect=no>.
My code still converts it to uppercase, but on frwiki there is one page for
the lowercase letter, and one page for the uppercase letter, so this letter
is not converted to uppercase by current MediaWiki version.
Is it missing in Php72ToUpper to prevent it to be converted with PHP 7.2 ?

Nico

On Mon, Aug 5, 2019 at 8:45 AM Nicolas Vervelle <[hidden email]> wrote:

> Thanks Giuseppe !
>
> I've subscribed to T219279 to know when the pages are properly converted,
> and when I can remove the hack in my code.
>
> Nico
>
> On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto <
> [hidden email]> wrote:
>
>> On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <[hidden email]>
>> wrote:
>>
>> > Thanks Brian,
>> >
>> > Great for the link to Php72ToUpper.php !
>> > I think I understand with it : for example, the first line says 'ƀ' =>
>> 'ƀ',
>> > which should mean that this letter shouldn't be converted to uppercase
>> by
>> > MW ?
>> > That's one of the letter I found that wasn't converted to uppercase and
>> > that was generating a false positive in my code : so it's because
>> specific
>> > MW code is preventing the conversion :-)
>> >
>>
>> Hi!
>>
>> No, that file is a temporary measure during a transition between two
>> versions of php.
>>
>> In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous
>> result "ƀ".
>>
>> In PHP 7.x, the result is the correct capitalization.
>>
>> The issue is that the titles of wiki articles get normalized, so under
>> php7
>> we would have
>>
>> ƀar => Ƀar
>>
>> which would prevent you from being able to reach the page.
>>
>> Once we're done with the transition and we go through the process of
>> coverting the (several hundred) pages/users that have the wrong title
>> normalization, we will remove that table, and obtain the correct
>> behaviour.
>>
>> You just need to subscribe https://phabricator.wikimedia.org/T219279 and
>> wait for its resolution I think - most unicode horrors are fixed in recent
>> versions of PHP, including the one you were citing.
>>
>> Cheers,
>>
>> Giuseppe
>> --
>> Giuseppe Lavagetto
>> Principal Site Reliability Engineer, Wikimedia Foundation
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: For title normalization, what characters are converted to uppercase ?

bawolff
Apparently that will change in php7.3, which we will move to eventually but
probably not anytime soon: https://3v4l.org/W7TiC

--
bawolff
On Mon, Aug 5, 2019 at 12:32 PM Nicolas Vervelle <[hidden email]>
wrote:

> Last question (I believe) :
> I've implemented something similar as Php72ToUpper in WPCleaner, and it
> seems to work fine for removing false positives.
> I've only one left on frwiki : ⅷ
> <https://fr.wikipedia.org/w/index.php?title=%E2%85%B7&redirect=no>.
> My code still converts it to uppercase, but on frwiki there is one page for
> the lowercase letter, and one page for the uppercase letter, so this letter
> is not converted to uppercase by current MediaWiki version.
> Is it missing in Php72ToUpper to prevent it to be converted with PHP 7.2 ?
>
> Nico
>
> On Mon, Aug 5, 2019 at 8:45 AM Nicolas Vervelle <[hidden email]>
> wrote:
>
> > Thanks Giuseppe !
> >
> > I've subscribed to T219279 to know when the pages are properly converted,
> > and when I can remove the hack in my code.
> >
> > Nico
> >
> > On Mon, Aug 5, 2019 at 7:03 AM Giuseppe Lavagetto <
> > [hidden email]> wrote:
> >
> >> On Sun, Aug 4, 2019 at 11:34 AM Nicolas Vervelle <[hidden email]>
> >> wrote:
> >>
> >> > Thanks Brian,
> >> >
> >> > Great for the link to Php72ToUpper.php !
> >> > I think I understand with it : for example, the first line says 'ƀ' =>
> >> 'ƀ',
> >> > which should mean that this letter shouldn't be converted to uppercase
> >> by
> >> > MW ?
> >> > That's one of the letter I found that wasn't converted to uppercase
> and
> >> > that was generating a false positive in my code : so it's because
> >> specific
> >> > MW code is preventing the conversion :-)
> >> >
> >>
> >> Hi!
> >>
> >> No, that file is a temporary measure during a transition between two
> >> versions of php.
> >>
> >> In HHVM and PHP 5.x, calling mb_toupper("ƀ") would give the erroneous
> >> result "ƀ".
> >>
> >> In PHP 7.x, the result is the correct capitalization.
> >>
> >> The issue is that the titles of wiki articles get normalized, so under
> >> php7
> >> we would have
> >>
> >> ƀar => Ƀar
> >>
> >> which would prevent you from being able to reach the page.
> >>
> >> Once we're done with the transition and we go through the process of
> >> coverting the (several hundred) pages/users that have the wrong title
> >> normalization, we will remove that table, and obtain the correct
> >> behaviour.
> >>
> >> You just need to subscribe https://phabricator.wikimedia.org/T219279
> and
> >> wait for its resolution I think - most unicode horrors are fixed in
> recent
> >> versions of PHP, including the one you were citing.
> >>
> >> Cheers,
> >>
> >> Giuseppe
> >> --
> >> Giuseppe Lavagetto
> >> Principal Site Reliability Engineer, Wikimedia Foundation
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l