[MediaWiki-l] Normalization Code

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[MediaWiki-l] Normalization Code

Al Johnson
Hello,

I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text.  Grep'ing for 'normaliz' brings up a lot and I'm not a php dev.  Can someone point me to a key php module and/or function?

Thank you,
Al
_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Normalization Code

Jeremy Baron
On Wed, Dec 26, 2012 at 10:20 PM, Al Johnson <[hidden email]> wrote:
> I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text.  Grep'ing for 'normaliz' brings up a lot and I'm not a php dev.  Can someone point me to a key php module and/or function?

I guess (it really is a guess) start with
includes/normal/UtfNormal.php and
https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/installer/Installer.i18n.php;hb=81ea9d1492c296902e9325f3b571cc4e5b39a272#l94

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Normalization Code

Brion Vibber
In reply to this post by Al Johnson
On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote:

> I need to make sure a backend Java process is doing the same UTF
> normalization that is done for edit text.  Grep'ing for 'normaliz' brings
> up a lot and I'm not a php dev.  Can someone point me to a key php module
> and/or function?
>

The PHP code for this is in includes/normal -- luckily you shouldn't have
to replicate most of that code which is nasty and low-level.

For the most part, you want to do two things:
* make sure the input is valid UTF-8
* normalize any composition character sequences to 'normalization form C'

Reading data in from a UTF-8 input stream into a Java string should already
take care of making sure it's valid UTF-8. :) If you want to treat invalid
input the same, make sure that invalid UTF-8 sequences get converted to the
'replacement character' U+FFFD rather than throwing an exception.

It looks like you should be able to use the java.text.Normalizer class to
convert to NFC: <
http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html>

You might or might not prefer to use the Java version of the ICU library to
do the same thing, it might be more up to date: <
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html>

-- brion
_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Normalization Code

Al Johnson
Thanks for the Java API ref.  But, I'm curious as to how or where invalid UTF-8 sequences come about; is it primarily a hacker thing?  I see the Java Character API has an isValidCodePoint() method.  Do I just run each code point through that?

Thanks,
al



________________________________
 From: Brion Vibber <[hidden email]>
To: Al Johnson <[hidden email]>; MediaWiki announcements and site admin list <[hidden email]>
Sent: Wednesday, December 26, 2012 8:40 PM
Subject: Re: [MediaWiki-l] Normalization Code
 

On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote:

I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text.  Grep'ing for 'normaliz' brings up a lot and I'm not a php dev.  Can someone point me to a key php module and/or function?
>

The PHP code for this is in includes/normal -- luckily you shouldn't have to replicate most of that code which is nasty and low-level.

For the most part, you want to do two things:
* make sure the input is valid UTF-8
* normalize any composition character sequences to 'normalization form C'

Reading data in from a UTF-8 input stream into a Java string should already take care of making sure it's valid UTF-8. :) If you want to treat invalid input the same, make sure that invalid UTF-8 sequences get converted to the 'replacement character' U+FFFD rather than throwing an exception.

It looks like you should be able to use the java.text.Normalizer class to convert to NFC: <http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html>

You might or might not prefer to use the Java version of the ICU library to do the same thing, it might be more up to date: <http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html>

-- brion
_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Normalization Code

Brion Vibber
On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson <[hidden email]> wrote:

> Thanks for the Java API ref.  But, I'm curious as to how or where invalid
> UTF-8 sequences come about; is it primarily a hacker thing?


Most frequently due to buggy bot tools or reaaaally old browsers that
didn't support UTF-8 correctly.


>   I see the Java Character API has an isValidCodePoint() method.  Do I
> just run each code point through that?
>

By the time your data is in Java String objects or 'char's it's already
been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character
string). I don't remember offhand enough about Java I/O to tell you exactly
what class in the input stack is doing that though. :)

-- brion
_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l