database encoding for field with mathematical expressions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

database encoding for field with mathematical expressions

Moritz Schubotz-2
Hi,

I'm a testing a new rendering option for the <math /> element and had
problems to store MathML elements in the database field
math_mathml which is of type text.
The MathML elements contain a wide range of Unicode characters like the
INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte
chars like MATHEMATICAL BOLD CAPITAL A  0xF0 0x9D 0x90 0x80 .
In some rar cases I had problem to retrieve the stored value correctly from
MySQL.
To fix that problem I'm now using the PHP functions utf8_encode /decode to
which is not a very intuitive solution.
Do you know a better method to solve this issue without to change the
database layout.

Best
Physikerwelt




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: database encoding for field with mathematical expressions

Kevin Israel
On 05/23/2013 11:31 PM, [hidden email] wrote:

> Hi,
>
> I'm a testing a new rendering option for the <math /> element and had
> problems to store MathML elements in the database field
> math_mathml which is of type text.
> The MathML elements contain a wide range of Unicode characters like the
> INVISIBLE TIMES that is encoded as 0xE2 0x81 0xA2 in UTF-8 or even 4 byte
> chars like MATHEMATICAL BOLD CAPITAL A  0xF0 0x9D 0x90 0x80 .
> In some rar cases I had problem to retrieve the stored value correctly from
> MySQL.
> To fix that problem I'm now using the PHP functions utf8_encode /decode to
> which is not a very intuitive solution.
> Do you know a better method to solve this issue without to change the
> database layout.
>
> Best
> Physikerwelt

If you use MySQL, when you installed MediaWiki (or created the table),
did you choose the "UTF-8" option instead of "binary"? The underlying
MySQL character set is "utf8"[1], which does not support characters
above U+FFFF (four-byte characters).

This is mentioned in the web installer (message 'config-charset-help'):

> In binary mode, MediaWiki stores UTF-8 text to the database in binary
> fields. This is more efficient than MySQL's UTF-8 mode, and allows
> you to use the full range of Unicode characters. In UTF-8 mode, MySQL
> will know what character set your data is in, and can present and
> convert it appropriately, but it will not let you store characters
> above the Basic Multilingual Plane[2]."

MySQL 5.5 did introduce a new "utf8mb4" character set, which does
support four-byte characters; however, MediaWiki does not currently
support that option (now filed as bug 48767).

The WMF of course has to use the 'binary' option (actually, UTF-8 stored
in latin1 columns, as mentioned in bug 32217) to allow storage
of all sorts of obscure characters from different languages.

utf8_encode()/utf8_decode() work around the problem because they replace
byte values 80 to FF with two-byte characters from U+0080 to U+00FF,
(encoded as C2 80 to C3 BF) and the 'utf8' option does allow those
characters.

[1]: https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html
[2]: http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
--
Wikipedia user PleaseStand
http://en.wikipedia.org/wiki/User:PleaseStand

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: database encoding for field with mathematical expressions

Matthew Flaschen-2
In reply to this post by Moritz Schubotz-2
On 05/23/2013 11:31 PM, [hidden email] wrote:
> Hi,
>
> I'm a testing a new rendering option for the <math /> element and had
> problems to store MathML elements in the database field
> math_mathml which is of type text.

The Gerrit for this is https://gerrit.wikimedia.org/r/#/c/61987/

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l