Mysql, UTF-8: How is it supposed to work?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Mysql, UTF-8: How is it supposed to work?

Dorthe Luebbert
Hi,

I wonder how the UTF-8-Support in Mediawiki works and what valid
combinations of database charsets and output charsets are.

As far as I understand in version 1.5 the default character set has
changed to UTF-8. Therefore I suppose Mediawiki stores HTML-entities in
the database per default (because Mysql 4.0 does not fully support
UTF-8). Right?

Yesterday we tried to upgrade a 1.5x-Media-Wiki to Mysql 4.1 (the server
was upgraded and the wiki was unfortunately affected). We found a
character set mess within the latin1-database, which we cleaned up by
find/replace in the dump file. Now we have UTF8 content in the database,
the character set for the tables is set to UTF-8 and utf8 is used as
charset in the output. We also enabled the Mysql5-experimental flag.
Some parts of the page work all right, some do not (e.g. page titles),
this was mentioned in the changelog file as todo.

Now it's broken and I would like to which combination is supposed to
work. Is this one a possible combination?
Database: Mysql 4.1
PHP: 5.1
Database-charset: Latin1, all content in the database is latin1
Output-charset: UTF-8

Thanks for any hint.

Regards

  Dorthe

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Mysql, UTF-8: How is it supposed to work?

Brion Vibber
Dorthe Luebbert wrote:
> I wonder how the UTF-8-Support in Mediawiki works and what valid
> combinations of database charsets and output charsets are.
>
> As far as I understand in version 1.5 the default character set has
> changed to UTF-8.

The default has been UTF-8 since a long long time ago. In some older versions
(possibly as late as 1.3), a handful of European languages had to be installed
in Latin-1, English defaulted to UTF-8 but could optionally be Latin-1, and
every other languages was UTF-8.

As of 1.4, UTF-8 was the default for all languages.

As of 1.5, Latin-1 is no longer supported.

> Therefore I suppose Mediawiki stores HTML-entities in
> the database per default (because Mysql 4.0 does not fully support
> UTF-8). Right?

MySQL through 4.0 doesn't have native support for Unicode, so we just treat the
fields as binary and store UTF-8 data in them directly.

MySQL 4.1 and later have somewhat fancier character set options including some
broken Unicode support. By default, MediaWiki continues to treat it as on 4.0
and earlier; data is chucked in and retrieved as raw UTF-8 without worrying
about the server's character set configuration.

Generally this works fine, though sometimes you'll get surprises if you let
MySQL do implicit character conversion based on what it _thinks_ your tables
contain.


In current 1.5 releases you may optionally have the tables created with the
UTF-8 character set explicitly set, and UTF-8 explicitly set on the db connection.

This may or may not be helpful for some people for some reason; but mostly it will:
* Make indexes larger (3 bytes per character)
* Cause failures if you use characters outside the BOM in page titles,
usernames, etc.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Mysql, UTF-8: How is it supposed to work?

Brion Vibber
Brion Vibber wrote:
> * Cause failures if you use characters outside the BOM in page titles,
> usernames, etc.

That of course should be BMP, not BOM.

*needs more sleep*

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment