accents not appearing correctly

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

accents not appearing correctly

Hugh Prior
I am trying to create wiki pages via program.  I have been partially
successful, but I cannot seem to get passed problems of accented characters
not appearing correctly.  Below I have a self-contained example.  It creates
a page called "Page Test 1" fine, except that the page text, instead of
reading as "Fédération" (with two "e"s with accents), I get complete junk
for that part. In Internet Explorer it shows as a chinese character (!), and
in Firefox I get two nasty blobs with question marks in.

What can I do to ensure the code-page translation stuff works correctly?
There are a whole bunch of stuff for dealing with funny chars, but which to
use, or how should I be pre-processing 'user' input?

Thanks!


Hugh Prior


<?

require_once("../includes/Article.php");
require_once("../includes/Title.php");
require_once("../includes/EditPage.php");
require_once("../includes/GlobalFunctions.php");


/**
 * Test page creation
 */
function pageCreate() {
   global $wgLoadBalancer;
   global $wgUser;

   // Create the page text
   $pageText = "Fédération";
   $wikiPageName = "Page Test 1";

   // Code adapted from "maintenance/InitialiseMessages.inc"
   $dbw =& wfGetDB( DB_MASTER );

   $title = new Title();
   $title = $title->newFromText($wikiPageName);

   $article = new Article( $title );
   $newid = $article->insertOn( $dbw, 'sysop' );

   $revision = new Revision( array(
   'page'      => $newid,
   'text'      => $pageText,
   'user'      => 0,
   'user_text' => "My user text",
   'comment'   => '',
   ) );
   $revid = $revision->insertOn( $dbw );
   $article->updateRevisionOn( $dbw, $revision );

   $dbw->commit();

}

// Call the page creation
pageCreate();

?>



_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: accents not appearing correctly

Brion Vibber
Hugh Prior wrote:
> In Internet Explorer it shows as a chinese character (!), and
> in Firefox I get two nasty blobs with question marks in.

Save your file as UTF-8.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: accents not appearing correctly

Hugh Prior
Thank you Brion for your answer.  But I am not much the wiser.  I know of
course that I need to do some special character treatment.

How?  Is there some sort of "preprocessTextForUTF8()" function which I need
to call?  Is there some sor "$revision->saveAsUTF8()" function which I need
to call?

Thanks.



_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: accents not appearing correctly

Brion Vibber
Hugh Prior wrote:
> Thank you Brion for your answer.  But I am not much the wiser.  I know of
> course that I need to do some special character treatment.
>
> How?  Is there some sort of "preprocessTextForUTF8()" function which I need
> to call?  Is there some sor "$revision->saveAsUTF8()" function which I need
> to call?

Your text editor will have some sort of encoding setting. Use it.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: accents not appearing correctly

Hugh Prior
"Brion Vibber" <[hidden email]> wrote:
> Your text editor will have some sort of encoding setting. Use it.

Thanks for trying Brion.

However, in view of the actual problem though, this which you suggest is,
sorry to say, complete nonsense.  The whole idea is that the page is created
via a PROGRAM and not via the browser, so browser settings are totally
irrelevant.  Sure, the page needs to be visible correctly in a browser
afterwards, but it should not be for the end user to try and fudge the
browser to some bizarre setting just because a letter "e" has a simple
accent.

If you look at the sample code you will see the sample text which causes a
problem:

 $pageText = "Fédération";


It is not complex text.  It is not as if I am trying to input Chinese via a
program into a wiki.

If you think that the code, being PHP, still has to be run by a browser, ask
the question how could such code as shown in the sample run and generate
correct output.when the PHP program is run from the command line.

To reiterate, how can I get the following simple program to correctly create
wiki pages with the accents correctly:

<?

require_once("../includes/Article.php");
require_once("../includes/Title.php");
require_once("../includes/EditPage.php");
require_once("../includes/GlobalFunctions.php");


/**
 * Test page creation
 */
function pageCreate() {
   global $wgLoadBalancer;
   global $wgUser;

   // Create the page text
   $pageText = "Fédération";
   $wikiPageName = "Page Test 1";

   // Code adapted from "maintenance/InitialiseMessages.inc"
   $dbw =& wfGetDB( DB_MASTER );

   $title = new Title();
   $title = $title->newFromText($wikiPageName);

   $article = new Article( $title );
   $newid = $article->insertOn( $dbw, 'sysop' );

   $revision = new Revision( array(
   'page'      => $newid,
   'text'      => $pageText,
   'user'      => 0,
   'user_text' => "My user text",
   'comment'   => '',
   ) );
   $revid = $revision->insertOn( $dbw );
   $article->updateRevisionOn( $dbw, $revision );

   $dbw->commit();

}

// Call the page creation
pageCreate();

?>



_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
I think it isn't nonsense actually.

Mediawiki is UTF8. UTF8 has no problem with plain ASCII as long as it's in
the common English subset.

Latin_1 characters are not transparent to this. If you edit your PHP with an
ASCII editor, it won't be proper UTF8.

To be on the safe side, since I don't know which platform you're running, my
humble advice is to try jEdit (java) and change the buffer encoding setting
to UTF-8.

(just load, change the buffer settings to UTF8, and save again. You can save
to a different file and see how they're not binary-identical) - Utilities >>
Buffer Options >> Character Encoding

Hope that helps.


On 2/6/06, Hugh Prior <[hidden email]> wrote:

>
> "Brion Vibber" <[hidden email]> wrote:
> > Your text editor will have some sort of encoding setting. Use it.
>
> Thanks for trying Brion.
>
> However, in view of the actual problem though, this which you suggest is,
> sorry to say, complete nonsense.  The whole idea is that the page is
> created
> via a PROGRAM and not via the browser, so browser settings are totally
> irrelevant.  Sure, the page needs to be visible correctly in a browser
> afterwards, but it should not be for the end user to try and fudge the
> browser to some bizarre setting just because a letter "e" has a simple
> accent.
>
> If you look at the sample code you will see the sample text which causes a
> problem:
>
> $pageText = "Fédération";
>
>
> It is not complex text.  It is not as if I am trying to input Chinese via
> a
> program into a wiki.
>
> If you think that the code, being PHP, still has to be run by a browser,
> ask
> the question how could such code as shown in the sample run and generate
> correct output.when the PHP program is run from the command line.
>
> To reiterate, how can I get the following simple program to correctly
> create
> wiki pages with the accents correctly:
>
> <?
>
> require_once("../includes/Article.php");
> require_once("../includes/Title.php");
> require_once("../includes/EditPage.php");
> require_once("../includes/GlobalFunctions.php");
>
>
> /**
> * Test page creation
> */
> function pageCreate() {
>   global $wgLoadBalancer;
>   global $wgUser;
>
>   // Create the page text
>   $pageText = "Fédération";
>   $wikiPageName = "Page Test 1";
>
>   // Code adapted from "maintenance/InitialiseMessages.inc"
>   $dbw =& wfGetDB( DB_MASTER );
>
>   $title = new Title();
>   $title = $title->newFromText($wikiPageName);
>
>   $article = new Article( $title );
>   $newid = $article->insertOn( $dbw, 'sysop' );
>
>   $revision = new Revision( array(
>   'page'      => $newid,
>   'text'      => $pageText,
>   'user'      => 0,
>   'user_text' => "My user text",
>   'comment'   => '',
>   ) );
>   $revid = $revision->insertOn( $dbw );
>   $article->updateRevisionOn( $dbw, $revision );
>
>   $dbw->commit();
>
> }
>
> // Call the page creation
> pageCreate();
>
> ?>
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
In reply to this post by Hugh Prior
Hugh Prior wrote:
> "Brion Vibber" <[hidden email]> wrote:
>> Your text editor will have some sort of encoding setting. Use it.
>
> Thanks for trying Brion.
>
> However, in view of the actual problem though, this which you suggest is,
> sorry to say, complete nonsense.

That's only, sorry to say, because you have no idea what you're talking about.

>  $pageText = "Fédération";
>
> It is not complex text.
>
>  It is not as if I am trying to input Chinese via a
> program into a wiki.

Actually, it's exactly like that. Your string contains two non-ASCII characters,
which will need to be properly encoded or you'll get some data corruption.
Specifically, they must be UTF-8 encoded.

There's *no* qualitative difference between "é" and something like "本"; both
are non-ASCII characters and therefore must be properly encoded in the UTF-8
source file.

The symptoms you described are *exactly* the symptoms of a miscoded 8-bit ISO
8859-1 (or Windows "ANSI" or whatever they call it) character in what should be
a UTF-8 text stream.

> If you think that the code, being PHP, still has to be run by a browser,

I'm talking about the text editor you used to save the PHP source file
containing literal strings. There's no "browser" involved in your problem.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
A couple of mistakes there.

There is a difference between 'é' and '木' for many editors, including
non-windows editors that default to ASCII. The character 'é' is indeed
ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
Many popular editors default to 8-bit ASCII, and others default to 8859-1 ,
also known as "latin 1" ; some use "Windows encoding" which is not exactly
the same thing, but it's close. There is also "Mac encoding" which is also
close but again it's different. Those just to mention the "popular" ones.

ASCII values from 128 on, those used with the first bit set, are the
problematic ones. UTF8 reserves those to indicate more bytes are needed for
displaying a char. UTF8 is variable-length while all the others I've
mentioned here are "1-byte-1-char" so to speak.

Brion is right about 'é' being not "UTF8 friendly" - meaning all lower ASCII
( 0-127 or, in hexadecimal, 0-7F) are encoded the same for all popular 8-bit
representations and also UTF8.

In other words, Hugh will have to check the encoding of the file, and Brion
is right about this not being a browser problem whatsoever.

Hope that helped, Hugh. Also read my email from yesterday where I tried to
give you a solution instead of scolding you ;)

UTF8 is, by the way, not the best encoding for Asian text. UTF8 is meant to
display English text effectively (1 byte) while still being able to map all
Unicode. This is nice, but since all Japanese and Chinese characters (at
least all I tried, I'd have to check the tables to make sure) take 3 BYTES
OR MORE (sorry for shouting) that alone is reason enough to use another Wiki
like the popular japanese pukiwiki (using EUC-Japanese), or others using
typically SJIS or EUC-Japanese, EUC-Chinese, Big5 etc. It would be very nice
to have an UTF16 version, which would only take 2-bytes for each character
most of the time, 33%+- better space-wise. I'm aware it's bad to have just
one thing more to care about (different encodings) so I really understand
this is not being done. For me UTF8 is more or less okay, since my Wiki will
be mixed latin1+asian text.


For those who made it to the end of this message, thanks for your patience
:-) now back to my busy-ass life as game developer... I'm late to my
commute.


On 2/6/06, Brion Vibber <[hidden email]> wrote:

>
> Hugh Prior wrote:
> > "Brion Vibber" <[hidden email]> wrote:
> >> Your text editor will have some sort of encoding setting. Use it.
> >
> > Thanks for trying Brion.
> >
> > However, in view of the actual problem though, this which you suggest
> is,
> > sorry to say, complete nonsense.
>
> That's only, sorry to say, because you have no idea what you're talking
> about.
>
> >  $pageText = "Fédération";
> >
> > It is not complex text.
> >
> >  It is not as if I am trying to input Chinese via a
> > program into a wiki.
>
> Actually, it's exactly like that. Your string contains two non-ASCII
> characters,
> which will need to be properly encoded or you'll get some data corruption.
> Specifically, they must be UTF-8 encoded.
>
> There's *no* qualitative difference between "é" and something like "本";
> both
> are non-ASCII characters and therefore must be properly encoded in the
> UTF-8
> source file.
>
> The symptoms you described are *exactly* the symptoms of a miscoded 8-bit
> ISO
> 8859-1 (or Windows "ANSI" or whatever they call it) character in what
> should be
> a UTF-8 text stream.
>
> > If you think that the code, being PHP, still has to be run by a browser,
>
> I'm talking about the text editor you used to save the PHP source file
> containing literal strings. There's no "browser" involved in your problem.
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
>
>
>

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
muyuubyou wrote:
> A couple of mistakes there.
>
> There is a difference between 'é' and '木' for many editors, including
> non-windows editors that default to ASCII. The character 'é' is indeed
> ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .

False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some
other encoding.

Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII,
which incorporate the 7-bit ASCII character set in the lower 128 code points and
various other characters in the high 128 code points.

Many people erroneously call any mapping from a number to a character that can
fit in 8 bits an "ASCII code", however this is incorrect.

> Many popular editors default to 8-bit ASCII,

There's no such thing.

> and others default to 8859-1, also known as "latin 1" ;

That part is reasonably true for Windows operating systems and some older
Unix/Linux systems in North America and western Europe.

Mac OS X and most modern Linux systems default to UTF-8.

> ASCII values from 128 on,

No such thing; there are no ASCII values from 128 on. However many 8-bit
character encodings which are supersets of ASCII contain *non*-ASCII characters
in the 128-256 range. Since these represent wildly different characters for each
such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps
a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it
will have any meaning in a program that doesn't know about your favorite
encoding selection.

> UTF8 is, by the way, not the best encoding for Asian text.

That depends on what you mean by "best". If by "best" you mean only "as compact
as possible for the particular data I want to use at the moment" then yes, there
are other encodings which are more compact.

If, however, compatibility is an issue, UTF-8 is extremely functional and works
very well with UNIX/C-style string handling, pathnames, and byte-oriented
communications protocols at a minor 50% increase in uncompressed size for such
languages.

If space were an issue, though, you'd be using data compression.

> UTF8 is meant to
> display English text effectively (1 byte)

False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of
specially meaningful bytes such as 0 and the '/' path separator in string
handling in Unix-like environments. (It was created for Bell Labs' Plan 9
operating system, an experimental successor to Unix.)

That it happens to also be compact for English is nice, too.

> It would be very nice
> to have an UTF16 version, which would only take 2-bytes for each character
> most of the time, 33%+- better space-wise.

Much of the time, the raw amount of space taken up by text files is fairly
insignificant. Text is small compared to image and multimedia data, and it
compresses very well.

Modern memory and hard disk prices strongly favor accessibility and
compatibility in most cases over squeezing a few percentage points out of
uncompressed text size.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
"Extended ASCII" is "accepted" and thus exists, regardless it came from the
ASCII board or not. The fact that bytes are 8-bits and almost everything in
computers is in bytes or multiples thereof has created this nightmare of
8-bit encodings we're still suffering today. IBM's first extension is what
many people call "extended ASCII" we like it or not, and that is what I was
talking about. Namely the DOS representation of the higher 128 codes. It
came with "IBM PC".

There is no such thing blahblah can be true if you ignore the gazillion
lines of legacy code thinking otherwise.

I agree with you it's unwise to assume programs will map your non-ASCII
right, but since many do it's a common thing. 99% (in ammount) of things in
1 byte are latin-1. The other "important" languages are impossible to
represent in 1 byte anyway except for arabic and hebrew, but those are
usually isolated from our "computer isle" in the west. For instance, 90% of
the "interweb" that isn't Chinese or Japanese, it belongs to a latin-1
covered language.

>False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
of
>specially meaningful bytes such as 0 and the '/' path separator in string
>handling in Unix-like environments. (It was created for Bell Labs' Plan 9
>operating system, an experimental successor to Unix.)
>
>That it happens to also be compact for English is nice, too.

In the real world that "happens to be compact for English" is crucial. Unix
was developed "in English" mainly, and therefore encoding of the English
language plus some extra codes was all that fell into consideration. They
simply didn't need to comment code in Japanese. What you say is factually
true, but I was just pointing out the most important reason in regards to
the topic at hand. By saying "UTF8 is meant to display English correctly" I
didn't imply it isn't meant to do anything else or that was the basis of it.
I could have said "UTF8 is meant to encode English correctly and
effectively, among other things" but I just didn't want to shift the focus.

>> and others default to 8859-1, also known as "latin 1" ;
>
>That part is reasonably true for Windows operating systems and some older
>Unix/Linux systems in North America and western Europe.
>
>Mac OS X and most modern Linux systems default to UTF-8.

Yeah, but he his editor of choice most probably isn't, or there wouldn't be
a problem in the first place. Let's keep the focus.

It's a very common scenario, for minor changes, that people connect via
TELNET or SSH and quickly edit something directly in their test server,
instead of editing locally and then uploading FTP (or using some FTP capable
editor like gvim with the FTP plugin for instance). It's also very common
that consoles are set to ISO-8859-1 and thus vi, pico or nano will use that.
Can also happen that it's a shared environment and the user just can't
install stuff... also many telnet/ssh clients are not UTF8 compatible, or he
may have any sort of configuration problem I can't even imagine now. Shit
happens.

>> UTF8 is, by the way, not the best encoding for Asian text.
>
>That depends on what you mean by "best". If by "best" you mean only "as
compact
>as possible for the particular data I want to use at the moment" then yes,
there
>are other encodings which are more compact.
>
>If, however, compatibility is an issue, UTF-8 is extremely functional and
works
>very well with UNIX/C-style string handling, pathnames, and byte-oriented
>communications protocols at a minor 50% increase in uncompressed size for
such
>languages.
>
>If space were an issue, though, you'd be using data compression.

Compatibility is always an issue I'm afraid, and for this project UTF-8 is
IMO the best choice, if we have to stick to just one encoding. For Wikipedia
this is undoubtedly true. Other Wikis I'm sure they'd use a different thing.
But it still "Just Works" so I'm not complaining. It also makes things nice
for the developers because many IDEs and editors support UTF-8 out of the
box.

But, of course, space is always an issue. Using data compression has an
impact in processor performance. Having a better encoding for your text is
"compression without processing penalty" to put it in layman terms, and
having to retrieve more data slows down your wiki for several reasons: more
data to retrieve from the database and more bandwidth needed/longer
transmission time. For instance, for the average Japanese wiki it would save
30% space in server, 15-20% in bandwidth even with mod-gz, 30% better memory
usage in database caching (caching is good for mediawiki as you know better
than me for sure) - equivalent to have 30%+ memory for caching. Those are
rough figures. I'm not asking you to change this, as it would involve a lot
of time I'm sure you can use, just to keep it into consideration if at some
point you had time to support more than one encoding for mediawiki. Many
wikis hardly use any image at all, and when they do, they keep it somewhere
out of the database (haven't looked this in mediawiki, are you storing them
in BLOBs?).

So, "UTF8 is not the best for Asian text" as in, "by using exclusively UTF8,
you're bogging your performance down 20%+ for many people" . And extra
tweaks are not realistic for the joe-wiki-admin who most probably won't have
caching at all.

This is not a critique. For me, the wiki works well, it's fast enough and
UTF8 happens to suit me fine. This direction just keeps mediawiki from being
more popular in Asia. Stability and functionality are over performance in my
consideration list.


On 2/7/06, Brion Vibber <[hidden email]> wrote:

>
> muyuubyou wrote:
> > A couple of mistakes there.
> >
> > There is a difference between 'é' and '木' for many editors, including
> > non-windows editors that default to ASCII. The character 'é' is indeed
> > ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
>
> False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but
> some
> other encoding.
>
> Many/most 8-bit character encodings other than EBCDIC are *supersets* of
> ASCII,
> which incorporate the 7-bit ASCII character set in the lower 128 code
> points and
> various other characters in the high 128 code points.
>
> Many people erroneously call any mapping from a number to a character that
> can
> fit in 8 bits an "ASCII code", however this is incorrect.
>
> > Many popular editors default to 8-bit ASCII,
>
> There's no such thing.
>
> > and others default to 8859-1, also known as "latin 1" ;
>
> That part is reasonably true for Windows operating systems and some older
> Unix/Linux systems in North America and western Europe.
>
> Mac OS X and most modern Linux systems default to UTF-8.
>
> > ASCII values from 128 on,
>
> No such thing; there are no ASCII values from 128 on. However many 8-bit
> character encodings which are supersets of ASCII contain *non*-ASCII
> characters
> in the 128-256 range. Since these represent wildly different characters
> for each
> such encoding (perhaps an accented Latin letter, perhaps a Greek letter,
> perhaps
> a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume
> that it
> will have any meaning in a program that doesn't know about your favorite
> encoding selection.
>
> > UTF8 is, by the way, not the best encoding for Asian text.
>
> That depends on what you mean by "best". If by "best" you mean only "as
> compact
> as possible for the particular data I want to use at the moment" then yes,
> there
> are other encodings which are more compact.
>
> If, however, compatibility is an issue, UTF-8 is extremely functional and
> works
> very well with UNIX/C-style string handling, pathnames, and byte-oriented
> communications protocols at a minor 50% increase in uncompressed size for
> such
> languages.
>
> If space were an issue, though, you'd be using data compression.
>
> > UTF8 is meant to
> > display English text effectively (1 byte)
>
> False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
> of
> specially meaningful bytes such as 0 and the '/' path separator in string
> handling in Unix-like environments. (It was created for Bell Labs' Plan 9
> operating system, an experimental successor to Unix.)
>
> That it happens to also be compact for English is nice, too.
>
> > It would be very nice
> > to have an UTF16 version, which would only take 2-bytes for each
> character
> > most of the time, 33%+- better space-wise.
>
> Much of the time, the raw amount of space taken up by text files is fairly
> insignificant. Text is small compared to image and multimedia data, and it
> compresses very well.
>
> Modern memory and hard disk prices strongly favor accessibility and
> compatibility in most cases over squeezing a few percentage points out of
> uncompressed text size.
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
>
>
>

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
muyuubyou wrote:
[snipped excuses for making false statements claimed as corrections for
"mistakes" which were true statements]

The only relevant thing in this discussion is that you always have to save your
text files in the proper encoding (which for MediaWiki is always UTF-8, the
standard Unicode encoding for Unix and text-based communication protocols).

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
[ 'é' and japanese 'ki' not being in the same league for the vast
majority of editors in the world, that wasn't a false statement.

I can agree to take back my claim of 'é' being ASCII ("extended" or
not) because strictly it isn't ]

Hugh, can you reply and let me know if my suggestion worked?

Mr Vibber, pissed or not, it would be wise to reply to questions from
users with more diplomacy, regardless of the tone used in the question
in the first place. Granted you were told to be "speaking nonsense"
when you were right, but instead of "you have no idea what you're
talking about, é and japanese code are both the same for UTF8 " you
could have said "actually, both of them have to be encoded properly in
UTF8" and nothing would have happened. Please don't take this
personal.

Is this the right list for suggestions? if so, please take my previous
comment about UTF8 and UTF16 as a suggestion, please don't snip it
out. Just by having UTF8 AND UTF16 things would improve. Sure it's a
lot of work, but it's just a thing to consider for the future.

Different issue (sorry to mix stuff, but the list is busy enough already)

My issue with Firefox is happening in my installation but not in
wikipedia. Not sure what it is, but I'll try to find out when I have
more time.

The following only occurs with Chinese and Japanese text in page titles:

Basically when I pass the script an existing page, it opens it no
problem in all browsers; but when I pass the script an unexisting one,
it mangles the title only on firefox (don't have other mozilla
browsers installed at the moment at home, must check it out with
Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE
based ones too. It's probably some strange behavior from the
browser... but then again it doesn't happen with Wikipedia. Just in
case someone has any pointers.

If it's something easy please don't scold me, I'm just a user who
hasn't looked too much into the code *hides away*


On 2/7/06, Brion Vibber <[hidden email]> wrote:

> muyuubyou wrote:
> [snipped excuses for making false statements claimed as corrections for
> "mistakes" which were true statements]
>
> The only relevant thing in this discussion is that you always have to save your
> text files in the proper encoding (which for MediaWiki is always UTF-8, the
> standard Unicode encoding for Unix and text-based communication protocols).
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Hugh Prior
Thanks muyuubyou and Brian for all your information.

I now understand clearly that I need to make sure my editor is working in
UTF-8, and that accented e (é) is going to have problems because it is not
part of the 127 characters that make up the original ASCII.

At present I haven't yet found an editor-type solution but I have made a
note that jEdit is one option.

When I get the problem fixed I'll let you know.

Thanks again for all your help. :-)


Hugh Prior



_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Hugh Prior
I use Dreamweaver MX as my editor, and I changed the "Document Encoding" to
be "UTF-8 (Unicode)" instead of the default "Western (Latin 1)", and now
this works fine (at least as a hard-coded example anyway).

i.e. in Dreamweaver:
Modify->Properties->Page Properties->Document Encoding

also I have changed the default for new documents to be UTF-8 within
Dreamweaver:
Edit->Preferences->New Document->Default Encoding

Hope that helps somebody out there.


Hugh Prior



_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
In reply to this post by muyuubyou
muyuubyou wrote:
> Mr Vibber, pissed or not, it would be wise to reply to questions from
> users with more diplomacy, regardless of the tone used in the question
> in the first place. Granted you were told to be "speaking nonsense"
> when you were right, but instead of "you have no idea what you're
> talking about, é and japanese code are both the same for UTF8 " you
> could have said "actually, both of them have to be encoded properly in
> UTF8" and nothing would have happened. Please don't take this
> personal.

I'm sorry if I was a bit snappy.

> Is this the right list for suggestions? if so, please take my previous
> comment about UTF8 and UTF16 as a suggestion, please don't snip it
> out. Just by having UTF8 AND UTF16 things would improve. Sure it's a
> lot of work, but it's just a thing to consider for the future.

If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5 mode, and
you know for sure that you aren't going to use compressed text storage, you
might be able to get away with changing text.old_text to a TEXT field type and
assigning it the ucs2 charset. This will store its data as UCS-2 instead of UTF-8.

You can do the same for any of the various name, comment, etc fields.

Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8 storage
is also limited so that characters outside the basic multilingual plane (the
classic 16-bit range) can't be stored at all. Attempting to insert these
characters will cause the field to become truncated (in UTF-8) or just corrupt
the character (in UCS-2).

If MySQL supported it, my preference would be to use UTF-16 with 16-bit
collation for the non-bulk-text fields; that is, allow clean translation to/from
compliant UTF-8 but keep the indexes at 2 bytes per code point. This would keep
the size of the indexes down compared to their UTF-8 support (which currently
needs 3 bytes per character and would need 4 if they made it actually support
full UTF-8).

Index size directly relates to key caching and index scanning performance, so on
a large-scale setup that can be relevant. (Bulk text storage is much less
significant in this respect; individual records are picked out cheaply based on
an integer index lookup.)

Alternatively, you could potentially whip up some kind of text storage handler
for MediaWiki that would convert the internal UTF-8 data into UTF-16 for storage
in the blob. I doubt this would be significantly pleasant though. :)

Using UTF-16 internally or for output isn't really possible.

> My issue with Firefox is happening in my installation but not in
> wikipedia. Not sure what it is, but I'll try to find out when I have
> more time.
>
> The following only occurs with Chinese and Japanese text in page titles:
>
> Basically when I pass the script an existing page, it opens it no
> problem in all browsers; but when I pass the script an unexisting one,
> it mangles the title only on firefox (don't have other mozilla
> browsers installed at the moment at home, must check it out with
> Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE
> based ones too. It's probably some strange behavior from the
> browser... but then again it doesn't happen with Wikipedia. Just in
> case someone has any pointers.
Do you have an example? How is it mangled, exactly?

In what way are you passing the data?
* Typing on the URL bar
* From an <a href> link on a web page
* From a <form> on a web page

If in a URL or link, is the title:
* percent-encoded UTF-8 bytes, per RFC 3987
* percent-encoded bytes in some other encoding, such as EUC-JP or Shift-JIS
* raw typed text

Current versions of IE are, I think, set to send unencoded characters in URLs as
percent-encoded UTF-8. Mozilla for some reason has left this option off, so
sometimes it'll send unencoded characters in <a href> links in the source page's
character set. I'm not sure what it'll do in the URL bar (locale encoding?) but
it seems to be happy to send UTF-8 from the URL bar on my Windows XP box if I
paste in some random Chinese text.

LanguageJa and LanguageZh doesn't set fallback encoding checks, so non-Unicode
encodings of Japanese and Chinese won't be detected or automatically converted.
(There are multiple character sets in use for these, making it extra difficult
compared with most European languages.)

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
That was a very interesting update about Unicode support in MySQL. Thanks!

Well, about my little issue with Firefox: I just type in the URL bar, and no
cookie. Following links works (for instance, link to pages created with
Opera, work also under Firefox).

For instance:

http://someIPhere/wiki/index.php/中文
I press enter, then the URL bar turns to:
http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84

, which is page "ÖÐÎÄ"

Ö is UTF16 => 0096, UTF8 => C3 96

Those are 8 bytes there in the URL, 4 for each letter... that shouldn't be.

If I Google 中文, Google returns me this page:
http://www.google.com/search?q=%E4%B8%AD%E6%96%87
which looks more UTF8 to me. And works, too.


Been browsing the Unicode chart and the first character is 4e2d:
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e2d
UTF8 => E4 B8 AD

The second one is 6587 (God, unicode.org is hell to browse this stuff)
which in turn is UTF8 => E6 96 87

I wonder what's the browser doing, because 中 has nothing to do with anything
starting C396 in any encoding.

Hope that helps somehow.

On 2/7/06, Brion Vibber <[hidden email]> wrote:

>
> muyuubyou wrote:
> > Mr Vibber, pissed or not, it would be wise to reply to questions from
> > users with more diplomacy, regardless of the tone used in the question
> > in the first place. Granted you were told to be "speaking nonsense"
> > when you were right, but instead of "you have no idea what you're
> > talking about, é and japanese code are both the same for UTF8 " you
> > could have said "actually, both of them have to be encoded properly in
> > UTF8" and nothing would have happened. Please don't take this
> > personal.
>
> I'm sorry if I was a bit snappy.
>
> > Is this the right list for suggestions? if so, please take my previous
> > comment about UTF8 and UTF16 as a suggestion, please don't snip it
> > out. Just by having UTF8 AND UTF16 things would improve. Sure it's a
> > lot of work, but it's just a thing to consider for the future.
>
> If you're using MySQL 4.1 or 5.0 and MediaWiki's experimental MySQL 5
> mode, and
> you know for sure that you aren't going to use compressed text storage,
> you
> might be able to get away with changing text.old_text to a TEXT field type
> and
> assigning it the ucs2 charset. This will store its data as UCS-2 instead
> of UTF-8.
>
> You can do the same for any of the various name, comment, etc fields.
>
> Unfortunately MySQL doesn't support UTF-16 at this time, and its UTF-8
> storage
> is also limited so that characters outside the basic multilingual plane
> (the
> classic 16-bit range) can't be stored at all. Attempting to insert these
> characters will cause the field to become truncated (in UTF-8) or just
> corrupt
> the character (in UCS-2).
>
> If MySQL supported it, my preference would be to use UTF-16 with 16-bit
> collation for the non-bulk-text fields; that is, allow clean translation
> to/from
> compliant UTF-8 but keep the indexes at 2 bytes per code point. This would
> keep
> the size of the indexes down compared to their UTF-8 support (which
> currently
> needs 3 bytes per character and would need 4 if they made it actually
> support
> full UTF-8).
>
> Index size directly relates to key caching and index scanning performance,
> so on
> a large-scale setup that can be relevant. (Bulk text storage is much less
> significant in this respect; individual records are picked out cheaply
> based on
> an integer index lookup.)
>
> Alternatively, you could potentially whip up some kind of text storage
> handler
> for MediaWiki that would convert the internal UTF-8 data into UTF-16 for
> storage
> in the blob. I doubt this would be significantly pleasant though. :)
>
> Using UTF-16 internally or for output isn't really possible.
>
> > My issue with Firefox is happening in my installation but not in
> > wikipedia. Not sure what it is, but I'll try to find out when I have
> > more time.
> >
> > The following only occurs with Chinese and Japanese text in page titles:
> >
> > Basically when I pass the script an existing page, it opens it no
> > problem in all browsers; but when I pass the script an unexisting one,
> > it mangles the title only on firefox (don't have other mozilla
> > browsers installed at the moment at home, must check it out with
> > Mozilla, Seamonkey, Netscape...). Opera works just fine. IE and IE
> > based ones too. It's probably some strange behavior from the
> > browser... but then again it doesn't happen with Wikipedia. Just in
> > case someone has any pointers.
>
> Do you have an example? How is it mangled, exactly?
>
> In what way are you passing the data?
> * Typing on the URL bar
> * From an <a href> link on a web page
> * From a <form> on a web page
>
> If in a URL or link, is the title:
> * percent-encoded UTF-8 bytes, per RFC 3987
> * percent-encoded bytes in some other encoding, such as EUC-JP or
> Shift-JIS
> * raw typed text
>
> Current versions of IE are, I think, set to send unencoded characters in
> URLs as
> percent-encoded UTF-8. Mozilla for some reason has left this option off,
> so
> sometimes it'll send unencoded characters in <a href> links in the source
> page's
> character set. I'm not sure what it'll do in the URL bar (locale
> encoding?) but
> it seems to be happy to send UTF-8 from the URL bar on my Windows XP box
> if I
> paste in some random Chinese text.
>
> LanguageJa and LanguageZh doesn't set fallback encoding checks, so
> non-Unicode
> encodings of Japanese and Chinese won't be detected or automatically
> converted.
> (There are multiple character sets in use for these, making it extra
> difficult
> compared with most European languages.)
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
>
>
>

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
muyuubyou wrote:
> Well, about my little issue with Firefox: I just type in the URL bar, and no
> cookie. Following links works (for instance, link to pages created with
> Opera, work also under Firefox).
>
> For instance:
>
> http://someIPhere/wiki/index.php/中文

In GB 18030, that's: d6 d0 ce c4

> I press enter, then the URL bar turns to:
> http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84
>
> , which is page "ÖÐÎÄ"

In ISO 8859-1, that's: d6 d0 ce c4

By any chance is your desktop set to a Chinese locale? It sounds like Firefox is
taking the non-ASCII chars in the URL you type and encoding them as GB 18030.
MediaWiki sees these unexpected non-UTF-8 characters and tries to convert them
from a fallback, which is the default of ISO 8859-1 or Windows-1252.

Try setting the network.standard-url.encode-utf8 hidden preference in Firefox to
true, see if that fixes it. (Who knows why they haven't turned this on yet, it's
been standard on even MSIE for years...)

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

muyuubyou
My locale is indeed set to Chinese... because most of my users will be
Chinese. Changing the configuration in Firefox didn't work.

Great find BTW :-)

... but I guess my little problem is there to stay, since I can't change my
users' locale. Wish they fixed that in Firefox, it's almost 10% of my
visits.

To top it off, it doesn't work with locales set to latin-1 either.

Thanks a lot, Brion. I should probably mail somebody at Mozilla :D I'm
stubborn.

On 2/8/06, Brion Vibber <[hidden email]> wrote:

>
> muyuubyou wrote:
> > Well, about my little issue with Firefox: I just type in the URL bar,
> and no
> > cookie. Following links works (for instance, link to pages created with
> > Opera, work also under Firefox).
> >
> > For instance:
> >
> > http://someIPhere/wiki/index.php/中文
>
> In GB 18030, that's: d6 d0 ce c4
>
> > I press enter, then the URL bar turns to:
> > http://someIPhere/wiki/index.php/%C3%96%C3%90%C3%8E%C3%84
> >
> > , which is page "ÖÐÎÄ"
>
> In ISO 8859-1, that's: d6 d0 ce c4
>
> By any chance is your desktop set to a Chinese locale? It sounds like
> Firefox is
> taking the non-ASCII chars in the URL you type and encoding them as GB
> 18030.
> MediaWiki sees these unexpected non-UTF-8 characters and tries to convert
> them
> from a fallback, which is the default of ISO 8859-1 or Windows-1252.
>
> Try setting the network.standard-url.encode-utf8 hidden preference in
> Firefox to
> true, see if that fixes it. (Who knows why they haven't turned this on
> yet, it's
> been standard on even MSIE for years...)
>
> -- brion vibber (brion @ pobox.com)
>
>
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
>
>
>

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: accents not appearing correctly

Brion Vibber
muyuubyou wrote:
> My locale is indeed set to Chinese... because most of my users will be
> Chinese. Changing the configuration in Firefox didn't work.
>
> Great find BTW :-)
>
> ... but I guess my little problem is there to stay, since I can't change my
> users' locale. Wish they fixed that in Firefox, it's almost 10% of my
> visits.

If it's reasonably consistent, and you have working iconv or mbstring on PHP,
you might be able to set the conversion by adding a fallbackEncoding() method on
LanguageZh_cn.

-- brion vibber (brion @ pobox.com)


_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

signature.asc (257 bytes) Download Attachment