Wrong database, maybe

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Wrong database, maybe

Javier Bezos-2
Hi all,

This is the background. A year ago we hired a person to
update and add some improvements we needed, with an estimated
time of 8 weeks. After months of delays and a lot of problems,
we've fired him. Unfortunately, I'm afraid the database has
been damaged or not properly set up.

Now, two questions--

1) In the short term, I'd like to fix an issue with searches.
The site is in Spanish but it's unable to find words with
diacriticals in the article body (in the title they are found).
I've rebuilt the search index, with no luck. Fortunately, I
had access to old dumps and I've noticed a difference --
formerly tables were like

CREATE TABLE `wl_archive` (
  `ar_namespace` int(11) NOT NULL default '0',
  `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
NULL default '',
[...]
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

and now they are like

SET @saved_cs_client     = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `wl_archive` (
  `ar_namespace` int(11) NOT NULL default '0',
  `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
NULL default '',
[...]
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;

2) In the medium term, and with the background above, which
is the best approach to have a "correct" database and system
again.

Thanx
Javier
--------------------------------
www.wikilengua.org
www.tex-tipografia.com

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Raymond Wan-2

Hi Javier,

The short answer is that I can't help you...I can only tell you what I did and maybe someone else
will give a better reply so that I can learn something, too.


Javier Bezos wrote:

> CREATE TABLE `wl_archive` (
>   `ar_namespace` int(11) NOT NULL default '0',
>   `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
> NULL default '',
> [...]
> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
>
> and now they are like
>
> SET @saved_cs_client     = @@character_set_client;
> SET character_set_client = utf8;
> CREATE TABLE `wl_archive` (
>   `ar_namespace` int(11) NOT NULL default '0',
>   `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
> NULL default '',
> [...]
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
> SET character_set_client = @saved_cs_client;


This is the exact problem that I had.  I had a database with a mix of English and Japanese and could
see everything fine.  I then checked both the database and the system and found out they were both
set to "latin1".  If so, I don't know why it had worked for so long...

So, what I did was change the database and the MySQL system to utf8.  But then the database also has
to be converted and this failed miserably for me.  I followed these steps:

http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL

which did not work (more specifically, the "Convert dump" step).  In the end, I re-typed the
Japanese (which wasn't a lot).  So maybe you might have better luck or someone else can help you.
Or maybe you can make some sense of the above link and succeed where I failed.  :-)

Good luck...

Ray


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Platonides
El 09/03/10 10:20, Raymond Wan wrote:

> Javier Bezos wrote:
>> CREATE TABLE `wl_archive` (
>>    `ar_namespace` int(11) NOT NULL default '0',
>>    `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
>> NULL default '',
>> [...]
>> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
>>
>> and now they are like
>>
>> SET @saved_cs_client     = @@character_set_client;
>> SET character_set_client = utf8;
>> CREATE TABLE `wl_archive` (
>>    `ar_namespace` int(11) NOT NULL default '0',
>>    `ar_title` varchar(255) character set latin1 collate latin1_bin NOT
>> NULL default '',
>> [...]
>> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
>> SET character_set_client = @saved_cs_client;
>
>
> This is the exact problem that I had.  I had a database with a mix of English and Japanese and could
> see everything fine.  I then checked both the database and the system and found out they were both
> set to "latin1".  If so, I don't know why it had worked for so long...
>
> So, what I did was change the database and the MySQL system to utf8.  But then the database also has
> to be converted and this failed miserably for me.  I followed these steps:
>
> http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
>
> which did not work (more specifically, the "Convert dump" step).

That's wrong. Even when the database shows charset=latin1, mediawiki
stores utf8 inside.
So you shouldn't have tried to convert the content, that would have
broken it.

>  In the end, I re-typed the
> Japanese (which wasn't a lot).  So maybe you might have better luck or someone else can help you.
> Or maybe you can make some sense of the above link and succeed where I failed.  :-)
>
> Good luck...
>
> Ray

Javier, you need to what's the data representation and set to one
(either latin1 or utf8), I think you had some tables latin1 and others
utf8 now. With $wgDBmysql5 set accordingly.

Without knowing the exact way your db is broken, it'd be hard to fix it.
What search are you using? The internal one? Maybe you can share a
one-revision sql dump?


PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Javier Bezos-2
Thank yoy for your interest and answers.

Platonides:

> So you shouldn't have tried to convert the content, that would have
> broken it.

Very likely this is what they did.

> Without knowing the exact way your db is broken, it'd be hard to fix it.
> What search are you using? The internal one? Maybe you can share a
> one-revision sql dump?

And your crystall ball? ;-) Well, as to the problem with the search I've
discovered the key for the searchindex table is generated with U8<hex>
while the table has u8<hex>. So, if I replace in SeachMySQL.php

$searchon = $this->db->strencode( $searchon );

by

$searchon = $this->db->strencode( $wgContLang->lc ($searchon) );

words are found. I'm using this as a workaround, but I'm investigating
the origin of the problem (any hints?).

> PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.

Thank you. Please, could you give a pointer explaining it?

Cheers
Javier

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Chad
On Wed, Mar 10, 2010 at 9:57 AM, Javier Bezos <[hidden email]> wrote:

>> PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.
>
> Thank you. Please, could you give a pointer explaining it?
>
> Cheers
> Javier
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>

http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.html

Which was also cross-posted to this list and wikitech-l.

-Chad

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Javier Bezos-2
> http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.html
>
> Which was also cross-posted to this list and wikitech-l.

Thank you, Chad, but what I actually wanted was an explanation of
the bug. If the problem are external images, I presume a link has
to be given somehow, but in our wiki only wiki-like links (ie, [http://...])
are allowed, so I was interested in the details. I presume the fix is
this one:

http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=63424

Javier

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Platonides
Javier Bezos wrote:

>> http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.html
>>
>> Which was also cross-posted to this list and wikitech-l.
>
> Thank you, Chad, but what I actually wanted was an explanation of
> the bug. If the problem are external images, I presume a link has
> to be given somehow, but in our wiki only wiki-like links (ie, [http://...])
> are allowed, so I was interested in the details. I presume the fix is
> this one:
>
> http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=63424
>
> Javier

Yes. The problem is not the external images feature, but that you could
embed an image inside CSS with mediawiki not noticing that it was an
external link.
Per the advisory, all installs since 1.5 are affected.


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Raymond Wan-2
In reply to this post by Platonides

Hi,


Platonides wrote:

>> So, what I did was change the database and the MySQL system to utf8.  But then the database also has
>> to be converted and this failed miserably for me.  I followed these steps:
>>
>> http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
>>
>> which did not work (more specifically, the "Convert dump" step).
>
> That's wrong. Even when the database shows charset=latin1, mediawiki
> stores utf8 inside.
> So you shouldn't have tried to convert the content, that would have
> broken it.


Oh...I see -- thanks for this!

Then I guess there are two combinations:  Mediawiki with latin1 MySQL ; Mediawiki with UTF MySQL.
What are the advantages / disadvantages of either choice?

I *guess* that if someone were to login to mysql directly, and did a SELECT, then the UTF would look
like gibberish.  Likewise when a dump is done of the data.  Of course, neither "problem" affects
Mediawiki's functionality...

Any other pros/cons?

Thanks!

Ray


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Javier Bezos-2
In reply to this post by Javier Bezos-2
I wrote:

> $searchon = $this->db->strencode( $searchon );
> by
> $searchon = $this->db->strencode( $wgContLang->lc ($searchon) );
>
> words are found. I'm using this as a workaround, but I'm investigating
> the origin of the problem (any hints?).

Found. The problem is in the database. For some reason they have
decided the table searchindex should have a collation using latin1 :-(.
So I repeat my second question:

> 2) In the medium term, and with the background above, which
> is the best approach to have a "correct" database and system
> again.

Cheers
Javier

_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Dave butlerdi
When we have had problems it was easier to dump database to sql statements,
drop database, recreate with media wiki setup with all correct settings and
restore the data without drop table instructions.

This worked for us in the past anyway.

On Thu, Mar 11, 2010 at 9:14 AM, Javier Bezos <[hidden email]> wrote:

> I wrote:
>
> > $searchon = $this->db->strencode( $searchon );
> > by
> > $searchon = $this->db->strencode( $wgContLang->lc ($searchon) );
> >
> > words are found. I'm using this as a workaround, but I'm investigating
> > the origin of the problem (any hints?).
>
> Found. The problem is in the database. For some reason they have
> decided the table searchindex should have a collation using latin1 :-(.
> So I repeat my second question:
>
> > 2) In the medium term, and with the background above, which
> > is the best approach to have a "correct" database and system
> > again.
>
> Cheers
> Javier
>
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>



--
Regards

Dave Butler
butlerdi-at-gmail-dot-com

Also on Skype as butlerdi

Get Skype here http://www.skype.com/download.html


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************
_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Wrong database, maybe

Platonides
In reply to this post by Raymond Wan-2
Raymond Wan wrote:

> Oh...I see -- thanks for this!
>
> Then I guess there are two combinations:  Mediawiki with latin1 MySQL ; Mediawiki with UTF MySQL.
> What are the advantages / disadvantages of either choice?
>
> I *guess* that if someone were to login to mysql directly, and did a SELECT, then the UTF would look
> like gibberish.  Likewise when a dump is done of the data.  Of course, neither "problem" affects
> Mediawiki's functionality...
>
> Any other pros/cons?
>
> Thanks!
>
> Ray

MediaWiki offers you three character sets for MySQL:
  * MySQL 4.1/5.0 binary
  * MySQL 4.1/5.0 UTF-8
  * MySQL 4.0 backwards-compatible UTF-8

In the three modes MediaWiki is storing utf-8 characters. It all depends
on how MySQL treats them.

In "backwards-compatible UTF-8" mysql thinks it's latin1. The data  will
"look wrong" and if you don't provide --default-character-set for
mysqldump 4.1 and newer, it will corrupt the text (it will "helpfully"
transform it to utf-8). This is the only one which works with mysql 4.0,
and it supports the full unicode.

UTF-8 uses MySQL support for UTF-8, which currentyl limits you to the
Basic Multilingual Plane. The data will "look right". The indexes will
be larger.

With binary, it works almost like backwards utf-8, but mysql will treat
it as opaque data and won't mess with it. Representation will be messy.
You have the full unicode.


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: UTF support (was: Wrong database, maybe)

Raymond Wan-2

Hi,


Platonides wrote:
> MediaWiki offers you three character sets for MySQL:
>   * MySQL 4.1/5.0 binary
>   * MySQL 4.1/5.0 UTF-8
>   * MySQL 4.0 backwards-compatible UTF-8
...


Thank you for your explanation!  I think I understand the differences between the 3 options.  Also,
looking at the installation instructions, it seems I was asked that before and probably just clicked
"OK".  I installed Mediawiki 3-4 years ago and have been upgrading since...so maybe back then I
wasn't asked?  I honestly don't remember...

Is there a way to find out what it is set at?  Since I've forced mysql to use UTF-8, I'm not sure if
I've made a mistake...but since it's retrieving Japanese characters fine, it must be ok...  I'm just
concerned that I've broken something that will show up later on...

Thanks a lot!

Ray


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: UTF support

Platonides
Raymond Wan wrote:

> Platonides wrote:
>> MediaWiki offers you three character sets for MySQL:
>>    * MySQL 4.1/5.0 binary
>>    * MySQL 4.1/5.0 UTF-8
>>    * MySQL 4.0 backwards-compatible UTF-8
> ...
>
>
> Thank you for your explanation!  I think I understand the differences between the 3 options.  Also,
> looking at the installation instructions, it seems I was asked that before and probably just clicked
> "OK".  I installed Mediawiki 3-4 years ago and have been upgrading since...so maybe back then I
> wasn't asked?  I honestly don't remember...
>
> Is there a way to find out what it is set at?  Since I've forced mysql to use UTF-8, I'm not sure if
> I've made a mistake...but since it's retrieving Japanese characters fine, it must be ok...  I'm just
> concerned that I've broken something that will show up later on...
>
> Thanks a lot!
>
> Ray


Look at your table definitions. If the CREATE TABLE contains DEFAULT
CHARSET=binary or DEFAULT CHARSET=utf8, you are using the first two, and
$wgDBmysql5 should be set. Else you are using the third one and
$wgDBmysql5 should be false.

The three options are supported, if it works, don't fix it  :)



_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: UTF support

Raymond Wan-2

Hi,


Platonides wrote:
> Look at your table definitions. If the CREATE TABLE contains DEFAULT
> CHARSET=binary or DEFAULT CHARSET=utf8, you are using the first two, and
> $wgDBmysql5 should be set. Else you are using the third one and
> $wgDBmysql5 should be false.


Ah, thanks -- I'll take a look for it -- thanks!


> The three options are supported, if it works, don't fix it  :)


Yes...good point -- I should have asked on this list before attempting to "fix" something that
wasn't broken!

Thanks for the advice!

Ray


_______________________________________________
MediaWiki-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l