Duplicate Articles in Database Dump -- MYSQL errors

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate Articles in Database Dump -- MYSQL errors

jmerkey-3
The last two runs of database dumps are severely broken and have a large number of duplicate titles which cause mysql to throw errors.   I have managed to get some good translation runs and full imports, but not without a lot of work, and a lot of wasted time.  

How about we fix the database dumps for those of us who need them and stop introfucing breakage.  

Thanks.   Latest errors and duplicate titles attached.   I did not attach the full listing since this will exceed the limits of the mail server for a single message since there are hundreds of errors.  

The 20070802 dumps do not seem to have this problem as bad.  
ERROR 1062 (23000) at line 96187: Duplicate entry '0-Nyquist_theorem' for key 2

ERROR 1062 (23000) at line 113658: Duplicate entry '0-Urysohn_lemma' for key 2

ERROR 1062 (23000) at line 114417: Duplicate entry '0-Turner_syndrome' for key 2

ERROR 1062 (23000) at line 116050: Duplicate entry '0-World_fair' for key 2

ERROR 1062 (23000) at line 125448: Duplicate entry '0-Dining_cryptographer_protocol' for key 2

ERROR 1062 (23000) at line 132820: Duplicate entry '0-Rothmund-Thompson_syndrome' for key 2

ERROR 1062 (23000) at line 136377: Duplicate entry '0-Hansen_disease' for key 2

ERROR 1062 (23000) at line 139130: Duplicate entry '0-Wilson_disease' for key 2

ERROR 1062 (23000) at line 147900: Duplicate entry '0-Falkner_Island' for key 2

ERROR 1062 (23000) at line 170901: Duplicate entry '0-Microsoft_.NET' for key 2

ERROR 1062 (23000) at line 184074: Duplicate entry '0-Ohm_Law' for key 2

ERROR 1062 (23000) at line 204307: Duplicate entry '0-Kaposi_Sarcoma' for key 2

ERROR 1062 (23000) at line 257406: Duplicate entry '0-Bell_inequality' for key 2

ERROR 1062 (23000) at line 289396: Duplicate entry '0-Hollywood_Walk_of_Fame' for key 2

ERROR 1062 (23000) at line 343974: Duplicate entry '0-Sgt._Pepper_Lonely_Hearts_Club_Band' for key 2

ERROR 1062 (23000) at line 350134: Duplicate entry '0-Boltzmann_constant' for key 2

ERROR 1062 (23000) at line 361687: Duplicate entry '0-Gauss_theorem' for key 2

ERROR 1062 (23000) at line 369568: Duplicate entry '0-Charles_law' for key 2

ERROR 1062 (23000) at line 374524: Duplicate entry '0-Lavender_Blue' for key 2

ERROR 1062 (23000) at line 375710: Duplicate entry '0-DeMorgan' for key 2

ERROR 1062 (23000) at line 378180: Duplicate entry '0-Kaposi_sarcoma' for key 2

ERROR 1062 (23000) at line 383521: Duplicate entry '0-Down_syndrome' for key 2

ERROR 1062 (23000) at line 384832: Duplicate entry '0-Maslow_hierarchy_of_needs' for key 2

ERROR 1062 (23000) at line 389129: Duplicate entry '0-Japan_copyright_law' for key 2

ERROR 1062 (23000) at line 397556: Duplicate entry '0-Sainsbury' for key 2

ERROR 1062 (23000) at line 409151: Duplicate entry '0-St._John' for key 2

ERROR 1062 (23000) at line 413091: Duplicate entry '0-Milgram_experiment' for key 2

ERROR 1062 (23000) at line 418176: Duplicate entry '0-Hudson_Bay_Company' for key 2

ERROR 1062 (23000) at line 419421: Duplicate entry '0-Tourette_syndrome' for key 2

ERROR 1062 (23000) at line 423098: Duplicate entry '0-Benny_Goodman_Orchestra' for key 2

ERROR 1062 (23000) at line 427487: Duplicate entry '0-Long_John_Silver' for key 2

ERROR 1062 (23000) at line 432944: Duplicate entry '0-Schroedinger_equation' for key 2

ERROR 1062 (23000) at line 442032: Duplicate entry '0-Hölder_inequality' for key 2

ERROR 1062 (23000) at line 444903: Duplicate entry '0-Jay_Treaty' for key 2

ERROR 1062 (23000) at line 446458: Duplicate entry '0-Back_River' for key 2

ERROR 1062 (23000) at line 453610: Duplicate entry '0-1980' for key 2

ERROR 1062 (23000) at line 478337: Duplicate entry '0-Hudson_Bay' for key 2

ERROR 1062 (23000) at line 479826: Duplicate entry '0-Hilbert_basis_theorem' for key 2

ERROR 1062 (23000) at line 484443: Duplicate entry '0-Saint_Mary' for key 2


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Articles in Database Dump -- MYSQL errors

jmerkey-3
NOTE:  Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.

In short, the DUMPS are broken again.

Jeff
  ----- Original Message -----
  From: Jeffrey Vernon Merkey
  To: [hidden email]
  Sent: Monday, October 01, 2007 1:11 PM
  Subject: Duplicate Articles in Database Dump -- MYSQL errors


  The last two runs of database dumps are severely broken and have a large number of duplicate titles which cause mysql to throw errors.   I have managed to get some good translation runs and full imports, but not without a lot of work, and a lot of wasted time.  

  How about we fix the database dumps for those of us who need them and stop introfucing breakage.  

  Thanks.   Latest errors and duplicate titles attached.   I did not attach the full listing since this will exceed the limits of the mail server for a single message since there are hundreds of errors.  

  The 20070802 dumps do not seem to have this problem as bad.  
  ERROR 1062 (23000) at line 96187: Duplicate entry '0-Nyquist_theorem' for key 2

  ERROR 1062 (23000) at line 113658: Duplicate entry '0-Urysohn_lemma' for key 2

  ERROR 1062 (23000) at line 114417: Duplicate entry '0-Turner_syndrome' for key 2

  ERROR 1062 (23000) at line 116050: Duplicate entry '0-World_fair' for key 2

  ERROR 1062 (23000) at line 125448: Duplicate entry '0-Dining_cryptographer_protocol' for key 2

  ERROR 1062 (23000) at line 132820: Duplicate entry '0-Rothmund-Thompson_syndrome' for key 2

  ERROR 1062 (23000) at line 136377: Duplicate entry '0-Hansen_disease' for key 2

  ERROR 1062 (23000) at line 139130: Duplicate entry '0-Wilson_disease' for key 2

  ERROR 1062 (23000) at line 147900: Duplicate entry '0-Falkner_Island' for key 2

  ERROR 1062 (23000) at line 170901: Duplicate entry '0-Microsoft_.NET' for key 2

  ERROR 1062 (23000) at line 184074: Duplicate entry '0-Ohm_Law' for key 2

  ERROR 1062 (23000) at line 204307: Duplicate entry '0-Kaposi_Sarcoma' for key 2

  ERROR 1062 (23000) at line 257406: Duplicate entry '0-Bell_inequality' for key 2

  ERROR 1062 (23000) at line 289396: Duplicate entry '0-Hollywood_Walk_of_Fame' for key 2

  ERROR 1062 (23000) at line 343974: Duplicate entry '0-Sgt._Pepper_Lonely_Hearts_Club_Band' for key 2

  ERROR 1062 (23000) at line 350134: Duplicate entry '0-Boltzmann_constant' for key 2

  ERROR 1062 (23000) at line 361687: Duplicate entry '0-Gauss_theorem' for key 2

  ERROR 1062 (23000) at line 369568: Duplicate entry '0-Charles_law' for key 2

  ERROR 1062 (23000) at line 374524: Duplicate entry '0-Lavender_Blue' for key 2

  ERROR 1062 (23000) at line 375710: Duplicate entry '0-DeMorgan' for key 2

  ERROR 1062 (23000) at line 378180: Duplicate entry '0-Kaposi_sarcoma' for key 2

  ERROR 1062 (23000) at line 383521: Duplicate entry '0-Down_syndrome' for key 2

  ERROR 1062 (23000) at line 384832: Duplicate entry '0-Maslow_hierarchy_of_needs' for key 2

  ERROR 1062 (23000) at line 389129: Duplicate entry '0-Japan_copyright_law' for key 2

  ERROR 1062 (23000) at line 397556: Duplicate entry '0-Sainsbury' for key 2

  ERROR 1062 (23000) at line 409151: Duplicate entry '0-St._John' for key 2

  ERROR 1062 (23000) at line 413091: Duplicate entry '0-Milgram_experiment' for key 2

  ERROR 1062 (23000) at line 418176: Duplicate entry '0-Hudson_Bay_Company' for key 2

  ERROR 1062 (23000) at line 419421: Duplicate entry '0-Tourette_syndrome' for key 2

  ERROR 1062 (23000) at line 423098: Duplicate entry '0-Benny_Goodman_Orchestra' for key 2

  ERROR 1062 (23000) at line 427487: Duplicate entry '0-Long_John_Silver' for key 2

  ERROR 1062 (23000) at line 432944: Duplicate entry '0-Schroedinger_equation' for key 2

  ERROR 1062 (23000) at line 442032: Duplicate entry '0-Hölder_inequality' for key 2

  ERROR 1062 (23000) at line 444903: Duplicate entry '0-Jay_Treaty' for key 2

  ERROR 1062 (23000) at line 446458: Duplicate entry '0-Back_River' for key 2

  ERROR 1062 (23000) at line 453610: Duplicate entry '0-1980' for key 2

  ERROR 1062 (23000) at line 478337: Duplicate entry '0-Hudson_Bay' for key 2

  ERROR 1062 (23000) at line 479826: Duplicate entry '0-Hilbert_basis_theorem' for key 2

  ERROR 1062 (23000) at line 484443: Duplicate entry '0-Saint_Mary' for key 2


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Articles in Database Dump -- MYSQL errors

Brion Vibber-3
Jeffrey Vernon Merkey wrote:
> NOTE:  Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.

Offhand I'd guess that your table schemas are wrong, using
case-insensitive collation. Page title fields must be set as binary
(varbinary or varchar binary) to ensure you don't get duplicate key errors.

Can you double-check?

-- brion vibber (brion @ wikimedia.org)

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Articles in Database Dump -- MYSQL errors

jmerkey-3
> Jeffrey Vernon Merkey wrote:
>> NOTE:  Many of these errors are self-referencing #REDIRECT statements
>> which cause database corruption if not applied in the right order.
>
> Offhand I'd guess that your table schemas are wrong, using
> case-insensitive collation. Page title fields must be set as binary
> (varbinary or varchar binary) to ensure you don't get duplicate key
> errors.
>
> Can you double-check?
>

Table Schema's are those produced by tables.sql.  I use the following
script  (passwords removed) to create the shell database.  Opening the XML
dump with hexedit reveals there are in fact a large number of duplicate
titles.

I am guessing this may be due to the clustering setup you are using not
checking for duplicate titles.  <title>Ohm Law</title> is one example to
check.

Here is the method I use for each MediaWiki version to setup the base tables.

mysqladmin drop endb --password=XXXX
mysqladmin create endb --password=XXXX
echo "grant all privileges on endb.* to wgchr@localhost identified by
'dhbowt';" | mysql --password=XXXX
echo "flush privileges" | mysql --password=XXXX
mysql --password=XXXX endb < /wikidump/en/maintenance/tables.sql
mysql --password=XXXX endb < /wikidump/en/maintenance/wikipedia-interwiki.sql
php maintenance/createBcrat.php WikiSysop XXXX
php maintenance/changePassword.php --user=WikiSysop --password=XXXX


Jeff



> -- brion vibber (brion @ wikimedia.org)
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Articles in Database Dump -- MYSQL errors

Brion Vibber-3
[hidden email] wrote:

>> Jeffrey Vernon Merkey wrote:
>>> NOTE:  Many of these errors are self-referencing #REDIRECT statements
>>> which cause database corruption if not applied in the right order.
>> Offhand I'd guess that your table schemas are wrong, using
>> case-insensitive collation. Page title fields must be set as binary
>> (varbinary or varchar binary) to ensure you don't get duplicate key
>> errors.
>>
>> Can you double-check?
>>
>
> Table Schema's are those produced by tables.sql.  I use the following
> script  (passwords removed) to create the shell database.  Opening the XML
> dump with hexedit reveals there are in fact a large number of duplicate
> titles.
>
> I am guessing this may be due to the clustering setup you are using not
> checking for duplicate titles.  <title>Ohm Law</title> is one example to
> check.

Duplicate titles in the source should be impossible, as the title
records are read straight out of the page table in a single pass.


I've confirmed:

[brion@benet 20070908]$ gzip -dc enwiki-20070908-stub-articles.xml.gz |
grep '<title>Ohm Law</title>'
    <title>Ohm Law</title>
[brion@benet 20070908]$ bzip2 -dc enwiki-20070908-pages-articles.xml.bz2
| grep '<title>Ohm Law</title>'
    <title>Ohm Law</title>


[brion@benet 20070908]$ gzip -dc
enwiki-20070908-stub-meta-current.xml.gz | grep '<title>Ohm Law</title>'
    <title>Ohm Law</title>
[brion@benet 20070908]$ bzip2 -dc
enwiki-20070908-pages-meta-current.xml.bz2 | grep '<title>Ohm Law</title>'
    <title>Ohm Law</title>


So, no problem in the dump files. Can you confirm these are the affected
files? I notice that you didn't mention which dump files you are using
("the latest", but of which wiki, and which data set?), nor have you
told how you got SQL out of the XML dumps, with what software, what
version of it, what if any options, what if any processing...


Another possibility might be if you're using a dump-to-SQL tool that's
broken, perhaps in its handling of namespaces, or if you've done some
processing on the XML dump file which damages the namespace list.

mwdumper, for instance, will require the <namespaces> info to properly
split titles.

MW's internal importDump.php (which is much slower) may have unexpected
results if your local namespaces don't match, particularly where some
namespaces in the dump match local interwiki prefixes (eg 'Wikipedia:')
and are not defined locally.


Yet another possibility is a corrupt download, where parts of the file
have been duplicated within itself.


Or of course you may just be importing the dump twice by mistake somehow.


However given your note about #redirect entries, the most likely
explanation is mismarked indexes.

Can you provide the exact 'page' table definition you're using?

Trunk's maintenance/tables.sql (as of r26282) defines page_title as:

  -- The rest of the title, as text.
  -- Spaces are transformed into underscores in title storage.
  page_title varchar(255) binary NOT NULL,

which should be nicely binary-safe for sorting and unique index matches.
Please confirm that your table has the same definition for the field.

> Here is the method I use for each MediaWiki version to setup the base tables.
>
> mysqladmin drop endb --password=XXXX
> mysqladmin create endb --password=XXXX
> echo "grant all privileges on endb.* to wgchr@localhost identified by
> 'dhbowt';" | mysql --password=XXXX
> echo "flush privileges" | mysql --password=XXXX
> mysql --password=XXXX endb < /wikidump/en/maintenance/tables.sql
> mysql --password=XXXX endb < /wikidump/en/maintenance/wikipedia-interwiki.sql
> php maintenance/createBcrat.php WikiSysop XXXX
> php maintenance/changePassword.php --user=WikiSysop --password=XXXX

Confirm the version and the table definition, and your XML-to-SQL
conversion.

-- brion vibber (brion @ wikimedia.org)

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l