inconsistencies in dumps ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

inconsistencies in dumps ?

Colin Delacroix
[repost, sorry if it ends duplicated]

Hi

it seems to me that there are some inconsistencies between at least the
page and revision tables, in the 20060303 enwiki dump.

The first problematic page would be page_id 12, Anarchism (sorry for the
raw mysql formatting):
| page_id | page_namespace | page_title | page_restrictions |
page_counter | page_is_redirect | page_is_new | page_random       |
page_touched   | page_latest | page_len |
+---------+----------------+------------+-------------------+-----------
---+------------------+-------------+-------------------+---------------
-+-------------+----------+
|      12 |              0 | Anarchism  |                   |
5252 |                0 |           0 | 0.786172332974311 |
20060303031540 |    41982999 |    67537 |

which indicates a revision # 41982999.

But there is no line with rev_id=41982999 in the revision table.

(these can be verified grepping for 41982999 directly in
enwiki-20060303-pages-articles.xml.bz2 and in
enwiki-20060303-page.sql.gz)


Now:
- am I missing something here ?
- it might be that the revision has changed between the dumps of those 2
tables (page has been edited)
- it ends in empty pages (i.e. with the usual stub text), for ~ 5% of
the pages (that seems huge, but I don't see where the problem lies)
- is it a temporary problem (I don't recall getting so many empty
articles with earlier dumps) ?
- is there a simple way to fix it ? (if no better idea emerges, I will
try to fix the page_latest column in the page table by doing a lookup on
rev_page in the revision table - is it right ?)


Thanks
--
Colin

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: inconsistencies in dumps ?

Brion Vibber
Colin Delacroix wrote:
> - it might be that the revision has changed between the dumps of those 2
> tables (page has been edited)

They're dumped at different times, yes. A copy of the 'page' table is for
reference only, and is not guaranteed to be consistent with the XML data dumps.
If you're using an XML data dump you don't need a separate download of the
'page' table, just use the one created by the import.

> - it ends in empty pages (i.e. with the usual stub text), for ~ 5% of
> the pages (that seems huge, but I don't see where the problem lies)

I don't know what this means. What "ends" in empty pages, and in what way?

> - is there a simple way to fix it ? (if no better idea emerges, I will
> try to fix the page_latest column in the page table by doing a lookup on
> rev_page in the revision table - is it right ?)

Just use the page table generated from the import; it's guaranteed to be consistent.

-- brion vibber (brion @ pobox.com)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: inconsistencies in dumps ?

Colin Delacroix
Brion Vibber <[hidden email]> wrote:

> > - it might be that the revision has changed between the dumps of those 2
> > tables (page has been edited)
>
> They're dumped at different times, yes. A copy of the 'page' table is for
> reference only, and is not guaranteed to be consistent with the XML data
> dumps. If you're using an XML data dump you don't need a separate download
> of the 'page' table, just use the one created by the import.

Ok. I had missed the fact that the page.sql archive was superfluous for
a simple import, sorry.

> > - it ends in empty pages (i.e. with the usual stub text), for ~ 5% of
> > the pages (that seems huge, but I don't see where the problem lies)
>
> I don't know what this means. What "ends" in empty pages, and in what way?

I meant the renderer output, but it's irrelevant now.

> Just use the page table generated from the import; it's guaranteed to be
consistent.

Thanks a lot, Brion, now it works just right.

--
Colin

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l