Proposal: slight change to the XML dump format

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: slight change to the XML dump format

Daniel Kinzler
tl;dr:

In the xml dumps, I want to change
<text> <sha1> <model> <format>
to
<model> <format> <text> <sha1>

However, this is a breaking change to our XML schema.
See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417


Background:

While trying to fix bug 72361, I ran into an issue with our current XML dump format:

The <model> and <format> tags are placed *after* the <text> tag.
This means that we don't know how to handle the text when we process XML events
in a stream - we'd have to buffer the text, wait until we know model and format,
and then process it. A pain.

The current order has no deeper meaning - it is, indeed, my own fault: i didn't
think this through when adding these tags. I propose to change the order of the
tags now, to make stream processing easier.

That would technically be a breaking change to the dump format, incompatible
with <https://www.mediawiki.org/xml/export-0.8.xsd> and export-0.9.xsd. I doubt
however that any consumers rely on the current placement of <model> and
<format>, as it is extremely inconvenient (compare bug 72361), but you never know.

I propose to release a new XSD version 0.10 with the order changed, and mention
it in the release notes. Should be fine.

Any objections?

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Aaron Halfaker-3
I spend a lot of time processing the XML dumps that this will affect.  I
just wanted to chime in to say that this change makes sense to me and it
won't affect my work.

-Aaron

On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler <[hidden email]>
wrote:

> tl;dr:
>
> In the xml dumps, I want to change
> <text> <sha1> <model> <format>
> to
> <model> <format> <text> <sha1>
>
> However, this is a breaking change to our XML schema.
> See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
>
>
> Background:
>
> While trying to fix bug 72361, I ran into an issue with our current XML
> dump format:
>
> The <model> and <format> tags are placed *after* the <text> tag.
> This means that we don't know how to handle the text when we process XML
> events
> in a stream - we'd have to buffer the text, wait until we know model and
> format,
> and then process it. A pain.
>
> The current order has no deeper meaning - it is, indeed, my own fault: i
> didn't
> think this through when adding these tags. I propose to change the order
> of the
> tags now, to make stream processing easier.
>
> That would technically be a breaking change to the dump format,
> incompatible
> with <https://www.mediawiki.org/xml/export-0.8.xsd> and export-0.9.xsd. I
> doubt
> however that any consumers rely on the current placement of <model> and
> <format>, as it is extremely inconvenient (compare bug 72361), but you
> never know.
>
> I propose to release a new XSD version 0.10 with the order changed, and
> mention
> it in the release notes. Should be fine.
>
> Any objections?
>
> -- daniel
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Daniel Kinzler
In reply to this post by Daniel Kinzler
Am 23.10.2014 16:06, schrieb Daniel Kinzler:
> tl;dr:
>
> In the xml dumps, I want to change
> <text> <sha1> <model> <format>
> to
> <model> <format> <text> <sha1>
>
> However, this is a breaking change to our XML schema.
> See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417

There is now a patch up for review:

https://gerrit.wikimedia.org/r/#/c/168583/


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Ariel T. Glenn-3
In reply to this post by Aaron Halfaker-3
Thank you Google for hiding the start of this thread in my spam folder
>_<

I'm going to have to change my import tools for the new format, but
that's the way it goes; it's a reasonable change.  Have you checked with
folks on the xml data dumps list to see who might be affected?

Ariel


Στις 23-10-2014, ημέρα Πεμ, και ώρα 09:52 -0500, ο/η Aaron Halfaker
έγραψε:

> I spend a lot of time processing the XML dumps that this will affect.  I
> just wanted to chime in to say that this change makes sense to me and it
> won't affect my work.
>
> -Aaron
>
> On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler <[hidden email]>
> wrote:
>
> > tl;dr:
> >
> > In the xml dumps, I want to change
> > <text> <sha1> <model> <format>
> > to
> > <model> <format> <text> <sha1>
> >
> > However, this is a breaking change to our XML schema.
> > See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
> >
> >
> > Background:
> >
> > While trying to fix bug 72361, I ran into an issue with our current XML
> > dump format:
> >
> > The <model> and <format> tags are placed *after* the <text> tag.
> > This means that we don't know how to handle the text when we process XML
> > events
> > in a stream - we'd have to buffer the text, wait until we know model and
> > format,
> > and then process it. A pain.
> >
> > The current order has no deeper meaning - it is, indeed, my own fault: i
> > didn't
> > think this through when adding these tags. I propose to change the order
> > of the
> > tags now, to make stream processing easier.
> >
> > That would technically be a breaking change to the dump format,
> > incompatible
> > with <https://www.mediawiki.org/xml/export-0.8.xsd> and export-0.9.xsd. I
> > doubt
> > however that any consumers rely on the current placement of <model> and
> > <format>, as it is extremely inconvenient (compare bug 72361), but you
> > never know.
> >
> > I propose to release a new XSD version 0.10 with the order changed, and
> > mention
> > it in the release notes. Should be fine.
> >
> > Any objections?
> >
> > -- daniel
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Gerard Meijssen-3
Hoi,
You may want to wait until the dumps are fixed. Magnus fixed the one but
last dump by hand. The following dump is still broken. Wait until we KNOW
the dumps are ok.
Thanks,
     GerardM

On 27 October 2014 21:58, Ariel T. Glenn <[hidden email]> wrote:

> Thank you Google for hiding the start of this thread in my spam folder
> >_<
>
> I'm going to have to change my import tools for the new format, but
> that's the way it goes; it's a reasonable change.  Have you checked with
> folks on the xml data dumps list to see who might be affected?
>
> Ariel
>
>
> Στις 23-10-2014, ημέρα Πεμ, και ώρα 09:52 -0500, ο/η Aaron Halfaker
> έγραψε:
> > I spend a lot of time processing the XML dumps that this will affect.  I
> > just wanted to chime in to say that this change makes sense to me and it
> > won't affect my work.
> >
> > -Aaron
> >
> > On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler <[hidden email]>
> > wrote:
> >
> > > tl;dr:
> > >
> > > In the xml dumps, I want to change
> > > <text> <sha1> <model> <format>
> > > to
> > > <model> <format> <text> <sha1>
> > >
> > > However, this is a breaking change to our XML schema.
> > > See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
> > >
> > >
> > > Background:
> > >
> > > While trying to fix bug 72361, I ran into an issue with our current XML
> > > dump format:
> > >
> > > The <model> and <format> tags are placed *after* the <text> tag.
> > > This means that we don't know how to handle the text when we process
> XML
> > > events
> > > in a stream - we'd have to buffer the text, wait until we know model
> and
> > > format,
> > > and then process it. A pain.
> > >
> > > The current order has no deeper meaning - it is, indeed, my own fault:
> i
> > > didn't
> > > think this through when adding these tags. I propose to change the
> order
> > > of the
> > > tags now, to make stream processing easier.
> > >
> > > That would technically be a breaking change to the dump format,
> > > incompatible
> > > with <https://www.mediawiki.org/xml/export-0.8.xsd> and
> export-0.9.xsd. I
> > > doubt
> > > however that any consumers rely on the current placement of <model> and
> > > <format>, as it is extremely inconvenient (compare bug 72361), but you
> > > never know.
> > >
> > > I propose to release a new XSD version 0.10 with the order changed, and
> > > mention
> > > it in the release notes. Should be fine.
> > >
> > > Any objections?
> > >
> > > -- daniel
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Daniel Kinzler
Am 27.10.2014 22:08, schrieb Gerard Meijssen:
> Hoi,
> You may want to wait until the dumps are fixed. Magnus fixed the one but
> last dump by hand. The following dump is still broken. Wait until we KNOW
> the dumps are ok.

Gerard, what exactly do you mean? The only problem I know of is the fact that we
are still outputting content using the old serialization format for some
revisions. Changing the tag order is, strange as it may sound, needed to fix
that problem. Bugs:

https://bugzilla.wikimedia.org/show_bug.cgi?id=72348
https://bugzilla.wikimedia.org/show_bug.cgi?id=72361
https://bugzilla.wikimedia.org/show_bug.cgi?id=72417


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Daniel Kinzler
In reply to this post by Ariel T. Glenn-3
Am 27.10.2014 21:58, schrieb Ariel T. Glenn:
> Thank you Google for hiding the start of this thread in my spam folder
>> _<
>
> I'm going to have to change my import tools for the new format, but
> that's the way it goes; it's a reasonable change.  Have you checked with
> folks on the xml data dumps list to see who might be affected?

Not yet, shall do that now.

Thanks!
-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: slight change to the XML dump format

Andrew Dunbar
I noticed that the dump format version number went from "0.9" to "0.10".

I wonder if this format is documented somewhere or if some code might
expect "1.0"?

Andrew Dunbar (hippietrail)

On 28 October 2014 20:45, Daniel Kinzler <[hidden email]> wrote:

> Am 27.10.2014 21:58, schrieb Ariel T. Glenn:
> > Thank you Google for hiding the start of this thread in my spam folder
> >> _<
> >
> > I'm going to have to change my import tools for the new format, but
> > that's the way it goes; it's a reasonable change.  Have you checked with
> > folks on the xml data dumps list to see who might be affected?
>
> Not yet, shall do that now.
>
> Thanks!
> -- daniel
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l