Suggested file format of new incremental dumps

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Suggested file format of new incremental dumps

Petr Onderka
For my GSoC project Incremental data dumps [1], I'm creating a new file
format to replace Wikimedia's XML data dumps.
A sketch of how I imagine the file format to look like is at
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.

What do you think? Does it make sense? Would it work for your use case?
Any comments or suggestions are welcome.

Petr Onderka
[[User:Svick]]

[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Tyler Romeo
What is the intended format of the dump files? The page makes it sound like
it will have a binary format, which I'm not opposed to, but is definitely
something you should decide on.

Also, I really like the idea of writing it in a low level language and then
having bindings for something higher. However, unless you plan of having
multiple language bindings (e.g., *both* C# and Python), you may want to
pick a different route. For example, if you decide to only bind to Python,
you can use something like Cython, which would allow you to write
pseudo-Python that is still compiled to C. Of course, if you want multiple
language bindings, this is likely no longer an option.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
www.whizkidztech.com | [hidden email]


On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka <[hidden email]> wrote:

> For my GSoC project Incremental data dumps [1], I'm creating a new file
> format to replace Wikimedia's XML data dumps.
> A sketch of how I imagine the file format to look like is at
> http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
>
> What do you think? Does it make sense? Would it work for your use case?
> Any comments or suggestions are welcome.
>
> Petr Onderka
> [[User:Svick]]
>
> [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Ariel Glenn WMF
In reply to this post by Petr Onderka
Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka
έγραψε:

> For my GSoC project Incremental data dumps [1], I'm creating a new file
> format to replace Wikimedia's XML data dumps.
> A sketch of how I imagine the file format to look like is at
> http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
>
> What do you think? Does it make sense? Would it work for your use case?
> Any comments or suggestions are welcome.
>
> Petr Onderka
> [[User:Svick]]
>
> [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps

Dumps v 2.0 finally on the horizon!

A few comments/questions:

I was envisioning that we would produce "diff dumps" in one pass
(presumably in a much shorter time than the fulls we generate now) and
would apply those against previous fulls (in the new format) to produce
new fulls, hopefully also in less time.  What do you have in mind for
the production of the new fulls?

It might be worth seeing how large the resulting en wp history files are
going to be if you compress each revision separaately for version 1 of
this project.  My fear is that even with 7z it's going to make the size
unwieldy.  If the thought is that it's a first round prototype, not
meant to be run on large projects, that's another story.

I'm not sure about removing the restrictions data; someone must have
wanted it, like the other various fields that have crept in over time.
And we should expect there will be more such fields over time...

We need to get some of the wikidata users in on the model/format
dicussion, to see what use they plan to make of those fields and what
would be most convenient for them.

It's quite likely that these new fulls will need to be split into chunks
much as we do with the current en wp files.  I don't know what that
would mean for the diff files.  Currently we split in an arbitrary way
based on sequences of page numbers, writing out separate stub files and
using those for the content dumps.  Any thoughts?

Ariel




_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Tyler Romeo
>
> What is the intended format of the dump files? The page makes it sound like
> it will have a binary format, which I'm not opposed to, but is definitely
> something you should decide on.
>

Yes, it is a binary format, I will make that clearer on the page.

The advantage of a binary format is that it's smaller, which I think is
quite important.

I think the main advantages of text-based formats is that there are lots of
tools for the common ones (XML and JSON) and that they are human readable.
But those tools wouldn't be very useful, because we certainly want to have
some sort of custom compression scheme and the tools wouldn't be able to
work with that.
And I think human readability is mostly useful if we want others to be able
to write their own code that directly accesses the data.
And, because of the custom compression, doing that won't be that easy
anyway. And hopefully, it won't be necessary, because there will be a nice
library usable by everyone (see below).


> Also, I really like the idea of writing it in a low level language and then
> having bindings for something higher. However, unless you plan of having
> multiple language bindings (e.g., *both* C# and Python), you may want to
> pick a different route. For example, if you decide to only bind to Python,
> you can use something like Cython, which would allow you to write
> pseudo-Python that is still compiled to C. Of course, if you want multiple
> language bindings, this is likely no longer an option.
>

Right now, everyone can read the dumps in their favorite language.
If I write the library interface well, writing bindings for it for another
language should be relatively trivial, so everyone can keep using their
favorite language.

And I admit, I'm proposing doing it this way partially because of selfish
reasons: I'd like to use this library in my future C# code.
But I realize creating something that works only in C# doesn't make sense,
because most people in this community don't use it.
So, to me writing the code so that it can be used from anywhere makes the
most sense

Petr Onderka


>  On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka <[hidden email]> wrote:
>
> > For my GSoC project Incremental data dumps [1], I'm creating a new file
> > format to replace Wikimedia's XML data dumps.
> > A sketch of how I imagine the file format to look like is at
> > http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
> >
> > What do you think? Does it make sense? Would it work for your use case?
> > Any comments or suggestions are welcome.
> >
> > Petr Onderka
> > [[User:Svick]]
> >
> > [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Ariel Glenn WMF
>
> I was envisioning that we would produce "diff dumps" in one pass
> (presumably in a much shorter time than the fulls we generate now) and
> would apply those against previous fulls (in the new format) to produce
> new fulls, hopefully also in less time.  What do you have in mind for
> the production of the new fulls?
>

What I originally imagined is that the full dump will be modified directly
and a description of the changes made to it will be also written to the
diff dump.
But now I think that creating the diff and then applying it makes more
sense, because it's simpler.
But I also think that doing the two at the same time will be faster,
because it's less work (no need to read and parse the diff).
So what I imagine now is something like this:

1. Read information about a change in a page/revision
2. Create diff object in memory
3. Write the diff object to the diff file
4. Apply the diff object to the full dump


> It might be worth seeing how large the resulting en wp history files are
> going to be if you compress each revision separaately for version 1 of
> this project.  My fear is that even with 7z it's going to make the size
> unwieldy.  If the thought is that it's a first round prototype, not
> meant to be run on large projects, that's another story.
>

I do expect that full dump of enwiki using this compression would be way
too big.
So yes, this was meant just to have something working, so that I can
concentrate on doing compression properly later (after the mid-term).


> I'm not sure about removing the restrictions data; someone must have
> wanted it, like the other various fields that have crept in over time.
> And we should expect there will be more such fields over time...
>

If I understand the code in XmlDumpWriter.openPage correctly, that data
comes from the page_restrictions row [1], which doesn't seem to be used in
non-ancient versions of MediaWiki.

I did think about versioning the page and revision objects in the dump, but
I'm not sure how exactly to handle upgrades from one version to another.
For now, I think I'll have just one global "data version" per file, but
I'll make sure that adding a version to each object in the future will be
possible.


> We need to get some of the wikidata users in on the model/format
> discussion, to see what use they plan to make of those fields and what
> would be most convenient for them.
>
> It's quite likely that these new fulls will need to be split into chunks
> much as we do with the current en wp files.  I don't know what that
> would mean for the diff files.  Currently we split in an arbitrary way
> based on sequences of page numbers, writing out separate stub files and
> using those for the content dumps.  Any thoughts?
>

If possible, I would prefer to keep everything in a single file.
If that won't be possible, I think it makes sense to split on page ids, but
make the split id visible (probably in the file name) and unchanging  from
month to month.
If it turns out that a single chunk grows too big, we might consider adding
a "split" instruction to diff dumps, but that's probably not necessary now.

Petr Onderka

[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Petr Onderka
Compressed XML is what the current dumps use and it doesn't work well
because:
* it can't be edited
* it doesn't support seeking

I think the only way to solve this is "obscure" and requires special code
to read and write.
(And endianness is not a problem if the specification says which one it
uses and the implementation sticks to it.)

Theoretically, I could use compressed XML in internal data structures, but
I think that just combines the disadvantages of both.

So, the size is not the main reason not to use XML, it's just one of the
reasons.

Petr Onderka


On Mon, Jul 1, 2013 at 7:26 PM, <[hidden email]> wrote:

> On 07/01/2013 12:48:11 PM, Petr Onderka - [hidden email] wrote:
>
>> >
>> > What is the intended format of the dump files? The page makes it sound
>> like
>> > it will have a binary format, which I'm not opposed to, but is
>> definitely
>> > something you should decide on.
>> >
>>
>> Yes, it is a binary format, I will make that clearer on the page.
>>
>> The advantage of a binary format is that it's smaller, which I think is
>> quite important.
>>
>
> In my experience binary formats have very little to recommend them.
>
> They are definitely more obscure. They sometimes suffer from endian
> problems. They require special code to read and write.
>
> In my experience I have found that the notion that they offer an advantage
> by being "smaller" is somewhat misguided.
>
> In particular, with XML, there is generally a very high degree of
> redundancy in the text, far more than in normal writing.
>
> The consequence of this regularity is that text based XML often compresses
> very, very well.
>
> I remember one particular instance where we were generating 30-50
> Megabytes of XML a day and needed to send it from the USA to the UK every
> day, in a situation where our leased data rate was really limiting. We were
> surprised and pleased to discover that zipping the files reduced them to
> only 1-2 MB. I have been skeptical of claims that binary formats are more
> efficient on the wire (where it matters most) ever since.
>
> I think you should do some experiments versus compressed XML to justify
> your claimed benefits of using a binary format.
>
> Jim
>
> <snip>
>
> --
> Jim Laurino
> [hidden email]
> Please direct any reply to the list.
> Only mail from the listserver reaches this address.
>
>
> ______________________________**_________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Tyler Romeo
Petr is right on par with this one. The purpose of this version 2 for dumps
is to allow protocol-specific incremental updating of the dump, which would
be significantly more difficult in non-binary format.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
www.whizkidztech.com | [hidden email]


On Mon, Jul 1, 2013 at 2:54 PM, Petr Onderka <[hidden email]> wrote:

> Compressed XML is what the current dumps use and it doesn't work well
> because:
> * it can't be edited
> * it doesn't support seeking
>
> I think the only way to solve this is "obscure" and requires special code
> to read and write.
> (And endianness is not a problem if the specification says which one it
> uses and the implementation sticks to it.)
>
> Theoretically, I could use compressed XML in internal data structures, but
> I think that just combines the disadvantages of both.
>
> So, the size is not the main reason not to use XML, it's just one of the
> reasons.
>
> Petr Onderka
>
>
> On Mon, Jul 1, 2013 at 7:26 PM, <[hidden email]> wrote:
>
> > On 07/01/2013 12:48:11 PM, Petr Onderka - [hidden email] wrote:
> >
> >> >
> >> > What is the intended format of the dump files? The page makes it sound
> >> like
> >> > it will have a binary format, which I'm not opposed to, but is
> >> definitely
> >> > something you should decide on.
> >> >
> >>
> >> Yes, it is a binary format, I will make that clearer on the page.
> >>
> >> The advantage of a binary format is that it's smaller, which I think is
> >> quite important.
> >>
> >
> > In my experience binary formats have very little to recommend them.
> >
> > They are definitely more obscure. They sometimes suffer from endian
> > problems. They require special code to read and write.
> >
> > In my experience I have found that the notion that they offer an
> advantage
> > by being "smaller" is somewhat misguided.
> >
> > In particular, with XML, there is generally a very high degree of
> > redundancy in the text, far more than in normal writing.
> >
> > The consequence of this regularity is that text based XML often
> compresses
> > very, very well.
> >
> > I remember one particular instance where we were generating 30-50
> > Megabytes of XML a day and needed to send it from the USA to the UK every
> > day, in a situation where our leased data rate was really limiting. We
> were
> > surprised and pleased to discover that zipping the files reduced them to
> > only 1-2 MB. I have been skeptical of claims that binary formats are more
> > efficient on the wire (where it matters most) ever since.
> >
> > I think you should do some experiments versus compressed XML to justify
> > your claimed benefits of using a binary format.
> >
> > Jim
> >
> > <snip>
> >
> > --
> > Jim Laurino
> > [hidden email]
> > Please direct any reply to the list.
> > Only mail from the listserver reaches this address.
> >
> >
> > ______________________________**_________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Dmitriy Sintsov
On 01.07.2013 22:56, Tyler Romeo wrote:
> Petr is right on par with this one. The purpose of this version 2 for dumps
> is to allow protocol-specific incremental updating of the dump, which would
> be significantly more difficult in non-binary format.
>
>
Why the dumps cannot be just split into daily or weekly XML files
(optionally compressed ones). Such way the seeking will be performed by
simply opening YYYY.MM.DD.xml file.
It is so much simplier than going for binary git-like formats. Which
would take a bit less space but are more prone to bugs and impossible to
extract and analyze/edit via text/XML processing utils.
Dmitriy


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Daniel Friesen-2
In reply to this post by Tyler Romeo
Instead of XML "or" a proprietary binary format could we try using a  
standard binary format such as Protocol Buffers as a base to reduce the  
issues with having to implement the reading/writing in multiple languages?

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo <[hidden email]>  
wrote:

> Petr is right on par with this one. The purpose of this version 2 for  
> dumps
> is to allow protocol-specific incremental updating of the dump, which  
> would
> be significantly more difficult in non-binary format.
>
> *-- *
> *Tyler Romeo*
> Stevens Institute of Technology, Class of 2016
> Major in Computer Science
> www.whizkidztech.com | [hidden email]


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Dmitriy Sintsov
I think this would work well only for the use case where you're always
looking though the whole history of all pages.

How would you find the current revision of a specific page? Or all
revisions of a page?
What if you don't want the whole history, just current versions of all
pages?
And don't forget about deletions (and undeletions).

You could somewhat solve some of these problems (e.g. by adding indexes),
but I don't think you can solve all of them.

Petr Onderka


On Mon, Jul 1, 2013 at 9:13 PM, Dmitriy Sintsov <[hidden email]> wrote:

> On 01.07.2013 22:56, Tyler Romeo wrote:
>
>> Petr is right on par with this one. The purpose of this version 2 for
>> dumps
>> is to allow protocol-specific incremental updating of the dump, which
>> would
>> be significantly more difficult in non-binary format.
>>
>>
>>  Why the dumps cannot be just split into daily or weekly XML files
> (optionally compressed ones). Such way the seeking will be performed by
> simply opening YYYY.MM.DD.xml file.
> It is so much simplier than going for binary git-like formats. Which would
> take a bit less space but are more prone to bugs and impossible to extract
> and analyze/edit via text/XML processing utils.
> Dmitriy
>
>
>
> ______________________________**_________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Daniel Friesen-2
In reply to this post by Petr Onderka
How are you dealing with extensibility?

We need to be able to extend the format. The fields of data we need to
export change over time (just look at the changelog for our export's XSD
file https://www.mediawiki.org/xml/export-0.7.xsd).

Here are some things in that XML format you are missing in the incremental:
- Redirect info
- Upload info
- Log items
- Liquid Threads support

And something that I don't think we've thought about support for in our
current export format, ContentHandler. There's metadata for it missing
     from our dumps and the data format is somewhat different than our text
dumps have traditionally expected.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]

On Mon, 01 Jul 2013 07:00:23 -0700, Petr Onderka
<[hidden email]> wrote:

> For my GSoC project Incremental data dumps [1], I'm creating a new file
> format to replace Wikimedia's XML data dumps.
> A sketch of how I imagine the file format to look like is at
> http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
>
> What do you think? Does it make sense? Would it work for your use case?
> Any comments or suggestions are welcome.
>
> Petr Onderka
> [[User:Svick]]
>
> [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dump

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Daniel Friesen-2
Protocol Buffers are not a bad idea, but I'm not sure about their overhead.

AFAIK, PB have overhead of 1 byte per field.
If I'm counting correctly, with enwiki's 600M revisions and 8 fields per
revision, that means total overhead of more than 4 GB.
The fixed-size part of all revisions (i.e. without comment and text)
amounts to ~22 GB.
I think this means PB have too much overhead.

The overhead could be alleviated by using compression, but I didn't intend
to compress metadata.

So, I think I will start without PB. If I later decide to compress
metadata, I will also try to use PB and see if it works.

Also, I think that reading the binary format isn't going to be the biggest
issue if you're implementing your own library for incremental dumps,
especially if I'm going to use delta compression of revision texts.

Petr Onderka


On Mon, Jul 1, 2013 at 9:16 PM, Daniel Friesen
<[hidden email]>wrote:

> Instead of XML "or" a proprietary binary format could we try using a
> standard binary format such as Protocol Buffers as a base to reduce the
> issues with having to implement the reading/writing in multiple languages?
>
> --
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
>
>
> On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo <[hidden email]>
> wrote:
>
>  Petr is right on par with this one. The purpose of this version 2 for
>> dumps
>> is to allow protocol-specific incremental updating of the dump, which
>> would
>> be significantly more difficult in non-binary format.
>>
>> *-- *
>> *Tyler Romeo*
>> Stevens Institute of Technology, Class of 2016
>> Major in Computer Science
>> www.whizkidztech.com | [hidden email]
>>
>
>
> ______________________________**_________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Problem with SVG thumbnails

Aran Dunkley
In reply to this post by Daniel Friesen-2
Hello,

My wiki's giving an error generating SVG thumbnails, e.g.

Cannot parse integer value '-h214' for -w

Has anyone come across a solution for this? I'm seeing it on many sites
around the net including my own - I think it started after I upgraded to
1.19.

Here's a live example:
http://www.organicdesign.co.nz/File:Nginx-logo.svg

Thanks,
Aran

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Problem with SVG thumbnails

Andre Klapper-2
Hi,

On Mon, 2013-07-01 at 20:54 -0300, Aran wrote:
> My wiki's giving an error generating SVG thumbnails, e.g.
> Cannot parse integer value '-h214' for -w

Does this refer to creating bitmap thumbnails from SVG files? In that
case, which SVGConverter is used to generate thumbnails?

andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Problem with SVG thumbnails

Bartosz Dziewoński
In reply to this post by Aran Dunkley
This is most likely bug 45054, fixed in MediaWiki 1.21. It has a rather simple workaround, too, see https://bugzilla.wikimedia.org/show_bug.cgi?id=45054 .

--
Matma Rex

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Problem with SVG thumbnails

Aran Dunkley
Yep that's my problem, thanks a lot :-)

On 02/07/13 07:28, Bartosz Dziewoński wrote:
> This is most likely bug 45054, fixed in MediaWiki 1.21. It has a
> rather simple workaround, too, see
> https://bugzilla.wikimedia.org/show_bug.cgi?id=45054 .
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Math rendering problem

Aran Dunkley
Hi Guys,

I've just upgraded my wiki from 1.19.2 to 1.21.1 to fix the SVG
rendering problem which now is all fine, but now my Math rendering has
broken. I'm getting the following error:

Failed to parse (PNG conversion failed; check for correct installation
of latex and dvipng (or dvips + gs + convert))

This error seems very common, but none of the solutions I've found have
worked (creating latex.fmt, running fmtutil-sys --all, setting $wgTexvc
etc).

All the packages are installed and were running fine for 1.19, I've
downloaded Extension:Math for 1.21 and ran 'make' which generated a
texvc binary with no errors.

Any ideas what may be wrong?

Thanks,
Aran

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Petr Onderka
In reply to this post by Daniel Friesen-2
On Mon, Jul 1, 2013 at 10:15 PM, Daniel Friesen
<[hidden email]>wrote:

> How are you dealing with extensibility?
>
> We need to be able to extend the format. The fields of data we need to
> export change over time (just look at the changelog for our export's XSD
> file https://www.mediawiki.org/xml/**export-0.7.xsd<https://www.mediawiki.org/xml/export-0.7.xsd>
> ).
>

I have touched on this in answer to Ariel's email.
I think that for now, there will be just a single data version number in
the header of the dump file.
But I will make sure to leave the possibility of having a version number on
each object open.


> Here are some things in that XML format you are missing in the incremental:
> - Redirect info
> - Upload info
> - Log items
> - Liquid Threads support
>

I should have gone to the source instead of assuming that looking at few
samples is enough.
I will add redirect and upload info to the format description.

As far as I know, log items are in a separate XML dump and I'm not planning
to replace that one.

Unless I'm mistaken, Liquid Threads don't have much of a future and are
used only on few wikis like mediawiki.org.
Does anyone actually use this information from the dumps?


> And something that I don't think we've thought about support for in our
> current export format, ContentHandler. There's metadata for it missing
>     from our dumps and the data format is somewhat different than our text
> dumps have traditionally expected.


The current dumps already store model and format.
Is there something else needed for ContentHandler?
The dumps don't really care what is the format or encoding of the revision
text, it's just a byte stream to them.

Petr Onderka
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Suggested file format of new incremental dumps

Tyler Romeo
On Tue, Jul 2, 2013 at 2:18 PM, Petr Onderka <[hidden email]> wrote:

> Unless I'm mistaken, Liquid Threads don't have much of a future and are
> used only on few wikis like mediawiki.org.
> Does anyone actually use this information from the dumps?
>

LiquidThreads is an extension. I don't think extension dumps are within the
scope of this, unless we provide some sort of generic "extensions can add
stuff to the dump" hook.

The current dumps already store model and format.
> Is there something else needed for ContentHandler?
> The dumps don't really care what is the format or encoding of the revision
> text, it's just a byte stream to them.


I'm not familiar with the current dump format, but what is being referred
to here is that if you set $wgContentHandlerUseDB to true, then the content
type (i.e., whether it is Wikitext, or JS/CSS, etc.) can be stored in the
database rather than being determined statically by namespace.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
www.whizkidztech.com | [hidden email]
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Math rendering problem

Aran Dunkley
In reply to this post by Aran Dunkley
I've found that the logged shell command actually does execute properly
and creates the .png when executed manually from shell - even when I
execute it as the www-data user that the web-server runs as.

But from the wiki it creates the tmp/hash.tex file, but not the png, and
there's nothing logged anywhere to say why it's not been able to do it.

On 02/07/13 12:32, Aran wrote:

> Hi Guys,
>
> I've just upgraded my wiki from 1.19.2 to 1.21.1 to fix the SVG
> rendering problem which now is all fine, but now my Math rendering has
> broken. I'm getting the following error:
>
> Failed to parse (PNG conversion failed; check for correct installation
> of latex and dvipng (or dvips + gs + convert))
>
> This error seems very common, but none of the solutions I've found have
> worked (creating latex.fmt, running fmtutil-sys --all, setting $wgTexvc
> etc).
>
> All the packages are installed and were running fine for 1.19, I've
> downloaded Extension:Math for 1.21 and ran 'make' which generated a
> texvc binary with no errors.
>
> Any ideas what may be wrong?
>
> Thanks,
> Aran
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12