Wikidiff2

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikidiff2

Tim Starling
An enhanced version of the C++ diff extension, wikidiff2, is now running on both clusters. It now
does character-level diffs on Chinese, Japanese and Thai, so it produces much better results than
the PHP diff algorithm, in a much shorter time to boot. Chinese had an ad-hoc segmentation scheme
based on inserting a space between every character before the diff, then removing the spaces
afterwards, but unfortunately that left spaces all over the place where there shouldn't have been
spaces. Anyway, it's fixed now.

We're still calling dl() every time a diff is needed, and I'm still waiting for profiling results on
the effect of that. The performance of the algorithm is quite good though, on our opterons, it can
diff 2MB (each side) of the most pathological input text I've yet been able to devise in 5.2
seconds, and it does it with only about 15MB of memory.

-- Tim Starling

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikidiff2

Phil Boswell
"Tim Starling"
<[hidden email]> wrote in
message news:dtbtss$tkf$[hidden email]...

> An enhanced version of the C++ diff extension, wikidiff2, is now running
> on both clusters. It now
> does character-level diffs on Chinese, Japanese and Thai, so it produces
> much better results than
> the PHP diff algorithm, in a much shorter time to boot. Chinese had an
> ad-hoc segmentation scheme
> based on inserting a space between every character before the diff, then
> removing the spaces
> afterwards, but unfortunately that left spaces all over the place where
> there shouldn't have been
> spaces. Anyway, it's fixed now.


Any chances of improving the behaviour when several successive paragraphs
are split apart?

e.g. assuming that A, B, C, D, etc are words, when:
----
ABCDEF

GHIJKLM
----
becomes
----
A'
B'
C'
D'
E'
F'

G'
H'
I'
J'
K'
L'
----

At present the component parts of the second paragraph do not line up with
that paragraph, so it is difficult to compare the versions:
--Old edit--         --New edit--
ABCDEF               A'
                     B'
GHIJKLM               C'
                     D'
                     E'
                     F'

                     G'
                     H'
                     I'
                     J'
                     K'
                     L'
------               ------
As you cxan see, by the time the fragments G', H', I', J', K', L' appear,
the original might have scrolled off the top. If the alterations were subtle
it might be difficult to check them adequately.

Would it be possible to represent this in this fashion?
--Old edit--         --New edit--
ABCDEF               A'
                     B'
                     C'
                     D'
                     E'
                     F'

GHIJKLM               G'
                     H'
                     I'
                     J'
                     K'
                     L'
------               ------

HTH HAND
--
Phil
[[en:User:Phil Boswell]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikidiff2

Adrian Buehlmann
In reply to this post by Tim Starling
"Tim Starling" wrote:
> An enhanced version of the C++ diff extension, wikidiff2, is now
> running on both clusters.

This one could be better:

http://en.wikipedia.org/w/index.php?title=Binding_of_Isaac&diff=40442877&oldid=39680991





_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikidiff2

Phil Boswell
"Adrian Buehlmann" <[hidden email]> wrote in
message news:dtcq8s$33l$[hidden email]...
> "Tim Starling" wrote:
>> An enhanced version of the C++ diff extension, wikidiff2, is now
>> running on both clusters.
>
> This one could be better:
>
> http://en.wikipedia.org/w/index.php?title=Binding_of_Isaac&diff=40442877&oldid=39680991

At first I couldn't see what you meant, they all seemed to line up fine. But
then I realised what the trouble was: that *is* subtle isn't it!

For those checking this to see what the problem is, there is a series of
multi-line paragraphs (i.e. each paragraph is made up of several separate
lines), all very similar. Some have been changed, some not. The DIFF gets
out of step, and the changes are not shown opposite the relevant paragraph.

HTH HAND
--
Phil
[[en:User:Phil Boswell]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikidiff2

Adrian Buehlmann
"Phil Boswell" wrote:

> "Adrian Buehlmann" wrote:
>> "Tim Starling" wrote:
>>> An enhanced version of the C++ diff extension, wikidiff2, is now
>>> running on both clusters.
>>
>> This one could be better:
>>
>> http://en.wikipedia.org/w/index.php?title=Binding_of_Isaac&diff=40442877&oldid=39680991
>
> At first I couldn't see what you meant, they all seemed to line up
> fine. But then I realised what the trouble was: that *is* subtle
> isn't it!
> For those checking this to see what the problem is, there is a series
> of multi-line paragraphs (i.e. each paragraph is made up of several
> separate lines), all very similar. Some have been changed, some not.
> The DIFF gets out of step, and the changes are not shown opposite the
> relevant paragraph.

This one is even more extreme:

http://en.wikipedia.org/w/index.php?title=AIDS&diff=40206516&oldid=40206062

That text didn't change at all. Only added spaces and newlines.

It don't know if this is possible. But a bit less red in such a case
would be good.



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikidiff2

Phil Boswell
"Adrian Buehlmann" <[hidden email]> wrote in
message news:dtcvbt$mr9$[hidden email]...
[snip]
> This one is even more extreme:
> http://en.wikipedia.org/w/index.php?title=AIDS&diff=40206516&oldid=40206062
> That text didn't change at all. Only added spaces and newlines.
> It don't know if this is possible. But a bit less red in such a case
> would be good.

I have been trying in a desultory sort of way to figure out how to make the
altered white-space show up properly in DIFFs.

I'm not convinced it's actually possible with the current implementation :-(
--
Phil
[[en:User:Phil Boswell]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l