Big problem to solve: good WYSIWYG on WMF wikis

classic Classic list List threaded Threaded
153 messages Options
1 ... 5678
Reply | Threaded
Open this post in threaded view
|

Re: What do we want to accomplish? (was Re: WikiCreole)

Jay Ashworth-2
----- Original Message -----
> From: "George Herbert" <[hidden email]>

> On Wed, Jan 5, 2011 at 7:35 PM, Jay Ashworth <[hidden email]> wrote:
> > Did anyone ever pull statistics about exactly how many instances of
> > that Last Five Percent there really were, as I suspect I suggested at the
> > time?
>
> Expansion off "how many instances..?" -

The thing you want expanded, George, is "Last Five Percent"; I refer
there to (I think it was) David Gerard's comment earlier that the
first 95% of wikisyntax fits reasonably well into current parser
building frameworks, and the last 5% causes well adjusted programmers
to consider heroin... or something like that. :-)

> At some point in the corner, the fix is to change the templates and
> pages to match a more sane parser's capabilities or a more standard
> specification for the markup, rather than make the parser match the
> insanity that's already out there.
>
> If we know what we're looking at, we can assign corner cases to an
> on-wiki cleanup "hit squad". Who knows how many of the corners we can
> outright assassinate that way, but it's worth a go... The less used
> it is and harder to code for it is, the easier it is for us to justify
> taking it out.

Yup; that's the point I was making.

The argument advanced was always "there's too much usage of that ugly
stuff to consider Just Not Supporting It" and I always asked whether
anyone with larger computers than me had ever extracted actual statistics,
and no one ever answered.

Cheers,
-- jra

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What do we want to accomplish?

Mark A. Hershberger-2
In reply to this post by Mark A. Hershberger-2

Thinking about this question from the other day and the apparently deep
conviction that XML is the magic elixir, I had to wonder: what about the
existing Preprocessor_DOM class?

I'm asking out of ignorance.  I realize a the preprocessor is not the
parser, but it does turn the WikiText into a DOM (right?) and that
could, conceivably, be used to create different parsers.

What am I missing?

Mark.

--
http://hexmode.com/

War begins by calling for the annihilation of the Other,
    but ends ultimately in self-annihilation.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What would be a perfect wiki syntax? (Re: WYSIWYG)

Dmitriy Sintsov
In reply to this post by George William Herbert
* George Herbert <[hidden email]> [Wed, 5 Jan 2011 19:52:18
-0800]:
> On Wed, Jan 5, 2011 at 7:37 PM, Jay Ashworth <[hidden email]> wrote:
> > ---- Original Message -----
> >> From: "Daniel Kinzler" <[hidden email]>
> >
> >> On 05.01.2011 05:25, Jay Ashworth wrote:
> >> > I believe the snap reaction here is "you haven't tried to diff
XML,

> >> > have you?
> >>
> >> A text-based diff of XML sucks, but how about a DOM based
> (structural)
> >> diff?
> >
> > Sure, but how much more processor horsepower is that going to take.
> >
> > Scale is a driver in Mediawiki, for obvious reasons.
>
> I suspect that diffs are relatively rare events in the day to day WMF
> processing, though non-trivial.
>
> That said, and as much of a fan of some sort of conceptually object
> oriented page data approach... DOM?  Really??
>
> We're not trying to do 99% of what that does; we just need object /
> element contents, style and perhaps minimal other attributes, and
> order within a page.
>
>
DOM manipulation at templates level is not a bad thing. Also that could
be partially unified with parsing because trees are used there as well.
I just hope there is a chance to have XML to wikitext mapping (at least
partially compatible in basic markups).
Dmitriy

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What would be a perfect wiki syntax? (Re: WYSIWYG)

Jay Ashworth-2
In reply to this post by George William Herbert
----- Original Message -----
> From: "George Herbert" <[hidden email]>

> >> A text-based diff of XML sucks, but how about a DOM based
> >> (structural)
> >> diff?
> >
> > Sure, but how much more processor horsepower is that going to take.
> >
> > Scale is a driver in Mediawiki, for obvious reasons.
>
> I suspect that diffs are relatively rare events in the day to day WMF
> processing, though non-trivial.

Every single time you make an edit, unless I badly misunderstand the current
architecture; that's how it's possible for multiple people editing the
same article not to collide unless their edits actually collide at the
paragraph level.

Not to mention pulling old versions.

Can someone who knows the current code better than me confirm or deny?

Cheers,
-- jra

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What would be a perfect wiki syntax? (Re: WYSIWYG)

Brion Vibber
On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth <[hidden email]> wrote:

> ----- Original Message -----
> > From: "George Herbert" <[hidden email]>
>
> > >> A text-based diff of XML sucks, but how about a DOM based
> > >> (structural)
> > >> diff?
> > >
> > > Sure, but how much more processor horsepower is that going to take.
> > >
> > > Scale is a driver in Mediawiki, for obvious reasons.
> >
> > I suspect that diffs are relatively rare events in the day to day WMF
> > processing, though non-trivial.
>
> Every single time you make an edit, unless I badly misunderstand the
> current
> architecture; that's how it's possible for multiple people editing the
> same article not to collide unless their edits actually collide at the
> paragraph level.
>
> Not to mention pulling old versions.
>
> Can someone who knows the current code better than me confirm or deny?
>

There's a few separate issues mixed up here, I think.


First: diffs for viewing and the external diff3 merging for resolving edit
conflicts are actually unrelated code paths and use separate diff engines.
(Nor does diff3 get used at all unless there actually is a conflict to
resolve -- if nobody else edited since your change, it's not called.)


Second: the notion that diffing a structured document must inherently be
very slow is, I think, not right.

A well-structured document should be pretty diff-friendly actually; our
diffs are already working on two separate levels (paragraphs as a whole,
then words within matched paragraphs). In the most common cases, the diffing
might actually work pretty much the same -- look for nodes that match, then
move on to nodes that don't; within changed nodes, look for sub-nodes that
can be highlighted. Comparisons between nodes may be slower than straight
strings, but the basic algorithms don't need to be hugely different, and the
implementation can be in heavily-optimized C++ just like our text diffs are
today.


Third: the most common diff view cases are likely adjacent revisions of
recent edits, which smells like cache. :) Heck, these could be made once and
then simply *stored*, never needing to be recalculated again.


Fourth: the notion that diffing structured documents would be overwhelming
for the entire Wikimedia infrastructure... even if we assume such diffs are
much slower, I think this is not really an issue compared to the huge CPU
savings that it could bring elsewhere.

The biggest user of CPU has long been parsing and re-parsing of wikitext.
Every time someone comes along with different view preferences, we have to
parse again. Every time a template or image changes, we have to parse again.
Every time there's an edit, we have to parse again. Every time something
fell out of cache, we have to parse again.

And that parsing is *really expensive* on large, complex pages. Much of the
history of MediaWiki's parser development has been in figuring out how to
avoid parsing quite as much, or setting limits to keep the worst corner
cases from bringing down the server farm.

We parse *way*, *wayyyyy* more than we diff.


Part of what makes these things slow is that we have to do a lot of work
from scratch every time, and we have to do it in slow PHP code, and we have
to keep going back and fetching more stuff halfway through. Expanding
templates can change the document structure at the next parsing level, so
referenced files and templates have to be fetched or recalculated, often one
at a time because it's hard to batch up a list of everything we need at
once.

I think there would be some very valuable savings to using a document model
that can be stored in a machine-readable way up front. A data structure that
can be described as JSON or XML (for examples) allows leaving the low-level
"how do I turn a string into a structure" details to highly-tuned native C
code. A document model that is easily traversed and mapped to/from
hierarchical HTML allows code to process just the parts of the document it
needs at any given time, and would make it easier to share intermediate data
between variants if that's still needed.

In some cases, work that is today done in the 'parser' could even be done by
client-side JavaScript (on supporting user-agents), moving little bits of
work from the server farm (where CPU time is vast but sharply limited) to
end-user browsers (where there's often a local surplus -- CPU's not doing
much while it's waiting on the network to transfer big JPEG images).


It may be easier to prototype a lot of this outside of MediaWiki, though, or
in specific areas such as media or interactive extensions, before we all go
trying to redo the full core.

-- brion
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What would be a perfect wiki syntax? (Re: WYSIWYG)

George William Herbert
On Thu, Jan 6, 2011 at 11:38 AM, Brion Vibber <[hidden email]> wrote:

> On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth <[hidden email]> wrote:
>> > From: "George Herbert" <[hidden email]>
>> > I suspect that diffs are relatively rare events in the day to day WMF
>> > processing, though non-trivial.
>>
>> Every single time you make an edit, unless I badly misunderstand the
>> current
>> architecture; that's how it's possible for multiple people editing the
>> same article not to collide unless their edits actually collide at the
>> paragraph level.
>>
>> Not to mention pulling old versions.
>>
>> Can someone who knows the current code better than me confirm or deny?
>>
>
> There's a few separate issues mixed up here, I think.
>
>
> First: diffs for viewing and the external diff3 merging for resolving edit
> conflicts are actually unrelated code paths and use separate diff engines.
> (Nor does diff3 get used at all unless there actually is a conflict to
> resolve -- if nobody else edited since your change, it's not called.)
>
>
> Second: the notion that diffing a structured document must inherently be
> very slow is, I think, not right.
>
> A well-structured document should be pretty diff-friendly actually; our
> diffs are already working on two separate levels (paragraphs as a whole,
> then words within matched paragraphs). In the most common cases, the diffing
> might actually work pretty much the same -- look for nodes that match, then
> move on to nodes that don't; within changed nodes, look for sub-nodes that
> can be highlighted. Comparisons between nodes may be slower than straight
> strings, but the basic algorithms don't need to be hugely different, and the
> implementation can be in heavily-optimized C++ just like our text diffs are
> today.
>
>
> Third: the most common diff view cases are likely adjacent revisions of
> recent edits, which smells like cache. :) Heck, these could be made once and
> then simply *stored*, never needing to be recalculated again.
>
>
> Fourth: the notion that diffing structured documents would be overwhelming
> for the entire Wikimedia infrastructure... even if we assume such diffs are
> much slower, I think this is not really an issue compared to the huge CPU
> savings that it could bring elsewhere.
>
> The biggest user of CPU has long been parsing and re-parsing of wikitext.
> Every time someone comes along with different view preferences, we have to
> parse again. Every time a template or image changes, we have to parse again.
> Every time there's an edit, we have to parse again. Every time something
> fell out of cache, we have to parse again.
>
> And that parsing is *really expensive* on large, complex pages. Much of the
> history of MediaWiki's parser development has been in figuring out how to
> avoid parsing quite as much, or setting limits to keep the worst corner
> cases from bringing down the server farm.
>
> We parse *way*, *wayyyyy* more than we diff.
>[...]

Even if we diff on average 2-3x per edit, we're only doing order ten
edits a second across the projects, right?  Not going to dig up the
current stats, but that's what I remember from last time I looked.

So; priority remains parser and actual used syntax cleanup, from a
sanity point of view (being able to describe the syntax usefully, and
in a way that allows multiple parsers to be written), with diff
management as a distant low-impact priority...


--
-george william herbert
[hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What do we want to accomplish?

Daniel Friesen-4
In reply to this post by Mark A. Hershberger-2
On 11-01-06 10:15 AM, Mark A. Hershberger wrote:

> Thinking about this question from the other day and the apparently deep
> conviction that XML is the magic elixir, I had to wonder: what about the
> existing Preprocessor_DOM class?
>
> I'm asking out of ignorance.  I realize a the preprocessor is not the
> parser, but it does turn the WikiText into a DOM (right?) and that
> could, conceivably, be used to create different parsers.
>
> What am I missing?
>
> Mark.
The preprocessor only handles a minimal of WikiText... it's only
function is things like template hierarchy, parser functions, and
perhaps tags. It doesn't do any of the pile of other WikiText syntax.
It's also not really anything special to do with XML, we have a
Preprocessor_Hash too which iirc uses php arrays instead of a dom.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What would be a perfect wiki syntax? (Re: WYSIWYG)

Roan Kattouw-2
In reply to this post by Brion Vibber
2011/1/6 Brion Vibber <[hidden email]>:
> Third: the most common diff view cases are likely adjacent revisions of
> recent edits, which smells like cache. :) Heck, these could be made once and
> then simply *stored*, never needing to be recalculated again.
>
We already do this for text diffs between revisions, we cache them in memcached.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What do we want to accomplish? (was Re: WikiCreole)

Happy-melon
In reply to this post by Jay Ashworth-2

"Jay Ashworth" <[hidden email]> wrote in message
news:[hidden email]...

> ----- Original Message -----
>
> The thing you want expanded, George, is "Last Five Percent"; I refer
> there to (I think it was) David Gerard's comment earlier that the
> first 95% of wikisyntax fits reasonably well into current parser
> building frameworks, and the last 5% causes well adjusted programmers
> to consider heroin... or something like that. :-)
>
> The argument advanced was always "there's too much usage of that ugly
> stuff to consider Just Not Supporting It" and I always asked whether
> anyone with larger computers than me had ever extracted actual statistics,
> and no one ever answered.

This is a key point.  Every other parser discussion has floundered *before*
the stage of saying "here is a working parser which does *something*
interesting, now we can see how it behaves".  Everyone before has got to
that last 5% and said "I can't make this work; I can do *this* which is
kinda similar, but when you combine it with *this* and *that* and *the
other* we're now in a totally different set of edge cases".  And stopped
there.  Obviously it's impossible to quantify all the edge cases of the
current parser *because of* the lack of a schema, but until we actually get
a new parser churning through real wikitext, we're blind in the dark to say
whether those edge cases make up 5%, 0.5% or 50% of the corpus that's out
there.

--HM
 



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: What do we want to accomplish? (was Re: WikiCreole)

Neil Harris
On 07/01/11 00:49, Happy-melon wrote:

> "Jay Ashworth"<[hidden email]>  wrote in message
> news:[hidden email]...
>> ----- Original Message -----
>>
>> The thing you want expanded, George, is "Last Five Percent"; I refer
>> there to (I think it was) David Gerard's comment earlier that the
>> first 95% of wikisyntax fits reasonably well into current parser
>> building frameworks, and the last 5% causes well adjusted programmers
>> to consider heroin... or something like that. :-)
>>
>> The argument advanced was always "there's too much usage of that ugly
>> stuff to consider Just Not Supporting It" and I always asked whether
>> anyone with larger computers than me had ever extracted actual statistics,
>> and no one ever answered.
> This is a key point.  Every other parser discussion has floundered *before*
> the stage of saying "here is a working parser which does *something*
> interesting, now we can see how it behaves".  Everyone before has got to
> that last 5% and said "I can't make this work; I can do *this* which is
> kinda similar, but when you combine it with *this* and *that* and *the
> other* we're now in a totally different set of edge cases".  And stopped
> there.  Obviously it's impossible to quantify all the edge cases of the
> current parser *because of* the lack of a schema, but until we actually get
> a new parser churning through real wikitext, we're blind in the dark to say
> whether those edge cases make up 5%, 0.5% or 50% of the corpus that's out
> there.
>
> --HM
>

Am I right in assuming that "working" means in this case:

(a) being able to parse an article as a valid production of its grammar,
and then
(b) being able to complete the round trip by generating
character-for-character identical wikitext output from that parse tree

If so, what would count as a statistically useful sample of articles to
test? 1000? 10,000? 100,000? Or, if someone has access to serious
computing resources, and a recent dump, is it worth just trying all of
them? In any case, it would be interesting to have a list of failed
revisions, so developers can study the problems involved.

Given the generality of wikimarkup, and that user-editability means that
editors can provide absolutely any string as an input to it, it might
also make sense trying it on random garbage inputs, and "fuzzed"
versions of articles as well as real articles.

Flexbisonparser looks like the most plausible candidate for testing.
Does anyone know if it is currently buildable?

-- Neil



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How would you disrupt Wikipedia?

Antoine Musso-3
In reply to this post by David Gerard-2
On 01/01/11 16:06, David Gerard wrote:
> Because MediaWiki is very little work. And we like to be treated like
> heroes every now and then.

This is my exact experience.  And I have been a "hero" for 4 years in my
current company.  Almost all department now have a MediaWiki
installation and nobody complained about the lack of ACL or WYSIWTF :b

The main issues users encountered were :
  - installing the parserfunction
  - getting the wikipedia look'n feel (just add some CSS)
  - single sign on (install Ryan Lane LDAP authentication)


--
Ashar Voultoiz


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How would you disrupt Wikipedia?

Dmitriy Sintsov
* Ashar Voultoiz <[hidden email]> [Sat, 08 Jan 2011 23:08:23 +0100]:
> On 01/01/11 16:06, David Gerard wrote:
> > Because MediaWiki is very little work. And we like to be treated
like
> > heroes every now and then.
>
MediaWiki is not a little work. Not everybody can set up a farm with
it's own (not WMF's) shared repository "commons" (I am especially
speaking of pre-instant commons era, where you had to alter many global
settings). Not everybody can have a path-based farm, instead of
DNS-based one. Even memorizing these wg* globals is a large work. 99% of
users do not even know that one might add JS-scripts to MediaWiki
namespace.

There's been done everything at my primary work to undermine my
MediaWiki deployment efforts - that it "easily can be installed via the
linux package - so why he is installing that manually", "markup is
primitive", "inflexible", "PHP is inferior language, use ASP.NET
instead" and so on.

> This is my exact experience.  And I have been a "hero" for 4 years in
my
> current company.  Almost all department now have a MediaWiki
> installation and nobody complained about the lack of ACL or WYSIWTF :b
>
BTW, there's HaloACL nowadays, although I haven't deployed it yet.
Unfortunately my own experience with earning on MediaWiki is not so
bright - perhaps because this is a third world country.

> The main issues users encountered were :
>   - installing the parserfunction
>   - getting the wikipedia look'n feel (just add some CSS)
>   - single sign on (install Ryan Lane LDAP authentication)
>
Yes, that is simple. However not everything is simple and sometimes you
have to write your own extension. For example, there was no flexible
poll extensions some years ago.
Dmitriy

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: How would you disrupt Wikipedia?

Marco Schuster-2
On Mon, Jan 10, 2011 at 7:25 PM, Dmitriy Sintsov <[hidden email]> wrote:
> There's been done everything at my primary work to undermine my
> MediaWiki deployment efforts - that it "easily can be installed via the
> linux package - so why he is installing that manually", "markup is
> primitive", "inflexible", "PHP is inferior language, use ASP.NET
> instead" and so on.
ASP.NET? Only if you want all your sourcecode exposed.

Marco

--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
1 ... 5678