Parser implementaton for MediaWiki syntax

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
Hi,

I have written a parser for MediaWiki syntax and have set up a test
site for it here:

http://libmwparser.kreablo.se/index.php/Libmwparsertest

and the source code is available here:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser

A preprocessor will take care of parser functions, magic words,
comment removal, and transclusion.  But as it wasn't possible to
cleanly separate these functions from the existing preprocessor, some
preprocessing is disabled at the test site.  It should be
straightforward to write a new preprocessor that provides only the required
functionality, however.

The parser is not feature complete, but the hard parts are solved.  I
consider "the hard parts" to be:

* parsing apostrophes
* parsing html mixed with wikitext
* parsing headings and links
* parsing image links

And when I say "solved" I mean producing the same or equivalent output
as the original parser, as long as the behavior of the original parser
is well defined and produces valid html.

Here is a schematic overview of the design:

+-----------------------+
|                       |              Wikitext
|  client application   +---------------------------------------+
|                       |                                       |
+-----------------------+                                       |
            ^                                                    |
            | Event stream                                       |
+----------+------------+        +-------------------------+    |
|                       |        |                         |    |
|    parser context     |<------>|         Parser          |    |
|                       |        |                         |    |
+-----------------------+        +-------------------------+    |
                                               ^                 |
                                               | Token stream    |
+-----------------------+        +------------+------------+    |
|                       |        |                         |    |
|    lexer context      |<------>|         Lexer           |<---+
|                       |        |                         |
+-----------------------+        +-------------------------+


The design is described more in detail in a series of posts at the
wikitext-l mailing list.  The most important "trick" is to make sure
that the lexer never produce a spurious token.  An end token for a
production will not appear unless the corresponding begin token
already has been produced, and the lexer maintains a block context to
only produce tokens that makes sense in the current block.

I have used Antlr for generating both the parser and the lexer, as
Antlr supports semantic predicates that can be used for context
sensitive parsing.  Also I am using a slightly patched version of
Antlr's C runtime environent, because the lexer needs to support
speculative execution in order to do context sensitive lookahead.

A Swig generated interface is used for providing the php api.  The
parser process the buffer of the php string directly, and writes its
output to an array of php strings.  Only UTF-8 is supported at the
moment.

The performance seems to be about the same as for the original parser
on plain text.  But with an increasing amount of markup, the original
parser runs slower.  This new parser implementation maintains roughly
the same performance regardless of input.

I think that this demonstrates the feasability of replacing the
MediaWiki parser.  There is still a lot of work to do in order to turn
it into a full replacement, however.

Best regards,

Andreas


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Bryan Tong Minh
Hi,


Pretty awesome work you've done!

On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
<[hidden email]> wrote:
> I think that this demonstrates the feasability of replacing the
> MediaWiki parser.  There is still a lot of work to do in order to turn
> it into a full replacement, however.
>

Have you already tried to run the parsertests that come with
MediaWiki? Do they produce (roughly) the same output as with the PHP
parser?



Bryan

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Robin Krahl
On 23.09.2010 11:34, Bryan Tong Minh wrote:
> Pretty awesome work you've done!

+1.

I just was about writing an own parser for an application, but that’s
really great. :)

  Robin


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
In reply to this post by Bryan Tong Minh
2010-09-23 11:34, Bryan Tong Minh skrev:

> Hi,
>
>
> Pretty awesome work you've done!
>
> On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
> <[hidden email]>  wrote:
>    
>> I think that this demonstrates the feasability of replacing the
>> MediaWiki parser.  There is still a lot of work to do in order to turn
>> it into a full replacement, however.
>>
>>      
> Have you already tried to run the parsertests that come with
> MediaWiki? Do they produce (roughly) the same output as with the PHP
> parser?
>
>    

No, I haven't.  I have produced my own set of unit tests that are
based on the original parser.  For the features that I have
implemented, the output should be roughly the same under "normal"
circumstances.

But the original parser have tons of border cases where the behavior
is not very well defined.  For instance, the table on the test page
will render very differently with the original parser (it will
actually turn into two separate tables).

I am employing a consistent and easily understood strategy for
handling html intermixed with wikitext markup; it is easy to explain
that the |} token is disabled in the context of an html-table.  There
is no such simple explanation for the behavior of the original parser,
even though in this particular example the produced html code happens
to be valid (which isn't always the case).

So, what I'm trying to say is that for the border cases where my
implementation differs from the original, the behavior of my parser
should be considered the correct one. :-)

/Andreas


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Krinkle
Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:

> 2010-09-23 11:34, Bryan Tong Minh skrev:
>> Hi,
>>
>>
>> Pretty awesome work you've done!
>>
>> On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
>> <[hidden email]>  wrote:
>>
>>> I think that this demonstrates the feasability of replacing the
>>> MediaWiki parser.  There is still a lot of work to do in order to  
>>> turn
>>> it into a full replacement, however.
>>>
>>>
>> Have you already tried to run the parsertests that come with
>> MediaWiki? Do they produce (roughly) the same output as with the PHP
>> parser?
>>
>>
>
> No, I haven't.  I have produced my own set of unit tests that are
> based on the original parser.  For the features that I have
> implemented, the output should be roughly the same under "normal"
> circumstances.
>
> But the original parser have tons of border cases where the behavior
> is not very well defined.  For instance, the table on the test page
> will render very differently with the original parser (it will
> actually turn into two separate tables).
>
> I am employing a consistent and easily understood strategy for
> handling html intermixed with wikitext markup; it is easy to explain
> that the |} token is disabled in the context of an html-table.  There
> is no such simple explanation for the behavior of the original parser,
> even though in this particular example the produced html code happens
> to be valid (which isn't always the case).
>
> So, what I'm trying to say is that for the border cases where my
> implementation differs from the original, the behavior of my parser
> should be considered the correct one. :-)
>
> /Andreas
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hm...
Depending on how 'edge' those edge cases are, and on how much they are  
known.
Doing that would may render it unusable for established wikis and  
would never become the default anytime soon, right ?

--
Krinkle



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
2010-09-23 14:17, Krinkle skrev:

> Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende geschreven:
>
>    
>> 2010-09-23 11:34, Bryan Tong Minh skrev:
>>      
>>> Hi,
>>>
>>>
>>> Pretty awesome work you've done!
>>>
>>> On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
>>> <[hidden email]>   wrote:
>>>
>>>        
>>>> I think that this demonstrates the feasability of replacing the
>>>> MediaWiki parser.  There is still a lot of work to do in order to
>>>> turn
>>>> it into a full replacement, however.
>>>>
>>>>
>>>>          
>>> Have you already tried to run the parsertests that come with
>>> MediaWiki? Do they produce (roughly) the same output as with the PHP
>>> parser?
>>>
>>>
>>>        
>> No, I haven't.  I have produced my own set of unit tests that are
>> based on the original parser.  For the features that I have
>> implemented, the output should be roughly the same under "normal"
>> circumstances.
>>
>> But the original parser have tons of border cases where the behavior
>> is not very well defined.  For instance, the table on the test page
>> will render very differently with the original parser (it will
>> actually turn into two separate tables).
>>
>> I am employing a consistent and easily understood strategy for
>> handling html intermixed with wikitext markup; it is easy to explain
>> that the |} token is disabled in the context of an html-table.  There
>> is no such simple explanation for the behavior of the original parser,
>> even though in this particular example the produced html code happens
>> to be valid (which isn't always the case).
>>
>> So, what I'm trying to say is that for the border cases where my
>> implementation differs from the original, the behavior of my parser
>> should be considered the correct one. :-)
>>
>> /Andreas
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>      
> Hm...
> Depending on how 'edge' those edge cases are, and on how much they are
> known.
> Doing that would may render it unusable for established wikis and
> would never become the default anytime soon, right ?
>
>    
We are talking about the edge cases that arise when intermixing
wikitext and html code in "creative" ways.  This is for instance ok
with the original parser:

* item 1 <li> item 2
* item 3

That may seem harmless and easy to handle, but suprise!  explicitly
adding the </li> token doesn't work as expected:

* item 1 <li> item 2 </li>
* item 3

And what happens when you add a new html list inside a wikitext list
item without closing it?

* item 1 <ul><li> item 2
* item 3

Which list should item 3 belong to?  You can can come up with
thousands of situations like this, and without a consistent plan on
how to handle them, you will need to add thousands of border cases to
the code to handle them all.

I have avoided this by simply disabling all html block tokens inside
wikitext list items.  Of course, it may be that someone is actually
relying on being able to mix in this way, but it doesn't seem likely
as the result tends to be strange.

/Andreas


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Krinkle
Op 23 sep 2010, om 14:47 heeft Andreas Jonsson het volgende geschreven:

> 2010-09-23 14:17, Krinkle skrev:
>> Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende  
>> geschreven:
>>
>>
>>> 2010-09-23 11:34, Bryan Tong Minh skrev:
>>>
>>>> Hi,
>>>>
>>>>
>>>> Pretty awesome work you've done!
>>>>
>>>> On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
>>>> <[hidden email]>   wrote:
>>>>
>>>>
>>>>> I think that this demonstrates the feasability of replacing the
>>>>> MediaWiki parser.  There is still a lot of work to do in order to
>>>>> turn
>>>>> it into a full replacement, however.
>>>>>
>>>>>
>>>>>
>>>> Have you already tried to run the parsertests that come with
>>>> MediaWiki? Do they produce (roughly) the same output as with the  
>>>> PHP
>>>> parser?
>>>>
>>>>
>>>>
>>> No, I haven't.  I have produced my own set of unit tests that are
>>> based on the original parser.  For the features that I have
>>> implemented, the output should be roughly the same under "normal"
>>> circumstances.
>>>
>>> But the original parser have tons of border cases where the behavior
>>> is not very well defined.  For instance, the table on the test page
>>> will render very differently with the original parser (it will
>>> actually turn into two separate tables).
>>>
>>> I am employing a consistent and easily understood strategy for
>>> handling html intermixed with wikitext markup; it is easy to explain
>>> that the |} token is disabled in the context of an html-table.  
>>> There
>>> is no such simple explanation for the behavior of the original  
>>> parser,
>>> even though in this particular example the produced html code  
>>> happens
>>> to be valid (which isn't always the case).
>>>
>>> So, what I'm trying to say is that for the border cases where my
>>> implementation differs from the original, the behavior of my parser
>>> should be considered the correct one. :-)
>>>
>>> /Andreas
>>>
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>> Hm...
>> Depending on how 'edge' those edge cases are, and on how much they  
>> are
>> known.
>> Doing that would may render it unusable for established wikis and
>> would never become the default anytime soon, right ?
>>
>>
> We are talking about the edge cases that arise when intermixing
> wikitext and html code in "creative" ways.  This is for instance ok
> with the original parser:
>
> * item 1 <li> item 2
> * item 3
>
> That may seem harmless and easy to handle, but suprise!  explicitly
> adding the </li> token doesn't work as expected:
>
> * item 1 <li> item 2 </li>
> * item 3
>
> And what happens when you add a new html list inside a wikitext list
> item without closing it?
>
> * item 1 <ul><li> item 2
> * item 3
>
> Which list should item 3 belong to?  You can can come up with
> thousands of situations like this, and without a consistent plan on
> how to handle them, you will need to add thousands of border cases to
> the code to handle them all.
>
> I have avoided this by simply disabling all html block tokens inside
> wikitext list items.  Of course, it may be that someone is actually
> relying on being able to mix in this way, but it doesn't seem likely
> as the result tends to be strange.
>
> /Andreas
>

I agree that making in consistant is important and will only cause  
good things (such as people getting used to behaviour and being able  
to predict what something would logically do).

About the html in wikitext mixup: Although not directly, it is most  
certainly done indirectly.

Imagine a template which is consists of a table in wikitext. A certain  
parameters value is outputted in a table cel.
On some page that template is called and the parameter is filled with  
the help of a parser function (like #if or #expr).
To avoid mess and escape templates, the table inside this table cell  
is build from there in HTML in a lot of cases instead of wiki text  
(pipe problem, think {{!}})

Result is html table in wikitext table.

Or for example the thing with whitespace and parser functions /  
template parameters. Starting something like a table or list requires  
the block level hack (like <br /> or <div></div> after the pipe, and  
then the {| table |} or *list on the next time). To avoid those in  
complex templates often HTML is used.
If that template would be called on a page with an already existing  
wikitext list in place there would be a html list inside a wikitext  
list.

I dont know in which order the parser works, but I think if the  
behaviour changes of that lots of complicated templates will break,  
and not just on Wikimedia projects.

--
Krinkle

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
2010-09-23 14:56, Krinkle skrev:

> Op 23 sep 2010, om 14:47 heeft Andreas Jonsson het volgende geschreven:
>
>    
>> 2010-09-23 14:17, Krinkle skrev:
>>      
>>> Op 23 sep 2010, om 14:14 heeft Andreas Jonsson het volgende
>>> geschreven:
>>>
>>>
>>>        
>>>> 2010-09-23 11:34, Bryan Tong Minh skrev:
>>>>
>>>>          
>>>>> Hi,
>>>>>
>>>>>
>>>>> Pretty awesome work you've done!
>>>>>
>>>>> On Thu, Sep 23, 2010 at 11:27 AM, Andreas Jonsson
>>>>> <[hidden email]>    wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> I think that this demonstrates the feasability of replacing the
>>>>>> MediaWiki parser.  There is still a lot of work to do in order to
>>>>>> turn
>>>>>> it into a full replacement, however.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> Have you already tried to run the parsertests that come with
>>>>> MediaWiki? Do they produce (roughly) the same output as with the
>>>>> PHP
>>>>> parser?
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> No, I haven't.  I have produced my own set of unit tests that are
>>>> based on the original parser.  For the features that I have
>>>> implemented, the output should be roughly the same under "normal"
>>>> circumstances.
>>>>
>>>> But the original parser have tons of border cases where the behavior
>>>> is not very well defined.  For instance, the table on the test page
>>>> will render very differently with the original parser (it will
>>>> actually turn into two separate tables).
>>>>
>>>> I am employing a consistent and easily understood strategy for
>>>> handling html intermixed with wikitext markup; it is easy to explain
>>>> that the |} token is disabled in the context of an html-table.
>>>> There
>>>> is no such simple explanation for the behavior of the original
>>>> parser,
>>>> even though in this particular example the produced html code
>>>> happens
>>>> to be valid (which isn't always the case).
>>>>
>>>> So, what I'm trying to say is that for the border cases where my
>>>> implementation differs from the original, the behavior of my parser
>>>> should be considered the correct one. :-)
>>>>
>>>> /Andreas
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> [hidden email]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>>
>>>>          
>>> Hm...
>>> Depending on how 'edge' those edge cases are, and on how much they
>>> are
>>> known.
>>> Doing that would may render it unusable for established wikis and
>>> would never become the default anytime soon, right ?
>>>
>>>
>>>        
>> We are talking about the edge cases that arise when intermixing
>> wikitext and html code in "creative" ways.  This is for instance ok
>> with the original parser:
>>
>> * item 1<li>  item 2
>> * item 3
>>
>> That may seem harmless and easy to handle, but suprise!  explicitly
>> adding the</li>  token doesn't work as expected:
>>
>> * item 1<li>  item 2</li>
>> * item 3
>>
>> And what happens when you add a new html list inside a wikitext list
>> item without closing it?
>>
>> * item 1<ul><li>  item 2
>> * item 3
>>
>> Which list should item 3 belong to?  You can can come up with
>> thousands of situations like this, and without a consistent plan on
>> how to handle them, you will need to add thousands of border cases to
>> the code to handle them all.
>>
>> I have avoided this by simply disabling all html block tokens inside
>> wikitext list items.  Of course, it may be that someone is actually
>> relying on being able to mix in this way, but it doesn't seem likely
>> as the result tends to be strange.
>>
>> /Andreas
>>
>>      
> I agree that making in consistant is important and will only cause
> good things (such as people getting used to behaviour and being able
> to predict what something would logically do).
>
> About the html in wikitext mixup: Although not directly, it is most
> certainly done indirectly.
>
> Imagine a template which is consists of a table in wikitext. A certain
> parameters value is outputted in a table cel.
> On some page that template is called and the parameter is filled with
> the help of a parser function (like #if or #expr).
> To avoid mess and escape templates, the table inside this table cell
> is build from there in HTML in a lot of cases instead of wiki text
> (pipe problem, think {{!}})
>
> Result is html table in wikitext table.
>
>    
Yes, but that is supported by the parser.  What isn't supported is
mixing tokens from html tables with tokens from a wikitext table.  So
you have:

<table><td>this is a column inside an html table, and as such,
| token and
|- token are disabled.  However,
{|
| opens up a wikitext table, which changes the context so that now <td>
<tr> and </table> tokens are disabled.  But it is still possible to once
again
<table><td> open up a html table and thus the context is switched so
that the
|} token is disabled.
</table>
|}
</table>

And here we're back to an ordinary paragraph.

> Or for example the thing with whitespace and parser functions /
> template parameters. Starting something like a table or list requires
> the block level hack (like<br />  or<div></div>  after the pipe, and
> then the {| table |} or *list on the next time). To avoid those in
> complex templates often HTML is used.
> If that template would be called on a page with an already existing
> wikitext list in place there would be a html list inside a wikitext
> list.
>
>    
A feasible alternative is to parse these as inline block elements
inside wikitext list elements, which I'm already doing for image links
with caption.  But I think that it is preferable to just disable them.

> I dont know in which order the parser works, but I think if the
> behaviour changes of that lots of complicated templates will break,
> and not just on Wikimedia projects.
>    
That's possible, but I believe that the set of broken templates can be
limited to a great extent.  To deploy a new parser on an existing
site, one would need a tool that walks the existing pages and warns
about suspected problems.

/Andreas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

David Gerard-2
On 23 September 2010 14:25, Andreas Jonsson <[hidden email]> wrote:

> That's possible, but I believe that the set of broken templates can be
> limited to a great extent.  To deploy a new parser on an existing
> site, one would need a tool that walks the existing pages and warns
> about suspected problems.


Presumably it would be feasible to get a current pages (and templates)
dump and compare the results of the old parser and the new parser for
problematic constructs that are used in practice. In fact, you'd
pretty much have to do something like that for a replacement parser to
be considered a substitute.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Aryeh Gregor
In reply to this post by Andreas Jonsson-5
On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson
<[hidden email]> wrote:
> . . . You can can come up with
> thousands of situations like this, and without a consistent plan on
> how to handle them, you will need to add thousands of border cases to
> the code to handle them all.
>
> I have avoided this by simply disabling all html block tokens inside
> wikitext list items.  Of course, it may be that someone is actually
> relying on being able to mix in this way, but it doesn't seem likely
> as the result tends to be strange.

The way the parser is used in real life is that people just write
random stuff until it looks right.  They wind up hitting all sorts of
bizarre edge cases, and these are propagated to thousands of pages by
templates.  A pure-PHP parser is needed for end users who can't
install binaries, and any replacement parser must be compatible with
it in practice, not just on the cases where the pure-PHP parser
behaves sanely.  In principle, we might be able to change parser
behavior in lots of edge cases and let users fix the broken stuff, if
the benefit is large enough.  But we'd have to have a pure-PHP parser
that implements the new behavior too.

The parts you considered to be the hard parts are not that hard.
We've had lots of parser projects, and I'm sure some have handled
those.  The hard part is coming up with a practical way to integrate
the new and simplified parser into MediaWiki in such a way that it can
actually be used at some point on sites like Wikipedia.  Do you have
any plans for how to do that?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
2010-09-26 20:57, Aryeh Gregor skrev:

> On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson
> <[hidden email]>  wrote:
>    
>> . . . You can can come up with
>> thousands of situations like this, and without a consistent plan on
>> how to handle them, you will need to add thousands of border cases to
>> the code to handle them all.
>>
>> I have avoided this by simply disabling all html block tokens inside
>> wikitext list items.  Of course, it may be that someone is actually
>> relying on being able to mix in this way, but it doesn't seem likely
>> as the result tends to be strange.
>>      
> The way the parser is used in real life is that people just write
> random stuff until it looks right.  They wind up hitting all sorts of
> bizarre edge cases, and these are propagated to thousands of pages by
> templates.

Yes, this is a problem.

> A pure-PHP parser is needed for end users who can't
> install binaries, and any replacement parser must be compatible with
> it in practice, not just on the cases where the pure-PHP parser
> behaves sanely.  In principle, we might be able to change parser
> behavior in lots of edge cases and let users fix the broken stuff, if
> the benefit is large enough.  But we'd have to have a pure-PHP parser
> that implements the new behavior too.
>    

Antlr is a multi language parser generator.  Unfortunately
PHP is not currently on the list of target languages.  Porting the
back end to PHP is a feasible task, however.  Likewise, porting my
parser implementation to PHP is feasible.  Then the later question is
if you want to maintain two language versions to also have the
performance advantage of the C parser.

> The parts you considered to be the hard parts are not that hard.
>    

What support do you have for this claim?  Parsing wikitext is
difficult, because of the any-input-is-valid-wikitext philosphy.
Parsing MediaWiki wikitext is very difficult, since it is not designed
to be parsable.

I consider the parts I pointed out to be hard because
they cannot be implemented with standard parser techniques.  I've
developed a state machine for enabling/disabling individual token
productions depending on context; I've employed speculative execution
in the lexical analysator to support context sensitive lookahead.  I
don't believe that you will find these techniques in any text book on
compiler design.  So I consider these items hard in the sense that
before I started working on my implementation, it was not at all clear
that I would be able to find a workable algorithm.

As of the apostrophe heuristics, as much as 30% of the cpu time seem
to be spent on this, regardless of there being any apostrophes in the
text.  So, I consider that hard in the sense that it is a very high
cost for very little functionality.  It might be possible to get rid
of this overhead at the cost of higher implementation complexity
though.

> We've had lots of parser projects, and I'm sure some have handled
> those.

Point me to one that has.

> The hard part is coming up with a practical way to integrate
> the new and simplified parser into MediaWiki in such a way that it can
> actually be used at some point on sites like Wikipedia.  Do you have
> any plans for how to do that?
>    

Developing a fully featured integration would require a large amount
of work.  But I wouldn't call it hard.  I haven't analysed all of the
hooks, and it is possible that some difficulties will turn up when
implementing emulation of these.  But other than that, I cannot see
that the integration work would consist of anything but a long list of
relatively simple tasks.  If a project were to be launced to perform
the integration, I would feel confidident that it would reach its
goal.  Considering the large body of information that is encoded in
MediaWiki syntax, I would guess that there is a strong
interest in actually spending efforts on this.

/Andreas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Aryeh Gregor
On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson
<[hidden email]> wrote:
> Point me to one that has.

Maybe I'm wrong.  I've never looked at them in depth.  I don't mean to
be discouraging here.  If you can replace the MediaWiki parser with
something sane, my hat is off to you.  But if you don't receive a very
enthusiastic response from established developers, it's probably
because we've had various people trying to replace MediaWiki's parser
with a more conventional one since like 2003, and it's never produced
anything usable in practice.  The prevailing sentiment is reflected
pretty well in Tim's commit summary from shortly before giving you
commit access:

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71620

Maybe we're just pessimistic, though.  I'd be happy to be proven wrong!

> Developing a fully featured integration would require a large amount
> of work.  But I wouldn't call it hard.  I haven't analysed all of the
> hooks, and it is possible that some difficulties will turn up when
> implementing emulation of these.  But other than that, I cannot see
> that the integration work would consist of anything but a long list of
> relatively simple tasks.  If a project were to be launced to perform
> the integration, I would feel confidident that it would reach its
> goal.

For practical purposes, something that requires a large amount of
fiddly and boring work counts as hard too.  It can stall a project as
easily as anything else.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Antoine Musso-3
On 27/09/10 19:42, Aryeh Gregor wrote:
<snip>
> :: we've had various people trying to replace MediaWiki's parser
> with a more conventional one since like 2003, and it's never produced
> anything usable in practice.

JeLuF wrote a token based parser back in 1.3. I am almost sure it has
been used on live site. It eventually got disabled after a couple of
days due to perf issues :

http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/10496

I have found some old code in REL1_3 :
http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_3/phase3/includes/Tokenizer.php?revision=4179&view=markup

That was before transclusions, categories ...


--
Ashar Voultoiz


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Chad
In reply to this post by Aryeh Gregor
On Mon, Sep 27, 2010 at 1:42 PM, Aryeh Gregor
<[hidden email]> wrote:

> On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson
> <[hidden email]> wrote:
>> Point me to one that has.
>
> Maybe I'm wrong.  I've never looked at them in depth.  I don't mean to
> be discouraging here.  If you can replace the MediaWiki parser with
> something sane, my hat is off to you.  But if you don't receive a very
> enthusiastic response from established developers, it's probably
> because we've had various people trying to replace MediaWiki's parser
> with a more conventional one since like 2003, and it's never produced
> anything usable in practice.  The prevailing sentiment is reflected
> pretty well in Tim's commit summary from shortly before giving you
> commit access:
>
> http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71620
>
> Maybe we're just pessimistic, though.  I'd be happy to be proven wrong!
>

This. Tim sums up the consensus very well with that commit summary.
He also made some comments on the history of wikitext and alternative
parsers on foundation-l back in Jan '09[0]. Worth a read (starting mainly
at ""Parser" is a convenient and short name for it").

While a real parser is a nice pipe dream, in practice not a single project
to "rewrite the parser" has succeeded in the years of people  trying. Like
Aryeh says, if you can pull it off and make it practical, hats off to you.

-Chad

[0] http://article.gmane.org/gmane.org.wikimedia.foundation/35876/

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Paul Houle
  On 9/27/2010 2:58 PM, Chad wrote:
> This. Tim sums up the consensus very well with that commit summary.
> He also made some comments on the history of wikitext and alternative
> parsers on foundation-l back in Jan '09[0]. Worth a read (starting mainly
> at ""Parser" is a convenient and short name for it").
>
> While a real parser is a nice pipe dream, in practice not a single project
> to "rewrite the parser" has succeeded in the years of people  trying. Like
> Aryeh says, if you can pull it off and make it practical, hats off to you.
>
     For my own IX work I've written a wikimedia markup parser in C#
based on the Irony framework.  It fails to parse about 0.5% of pages in
wikipedia and is oblivious to a lot of the stranger stuff [like the HTML
intrusions] but it does a good job of eating infoboxes and making sense
of internal and external links.  Now,  the strange stuff + the parse
fails would probably be impossible to handle in a rational way...

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
In reply to this post by Aryeh Gregor
2010-09-27 19:42, Aryeh Gregor skrev:

> On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson
> <[hidden email]>  wrote:
>    
>> Point me to one that has.
>>      
> Maybe I'm wrong.  I've never looked at them in depth.  I don't mean to
> be discouraging here.  If you can replace the MediaWiki parser with
> something sane, my hat is off to you.  But if you don't receive a very
> enthusiastic response from established developers, it's probably
> because we've had various people trying to replace MediaWiki's parser
> with a more conventional one since like 2003, and it's never produced
> anything usable in practice.  The prevailing sentiment is reflected
> pretty well in Tim's commit summary from shortly before giving you
> commit access:
>
> http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71620
>
> Maybe we're just pessimistic, though.  I'd be happy to be proven wrong!
>
>    

I'm aware of the previous attempts, and I understand that people are
being sceptic.  But now I'm claiming that I have made significant
progress, and the implementation I'm presenting is my proof.


>> Developing a fully featured integration would require a large amount
>> of work.  But I wouldn't call it hard.  I haven't analysed all of the
>> hooks, and it is possible that some difficulties will turn up when
>> implementing emulation of these.  But other than that, I cannot see
>> that the integration work would consist of anything but a long list of
>> relatively simple tasks.  If a project were to be launced to perform
>> the integration, I would feel confidident that it would reach its
>> goal.
>>      
> For practical purposes, something that requires a large amount of
> fiddly and boring work counts as hard too.  It can stall a project as
> easily as anything else.
>
>    
I agree.  But I don't believe that any of previous attempts at writing
a parser have failed because of the integration.

/Andreas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
In reply to this post by Antoine Musso-3
2010-09-27 20:05, Ashar Voultoiz skrev:

> On 27/09/10 19:42, Aryeh Gregor wrote:
> <snip>
>    
>> :: we've had various people trying to replace MediaWiki's parser
>> with a more conventional one since like 2003, and it's never produced
>> anything usable in practice.
>>      
> JeLuF wrote a token based parser back in 1.3. I am almost sure it has
> been used on live site. It eventually got disabled after a couple of
> days due to perf issues :
>
> http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/10496
>
> I have found some old code in REL1_3 :
> http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_3/phase3/includes/Tokenizer.php?revision=4179&view=markup
>    

295 lines of code?  I don't see how that could be anywhere near compatible.

/Andreas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
In reply to this post by Chad
2010-09-27 20:58, Chad skrev:

> On Mon, Sep 27, 2010 at 1:42 PM, Aryeh Gregor
> <[hidden email]>  wrote:
>    
>> On Mon, Sep 27, 2010 at 3:38 AM, Andreas Jonsson
>> <[hidden email]>  wrote:
>>      
>>> Point me to one that has.
>>>        
>> Maybe I'm wrong.  I've never looked at them in depth.  I don't mean to
>> be discouraging here.  If you can replace the MediaWiki parser with
>> something sane, my hat is off to you.  But if you don't receive a very
>> enthusiastic response from established developers, it's probably
>> because we've had various people trying to replace MediaWiki's parser
>> with a more conventional one since like 2003, and it's never produced
>> anything usable in practice.  The prevailing sentiment is reflected
>> pretty well in Tim's commit summary from shortly before giving you
>> commit access:
>>
>> http://www.mediawiki.org/wiki/Special:Code/MediaWiki/71620
>>
>> Maybe we're just pessimistic, though.  I'd be happy to be proven wrong!
>>
>>      
> This. Tim sums up the consensus very well with that commit summary.
> He also made some comments on the history of wikitext and alternative
> parsers on foundation-l back in Jan '09[0]. Worth a read (starting mainly
> at ""Parser" is a convenient and short name for it").
>
> While a real parser is a nice pipe dream, in practice not a single project
> to "rewrite the parser" has succeeded in the years of people  trying. Like
> Aryeh says, if you can pull it off and make it practical, hats off to you.
>
> -Chad
>
> [0] http://article.gmane.org/gmane.org.wikimedia.foundation/35876/
>    
So, Tim are raising three objections against a more formalized parser:

1. Formal grammars are too restricted for wikitext.

    My implementation represents a greater class of grammars than the
    class of context free grammars.  I believe that this gives
    sufficient space for wikitext.

2. Previous parser implementation had performance issues.

    I have not rigourusly tested the performance of my parser, but it
    is linear to the size of the input complexity and seems to be
    comparable to the original parser on plain text.  Whith increasing
    amount of markup, the original parser seems to degrade in
    performance, while my implementation maintains a fairly constant
    speed, regardless of input.  It is possible to construct malicous
    input that cause the performance my parser to be offset with a
    constant (the same content scanned up to 13 times).  But this is
    not a situation that would occur on a normal page.

3. Some aspects of the existing parser follows well known parser
    algorithms, but is better optimized.  In particular, the
    preprocessor.

    My parser implementation does not preprocess the content.  I
    acknowledge that preprocessing is better done by the current
    preprocessor.  One just need to detangle the independent
    preprocessing (parser functions, transclusion, magic words etc.)
    from the parser preparation preprocessing (e.g., replacing <nowiki>
    ... </nowiki> with "magic" string).

    Regarding optimization, it doesn't matter that the current parser
    is "optimized" if my unoptimized implementation outperforms the
    existing optimized one.


/Andreas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Andreas Jonsson-5
In reply to this post by Paul Houle
2010-09-27 22:46, Paul Houle skrev:

>    On 9/27/2010 2:58 PM, Chad wrote:
>    
>> This. Tim sums up the consensus very well with that commit summary.
>> He also made some comments on the history of wikitext and alternative
>> parsers on foundation-l back in Jan '09[0]. Worth a read (starting mainly
>> at ""Parser" is a convenient and short name for it").
>>
>> While a real parser is a nice pipe dream, in practice not a single project
>> to "rewrite the parser" has succeeded in the years of people  trying. Like
>> Aryeh says, if you can pull it off and make it practical, hats off to you.
>>
>>      
>       For my own IX work I've written a wikimedia markup parser in C#
> based on the Irony framework.  It fails to parse about 0.5% of pages in
> wikipedia

What do you mean with "fail".  It assigns slightly incorrect semantic to a
construction? It fails to accept the input?  It crashes?

>   and is oblivious to a lot of the stranger stuff [like the HTML
> intrusions] but it does a good job of eating infoboxes and making sense
> of internal and external links.  Now,  the strange stuff + the parse
> fails would probably be impossible to handle in a rational way...
>    

I disagree.  I believe that there is a rational way to handle all kinds
of input.

/Andreas


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Parser implementaton for MediaWiki syntax

Paul Houle
  On 9/28/2010 3:53 AM, Andreas Jonsson wrote:
>>        For my own IX work I've written a wikimedia markup parser in C#
>> based on the Irony framework.  It fails to parse about 0.5% of pages in
>> wikipedia
> What do you mean with "fail".  It assigns slightly incorrect semantic to a
> construction? It fails to accept the input?  It crashes?
>

Fails to accept input -- that is,  the text doesn't match the grammar.

Now,  the toolchain above the parser gets between 30-80% recall at the
moment doing the things it has to do,  so making the grammar better
isn't the highest priority on my list.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12