Emancipate the Parser

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Emancipate the Parser

Gregory Szorc-2
Has it ever been considered to separate the MediaWiki markup parser from the
core MediaWiki project?  It seems to me that if the parser stood on its own,
it would help wiki adoption by allowing others to use the same syntax as the
most popular wikis in the world (Wikipedia).  As MediaWiki has pledged
support for Creole, it seems that eventually the wiki parsing in MediaWiki
will have to be converted to accomodate a common interface that works for
both MediaWiki and Creole, so why not use this opportunity to free out the
core parser?

Gregory Szorc
[hidden email]
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Emancipate the Parser

naikrovek
I think this is an awesome idea.  Especially since I spent all day
writing a parser in Java.

I work at an organization where using PHP is not allowed, so this is
my only option.  I don't even know PHP well enough to port MediaWiki's
parser.

If anyone knows which PHP classes (is 'classes' an appropriate term
for PHP?) contain the parsing logic, I can port it, then open source
the parser.

If anyone is interested, please let me know.  I'm writing a parser for
the small amount of markup my organization would use, but I'd be quite
happy to port the MediaWiki parser to Java at home.

jeremiah();

On 9/28/06, Gregory Szorc <[hidden email]> wrote:

> Has it ever been considered to separate the MediaWiki markup parser from the
> core MediaWiki project?  It seems to me that if the parser stood on its own,
> it would help wiki adoption by allowing others to use the same syntax as the
> most popular wikis in the world (Wikipedia).  As MediaWiki has pledged
> support for Creole, it seems that eventually the wiki parsing in MediaWiki
> will have to be converted to accomodate a common interface that works for
> both MediaWiki and Creole, so why not use this opportunity to free out the
> core parser?
>
> Gregory Szorc
> [hidden email]
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Emancipate the Parser

Magnus Manske-2
I've been among those writing parsers (many half-baken ones;-) and
IMHO the only viable option, in the long run, is to make an abstract
grammar (whatever style) and have parser generators for many languages
implement them.

This would also be an improvement over the current parser, which renders

   '''bold [[link|bold ''' not bold ''italics]] still italics

as

<b>bold <a href="/mediawiki/index.php?title=Link">bold </b> not bold
<i>italics</a> still italics</i>

which is about as broken as it gets, and only tidyhtml saves wikipedia
from serving such embarrassments.

Magnus



On 9/28/06, jeremiah johnson <[hidden email]> wrote:

> I think this is an awesome idea.  Especially since I spent all day
> writing a parser in Java.
>
> I work at an organization where using PHP is not allowed, so this is
> my only option.  I don't even know PHP well enough to port MediaWiki's
> parser.
>
> If anyone knows which PHP classes (is 'classes' an appropriate term
> for PHP?) contain the parsing logic, I can port it, then open source
> the parser.
>
> If anyone is interested, please let me know.  I'm writing a parser for
> the small amount of markup my organization would use, but I'd be quite
> happy to port the MediaWiki parser to Java at home.
>
> jeremiah();
>
> On 9/28/06, Gregory Szorc <[hidden email]> wrote:
> > Has it ever been considered to separate the MediaWiki markup parser from the
> > core MediaWiki project?  It seems to me that if the parser stood on its own,
> > it would help wiki adoption by allowing others to use the same syntax as the
> > most popular wikis in the world (Wikipedia).  As MediaWiki has pledged
> > support for Creole, it seems that eventually the wiki parsing in MediaWiki
> > will have to be converted to accomodate a common interface that works for
> > both MediaWiki and Creole, so why not use this opportunity to free out the
> > core parser?
> >
> > Gregory Szorc
> > [hidden email]
> > _______________________________________________
> > MediaWiki-l mailing list
> > [hidden email]
> > http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Emancipate the Parser

Kasimir Gabert
Hm... When I previewed this in MW 1.7.1, I got:

<p><b>bold </b><a href="/w/index.php?title=Link&amp;action=edit"
class="new" title="Link"><b>bold </b> not bold <i>italics</i></a><i>
still italics</i>

Which is completely valid, and exactly what is wanted by the user.

I am not sure what the problem is...

On 9/29/06, Magnus Manske <[hidden email]> wrote:

> I've been among those writing parsers (many half-baken ones;-) and
> IMHO the only viable option, in the long run, is to make an abstract
> grammar (whatever style) and have parser generators for many languages
> implement them.
>
> This would also be an improvement over the current parser, which renders
>
>    '''bold [[link|bold ''' not bold ''italics]] still italics
>
> as
>
> <b>bold <a href="/mediawiki/index.php?title=Link">bold </b> not bold
> <i>italics</a> still italics</i>
>
> which is about as broken as it gets, and only tidyhtml saves wikipedia
> from serving such embarrassments.
>
> Magnus
>
>
>
> On 9/28/06, jeremiah johnson <[hidden email]> wrote:
> > I think this is an awesome idea.  Especially since I spent all day
> > writing a parser in Java.
> >
> > I work at an organization where using PHP is not allowed, so this is
> > my only option.  I don't even know PHP well enough to port MediaWiki's
> > parser.
> >
> > If anyone knows which PHP classes (is 'classes' an appropriate term
> > for PHP?) contain the parsing logic, I can port it, then open source
> > the parser.
> >
> > If anyone is interested, please let me know.  I'm writing a parser for
> > the small amount of markup my organization would use, but I'd be quite
> > happy to port the MediaWiki parser to Java at home.
> >
> > jeremiah();
> >
> > On 9/28/06, Gregory Szorc <[hidden email]> wrote:
> > > Has it ever been considered to separate the MediaWiki markup parser from the
> > > core MediaWiki project?  It seems to me that if the parser stood on its own,
> > > it would help wiki adoption by allowing others to use the same syntax as the
> > > most popular wikis in the world (Wikipedia).  As MediaWiki has pledged
> > > support for Creole, it seems that eventually the wiki parsing in MediaWiki
> > > will have to be converted to accomodate a common interface that works for
> > > both MediaWiki and Creole, so why not use this opportunity to free out the
> > > core parser?
> > >
> > > Gregory Szorc
> > > [hidden email]
> > > _______________________________________________
> > > MediaWiki-l mailing list
> > > [hidden email]
> > > http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
> > >
> > _______________________________________________
> > MediaWiki-l mailing list
> > [hidden email]
> > http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> [hidden email]
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
>


--
Kasimir Gabert
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Emancipate the Parser

Gregory Szorc-2
In reply to this post by Magnus Manske-2
On 9/29/06, Magnus Manske <[hidden email]> wrote:
>
> I've been among those writing parsers (many half-baken ones;-) and
> IMHO the only viable option, in the long run, is to make an abstract
> grammar (whatever style) and have parser generators for many languages
> implement them.
>
>
I agree that down the road a formal grammar should be adopted, but first
thing is first:  separate the parser.  I would love to see the MediaWiki
parser become something like Radeox (http://radeox.org/space/start).  This
rendering engine is used by Confluence, XWiki, and others.  It is currently
only written in Java, but that is fine.  The MediaWiki parser would only
initially be available in PHP.  This is much better than it only being
available in MediaWiki.

Also, the parser could still be maintained by the MediaWiki team.  They
would not have to give up control of the parser or their vision for it.
They only change is the parser could stand on its own and its power and
popular syntax could be utilized by scores of other (PHP) wikis.

On another positive note, the decoupling of the parser would also bring a
great opportunity to fix any quirks with the current parser, including
rendering issues.

Greg
_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: Emancipate the Parser

Tim Starling-2
Gregory Szorc wrote:

> On 9/29/06, Magnus Manske <[hidden email]> wrote:
>> I've been among those writing parsers (many half-baken ones;-) and
>> IMHO the only viable option, in the long run, is to make an abstract
>> grammar (whatever style) and have parser generators for many languages
>> implement them.
>>
>>
> I agree that down the road a formal grammar should be adopted, but first
> thing is first:  separate the parser.  I would love to see the MediaWiki
> parser become something like Radeox (http://radeox.org/space/start).  This
> rendering engine is used by Confluence, XWiki, and others.  It is currently
> only written in Java, but that is fine.  The MediaWiki parser would only
> initially be available in PHP.  This is much better than it only being
> available in MediaWiki.
>
> Also, the parser could still be maintained by the MediaWiki team.  They
> would not have to give up control of the parser or their vision for it.
> They only change is the parser could stand on its own and its power and
> popular syntax could be utilized by scores of other (PHP) wikis.
>
> On another positive note, the decoupling of the parser would also bring a
> great opportunity to fix any quirks with the current parser, including
> rendering issues.

A parser that performs a subset of the native MediaWiki parser is entirely
possible, and has been done several times before, but a complete decoupling
is rather more challenging. I imagine it would be rather like the separation
between the Zend Engine and PHP. Features such as the following rely on
diverse parts of the MediaWiki framework and would have to be dealt with by
hooks or callbacks:

* link colouring
* interlanguage link recognition
* URL generation
* template text fetch
* image rendering
* double-underscore properties such as __NEWSECTIONLINK__
* core parser functions
* variables, e.g. {{NUMBEROFARTICLES}}
* language conversion
* extensions

Some of these items could be deferred, by having the parser output an
intermediate representation which can then be converted to HTML by a
feature-rich output phase. But that doesn't exempt you from writing that
output phase, if you want the parser to be useful for anything at all. Few
people realise what a large proportion of the MediaWiki codebase is accessed
by the present parser module.

I'm in favour of a C/C++ module closely coupled with the existing PHP
framework, to speed up wikitext to HTML transformation. I can also see that
feature-reduced parsers may be occasionally useful, such as an embeddable
PHP parser along the lines of Gregory's original post. But for
fully-featured wikitext to HTML conversion, including access to
MediaWiki-specific features like those listed above, the parser has to be
coupled with MediaWiki itself.

It may be possible to decouple the parser, like Gregory suggests, and to add
MediaWiki-specific features back in with callbacks or post-processing.
However it would be a lot of work, and any performance losses due to the
abstraction would have to be offset with gains elsewhere, if Wikimedia is
going to buy in. You might be better off just using an independent
feature-reduced parser like PEAR's Text_Wiki_Mediawiki.

-- Tim Starling

_______________________________________________
MediaWiki-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l