GSoC project advice: port texvc to Python?

classic Classic list List threaded Threaded
60 messages Options
123
Reply | Threaded
Open this post in threaded view
|

GSoC project advice: port texvc to Python?

Damon Wang-2
Hello everyone,

I'm interested in porting texvc to Python, and I was hoping this list
here might help me hash out the plan. Please let me know if I should
take my questions elsewhere.

Roughly, my plan of attack would be something like this:

1. Collect test cases and write a testing script
Thanks to avar from #wikimedia, I already have the <math>...</math> bits
from enwiki and dewiki. I would also construct some simpler ones by hand
to test each of the acceptable LaTeX commands.

Would there be any possibility of logging the input seen by texvc on a
production instance of Mediawiki, so I could get some invalid input
submitted by actual users?

This could also be useful to future maintainers for regression testing.

2. Implement an AMS-TeX validator
I'll probably use PLY because it's rumored to have helpful debugging
features (designed for a first-year compilers class, apparently). ANTLR
is another popular option, but this guy
    http://www.bearcave.com/software/antlr/antlr_expr.html
thinks it's complicated and hard to debug. I've never used either, so if
anyone on this list knows of a good Python parsing package I'd welcome
suggestions.

3. Port over the existing tex->dvi->png rendering.
This is probably just a few calls into the subprocess module. Yeah, I
just jinxed it.

4. Add HTML rendering to texvc and test script
I don't even understand how the existing texvc decides whether HTML is
good enough. It looks like the original programmer just decreed that
certain LaTeX commands could be rendered to HTML, and defaults to PNG if
it sees anything not on that list. How important is this feature?

5. Repackage the entire Math thing as an extension
I might do this if I have time left at the end. I'm sure the project
will change over the summer.

Python doesn't have parsing just locked right down the way C does with
flex/bison, but there are some good options, I have the most experience
with it, and I think I'd be able to complete the port faster in Python
than in either of the other languages. I was tempted at first to port to
PHP, to conform with the rest of Mediawiki, but there don't seem to be
any good parsing packages for PHP. (Please tell me if that's wrong.)

I'd appreciate any advice or criticism. Since my only previous
experience has been using Wikipedia and setting up a test Mediawiki
instance for my ACM chapter, I'm only just now learning my way around
the code base and it's not always evident why things were done as they
are. Does this look like a reasonable and worthwhile project?

Yours,
Damon Wang

P.S. Some of you may remember me on IRC a couple of days ago getting a
little panicky about not knowing OCaml, but I'm a bit more hopeful now
after looking around the source. I definitely have to keep the OCaml
manual open for reference, but I've written Scheme, Common Lisp, and
Haskell before, so I think I might be able to fake it.  These are just
Famous Last Words waiting to happen, I know.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Conrad Irwin

On 03/23/2010 08:06 AM, Damon Wang wrote:

> Hello everyone,
>
> I'm interested in porting texvc to Python, and I was hoping this list
> here might help me hash out the plan. Please let me know if I should
> take my questions elsewhere.
>
> Roughly, my plan of attack would be something like this:
>
> 1. Collect test cases and write a testing script
> Thanks to avar from #wikimedia, I already have the <math>...</math> bits
> from enwiki and dewiki. I would also construct some simpler ones by hand
> to test each of the acceptable LaTeX commands.
>
It is not too challenging to create a test file that checks most
existing commands with some hacky regexes on the existing parser (I
can't find where mine has gone though); what is much harder is proving
that your script cannot ever let through invalid or potentially harmful
LaTeX (the moment anyone gets a \catcode past you, you're doomed, and
there are many other commands that are "not wanted" to say the least).
Obviously, the current implementation makes this reasonably pleasant to
verify, the syntax of the parser is exceedingly light, so any
reimplementation should strive to have as little syntactic overhead as
possible.

>
> 2. Implement an AMS-TeX validator

How different would this be from the current validator?

> 3. Port over the existing tex->dvi->png rendering.
> This is probably just a few calls into the subprocess module. Yeah, I
> just jinxed it.
> 4. Add HTML rendering to texvc and test script
> I don't even understand how the existing texvc decides whether HTML is
> good enough. It looks like the original programmer just decreed that
> certain LaTeX commands could be rendered to HTML, and defaults to PNG if
> it sees anything not on that list. How important is this feature?

I am not too fussed about the HTML output, though I can't speak for
everyone, at the moment it seems that many more of the Unicode
characters should be let through (at least at some level of HTML),
though I don't know enough about worldwide unicode support. Some things,
like \sqrt for example, are pretty hard to render nicely in HTML, so
images are still sensible for some expressions.

>
> 5. Repackage the entire Math thing as an extension
> I might do this if I have time left at the end. I'm sure the project
> will change over the summer.

This would be very amazing.

> Python doesn't have parsing just locked right down the way C does with
> flex/bison, but there are some good options, I have the most experience
> with it, and I think I'd be able to complete the port faster in Python
> than in either of the other languages. I was tempted at first to port to
> PHP, to conform with the rest of Mediawiki, but there don't seem to be
> any good parsing packages for PHP. (Please tell me if that's wrong.)

A good PHP parser library would be exceptionally useful for MediaWiki
(and many extensions), at the moment we have loads of methods that do
regex "parsing", so if you felt like writing one... :D.

>
> I'd appreciate any advice or criticism. Since my only previous
> experience has been using Wikipedia and setting up a test Mediawiki
> instance for my ACM chapter, I'm only just now learning my way around
> the code base and it's not always evident why things were done as they
> are. Does this look like a reasonable and worthwhile project?
>

Step 5. has been a "we really should do this" for a while, the shipping
of OCaml code which many users won't be able to use is very messy. I am
less convinced of the utility of a Python port, OCaml is a great
language for implementing this, and I fear a lot of your time would be
wasted trying to make the Python similarly nice. As you note, MediaWiki
is not written in Python, doing this in PHP would be a larger step in
the right direction, though without such nice frameworks, maybe less
nice to do.

Instead of rewriting the <math> parser, it might be more productive to
create parsers for some of the other languages that extensions use,
hopefully with a view to adding additional extensions to Wikipedia. The
ones I can think of immediately are <chem> tags (bug 3252/5856),
<gnuplot>, <lilypond>/<ABC> (bug 189!), <graphviz> (bug 2403).

Yours
Conrad

(PS. I'm no-one official, so can be ignored safely)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Roan Kattouw-2
2010/3/23 Conrad Irwin <[hidden email]>:
> Instead of rewriting the <math> parser, it might be more productive to
> create parsers for some of the other languages that extensions use,
> hopefully with a view to adding additional extensions to Wikipedia. The
> ones I can think of immediately are <chem> tags (bug 3252/5856),
> <gnuplot>, <lilypond>/<ABC> (bug 189!), <graphviz> (bug 2403).
>
Note that there's already an ABC extension, as linked on bug 189,
which AFAIK is pretty much ready for WMF deployment already. As
mentioned on the same bug, shelling out to Lilypond has certain issues
with unbounded time/CPU/memory usage. I'm not familiar with any of the
other programs mentioined, so I can't comment on those.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Damon Wang-2
In reply to this post by Conrad Irwin
Hello Conrad,

>> 2. Implement an AMS-TeX validator
>
> How different would this be from the current validator?

It should be exactly the same, except written in Python.

>> 5. Repackage the entire Math thing as an extension
>> I might do this if I have time left at the end. I'm sure the project
>> will change over the summer.
>
> This would be very amazing.

Maybe this should be my project, then.

>> Python doesn't have parsing just locked right down the way C does with
>> flex/bison, but there are some good options, I have the most experience
>> with it, and I think I'd be able to complete the port faster in Python
>> than in either of the other languages. I was tempted at first to port to
>> PHP, to conform with the rest of Mediawiki, but there don't seem to be
>> any good parsing packages for PHP. (Please tell me if that's wrong.)
>
> A good PHP parser library would be exceptionally useful for MediaWiki
> (and many extensions), at the moment we have loads of methods that do
> regex "parsing", so if you felt like writing one... :D.

Actually...

I've never used PHP for real programming, but how difficult would it be
to write a really simple, stupid first pass at a DFA parser? I suspect
I'd need much more than three months to make it useful, but would it be
possible to implement some coherent subset of the features? E.g.,
building the LR0 automaton, at least?

>> I'd appreciate any advice or criticism. Since my only previous
>> experience has been using Wikipedia and setting up a test Mediawiki
>> instance for my ACM chapter, I'm only just now learning my way around
>> the code base and it's not always evident why things were done as they
>> are. Does this look like a reasonable and worthwhile project?
>>
>
> Step 5. has been a "we really should do this" for a while, the shipping
> of OCaml code which many users won't be able to use is very messy. I am
> less convinced of the utility of a Python port, OCaml is a great
> language for implementing this, and I fear a lot of your time would be
> wasted trying to make the Python similarly nice. As you note, MediaWiki
> is not written in Python, doing this in PHP would be a larger step in
> the right direction, though without such nice frameworks, maybe less
> nice to do.

I suggested a Python port because
    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core
lists it as a potential project idea. I was under the impression that
people around here did not want to leave texvc in OCaml. Is this wrong?

Yours,
Damon Wang

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Aryeh Gregor
In reply to this post by Damon Wang-2
On Tue, Mar 23, 2010 at 4:06 AM, Damon Wang <[hidden email]> wrote:
> I'm interested in porting texvc to Python, and I was hoping this list
> here might help me hash out the plan. Please let me know if I should
> take my questions elsewhere.

Python is much better than OCaml, and I prefer Python to PHP, but a
PHP implementation would be preferable for core IMO.  Not all
MediaWiki developers know Python, but all obviously know PHP.  If you
did a Python implementation, though, then at least someone could
translate it to PHP pretty easily.

> 1. Collect test cases and write a testing script
> Thanks to avar from #wikimedia, I already have the <math>...</math> bits
> from enwiki and dewiki. I would also construct some simpler ones by hand
> to test each of the acceptable LaTeX commands.
>
> Would there be any possibility of logging the input seen by texvc on a
> production instance of Mediawiki, so I could get some invalid input
> submitted by actual users?
>
> This could also be useful to future maintainers for regression testing.

If you have a Unix box handy, it's pretty easy to install MediaWiki
with math support so you can test yourself.  sudo apt-get install
mediawiki mediawiki-math should do it on anything Debian-based, for
example.

> 2. Implement an AMS-TeX validator
> I'll probably use PLY because it's rumored to have helpful debugging
> features (designed for a first-year compilers class, apparently). ANTLR
> is another popular option, but this guy
>    http://www.bearcave.com/software/antlr/antlr_expr.html
> thinks it's complicated and hard to debug. I've never used either, so if
> anyone on this list knows of a good Python parsing package I'd welcome
> suggestions.

If it's in PHP, you'd probably have to write a parser yourself, but
LaTeX is pretty easy to parse, I'd think.

> 4. Add HTML rendering to texvc and test script
> I don't even understand how the existing texvc decides whether HTML is
> good enough. It looks like the original programmer just decreed that
> certain LaTeX commands could be rendered to HTML, and defaults to PNG if
> it sees anything not on that list. How important is this feature?

Fairly important, IMO, if the goal is to replace texvc, although not
critical.  <math>x</math> shouldn't render x as a PNG -- that's silly.

> Python doesn't have parsing just locked right down the way C does with
> flex/bison, but there are some good options, I have the most experience
> with it, and I think I'd be able to complete the port faster in Python
> than in either of the other languages. I was tempted at first to port to
> PHP, to conform with the rest of Mediawiki, but there don't seem to be
> any good parsing packages for PHP. (Please tell me if that's wrong.)

Would it really be very hard to write a LaTeX parser in PHP?  I'd
think it could be done easily, if you permit only a carefully-selected
subset.  I don't think you'd need any parser theory, just use
preg_split() and loop through all the tokens.

> I'd appreciate any advice or criticism. Since my only previous
> experience has been using Wikipedia and setting up a test Mediawiki
> instance for my ACM chapter, I'm only just now learning my way around
> the code base and it's not always evident why things were done as they
> are. Does this look like a reasonable and worthwhile project?

Rewriting texvc in PHP would be a nice project to have, which is small
enough in scope that I'm optimistic that it could be done in a summer.
 I'd say it's a good choice.

On Tue, Mar 23, 2010 at 6:23 AM, Conrad Irwin
<[hidden email]> wrote:
> I am not too fussed about the HTML output, though I can't speak for
> everyone, at the moment it seems that many more of the Unicode
> characters should be let through (at least at some level of HTML),
> though I don't know enough about worldwide unicode support.

I suspect we need to be about as conservative as we currently are for
platforms like IE6 on XP.  We should be able to expand the range of
HTML characters in the future, though.

> A good PHP parser library would be exceptionally useful for MediaWiki
> (and many extensions), at the moment we have loads of methods that do
> regex "parsing", so if you felt like writing one... :D.

Wouldn't a real generic parser implementation written in PHP be too
slow to be useful?  preg_replace() has the advantage of being
implemented in C.

> I am
> less convinced of the utility of a Python port, OCaml is a great
> language for implementing this, and I fear a lot of your time would be
> wasted trying to make the Python similarly nice. As you note, MediaWiki
> is not written in Python, doing this in PHP would be a larger step in
> the right direction, though without such nice frameworks, maybe less
> nice to do.

OCaml might be a great language for implementing this, but very few of
us understand it.  texvc has been totally unmaintained for years,
other than new things being added to the whitelist sometimes by means
of cargo-culting what previous commits do.  Rewriting texvc in
*anything* that more people understand would be a step forward.

On Tue, Mar 23, 2010 at 8:31 AM, Roan Kattouw <[hidden email]> wrote:
> As
> mentioned on the same bug, shelling out to Lilypond has certain issues
> with unbounded time/CPU/memory usage.

The same is true for LaTeX.  Lilypond would just need a parser and
filter to whitelist safe constructs, like LaTeX does.

On Tue, Mar 23, 2010 at 12:25 PM, Damon Wang <[hidden email]> wrote:
> I've never used PHP for real programming, but how difficult would it be
> to write a really simple, stupid first pass at a DFA parser? I suspect
> I'd need much more than three months to make it useful, but would it be
> possible to implement some coherent subset of the features? E.g.,
> building the LR0 automaton, at least?

I don't think you'd need a "real" parser here.  Mostly we just use
preg_split() for this sort of thing.  I'm not familiar with formal
grammars and such, so I can't say what the concrete disadvantages of
that approach are.

> I suggested a Python port because
>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core
> lists it as a potential project idea. I was under the impression that
> people around here did not want to leave texvc in OCaml. Is this wrong?

No, it's right.  Conrad is crazy.  :P

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Roan Kattouw-2
2010/3/23 Aryeh Gregor <[hidden email]>:

>> I've never used PHP for real programming, but how difficult would it be
>> to write a really simple, stupid first pass at a DFA parser? I suspect
>> I'd need much more than three months to make it useful, but would it be
>> possible to implement some coherent subset of the features? E.g.,
>> building the LR0 automaton, at least?
>
> I don't think you'd need a "real" parser here.  Mostly we just use
> preg_split() for this sort of thing.  I'm not familiar with formal
> grammars and such, so I can't say what the concrete disadvantages of
> that approach are.
>
DFAs parse regular languages, which means those languages can also be
expressed as regexes. In fact, the regexes accepted by the preg_*()
functions allow certain extensions to the language theory definition
of regular expressions, allowing them to describe certain non-regular
languages as well. In short: preg_split() can do everything a DFA can
do, and more. The only reason to use a DFA parser would be
performance, but since the preg_*() functions are so heavily optimized
I don't think that'll be an issue.

>> I suggested a Python port because
>>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core
>> lists it as a potential project idea. I was under the impression that
>> people around here did not want to leave texvc in OCaml. Is this wrong?
>
> No, it's right.  Conrad is crazy.  :P
>
Having it in a language no one understands is a bad thing and leads to
maintenance not happening, so yeah, we definitely want it rewritten in
PHP. If the PHP implementation turns out to be too slow to run on WMF,
for instance, we could do a C++ port à la wikidiff2 (a C++ port of our
ludicrously slow PHP diff implementation).

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Conrad Irwin

On 03/23/2010 05:00 PM, Roan Kattouw wrote:

>>> I suggested a Python port because
>>>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core
>>> lists it as a potential project idea. I was under the impression that
>>> people around here did not want to leave texvc in OCaml. Is this wrong?
>>
>> No, it's right.  Conrad is crazy.  :P
>>
> Having it in a language no one understands is a bad thing and leads to
> maintenance not happening, so yeah, we definitely want it rewritten in
> PHP. If the PHP implementation turns out to be too slow to run on WMF,
> for instance, we could do a C++ port à la wikidiff2 (a C++ port of our
> ludicrously slow PHP diff implementation).
>

And here was me thinking that maintenance didn't happen because making
changes to security critical sections of the code is dangerous :). The
current implementation is just over a thousand lines of exceedingly
concise code, while I agree that a re-implementation in PHP is probably
sensible, I'll stubbornly maintain that the existing OCaml is more
suited to the task. (Oh, and it seems I misread that proposal; I could
not imagine a language other than LaTeX being useful for doing maths :p).

While re-implementing the syntax whitelister would not be too hard,
LaTeX, with it's wonderfully re-definable syntax is incredibly
dangerous. Have fun, and be careful!

Conrad

http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=xii

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Aryeh Gregor
In reply to this post by Roan Kattouw-2
On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote:
> DFAs parse regular languages, which means those languages can also be
> expressed as regexes. In fact, the regexes accepted by the preg_*()
> functions allow certain extensions to the language theory definition
> of regular expressions, allowing them to describe certain non-regular
> languages as well. In short: preg_split() can do everything a DFA can
> do, and more. The only reason to use a DFA parser would be
> performance, but since the preg_*() functions are so heavily optimized
> I don't think that'll be an issue.

This much I know, but is LaTeX actually a regular language?

On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin
<[hidden email]> wrote:
> And here was me thinking that maintenance didn't happen because making
> changes to security critical sections of the code is dangerous :).

It's not security-critical.  The worst you could possibly do is DoS,
and any DoS could be instantly shut off by just turning off math
briefly.  Furthermore, the part that makes DoS impossible is a quite
small portion of the code that would need to change effectively never.
 No, the problem is that most PHP programmers have never even heard of
OCaml, let alone used it.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Roan Kattouw-2
2010/3/23 Aryeh Gregor <[hidden email]>:
> This much I know, but is LaTeX actually a regular language?
>
I don't know; I was just making the point that writing a DFA parser in
PHP is probably not very useful.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Damon Wang-2
2010/3/23 Roan Kattouw <[hidden email]>:
> 2010/3/23 Aryeh Gregor <[hidden email]>:
>> This much I know, but is LaTeX actually a regular language?
>>
> I don't know; I was just making the point that writing a DFA parser in
> PHP is probably not very useful.

Sorry, I got confused and wrote DFA when I should have written LALR.
DFAs cannot parse even the allowed subset of AMS-LaTeX, because there
are some permitted environments.

Without claiming to know much formal language theory, a rule of thumb is
that languages with matched delimiters were never regular, because of
the pumping lemma:
    http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

So, for example, it's theoretically impossible to check that parentheses
nested correctly using regular expressions, and similarly it'd be
impossible to check that the \begin and \end commands matched up.

In practice there might be ways to hack around that by using multiple
regular expressions and manually tracking how they nest, but at that
point we're basically writing half of a bad LALR parser.

Fortunately, though, Python has parser generators! And if we're really
concerned about speed, there's PyBison, which does the parsing in C and
apparently produces (at least) five-fold improvements over Python-native
alternatives.

Yours,
Damon Wang

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Rob Lanphier
In reply to this post by Damon Wang-2
Hi Damon,

Thank you so much for floating your GSoC ideas early here on the mailing
list!  Putting out concrete examples we can weigh in on is really helpful,
and engaging in this way is a fantastic way of demonstrating how you'll be
able to engage with us if we select your project.


On Tue, Mar 23, 2010 at 1:06 AM, Damon Wang <[hidden email]> wrote:

> I'm interested in porting texvc to Python, and I was hoping this list
> here might help me hash out the plan.



As I'm sure you've already gathered from the other responses, this is
exactly the right place.  I'm a little skeptical myself that porting that
particular piece of code from OCaml to Python is going to be a really big
win for us (because it's still a "foreign" language as far as PHP-based
MediaWiki is concerned, so integration is still a little clunky and
performance may take a hit due to yet another interpreter needing to load),
but I'll let others weigh in on whether I'm making too big a deal about
that.

Stepping back from the specifics of your proposal (which I think the others
on this list have responded to pretty well), I'd like to find out more about
what general sorts of projects interest you the most, which may help us
figure out if we should keep going in this direction.  Some questions:
1.  Are you most interested in having a Python-based project, or would you
be *equally* happy and productive programming something in PHP?
2.  Are you zeroing in on <math> parsing and parsing in general because
that's an area that you're already developing expertise in and/or are deeply
interested in getting into, or is that just something that looked kinda
interesting to learn about relative to other opportunities you considered?
3.  Are you coming at this as someone who is already deep into
Wikipedia/MediaWiki usage who is looking to resolve particular things (like
<math> parsing) that are painful as an end user, or are you more casually
involved and more interested in applying in this project because it looks
like we've got a lot of interesting programming problems to solve?

Just to be really clear, I'm not looking for a "right" answer on any of
those questions.  It's not necessary for you to be even interested in
getting deeply involved in the Wikipedia user community to have a really
successful project.  The purpose of this line of questions is to figure out
if we should continue helping you refine your current idea, or suggest some
other direction that's a bigger payoff and/or easier sell.

Rob
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Damon Wang-2
Hello Rob,

> Just to be really clear, I'm not looking for a "right" answer on any of
> those questions.  It's not necessary for you to be even interested in
> getting deeply involved in the Wikipedia user community to have a really
> successful project.  The purpose of this line of questions is to figure out
> if we should continue helping you refine your current idea, or suggest some
> other direction that's a bigger payoff and/or easier sell.

I understand, and that'd be very helpful. To be honest, I'm not
passionately committed to any project at all. I've been writing projects
for university and for a computer lab I work at, but it's mostly small,
one-off sysadmin things and usually the emphasis is more on "xyz server
has to be back up before we open tomorrow" than writing good, clean code.
So, yes, I'd welcome other suggestions.

> As I'm sure you've already gathered from the other responses, this is
> exactly the right place.  I'm a little skeptical myself that porting that
> particular piece of code from OCaml to Python is going to be a really big
> win for us (because it's still a "foreign" language as far as PHP-based
> MediaWiki is concerned, so integration is still a little clunky and
> performance may take a hit due to yet another interpreter needing to load),
> but I'll let others weigh in on whether I'm making too big a deal about
> that.

There are ways to make this run faster if performance is a concern. For
example, mod_python or mod_wscgi, or explicitly pulling the Python out
into a standalone daemon that listens for requests from the webserver.

Another possibility be writing it in C to avoid all interpreter
overhead, and using a foreign function interface. Unfortunately, I'm not
familiar with PHP's FFI. Google takes me to
    http://wiki.php.net/rfc/php_native_interface
which seems to think that as of a year ago there weren't any good ones,
but this doesn't look too painful:
    http://theserverpages.com/php/manual/en/zend.creating.php

> Stepping back from the specifics of your proposal (which I think the others
> on this list have responded to pretty well), I'd like to find out more about
> what general sorts of projects interest you the most, which may help us
> figure out if we should keep going in this direction.  Some questions:
> 1.  Are you most interested in having a Python-based project, or would you
> be *equally* happy and productive programming something in PHP?

I'm most familiar with Python and C, for whatever that's worth coming
from an undergrad who didn't know Python existed five years ago. I
learned PHP to maintain the web interfaces of an in-house print system
at work, but I haven't used it for anything as involved as what we're
discussing here. So, in terms of productivity, yes, if I have to work in
PHP my mentor will probably get asked a few more newbie questions.

In terms of happiness, though, it'd be a great opportunity to dig into
PHP and finally learn to use it as more than really smart CSS with a
database connection. Although I prefer Python or even C because I think
I'd be more useful, I wouldn't be very upset at all if it turned out you
guys were willing to let me learn PHP on your time.

> 2.  Are you zeroing in on <math> parsing and parsing in general because
> that's an area that you're already developing expertise in and/or are deeply
> interested in getting into, or is that just something that looked kinda
> interesting to learn about relative to other opportunities you considered?

I like the <math> parsing project because it seems well-suited for a
third-year undergrad who knows LaTeX and reads a few other functional
languages and has studied lex/yacc before in his coursework. The goals
are clear, and I know how to break them down into smaller problems and
how to tackle each one. It's a little isolated from the rest of
Mediawiki, so I don't need to grok the entire code base.

Basically, this looks like a way to make a concrete contribution despite
being a newcomer to the project. That doesn't mean I'm not happy to
entertain alternatives, just that they have a pretty high bar to clear.

> 3.  Are you coming at this as someone who is already deep into
> Wikipedia/MediaWiki usage who is looking to resolve particular things (like
> <math> parsing) that are painful as an end user, or are you more casually
> involved and more interested in applying in this project because it looks
> like we've got a lot of interesting programming problems to solve?

The second. I just want to tackle a problem that's near but not quite
beyond my limits, and if I can help out a site I use daily, so much the
better.

Yours,
Damon Wang

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Conrad Irwin
In reply to this post by Aryeh Gregor

On 03/23/2010 05:23 PM, Aryeh Gregor wrote:

> On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote:
>> DFAs parse regular languages, which means those languages can also be
>> expressed as regexes. In fact, the regexes accepted by the preg_*()
>> functions allow certain extensions to the language theory definition
>> of regular expressions, allowing them to describe certain non-regular
>> languages as well. In short: preg_split() can do everything a DFA can
>> do, and more. The only reason to use a DFA parser would be
>> performance, but since the preg_*() functions are so heavily optimized
>> I don't think that'll be an issue.
>
> This much I know, but is LaTeX actually a regular language?

It's not even context free, luckily the subset we are interested in is
(as clearly shown by the texvc parser :p).

>
> On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin
> <[hidden email]> wrote:
>> And here was me thinking that maintenance didn't happen because making
>> changes to security critical sections of the code is dangerous :).
>
> It's not security-critical.  The worst you could possibly do is DoS,
> and any DoS could be instantly shut off by just turning off math
> briefly.  Furthermore, the part that makes DoS impossible is a quite
> small portion of the code that would need to change effectively never.
>  No, the problem is that most PHP programmers have never even heard of
> OCaml, let alone used it.

Many LaTeX installations can be made read/write/execute anything by
default. LaTeX also allows you to redefine the meaning of characters in
the input, if you accidentally let a single command through, then all
the whitelisting becomes pointless. It certainly is a security issue.

Conrad

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Platonides
In reply to this post by Damon Wang-2
Python is a nice language. PHP (portability) or C/C++ (speed) would be
better but Python is preferable to OCaml.

You mention ANTLR, something like that could be a good because it should
allow to generate the same parser in a different language with not so
much effort (probably you won't have enough time in gsoc for that, but a
design taking that option into account would be interesting).

So you could do (please don't take this as a requisites list):
*Figure out wth is doing the current texvc.
*Document it heavily.
*Design how to create the next textvc.
*Any parser you make for it.
*Actual implementation.

You seem to be thinking about creating a PHP extension. I don't think
you should go that route. A binary is good enough, we don't need it to
be in a PHP extension. That glue could be added later if needed, but
would increase the complexity to write and debug.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Tim Starling-2
In reply to this post by Conrad Irwin
Conrad Irwin wrote:

> On 03/23/2010 05:23 PM, Aryeh Gregor wrote:
>> On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote:
>>> DFAs parse regular languages, which means those languages can also be
>>> expressed as regexes. In fact, the regexes accepted by the preg_*()
>>> functions allow certain extensions to the language theory definition
>>> of regular expressions, allowing them to describe certain non-regular
>>> languages as well. In short: preg_split() can do everything a DFA can
>>> do, and more. The only reason to use a DFA parser would be
>>> performance, but since the preg_*() functions are so heavily optimized
>>> I don't think that'll be an issue.
>> This much I know, but is LaTeX actually a regular language?
>
> It's not even context free, luckily the subset we are interested in is
> (as clearly shown by the texvc parser :p).

Just because a language is context-sensitive doesn't mean it will be
hard to write a parser for it. That's just a myth propagated by
computer scientists who, strangely enough given their profession, have
a disdain for the algorithm as a descriptive framework.

In the last few decades, pure mathematicians have been exploring the
power of algorithms as a general description of an axiomatic system.
And simultaneously, computer scientists have embraced the idea that
the best way to process text is by trying to shoehorn all computer
languages into some Chomsky-inspired representation, regardless of how
awkward that representation is, or how inefficient the resulting
algorithm becomes, when compared to an algorithm constructed a priori.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Trevor Parscal-2
I think we should really consider LOLCODE for this sort of thing.

http://en.wikipedia.org/wiki/Lolcode

It's just more fun!

- Trevor

On 3/23/10 3:44 PM, Tim Starling wrote:

> Conrad Irwin wrote:
>    
>> On 03/23/2010 05:23 PM, Aryeh Gregor wrote:
>>      
>>> On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw<[hidden email]>  wrote:
>>>        
>>>> DFAs parse regular languages, which means those languages can also be
>>>> expressed as regexes. In fact, the regexes accepted by the preg_*()
>>>> functions allow certain extensions to the language theory definition
>>>> of regular expressions, allowing them to describe certain non-regular
>>>> languages as well. In short: preg_split() can do everything a DFA can
>>>> do, and more. The only reason to use a DFA parser would be
>>>> performance, but since the preg_*() functions are so heavily optimized
>>>> I don't think that'll be an issue.
>>>>          
>>> This much I know, but is LaTeX actually a regular language?
>>>        
>> It's not even context free, luckily the subset we are interested in is
>> (as clearly shown by the texvc parser :p).
>>      
> Just because a language is context-sensitive doesn't mean it will be
> hard to write a parser for it. That's just a myth propagated by
> computer scientists who, strangely enough given their profession, have
> a disdain for the algorithm as a descriptive framework.
>
> In the last few decades, pure mathematicians have been exploring the
> power of algorithms as a general description of an axiomatic system.
> And simultaneously, computer scientists have embraced the idea that
> the best way to process text is by trying to shoehorn all computer
> languages into some Chomsky-inspired representation, regardless of how
> awkward that representation is, or how inefficient the resulting
> algorithm becomes, when compared to an algorithm constructed a priori.
>
> -- Tim Starling
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>    


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

K. Peachey
On Wed, Mar 24, 2010 at 9:16 AM, Trevor Parscal <[hidden email]> wrote:
> I think we should really consider LOLCODE for this sort of thing.
>
> http://en.wikipedia.org/wiki/Lolcode
>
> It's just more fun!
>
> - Trevor
Also rewrite parser functions to use it? that would be interesting on
en.wiki since they are always complaining about the syntax.



jks

-Peacvhey

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Rob Lanphier
In reply to this post by Damon Wang-2
On Tue, Mar 23, 2010 at 2:00 PM, Damon Wang <[hidden email]> wrote:

> I've been writing projects
> for university and for a computer lab I work at, but it's mostly small,
> one-off sysadmin things and usually the emphasis is more on "xyz server
> has to be back up before we open tomorrow" than writing good, clean code.
> So, yes, I'd welcome other suggestions.
>


Cool!  So, I'm assuming you're looking forward to an opportunity to write
good, clean code as a summer project.  :)


There are ways to make [Python-based extensions]  run faster if performance
> is a concern. For
> example, mod_python or mod_wscgi, or explicitly pulling the Python out
> into a standalone daemon that listens for requests from the webserver.
>


Personally, I'd avoid trying to make that pitch for a GSoC project.  While
you're right that Python is a pretty defensible choice when embarking on a
large project, trading one dependency for another for this size/scale of
project won't be as compelling as eliminating a dependency altogether.

Of course, as I say that, I see Platonides disagrees with me here.  Choosing
Python is not a huge disadvantage in this context, but it's not going to
have the same unanimous(-ish) approval of using PHP.



> Another possibility be writing it in C to avoid all interpreter
> overhead, and using a foreign function interface. Unfortunately, I'm not
> familiar with PHP's FFI. Google takes me to
>    http://wiki.php.net/rfc/php_native_interface
> which seems to think that as of a year ago there weren't any good ones,
> but this doesn't look too painful:
>    http://theserverpages.com/php/manual/en/zend.creating.php
>
>
I think straight PHP would be fine for this particular project.  The
downside of a C implementation is that, while its almost certainly going to
have the best performance characteristics, it also makes it more likely to
fall into disrepair and be a possible source of buffer overruns and other
security issues.

The nice thing about a PHP port (if done correctly) is that it would be a
trivial install for small wikis and Wikipedia alike.  That translates into
more usage, which in turn translates into higher likelihood that it stays
maintained.

That said, there have got to be a ton of projects that could benefit from
PHP->native C bindings.  I'm going to leave it to some other folks to
suggest projects in this area.


> I'm most familiar with Python and C, for whatever that's worth coming
> from an undergrad who didn't know Python existed five years ago. I
> learned PHP to maintain the web interfaces of an in-house print system
> at work, but I haven't used it for anything as involved as what we're
> discussing here. So, in terms of productivity, yes, if I have to work in
> PHP my mentor will probably get asked a few more newbie questions.
>
> In terms of happiness, though, it'd be a great opportunity to dig into
> PHP and finally learn to use it as more than really smart CSS with a
> database connection. Although I prefer Python or even C because I think
> I'd be more useful, I wouldn't be very upset at all if it turned out you
> guys were willing to let me learn PHP on your time.
>


There's a few Python-based things that might be interesting, but I think
you'll get a lot more love for doing something in PHP or C.  Since this is a
student internship, you shouldn't be bashful about using this as a learning
opportunity.

I'd only caution against convincing yourself (and us) that you'll be more
interested in learning something like PHP than you truly are.  It might help
you land a spot, but it will work against you in having a successful
project, and this has such high visibility that you'll really want to be
successful.  So, if you find yourself thinking about doing this in PHP and
having your inner voice say "meh", then I'd recommend sticking to your guns
and propose doing this or something else in Python and/or C.



> > 2.  Are you zeroing in on <math> parsing and parsing in general because
> > that's an area that you're already developing expertise in and/or are
> deeply
> > interested in getting into, or is that just something that looked kinda
> > interesting to learn about relative to other opportunities you
> considered?
>
> I like the <math> parsing project because it seems well-suited for a
> third-year undergrad who knows LaTeX and reads a few other functional
> languages and has studied lex/yacc before in his coursework. The goals
> are clear, and I know how to break them down into smaller problems and
> how to tackle each one. It's a little isolated from the rest of
> Mediawiki, so I don't need to grok the entire code base.
>
> Basically, this looks like a way to make a concrete contribution despite
> being a newcomer to the project. That doesn't mean I'm not happy to
> entertain alternatives, just that they have a pretty high bar to clear.
>

This is a really smart way of thinking about this, so that's great that
you're thinking the right way about the project scope.  I agree with you
that finding something reasonably well-contained is going to be the best
strategy for success.



> > 3.  Are you coming at this as someone who is already deep into
> > Wikipedia/MediaWiki usage who is looking to resolve particular things
> (like
> > <math> parsing) that are painful as an end user, or are you more casually
> > involved and more interested in applying in this project because it looks
> > like we've got a lot of interesting programming problems to solve?
>
> The second. I just want to tackle a problem that's near but not quite
> beyond my limits, and if I can help out a site I use daily, so much the
> better.



Wonderful!  Great reason to get involved!

Rob
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Happy-melon
In reply to this post by Platonides

"Platonides" <[hidden email]> wrote in message
news:hobfpi$4ud$[hidden email]...
> You seem to be thinking about creating a PHP extension. I don't think
> you should go that route. A binary is good enough, we don't need it to
> be in a PHP extension. That glue could be added later if needed, but
> would increase the complexity to write and debug.

I took it to mean that he wanted to split the math parsing out as a
**MediaWiki** extension, implementing <math> as a parser tag hook in the
usual way.  Which is definitely highly desirable.

--HM



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project advice: port texvc to Python?

Platonides
Happy-melon wrote:
> I took it to mean that he wanted to split the math parsing out as a
> **MediaWiki** extension, implementing <math> as a parser tag hook in the
> usual way.  Which is definitely highly desirable.
>
> --HM

Making it a MediaWiki extension is of course desirable (moving texvc out
of core is a pending issue, at least now <math> can be used by extensions).

but Damon wrote:
> Another possibility be writing it in C to avoid all interpreter
> overhead, and using a foreign function interface. Unfortunately, I'm not
> familiar with PHP's FFI. Google takes me to
>     http://wiki.php.net/rfc/php_native_interface
> which seems to think that as of a year ago there weren't any good ones,
> but this doesn't look too painful:
>     http://theserverpages.com/php/manual/en/zend.creating.php

That's about PHP extensions (which are written in C).
So, instead of going that path, he should make a C program which does
what texvc does. It can then be moved into a PHP extension if really
needed, but starting with Zend extensions would be an unneeded pain for
this project.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
123