# GSoC project advice: port texvc to Python?

60 messages
123
Open this post in threaded view
|

## GSoC project advice: port texvc to Python?

 Hello everyone, I'm interested in porting texvc to Python, and I was hoping this list here might help me hash out the plan. Please let me know if I should take my questions elsewhere. Roughly, my plan of attack would be something like this: 1. Collect test cases and write a testing script Thanks to avar from #wikimedia, I already have the $...$ bits from enwiki and dewiki. I would also construct some simpler ones by hand to test each of the acceptable LaTeX commands. Would there be any possibility of logging the input seen by texvc on a production instance of Mediawiki, so I could get some invalid input submitted by actual users? This could also be useful to future maintainers for regression testing. 2. Implement an AMS-TeX validator I'll probably use PLY because it's rumored to have helpful debugging features (designed for a first-year compilers class, apparently). ANTLR is another popular option, but this guy     http://www.bearcave.com/software/antlr/antlr_expr.htmlthinks it's complicated and hard to debug. I've never used either, so if anyone on this list knows of a good Python parsing package I'd welcome suggestions. 3. Port over the existing tex->dvi->png rendering. This is probably just a few calls into the subprocess module. Yeah, I just jinxed it. 4. Add HTML rendering to texvc and test script I don't even understand how the existing texvc decides whether HTML is good enough. It looks like the original programmer just decreed that certain LaTeX commands could be rendered to HTML, and defaults to PNG if it sees anything not on that list. How important is this feature? 5. Repackage the entire Math thing as an extension I might do this if I have time left at the end. I'm sure the project will change over the summer. Python doesn't have parsing just locked right down the way C does with flex/bison, but there are some good options, I have the most experience with it, and I think I'd be able to complete the port faster in Python than in either of the other languages. I was tempted at first to port to PHP, to conform with the rest of Mediawiki, but there don't seem to be any good parsing packages for PHP. (Please tell me if that's wrong.) I'd appreciate any advice or criticism. Since my only previous experience has been using Wikipedia and setting up a test Mediawiki instance for my ACM chapter, I'm only just now learning my way around the code base and it's not always evident why things were done as they are. Does this look like a reasonable and worthwhile project? Yours, Damon Wang P.S. Some of you may remember me on IRC a couple of days ago getting a little panicky about not knowing OCaml, but I'm a bit more hopeful now after looking around the source. I definitely have to keep the OCaml manual open for reference, but I've written Scheme, Common Lisp, and Haskell before, so I think I might be able to fake it.  These are just Famous Last Words waiting to happen, I know. _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

 On 03/23/2010 08:06 AM, Damon Wang wrote: > Hello everyone, > > I'm interested in porting texvc to Python, and I was hoping this list > here might help me hash out the plan. Please let me know if I should > take my questions elsewhere. > > Roughly, my plan of attack would be something like this: > > 1. Collect test cases and write a testing script > Thanks to avar from #wikimedia, I already have the $...$ bits > from enwiki and dewiki. I would also construct some simpler ones by hand > to test each of the acceptable LaTeX commands. > It is not too challenging to create a test file that checks most existing commands with some hacky regexes on the existing parser (I can't find where mine has gone though); what is much harder is proving that your script cannot ever let through invalid or potentially harmful LaTeX (the moment anyone gets a \catcode past you, you're doomed, and there are many other commands that are "not wanted" to say the least). Obviously, the current implementation makes this reasonably pleasant to verify, the syntax of the parser is exceedingly light, so any reimplementation should strive to have as little syntactic overhead as possible. > > 2. Implement an AMS-TeX validator How different would this be from the current validator? > 3. Port over the existing tex->dvi->png rendering. > This is probably just a few calls into the subprocess module. Yeah, I > just jinxed it. > 4. Add HTML rendering to texvc and test script > I don't even understand how the existing texvc decides whether HTML is > good enough. It looks like the original programmer just decreed that > certain LaTeX commands could be rendered to HTML, and defaults to PNG if > it sees anything not on that list. How important is this feature? I am not too fussed about the HTML output, though I can't speak for everyone, at the moment it seems that many more of the Unicode characters should be let through (at least at some level of HTML), though I don't know enough about worldwide unicode support. Some things, like \sqrt for example, are pretty hard to render nicely in HTML, so images are still sensible for some expressions. > > 5. Repackage the entire Math thing as an extension > I might do this if I have time left at the end. I'm sure the project > will change over the summer. This would be very amazing. > Python doesn't have parsing just locked right down the way C does with > flex/bison, but there are some good options, I have the most experience > with it, and I think I'd be able to complete the port faster in Python > than in either of the other languages. I was tempted at first to port to > PHP, to conform with the rest of Mediawiki, but there don't seem to be > any good parsing packages for PHP. (Please tell me if that's wrong.) A good PHP parser library would be exceptionally useful for MediaWiki (and many extensions), at the moment we have loads of methods that do regex "parsing", so if you felt like writing one... :D. > > I'd appreciate any advice or criticism. Since my only previous > experience has been using Wikipedia and setting up a test Mediawiki > instance for my ACM chapter, I'm only just now learning my way around > the code base and it's not always evident why things were done as they > are. Does this look like a reasonable and worthwhile project? > Step 5. has been a "we really should do this" for a while, the shipping of OCaml code which many users won't be able to use is very messy. I am less convinced of the utility of a Python port, OCaml is a great language for implementing this, and I fear a lot of your time would be wasted trying to make the Python similarly nice. As you note, MediaWiki is not written in Python, doing this in PHP would be a larger step in the right direction, though without such nice frameworks, maybe less nice to do. Instead of rewriting the $parser, it might be more productive to create parsers for some of the other languages that extensions use, hopefully with a view to adding additional extensions to Wikipedia. The ones I can think of immediately are tags (bug 3252/5856), , / (bug 189!), (bug 2403). Yours Conrad (PS. I'm no-one official, so can be ignored safely) _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l Reply | Threaded Open this post in threaded view | ## Re: GSoC project advice: port texvc to Python?  2010/3/23 Conrad Irwin <[hidden email]>: > Instead of rewriting the [itex] parser, it might be more productive to > create parsers for some of the other languages that extensions use, > hopefully with a view to adding additional extensions to Wikipedia. The > ones I can think of immediately are tags (bug 3252/5856), > , / (bug 189!), (bug 2403). > Note that there's already an ABC extension, as linked on bug 189, which AFAIK is pretty much ready for WMF deployment already. As mentioned on the same bug, shelling out to Lilypond has certain issues with unbounded time/CPU/memory usage. I'm not familiar with any of the other programs mentioined, so I can't comment on those. Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l Reply | Threaded Open this post in threaded view | ## Re: GSoC project advice: port texvc to Python?  In reply to this post by Conrad Irwin Hello Conrad, >> 2. Implement an AMS-TeX validator > > How different would this be from the current validator? It should be exactly the same, except written in Python. >> 5. Repackage the entire Math thing as an extension >> I might do this if I have time left at the end. I'm sure the project >> will change over the summer. > > This would be very amazing. Maybe this should be my project, then. >> Python doesn't have parsing just locked right down the way C does with >> flex/bison, but there are some good options, I have the most experience >> with it, and I think I'd be able to complete the port faster in Python >> than in either of the other languages. I was tempted at first to port to >> PHP, to conform with the rest of Mediawiki, but there don't seem to be >> any good parsing packages for PHP. (Please tell me if that's wrong.) > > A good PHP parser library would be exceptionally useful for MediaWiki > (and many extensions), at the moment we have loads of methods that do > regex "parsing", so if you felt like writing one... :D. Actually... I've never used PHP for real programming, but how difficult would it be to write a really simple, stupid first pass at a DFA parser? I suspect I'd need much more than three months to make it useful, but would it be possible to implement some coherent subset of the features? E.g., building the LR0 automaton, at least? >> I'd appreciate any advice or criticism. Since my only previous >> experience has been using Wikipedia and setting up a test Mediawiki >> instance for my ACM chapter, I'm only just now learning my way around >> the code base and it's not always evident why things were done as they >> are. Does this look like a reasonable and worthwhile project? >> > > Step 5. has been a "we really should do this" for a while, the shipping > of OCaml code which many users won't be able to use is very messy. I am > less convinced of the utility of a Python port, OCaml is a great > language for implementing this, and I fear a lot of your time would be > wasted trying to make the Python similarly nice. As you note, MediaWiki > is not written in Python, doing this in PHP would be a larger step in > the right direction, though without such nice frameworks, maybe less > nice to do. I suggested a Python port because http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_corelists it as a potential project idea. I was under the impression that people around here did not want to leave texvc in OCaml. Is this wrong? Yours, Damon Wang _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l Reply | Threaded Open this post in threaded view | ## Re: GSoC project advice: port texvc to Python?  In reply to this post by Damon Wang-2 On Tue, Mar 23, 2010 at 4:06 AM, Damon Wang <[hidden email]> wrote: > I'm interested in porting texvc to Python, and I was hoping this list > here might help me hash out the plan. Please let me know if I should > take my questions elsewhere. Python is much better than OCaml, and I prefer Python to PHP, but a PHP implementation would be preferable for core IMO. Not all MediaWiki developers know Python, but all obviously know PHP. If you did a Python implementation, though, then at least someone could translate it to PHP pretty easily. > 1. Collect test cases and write a testing script > Thanks to avar from #wikimedia, I already have the [itex]...$ bits > from enwiki and dewiki. I would also construct some simpler ones by hand > to test each of the acceptable LaTeX commands. > > Would there be any possibility of logging the input seen by texvc on a > production instance of Mediawiki, so I could get some invalid input > submitted by actual users? > > This could also be useful to future maintainers for regression testing. If you have a Unix box handy, it's pretty easy to install MediaWiki with math support so you can test yourself.  sudo apt-get install mediawiki mediawiki-math should do it on anything Debian-based, for example. > 2. Implement an AMS-TeX validator > I'll probably use PLY because it's rumored to have helpful debugging > features (designed for a first-year compilers class, apparently). ANTLR > is another popular option, but this guy >    http://www.bearcave.com/software/antlr/antlr_expr.html> thinks it's complicated and hard to debug. I've never used either, so if > anyone on this list knows of a good Python parsing package I'd welcome > suggestions. If it's in PHP, you'd probably have to write a parser yourself, but LaTeX is pretty easy to parse, I'd think. > 4. Add HTML rendering to texvc and test script > I don't even understand how the existing texvc decides whether HTML is > good enough. It looks like the original programmer just decreed that > certain LaTeX commands could be rendered to HTML, and defaults to PNG if > it sees anything not on that list. How important is this feature? Fairly important, IMO, if the goal is to replace texvc, although not critical.  $x$ shouldn't render x as a PNG -- that's silly. > Python doesn't have parsing just locked right down the way C does with > flex/bison, but there are some good options, I have the most experience > with it, and I think I'd be able to complete the port faster in Python > than in either of the other languages. I was tempted at first to port to > PHP, to conform with the rest of Mediawiki, but there don't seem to be > any good parsing packages for PHP. (Please tell me if that's wrong.) Would it really be very hard to write a LaTeX parser in PHP?  I'd think it could be done easily, if you permit only a carefully-selected subset.  I don't think you'd need any parser theory, just use preg_split() and loop through all the tokens. > I'd appreciate any advice or criticism. Since my only previous > experience has been using Wikipedia and setting up a test Mediawiki > instance for my ACM chapter, I'm only just now learning my way around > the code base and it's not always evident why things were done as they > are. Does this look like a reasonable and worthwhile project? Rewriting texvc in PHP would be a nice project to have, which is small enough in scope that I'm optimistic that it could be done in a summer.  I'd say it's a good choice. On Tue, Mar 23, 2010 at 6:23 AM, Conrad Irwin <[hidden email]> wrote: > I am not too fussed about the HTML output, though I can't speak for > everyone, at the moment it seems that many more of the Unicode > characters should be let through (at least at some level of HTML), > though I don't know enough about worldwide unicode support. I suspect we need to be about as conservative as we currently are for platforms like IE6 on XP.  We should be able to expand the range of HTML characters in the future, though. > A good PHP parser library would be exceptionally useful for MediaWiki > (and many extensions), at the moment we have loads of methods that do > regex "parsing", so if you felt like writing one... :D. Wouldn't a real generic parser implementation written in PHP be too slow to be useful?  preg_replace() has the advantage of being implemented in C. > I am > less convinced of the utility of a Python port, OCaml is a great > language for implementing this, and I fear a lot of your time would be > wasted trying to make the Python similarly nice. As you note, MediaWiki > is not written in Python, doing this in PHP would be a larger step in > the right direction, though without such nice frameworks, maybe less > nice to do. OCaml might be a great language for implementing this, but very few of us understand it.  texvc has been totally unmaintained for years, other than new things being added to the whitelist sometimes by means of cargo-culting what previous commits do.  Rewriting texvc in *anything* that more people understand would be a step forward. On Tue, Mar 23, 2010 at 8:31 AM, Roan Kattouw <[hidden email]> wrote: > As > mentioned on the same bug, shelling out to Lilypond has certain issues > with unbounded time/CPU/memory usage. The same is true for LaTeX.  Lilypond would just need a parser and filter to whitelist safe constructs, like LaTeX does. On Tue, Mar 23, 2010 at 12:25 PM, Damon Wang <[hidden email]> wrote: > I've never used PHP for real programming, but how difficult would it be > to write a really simple, stupid first pass at a DFA parser? I suspect > I'd need much more than three months to make it useful, but would it be > possible to implement some coherent subset of the features? E.g., > building the LR0 automaton, at least? I don't think you'd need a "real" parser here.  Mostly we just use preg_split() for this sort of thing.  I'm not familiar with formal grammars and such, so I can't say what the concrete disadvantages of that approach are. > I suggested a Python port because >    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core> lists it as a potential project idea. I was under the impression that > people around here did not want to leave texvc in OCaml. Is this wrong? No, it's right.  Conrad is crazy.  :P _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 2010/3/23 Aryeh Gregor <[hidden email]>: >> I've never used PHP for real programming, but how difficult would it be >> to write a really simple, stupid first pass at a DFA parser? I suspect >> I'd need much more than three months to make it useful, but would it be >> possible to implement some coherent subset of the features? E.g., >> building the LR0 automaton, at least? > > I don't think you'd need a "real" parser here.  Mostly we just use > preg_split() for this sort of thing.  I'm not familiar with formal > grammars and such, so I can't say what the concrete disadvantages of > that approach are. > DFAs parse regular languages, which means those languages can also be expressed as regexes. In fact, the regexes accepted by the preg_*() functions allow certain extensions to the language theory definition of regular expressions, allowing them to describe certain non-regular languages as well. In short: preg_split() can do everything a DFA can do, and more. The only reason to use a DFA parser would be performance, but since the preg_*() functions are so heavily optimized I don't think that'll be an issue. >> I suggested a Python port because >>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core>> lists it as a potential project idea. I was under the impression that >> people around here did not want to leave texvc in OCaml. Is this wrong? > > No, it's right.  Conrad is crazy.  :P > Having it in a language no one understands is a bad thing and leads to maintenance not happening, so yeah, we definitely want it rewritten in PHP. If the PHP implementation turns out to be too slow to run on WMF, for instance, we could do a C++ port à la wikidiff2 (a C++ port of our ludicrously slow PHP diff implementation). Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 On 03/23/2010 05:00 PM, Roan Kattouw wrote: >>> I suggested a Python port because >>>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core>>> lists it as a potential project idea. I was under the impression that >>> people around here did not want to leave texvc in OCaml. Is this wrong? >> >> No, it's right.  Conrad is crazy.  :P >> > Having it in a language no one understands is a bad thing and leads to > maintenance not happening, so yeah, we definitely want it rewritten in > PHP. If the PHP implementation turns out to be too slow to run on WMF, > for instance, we could do a C++ port à la wikidiff2 (a C++ port of our > ludicrously slow PHP diff implementation). > And here was me thinking that maintenance didn't happen because making changes to security critical sections of the code is dangerous :). The current implementation is just over a thousand lines of exceedingly concise code, while I agree that a re-implementation in PHP is probably sensible, I'll stubbornly maintain that the existing OCaml is more suited to the task. (Oh, and it seems I misread that proposal; I could not imagine a language other than LaTeX being useful for doing maths :p). While re-implementing the syntax whitelister would not be too hard, LaTeX, with it's wonderfully re-definable syntax is incredibly dangerous. Have fun, and be careful! Conrad http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=xii_______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 In reply to this post by Roan Kattouw-2 On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote: > DFAs parse regular languages, which means those languages can also be > expressed as regexes. In fact, the regexes accepted by the preg_*() > functions allow certain extensions to the language theory definition > of regular expressions, allowing them to describe certain non-regular > languages as well. In short: preg_split() can do everything a DFA can > do, and more. The only reason to use a DFA parser would be > performance, but since the preg_*() functions are so heavily optimized > I don't think that'll be an issue. This much I know, but is LaTeX actually a regular language? On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin <[hidden email]> wrote: > And here was me thinking that maintenance didn't happen because making > changes to security critical sections of the code is dangerous :). It's not security-critical.  The worst you could possibly do is DoS, and any DoS could be instantly shut off by just turning off math briefly.  Furthermore, the part that makes DoS impossible is a quite small portion of the code that would need to change effectively never.  No, the problem is that most PHP programmers have never even heard of OCaml, let alone used it. _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 2010/3/23 Aryeh Gregor <[hidden email]>: > This much I know, but is LaTeX actually a regular language? > I don't know; I was just making the point that writing a DFA parser in PHP is probably not very useful. Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 2010/3/23 Roan Kattouw <[hidden email]>: > 2010/3/23 Aryeh Gregor <[hidden email]>: >> This much I know, but is LaTeX actually a regular language? >> > I don't know; I was just making the point that writing a DFA parser in > PHP is probably not very useful. Sorry, I got confused and wrote DFA when I should have written LALR. DFAs cannot parse even the allowed subset of AMS-LaTeX, because there are some permitted environments. Without claiming to know much formal language theory, a rule of thumb is that languages with matched delimiters were never regular, because of the pumping lemma:     http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languagesSo, for example, it's theoretically impossible to check that parentheses nested correctly using regular expressions, and similarly it'd be impossible to check that the \begin and \end commands matched up. In practice there might be ways to hack around that by using multiple regular expressions and manually tracking how they nest, but at that point we're basically writing half of a bad LALR parser. Fortunately, though, Python has parser generators! And if we're really concerned about speed, there's PyBison, which does the parsing in C and apparently produces (at least) five-fold improvements over Python-native alternatives. Yours, Damon Wang _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 In reply to this post by Aryeh Gregor On 03/23/2010 05:23 PM, Aryeh Gregor wrote: > On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote: >> DFAs parse regular languages, which means those languages can also be >> expressed as regexes. In fact, the regexes accepted by the preg_*() >> functions allow certain extensions to the language theory definition >> of regular expressions, allowing them to describe certain non-regular >> languages as well. In short: preg_split() can do everything a DFA can >> do, and more. The only reason to use a DFA parser would be >> performance, but since the preg_*() functions are so heavily optimized >> I don't think that'll be an issue. > > This much I know, but is LaTeX actually a regular language? It's not even context free, luckily the subset we are interested in is (as clearly shown by the texvc parser :p). > > On Tue, Mar 23, 2010 at 1:13 PM, Conrad Irwin > <[hidden email]> wrote: >> And here was me thinking that maintenance didn't happen because making >> changes to security critical sections of the code is dangerous :). > > It's not security-critical.  The worst you could possibly do is DoS, > and any DoS could be instantly shut off by just turning off math > briefly.  Furthermore, the part that makes DoS impossible is a quite > small portion of the code that would need to change effectively never. >  No, the problem is that most PHP programmers have never even heard of > OCaml, let alone used it. Many LaTeX installations can be made read/write/execute anything by default. LaTeX also allows you to redefine the meaning of characters in the input, if you accidentally let a single command through, then all the whitelisting becomes pointless. It certainly is a security issue. Conrad _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 In reply to this post by Damon Wang-2 Python is a nice language. PHP (portability) or C/C++ (speed) would be better but Python is preferable to OCaml. You mention ANTLR, something like that could be a good because it should allow to generate the same parser in a different language with not so much effort (probably you won't have enough time in gsoc for that, but a design taking that option into account would be interesting). So you could do (please don't take this as a requisites list): *Figure out wth is doing the current texvc. *Document it heavily. *Design how to create the next textvc. *Any parser you make for it. *Actual implementation. You seem to be thinking about creating a PHP extension. I don't think you should go that route. A binary is good enough, we don't need it to be in a PHP extension. That glue could be added later if needed, but would increase the complexity to write and debug. _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 In reply to this post by Conrad Irwin Conrad Irwin wrote: > On 03/23/2010 05:23 PM, Aryeh Gregor wrote: >> On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw <[hidden email]> wrote: >>> DFAs parse regular languages, which means those languages can also be >>> expressed as regexes. In fact, the regexes accepted by the preg_*() >>> functions allow certain extensions to the language theory definition >>> of regular expressions, allowing them to describe certain non-regular >>> languages as well. In short: preg_split() can do everything a DFA can >>> do, and more. The only reason to use a DFA parser would be >>> performance, but since the preg_*() functions are so heavily optimized >>> I don't think that'll be an issue. >> This much I know, but is LaTeX actually a regular language? > > It's not even context free, luckily the subset we are interested in is > (as clearly shown by the texvc parser :p). Just because a language is context-sensitive doesn't mean it will be hard to write a parser for it. That's just a myth propagated by computer scientists who, strangely enough given their profession, have a disdain for the algorithm as a descriptive framework. In the last few decades, pure mathematicians have been exploring the power of algorithms as a general description of an axiomatic system. And simultaneously, computer scientists have embraced the idea that the best way to process text is by trying to shoehorn all computer languages into some Chomsky-inspired representation, regardless of how awkward that representation is, or how inefficient the resulting algorithm becomes, when compared to an algorithm constructed a priori. -- Tim Starling _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 I think we should really consider LOLCODE for this sort of thing. http://en.wikipedia.org/wiki/LolcodeIt's just more fun! - Trevor On 3/23/10 3:44 PM, Tim Starling wrote: > Conrad Irwin wrote: >     >> On 03/23/2010 05:23 PM, Aryeh Gregor wrote: >>       >>> On Tue, Mar 23, 2010 at 1:00 PM, Roan Kattouw<[hidden email]>  wrote: >>>         >>>> DFAs parse regular languages, which means those languages can also be >>>> expressed as regexes. In fact, the regexes accepted by the preg_*() >>>> functions allow certain extensions to the language theory definition >>>> of regular expressions, allowing them to describe certain non-regular >>>> languages as well. In short: preg_split() can do everything a DFA can >>>> do, and more. The only reason to use a DFA parser would be >>>> performance, but since the preg_*() functions are so heavily optimized >>>> I don't think that'll be an issue. >>>>           >>> This much I know, but is LaTeX actually a regular language? >>>         >> It's not even context free, luckily the subset we are interested in is >> (as clearly shown by the texvc parser :p). >>       > Just because a language is context-sensitive doesn't mean it will be > hard to write a parser for it. That's just a myth propagated by > computer scientists who, strangely enough given their profession, have > a disdain for the algorithm as a descriptive framework. > > In the last few decades, pure mathematicians have been exploring the > power of algorithms as a general description of an axiomatic system. > And simultaneously, computer scientists have embraced the idea that > the best way to process text is by trying to shoehorn all computer > languages into some Chomsky-inspired representation, regardless of how > awkward that representation is, or how inefficient the resulting > algorithm becomes, when compared to an algorithm constructed a priori. > > -- Tim Starling > > > _______________________________________________ > Wikitech-l mailing list > [hidden email] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l>     _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 On Wed, Mar 24, 2010 at 9:16 AM, Trevor Parscal <[hidden email]> wrote: > I think we should really consider LOLCODE for this sort of thing. > > http://en.wikipedia.org/wiki/Lolcode> > It's just more fun! > > - Trevor Also rewrite parser functions to use it? that would be interesting on en.wiki since they are always complaining about the syntax. jks -Peacvhey _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

Open this post in threaded view
|

## Re: GSoC project advice: port texvc to Python?

 In reply to this post by Platonides "Platonides" <[hidden email]> wrote in message news:hobfpi$4ud$[hidden email]... > You seem to be thinking about creating a PHP extension. I don't think > you should go that route. A binary is good enough, we don't need it to > be in a PHP extension. That glue could be added later if needed, but > would increase the complexity to write and debug. I took it to mean that he wanted to split the math parsing out as a **MediaWiki** extension, implementing [itex] as a parser tag hook in the usual way.  Which is definitely highly desirable. --HM _______________________________________________ Wikitech-l mailing list [hidden email] https://lists.wikimedia.org/mailman/listinfo/wikitech-l