20071018 dumps have more problems. "United States" does not render.

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

20071018 dumps have more problems. "United States" does not render.

jmerkey-3

The article "United States" and "Antarctica" (and lots of others) do not
render on MediaWiki 1.9.3 release with the 20071018 dumps.  I also have a
test setup with MediaWiki 1.11 and the performance is very poor vs. 1.9.3.

The errors are mysql timeouts and no useful output from the http logs.  I
will attempt debug next.

Any ideas on what this could be? The page is:

http://www.wikigadugi.org/wiki/United_States

Dumps prior to September do not exhibit this breakage.

Jeff


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Brion Vibber-3
[hidden email] wrote:
> The article "United States" and "Antarctica" (and lots of others) do not

Ugly slow templates. Please look at the actual content before
complaining about dumps.

-- brion vibber (brion @ wikimedia.org)


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

David A. Desrosiers-2
In reply to this post by jmerkey-3

> The article "United States" and "Antarctica" (and lots of others) do
> not render on MediaWiki 1.9.3 release with the 20071018 dumps.  I
> also have a test setup with MediaWiki 1.11 and the performance is
> very poor vs. 1.9.3.

I just tried on my install here, running latest trunk of phase3, and
although it took about 10 seconds at the first query, it did come up.
There's so much on that page that it took my browser a lot longer to
render it than a fetch with wget did, but the content is there, and
did display.

This is enwiki fetched on September 10th. You want me to try with a
newer dump?


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

jmerkey-3
>
>> The article "United States" and "Antarctica" (and lots of others) do
>> not render on MediaWiki 1.9.3 release with the 20071018 dumps.  I
>> also have a test setup with MediaWiki 1.11 and the performance is
>> very poor vs. 1.9.3.
>
> I just tried on my install here, running latest trunk of phase3, and
> although it took about 10 seconds at the first query, it did come up.
> There's so much on that page that it took my browser a lot longer to
> render it than a fetch with wget did, but the content is there, and
> did display.
>
> This is enwiki fetched on September 10th. You want me to try with a
> newer dump?
>
>

The Sep dump has the same problem, just not as bad.  It appears the
templates are exceeding the 30 second timeout and causing the connection
to abort (based on tracing through wfDebug logs).

It appears the fix is to set the 30 second timeout higher.  On my system,
it looks to be exceeding the preconfigured timeouts.    1.11 seems to
exacerbate the problem when used.

Jeff
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Thomas Dalton
> It appears the fix is to set the 30 second timeout higher.

The fix is to remove/fix the templates. 30 seconds is a long time for
a page to take to render...

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

jmerkey-3
>> It appears the fix is to set the 30 second timeout higher.
>
> The fix is to remove/fix the templates. 30 seconds is a long time for
> a page to take to render...

Yep.  Since Wikimedia publishes the templates and the dumps, the fixing
needs to happen at the source -- the English Wikipedia.     Some sort of
constraint should be placed into MediaWiki to limit the call depth and
complexity for some of these templates by refusing to save changes for
templates which are so obviously broken.

Jeff



>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Steve Summit
Jeff Merkey wrote:
> ...Some sort of constraint should be placed into MediaWiki to limit
> the call depth and complexity for some of these templates by refusing
> to save changes for templates which are so obviously broken.

Yup.  See this message from Simetrical, over in the "slowness
tonight?" thread:

                        * * *

Date: Thu, 25 Oct 2007 11:42:23 -0400
From: Simetrical <[hidden email]>
To: "Wikimedia developers" <[hidden email]>
Subject: Re: [Wikitech-l] slowness tonight?

Yes, those load bog-slowly.  I tested WP:RD/S.  If I leave in the
header, but remove the archives, it loads in 40191 ms on preview
according to Tamper Data.  With the archives but not the header, it's
just 6766 ms.  Without either, it's 5836 ms.  With both, it's 49430 ms.

I think the conclusion is clear: the header adds over 30 seconds to
the page load time.  It needs to be killed, stone-dead.  This is the
sort of thing that the wikitext include limit was supposed to prevent
-- it seems not to be doing its job.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

RLS-2
In reply to this post by jmerkey-3
[hidden email] wrote:
> Yep.  Since Wikimedia publishes the templates and the dumps, the fixing
> needs to happen at the source -- the English Wikipedia.     Some sort of
> constraint should be placed into MediaWiki to limit the call depth and
> complexity for some of these templates by refusing to save changes for
> templates which are so obviously broken.

Thing is, they're not "broken" on en.wikipedia.org.  The WMF hardware is
capable of rendering these pages in a reasonable amount of time, ~12s
for me for either [[United States]] and [[Antarctica]], including
download time for the images etc.  I agree that's higher than most
pages, but I wouldn't call it "broken."

Broken is relative, and I don't see why the English Wikipedia needs to
be crippled because your hardware, as a downstream user, isn't capable
of matching the performance of the Foundation's hardware.

--en.wp Darkwind

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Gerard Meijssen-3
Hoi,
There is another reason why such templates might be reconsidered. The
complexity has become such that it become increasingly difficult to find
people who still understand it. The technical sophistication needed has
become such that it is beyond many advanced users of MediaWiki. This in
itself is a problem because it also raises the threshold for starting
Wikimedians.
Thanks,
     GerardM

On 10/25/07, RLS <[hidden email]> wrote:

>
> [hidden email] wrote:
> > Yep.  Since Wikimedia publishes the templates and the dumps, the fixing
> > needs to happen at the source -- the English Wikipedia.     Some sort of
> > constraint should be placed into MediaWiki to limit the call depth and
> > complexity for some of these templates by refusing to save changes for
> > templates which are so obviously broken.
>
> Thing is, they're not "broken" on en.wikipedia.org.  The WMF hardware is
> capable of rendering these pages in a reasonable amount of time, ~12s
> for me for either [[United States]] and [[Antarctica]], including
> download time for the images etc.  I agree that's higher than most
> pages, but I wouldn't call it "broken."
>
> Broken is relative, and I don't see why the English Wikipedia needs to
> be crippled because your hardware, as a downstream user, isn't capable
> of matching the performance of the Foundation's hardware.
>
> --en.wp Darkwind
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

RLS-2
GerardM wrote:
> Hoi,
> There is another reason why such templates might be reconsidered. The
> complexity has become such that it become increasingly difficult to find
> people who still understand it. The technical sophistication needed has
> become such that it is beyond many advanced users of MediaWiki. This in
> itself is a problem because it also raises the threshold for starting
> Wikimedians.

Well, I certainly agree that template programming is entirely beyond the
pale, but that's an entirely separate issue that actually has several
possible solutions (an extremely simplified programming language that
could be enabled for only certain namespaces, artificially limiting
template recursion to improve simplicity, keeping the current style of
functions but developing an easier-to-read syntax, etc.)

--en.wp Darkwind



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Nick Jenkins
In reply to this post by RLS-2
> > Yep.  Since Wikimedia publishes the templates and the dumps, the fixing
> > needs to happen at the source -- the English Wikipedia.     Some sort of
> > constraint should be placed into MediaWiki to limit the call depth and
> > complexity for some of these templates by refusing to save changes for
> > templates which are so obviously broken.
>
> Thing is, they're not "broken" on en.wikipedia.org.  The WMF hardware is
> capable of rendering these pages in a reasonable amount of time, ~12s
> for me for either [[United States]] and [[Antarctica]], including
> download time for the images etc.  I agree that's higher than most
> pages, but I wouldn't call it "broken."
>
> Broken is relative, and I don't see why the English Wikipedia needs to
> be crippled because your hardware, as a downstream user, isn't capable
> of matching the performance of the Foundation's hardware.

"Not our problem" is potentially a dangerous argument. Let's take as a given that
some normal non-malicious pages as currently written take 12 seconds to render on WMF
hardware. Suppose that an actively malicious user then systematically identifies and
repeatedly calls the slowest operations contributing to that render time, and eliminates
all the fast operations, thus allowing them to increase the "efficacy"/slowness of the
wikitext rendering, such that a page that's only 5% of the size takes 4 times longer
to render (so we're up to around 50 seconds to render 8 KB of wikitext). We then take
the number of MediaWiki Apache servers (let's assume 170 for the sake of argument).
So for a DoS we need to request from each server (say) two preview renderings of each
attack page per 50 seconds, and assuming 170 servers, that's 170/(50/2) seconds * 8 KB
= 54 KB per second upstream bandwidth required. Downstream bandwidth doesn't matter
because we don't care about the response, and we won't be listening anyway. My connection
now for example is 1017 kilobits per second upstream, equals 127 KB per second. So,
if the above assumptions are reasonable and my maths is okay, then any single reasonably
modern broadband connection is more than sufficient to make every Wikipedia unusable.
... remind me again of how this is not our problem?  ;-)

It might be better to think of Jeff's servers as the gasping canary, WMF servers as the
miner, and slow render time as the toxic gas, the Internet as the mine, the people who
will make useful contributions as the gold, the trolls as trolls, and ... actually I
think I'm overextending the metaphor, so I'll stop there!

-- All the best,
Nick.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

RLS-2
Nick Jenkins wrote:
> It might be better to think of Jeff's servers as the gasping canary, WMF servers as the
> miner, and slow render time as the toxic gas, the Internet as the mine, the people who
> will make useful contributions as the gold, the trolls as trolls, and ... actually I
> think I'm overextending the metaphor, so I'll stop there!

I agree there are limits to even what the WMF hardware can do; I just
don't necessarily see that we've reached them, when we don't know
anything about his servers or configuration and the issue is not causing
problems on en.wp, which is likely to hit such problems first as the
largest WMF wiki.

There *are* template-related problems on en.wp at the moment, discussed
in the thread from
http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034323.html,
but I'm still not sure that's an indication that additional limits are
needed - but it might be.

--Darkwind

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Steve Summit
Darkwind wrote:
> There *are* template-related problems on en.wp at the moment, discussed
> in the thread from
> http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034323.html,
> but I'm still not sure that's an indication that additional limits are
> needed - but it might be.

Also relevant are the threads at [[Wikipedia:Village pump
(technical)#any work on the incredible sluggishness of deeply
edited articles?]] and [[Wikipedia talk:Reference desk#Houston,
we have a problem.]], and Simetrical's statement at
<http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034338.html>
in the aforementioned thread that "This is the sort of thing that
the wikitext include limit was supposed to prevent".

The question is, how true is it that "almost every
very-high-traffic page on Wikipedia is having extreme problems
right now".  I suspect not, but if so, is it because there are
more pages with say, heavy use of the {cite} template, or because
templates like {cite} have gotten more complicated, or because
template interpolation has somehow gotten slower, or simply
because there are more hits and edits being processed every day,
such that our headroom is going down?

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Aryeh Gregor
On 10/26/07, Steve Summit <[hidden email]> wrote:
> The question is, how true is it that "almost every
> very-high-traffic page on Wikipedia is having extreme problems
> right now".  I suspect not, but if so, is it because there are
> more pages with say, heavy use of the {cite} template, or because
> templates like {cite} have gotten more complicated, or because
> template interpolation has somehow gotten slower, or simply
> because there are more hits and edits being processed every day,
> such that our headroom is going down?

Well, whatever the problem is, I suspect I know one way that would fix
it: rewriting the parser in C(++).  Unfortunately, that's a whole lot
easier said than done.  Rewriting even part of it, though, say
replaceVariables, might be a big benefit.

For now it might be best to refine our heuristics of what's slow to
render.  Currently we use a simple text-length heuristic, but perhaps
it would make more sense to incorporate additional criteria.  Maximum
number of template inclusions?  Maximum template depth?  It would
require testing to see what would be effective.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Thomas Dalton
> Well, whatever the problem is, I suspect I know one way that would fix
> it: rewriting the parser in C(++).  Unfortunately, that's a whole lot
> easier said than done.  Rewriting even part of it, though, say
> replaceVariables, might be a big benefit.

Working out what the parser is actually meant to do would be required
first, though. At the moment it does what it does and that the best
anyone can say. Trying to translate that idiosyncratic behaviour into
a new language would be a nightmare.

> For now it might be best to refine our heuristics of what's slow to
> render.  Currently we use a simple text-length heuristic, but perhaps
> it would make more sense to incorporate additional criteria.  Maximum
> number of template inclusions?  Maximum template depth?  It would
> require testing to see what would be effective.

I suspect depth would be the best one to try. People can tell by
looking at an article's source how many templates there are, and can
keep that under control. Telling how deep templates go is often
impossible for anyone that isn't an expert on MediaWiki template
syntax, so they could easily end up with 100s of templates being
processed without noticing.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Steve Sanbeg
In reply to this post by Aryeh Gregor
On Fri, 26 Oct 2007 14:09:38 -0400, Simetrical wrote:

> On 10/26/07, Steve Summit <[hidden email]>
> wrote:
>> The question is, how true is it that "almost every very-high-traffic
>> page on Wikipedia is having extreme problems right now".  I suspect not,
>> but if so, is it because there are more pages with say, heavy use of the
>> {cite} template, or because templates like {cite} have gotten more
>> complicated, or because template interpolation has somehow gotten
>> slower, or simply because there are more hits and edits being processed
>> every day, such that our headroom is going down?
>
> Well, whatever the problem is, I suspect I know one way that would fix it:
> rewriting the parser in C(++).  Unfortunately, that's a whole lot easier
> said than done.  Rewriting even part of it, though, say replaceVariables,
> might be a big benefit.
>

I'm not sure simply porting to a different language would have such a huge
affect, and certainly isn't easy with a grammar that's not well defined.
Currently, even if you were to render a large plain-text page with no
markup, MW would still have to make about dozen passes over the text to
determine that there's really nothing to do; that's going to be slow, no
matter what language it's done in.  I think a much simpler interpreted
parser would beat a complex compiled one, unless you're dealing with small
pages where initial overhead is significant.

> For now it might be best to refine our heuristics of what's slow to
> render.  Currently we use a simple text-length heuristic, but perhaps it
> would make more sense to incorporate additional criteria.  Maximum number
> of template inclusions?  Maximum template depth?  It would require testing
> to see what would be effective.

I don't think the text length is very accurate; we definitely need
something better.  Also, I think a big part of the problem is with the
parser functions; they tend to first expand every template passed into
them, then decide which one to keep.  Deferring that expansion, which
could be done by adding a keyword to each nested template call, should
help there, although there may be a better way.



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Aryeh Gregor
On 10/26/07, Steve Sanbeg <[hidden email]> wrote:
> I'm not sure simply porting to a different language would have such a huge
> affect, and certainly isn't easy with a grammar that's not well defined.
> Currently, even if you were to render a large plain-text page with no
> markup, MW would still have to make about dozen passes over the text to
> determine that there's really nothing to do; that's going to be slow, no
> matter what language it's done in.

That depends on a number of things.  Twelve passes in C is certainly a
*lot* faster than twelve passes in PHP.  Remember that the difference
engine used to be one of the slowest components of MediaWiki, until it
was rewritten (using an identical algorithm) in C++ -- now it's far
faster than rendering the exact same page.

> I think a much simpler interpreted
> parser would beat a complex compiled one, unless you're dealing with small
> pages where initial overhead is significant.

Tim once remarked to me on IRC that he suspected a one-pass PHP parser
would be slower than our current one, simply because the current one
avoids going through each character in PHP.  Something like preg_split
is fast precisely because it's executed in C: then PHP only has to
deal with ten or twenty or two hundred chunks of text, rather than a
hundred thousand individual characters.

> I don't think the text length is very accurate; we definitely need
> something better.  Also, I think a big part of the problem is with the
> parser functions; they tend to first expand every template passed into
> them, then decide which one to keep.  Deferring that expansion, which
> could be done by adding a keyword to each nested template call, should
> help there, although there may be a better way.

Well, if the expansion is deferred that should be decided by the
individual parser function, not by the call syntax for the template.
Either way, I think some more careful benchmarking is needed here
before anyone can say what limits are best to add.  One thing that's
for sure is that it's the templates/conditionals specifically that are
the problem, not refs or links or whatever: replaceVariables takes up
something like 50% of CPU time now, or what?  There are charts around
somewhere.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Steve Sanbeg
On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:

> On 10/26/07, Steve Sanbeg <[hidden email]> wrote:
>> I'm not sure simply porting to a different language would have such a
>> huge affect, and certainly isn't easy with a grammar that's not well
>> defined. Currently, even if you were to render a large plain-text page
>> with no markup, MW would still have to make about dozen passes over the
>> text to determine that there's really nothing to do; that's going to be
>> slow, no matter what language it's done in.
>
> That depends on a number of things.  Twelve passes in C is certainly a
> *lot* faster than twelve passes in PHP.  Remember that the difference
> engine used to be one of the slowest components of MediaWiki, until it was
> rewritten (using an identical algorithm) in C++ -- now it's far faster
> than rendering the exact same page.
>

My own experiences with perl & C haven't shown such dramatic differences,
and that some operations scale linearly with the number of passes. I
was assuming PHP would be similar, although I haven't benchmarked
differences in language or passes for this.

>> I think a much simpler interpreted
>> parser would beat a complex compiled one, unless you're dealing with
>> small pages where initial overhead is significant.
>
> Tim once remarked to me on IRC that he suspected a one-pass PHP parser
> would be slower than our current one, simply because the current one
> avoids going through each character in PHP.  Something like preg_split is
> fast precisely because it's executed in C: then PHP only has to deal with
> ten or twenty or two hundred chunks of text, rather than a hundred
> thousand individual characters.
>

The number of individual characters that are significant to wiki markup is
actually fairly small.  Changing it to one pass would significantly alter
the language in a lot of cases.  But I still think if we could do it in
three or so passes it would be faster, even if we did have to deal with
dozens, or even hundreds, of individual characters.

>> I don't think the text length is very accurate; we definitely need
>> something better.  Also, I think a big part of the problem is with the
>> parser functions; they tend to first expand every template passed into
>> them, then decide which one to keep.  Deferring that expansion, which
>> could be done by adding a keyword to each nested template call, should
>> help there, although there may be a better way.
>
> Well, if the expansion is deferred that should be decided by the
> individual parser function, not by the call syntax for the template.
> Either way, I think some more careful benchmarking is needed here before
> anyone can say what limits are best to add.  One thing that's for sure is
> that it's the templates/conditionals specifically that are the problem,
> not refs or links or whatever: replaceVariables takes up something like
> 50% of CPU time now, or what?  There are charts around somewhere.

Yes, certainly variable replacement.  I think it's clear that
something like {{#if{{a}}:{{defer:b}}|{{defer:c}}}} would be more
efficient than {{#if:{{a}}|{{b}}|{{c}}}}.  If that behavior was implicit
in #if, rather than adding a new modifier and plugging it into all the
templates, so much the better.

I agree that there should be benchmarking to suggest new limits.  Really,
we should have a cost per transclusion/function, which could vary by
function, that the caller would be charged.  This would much more
accurately address the issue.  The side affect might be that large classes
of those spaghetti templates become inoperable.


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Aryeh Gregor
On 10/26/07, Steve Sanbeg <[hidden email]> wrote:

> On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:
>
> > On 10/26/07, Steve Sanbeg <[hidden email]> wrote:
> > That depends on a number of things.  Twelve passes in C is certainly a
> > *lot* faster than twelve passes in PHP.  Remember that the difference
> > engine used to be one of the slowest components of MediaWiki, until it was
> > rewritten (using an identical algorithm) in C++ -- now it's far faster
> > than rendering the exact same page.
> >
>
> My own experiences with perl & C haven't shown such dramatic differences,
> and that some operations scale linearly with the number of passes. I
> was assuming PHP would be similar, although I haven't benchmarked
> differences in language or passes for this.

It really depends on what you're doing.  If you're doing some simple
regex of input data, almost all the heavy lifting is done in C anyway.
 But the Parser is 5000 lines of PHP code, the most troublesome parts
of which are called repeatedly for complicated templates.  Computation
tends to be between ten and a hundred times faster in C than in
interpreted languages, according to various benchmarks, depending on
the exact task.  The differences in performance when using wikidiff2
versus the built-in diff engine aren't made up.

Of course, there would be many other possible parser optimizations.
If templates inserted HTML rather than wikitext, for instance, they
could be cached separately from the including articles, so that a
header or infobox template wouldn't need to be rerendered every time
there was a change to article content.  But that would be a major
change to functionality, I suspect.

> The number of individual characters that are significant to wiki markup is
> actually fairly small.  Changing it to one pass would significantly alter
> the language in a lot of cases.  But I still think if we could do it in
> three or so passes it would be faster, even if we did have to deal with
> dozens, or even hundreds, of individual characters.

So preg_split on every significant character, and iterate through each
of those?  Maybe.  I'm really overstepping my expertise by venturing
to comment much here.

> The side affect might be that large classes
> of those spaghetti templates become inoperable.

Which is really the idea, isn't it?  It's not what I'd call a side
effect, the point is to kill them.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 20071018 dumps have more problems. "United States" does not render.

Rolf Lampa [RIL]
In reply to this post by Aryeh Gregor
Simetrical skrev:

> Well, if the expansion is deferred that should be decided by the
> individual parser function, not by the call syntax for the template.
> Either way, I think some more careful benchmarking is needed here
> before anyone can say what limits are best to add.  One thing that's
> for sure is that it's the templates/conditionals specifically that are
> the problem, not refs or links or whatever: replaceVariables takes up
> something like 50% of CPU time now, or what?  There are charts around
> somewhere.

Interestingly enough I'm coding the ReplaceVariable in Delphi Pascal
right now using highly optimized code. Although implementing it "my
way" I'll soon be able to produce "any metrics" about exactly what's
time consuming in which template, on which page, in the entire enWP.

I can already say (after having profiled it "hundreds of times") that
you are perfectly right in that the "links or whatever" is NOT taking
much CPU at all, especially not in comparison with ReplaceVariables.

My dump-processor expands templates (most of it), but not parsing
HTML, at least not yet.

Regards,

// Rolf Lampa


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
12