Transcluding non-text content as HTML on wikitext pages

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Subramanya Sastry
On 05/19/2014 04:52 AM, Daniel Kinzler wrote:
> I'm getting the impression there is a fundamental misunderstanding here.

You are correct. I completely misunderstood what you said in your last
response about expandtemplates. So, the rest of my response to your last
email is irrelevant ... and let me reboot :-).

>> All that said, if you want to provide the wrapper with <html model="whatever"
>> ....>fully-expanded-HTML</html>, we can handle that as well. We'll use the model
>> attribute of the wrapper, discard the wrapper and use the contents in our pipeline.
> Why use the model attribute? Why would you care about the original model? All
> you need to know is that you'll get HTML. Exposing the original model in this
> context seems useless if not misleading.
Given that I misunderstood your larger observation about
expandtemplates, this is not relevant now. But, I was basing this on
your proposal from the previous email which I'll now go back to.

On 05/17/2014 06:14 PM, Daniel Kinzler wrote:
> I think something like <html transclusion="{{T}}"
> model="whatever">...</html> would work best.

I see what you are getting at here. Parsoid can treat this like a
regular tag-extension and send it back to the api=parse endpoint for
processing. Except if you provided the full expansion as the content of
the html-wrapper in which case the extra api call can be skipped. The
extra api call is not really an issue for occasional uses, but on pages
with a lot of non-wikitext transclusion uses, this is an extra api call
for each such use. I don't have a sense for how common this would be, so
maybe that is a premature worry.

That said, for other clients, this content would be deadweight (if they
are going to discard it and go back to the api=parse endpoint anyway or
worse send back the entire response to the parser that is going to just
discard it after the network transfer).

So, looks like there are some conflicting perf. requirements for
different clients wrt expandtemplates response here. In that context, at
least from a solely parsoid-centric point of view, the new api endpoint
'expand=Foo|x|y|z' you proposed would work well as well.

Subbu.

>> A separate property in the JSON/XML structure avoids the need for escaping
>> (and associated security risks if not done thoroughly), and should be
>> relatively straightforward to implement and consume.
> As explained above, I do not see how this would work except for the very special
> case of using expandtemplates to expand just a single template. This could be
> solved by introducing a new, single template mode for expandtemplates, e.g.
> using expand="Foo|x|y|z" instead of text="{{Foo|x|y|z}}".
>
> Another way would be to use hints the structure returned by generatexml. There,
> we have an opportunity to declare a content type for a *part* of the output (or
> rather, input).

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Am 19.05.2014 14:21, schrieb Subramanya Sastry:
> On 05/19/2014 04:52 AM, Daniel Kinzler wrote:
>> I'm getting the impression there is a fundamental misunderstanding here.
>
> You are correct. I completely misunderstood what you said in your last response
> about expandtemplates. So, the rest of my response to your last email is
> irrelevant ... and let me reboot :-).

Glad we got that out of the way :)

> On 05/17/2014 06:14 PM, Daniel Kinzler wrote:
>> I think something like <html transclusion="{{T}}" model="whatever">...</html>
>> would work best.
>
> I see what you are getting at here. Parsoid can treat this like a regular
> tag-extension and send it back to the api=parse endpoint for processing.
> Except
> if you provided the full expansion as the content of the html-wrapper in which
> case the extra api call can be skipped. The extra api call is not really an
> issue for occasional uses, but on pages with a lot of non-wikitext transclusion
> uses, this is an extra api call for each such use. I don't have a sense for how
> common this would be, so maybe that is a premature worry.

I would probably go for always including the expanded HTML for now.

> That said, for other clients, this content would be deadweight (if they are
> going to discard it and go back to the api=parse endpoint anyway or worse send
> back the entire response to the parser that is going to just discard it after
> the network transfer).

Yes. There could be an option to omit it. That makes the implementation more
complex, but it's doable.

> So, looks like there are some conflicting perf. requirements for different
> clients wrt expandtemplates response here. In that context, at least from a
> solely parsoid-centric point of view, the new api endpoint 'expand=Foo|x|y|z'
> you proposed would work well as well.

That seems the cleanest solution for the parsoid use case - however, the
implementation is complicated by how parameter substitution works. For HTML
based transclusion, it doesn't work at all at the moment - we would need tighter
integration with the preprocessor for doing that.

Basically, there would be two cases: convert expand=Foo|x|y|z to {{Foo|x|y|z}}
internally an call Parser::preprocess on that, so parameter subsitution is done
correctly; or get the HTML from Foo, and discard the parameters. We would have
to somehow know in advance which mode to use, handle the appropriate case, and
then set the Content-Type header accordingly. Pretty messy...

I think <html transclusion="{{T}}"> is the simplest and most robust solution for
now.

-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
In reply to this post by Daniel Kinzler
On 05/19/2014 09:52 AM, Daniel Kinzler wrote:

> Am 18.05.2014 16:29, schrieb Gabriel Wicke:
>> The difference between wrapper and property is actually that using inline
>> wrappers in the returned wikitext would force us to escape similar wrappers
>> from normal template content to avoid opening a gaping XSS hole.
>
> Please explain, I do not see the hole you mention.
>
> If the input contained <html>evil stuff</html>, it would just get escaped by the
> preprocessor (unless $wgRawHtml is enabled), as it is now:
> https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3E%3Cscript%3Ealert%28%27evil%27%29%3C/script%3E%3C/html%3E

What you see there is just unescaped HTML embedded in the XML result format.
It's clearer that there's in fact no escaping on the HTML when looking at
the JSON:

https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3E%3Cscript%3Ealert%28%27evil%27%29%3C/script%3E%3C/html%3E&format=json

Parsoid depends on there being no escaping for unknown tags (and known
extension tags) in the preprocessor.

So if you use tags, you'll have to add escaping for those.

The move to HTML-based (self-contained) transclusions expansions will avoid
this issue completely. That's a few months out though. Maybe we can find a
stop-gap solution that moves in that direction, without introducing special
tags in expandtemplates that we'll have to support for a long time.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/19/2014 04:54 PM, Gabriel Wicke wrote:
> The move to HTML-based (self-contained) transclusions expansions will avoid
> this issue completely. That's a few months out though. Maybe we can find a
> stop-gap solution that moves in that direction, without introducing special
> tags in expandtemplates that we'll have to support for a long time.

Here's a proposal:

* Introduce a <domparse> extension tag that causes its content to be parsed
all the way to a self-contained DOM structure. Example:
<domparse>{{T}}</domparse>

* Emit this tag for HTML page transclusions. Avoids the security issue as
there's no way to inject verbatim HTML. Works with Parsoid out of the box.

* Use <domparse> to support parsing unbalanced templates by inserting it
into wikitext:
<domparse>
{{table-start}}
{{table-row}}
{{table-end}}
</domparse>

* Build a solid HTML-only expansion API end point, and start using that for
all transclusions that are not wrapped in <domparse>

* Stop wrapping non-wikitext transclusions into <domparse> in
action=expandtemplates once those can be directly expanded to a
self-contained DOM.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/19/2014 10:19 AM, Gabriel Wicke wrote:

> On 05/19/2014 04:54 PM, Gabriel Wicke wrote:
>> The move to HTML-based (self-contained) transclusions expansions will avoid
>> this issue completely. That's a few months out though. Maybe we can find a
>> stop-gap solution that moves in that direction, without introducing special
>> tags in expandtemplates that we'll have to support for a long time.
>
> Here's a proposal:
>
> * Introduce a <domparse> extension tag that causes its content to be parsed
> all the way to a self-contained DOM structure. Example:
> <domparse>{{T}}</domparse>
>
> * Emit this tag for HTML page transclusions. Avoids the security issue as
> there's no way to inject verbatim HTML. Works with Parsoid out of the box.
>
> * Use <domparse> to support parsing unbalanced templates by inserting it
> into wikitext:
> <domparse>
> {{table-start}}
> {{table-row}}
> {{table-end}}
> </domparse>
>
> * Build a solid HTML-only expansion API end point, and start using that for
> all transclusions that are not wrapped in <domparse>
>
> * Stop wrapping non-wikitext transclusions into <domparse> in
> action=expandtemplates once those can be directly expanded to a
> self-contained DOM.

Here's a possible division of labor:

You (Daniel) could start with the second step (emitting the tag). Since not
much escaping is needed (only nested <domparse> tags in the transclusion)
this should be fairly straightforward.

We could work on the extension implementation (first bullet point) together,
or tackle it completely on the Parsoid side. We planned to work on this in
any case as part of our longer-term migration to well-balanced HTML
transclusions.

The advantage of using <domparse> to support both unbalanced templates &
special transclusions is that we'll only have to implement this once, and
won't introduce another tag only to deprecate it fairly quickly. Phasing out
unbalanced templates will take longer, as we'll first have to come up with
alternative means to support the same use cases.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Bartosz Dziewoński
In reply to this post by Gabriel Wicke-3
I am kind of lost in this discussion, but let me just ask one question.

Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits?

I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well.

--
Matma Rex

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:
> I am kind of lost in this discussion, but let me just ask one question.
>
> Won't all of the proposed solutions, other than the one of just not
> expanding transclusions that can't be expanded to wikitext, break the
> original and primary purpose of ExpandTemplates: providing valid parsable
> wikitext, for understanding by humans and for pasting back into articles in
> order to bypass transclusion limits?

Yup. But that's the case with <domparse>, while it's not the case with
<html> unless $wgRawHtml is true (which is impossible for publicly-editable
wikis).

> I feel that Parsoid should be using a separate API for whatever it's doing
> with the wikitext. I'm sure that would give you more flexibility with
> internal design as well.

We are moving towards that, but will still need to support unbalanced
transclusions for a while. Since special transclusions can be nested inside
of those we will need some form of inline support even if we expand most
transclusions all the way to DOM with a different end point. Also, as Daniel
pointed out, most other users are using action=expandtemplates for entire
pages and expect that to work as well.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Am 19.05.2014 20:01, schrieb Gabriel Wicke:

> On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:
>> I am kind of lost in this discussion, but let me just ask one question.
>>
>> Won't all of the proposed solutions, other than the one of just not
>> expanding transclusions that can't be expanded to wikitext, break the
>> original and primary purpose of ExpandTemplates: providing valid parsable
>> wikitext, for understanding by humans and for pasting back into articles in
>> order to bypass transclusion limits?
>
> Yup. But that's the case with <domparse>, while it's not the case with
> <html> unless $wgRawHtml is true (which is impossible for publicly-editable
> wikis).

<html transclusion="{{T}}"> would work transparently. It would contain HTML, for
direct use by the client, and could be passed back to the parser, which would
ignore the HTML and execute the transclusion. It should be 100% compatible with
existing clients (unless the look for verbatim "<html>" for some reason).

I'll have to re-read Gabriel's <domparse> proposal tomorrow - right now, I don't
see why it would be necessary, or how it would improve the situation.

>> I feel that Parsoid should be using a separate API for whatever it's doing
>> with the wikitext. I'm sure that would give you more flexibility with
>> internal design as well.
>
> We are moving towards that, but will still need to support unbalanced
> transclusions for a while.

But for HTML based transclusions you could ignore that - you could already
resolve these using a separate API call, if needed.

But still - I do not see why that would be necessary. If expandtemplates returns
<html transclusion="{{T}}">, clients can pass that back to the parser safely, or
use the contained HTML directly, safely.

Parsoid would keep working as before: it would treat <html> as a tag extension
(it does that, right?) and pass it back to the parser (which would expand it
again, this time fully, if action=parse is used). If parsoid knows about the
special properties of <html>, it could just use the contents verbatim - I see no
reason why that would be any more unsafe as any other HTML returned by the parser.

But perhaps I'm missing something obvious. I'll re-read the proposal tomorrow.

-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/19/2014 12:46 PM, Daniel Kinzler wrote:

> Am 19.05.2014 20:01, schrieb Gabriel Wicke:
>> On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:
>>> I am kind of lost in this discussion, but let me just ask one question.
>>>
>>> Won't all of the proposed solutions, other than the one of just not
>>> expanding transclusions that can't be expanded to wikitext, break the
>>> original and primary purpose of ExpandTemplates: providing valid parsable
>>> wikitext, for understanding by humans and for pasting back into articles in
>>> order to bypass transclusion limits?
>>
>> Yup. But that's the case with <domparse>, while it's not the case with
>> <html> unless $wgRawHtml is true (which is impossible for publicly-editable
>> wikis).
>
> <html transclusion="{{T}}"> would work transparently. It would contain HTML, for
> direct use by the client, and could be passed back to the parser, which would
> ignore the HTML and execute the transclusion. It should be 100% compatible with
> existing clients (unless the look for verbatim "<html>" for some reason).

Currently <html> tags are escaped when $wgRawHtml is disabled. We could
change the implementation to stop doing so *iff* the transclusion parameter
is supplied, but IMO that would be fairly unexpected and inconsistent behavior.

>>> I feel that Parsoid should be using a separate API for whatever it's doing
>>> with the wikitext. I'm sure that would give you more flexibility with
>>> internal design as well.
>>
>> We are moving towards that, but will still need to support unbalanced
>> transclusions for a while.
>
> But for HTML based transclusions you could ignore that - you could already
> resolve these using a separate API call, if needed.

Yes, and they are going to be the common case once we have marked up the
exceptions with tags like <domparse>. As you correctly pointed out, inline
tags are primarily needed for expandtemplates calls on compound content,
which we need to do as long as we support unbalanced templates. We can't
know a priori whether some transclusions in turn transclude special HTML
content.

I think we have agreement that some kind of tag is still needed. The main
point still under discussion is on which tag to use, and how to implement
this tag in the parser.

Originally, <domparse> was conceived to be used in actual page content to
wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit*
rather than transclusion by transclusion. Once unbalanced compound
transclusion content is wrapped in <domparse> tags (manually or via bots
using Parsoid info), we can start to enforce nesting of all other
transclusions by default. This will make editing safer and more accurate,
and improve performance by letting us reuse expansions and avoid
re-rendering the entire page during refreshLinks. See
https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background.

The use of <domparse> to mark up special HTML transclusions in
expandtemplates output will be temporary (until HTML transclusions are the
default), but even if such output is pasted into the actual wikitext it
would be harmless, and would work as expected.

Now back to the syntax. Encoding complex transclusions in a HTML parameter
would be rather cumbersome, and would entail a lot of attribute-specific
escaping. Wrapping such transclusions in <domparse> tags on the other hand
normally does not entail any escaping, as only nested <domparse> tags are
problematic.

> Parsoid would keep working as before: it would treat <html> as a tag extension
> (it does that, right?)

$wgRawHtml is disabled in all wikis we are currently interested in.
MediaWiki does properly report the <html> extension tag from siteinfo when
$wgRawHtml is enabled, so it ought to work with Parsoid for private wikis.
It will be harder to support the <html
transclusion="<transclusions>"></html> exception.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Am 19.05.2014 23:05, schrieb Gabriel Wicke:
> I think we have agreement that some kind of tag is still needed. The main
> point still under discussion is on which tag to use, and how to implement
> this tag in the parser.

Indeed.

> Originally, <domparse> was conceived to be used in actual page content to
> wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit*
> rather than transclusion by transclusion. Once unbalanced compound
> transclusion content is wrapped in <domparse> tags (manually or via bots
> using Parsoid info), we can start to enforce nesting of all other
> transclusions by default. This will make editing safer and more accurate,
> and improve performance by letting us reuse expansions and avoid
> re-rendering the entire page during refreshLinks. See
> https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background.


Ah, I though you just pulled that out of your hat :)

My main reason for recycling the <html> tag was to not introduce a new tag
extension. <domparse> may occur verbatim in existing wikitext, and would break
when the tag is introduces.

Other than that, I'm find with outputting whatever tag you like for the
transclusion. Implementing the tag is something else, though - I could implement
it so it will work for HTML transclusion, but I'm not sure I understand the
original domparse stuff well enough to get that right. Would domparse be in
core, btw?


> Now back to the syntax. Encoding complex transclusions in a HTML parameter
> would be rather cumbersome, and would entail a lot of attribute-specific
> escaping.

Why would it involve any escaping? It should be handled as a tag extension, like
any other.

> $wgRawHtml is disabled in all wikis we are currently interested in.
> MediaWiki does properly report the <html> extension tag from siteinfo when
> $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis.
> It will be harder to support the <html
> transclusion="<transclusions>"></html> exception.

I should try what expandtemplates does with <html> with $wgRawHtml enabled.
Nothing, probably. It will just come back containing raw HTML. Which would be
fine, I think.

By the way: once we agree on a mechanism, it would be trivial to use the same
mechanism for special page transclusion. My patch actually already covers that.
Do you agree that this is the Right Thing? It's just transclusion of HTML
content, after all.

-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/20/2014 02:46 AM, Daniel Kinzler wrote:
> My main reason for recycling the <html> tag was to not introduce a new tag
> extension. <domparse> may occur verbatim in existing wikitext, and would break
> when the tag is introduces.

The only existing mentions of this are probably us discussing it ;) In any
case, it's easy to grep for it & nowikify existing uses.

> Other than that, I'm find with outputting whatever tag you like for the
> transclusion.

Great!

> Implementing the tag is something else, though - I could implement
> it so it will work for HTML transclusion, but I'm not sure I understand the
> original domparse stuff well enough to get that right. Would domparse be in
> core, btw?

Yes, it should be in core. I believe that a very simple implementation
(without actual DOM balancing, using Parser::recursiveTagParse()) would not
be too hard. The guts of it are described in [1]. The limitations of
recursiveTagParse should not matter much for this use case.

>> Now back to the syntax. Encoding complex transclusions in a HTML parameter
>> would be rather cumbersome, and would entail a lot of attribute-specific
>> escaping.
>
> Why would it involve any escaping? It should be handled as a tag extension, like
> any other.

Transclusions can contain quotes, which need to be escaped in attribute
values to make sure that the attribute is in fact an attribute. Since quotes
tend to be more common than <domparse> tags this means that there's going to
be more escaping. I also find it harder to scan for quotes ending a long
attribute value. Tags are easier to spot.

>> $wgRawHtml is disabled in all wikis we are currently interested in.
>> MediaWiki does properly report the <html> extension tag from siteinfo when
>> $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis.
>> It will be harder to support the <html
>> transclusion="<transclusions>"></html> exception.
>
> I should try what expandtemplates does with <html> with $wgRawHtml enabled.
> Nothing, probably. It will just come back containing raw HTML. Which would be
> fine, I think.

Yes, that case will work. But $wgRawHtml enabled is the exception, and not
something I'd like to encourage.

> By the way: once we agree on a mechanism, it would be trivial to use the same
> mechanism for special page transclusion. My patch actually already covers that.
> Do you agree that this is the Right Thing? It's just transclusion of HTML
> content, after all.

Yes, that sounds good to me.

Gabriel

[1]:
https://www.mediawiki.org/wiki/Manual:Tag_extensions#How_do_I_render_wikitext_in_my_extension.3F

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12