Transcluding non-text content as HTML on wikitext pages

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Hi all!

During the hackathon, I worked on a patch that would make it possible for
non-textual content to be included on wikitext pages using the template syntax.
The idea is that if we have a content handler that e.g. generates awesome
diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be
able to use that output on a wiki page. But until now, that would have required
the content handler to generate wikitext for the transclusion - not easily done.

So, I came up with a way for ContentHandler to wrap the HTML generated by
another ContentHandler so it can be used for transclusion.

Have a look at the patch at <https://gerrit.wikimedia.org/r/#/c/132710/>. Note
that I have completely rewritten it since my first version at the hackathon.

It would be great to get some feedback on this, and have it merged soon, so we
can start using non-textual content to its full potential.

Here is a quick overview of the information flow. Let's assume we have a
"template" page T that is supposed to be transcluded on a "target" page P; the
template page uses the non-text content model X, while the target page is
wikitext. So:

* When Parser parses P, it encounters {{T}}
* Parser loads the Content object for T (an XContent object, for model X), and
calls getTextForTransclusion() on it, with CONTENT_MODEL_WIKITEXT as the target
format.
* getTextForTransclusion() calls getContentForTransclusion()
* getContentForTransclusion() calls convert( CONTENT_MODEL_WIKITEXT ) which
fails (because content model X doesn't provide a wikitext representation).
* getContentForTransclusion() then calls convertContentViaHtml()
* convertContentViaHtml() calls getTextForTransclusion( CONTENT_MODEL_HTML ) to
get the HTML representation.
* getTextForTransclusion() calls getContentForTransclusion() calls convert()
which handles the conversion to HTML by calling getHtml() directly.
* convertContentViaHtml() takes the HTML and calls makeContentFromHtml() on the
ContentHandler for wikitext.
* makeContentFromHtml() replaces the actual HTML by a parser strip mark, and
returns a WikitextContent containing this strip mark.
* The strip mark is eventually returns to the original Parser instances, and
used to replace {{T}} on the original page.

This essentialyl means that any content can be converted to HTML, and can be
transcluded into any content that provides an implementation of
makeContentFromHtml(). This actually changes how transclusion of JS and CSS
pages into wikitext pages work. You can try this out by transclusing a JS page
like MediaWiki:Test.js as a template on a wikitext page.


The old getWikitextForTransclusion() is now a shorthand for
getTextForTransclusion( CONTENT_MODEL_WIKITEXT ).


As Brion pointed out in a comment to my original, there is another caveat: what
should the expandtemplates module do when expanding non-wikitext templates? I
decided to just wrap the HTML in <html>...</html> tags instead of using a strip
mark in this case. The resulting wikitext is however only "correct" if
$wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by wikitext
parsing. This seems acceptable to me, but please let me know if you have a
better idea.


So, let me know what you think!
Daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Brad Jorsch (Anomie)
On Tue, May 13, 2014 at 11:37 AM, Daniel Kinzler <[hidden email]>wrote:

> As Brion pointed out in a comment to my original, there is another caveat:
> what
> should the expandtemplates module do when expanding non-wikitext
> templates? I
> decided to just wrap the HTML in <html>...</html> tags instead of using a
> strip
> mark in this case. The resulting wikitext is however only "correct" if
> $wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by
> wikitext
> parsing. This seems acceptable to me, but please let me know if you have a
> better idea.
>

Just brainstorming:

To avoid the wikitext mangling, you could wrap it in some tag that works
like <html> if $wgRawHtml is set and <pre> otherwise.

Or one step further, maybe a tag <foo wikitext="{{P}}">html goes here</foo>
that parses just as {{P}} does (and ignores "html goes here" entirely),
which preserves the property that the output of expandtemplates will mostly
work when passed back to the parser.


--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Matthew Flaschen-2
In reply to this post by Daniel Kinzler
On 05/13/2014 11:37 AM, Daniel Kinzler wrote:
> Hi all!
>
> During the hackathon, I worked on a patch that would make it possible for
> non-textual content to be included on wikitext pages using the template syntax.
> The idea is that if we have a content handler that e.g. generates awesome
> diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be
> able to use that output on a wiki page. But until now, that would have required
> the content handler to generate wikitext for the transclusion - not easily done.

 From working with Dan on this, the main issue is the ResourceLoader
module that the diagrams require (it uses a JavaScript library called
Vega, plus a couple supporting libraries, and simple MW setup code).

The container element that it needs can be as simple as:

<div data-something="..."></div>

which is actually valid wikitext.

Can you outline how RL modules would be handled in the transclusion
scenario?

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
In reply to this post by Daniel Kinzler
On 05/13/2014 05:37 PM, Daniel Kinzler wrote:
> Hi all!
>
> During the hackathon, I worked on a patch that would make it possible for
> non-textual content to be included on wikitext pages using the template syntax.
> The idea is that if we have a content handler that e.g. generates awesome
> diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be
> able to use that output on a wiki page. But until now, that would have required
> the content handler to generate wikitext for the transclusion - not easily done.


It sounds like this won't work well with current Parsoid. We are using
action=expandtemplates for the preprocessing of transclusions, and then
parse the contents using Parsoid. The content is finally
passed through the sanitizer to keep XSS at bay.

This means that HTML returned from the preprocessor needs to be valid in
wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
possible, but my impression is that you are shooting for something that's
closer to the behavior of a tag extension. Those already bypass the
sanitizer, so would be less troublesome in the short term. We currently also
can't process transclusions independently to HTML, as we still have to
support unbalanced templates. We are moving into that direction though,
which should also make it easier to support non-wikitext transclusion content.

In the longer team, Parsoid will request pre-sanitized and balanced HTML
from the content API [1,2] for everything but unbalanced wikitext content
[3]. The content API will treat it like any other request, and ask the
storage service for the HTML. If that's found, then it is directly returned
and no rendering happens. This is going to be the typical and fast case. If
there is however no HTML in storage for that revision the content API will
just call the renderer service and save the HTML back / return it to clients
like Parsoid.

So it is important to think of renderers as services, so that they are
usable from the content API and Parsoid. For existing PHP code this could
even be action=parse, but for new renderers without a need or desire to tie
themselves to MediaWiki internals I'd recommend to think of them as their
own service. This can also make them more attractive to third party
contributors from outside the MediaWiki world, as has for example recently
happened with Mathoid.

Gabriel

[1]: https://www.mediawiki.org/wiki/Requests_for_comment/Content_API
[2]: https://github.com/gwicke/restface
[3]: We are currently mentoring a GSoC project to collect statistics on
issues like unbalanced templates, which should allow us to systematically
mark those transclusions by wrapping them in a <domparse> tag in wikitext.
All transclusions outside of <domparse> will then be expected to yield
stand-alone HTML.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
In reply to this post by Brad Jorsch (Anomie)
Thanks all for the imput!

Am 14.05.2014 10:17, schrieb Gabriel Wicke:> On 05/13/2014 05:37 PM, Daniel
Kinzler wrote:

> It sounds like this won't work well with current Parsoid. We are using
> action=expandtemplates for the preprocessing of transclusions, and then
> parse the contents using Parsoid. The content is finally
> passed through the sanitizer to keep XSS at bay.
>
> This means that HTML returned from the preprocessor needs to be valid in
> wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
> possible, but my impression is that you are shooting for something that's
> closer to the behavior of a tag extension. Those already bypass the
> sanitizer, so would be less troublesome in the short term.

Yes. Just treat <html>...</html> like a tag extension, and it should work fine.
Do you see any problems with that?

> So it is important to think of renderers as services, so that they are
> usable from the content API and Parsoid. For existing PHP code this could
> even be action=parse, but for new renderers without a need or desire to tie
> themselves to MediaWiki internals I'd recommend to think of them as their
> own service. This can also make them more attractive to third party
> contributors from outside the MediaWiki world, as has for example recently
> happened with Mathoid.

True, but that has little to do with my patch. It just means that 3rd party
Content objects should preferably implement getHtml() by calling out to a
service object.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):
> To avoid the wikitext mangling, you could wrap it in some tag that works
> like <html> if $wgRawHtml is set and <pre> otherwise.

But <pre> will result in *escaped* HTML. That's just another kind of mangling.
It's at all the "normal" result of parsing.

Basically, the <html> mode is for expandtemplates only, and not intended to be
follow up by "actual" parsing.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):
> Or one step further, maybe a tag <foo wikitext="{{P}}">html goes here</foo>
> that parses just as {{P}} does (and ignores "html goes here" entirely),
> which preserves the property that the output of expandtemplates will mostly
> work when passed back to the parser.

Hm... that's an interesting idea, I'll think about it!

Btw, just so this is mentioned somewhere: it would be very easy to simply not
expand such templates at all in expandtemplates mode, keeping them as {{T}} or
[[T]].

Am 14.05.2014 00:11, schrieb Matthew Flaschen:
> From working with Dan on this, the main issue is the ResourceLoader module
> that the diagrams require (it uses a JavaScript library called Vega, plus a
> couple supporting libraries, and simple MW setup code).
>
> The container element that it needs can be as simple as:
>
> <div data-something="..."></div>
>
> which is actually valid wikitext.

So, there is no server side rendering at all? It's all done using JS on the
client? Ok then, HTML transclusion isn't the solution.

> Can you outline how RL modules would be handled in the transclusion
> scenario?

The current patch does not really address that problem, I'm afraid. I can think
of two solutions:

* Create an SyntheticHtmlContent class that would hold meta info about modules
etc, just like ParserOutput - perhaps it would just contain a ParserOutput
object.  And an equvalent SyntheticWikitextContent class, perhaps. That would
allow us to pass such meta-info around as needed.

* Move the entire logic for HTML based transclusion into the wikitext parser,
where it can just call getParserOutput() on the respective Content object. We
would then no longer need the generic infrastructure for HTML transclusion.
Maybe that would be a better solution in the end.

Hm... yes, I should make an alternative patch using that approach, so we can
compare.


Thanks for your input!
-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/14/2014 01:40 PM, Daniel Kinzler wrote:
>> This means that HTML returned from the preprocessor needs to be valid in
>> wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
>> possible, but my impression is that you are shooting for something that's
>> closer to the behavior of a tag extension. Those already bypass the
>> sanitizer, so would be less troublesome in the short term.
>
> Yes. Just treat <html>...</html> like a tag extension, and it should work fine.
> Do you see any problems with that?

First of all you'll have to make sure that users cannot inject <html> tags
as that would enable arbitrary XSS. I might have missed it, but I believe
that this is not yet done in your current patch.

In contrast to normal tag extensions <html> would also contain fully
rendered HTML, and should not be piped through action=parse as is done in
Parsoid for tag extensions (in absence of a direct tag extension expansion
API end point). We and other users of the expandtemplates API will have to
add special-case handling for this pseudo tag extension.

In HTML, the <html> tag is also not meant to be used inside the body of a
page. I'd suggest using a different tag name to avoid issues with HTML
parsers and potential name conflicts with existing tag extensions.

Overall it does not feel like a very clean way to do this. My preference
would be to let the consumer directly ask for pre-expanded wikitext *or*
HTML, without overloading action=expandtemplates. Even indicating the
content type explicitly in the API response (rather than inline with an HTML
tag) would be a better stop-gap as it would avoid some of the security and
compatibility issues described above.

>> So it is important to think of renderers as services, so that they are
>> usable from the content API and Parsoid. For existing PHP code this could
>> even be action=parse, but for new renderers without a need or desire to tie
>> themselves to MediaWiki internals I'd recommend to think of them as their
>> own service. This can also make them more attractive to third party
>> contributors from outside the MediaWiki world, as has for example recently
>> happened with Mathoid.
>
> True, but that has little to do with my patch. It just means that 3rd party
> Content objects should preferably implement getHtml() by calling out to a
> service object.

You are right that it is not an immediate issue with your patch. The point
is about the *longer-term* role of the ContentHandler vs. the content API.
The ContentHandler could either try to be the central piece of our new
content API, or could become an integration point that normally calls out to
the content API and other services to retrieve HTML.

To me the latter is preferable as it enables us to optimize the content API
for high request rates by concentrating on doing one job well, and lets us
leverage this API from the server-side MediaWiki front-end through
ContentHandler.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Am 14.05.2014 15:11, schrieb Gabriel Wicke:

> On 05/14/2014 01:40 PM, Daniel Kinzler wrote:
>>> This means that HTML returned from the preprocessor needs to be valid in
>>> wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
>>> possible, but my impression is that you are shooting for something that's
>>> closer to the behavior of a tag extension. Those already bypass the
>>> sanitizer, so would be less troublesome in the short term.
>>
>> Yes. Just treat <html>...</html> like a tag extension, and it should work fine.
>> Do you see any problems with that?
>
> First of all you'll have to make sure that users cannot inject <html> tags
> as that would enable arbitrary XSS. I might have missed it, but I believe
> that this is not yet done in your current patch.

My patch doesn't change the handling of <html>...</html> by the parser. As
before, the parser will pass HTML code in <html>...</html> through only if
wgRawHtml is enabled, and will mangle/sanitize it otherwise.

My patch does mean however that the text return by expandtemplates may not
render as expected when processed by the parser. Perhaps anomie's approach of
preserving the original template call would work, something like:

  <html template="{{T}}">...</html>

Then, the parser could apply the normal expansion when encountering the tag,
ignoring the pre-rendered HTML.

> In contrast to normal tag extensions <html> would also contain fully
> rendered HTML, and should not be piped through action=parse as is done in
> Parsoid for tag extensions (in absence of a direct tag extension expansion
> API end point). We and other users of the expandtemplates API will have to
> add special-case handling for this pseudo tag extension.

Handling for the <html> tag should already be in place, since it's part of the
core spec. The issue is only to know when to allow/trust such <html> tags, and
when to treat them as plain text (or like a <pre> tag).

> In HTML, the <html> tag is also not meant to be used inside the body of a
> page. I'd suggest using a different tag name to avoid issues with HTML
> parsers and potential name conflicts with existing tag extensions.

As above: <html> is part of the core syntax, to support $wgRawHtml. It's just
disabled per default.

> Overall it does not feel like a very clean way to do this. My preference
> would be to let the consumer directly ask for pre-expanded wikitext *or*
> HTML, without overloading action=expandtemplates.

The question is how to represent non-wikitext transclusions in the output of
expandtemplates. We'll need an answer to this question in any case.

For the main purpose of my patch, expandtemplates is irrelevant. I added the
special mode that generates <html> specifically to have a consistent wikitext
representation for use by expandtemplates. I could simply disable it just as
well, so no expansion would apply for such templates when calling
expandtemplates (as is done for special page inclusiono).

> Even indicating the
> content type explicitly in the API response (rather than inline with an HTML
> tag) would be a better stop-gap as it would avoid some of the security and
> compatibility issues described above.

The content type did not change. It's wikitext.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/14/2014 03:22 PM, Daniel Kinzler wrote:
> My patch doesn't change the handling of <html>...</html> by the parser. As
> before, the parser will pass HTML code in <html>...</html> through only if
> wgRawHtml is enabled, and will mangle/sanitize it otherwise.


Oh, I thought that you wanted to support normal wikis with $wgRawHtml disabled.

> The content type did not change. It's wikitext.
Anything is wikitext ;)

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Dan Andreescu
In reply to this post by Daniel Kinzler
>
> > Can you outline how RL modules would be handled in the transclusion
> > scenario?
>
> The current patch does not really address that problem, I'm afraid. I can
> think
> of two solutions:
>
> * Create an SyntheticHtmlContent class that would hold meta info about
> modules
> etc, just like ParserOutput - perhaps it would just contain a ParserOutput
> object.  And an equvalent SyntheticWikitextContent class, perhaps. That
> would
> allow us to pass such meta-info around as needed.
>
> * Move the entire logic for HTML based transclusion into the wikitext
> parser,
> where it can just call getParserOutput() on the respective Content object.
> We
> would then no longer need the generic infrastructure for HTML transclusion.
> Maybe that would be a better solution in the end.
>
> Hm... yes, I should make an alternative patch using that approach, so we
> can
> compare.
>

Thanks a lot Daniel, I'm happy to help test / try out any solutions you
want to experiment with.  I've moved my work to gerrit:
https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Limnand
the last commit (with a lot of help from Matt F.) may be ready for you
to use as a use case.  Let me know if it'd be helpful to install this
somewhere in labs.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
In reply to this post by Gabriel Wicke-3
Am 14.05.2014 16:04, schrieb Gabriel Wicke:
> On 05/14/2014 03:22 PM, Daniel Kinzler wrote:
>> My patch doesn't change the handling of <html>...</html> by the parser. As
>> before, the parser will pass HTML code in <html>...</html> through only if
>> wgRawHtml is enabled, and will mangle/sanitize it otherwise.
>
>
> Oh, I thought that you wanted to support normal wikis with $wgRawHtml disabled.

I want to, and I do. <html> is not sued for normal rendering, it is used by
expandtemplates only. During normal rendering, a strip mark is inserted, which
will work on all wikis. The one thing that will not work on wikis with
$wgRawHtml disabled is parsing the output of expandtemplates.

-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
In reply to this post by Daniel Kinzler
Hi again!

I have rewritten the patch that enabled HTML based transclusion:

https://gerrit.wikimedia.org/r/#/c/132710/

I tried to address the concerns raised about my previous attempt, namely, how
HTML based transclusion is handled in expandtemplates, and how page meta data
such as resource modules get passed from the transcluded content to the main
parser output (this should work now).

For expandtemplates, I decided to just keep HTML based transclusions as they are
- including special page transclusions. So, expandtemplates will simply leave
{{Special:Foo}} and {{MediaWiki:Foo.js}} in the expanded text, while in the xml
output, you can still see them as template calls.

Cheers,
Daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
In reply to this post by Daniel Kinzler
On 05/15/2014 04:42 PM, Daniel Kinzler wrote:
> The one thing that will not work on wikis with
> $wgRawHtml disabled is parsing the output of expandtemplates.

Yes, which means that it won't work with Parsoid, Flow, VE and other users.

I do think that we can do better, and I pointed out possible ways to do so
in my earlier mail:

> My preference
> would be to let the consumer directly ask for pre-expanded wikitext *or*
> HTML, without overloading action=expandtemplates. Even indicating the
> content type explicitly in the API response (rather than inline with an HTML
> tag) would be a better stop-gap as it would avoid some of the security and
> compatibility issues described above.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
Am 16.05.2014 21:07, schrieb Gabriel Wicke:
> On 05/15/2014 04:42 PM, Daniel Kinzler wrote:
>> The one thing that will not work on wikis with
>> $wgRawHtml disabled is parsing the output of expandtemplates.
>
> Yes, which means that it won't work with Parsoid, Flow, VE and other users.

And it has been fixed now. In the latest version, expandtemplates will just
return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext.

> I do think that we can do better, and I pointed out possible ways to do so
> in my earlier mail:
>
>> My preference
>> would be to let the consumer directly ask for pre-expanded wikitext *or*
>> HTML, without overloading action=expandtemplates. Even indicating the
>> content type explicitly in the API response (rather than inline with an HTML
>> tag) would be a better stop-gap as it would avoid some of the security and
>> compatibility issues described above.

I don't quite understand what you are asking for... action=parse returns HTML,
action=expandtemplates returns wikitext. The issue was with "mixed" output, that
is, representing the expandion of templates that generate HTML in wikitext. The
solution I'm going for no is to simply not expand them.

-- daniel


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Subramanya Sastry
(Top posting to quickly summarize what I gathered from the discussion
and what would be required for Parsoid to expand pages with these
transclusions).

Parsoid currently relies on the mediawiki API to preprocess
transclusions and return wikitext (uses action=expandtemplates for this)
which it then parses using native Parsoid pipeline.  Parsoid processes
extension tags via action=parse and weaves the result back into the
top-level content of the page.

As per your original email, I am assuming the T is a page with a special
content model that generates HTML and another page P has a transclusion
{{T}}.

So, when Parsoid encounters {{T}}, it should be able to replace {{T}}
with the HTML to generate the right parse output for P.

So, I am listing below 4 possible ways action=expandtemplates can
process {{T}}

1. Your newest implementation (that just returns back {{T}}):

* If Parsoid gets back {{T}}, one of two things can happen:
--- Parsoid, as usual, tries to parse it as wikitext, and it gets stuck
in an infinite loop (query MW api for expansion of {{T}}, get back
{{T}}, parse it as {{T}}, query MW api for expansion of {{T}}, .... ).
So, this will definitely not work.
--- Parsoid adds a special case check to see if the API sent back {{T}},
and in which case, requires a different API endpoint
(action=expandtohtml maybe?) to send back the html expansion based on
the assumption about output of expandtemplates. This would work and
would require the new endpoint to be implemented, but feels hacky.

So, going back to your original implementation, here are at least 3 ways
I see this working:

2. action=expandtemplates returns a <html>...</html> for the expansion
of {{T}}, but also provides an additional API response header that tells
Parsoid that T was a special content model page and that the raw HTML
that it received should not be sanitized.

3. action=expandtemplates returns <html>...</html> for the expansion of
{{T}} and no other indication about T being a special content model page
or not. However, if Parsoid (and other clients) are to trust these html
output always without sanitization, expandtemplates implementation
should have a conditional sanitization of <html> tags encountered in
wikitext to prevent XSS. As far as I understand, expandtemplates (on
master, not your patch) does not do this tag sanitization. But,
independent of that, what Parsoid and clients need is a guarantee that
it is safe to blindly splice the contents of any <html>...</html> it
receives for any {{T}} no matter whether what content model T implements.

4. Parsoid first queries the MW-api to find out the content model of T
for every transclusion {{T}} it encounters on the page P and based on
the content-model info, knows how to process the output of
action=expandtemplates.

Clearly 4. is expensive and 3. seems hacky, but if it can be made to
work, we can work with that.

But, both Gabriel and I think that solution 2. is the cleanest solution
for now that would work. The PHP parser (in your patch to handle {{T}})
already has information about the content model of T when it is
expanding {{T}} and it seems simplest and cleanest to return this
information back to clients in the non-default content content-model
expansions. That gives clients like Parsoid the cleanest way of handling
these.

If I am missing something or this is unclear, and this getting into too
much back and forth on email and it is simpler to discuss this on IRC, I
can hop onto any IRC channel on Monday or we can do this on
#mediawiki-parsoid, and one of us could later summarize the discussion
back onto this thread.

Thanks,
Subbu.


On 05/17/2014 02:54 AM, Daniel Kinzler wrote:

> Am 16.05.2014 21:07, schrieb Gabriel Wicke:
>> On 05/15/2014 04:42 PM, Daniel Kinzler wrote:
>>> The one thing that will not work on wikis with
>>> $wgRawHtml disabled is parsing the output of expandtemplates.
>> Yes, which means that it won't work with Parsoid, Flow, VE and other users.
> And it has been fixed now. In the latest version, expandtemplates will just
> return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext.
>
>> I do think that we can do better, and I pointed out possible ways to do so
>> in my earlier mail:
>>
>>> My preference
>>> would be to let the consumer directly ask for pre-expanded wikitext *or*
>>> HTML, without overloading action=expandtemplates. Even indicating the
>>> content type explicitly in the API response (rather than inline with an HTML
>>> tag) would be a better stop-gap as it would avoid some of the security and
>>> compatibility issues described above.
> I don't quite understand what you are asking for... action=parse returns HTML,
> action=expandtemplates returns wikitext. The issue was with "mixed" output, that
> is, representing the expandion of templates that generate HTML in wikitext. The
> solution I'm going for no is to simply not expand them.
>
> -- daniel
>
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Subramanya Sastry
On 05/17/2014 10:51 AM, Subramanya Sastry wrote:
> So, going back to your original implementation, here are at least 3
> ways I see this working:
>
> 2. action=expandtemplates returns a <html>...</html> for the expansion
> of {{T}}, but also provides an additional API response header that
> tells Parsoid that T was a special content model page and that the raw
> HTML that it received should not be sanitized.

Actually, the <html></html> wrapper is not even required here since the
new API response header (for example, X-Content-Model: HTML) is
sufficient to know what to do with the response body.

Subbu.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/17/2014 05:57 PM, Subramanya Sastry wrote:

> On 05/17/2014 10:51 AM, Subramanya Sastry wrote:
>> So, going back to your original implementation, here are at least 3 ways I
>> see this working:
>>
>> 2. action=expandtemplates returns a <html>...</html> for the expansion of
>> {{T}}, but also provides an additional API response header that tells
>> Parsoid that T was a special content model page and that the raw HTML that
>> it received should not be sanitized.
>
> Actually, the <html></html> wrapper is not even required here since the new
> API response header (for example, X-Content-Model: HTML) is sufficient to
> know what to do with the response body.

Indeed.

Also, instead of the header we can just set a property / attribute in the
JSON/XML response structure. This will also work for multi-part responses,
for example when calling action=expandtemplates on multiple titles.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
In reply to this post by Subramanya Sastry
Am 17.05.2014 17:57, schrieb Subramanya Sastry:

> On 05/17/2014 10:51 AM, Subramanya Sastry wrote:
>> So, going back to your original implementation, here are at least 3 ways I see
>> this working:
>>
>> 2. action=expandtemplates returns a <html>...</html> for the expansion of
>> {{T}}, but also provides an additional API response header that tells Parsoid
>> that T was a special content model page and that the raw HTML that it received
>> should not be sanitized.
>
> Actually, the <html></html> wrapper is not even required here since the new API
> response header (for example, X-Content-Model: HTML) is sufficient to know what
> to do with the response body.

 But that would only work if {{T}} was the whole text that was being expanded (I
guess that's what you do with parsoid, right? Took me a minute to realize that).
expandtemplates operates on full wikitext. If the input is something like

  == Foo ==
  {{T}}

  [[Category:Bla}}

Then expanding {{T}} without a wrapper and pretending the result was HTML would
just be wrong.

Regarding trusting the output: MediaWiki core trusts the generated HTML for
direct output. It's no different from the HTML generated by e.g. special pages
in that regard.

I think something like <html transclusion="{{T}}" model="whatever">...</html>
would work best.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Subramanya Sastry
On 05/17/2014 06:14 PM, Daniel Kinzler wrote:

> Am 17.05.2014 17:57, schrieb Subramanya Sastry:
>> On 05/17/2014 10:51 AM, Subramanya Sastry wrote:
>>> So, going back to your original implementation, here are at least 3 ways I see
>>> this working:
>>>
>>> 2. action=expandtemplates returns a <html>...</html> for the expansion of
>>> {{T}}, but also provides an additional API response header that tells Parsoid
>>> that T was a special content model page and that the raw HTML that it received
>>> should not be sanitized.
>> Actually, the <html></html> wrapper is not even required here since the new API
>> response header (for example, X-Content-Model: HTML) is sufficient to know what
>> to do with the response body.
>   But that would only work if {{T}} was the whole text that was being expanded (I
> guess that's what you do with parsoid, right? Took me a minute to realize that).
> expandtemplates operates on full wikitext. If the input is something like
>
>    == Foo ==
>    {{T}}
>
>    [[Category:Bla}}
>
> Then expanding {{T}} without a wrapper and pretending the result was HTML would
> just be wrong.

Parsoid handles this correctly. We have mechanisms for injecting HTML as
well as wikitext into the toplevel page. For example, tag extensions
currently return fully expanded html (we use action=parse API endpoint)
and we inject that HTML into the page. So, consider this wikitext for
page P.

== Foo ==
{{wikitext-transclusion}}
   *a1
<map ..> ... </map>
   *a2
{{T}} (the html-content-model-transclusion)
   *a3

Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses
it and injects the tokens into the P's content. Parsoid gets HTML from
the API for <map..>...</map> and injects the HTML into the
not-fully-processed wikitext of P (by adding an appropriate token
wrapper). So, if {{T}} returns HTML (i.e. the MW API lets Parsoid know
that it is HTML), Parsoid can inject the HTML into the
not-fully-processed wikitext and ensure that the final output comes out
right (in this case, the HTML from both the map extension and {{T}}
would not get sanitized as it should be).

Does that help explain why we said we don't need the html wrapper?

All that said, if you want to provide the wrapper with <html
model="whatever" ....>fully-expanded-HTML</html>, we can handle that as
well. We'll use the model attribute of the wrapper, discard the wrapper
and use the contents in our pipeline.

So, model information either as an attribute on the wrapper, api
response header, or a property in the JSON/XML response structure would
all work for us. I don't have clarity on which of these three is the
best mechanism for providing the template-page content-model information
to clients .. so till such time I understand that better, I dont have an
opinion about the specific mechanism. However, in his previous message,
Gabriel indicated that a property in the JSON/XML response structure
might work better for multi-part responses.

Subbu.

> Regarding trusting the output: MediaWiki core trusts the generated HTML for
> direct output. It's no different from the HTML generated by e.g. special pages
> in that regard.
>
> I think something like <html transclusion="{{T}}" model="whatever">...</html>
> would work best.
>
> -- daniel
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Gabriel Wicke-3
On 05/18/2014 02:28 AM, Subramanya Sastry wrote:
> However, in his previous message, Gabriel indicated that
> a property in the JSON/XML response structure might work better for
> multi-part responses.

The difference between wrapper and property is actually that using inline
wrappers in the returned wikitext would force us to escape similar wrappers
from normal template content to avoid opening a gaping XSS hole.

A separate property in the JSON/XML structure avoids the need for escaping
(and associated security risks if not done thoroughly), and should be
relatively straightforward to implement and consume.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Transcluding non-text content as HTML on wikitext pages

Daniel Kinzler
In reply to this post by Subramanya Sastry
I'm getting the impression there is a fundamental misunderstanding here.

Am 18.05.2014 04:28, schrieb Subramanya Sastry:

> So, consider this wikitext for page P.
>
> == Foo ==
> {{wikitext-transclusion}}
>   *a1
> <map ..> ... </map>
>   *a2
> {{T}} (the html-content-model-transclusion)
>   *a3
>
> Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses it and
> injects the tokens into the P's content. Parsoid gets HTML from the API for
> <map..>...</map> and injects the HTML into the not-fully-processed wikitext of P
> (by adding an appropriate token wrapper). So, if {{T}} returns HTML (i.e. the MW
> API lets Parsoid know that it is HTML), Parsoid can inject the HTML into the
> not-fully-processed wikitext and ensure that the final output comes out right
> (in this case, the HTML from both the map extension and {{T}} would not get
> sanitized as it should be).
>
> Does that help explain why we said we don't need the html wrapper?

No, it actually misses my point completely. My point is that this may work with
the way parsoid uses expandtemplates, but it does not work for expandtemplates
in general. Because expandtemplates takes full wikitext as input, and only
partially replaces it.

So, let me phrase it this way:

If expandtemplates is called with text=

   == Foo ==
   {{T}}

   [[Category:Bla]]

What should it return, and what content type should be declared in the http header?

Note that I'm not talking about how parsoid processes this text. That's not my
point - my point is that expandtemplates can be and is used on full wikitext. In
that context, the return type cannot be HTML.

> All that said, if you want to provide the wrapper with <html model="whatever"
> ....>fully-expanded-HTML</html>, we can handle that as well. We'll use the model
> attribute of the wrapper, discard the wrapper and use the contents in our pipeline.

Why use the model attribute? Why would you care about the original model? All
you need to know is that you'll get HTML. Exposing the original model in this
context seems useless if not misleading. <html transclude="{{T}}></html> would
give that backend parser a way to discard the HTML (as unsafe) and execute the
transclusion instead (generating trusted HTML). In fact, we could just omit the
content of the <html> tag.

> So, model information either as an attribute on the wrapper, api response
> header, or a property in the JSON/XML response structure would all work for us.

As explained above, the return type cannot be HTML for the full text, because
any "plain" wikitext would stay unprocessed. There needs to be a marker for
"html transclusion *here*" in the text.

Am 18.05.2014 16:29, schrieb Gabriel Wicke:
> The difference between wrapper and property is actually that using inline
> wrappers in the returned wikitext would force us to escape similar wrappers
> from normal template content to avoid opening a gaping XSS hole.

Please explain, I do not see the hole you mention.

If the input contained <html>evil stuff</html>, it would just get escaped by the
preprocessor (unless $wgRawHtml is enabled), as it is now:
https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3E%3Cscript%3Ealert%28%27evil%27%29%3C/script%3E%3C/html%3E

If <html transclude="{{T}}"> was passed, the parser/preprocessor would treat it
like it would treat {{T}} - it would get trusted, backend generated HTML from
respective Content object.

I see no change, and no opportunity to inject anything. Am I missing something?

> A separate property in the JSON/XML structure avoids the need for escaping
> (and associated security risks if not done thoroughly), and should be
> relatively straightforward to implement and consume.

As explained above, I do not see how this would work except for the very special
case of using expandtemplates to expand just a single template. This could be
solved by introducing a new, single template mode for expandtemplates, e.g.
using expand="Foo|x|y|z" instead of text="{{Foo|x|y|z}}".

Another way would be to use hints the structure returned by generatexml. There,
we have an opportunity to declare a content type for a *part* of the output (or
rather, input).

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12