Getting rid of $wgWellFormedXml = false;

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting rid of $wgWellFormedXml = false;

Brian Wolff
So currently, we have two ways of outputting html - $wgWellFormedXml =
true (The default), outputs html that happens to conform with the
rules of XML. $wgWellFormedXml = false on the other hand, uses more
lax html5 rules to save a few bytes.

Having two modes of output, feels rather silly to me. Originally I
think this was meant as a feature flag well $wgWellFormedXml=false
stabilized, but it never got turned on, and here we are 7 years later.

Having $wgWellFormedXml=false increases the complexity of the code,
and not all that many people use it (Notable exception is
translatewiki). I think its important that security critical code be
as simple as possible. Furthermore, there seems to be very little
benefit to having the second mode (After you account for gzip, saving
a few bytes from writing <img> instead of <img/> really doesn't
matter, imo)

With that in mind, I would like to propose killing $wgWellFormedXml =
false; I'm not so much attached to the true mode (Although I do feel
the true mode is significantly more sane), as I just simply want there
to be a single mode. Putting the default to false was vetoed in
T52040, so I think that true would be the best choice to go with going
forward if we are getting rid of one of the modes.

If there are aspects of the other mode that people really want, then I
think we should simply merge that in to the default behavior instead
of having two separate modes.

See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
appreciate everyone's feedback.

Thanks,
Brian

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Brion Vibber-4
I'd say an HTML5 output mode *ought* to work like this:

*Don't try to be clever.*
* Consistency and predictability are key to both security review and data
consumability.

*Quote attributes consistently and predictably.*
* Always use double-quotes on attributes in output.

*Output specced empty tags in HTML style.*
* <img>, <hr>, <br> are fine and not ambiguous at all to an HTML parser.
There's no need to go adding a "/" in at the end!
* These are already whitelisted in the Html class so it's easy to not mess
this up.

*Don't do other silly things for old-school XHTML 1.*
* CDATA wrapping of <script>s and <style>s is not needed.

The only benefit of $wgWellFormedXml was that you could toss your
"well-formed" tag soup into an XML parser that didn't grok HTML. I have no
idea if that worked reliably or was actually useful to anyone, but it's
probably worth confirming that before actually removing the funky
self-closing tags.

-- brion


On Mon, May 2, 2016 at 11:42 AM, Brian Wolff <[hidden email]> wrote:

> So currently, we have two ways of outputting html - $wgWellFormedXml =
> true (The default), outputs html that happens to conform with the
> rules of XML. $wgWellFormedXml = false on the other hand, uses more
> lax html5 rules to save a few bytes.
>
> Having two modes of output, feels rather silly to me. Originally I
> think this was meant as a feature flag well $wgWellFormedXml=false
> stabilized, but it never got turned on, and here we are 7 years later.
>
> Having $wgWellFormedXml=false increases the complexity of the code,
> and not all that many people use it (Notable exception is
> translatewiki). I think its important that security critical code be
> as simple as possible. Furthermore, there seems to be very little
> benefit to having the second mode (After you account for gzip, saving
> a few bytes from writing <img> instead of <img/> really doesn't
> matter, imo)
>
> With that in mind, I would like to propose killing $wgWellFormedXml =
> false; I'm not so much attached to the true mode (Although I do feel
> the true mode is significantly more sane), as I just simply want there
> to be a single mode. Putting the default to false was vetoed in
> T52040, so I think that true would be the best choice to go with going
> forward if we are getting rid of one of the modes.
>
> If there are aspects of the other mode that people really want, then I
> think we should simply merge that in to the default behavior instead
> of having two separate modes.
>
> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
> appreciate everyone's feedback.
>
> Thanks,
> Brian
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Brian Wolff
>
> The only benefit of $wgWellFormedXml was that you could toss your
> "well-formed" tag soup into an XML parser that didn't grok HTML. I have no
> idea if that worked reliably or was actually useful to anyone, but it's
> probably worth confirming that before actually removing the funky
> self-closing tags.
>

There are references to it breaking people's screen scraping bots last time
it was turned on. That was like 5 years ago though.

--bawolff
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Max Semenik
On Mon, May 2, 2016 at 3:04 PM, Brian Wolff <[hidden email]> wrote:

>
> There are references to it breaking people's screen scraping bots last time
> it was turned on. That was like 5 years ago though.
>

At this point, I would say that everybody who screen-scrapes saw it coming
and breaking them is a good thing as sometimes, lessons just have to be
learned.


Best regards,
Max Semenik ([[User:MaxSem]])
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Gergo Tisza
On Tue, May 3, 2016 at 2:43 AM, Max Semenik <[hidden email]> wrote:

> At this point, I would say that everybody who screen-scrapes saw it coming
> and breaking them is a good thing as sometimes, lessons just have to be
> learned.
>

There aren't many options other than content-scraping if you want to
transform Wikipedia articles into some semblance of structured data. We
even do it ourselves, for media metadata (and use an XML parser for it, as
PHP doesn't offer much in the way of parsing HTML5, so outputting
HTML5-style empty tags might break it - although IIRC there is a hack to
work around that as file pages can contain ill-formed HTML anyway).
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Gergo Tisza
On Tue, May 3, 2016 at 4:34 PM, Gergo Tisza <[hidden email]> wrote:
>
> There aren't many options other than content-scraping if you want to
> transform Wikipedia articles into some semblance of structured data. We
> even do it ourselves, for media metadata (and use an XML parser for it
>

Actually the XML parser has been replaced with DOMDocument a while ago,
which can handle HTML5 fine. But the point stands: HTML scraping is hardly
an unusual requirement for reusers of our content.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Brian Wolff
In reply to this post by Max Semenik
On Monday, May 2, 2016, Max Semenik <[hidden email]> wrote:
> On Mon, May 2, 2016 at 3:04 PM, Brian Wolff <[hidden email]> wrote:
>
>>

> At this point, I would say that everybody who screen-scrapes saw it coming
> and breaking them is a good thing as sometimes, lessons just have to be
> learned.
>

Personally, I dont think we should shy away from breaking screen scrapers
if we get something out of it, but in this case I dont see the benefit.
Breaking things because we can without getting any benefit (or only trivial
benefits) seems rather pointless and kind of mean to those who do scrape.

--
bawolff
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Legoktm
In reply to this post by Brian Wolff
Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:
> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
> appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged
it today.

-- Legoktm

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Strainu
2016-05-14 4:07 GMT+03:00 Legoktm <[hidden email]>:
> Hi,
>
> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>> appreciate everyone's feedback.
>
> Given the lack of objections here and on Gerrit, I went ahead and merged
> it today.

Can you please clarify if this change will have any effect on
non-valid HTML in the Wikitext? I suppose no change will occur, since
this was the default anyway, but I'd like a confirmation.

Strainu

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Antoine Musso-3
In reply to this post by Legoktm
Le 14/05/2016 à 03:07, Legoktm a écrit :
> Hi,
>
> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>> appreciate everyone's feedback.
>
> Given the lack of objections here and on Gerrit, I went ahead and merged
> it today.

Hello,

That sounds good. I would suggest to apply to REL1_27 as well.

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Getting rid of $wgWellFormedXml = false;

Brian Wolff
In reply to this post by Strainu
On Saturday, May 14, 2016, Strainu <[hidden email]> wrote:

> 2016-05-14 4:07 GMT+03:00 Legoktm <[hidden email]>:
>> Hi,
>>
>> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>>> appreciate everyone's feedback.
>>
>> Given the lack of objections here and on Gerrit, I went ahead and merged
>> it today.
>
> Can you please clarify if this change will have any effect on
> non-valid HTML in the Wikitext? I suppose no change will occur, since
> this was the default anyway, but I'd like a confirmation.
>
> Strainu
>

That is correct. Nothing will change about invalid html - if you have tidy
enabled the invalid html gets fixed, if you dont it does not.

--
bawolff
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l