Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Nicolas Vervelle-4
On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry <[hidden email]>
wrote:

> On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
>
>     - In the page dedicated to a category, there's a column telling if the
>>
>     problem is due to one template (and which one) or by several
>> templates, but
>>     I don't get this information in the REST API for Linter. Is it
>> possible to
>>     have it in the API result or should I deduce it myself where the
>> offset
>>     given by the API matches a call to a template?
>>
>
> Look for this in the template response.
>
> |"templateInfo": { "multiPartTemplateBlock": true }|
>

Thanks ! I have updated WPCleaner to display the information about the
template (template name or multiple templates).


I think I've found some discrepancy between Linter reports. On frwiki, the
page "Discussion:Yasser Arafat" is reported in the list for self-closed-tag
[1], but when run the text of the page through the transform API [2], I
only get errors for obsolete-tag and mixed-content and nothing for
self-closed-tag.

[1] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors/self-closed-tag
[2]
https://fr.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_lint_title_revision
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Subramanya Sastry
On 07/13/2017 02:18 AM, Nicolas Vervelle wrote:

>
> I think I've found some discrepancy between Linter reports. On frwiki, the
> page "Discussion:Yasser Arafat" is reported in the list for self-closed-tag
> [1], but when run the text of the page through the transform API [2], I
> only get errors for obsolete-tag and mixed-content and nothing for
> self-closed-tag.

When I pasted the wikitext for Discussion:Yasser_Arafat page in the
wikitext box AND entered the page title in the title box on
https://fr.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_lint_title_revision,
I do see the following among others:
...

|{ "type": "self-closed-tag", "params": { "name": "span" }, "dsr": [
183063, 183134, null, null ], "templateInfo": { "name": "Modèle:Censuré"
} },|

...

However, if I don't add the page title in the title box, I can reproduce
your problem ... so, clearly this is something to do with a template
depending on the page title.

I can reproduce this on the commandline with the specific wikitext
substring that the Linter interface shows you. This output below shows
that the linter error is dependent on having the page title there.

---
[subbu@earth parsoid] echo '{{Censuré|Tu remarqueras que je ne te
retourne pas la question.<br />}}' | parse.js --page
Discussion:Yasser_Arafat --prefix frwiki --lint > /dev/null
[info/lint/self-closed-tag][frwiki/Discussion:Yasser_Arafat]
{"type":"self-closed-tag","params":{"name":"span"},"dsr":[0,71,null,null],"templateInfo":{"name":"Modèle:Censuré"}}
[info/lint/stripped-tag][frwiki/Discussion:Yasser_Arafat]
{"type":"stripped-tag","params":{"name":"SPAN"},"dsr":[0,71,null,null],"templateInfo":{"name":"Modèle:Censuré"}}
[subbu@earth parsoid] echo '{{Censuré|Tu remarqueras que je ne te
retourne pas la question.<br />}}' | parse.js --prefix frwiki --lint >
/dev/null
[subbu@earth parsoid]
---

When I add a --dump tplsrc flag to parsoid (which you can also get by
using the expandtemplates action api endpoint), I see the following:

---
<span class="censure" style="background-color:#EEF;color:#EEF;"
title="Tu remarqueras que je ne te retourne pas la question.<br
/>"><span style="visibility:hidden">Tu remarqueras que je ne te retourne
pas la question.<br /></span></span>
---

So, it looks like Parsoid's tokenizer is tripping on the /> that is
present in the span title attribute and false assumes it is a
self-closing tag.

In any case, in conclusion:

(1) Please provide page title when you use the API
(2) There is a Parsoid bug in detection of self-closing tags where
presence of a "/>" in an HTML attribute triggers a false positive. This
has been reported previously ... so I suppose it is not as uncommon as I
thought. We'll take a look at that.

Subbu.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Arlo Breault

> On Jul 13, 2017, at 10:35 AM, Subramanya Sastry <[hidden email]> wrote:
>
> (2) There is a Parsoid bug in detection of self-closing tags where presence of a "/>" in an HTML attribute triggers a false positive. This has been reported previously ... so I suppose it is not as uncommon as I thought. We'll take a look at that.

No, Parsoid is doing that by design to match the php parser.

See T97157 and https://phabricator.wikimedia.org/T170582#3435855


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Pine W
Hi folks,

Do you think that the implementation discussion should move to Phabricator?

Pine
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Nicolas Vervelle-4
In reply to this post by Nicolas Vervelle-4
On Thu, Jul 13, 2017 at 9:18 AM, Nicolas Vervelle <[hidden email]>
wrote:

>
>
> On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry <[hidden email]>
> wrote:
>
>> On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
>>
>>     - In the page dedicated to a category, there's a column telling if the
>>>
>>     problem is due to one template (and which one) or by several
>>> templates, but
>>>     I don't get this information in the REST API for Linter. Is it
>>> possible to
>>>     have it in the API result or should I deduce it myself where the
>>> offset
>>>     given by the API matches a call to a template?
>>>
>>
>> Look for this in the template response.
>>
>> |"templateInfo": { "multiPartTemplateBlock": true }|
>>
>
> Thanks ! I have updated WPCleaner to display the information about the
> template (template name or multiple templates).
>

I've started adding a detection in WPCleaner (error #532) for the
missing-end-tag error reported by Linter (I'm starting with easy ones).

Is it normal that errrors inside a gallery tag are reported as being an
error in a "multiPartTemplateBlock" while it's directly inside the page
wikitext ?
Examples on frwiki : Manali
<https://fr.wikipedia.org/w/index.php?title=Manali&action=edit&lintid=4555235>,
Zillis-Reischen
<https://fr.wikipedia.org/w/index.php?title=Zillis-Reischen&action=edit&lintid=4555585>
...

Nico
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018

Subramanya Sastry
Hi Nico,

If you don't mind, let us move this more bug/feature-specific discussion
to Phabricator by filing bugs where appropriate. Or, we can have
discussions on-wiki at
https://www.mediawiki.org/wiki/Help_talk:Extension:Linter. I'll copy
your query to the talk page there and we can discuss it there.

Subbu.


On 07/17/2017 04:10 AM, Nicolas Vervelle wrote:

> On Thu, Jul 13, 2017 at 9:18 AM, Nicolas Vervelle <[hidden email]>
> wrote:
>
>>
>> On Tue, Jul 11, 2017 at 5:05 PM, Subramanya Sastry <[hidden email]>
>> wrote:
>>
>>> On 07/11/2017 05:13 AM, Nicolas Vervelle wrote:
>>>
>>>      - In the page dedicated to a category, there's a column telling if the
>>>      problem is due to one template (and which one) or by several
>>>> templates, but
>>>>      I don't get this information in the REST API for Linter. Is it
>>>> possible to
>>>>      have it in the API result or should I deduce it myself where the
>>>> offset
>>>>      given by the API matches a call to a template?
>>>>
>>> Look for this in the template response.
>>>
>>> |"templateInfo": { "multiPartTemplateBlock": true }|
>>>
>> Thanks ! I have updated WPCleaner to display the information about the
>> template (template name or multiple templates).
>>
> I've started adding a detection in WPCleaner (error #532) for the
> missing-end-tag error reported by Linter (I'm starting with easy ones).
>
> Is it normal that errrors inside a gallery tag are reported as being an
> error in a "multiPartTemplateBlock" while it's directly inside the page
> wikitext ?
> Examples on frwiki : Manali
> <https://fr.wikipedia.org/w/index.php?title=Manali&action=edit&lintid=4555235>,
> Zillis-Reischen
> <https://fr.wikipedia.org/w/index.php?title=Zillis-Reischen&action=edit&lintid=4555585>
> ...
>
> Nico
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

Subramanya Sastry
In reply to this post by Subramanya Sastry


On 07/06/2017 08:02 AM, Subramanya Sastry wrote:
>
> TL;DR
> -----
> The Parsing team wants to replace Tidy with a RemexHTML-based solution on the
> Wikimedia cluster by June 2018. This will require editors to fix pages and
> templates to address wikitext patterns that behave differently with
> RemexHTML.  Please see 'What editors will need to do' section on the Tidy
> replacement FAQ [1].
>
......
>
> 9. Monitoring progress
> ----------------------
> In order to monitor progress, we plan to do a weekly (or some such periodic
> frequency) test run that compares the rendering of pages with Tidy and with
> RemexHTML on a large sample of pages (in the 50K range) from a large subset
> of Wikimedia wikis (~50 or so).  This will give us a pulse of how fixups are
> going, and when we might be able to flip the switch on different wikis.

I wanted to post some followups on this.

1. We have a revived dashboard that tracks linter error counts on wikis
    for all linter categories.

    See https://tools.wmflabs.org/wikitext-deprecation/

2. We track the error counts as they change and publish weekly snapshots
    comparing counts to a July 24th baseline (which is when I first
    started collecting stats)

    See https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/Linter/Stats

3. We also have a pixel-diffs test run (previously called visual diffs)
    that compares page rendering with Tidy and with RemexHTML. The test
    set has 73K pages sampled from 60 wikis. These diffs more accurately
    reflect what kind of rendering differences we can expect to see if
    pages are not fixed.

    See http://mw-expt-tests.wmflabs.org/

4. Based on the runs above, I identified one more high priority linter
    category which is a Tidy whitespace bug and needs to be fixed (expect
    mostly templates, especially navboxes based on what I've seen in the
    test run above). Once the code is reviewed and deployed to the
    cluster, we'll start populating this category.

    See https://gerrit.wikimedia.org/r/#/c/371068/ and
https://gerrit.wikimedia.org/r/#/c/371071/

Thanks,
Subbu.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

יגאל חיטרון
Hello and thank you for this. Is there a phab ticket to follow the
deployment process?
Igal (User:IKhitron)


2017-08-10 21:42 GMT+03:00 Subramanya Sastry <[hidden email]>:

>
>
> On 07/06/2017 08:02 AM, Subramanya Sastry wrote:
>
>>
>> TL;DR
>> -----
>> The Parsing team wants to replace Tidy with a RemexHTML-based solution on
>> the
>> Wikimedia cluster by June 2018. This will require editors to fix pages and
>> templates to address wikitext patterns that behave differently with
>> RemexHTML.  Please see 'What editors will need to do' section on the Tidy
>> replacement FAQ [1].
>>
>> ......
>
>>
>> 9. Monitoring progress
>> ----------------------
>> In order to monitor progress, we plan to do a weekly (or some such
>> periodic
>> frequency) test run that compares the rendering of pages with Tidy and
>> with
>> RemexHTML on a large sample of pages (in the 50K range) from a large
>> subset
>> of Wikimedia wikis (~50 or so).  This will give us a pulse of how fixups
>> are
>> going, and when we might be able to flip the switch on different wikis.
>>
>
> I wanted to post some followups on this.
>
> 1. We have a revived dashboard that tracks linter error counts on wikis
>    for all linter categories.
>
>    See https://tools.wmflabs.org/wikitext-deprecation/
>
> 2. We track the error counts as they change and publish weekly snapshots
>    comparing counts to a July 24th baseline (which is when I first
>    started collecting stats)
>
>    See https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/Linter/Stats
>
> 3. We also have a pixel-diffs test run (previously called visual diffs)
>    that compares page rendering with Tidy and with RemexHTML. The test
>    set has 73K pages sampled from 60 wikis. These diffs more accurately
>    reflect what kind of rendering differences we can expect to see if
>    pages are not fixed.
>
>    See http://mw-expt-tests.wmflabs.org/
>
> 4. Based on the runs above, I identified one more high priority linter
>    category which is a Tidy whitespace bug and needs to be fixed (expect
>    mostly templates, especially navboxes based on what I've seen in the
>    test run above). Once the code is reviewed and deployed to the
>    cluster, we'll start populating this category.
>
>    See https://gerrit.wikimedia.org/r/#/c/371068/ and
> https://gerrit.wikimedia.org/r/#/c/371071/
>
> Thanks,
> Subbu.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

Subramanya Sastry

On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
> Hello and thank you for this. Is there a phab ticket to follow the
> deployment process?
> Igal (User:IKhitron)
We have the original Tidy replacement ticket
(https://phabricator.wikimedia.org/T89331) but, as we get closer to
start making phased deployments, we'll create phab tickets to track
deployments separately.

Subbu.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

יגאל חיטרון
Sorry for misunderstanding, I spoke about the whitespace.
Igal


2017-08-10 22:06 GMT+03:00 Subramanya Sastry <[hidden email]>:

>
> On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
>
>> Hello and thank you for this. Is there a phab ticket to follow the
>> deployment process?
>> Igal (User:IKhitron)
>>
> We have the original Tidy replacement ticket (
> https://phabricator.wikimedia.org/T89331) but, as we get closer to start
> making phased deployments, we'll create phab tickets to track deployments
> separately.
>
>
> Subbu.
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

Subramanya Sastry
Ah! No, there wasn't one. But, I created
https://phabricator.wikimedia.org/T173096 now and added you to the
ticket. We are expecting to deploy it by end of next week.

Subbu.


On 08/10/2017 03:09 PM, יגאל חיטרון wrote:

> Sorry for misunderstanding, I spoke about the whitespace.
> Igal
>
>
> 2017-08-10 22:06 GMT+03:00 Subramanya Sastry <[hidden email]>:
>
>> On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
>>
>>> Hello and thank you for this. Is there a phab ticket to follow the
>>> deployment process?
>>> Igal (User:IKhitron)
>>>
>> We have the original Tidy replacement ticket (
>> https://phabricator.wikimedia.org/T89331) but, as we get closer to start
>> making phased deployments, we'll create phab tickets to track deployments
>> separately.
>>
>>
>> Subbu.
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Followup (Re: Tidy will be replaced by RemexHTML on Wikimedia wikis latest by June 2018)

יגאל חיטרון
Saw it now. Thank you very much.
Igal


2017-08-11 17:26 GMT+03:00 Subramanya Sastry <[hidden email]>:

> Ah! No, there wasn't one. But, I created https://phabricator.wikimedia.
> org/T173096 now and added you to the ticket. We are expecting to deploy
> it by end of next week.
>
> Subbu.
>
>
>
> On 08/10/2017 03:09 PM, יגאל חיטרון wrote:
>
>> Sorry for misunderstanding, I spoke about the whitespace.
>> Igal
>>
>>
>> 2017-08-10 22:06 GMT+03:00 Subramanya Sastry <[hidden email]>:
>>
>> On 08/10/2017 02:49 PM, יגאל חיטרון wrote:
>>>
>>> Hello and thank you for this. Is there a phab ticket to follow the
>>>> deployment process?
>>>> Igal (User:IKhitron)
>>>>
>>>> We have the original Tidy replacement ticket (
>>> https://phabricator.wikimedia.org/T89331) but, as we get closer to start
>>> making phased deployments, we'll create phab tickets to track deployments
>>> separately.
>>>
>>>
>>> Subbu.
>>>
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12