I love Parsoid but it doesn't want me

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

I love Parsoid but it doesn't want me

Ricordisamoa
Are there any stable APIs for an application to get a parse tree in
machine-readable format, manipulate it and send the result back without
touching HTML?
I'm sorry if this question doesn't make any sense.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Antoine Musso-3
Le 23/07/2015 08:15, Ricordisamoa a écrit :
> Are there any stable APIs for an application to get a parse tree in
> machine-readable format, manipulate it and send the result back without
> touching HTML?
> I'm sorry if this question doesn't make any sense.

You might want to explain what you are trying to do and which wall you
have hit when attempting to use Parsoid :-)

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
Il 23/07/2015 15:28, Antoine Musso ha scritto:
> Le 23/07/2015 08:15, Ricordisamoa a écrit :
>> Are there any stable APIs for an application to get a parse tree in
>> machine-readable format, manipulate it and send the result back without
>> touching HTML?
>> I'm sorry if this question doesn't make any sense.
> You might want to explain what you are trying to do and which wall you
> have hit when attempting to use Parsoid :-)
>

For example, adding a template transclusion as new parameter in another
template.
XHTML5+RDFa is the wall :-(
Can't Parsoid's deserialization be caught at some point to get a
higher-level structure like mwparserfromhell
<https://github.com/earwig/mwparserfromhell>'s?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

C. Scott Ananian
HTML5+RDFa is a machine-readable format.  But I think what you are asking
for is either better documentation of the template-related stuff (did you
read through the slides in https://phabricator.wikimedia.org/T105175 ?) or
HTML template parameter support (https://phabricator.wikimedia.org/T52587)
which is in the codebase but not enabled by default in production.
 --scott

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
The slides are interesting, but for now it seems VisualEditor-focused
and not nearly as powerful as mwparserfromhell.
I don't care about presentation. I don't want HTML.
And I hate getting all edits tagged as "VisualEditor".

Il 23/07/2015 22:02, C. Scott Ananian ha scritto:

> HTML5+RDFa is a machine-readable format.  But I think what you are asking
> for is either better documentation of the template-related stuff (did you
> read through the slides in https://phabricator.wikimedia.org/T105175 ?) or
> HTML template parameter support (https://phabricator.wikimedia.org/T52587)
> which is in the codebase but not enabled by default in production.
>   --scott
> ​
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

C. Scott Ananian
Well, it's really just a different way of thinking about things.  Instead
of:
```
>>> import mwparserfromhell
>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>> wikicode = mwparserfromhell.parse(text)
>>> templates = wikicode.filter_templates()
```
you would write:
```
js> Parsoid = require('parsoid');
js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?";
js> Parsoid.parse(text, { document: true }).then(function(res) {
      templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]');
      console.log(templates);
     }).done();
```

That said, it wouldn't be hard to clone the API of
http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
and that would probably be a great addition to the parsoid package API.

HTML is just a tree structured data representation.  Think of it as XML if
it makes you happier.  It just happens to come with well-defined semantics
and lots of manipulation libraries.

I don't know about edits tagged as "VisualEditor".  That seems like that
should only be done by VE.  I take it you would like an easy work flow to
fetch a page, make edits, and then write the new revision back?
 mwparserfromhell doesn't actually seem to have that functionality, but it
would also be nice to facilitate that use case if we can.
  --scott


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:

> Well, it's really just a different way of thinking about things.  Instead
> of:
> ```
>>>> import mwparserfromhell
>>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>>> wikicode = mwparserfromhell.parse(text)
>>>> templates = wikicode.filter_templates()
> ```
> you would write:
> ```
> js> Parsoid = require('parsoid');
> js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?";
> js> Parsoid.parse(text, { document: true }).then(function(res) {
>        templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]');
>        console.log(templates);
>       }).done();
> ```
>
> That said, it wouldn't be hard to clone the API of
> http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html

Parsoid's expressiveness seems to convey useless information, overlook
important details, or duplicate them in different places.
If I want to resize an image, am I supposed to change "data-file-width"
and "data-file-height"? "width" and "height"? Or "src"?
I think what I'm looking for is sort of an 'enhanced wikitext' rather
than 'annotated HTML'.

> and that would probably be a great addition to the parsoid package API.
>
> HTML is just a tree structured data representation.  Think of it as XML if
> it makes you happier.  It just happens to come with well-defined semantics
> and lots of manipulation libraries.
>
> I don't know about edits tagged as "VisualEditor".  That seems like that
> should only be done by VE.

All edits made via visualeditoredit
<https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit>
are tagged.

> I take it you would like an easy work flow to
> fetch a page, make edits, and then write the new revision back?

Right.

>   mwparserfromhell doesn't actually seem to have that functionality

It is actually pretty easy to do with Pywikibot.
But since Parsoid happens to work server-side, it makes sense to request
and send back the structured tree directly.

> , but it
> would also be nice to facilitate that use case if we can.
>    --scott
>
> ​
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thanks for your time.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Marko Obrovac
On 24 July 2015 at 07:34, Ricordisamoa <[hidden email]> wrote:

> Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
>
>> Well, it's really just a different way of thinking about things.  Instead
>> of:
>> ```
>>
>>> import mwparserfromhell
>>>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>>>> wikicode = mwparserfromhell.parse(text)
>>>>> templates = wikicode.filter_templates()
>>>>>
>>>> ```
>> you would write:
>> ```
>> js> Parsoid = require('parsoid');
>> js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?";
>> js> Parsoid.parse(text, { document: true }).then(function(res) {
>>        templates =
>> res.out.querySelectorAll('[typeof~="mw:Transclusion"]');
>>        console.log(templates);
>>       }).done();
>> ```
>>
>> That said, it wouldn't be hard to clone the API of
>>
>> http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
>>
>
> Parsoid's expressiveness seems to convey useless information, overlook
> important details, or duplicate them in different places.
> If I want to resize an image, am I supposed to change "data-file-width"
> and "data-file-height"? "width" and "height"? Or "src"?
> I think what I'm looking for is sort of an 'enhanced wikitext' rather than
> 'annotated HTML'.
>
>  and that would probably be a great addition to the parsoid package API.
>>
>> HTML is just a tree structured data representation.  Think of it as XML if
>> it makes you happier.  It just happens to come with well-defined semantics
>> and lots of manipulation libraries.
>>
>> I don't know about edits tagged as "VisualEditor".  That seems like that
>> should only be done by VE.
>>
>
> All edits made via visualeditoredit <
> https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit>
> are tagged.
>
>  I take it you would like an easy work flow to
>> fetch a page, make edits, and then write the new revision back?
>>
>
> Right.


RESTBase could help you there. With one API call, you can get the (stored)
latest HTML revision of a page in Parsoid format~[1], but without the need
to wait for Parsoid to parse it (if the latest revision is in RESTBase's
storage). There is also section API support (you can get individual HTML
fragments of a page by ID, and send only those back for transformation into
wikitext~[2]). There is also support for page editing (aka saving), but
these endpoints have not yet been enabled for WMF wikis in production due
to security concerns.

[1]
https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/page_html__title__get
[2]
https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/transform_sections_to_wikitext__title___revision__post

Cheers,
Marko


>
>
>    mwparserfromhell doesn't actually seem to have that functionality
>>
>
> It is actually pretty easy to do with Pywikibot.
> But since Parsoid happens to work server-side, it makes sense to request
> and send back the structured tree directly.
>
>  , but it
>> would also be nice to facilitate that use case if we can.
>>    --scott
>>
>> ​
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
> Thanks for your time.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Subramanya Sastry
In reply to this post by Ricordisamoa
On 07/23/2015 01:07 PM, Ricordisamoa wrote:

> Il 23/07/2015 15:28, Antoine Musso ha scritto:
>> Le 23/07/2015 08:15, Ricordisamoa a écrit :
>>> Are there any stable APIs for an application to get a parse tree in
>>> machine-readable format, manipulate it and send the result back without
>>> touching HTML?
>>> I'm sorry if this question doesn't make any sense.
>> You might want to explain what you are trying to do and which wall you
>> have hit when attempting to use Parsoid :-)
>>
>
> For example, adding a template transclusion as new parameter in
> another template.
> XHTML5+RDFa is the wall :-(
> Can't Parsoid's deserialization be caught at some point to get a
> higher-level structure like mwparserfromhell
> <https://github.com/earwig/mwparserfromhell>'s?

Parsoid and mwparserfromhell have different design goals and hence do
things differently.

Parsoid is meant to support HTML editing and hence provides semantic
information as annotations over the HTML document. It effectively
maintains a bidirectional/reversible mapping between segments of
wikitext and DOM trees. You can manipulate the DOM trees and get back
wikitext that represents the edited tree. As for useless information and
duplicate information -- I think if you looked at the Parsoid DOM spec
[1], you will know what to look for and what to manipulate. The
information on the DOM is meant to (a) render accurately (b) support the
various bots / clients / gadgets that look for specific kinds of
information, and (b) be editable easily. If that spec has holes or needs
updates or fixing, we are happy to do that. Do let us know.

mwparserfromhell is an entirely wikitext-centric library as far as I can
tell. It is meant to manipulate wikitext directly. It is a neat library
which provides a lot of utilities and makes it easy to do wikitext
transformations. It doesn't know about or care about HTML because it
doesn't need to. It also seems to effectively gives you some kind of
wikitext-centric AST. These are all impressions based on a quick scan of
its docs -- so pardon any misunderstandings.

Parsoid does not provide you a wikitext AST directly since it doesn't
construct one. All wikitext information shows up indirectly as DOM
annotations (either attributes or JSON information in attributes). As
Scott showed, you can still do document ("wikitext") manipulations using
DOM libraries, CSS-style queries, or directly by walking the DOM. There
are lots of ways you can edit mediawiki pages without knowing about
wikitext and using the vast array of HTML libraries. That happens to be
our tagline: "we deal with wikitext so you don't have to".

But, you are right. It can indeed seem cumbersome if you want to
directly manipulate wikitext without the DOM getting in between or
having to deal with DOM libraries. But that is not the use case we
target. There are a vastly greater number of libraries in all kinds of
languages (and developers) that know about HTML and can render, handle,
and manipulate HTML easily than know how to (or want to) manipulate
wikitext programmatically. Kind of the difference between the wikitext
editor and the visual editor. They each have their constituencies and roles.

All that said, as Scott noted, it is possible to develop a
mwparserfromhell like layer on top of the Parsoid DOM annotations if you
want a wikitext-centric view (as opposed to a DOM-centric view that most
editing clients seem to want). But, since that is not a use case that we
target, that hasn't been on our radar. If someone does want to take that
on, and thinks it would be useful, we are happy to provide assistance.
It should not be too difficult.

Does that help summarize this issue and clarify the differences and
approaches of these two tools? I am "on vacation" :-)  so responses will
be delayed.

Subbu.

[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

C. Scott Ananian
In reply to this post by Ricordisamoa
On Fri, Jul 24, 2015 at 12:34 AM, Ricordisamoa <[hidden email]
> wrote:

> Parsoid's expressiveness seems to convey useless information, overlook
> important details, or duplicate them in different places.
> If I want to resize an image, am I supposed to change "data-file-width"
> and "data-file-height"? "width" and "height"? Or "src"?
>

These are great points, and reports from folks like you will help to
improve our documentation.  My goal for Parsoid's DOM[1] is that every bit
of information from the wikitext is represented exactly *once* in the
result.

In your example, `data-file-width` and `data-file-height` represent the
*unscaled* size of the *source* image.  Many image scaling operations want
to know this, so we include it in the DOM.  It is ignored when you convert
back to wikitext.

The `width` and `height` attributes are what you should modify if you want
to resize an image, just like you would do for any naive html editor.

The `src` attribute is again mostly ignored (sigh); the 'resource'
attribute specifies the url of the unscaled image.  Of course if 'resource'
is missing we'll try to make do with `src`; we really try hard to do
something reasonable with whatever we're given.
  --scott

[1] There is a tension between "don't repeat yourself" and the use of
Parsoid DOM for read views.  Certain attributes (like "alt" and "title")
get duplicated by default by the PHP parser.  So far I think we've been
mostly successful in not letting this sort of thing infect the Parsoid DOM,
but there may be corner cases we accomodate for the sake of ease-of-use for
viewers.

--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

James Forrester-4
In reply to this post by Ricordisamoa
On 23 July 2015 at 22:34, Ricordisamoa <[hidden email]> wrote:

> Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
>
>> I don't know about edits tagged as "VisualEditor".  That seems like that
>>
> should only be done by VE.
>>
>
> All edits made via visualeditoredit <
> https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit>
> are tagged.


​Yes. That's because that is the *private* API for VisualEditor. It
absolutely should not ever be used by anyone else.​ It's not like any of
the 'real' APIs in MediaWiki – it is designed for exactly one use case
(VisualEditor), makes huge assumptions about the world and what is needed
(like tagging edits), and we make breaking changes all the time.
Unfortunately, the request to badge internal APIs got turned into flagging
it and similar APIs in MediaWiki as "This module is internal or unstable.",
which isn't strong enough on just how bad an idea it is to use it. I would
extremely strongly suggest that you do not use it, ever.

As Marko, Subbu and Scott point out, we have actual public APIs for this
kind of stuff, in the forms of RESTbase and Parsoid, and that's what you
should use.

Yours,
--
James D. Forrester
Lead Product Manager, Editing
Wikimedia Foundation, Inc.

[hidden email] | @jdforrester
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

C. Scott Ananian
As a proof of concept, I started to build a `mwparserfromhell`-like
interface to the Parsoid DOM.

You can see it at https://gerrit.wikimedia.org/r/226734

I started by translating the template examples from the mwparserfromhell
documentation, which means I'm really jumping in at the deep end.  Most
non-template manipulations should be much easier!
 --scott

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
In reply to this post by Marko Obrovac
Thanks Marko. Replies inline

Il 24/07/2015 15:07, Marko Obrovac ha scritto:

> On 24 July 2015 at 07:34, Ricordisamoa <[hidden email]> wrote:
>
>> Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
>>
>>> Well, it's really just a different way of thinking about things.  Instead
>>> of:
>>> ```
>>>
>>>> import mwparserfromhell
>>>>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>>>>> wikicode = mwparserfromhell.parse(text)
>>>>>> templates = wikicode.filter_templates()
>>>>>>
>>>>> ```
>>> you would write:
>>> ```
>>> js> Parsoid = require('parsoid');
>>> js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?";
>>> js> Parsoid.parse(text, { document: true }).then(function(res) {
>>>         templates =
>>> res.out.querySelectorAll('[typeof~="mw:Transclusion"]');
>>>         console.log(templates);
>>>        }).done();
>>> ```
>>>
>>> That said, it wouldn't be hard to clone the API of
>>>
>>> http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
>>>
>> Parsoid's expressiveness seems to convey useless information, overlook
>> important details, or duplicate them in different places.
>> If I want to resize an image, am I supposed to change "data-file-width"
>> and "data-file-height"? "width" and "height"? Or "src"?
>> I think what I'm looking for is sort of an 'enhanced wikitext' rather than
>> 'annotated HTML'.
>>
>>   and that would probably be a great addition to the parsoid package API.
>>> HTML is just a tree structured data representation.  Think of it as XML if
>>> it makes you happier.  It just happens to come with well-defined semantics
>>> and lots of manipulation libraries.
>>>
>>> I don't know about edits tagged as "VisualEditor".  That seems like that
>>> should only be done by VE.
>>>
>> All edits made via visualeditoredit <
>> https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit>
>> are tagged.
>>
>>   I take it you would like an easy work flow to
>>> fetch a page, make edits, and then write the new revision back?
>>>
>> Right.
>
> RESTBase could help you there. With one API call, you can get the (stored)
> latest HTML revision of a page in Parsoid format~[1], but without the need
> to wait for Parsoid to parse it (if the latest revision is in RESTBase's
> storage).

What if it isn't?

> There is also section API support (you can get individual HTML
> fragments of a page by ID, and send only those back for transformation into
> wikitext~[2]). There is also support for page editing (aka saving), but
> these endpoints have not yet been enabled for WMF wikis in production due
> to security concerns.

Then I guess HTML would have to be converted into wikitext before
saving? +1 API call

>
> [1]
> https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/page_html__title__get
> [2]
> https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/transform_sections_to_wikitext__title___revision__post
>
> Cheers,
> Marko
>
>
>>
>>     mwparserfromhell doesn't actually seem to have that functionality
>> It is actually pretty easy to do with Pywikibot.
>> But since Parsoid happens to work server-side, it makes sense to request
>> and send back the structured tree directly.
>>
>>   , but it
>>> would also be nice to facilitate that use case if we can.
>>>     --scott
>>>
>>> ​
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> [hidden email]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>> Thanks for your time.
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
In reply to this post by C. Scott Ananian
Il 24/07/2015 15:56, C. Scott Ananian ha scritto:

> On Fri, Jul 24, 2015 at 12:34 AM, Ricordisamoa <[hidden email]
>> wrote:
>> Parsoid's expressiveness seems to convey useless information, overlook
>> important details, or duplicate them in different places.
>> If I want to resize an image, am I supposed to change "data-file-width"
>> and "data-file-height"? "width" and "height"? Or "src"?
>>
> These are great points, and reports from folks like you will help to
> improve our documentation.  My goal for Parsoid's DOM[1] is that every bit
> of information from the wikitext is represented exactly *once* in the
> result.

Be it so!

>
> In your example, `data-file-width` and `data-file-height` represent the
> *unscaled* size of the *source* image.  Many image scaling operations want
> to know this, so we include it in the DOM.  It is ignored when you convert
> back to wikitext.
>
> The `width` and `height` attributes are what you should modify if you want
> to resize an image, just like you would do for any naive html editor.

AFAICS there's still no way to know exactly how an image's size was
specified in the original wikitext.

>
> The `src` attribute is again mostly ignored (sigh); the 'resource'
> attribute specifies the url of the unscaled image.  Of course if 'resource'
> is missing we'll try to make do with `src`; we really try hard to do
> something reasonable with whatever we're given.
>    --scott
>
> [1] There is a tension between "don't repeat yourself" and the use of
> Parsoid DOM for read views.  Certain attributes (like "alt" and "title")
> get duplicated by default by the PHP parser.  So far I think we've been
> mostly successful in not letting this sort of thing infect the Parsoid DOM,
> but there may be corner cases we accomodate for the sake of ease-of-use for
> viewers.
>


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
In reply to this post by James Forrester-4
Il 24/07/2015 17:18, James Forrester ha scritto:

> On 23 July 2015 at 22:34, Ricordisamoa <[hidden email]> wrote:
>
>> Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
>>
>>> I don't know about edits tagged as "VisualEditor".  That seems like that
>>>
>> should only be done by VE.
>> All edits made via visualeditoredit <
>> https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit>
>> are tagged.
>
> ​Yes. That's because that is the *private* API for VisualEditor. It
> absolutely should not ever be used by anyone else.​ It's not like any of
> the 'real' APIs in MediaWiki – it is designed for exactly one use case
> (VisualEditor), makes huge assumptions about the world and what is needed
> (like tagging edits), and we make breaking changes all the time.
> Unfortunately, the request to badge internal APIs got turned into flagging
> it and similar APIs in MediaWiki as "This module is internal or unstable.",
> which isn't strong enough on just how bad an idea it is to use it. I would
> extremely strongly suggest that you do not use it, ever.

Oops. https://test.wikipedia.org/w/index.php?title=Tablez&action=history

>
> As Marko, Subbu and Scott point out, we have actual public APIs for this
> kind of stuff, in the forms of RESTbase and Parsoid, and that's what you
> should use.
>
> Yours,


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Gabriel Wicke-3
In reply to this post by Ricordisamoa
On Fri, Jul 24, 2015 at 10:58 AM, Ricordisamoa <[hidden email]
> wrote:

>
>> RESTBase could help you there. With one API call, you can get the (stored)
>> latest HTML revision of a page in Parsoid format~[1], but without the need
>> to wait for Parsoid to parse it (if the latest revision is in RESTBase's
>> storage).
>>
>
> What if it isn't?



If it is not in storage, then it will be generated transparently. This
should only sometimes happen when you request a revision less than a
handful of seconds after it was saved.


>  There is also section API support (you can get individual HTML
>> fragments of a page by ID, and send only those back for transformation
>> into
>> wikitext~[2]). There is also support for page editing (aka saving), but
>> these endpoints have not yet been enabled for WMF wikis in production due
>> to security concerns.
>>
>
> Then I guess HTML would have to be converted into wikitext before saving?
> +1 API call
>

As Marko mentioned, the HTML save end point is not yet enabled in
production. Once it is, you will be able to directly POST modified HTML to
save it, without adding a VisualEditor tag or having to perform extra API
requests.

Gabriel
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
In reply to this post by Subramanya Sastry
Il 24/07/2015 15:53, Subramanya Sastry ha scritto:

> On 07/23/2015 01:07 PM, Ricordisamoa wrote:
>> Il 23/07/2015 15:28, Antoine Musso ha scritto:
>>> Le 23/07/2015 08:15, Ricordisamoa a écrit :
>>>> Are there any stable APIs for an application to get a parse tree in
>>>> machine-readable format, manipulate it and send the result back
>>>> without
>>>> touching HTML?
>>>> I'm sorry if this question doesn't make any sense.
>>> You might want to explain what you are trying to do and which wall you
>>> have hit when attempting to use Parsoid :-)
>>>
>>
>> For example, adding a template transclusion as new parameter in
>> another template.
>> XHTML5+RDFa is the wall :-(
>> Can't Parsoid's deserialization be caught at some point to get a
>> higher-level structure like mwparserfromhell
>> <https://github.com/earwig/mwparserfromhell>'s?
>
> Parsoid and mwparserfromhell have different design goals and hence do
> things differently.
>
> Parsoid is meant to support HTML editing and hence provides semantic
> information as annotations over the HTML document. It effectively
> maintains a bidirectional/reversible mapping between segments of
> wikitext and DOM trees. You can manipulate the DOM trees and get back
> wikitext that represents the edited tree. As for useless information
> and duplicate information -- I think if you looked at the Parsoid DOM
> spec [1], you will know what to look for and what to manipulate. The
> information on the DOM is meant to (a) render accurately (b) support
> the various bots / clients / gadgets that look for specific kinds of
> information, and (b) be editable easily. If that spec has holes or
> needs updates or fixing, we are happy to do that. Do let us know.
>
> mwparserfromhell is an entirely wikitext-centric library as far as I
> can tell. It is meant to manipulate wikitext directly. It is a neat
> library which provides a lot of utilities and makes it easy to do
> wikitext transformations. It doesn't know about or care about HTML
> because it doesn't need to. It also seems to effectively gives you
> some kind of wikitext-centric AST. These are all impressions based on
> a quick scan of its docs -- so pardon any misunderstandings.
>
> Parsoid does not provide you a wikitext AST directly since it doesn't
> construct one. All wikitext information shows up indirectly as DOM
> annotations (either attributes or JSON information in attributes). As
> Scott showed, you can still do document ("wikitext") manipulations
> using DOM libraries, CSS-style queries, or directly by walking the
> DOM. There are lots of ways you can edit mediawiki pages without
> knowing about wikitext and using the vast array of HTML libraries.
> That happens to be our tagline: "we deal with wikitext so you don't
> have to".
>
> But, you are right. It can indeed seem cumbersome if you want to
> directly manipulate wikitext without the DOM getting in between or
> having to deal with DOM libraries. But that is not the use case we
> target. There are a vastly greater number of libraries in all kinds of
> languages (and developers) that know about HTML and can render,
> handle, and manipulate HTML easily than know how to (or want to)
> manipulate wikitext programmatically. Kind of the difference between
> the wikitext editor and the visual editor. They each have their
> constituencies and roles.
>
> All that said, as Scott noted, it is possible to develop a
> mwparserfromhell like layer on top of the Parsoid DOM annotations if
> you want a wikitext-centric view (as opposed to a DOM-centric view
> that most editing clients seem to want). But, since that is not a use
> case that we target, that hasn't been on our radar. If someone does
> want to take that on, and thinks it would be useful, we are happy to
> provide assistance. It should not be too difficult.
>
> Does that help summarize this issue and clarify the differences and
> approaches of these two tools? I am "on vacation" :-)  so responses
> will be delayed.
>
> Subbu.
>
> [1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
>

Hi Subbu,
thank you for this thoughtful insight.
HTML is not a barrier by itself. The problem seems to be Parsoid being
built primarily with VisualEditor in mind. It is not clear to me how can
a single DOM serving both view and edit modes avoid redundancy.
I see huge demand for alternative wikignome-style editors. The more
Parsoid's DOM is predictable, concise and documented, the more users you
get. I hope we can meet in the middle :-)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

C. Scott Ananian
I agree that we have not (to date) spent a lot of time on APIs supporting
direct editing of the Parsoid DOM.  I tend to do things directly using the
low-level DOM methods myself (and that's how I presented my Parsoid
tutorial at wikimania this year) but I can see the attractiveness of the
`mwparserfromhell` API in abstracting some of the details of the
representation.

Thankfully you can have it both ways!  Over the past week I've cloned the
`mwparserfromhell` API, build on top of the Parsoid DOM.  The initial
patches have been merged, but there's a little work to do to get the API
docs up on docs.wikimedia.org properly.  Once that's done I'll post here
with pointers.

Eventually I'd like to put the pieces together and implement something like
a `pywikibot` clone based on this API and using the RESTBase APIs for
read/write access to the wiki.  As has been mentioned, the RESTBase API for
saving edits is not yet quite complete (
https://phabricator.wikimedia.org/T101501); once that is done there should
be no problem connecting the dots.  (In the meantime you can use the API I
just implemented to reserialize the wikitext and then use the standard PHP
APIs, but that's a little bit clunky.)
 --scott

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Subramanya Sastry
In reply to this post by Ricordisamoa
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
>
> Hi Subbu,
> thank you for this thoughtful insight.

And thank you for starting this thread. :-)

> HTML is not a barrier by itself. The problem seems to be Parsoid being
> built primarily with VisualEditor in mind.

While we want the DOM to be VE-friendly, we definitely don't want the
DOM to be VE-centric and that has been the intention from the very
beginning. Flow, CX also use the Parsoid DOM for their functionality.
There are other users too [1]. We definitely want Parsoid's output to be
useful and usable more broadly as the canonical output representation of
wikitext and are open to fixing whatever prevents that.

As Scott noted in the other email on the thread, inspired (and maybe
challenged by :-) ) by mwparserfromhell's utilities, he has already
whipped out a layer that provides an easier interface for manipulating
the DOM.

> It is not clear to me how can a single DOM serving both view and edit
> modes avoid redundancy.

You are right that there are some redundancies in information
representation (because of having to serve multiple needs), but as far
as I know, it is mostly around image attributes. If there is anything
else specific (beyond image attributes) that is bothering you, can you
flag that?

> I see huge demand for alternative wikignome-style editors. The more
> Parsoid's DOM is predictable, concise and documented, the more users
> you get.

I think Parsoid's DOM is predictable :-) but, can you say more about
what prompted you to say that? As for documentation, we document the DOM
we generate and its semantics here [2]. As for size, I just looked at
the Barack Obama page and here are some size numbers.

1540407 /tmp/Barack_Obama.parsoid.html
1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html
1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html

Right now, because we inline template and other editable information (as
inline JSON attributes of the DOM), it is a bit bulky. However, we have
always had plans to move the data-mw attribute into its own bucket which
we might at some point in which case the size will be closer to the
current PHP parser output. If we moved page properties and other
metadata out, it will shrink it a little bit more.

For views that don't need to support editing or any other manipulation
or analyses, we can more aggressively strip more from the HTML without
affecting the rendering and get close to or even shrink the size below
the PHP parser output size (there might be use cases where that might be
appropriate thing to do). I could get this down to under 1M by stripping
rel attributes, element ids, and about ids for identifying template output.

But, for editing (not just in VE) use cases, because of additional
markup in place on the page (element ids, other markup for
transclusions, extensions, links, etc.), the output will probably be
somewhat larger than the corresponding PHP parser output. If we can keep
it under 1.1x of php parser output size, I think we are good.

> I hope we can meet in the middle :-)

Please file bugs and continue to report things that get in the way of
using Parsoid.

Subbu.

[1] https://www.mediawiki.org/wiki/Parsoid/Users
[2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: I love Parsoid but it doesn't want me

Ricordisamoa
In reply to this post by C. Scott Ananian
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:

> I agree that we have not (to date) spent a lot of time on APIs supporting
> direct editing of the Parsoid DOM.  I tend to do things directly using the
> low-level DOM methods myself (and that's how I presented my Parsoid
> tutorial at wikimania this year) but I can see the attractiveness of the
> `mwparserfromhell` API in abstracting some of the details of the
> representation.
>
> Thankfully you can have it both ways!  Over the past week I've cloned the
> `mwparserfromhell` API, build on top of the Parsoid DOM.  The initial
> patches have been merged, but there's a little work to do to get the API
> docs up on docs.wikimedia.org properly.  Once that's done I'll post here
> with pointers.

Thanks!
Unfortunately, that still requires using Node.js and depending on the
parsoid package.
Were the mwparserfromhell-like 'AST' exposed by RESTBase directly,
there'd easily be lots of thin manipulation libraries in different
programming languages.

>
> Eventually I'd like to put the pieces together and implement something like
> a `pywikibot` clone based on this API and using the RESTBase APIs for
> read/write access to the wiki.  As has been mentioned, the RESTBase API for
> saving edits is not yet quite complete (
> https://phabricator.wikimedia.org/T101501); once that is done there should
> be no problem connecting the dots.  (In the meantime you can use the API I
> just implemented to reserialize the wikitext and then use the standard PHP
> APIs, but that's a little bit clunky.)
>   --scott
> ​
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12