Enabling shared tabular data pages

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Enabling shared tabular data pages

Yuri Astrakhan-2
We have had some good feedback for the new shared tabular data feature, and
we are getting ready to deploy it in production. It would be amazing if you
can give it a final look-over to see if there are any blockers left.

The first stage will be to enable  Data:*.tab  pages on Commons, and allow
all other wikis direct access to it via Lua code and Graph extension. All
data at this point must be licensed under CC0. More licensing options are
still under discussion, and can be easily added later.

In line with the "release early, release often", we will not have any
elaborate data editing interface beyond the raw JSON code editor for the
first release. Our initial target audience is the more experienced users
who will evaluate and test the new technology. Once the underlying tech is
stable and prooven, we will work on making it more accessible to the
general audience.

Links:
* Task: https://phabricator.wikimedia.org/T134426
* Demo: http://data.wmflabs.org
* Technical: https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular
* Discussion:
https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_data_storage_for_Commons.21
* Facebook:
https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Daniel Kinzler-2
Am 04.06.2016 um 18:47 schrieb Yuri Astrakhan:
> In line with the "release early, release often", we will not have any
> elaborate data editing interface beyond the raw JSON code editor for the
> first release.

A word of caution about this strategy: this is great for user facing things, but
it really sucks if you are creating artifacts, such as page revisions. You will
have to stay compatible with your very first data format, and the second, and
the third, etc, forever. Similarly, once you have an ecosystem of tools that
rely on your API and data model, changing it becomes rather troublesome.

So, for anything that is supposed to offer a stable API, or creates persistent
data, "release early, release often" is not a good strategy in my experience. A
lot of pain lies this way. Remember: wikitext syntax was once a "let's just make
it work, we will fix it later" hack...

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Yuri Astrakhan-2
Daniel, I agree about the data/api versioning. I was mostly talking about
features and capabilities. For example, we could spend the next year
developing a visual table editor, implement support for unlimited table
sizes, provide import/export from other table formats, introduce elaborate
schema validation, and many other cool features. And after that year
realize that users don't need this whole thing at all, or need something
similar but very different.  Or we could release one small, well defined,
stable subset of that functionality, get feedback, and move forward.

Do you have any thoughts about the proposed data structure?

On Mon, Jun 6, 2016 at 4:09 PM, Daniel Kinzler <[hidden email]>
wrote:

> Am 04.06.2016 um 18:47 schrieb Yuri Astrakhan:
> > In line with the "release early, release often", we will not have any
> > elaborate data editing interface beyond the raw JSON code editor for the
> > first release.
>
> A word of caution about this strategy: this is great for user facing
> things, but
> it really sucks if you are creating artifacts, such as page revisions. You
> will
> have to stay compatible with your very first data format, and the second,
> and
> the third, etc, forever. Similarly, once you have an ecosystem of tools
> that
> rely on your API and data model, changing it becomes rather troublesome.
>
> So, for anything that is supposed to offer a stable API, or creates
> persistent
> data, "release early, release often" is not a good strategy in my
> experience. A
> lot of pain lies this way. Remember: wikitext syntax was once a "let's
> just make
> it work, we will fix it later" hack...
>
> -- daniel
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Rob Lanphier-4
On Mon, Jun 6, 2016 at 6:40 AM, Yuri Astrakhan <[hidden email]>
wrote:

> Daniel, I agree about the data/api versioning. I was mostly talking about
> features and capabilities. For example, we could spend the next year
> developing a visual table editor, implement support for unlimited table
> sizes, provide import/export from other table formats, introduce elaborate
> schema validation, and many other cool features. And after that year
> realize that users don't need this whole thing at all, or need something
> similar but very different.  Or we could release one small, well defined,
> stable subset of that functionality, get feedback, and move forward.
>

Hi Yuri,

I think one thing that would be helpful for me (and I suspect many people
who want to help) is some more specifics about this statement from your
original email: "We have had some good feedback for the new shared tabular
data feature, and we are getting ready to deploy it in production."  Which
"we" are you referring to, and by "getting ready to deploy it in
production", does that mean it's about to be usable where someone could
upload gigabytes of production data in this format Commons by the end of
the week?  Is there a more measured plan published somewhere?

This all sounds very cool, but also an area where we could accidentally
accrue a crushing load of technical debt without fully realizing it (per
Daniel's comment).  I'll confess to being ignorant on everything that's
been going on, and I'm wondering now how desperately I should study your
documentation to make up for it (and how important it is to drop other work
to make time for this).

Rob
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Daniel Kinzler-2
In reply to this post by Yuri Astrakhan-2
Am 06.06.2016 um 15:40 schrieb Yuri Astrakhan:
> Do you have any thoughts about the proposed data structure?

The structure looks sane and future-proof to me, but since it's all-in-one-blob,
it'll be hard to scale it to more than a few ten thousand lines or so. I like
this model, but if you want to go beyond that (DO we want to go beyond that?!)
you will need a different approach, which may be incompatible.

One thing that should be specified very rigorously from the start are the
supported data types, along with their exact syntax and semantics. Your example
has string, number, boolean, and localized. So:

* what's the length limit for string?
* what's the range and precision of number? Is it the same as for JSON?
* does boolean only accept JSON primitives, or also strings?
* what language codes are valid for localized? Is language fallback applied for
display?

Not answering these questions now may lead to having data that can later no
longer be properly interpreted. If you get into quantities with precision or
date, this becomes a lot more fun. In that case, you would want to re-use the
DataValues module(s) that Wikidata uses.

You write in your proposal "Hard to define types like Wikidata ID, datetime, and
URL could be stored as a string until we can reuse Wikidata's type system".
Well, what's keeping you from using it now? DataValue and friends are standalone
composer modules, you can find them on github.

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Yuri Astrakhan-2
Daniel, thanks, inline:

The structure looks sane and future-proof to me, but since it's
> all-in-one-blob,
> it'll be hard to scale it to more than a few ten thousand lines or so. I
> like
> this model, but if you want to go beyond that (DO we want to go beyond
> that?!)
> you will need a different approach, which may be incompatible.
>

We do *eventually* want to go beyond that towards large data. We had this
discussion with Brion, see here:
*  https://phabricator.wikimedia.org/T120452#2224764

I do not think my approach is a blocker for larger datasets, because you
can add simple SQL-like interface capable of reading data from these pages
and from large backend databases. 2MB page limit will prevent page data
from growing too large. Also, larger datasets is a different target, that
we should approach when we are ready.

One thing that should be specified very rigorously from the start are the
> supported data types, along with their exact syntax and semantics. Your
> example
> has string, number, boolean, and localized. So:
>
> * what's the length limit for string?
>
Good question. Do you have a limit for Wikidata labels and other string
values?

> * what's the range and precision of number? Is it the same as for JSON?
>
For now, same as JSON.

> * does boolean only accept JSON primitives, or also strings?
>
true/false only, no strings

> * what language codes are valid for localized? Is language fallback
> applied for
> display?
>
Same rules as for wiki language codes (but without validation against the
actual list). Automatic fallback is already implemented, using Language
class.  If everything else fails, and there is no English, takes random
first (unlike Language which stops at English and fails otherwise).


> You write in your proposal "Hard to define types like Wikidata ID,
> datetime, and
> URL could be stored as a string until we can reuse Wikidata's type system".
> Well, what's keeping you from using it now? DataValue and friends are
> standalone
> composer modules, you can find them on github.

I was told by the Wikidata team at the Jerusalem hackathon that the
Javascript code is too entangled, and I won't be able to reuse it for
non-Wikidata stuff.  I will be very happy to adapt it if possible. Yet, I
do not think this is a requirement for the first release.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Yuri Astrakhan-2
In reply to this post by Rob Lanphier-4
Rob, thanks for your offer to help! Always welcome :)

By discussion and positive feedback I meant Facebook and Commons comments,
and a very old and elaborate phab ticket discussion:
* https://phabricator.wikimedia.org/T120452
*
https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_data_storage_for_Commons.21
* https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/

I do not think this feature will immediately have a huge uptake. It will
simplify graph design, as it will be possible to store data outside of the
graph. It will also allow data from tables and lists to be moved into
separate wiki pages. The work of moving existing data into these pages
might not be very fast. In short, it will only be accessible from Lua and
graphs, will be less than 2MB each, and will require some technical skills
to edit JSON until better tools are created.


On Mon, Jun 6, 2016 at 9:14 PM, Rob Lanphier <[hidden email]> wrote:

> On Mon, Jun 6, 2016 at 6:40 AM, Yuri Astrakhan <[hidden email]>
> wrote:
>
> > Daniel, I agree about the data/api versioning. I was mostly talking about
> > features and capabilities. For example, we could spend the next year
> > developing a visual table editor, implement support for unlimited table
> > sizes, provide import/export from other table formats, introduce
> elaborate
> > schema validation, and many other cool features. And after that year
> > realize that users don't need this whole thing at all, or need something
> > similar but very different.  Or we could release one small, well defined,
> > stable subset of that functionality, get feedback, and move forward.
> >
>
> Hi Yuri,
>
> I think one thing that would be helpful for me (and I suspect many people
> who want to help) is some more specifics about this statement from your
> original email: "We have had some good feedback for the new shared tabular
> data feature, and we are getting ready to deploy it in production."  Which
> "we" are you referring to, and by "getting ready to deploy it in
> production", does that mean it's about to be usable where someone could
> upload gigabytes of production data in this format Commons by the end of
> the week?  Is there a more measured plan published somewhere?
>
> This all sounds very cool, but also an area where we could accidentally
> accrue a crushing load of technical debt without fully realizing it (per
> Daniel's comment).  I'll confess to being ignorant on everything that's
> been going on, and I'm wondering now how desperately I should study your
> documentation to make up for it (and how important it is to drop other work
> to make time for this).
>
> Rob
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Enabling shared tabular data pages

Rob Lanphier-4
In reply to this post by Yuri Astrakhan-2
Let's revive this thread for this week's ArchCom RFC meeting.  I'll
doll up a more formal announcement as I finish cleaning up some of our
notes documents, but for now, the short version:
URL: <https://phabricator.wikimedia.org/E213>
Time: 2016-06-15, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)
Location: #wikimedia-office IRC channel

Rob

On Sat, Jun 4, 2016 at 9:47 AM, Yuri Astrakhan <[hidden email]> wrote:

> We have had some good feedback for the new shared tabular data feature, and
> we are getting ready to deploy it in production. It would be amazing if you
> can give it a final look-over to see if there are any blockers left.
>
> The first stage will be to enable  Data:*.tab  pages on Commons, and allow
> all other wikis direct access to it via Lua code and Graph extension. All
> data at this point must be licensed under CC0. More licensing options are
> still under discussion, and can be easily added later.
>
> In line with the "release early, release often", we will not have any
> elaborate data editing interface beyond the raw JSON code editor for the
> first release. Our initial target audience is the more experienced users
> who will evaluate and test the new technology. Once the underlying tech is
> stable and prooven, we will work on making it more accessible to the
> general audience.
>
> Links:
> * Task: https://phabricator.wikimedia.org/T134426
> * Demo: http://data.wmflabs.org
> * Technical: https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular
> * Discussion:
> https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_data_storage_for_Commons.21
> * Facebook:
> https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l