Acquiring list of templates including external links

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Acquiring list of templates including external links

Takashi OTA-5
Hoi,

This is an inquiry from my friend in academia, researching about Wikipedia.

He would like to know whether there's a way to acquire a list of templates
including external links. Here are some examples including external links.

https://ja.wikipedia.org/wiki/Template:JOI/doc
https://ja.wikipedia.org/wiki/Template:Twitter/doc

Such links are stored in externallinks.sql.gz, in an expanded form.

When you want to check increase/decrease of linked domains in chronological
order through edit history, you have to check pages-meta-history1.xml etc.
In a such case, traditional links and links by templates are mixed,
Therefore, the latter ones (links by templates) should be expanded to
traditional link forms.

Sorry if what I am saying does not make sense.
Thanks in advance,

--Takashi Ota [[U:Takot]]
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Acquiring list of templates including external links

Marc-Andre
On 2016-07-31 10:53 AM, Takashi OTA wrote:

> When you want to check increase/decrease of linked domains in chronological
> order through edit history

This is actually a harder problem that it seems, even at first glance:
if you want to examine the links over time then, when you are looking at
an old revision of an article, you have to contrive to expand the
templates /as they existed at that time/ and not those that exist /now/
as the Mediawiki engine would do.

Clearly, all the data to do so is there in the database - and I seem to
recall that there exists an extension that will allow you to use the
parser in that way - but the Foundation projects do not have such an
extension installed and cannot be convinced to render a page for you
that would accurately show what ELs it might have had at a given date.

-- Coren / Marc


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Acquiring list of templates including external links

Gergo Tisza
On Mon, Aug 1, 2016 at 7:46 AM, Marc-Andre <[hidden email]> wrote:

> Clearly, all the data to do so is there in the database - and I seem to
> recall that there exists an extension that will allow you to use the parser
> in that way - but the Foundation projects do not have such an extension
> installed and cannot be convinced to render a page for you that would
> accurately show what ELs it might have had at a given date.
>

That would be the Memento [1] extension. I'm not sure this is even
theoretically possible - the parser has changed over time and old templates
might not work anymore.

Your best bet is probably to find some old dumps. (Kiwix [2] maybe? I don't
know if they preserve templates.)


[1] https://www.mediawiki.org/wiki/Extension:Memento
[2] https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Marc-Andre
On 2016-08-01 12:21 PM, Gergo Tisza wrote:

> the parser has changed over time and old templates
> might not work anymo

Aaah.  Good point.  Also, the changes in extensions (or, indeed, what
extensions are installed at all) might break attempts to parse the past,
as it were.

You know, this is actually quite troublesome: as the platform evolves
the older data becomes increasingly hard to use at all - making it
effectively lost even if we kept the bits around.  This is a rather
widespread issue in computing as a rule; but I now find myself
distressed at its unavoidable effect on what we've always intended to be
a permanent contribution to humanity.

We need to find a long-term view to a solution.  I don't mean just
keeping old versions of the software around - that would be of limited
help.  It's be an interesting nightmare to try to run early versions of
phase3 nowadays, and probably require managing to make a very very old
distro work and finding the right versions of an ancient apache and
PHP.  Even *building* those might end up being a challenge... when is
the last time you saw a working egcs install? I shudder how
nigh-impossible the task might be 100 years from now.

Is there something we can do to make the passage of years hurt less?  
Should we be laying groundwork now to prevent issues decades away?

At the very least, I think those questions are worth asking.

-- Coren / Marc


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Subramanya Sastry
On 08/01/2016 11:37 AM, Marc-Andre wrote:
> ...
> Is there something we can do to make the passage of years hurt less?  
> Should we be laying groundwork now to prevent issues decades away?

One possibility is considering storing rendered HTML for old revisions.
It lets wikitext (and hence parser) evolve without breaking old
revisions. Plus rendered HTML will use the template revision at the time
it was rendered vs. the latest revision (this is the problem Memento
tries to solve).

HTML storage comes with its own can of worms, but it seems like a
solution worth thinking about in some form.

1. storage costs (fully rendered HTML would be 5-10 times bigger than
wikitext for that same page, and much larger if stored as wikitext diffs)
2. evolution of HTML spec and its affect on old content (this affects
the entire web, so, whatever solution works there will work for us as well)
3. newly discovered security holes and retroactively fixing them in
stored html and released dumps (not sure).
... and maybe others.

Subbu.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Pine W
In reply to this post by Marc-Andre
"Should we be laying groundwork now to prevent issues decades away?" I'll
answer that with "Yes". I could provide some interesting stories about
technological and budgetary headaches that result from repeatedly delaying
efforts to make legacy software be forwards-compatible. The technical
details of the tools mentioned here are beyond me, but I saw what happened
in another org that was dealing with legacy software and it wasn't pretty.

Pine

On Mon, Aug 1, 2016 at 9:37 AM, Marc-Andre <[hidden email]> wrote:

> On 2016-08-01 12:21 PM, Gergo Tisza wrote:
>
> the parser has changed over time and old templates
>> might not work anymo
>>
>
> Aaah.  Good point.  Also, the changes in extensions (or, indeed, what
> extensions are installed at all) might break attempts to parse the past, as
> it were.
>
> You know, this is actually quite troublesome: as the platform evolves the
> older data becomes increasingly hard to use at all - making it effectively
> lost even if we kept the bits around.  This is a rather widespread issue in
> computing as a rule; but I now find myself distressed at its unavoidable
> effect on what we've always intended to be a permanent contribution to
> humanity.
>
> We need to find a long-term view to a solution.  I don't mean just keeping
> old versions of the software around - that would be of limited help.  It's
> be an interesting nightmare to try to run early versions of phase3
> nowadays, and probably require managing to make a very very old distro work
> and finding the right versions of an ancient apache and PHP.  Even
> *building* those might end up being a challenge... when is the last time
> you saw a working egcs install? I shudder how nigh-impossible the task
> might be 100 years from now.
>
> Is there something we can do to make the passage of years hurt less?
> Should we be laying groundwork now to prevent issues decades away?
>
> At the very least, I think those questions are worth asking.
>
> -- Coren / Marc
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Rob Lanphier-4
In reply to this post by Subramanya Sastry
On Mon, Aug 1, 2016 at 9:51 AM, Subramanya Sastry <[hidden email]> wrote:
> On 08/01/2016 11:37 AM, Marc-Andre wrote:
>> Is there something we can do to make the passage of years hurt less?
>> Should we be laying groundwork now to prevent issues decades away?
>
>
> One possibility is considering storing rendered HTML for old revisions. It
> lets wikitext (and hence parser) evolve without breaking old revisions. Plus
> rendered HTML will use the template revision at the time it was rendered vs.
> the latest revision (this is the problem Memento tries to solve).


This is a seductive path to choose.  Maintaining backwards
compatibility for poorly conceived (in retrospect) engineering
decisions is really hard work.  A lot of the cruft and awfulness of
enterprise-focused software comes from dealing with the seemingly
endless torrent of edge cases which are often backwards-compatibility
issues in the systems/formats/databases/protocols that the software
depends on.  The [Y2K problem][1] was a global lesson in the
importance of intelligently paying down technical debt.

You outline the problems with this approach in the remainder of your email....

> HTML storage comes with its own can of worms, but it seems like a solution
> worth thinking about in some form.
>
> 1. storage costs (fully rendered HTML would be 5-10 times bigger than
> wikitext for that same page, and much larger if stored as wikitext diffs)
> 2. evolution of HTML spec and its affect on old content (this affects the
> entire web, so, whatever solution works there will work for us as well)
> 3. newly discovered security holes and retroactively fixing them in stored
> html and released dumps (not sure).
> ... and maybe others.

I think these are all reasons why I chose the word "seductive" as
opposed to more unambiguous praise  :-)  Beyond these reasons, the
bigger issue is that it's an invitation to be sloppy about our
formats.  We should endeavor to make our wikitext to html conversion
more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
Kinzler taught me).  Holding a large data store of snapshots seems
like a crutch to avoid the hard work of specifying how this conversion
ought to work.  Let's actually nail down the spec for this[2][3]
rather than kidding ourselves into believing we can just store enough
HTML snapshots to make the problem moot.

Rob

[1]: https://en.wikipedia.org/wiki/Year_2000_problem
[2]: https://www.mediawiki.org/wiki/Markup_spec
[3]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Gergo Tisza
On Mon, Aug 1, 2016 at 11:47 AM, Rob Lanphier <[hidden email]> wrote:

> > HTML storage comes with its own can of worms, but it seems like a
> solution
> > worth thinking about in some form.
> >
> > 1. storage costs (fully rendered HTML would be 5-10 times bigger than
> > wikitext for that same page, and much larger if stored as wikitext diffs)
> > 2. evolution of HTML spec and its affect on old content (this affects the
> > entire web, so, whatever solution works there will work for us as well)
> > 3. newly discovered security holes and retroactively fixing them in
> stored
> > html and released dumps (not sure).
> > ... and maybe others.
>
> I think these are all reasons why I chose the word "seductive" as
> opposed to more unambiguous praise  :-)  Beyond these reasons, the
> bigger issue is that it's an invitation to be sloppy about our
> formats.  We should endeavor to make our wikitext to html conversion
> more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
> Kinzler taught me).  Holding a large data store of snapshots seems
> like a crutch to avoid the hard work of specifying how this conversion
> ought to work.  Let's actually nail down the spec for this[2][3]
> rather than kidding ourselves into believing we can just store enough
> HTML snapshots to make the problem moot.
>

Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
project (ie. wouldn't expect it to happen in this decade), and even then it
would not fully solve the problem - e.g. very old versions relied on the
default CSS of a different MediaWiki skin; you need site scripts for some
things such as infobox show/hide functionality to work, but the standard
library those scripts rely on has changed; same for Scribunto scripts.

HTML storage is actually not that bad - browsers are very good at backwards
compatibility with older HTML spec and there is very little security
footprint in serving static HTML from a separate domain. Storage is
problem, but there is no need to store every page revision - monthly or
yearly snapshots would be fine IMO. (cf. T17017 - again, Kiwix seems to do
this already, so maybe it's just a matter of coordination.) The only other
practical problem I can think of is that it would preserve
deleted/oversighted information - that problem already exists with the
dumps, but those are not kept for very long (on WMF servers at least).
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Acquiring list of templates including external links

Legoktm
In reply to this post by Takashi OTA-5
Hi,

On 07/31/2016 07:53 AM, Takashi OTA wrote:
> Such links are stored in externallinks.sql.gz, in an expanded form.
>
> When you want to check increase/decrease of linked domains in chronological
> order through edit history, you have to check pages-meta-history1.xml etc.
> In a such case, traditional links and links by templates are mixed,
> Therefore, the latter ones (links by templates) should be expanded to
> traditional link forms.

If you have the revision ID, you can make an API query like:
<https://en.wikipedia.org/w/api.php?action=parse&oldid=387276926&prop=externallinks>.

This will expand all templates and give you the same set of
externallinks that would have ended up in the dump.

-- Legoktm

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Rob Lanphier-4
In reply to this post by Gergo Tisza
On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza <[hidden email]> wrote:
> Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
> project (ie. wouldn't expect it to happen in this decade), and even then it
> would not fully solve the problem[...]

You seem to be suggesting that
1.  Specifying wikitext-html conversion is really hard
2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
3.  HTML storage looks more like a silver bullet, and is cheaper
4.  Therefore, a specification is not really worth doing, or if it is,
it's really low priority

Is that an accurate way of paraphrasing your email?

Rob

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Gergo Tisza
On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier <[hidden email]> wrote:

> On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza <[hidden email]> wrote:
> > Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
> > project (ie. wouldn't expect it to happen in this decade), and even then
> it
> > would not fully solve the problem[...]
>
> You seem to be suggesting that
> 1.  Specifying wikitext-html conversion is really hard
> 2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
> 3.  HTML storage looks more like a silver bullet, and is cheaper
> 4.  Therefore, a specification is not really worth doing, or if it is,
> it's really low priority
>
> Is that an accurate way of paraphrasing your email?
>

Yes. The main problem with specifying wikitext-to-html is that extensions
get to extend it in arbitrary ways; e.g. the specification for Scribunto
would have to include the whole Lua compiler semantics.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

David Gerard-2
In reply to this post by Marc-Andre
On 1 August 2016 at 17:37, Marc-Andre <[hidden email]> wrote:

> We need to find a long-term view to a solution.  I don't mean just keeping
> old versions of the software around - that would be of limited help.  It's
> be an interesting nightmare to try to run early versions of phase3 nowadays,
> and probably require managing to make a very very old distro work and
> finding the right versions of an ancient apache and PHP.  Even *building*
> those might end up being a challenge... when is the last time you saw a
> working egcs install? I shudder how nigh-impossible the task might be 100
> years from now.


oh god yes. I'm having this now, trying to revive an old Slash
installation. I'm not sure I could even reconstruct a box to run it
without compiling half of CPAN circa 2002 from source.

Suggestion: set up a copy of WMF's setup on a VM (or two or three),
save that VM and bundle it off to the Internet Archive as a dated
archive resource. Do this regularly.


- d.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Gabriel Wicke-3
> One possibility is considering storing rendered HTML for old revisions. It
> lets wikitext (and hence parser) evolve without breaking old revisions.
Plus
> rendered HTML will use the template revision at the time it was rendered
vs.
> the latest revision (this is the problem Memento tries to solve).

Long term HTML archival is a something we have been gradually working
towards with RESTBase.

Since HTML is about 10x larger than wikitext, a major concern is storage
cost. Old estimates <https://phabricator.wikimedia.org/T97710> put the
total storage needed to store one HTML copy of each revision at roughly
120T. To reduce this cost, we have since implemented several improvements
<https://phabricator.wikimedia.org/T93751>:


   - Brotli compression <https://en.wikipedia.org/wiki/Brotli>, once
   deployed, is expected to reduce the total storage needs to about
   1/4-1/5x over gzip <https://phabricator.wikimedia.org/T122028#2004953>.
   - The ability to split latest revisions from old revision lets us use
   cheaper and slower storage for old revisions.
   - Retention policies let us specify how many renders per revision we
   want to archive. We currently only archive one (the latest) render per
   revision, but have the option to store one render per $time_unit. This is
   especially important for pages like [[Main Page]], which are rarely edited,
   but constantly change their content in meaningful ways via templates. It is
   currently not possible to reliably cite such pages, without resorting to
   external services like archive.org.


Another important requirement for making HTML a useful long-term archival
medium is to establish a clear standard for HTML structures used. The
versioned Parsoid HTML spec
<https://www.mediawiki.org/wiki/Specs/HTML/1.2.1>, along with format
migration logic for old content, are designed to make the stored HTML as
future-proof as possible.

While we currently only have space for a few months worth of HTML
revisions, we do expect the changes above to make it possible to push this
to years in the foreseeable future without unreasonable hardware needs.
This means that we can start building up an archive of our content in a
format that is not tied to the software.

Faithfully re-rendering old revisions is harder in retrospect. We will
likely have to make some trade-offs between fidelity & effort.

Gabriel


On Mon, Aug 1, 2016 at 2:01 PM, David Gerard <[hidden email]> wrote:

> On 1 August 2016 at 17:37, Marc-Andre <[hidden email]> wrote:
>
> > We need to find a long-term view to a solution.  I don't mean just
> keeping
> > old versions of the software around - that would be of limited help.
> It's
> > be an interesting nightmare to try to run early versions of phase3
> nowadays,
> > and probably require managing to make a very very old distro work and
> > finding the right versions of an ancient apache and PHP.  Even *building*
> > those might end up being a challenge... when is the last time you saw a
> > working egcs install? I shudder how nigh-impossible the task might be 100
> > years from now.
>
>
> oh god yes. I'm having this now, trying to revive an old Slash
> installation. I'm not sure I could even reconstruct a box to run it
> without compiling half of CPAN circa 2002 from source.
>
> Suggestion: set up a copy of WMF's setup on a VM (or two or three),
> save that VM and bundle it off to the Internet Archive as a dated
> archive resource. Do this regularly.
>
>
> - d.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Rob Lanphier-4
In reply to this post by Gergo Tisza
On Mon, Aug 1, 2016 at 1:56 PM, Gergo Tisza <[hidden email]> wrote:

> On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier <[hidden email]> wrote:
>> On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza <[hidden email]> wrote:
>> > Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
>> > project (ie. wouldn't expect it to happen in this decade), and even then
>> it
>> > would not fully solve the problem[...]
>>
>> You seem to be suggesting that
>> 1.  Specifying wikitext-html conversion is really hard
>> 2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
>> 3.  HTML storage looks more like a silver bullet, and is cheaper
>> 4.  Therefore, a specification is not really worth doing, or if it is,
>> it's really low priority
>>
>> Is that an accurate way of paraphrasing your email?
>
> Yes. The main problem with specifying wikitext-to-html is that extensions
> get to extend it in arbitrary ways; e.g. the specification for Scribunto
> would have to include the whole Lua compiler semantics.


Do you believe that declaring "the implementation is the spec" is a
sustainable way of encouraging contribution to our projects?

Rob

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Reload SQLight from MYSQL dump?

Jefsey
In reply to this post by Gabriel Wicke-3
I am not familiar with databases. I have old MySQL based wikis sites
I cannot access anymore due to a change in PHP and MySQL versions. I
have old XML dumps. Is it possible to reload them under SQLight
wikis? These were working group wikis: we only are interested in
restoring texts. We have the images. We are not interested in the
access rights: we will have to rebuild them anyway.

Thank you for the help !
jefsey

PS. We dedicate to light wikis which are OK under SQLight, would
there be a dedicated list to SQLight mangement (and further on development)?


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Reload SQLight from MYSQL dump?

John Doe-27
 For mass imports, use importDump.php - see <
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps> for details.

On Mon, Aug 1, 2016 at 8:37 PM, Jefsey <[hidden email]> wrote:

> I am not familiar with databases. I have old MySQL based wikis sites I
> cannot access anymore due to a change in PHP and MySQL versions. I have old
> XML dumps. Is it possible to reload them under SQLight wikis? These were
> working group wikis: we only are interested in restoring texts. We have the
> images. We are not interested in the access rights: we will have to rebuild
> them anyway.
>
> Thank you for the help !
> jefsey
>
> PS. We dedicate to light wikis which are OK under SQLight, would there be
> a dedicated list to SQLight mangement (and further on development)?
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

John Mark Vandenberg
In reply to this post by Marc-Andre
There is a slow moving discussion about this at
https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Markdown

The bigger risk is that the rest of the world settles on using
CommonMark Markdown once it is properly specified.  That will mean in
the short term that MediaWiki will need to support Markdown, and
eventually it would need to adopt Markdown as the primary text format,
and ultimately we would loose our own ability to render old revisions,
because the parser would bit rot.

One practical way to add more discipline around this problem is to
introduce a "mediawiki-wikitext-announce" list, similar to the
mediawiki-api-announce list, and require that *every* breaking change
to the wikitext parser is announced there.

wikitext is file format, and there are alternative parsers, which need
to be updated any time the Php parser changes.

https://www.mediawiki.org/wiki/Alternative_parsers

It should be managed just like the MediaWiki API, with appropriate
notices sent out, so that other tools can be kept up to date, and so
there is an accurate record of when breaking changes occurred.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Gergo Tisza
In reply to this post by Rob Lanphier-4
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier <[hidden email]> wrote:

> Do you believe that declaring "the implementation is the spec" is a
> sustainable way of encouraging contribution to our projects?


Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.

There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

John Mark Vandenberg
On Tue, Aug 2, 2016 at 8:34 AM, Gergo Tisza <[hidden email]> wrote:
> On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier <[hidden email]> wrote:
>
>> Do you believe that declaring "the implementation is the spec" is a
>> sustainable way of encouraging contribution to our projects?
>
>
> Reimplementing Wikipedia's parser (complete with template inclusions,
> Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
> practically impossible. What we do or do not declare won't change that.

Correct, re-implementing the MediaWiki parser is a mission from hell.
And yet, WMF is doing that with parsoid ... ;-)
And, WMF will no doubt do it again in the future.
Changing infrastructure is normal for systems that last many generations.

But the real problem of not using a versioned spec is that nobody can
reliably do anything, at all, with the content.

Even basic tokenizing of wikitext has many undocumented gotchas, and
even with the correct voodoo today there is no guarantee that WMF
engineers wont break it tomorrow, and not inform everyone that the
spec has changed.

> There are many other, more realistic ways to encourage contribution by
> users who are interested in wikis, but not in Wikimedia projects.
> (Supporting Markdown would certainly be one of them.) But historically the
> WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
> other actor has been both willing and able to step up in its place.

The main reason for a spec should be the sanity of the Wikimedia
technical user base, including WMF engineers paid by donors, who build
parsers in other languages for various reasons, including supporting
tools that account for a very large percent of the total edits to
Wikimedia and are critical in preventing abuse and assisting admins
performing critical tasks to keep the sites from falling apart.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Subramanya Sastry
In reply to this post by Gergo Tisza

TL:DR; You get to a spec by paying down technical debt that untangles
wikitext parsing from being intricately tied to the internals of
mediawiki implementation and state.

In discussions, there is far too much focus on the fact that you cannot
write a BNF grammar or yacc / lex / bison / whatever or that quote
parsing is context-sensitive. I don't think it is as much of a big deal.
For example, you could use Markdown for parsing but that doesn't change
much of the picture outlined below ... I think all of that is less of an
issue compared to the following:

Right now, mediawiki HTML output depends on the following:
* input wikitext
* wiki config (including installed extensions)
* installed templates
* media resources (images, audio, video)
* PHP parser hooks that expose parsing internals and implementation
details (not replicable in other parsers)
* wiki messages (ex: cite output)
* state of the corpus and other db state (ex: red links, bad images)
* user state (prefs, etc.)
* Tidy

So, one reason for the complexity in implementing a wikitext parser is
because the output HTML is not simply a straightforward transformation
of input wikitext (and some config). There is far too much other state
that gets in the way.

The second reason for complexity is because markup errors aren't bounded
to narrow contexts, but, can leak out and impact output of the entire
page. Some user pages seem to exploit this as a feature even (unclosed
div tags).

The third source of complexity is because some parser hooks expose
internals of the implementation (Before/After Strip/Tidy and other such
hooks). An implementation without tidy or that handles wikitext
different might not have the same pipeline.

However, we can still get to a spec that is much more replicable if we
start cleaning up some of this incrementally and paying down technical
debt. Here are some things going on right now towards that.

* We are close to getting rid of Tidy which removes it from the equation.
* There are RFCs that propose defining DOM scopes and propose that
output of templates (and extensions) be a DOM (vs a string), with some
caveats (that I will ignore for here). If we can get to implementing
these, we immediately isolate the parsing of a top-level page from the
details of how extensions and transclusions are processed.
* RFCs that propose that things like red links, bad images, user state,
site messages not be an input into the core wikitext parse. From a
spec-point of view, they should be viewed as post-processing
transformations. However, for efficiency reasons, an implementation
might choose to integrate that as part of the parse, but that is not a
requirement.

Separately, here is one other thing we can consider:
* Deprecate and replace tag hooks that expose parser internals.

When all of these are done, it become far more feasible to think of
defining a spec for wikitext parsing that is not tied to the internals
of mediawiki or its extensions. At that point, you could implement
templating via Lua or via JS or via Ruby ... the specifics are
immaterial. What matters is those templating implementations and
extensions produce output with certain properties. You can then specify
that mediawiki-HTML is a series of transformations that are applied to
the output of the wikitext parser ... and where there can be multiple
spec-compliant implementations of that parser.

I think it is feasible to get there. But, whether we want a spec for
wikitext and should work towards that is a different question.

Subbu.

On 08/01/2016 08:34 PM, Gergo Tisza wrote:

> On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier <[hidden email]> wrote:
>
>> Do you believe that declaring "the implementation is the spec" is a
>> sustainable way of encouraging contribution to our projects?
>
> Reimplementing Wikipedia's parser (complete with template inclusions,
> Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
> practically impossible. What we do or do not declare won't change that.
>
> There are many other, more realistic ways to encourage contribution by
> users who are interested in wikis, but not in Wikimedia projects.
> (Supporting Markdown would certainly be one of them.) But historically the
> WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
> other actor has been both willing and able to step up in its place.
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12