WikiHist.html: English Wikipedia's Full Revision History in HTML Format

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Robert West
Hi all,

*TL;DR:*

So far, Wikipedia's full revision history has been available only in wiki
markup, not in HTML -- a big limitation for researchers. We are changing
this by releasing WikiHist.html, Wikipedia's full history (up until March
2019) in HTML:
https://zenodo.org/record/3605388 <https://t.co/ZhK7kKaPCi?amp=1>
Caveat emptor: 7 TB!

Tweet: https://twitter.com/cervisiarius/status/1301791239558311936

*More details:*

Wikipedia is written in the wikitext markup language. When serving content,
the MediaWiki software that powers Wikipedia parses wikitext to HTML,
thereby inserting additional content by expanding macros (templates and
modules). Hence, researchers who intend to analyze Wikipedia as seen by its
readers should work with HTML, rather than wikitext. Since Wikipedia’s
revision history is made publicly available by the Wikimedia Foundation
exclusively in wikitext format, researchers have had to produce HTML
themselves, typically by using Wikipedia’s REST API for ad-hoc
wikitext-to-HTML parsing. This approach, however, (1) does not scale to
very large amounts of data and (2) does not correctly expand macros in
historical article revisions.

We have solved these problems by developing a parallelized architecture for
parsing massive amounts of wikitext using local instances of MediaWiki,
enhanced with the capacity of correct historical macro expansion. By
deploying our system, we produce and hereby release WikiHist.html, English
Wikipedia’s full revision history in HTML format. It comprises the HTML
content of 580M revisions of 5.8M articles generated from the full English
Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019.
Boilerplate content such as page headers, footers, and navigation sidebars
are not included in the HTML.

For more details, please refer to https://zenodo.org/record/3605388
<https://t.co/ZhK7kKaPCi?amp=1> and to the dataset paper:

Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English
Wikipedia’s Full Revision History in HTML Format. In *Proceedings of the
14th International AAAI Conference on Web and Social Media,* 2020.
https://arxiv.org/abs/2001.10256

Best regards,
Bob
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Federico Leva (Nemo)
Robert West, 11/09/20 11:29:
> local instances of MediaWiki,
> enhanced with the capacity of correct historical macro expansion.

Interesting. I see this doesn't include deleted templates. Have you
considered using historical dumps?

«We emphasize that the limitation of deleted pages, tem- plates, and
modules is not introduced by our parsing process. Rather, it is
inherited from Wikipedia’s deliberate policy of permanently deleting the
entire history of deleted pages.»

A relevant task is
https://phabricator.wikimedia.org/T2851

See also the various discussions about Memento, like
https://phabricator.wikimedia.org/T164654

Federico

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Robert West
Thanks Federico.
I'm cc'ing Tiziano, who has been leading this project and can chime in.

All the best,
Bob

On Fri, Sep 11, 2020 at 11:22 AM Federico Leva (Nemo) <[hidden email]>
wrote:

> Robert West, 11/09/20 11:29:
> > local instances of MediaWiki,
> > enhanced with the capacity of correct historical macro expansion.
>
> Interesting. I see this doesn't include deleted templates. Have you
> considered using historical dumps?
>
> «We emphasize that the limitation of deleted pages, tem- plates, and
> modules is not introduced by our parsing process. Rather, it is
> inherited from Wikipedia’s deliberate policy of permanently deleting the
> entire history of deleted pages.»
>
> A relevant task is
> https://phabricator.wikimedia.org/T2851
>
> See also the various discussions about Memento, like
> https://phabricator.wikimedia.org/T164654
>
> Federico
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

WereSpielChequers-2
In reply to this post by Federico Leva (Nemo)
I wouldn't use the phrase "Wikipedia’s deliberate policy of permanently
deleting the
entire history of deleted pages". Quite a few "deleted" pages do actually
get restored, and depending on the deletion process it can be quite easy to
get much deleted content back. Especially if someone volunteers to
reference an unreferenced page or a budding footballer actually gets to
play at professional or international level, or indeed a political
candidate is elected. Almost all "deleted" content still exists and could
be restored by a volunteer admin in the right circumstances. However
Wikipedia's deletion processes are more than a little complex, many
articles have incomplete histories because admins have revision deleted
particular revisions that include copyright violations and or some really
libellous stuff. Some of the really nasty stuff gets "oversighted" - those
revisions are not even visible to administrators.

There is also the issue that some of the earliest material is not
available. stats on admin actions only go back to December 2004, and while
there is some content from before then, I am not sure if all the stuff
deleted before then is available.

Regards

WSC

On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <[hidden email]>
wrote:

> Robert West, 11/09/20 11:29:
> > local instances of MediaWiki,
> > enhanced with the capacity of correct historical macro expansion.
>
> Interesting. I see this doesn't include deleted templates. Have you
> considered using historical dumps?
>
> «We emphasize that the limitation of deleted pages, tem- plates, and
> modules is not introduced by our parsing process. Rather, it is
> inherited from Wikipedia’s deliberate policy of permanently deleting the
> entire history of deleted pages.»
>
> A relevant task is
> https://phabricator.wikimedia.org/T2851
>
> See also the various discussions about Memento, like
> https://phabricator.wikimedia.org/T164654
>
> Federico
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Tiziano Piccardi
Thanks Federico and WSC for the interest!

I want to specify that we used only public data released in the XML dump.
As WSC said, deleted content is not always permanently removed from the
database, but it is available only to users with privilege access. Our goal
is not only to release the dataset, but also to give anyone the
possibility to (1) reproduce the results, and (2) generate the HTML history
in other languages without any special access requirements.

Tiziano

On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers <
[hidden email]> wrote:

> I wouldn't use the phrase "Wikipedia’s deliberate policy of permanently
> deleting the
> entire history of deleted pages". Quite a few "deleted" pages do actually
> get restored, and depending on the deletion process it can be quite easy to
> get much deleted content back. Especially if someone volunteers to
> reference an unreferenced page or a budding footballer actually gets to
> play at professional or international level, or indeed a political
> candidate is elected. Almost all "deleted" content still exists and could
> be restored by a volunteer admin in the right circumstances. However
> Wikipedia's deletion processes are more than a little complex, many
> articles have incomplete histories because admins have revision deleted
> particular revisions that include copyright violations and or some really
> libellous stuff. Some of the really nasty stuff gets "oversighted" - those
> revisions are not even visible to administrators.
>
> There is also the issue that some of the earliest material is not
> available. stats on admin actions only go back to December 2004, and while
> there is some content from before then, I am not sure if all the stuff
> deleted before then is available.
>
> Regards
>
> WSC
>
> On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <[hidden email]>
> wrote:
>
> > Robert West, 11/09/20 11:29:
> > > local instances of MediaWiki,
> > > enhanced with the capacity of correct historical macro expansion.
> >
> > Interesting. I see this doesn't include deleted templates. Have you
> > considered using historical dumps?
> >
> > «We emphasize that the limitation of deleted pages, tem- plates, and
> > modules is not introduced by our parsing process. Rather, it is
> > inherited from Wikipedia’s deliberate policy of permanently deleting the
> > entire history of deleted pages.»
> >
> > A relevant task is
> > https://phabricator.wikimedia.org/T2851
> >
> > See also the various discussions about Memento, like
> > https://phabricator.wikimedia.org/T164654
> >
> > Federico
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Denny Vrandečić-2
Three questions:

1) assume a page P with a Template T.

P has been modified at time T2 and T4.
T has been modified at T1 and T3.

Will P be available as of T2 and T4 only, or also as of T3? (at which point
it will be different than at T2 or T4).


2) What about changes to Wikidata, Commons, or UI message strings?


3) Possibly interesting to look into TimeMachine, Memento, and related work

https://www.mediawiki.org/wiki/Extension:TimeMachine
https://www.mediawiki.org/wiki/Extension:Memento


On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi <[hidden email]>
wrote:

> Thanks Federico and WSC for the interest!
>
> I want to specify that we used only public data released in the XML dump.
> As WSC said, deleted content is not always permanently removed from the
> database, but it is available only to users with privilege access. Our goal
> is not only to release the dataset, but also to give anyone the
> possibility to (1) reproduce the results, and (2) generate the HTML history
> in other languages without any special access requirements.
>
> Tiziano
>
> On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers <
> [hidden email]> wrote:
>
> > I wouldn't use the phrase "Wikipedia’s deliberate policy of permanently
> > deleting the
> > entire history of deleted pages". Quite a few "deleted" pages do actually
> > get restored, and depending on the deletion process it can be quite easy
> to
> > get much deleted content back. Especially if someone volunteers to
> > reference an unreferenced page or a budding footballer actually gets to
> > play at professional or international level, or indeed a political
> > candidate is elected. Almost all "deleted" content still exists and could
> > be restored by a volunteer admin in the right circumstances. However
> > Wikipedia's deletion processes are more than a little complex, many
> > articles have incomplete histories because admins have revision deleted
> > particular revisions that include copyright violations and or some really
> > libellous stuff. Some of the really nasty stuff gets "oversighted" -
> those
> > revisions are not even visible to administrators.
> >
> > There is also the issue that some of the earliest material is not
> > available. stats on admin actions only go back to December 2004, and
> while
> > there is some content from before then, I am not sure if all the stuff
> > deleted before then is available.
> >
> > Regards
> >
> > WSC
> >
> > On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <[hidden email]>
> > wrote:
> >
> > > Robert West, 11/09/20 11:29:
> > > > local instances of MediaWiki,
> > > > enhanced with the capacity of correct historical macro expansion.
> > >
> > > Interesting. I see this doesn't include deleted templates. Have you
> > > considered using historical dumps?
> > >
> > > «We emphasize that the limitation of deleted pages, tem- plates, and
> > > modules is not introduced by our parsing process. Rather, it is
> > > inherited from Wikipedia’s deliberate policy of permanently deleting
> the
> > > entire history of deleted pages.»
> > >
> > > A relevant task is
> > > https://phabricator.wikimedia.org/T2851
> > >
> > > See also the various discussions about Memento, like
> > > https://phabricator.wikimedia.org/T164654
> > >
> > > Federico
> > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Tiziano Piccardi
Hi Denny, thanks for the questions!

1) The time unit is article revision (namespace 0). This means that in your
example, the article would be available at T2 and T4. Adding the pages also
at T1 or T3 would mean to regenerate all the pages that include the
article, and the resulting dataset would be significantly larger than the
current 7 TB. If there is a specific need to have the complete history at
such a level of granularity, the code could be adapted to store every
possible change.

2) No, we used only the Wikitext available in the static XML dump. The date
match is applied to templates and LUA modules. Regarding the UI message
strings, if you are referring to Mediawiki interface labels, consider that
we included only the content of the article as if you retrieved the page
with the parameter *action=render*

3) Thank you for these pointers. I confirm that WikiPDA can be seen and a
downloadable version of Memento with the bonus to have the templates
matched at the time of revision creation.

On Sat, Sep 12, 2020 at 12:32 AM Denny Vrandečić <[hidden email]>
wrote:

> Three questions:
>
> 1) assume a page P with a Template T.
>
> P has been modified at time T2 and T4.
> T has been modified at T1 and T3.
>
> Will P be available as of T2 and T4 only, or also as of T3? (at which point
> it will be different than at T2 or T4).
>
>
> 2) What about changes to Wikidata, Commons, or UI message strings?
>
>
> 3) Possibly interesting to look into TimeMachine, Memento, and related work
>
> https://www.mediawiki.org/wiki/Extension:TimeMachine
> https://www.mediawiki.org/wiki/Extension:Memento
>
>
> On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi <[hidden email]
> >
> wrote:
>
> > Thanks Federico and WSC for the interest!
> >
> > I want to specify that we used only public data released in the XML dump.
> > As WSC said, deleted content is not always permanently removed from the
> > database, but it is available only to users with privilege access. Our
> goal
> > is not only to release the dataset, but also to give anyone the
> > possibility to (1) reproduce the results, and (2) generate the HTML
> history
> > in other languages without any special access requirements.
> >
> > Tiziano
> >
> > On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers <
> > [hidden email]> wrote:
> >
> > > I wouldn't use the phrase "Wikipedia’s deliberate policy of permanently
> > > deleting the
> > > entire history of deleted pages". Quite a few "deleted" pages do
> actually
> > > get restored, and depending on the deletion process it can be quite
> easy
> > to
> > > get much deleted content back. Especially if someone volunteers to
> > > reference an unreferenced page or a budding footballer actually gets to
> > > play at professional or international level, or indeed a political
> > > candidate is elected. Almost all "deleted" content still exists and
> could
> > > be restored by a volunteer admin in the right circumstances. However
> > > Wikipedia's deletion processes are more than a little complex, many
> > > articles have incomplete histories because admins have revision deleted
> > > particular revisions that include copyright violations and or some
> really
> > > libellous stuff. Some of the really nasty stuff gets "oversighted" -
> > those
> > > revisions are not even visible to administrators.
> > >
> > > There is also the issue that some of the earliest material is not
> > > available. stats on admin actions only go back to December 2004, and
> > while
> > > there is some content from before then, I am not sure if all the stuff
> > > deleted before then is available.
> > >
> > > Regards
> > >
> > > WSC
> > >
> > > On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <[hidden email]
> >
> > > wrote:
> > >
> > > > Robert West, 11/09/20 11:29:
> > > > > local instances of MediaWiki,
> > > > > enhanced with the capacity of correct historical macro expansion.
> > > >
> > > > Interesting. I see this doesn't include deleted templates. Have you
> > > > considered using historical dumps?
> > > >
> > > > «We emphasize that the limitation of deleted pages, tem- plates, and
> > > > modules is not introduced by our parsing process. Rather, it is
> > > > inherited from Wikipedia’s deliberate policy of permanently deleting
> > the
> > > > entire history of deleted pages.»
> > > >
> > > > A relevant task is
> > > > https://phabricator.wikimedia.org/T2851
> > > >
> > > > See also the various discussions about Memento, like
> > > > https://phabricator.wikimedia.org/T164654
> > > >
> > > > Federico
> > > >
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Denny Vrandečić-2
Thanks for the info! Yes, I was mostly wondering about #1. Thanks for your
work!

On Sat, Sep 12, 2020 at 1:41 AM Tiziano Piccardi <[hidden email]>
wrote:

> Hi Denny, thanks for the questions!
>
> 1) The time unit is article revision (namespace 0). This means that in your
> example, the article would be available at T2 and T4. Adding the pages also
> at T1 or T3 would mean to regenerate all the pages that include the
> article, and the resulting dataset would be significantly larger than the
> current 7 TB. If there is a specific need to have the complete history at
> such a level of granularity, the code could be adapted to store every
> possible change.
>
> 2) No, we used only the Wikitext available in the static XML dump. The date
> match is applied to templates and LUA modules. Regarding the UI message
> strings, if you are referring to Mediawiki interface labels, consider that
> we included only the content of the article as if you retrieved the page
> with the parameter *action=render*
>
> 3) Thank you for these pointers. I confirm that WikiPDA can be seen and a
> downloadable version of Memento with the bonus to have the templates
> matched at the time of revision creation.
>
> On Sat, Sep 12, 2020 at 12:32 AM Denny Vrandečić <[hidden email]>
> wrote:
>
> > Three questions:
> >
> > 1) assume a page P with a Template T.
> >
> > P has been modified at time T2 and T4.
> > T has been modified at T1 and T3.
> >
> > Will P be available as of T2 and T4 only, or also as of T3? (at which
> point
> > it will be different than at T2 or T4).
> >
> >
> > 2) What about changes to Wikidata, Commons, or UI message strings?
> >
> >
> > 3) Possibly interesting to look into TimeMachine, Memento, and related
> work
> >
> > https://www.mediawiki.org/wiki/Extension:TimeMachine
> > https://www.mediawiki.org/wiki/Extension:Memento
> >
> >
> > On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi <
> [hidden email]
> > >
> > wrote:
> >
> > > Thanks Federico and WSC for the interest!
> > >
> > > I want to specify that we used only public data released in the XML
> dump.
> > > As WSC said, deleted content is not always permanently removed from the
> > > database, but it is available only to users with privilege access. Our
> > goal
> > > is not only to release the dataset, but also to give anyone the
> > > possibility to (1) reproduce the results, and (2) generate the HTML
> > history
> > > in other languages without any special access requirements.
> > >
> > > Tiziano
> > >
> > > On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers <
> > > [hidden email]> wrote:
> > >
> > > > I wouldn't use the phrase "Wikipedia’s deliberate policy of
> permanently
> > > > deleting the
> > > > entire history of deleted pages". Quite a few "deleted" pages do
> > actually
> > > > get restored, and depending on the deletion process it can be quite
> > easy
> > > to
> > > > get much deleted content back. Especially if someone volunteers to
> > > > reference an unreferenced page or a budding footballer actually gets
> to
> > > > play at professional or international level, or indeed a political
> > > > candidate is elected. Almost all "deleted" content still exists and
> > could
> > > > be restored by a volunteer admin in the right circumstances. However
> > > > Wikipedia's deletion processes are more than a little complex, many
> > > > articles have incomplete histories because admins have revision
> deleted
> > > > particular revisions that include copyright violations and or some
> > really
> > > > libellous stuff. Some of the really nasty stuff gets "oversighted" -
> > > those
> > > > revisions are not even visible to administrators.
> > > >
> > > > There is also the issue that some of the earliest material is not
> > > > available. stats on admin actions only go back to December 2004, and
> > > while
> > > > there is some content from before then, I am not sure if all the
> stuff
> > > > deleted before then is available.
> > > >
> > > > Regards
> > > >
> > > > WSC
> > > >
> > > > On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > Robert West, 11/09/20 11:29:
> > > > > > local instances of MediaWiki,
> > > > > > enhanced with the capacity of correct historical macro expansion.
> > > > >
> > > > > Interesting. I see this doesn't include deleted templates. Have you
> > > > > considered using historical dumps?
> > > > >
> > > > > «We emphasize that the limitation of deleted pages, tem- plates,
> and
> > > > > modules is not introduced by our parsing process. Rather, it is
> > > > > inherited from Wikipedia’s deliberate policy of permanently
> deleting
> > > the
> > > > > entire history of deleted pages.»
> > > > >
> > > > > A relevant task is
> > > > > https://phabricator.wikimedia.org/T2851
> > > > >
> > > > > See also the various discussions about Memento, like
> > > > > https://phabricator.wikimedia.org/T164654
> > > > >
> > > > > Federico
> > > > >
> > > > > _______________________________________________
> > > > > Wiki-research-l mailing list
> > > > > [hidden email]
> > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > > >
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l