Announcement - Mediawiki History Dumps

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Announcement - Mediawiki History Dumps

Joseph Allemandou
Hi Analytics People,

The Wikimedia Analytics Team is pleased to announce the release of the most
complete dataset we have to date to analyze content and contributors
metadata: Mediawiki History [1] [2].

Data is in TSV format, released monthly around the 3rd of the month
usually, and every new release contains the full history of metadata.

The dataset contains an enhanced [3] and historified [4] version of user,
page and revision metadata and serves as a base to Wiksitats API on edits,
users and pages [5] [6].

We hope you will have as much fun playing with the data as we have building
it, and we're eager to hear from you [7], whether for issues, ideas or
usage of the data.

Analytically yours,

--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation

[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
[3] Many pre-computed fields are present in the dataset, from edit-counts
by user and page to reverts and reverted information, as well as time
between events.
[4] As accurate as possible historical usernames and page-titles (as well
as user-groups and blocks) is available in addition to current values, and
are provided in a denormalized way to every event of the dataset.
[5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
[6] https://wikimedia.org/api/rest_v1/
[7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Nate E TeBlunthuis
Thank you so much Joal! I've been happily using this data for some time and I'm optimistic that it can make doing thorough analyses of Wikimedia projects much more accessible to the community, students, and researchers.

-- Nate
________________________________
From: Wiki-research-l <[hidden email]> on behalf of Joseph Allemandou <[hidden email]>
Sent: Monday, February 10, 2020 8:27 AM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. <[hidden email]>; Research into Wikimedia content and communities <[hidden email]>; Product Analytics <[hidden email]>
Subject: [Wiki-research-l] Announcement - Mediawiki History Dumps

Hi Analytics People,

The Wikimedia Analytics Team is pleased to announce the release of the most
complete dataset we have to date to analyze content and contributors
metadata: Mediawiki History [1] [2].

Data is in TSV format, released monthly around the 3rd of the month
usually, and every new release contains the full history of metadata.

The dataset contains an enhanced [3] and historified [4] version of user,
page and revision metadata and serves as a base to Wiksitats API on edits,
users and pages [5] [6].

We hope you will have as much fun playing with the data as we have building
it, and we're eager to hear from you [7], whether for issues, ideas or
usage of the data.

Analytically yours,

--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation

[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
[3] Many pre-computed fields are present in the dataset, from edit-counts
by user and page to reverts and reverted information, as well as time
between events.
[4] As accurate as possible historical usernames and page-titles (as well
as user-groups and blocks) is available in addition to current values, and
are provided in a denormalized way to every event of the dataset.
[5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
[6] https://wikimedia.org/api/rest_v1/
[7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Neil Shah-Quinn
I want to echo what Nate said. We've been using this for more than a year
within the Wikimedia Foundation, and it has made analyses of editing
behavior much, much easier and faster, not to mention a lot less annoying.

This is the product of years of expert work by the Analytics team, and they
deserve plenty of congratulations for it 😊

On Mon, 10 Feb 2020 at 10:42, Nate E TeBlunthuis <[hidden email]> wrote:

> Thank you so much Joal! I've been happily using this data for some time
> and I'm optimistic that it can make doing thorough analyses of Wikimedia
> projects much more accessible to the community, students, and researchers.
>
> -- Nate
> ------------------------------
> *From:* Wiki-research-l <[hidden email]> on
> behalf of Joseph Allemandou <[hidden email]>
> *Sent:* Monday, February 10, 2020 8:27 AM
> *To:* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics. <[hidden email]>;
> Research into Wikimedia content and communities <
> [hidden email]>; Product Analytics <
> [hidden email]>
> *Subject:* [Wiki-research-l] Announcement - Mediawiki History Dumps
>
> Hi Analytics People,
>
> The Wikimedia Analytics Team is pleased to announce the release of the most
> complete dataset we have to date to analyze content and contributors
> metadata: Mediawiki History [1] [2].
>
> Data is in TSV format, released monthly around the 3rd of the month
> usually, and every new release contains the full history of metadata.
>
> The dataset contains an enhanced [3] and historified [4] version of user,
> page and revision metadata and serves as a base to Wiksitats API on edits,
> users and pages [5] [6].
>
> We hope you will have as much fun playing with the data as we have building
> it, and we're eager to hear from you [7], whether for issues, ideas or
> usage of the data.
>
> Analytically yours,
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
> [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> [2]
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> [3] Many pre-computed fields are present in the dataset, from edit-counts
> by user and page to reverts and reverted information, as well as time
> between events.
> [4] As accurate as possible historical usernames and page-titles (as well
> as user-groups and blocks) is available in addition to current values, and
> are provided in a denormalized way to every event of the dataset.
> [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> [6] https://wikimedia.org/api/rest_v1/
> [7]
>
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Announcement - Mediawiki History Dumps

Pine W
In reply to this post by Joseph Allemandou
Hi Joseph,

Thanks for this announcement.

I am looking for license information regarding the dumps, and I'm not
finding it in the pages that you linked at [1] or [2]. The license
that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the
WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use
do not appear to provide any exception for metadata. In the absence of
a specific license, I think that the CC-BY-SA or other relevant
licenses would apply to the metadata, and that the licensing
information should be prominently included on relevant pages and in
the dumps themselves.

What do you think?

Pine
( https://meta.wikimedia.org/wiki/User:Pine )

On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou
<[hidden email]> wrote:

>
> Hi Analytics People,
>
> The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
>
> Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
>
> The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
>
> We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
>
> Analytically yours,
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
> [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events.
> [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset.
> [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> [6] https://wikimedia.org/api/rest_v1/
> [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> _______________________________________________
> Analytics mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Announcement - Mediawiki History Dumps

Pine W
I was thinking about the licensing issue some more. Apparently there
was a relevant United States court case regarding metadata several
years ago in the United States, but it's unclear to me from my brief
web search whether this holding would apply to metadata from every
nation. Also, I don't know if the underlying statues have changed
since the time of that ruling. I think that WMF Legal should be
consulted regarding the copyright status of the metadata. Also, I
think that the licensing of metadata should be explicitly addressed in
the Terms of Use or a similar document which is easily accessible to
all contributors to Wikimedia sites.

Pine
( https://meta.wikimedia.org/wiki/User:Pine )

On Tue, Feb 11, 2020 at 12:17 AM Pine W <[hidden email]> wrote:

>
> Hi Joseph,
>
> Thanks for this announcement.
>
> I am looking for license information regarding the dumps, and I'm not
> finding it in the pages that you linked at [1] or [2]. The license
> that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the
> WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use
> do not appear to provide any exception for metadata. In the absence of
> a specific license, I think that the CC-BY-SA or other relevant
> licenses would apply to the metadata, and that the licensing
> information should be prominently included on relevant pages and in
> the dumps themselves.
>
> What do you think?
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
> On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou
> <[hidden email]> wrote:
> >
> > Hi Analytics People,
> >
> > The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
> >
> > Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
> >
> > The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
> >
> > We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
> >
> > Analytically yours,
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> >
> > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events.
> > [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset.
> > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > [6] https://wikimedia.org/api/rest_v1/
> > [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > _______________________________________________
> > Analytics mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Announcement - Mediawiki History Dumps

Leila Zia-2
Hi Joseph and team,

summary: congratulations and some suggestions/requests.

I second and third Nate and Neil. Congratulations on meeting this
milestone. This effort can empower the research community to spend
less time on joining datasets and trying to resolve existing, known
(to some) and complex issues with mediawiki history data and instead
spend time doing the research. Nice! :)

I'm eager to see what the dataset(s) will be used for by others. On my
end, I am looking forward to seeing more research on how Wiki(m|p)edia
projects have evolved over the past almost 2 decades now that this
data is more readily available for studying. What we learn from the
Wikimedia projects and their evolution can be helpful in understanding
the broader web ecosystem and its evolution as well (as the Web is
only 30 years old now).

I have some requests if I may:

* Pine brings up a good point about licenses. It would be great to
make that clear in the documentation page(s). There are many examples
of this (that you know better than I), just in case, I find the
License section of
https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
informative, for example.

* The other request I have is that you make the template for citing
this data-set clear to the end-user in your documentation pages
(including readme). You can do this in a few different ways:

** In the documentation pages, put a suggested citation link. For
example (for bibtex):

@misc{wmfanalytics2020mediawikihistory,
  title = {MediaWiki History},
  author = {nameoftheauthors},
  howpublished = "\url{https://dumps.wikimedia.org/other/mediawiki_history/}",
  note = {Accessed on date x},
  year={2020}
}

** Upload a paper about the work on arxiv.org. This way, your work
gets a DOI that you can use in your documentation pages for folks to
use for citation. Note that this step can be relatively light-weight.
(no peer-review in this case and it's relatively quick.)

** Submit the paper to a conference. Some conferences have a data-set
paper track where you publish about the dataset you release. Research
is happy to support you with guidance if you need it and if you choose
to go down this path. This takes some more time and in return it will
give you a "peer-review" stamp and more experience in publishing if
you like that.

Unless you like publishing your work in a peer-reviewed venue, I
suggest one of the first two approaches.

* I'm not sure if you intend to make the dataset more discoverable
through places such as https://datasetsearch.research.google.com/ .
You may want to consider that.

Thanks,
Leila

--
Leila Zia
Head of Research
Wikimedia Foundation

On Mon, Feb 10, 2020 at 9:28 PM Pine W <[hidden email]> wrote:

>
> I was thinking about the licensing issue some more. Apparently there
> was a relevant United States court case regarding metadata several
> years ago in the United States, but it's unclear to me from my brief
> web search whether this holding would apply to metadata from every
> nation. Also, I don't know if the underlying statues have changed
> since the time of that ruling. I think that WMF Legal should be
> consulted regarding the copyright status of the metadata. Also, I
> think that the licensing of metadata should be explicitly addressed in
> the Terms of Use or a similar document which is easily accessible to
> all contributors to Wikimedia sites.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
> On Tue, Feb 11, 2020 at 12:17 AM Pine W <[hidden email]> wrote:
> >
> > Hi Joseph,
> >
> > Thanks for this announcement.
> >
> > I am looking for license information regarding the dumps, and I'm not
> > finding it in the pages that you linked at [1] or [2]. The license
> > that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the
> > WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use
> > do not appear to provide any exception for metadata. In the absence of
> > a specific license, I think that the CC-BY-SA or other relevant
> > licenses would apply to the metadata, and that the licensing
> > information should be prominently included on relevant pages and in
> > the dumps themselves.
> >
> > What do you think?
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> > On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou
> > <[hidden email]> wrote:
> > >
> > > Hi Analytics People,
> > >
> > > The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
> > >
> > > Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
> > >
> > > The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
> > >
> > > We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
> > >
> > > Analytically yours,
> > >
> > > --
> > > Joseph Allemandou (joal) (he / him)
> > > Sr Data Engineer
> > > Wikimedia Foundation
> > >
> > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > > [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > > [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events.
> > > [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset.
> > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > > [6] https://wikimedia.org/api/rest_v1/
> > > [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > > _______________________________________________
> > > Analytics mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Giovanni Luca Ciampaglia-5
In reply to this post by Joseph Allemandou
Hi Joseph,

Thanks a lot for creating and sharing such a valuable resource. I went
through the schema and from what I understand there is no information about
page-to-page links, correct? Are there any resources that would provide
such historical data?

Best,

*Giovanni Luca Ciampaglia* ∙ glciampaglia.com
Assistant Professor
Computer Science and Engineering
<https://www.usf.edu/engineering/cse/> ∙ University
of South Florida <https://www.usf.edu/>

*Due to Florida’s broad open records law, email to or from university
employees is public record, available to the public and the media upon
request.*


On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
[hidden email]> wrote:

> Hi Analytics People,
>
> The Wikimedia Analytics Team is pleased to announce the release of the most
> complete dataset we have to date to analyze content and contributors
> metadata: Mediawiki History [1] [2].
>
> Data is in TSV format, released monthly around the 3rd of the month
> usually, and every new release contains the full history of metadata.
>
> The dataset contains an enhanced [3] and historified [4] version of user,
> page and revision metadata and serves as a base to Wiksitats API on edits,
> users and pages [5] [6].
>
> We hope you will have as much fun playing with the data as we have building
> it, and we're eager to hear from you [7], whether for issues, ideas or
> usage of the data.
>
> Analytically yours,
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
> [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> [2]
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> [3] Many pre-computed fields are present in the dataset, from edit-counts
> by user and page to reverts and reverted information, as well as time
> between events.
> [4] As accurate as possible historical usernames and page-titles (as well
> as user-groups and blocks) is available in addition to current values, and
> are provided in a denormalized way to every event of the dataset.
> [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> [6] https://wikimedia.org/api/rest_v1/
> [7]
>
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Aaron Halfaker-2
+1 to Leila.  Really good suggestions re. making the dataset cite-able and
providing an in-depth discussion of how it was produced.  That's a lot of
work, but it could produce a bunch of additional value.

Thanks for working on this, A-team.  I wish I could transport it back to
the past so I could use it to finish my dissertation faster!

On Tue, Feb 11, 2020 at 3:30 PM Giovanni Luca Ciampaglia <[hidden email]>
wrote:

> Hi Joseph,
>
> Thanks a lot for creating and sharing such a valuable resource. I went
> through the schema and from what I understand there is no information about
> page-to-page links, correct? Are there any resources that would provide
> such historical data?
>
> Best,
>
> *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> Assistant Professor
> Computer Science and Engineering
> <https://www.usf.edu/engineering/cse/> ∙ University
> of South Florida <https://www.usf.edu/>
>
> *Due to Florida’s broad open records law, email to or from university
> employees is public record, available to the public and the media upon
> request.*
>
>
> On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> [hidden email]> wrote:
>
> > Hi Analytics People,
> >
> > The Wikimedia Analytics Team is pleased to announce the release of the
> most
> > complete dataset we have to date to analyze content and contributors
> > metadata: Mediawiki History [1] [2].
> >
> > Data is in TSV format, released monthly around the 3rd of the month
> > usually, and every new release contains the full history of metadata.
> >
> > The dataset contains an enhanced [3] and historified [4] version of user,
> > page and revision metadata and serves as a base to Wiksitats API on
> edits,
> > users and pages [5] [6].
> >
> > We hope you will have as much fun playing with the data as we have
> building
> > it, and we're eager to hear from you [7], whether for issues, ideas or
> > usage of the data.
> >
> > Analytically yours,
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> >
> > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > [2]
> >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > [3] Many pre-computed fields are present in the dataset, from edit-counts
> > by user and page to reverts and reverted information, as well as time
> > between events.
> > [4] As accurate as possible historical usernames and page-titles (as well
> > as user-groups and blocks) is available in addition to current values,
> and
> > are provided in a denormalized way to every event of the dataset.
> > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > [6] https://wikimedia.org/api/rest_v1/
> > [7]
> >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Joseph Allemandou
In reply to this post by Giovanni Luca Ciampaglia-5
Hi Giovanni,
Thank you for your message :)
You are correct in that there is no information on page-to-page link as of
today, as well as no information for instance on historical values of
revisions being redirects for instance.
We share with you the idea that such information is extremely valuable, and
we have in mind to be able to extract it at some point.
The reason for which it has not yet been done is because those pieces
of information are only available through parsing the wikitext of every
revision, which is not only resource intensive but also complicated
technically (templates, version changes etc).
You can be sure we will send another announcement when we'll release that
data :)
Best,

On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia <[hidden email]>
wrote:

> Hi Joseph,
>
> Thanks a lot for creating and sharing such a valuable resource. I went
> through the schema and from what I understand there is no information about
> page-to-page links, correct? Are there any resources that would provide
> such historical data?
>
> Best,
>
> *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> Assistant Professor
> Computer Science and Engineering
> <https://www.usf.edu/engineering/cse/> ∙ University
> of South Florida <https://www.usf.edu/>
>
> *Due to Florida’s broad open records law, email to or from university
> employees is public record, available to the public and the media upon
> request.*
>
>
> On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> [hidden email]> wrote:
>
> > Hi Analytics People,
> >
> > The Wikimedia Analytics Team is pleased to announce the release of the
> most
> > complete dataset we have to date to analyze content and contributors
> > metadata: Mediawiki History [1] [2].
> >
> > Data is in TSV format, released monthly around the 3rd of the month
> > usually, and every new release contains the full history of metadata.
> >
> > The dataset contains an enhanced [3] and historified [4] version of user,
> > page and revision metadata and serves as a base to Wiksitats API on
> edits,
> > users and pages [5] [6].
> >
> > We hope you will have as much fun playing with the data as we have
> building
> > it, and we're eager to hear from you [7], whether for issues, ideas or
> > usage of the data.
> >
> > Analytically yours,
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> >
> > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > [2]
> >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > [3] Many pre-computed fields are present in the dataset, from edit-counts
> > by user and page to reverts and reverted information, as well as time
> > between events.
> > [4] As accurate as possible historical usernames and page-titles (as well
> > as user-groups and blocks) is available in addition to current values,
> and
> > are provided in a denormalized way to every event of the dataset.
> > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > [6] https://wikimedia.org/api/rest_v1/
> > [7]
> >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Giovanni Luca Ciampaglia-5
Thank you Joseph; great to hear there is interest in building such a
dataset. You say that the link information would need to be parsed from
wikitext, which is complicated; would the pagelinks table help as an
alternative source of data?

*Giovanni Luca Ciampaglia* ∙ glciampaglia.com
Assistant Professor
Computer Science and Engineering
<https://www.usf.edu/engineering/cse/> ∙ University
of South Florida <https://www.usf.edu/>

*Due to Florida’s broad open records law, email to or from university
employees is public record, available to the public and the media upon
request.*


On Thu, Feb 13, 2020 at 9:27 AM Joseph Allemandou <[hidden email]>
wrote:

> Hi Giovanni,
> Thank you for your message :)
> You are correct in that there is no information on page-to-page link as of
> today, as well as no information for instance on historical values of
> revisions being redirects for instance.
> We share with you the idea that such information is extremely valuable, and
> we have in mind to be able to extract it at some point.
> The reason for which it has not yet been done is because those pieces
> of information are only available through parsing the wikitext of every
> revision, which is not only resource intensive but also complicated
> technically (templates, version changes etc).
> You can be sure we will send another announcement when we'll release that
> data :)
> Best,
>
> On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia <
> [hidden email]>
> wrote:
>
> > Hi Joseph,
> >
> > Thanks a lot for creating and sharing such a valuable resource. I went
> > through the schema and from what I understand there is no information
> about
> > page-to-page links, correct? Are there any resources that would provide
> > such historical data?
> >
> > Best,
> >
> > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> > Assistant Professor
> > Computer Science and Engineering
> > <https://www.usf.edu/engineering/cse/> ∙ University
> > of South Florida <https://www.usf.edu/>
> >
> > *Due to Florida’s broad open records law, email to or from university
> > employees is public record, available to the public and the media upon
> > request.*
> >
> >
> > On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> > [hidden email]> wrote:
> >
> > > Hi Analytics People,
> > >
> > > The Wikimedia Analytics Team is pleased to announce the release of the
> > most
> > > complete dataset we have to date to analyze content and contributors
> > > metadata: Mediawiki History [1] [2].
> > >
> > > Data is in TSV format, released monthly around the 3rd of the month
> > > usually, and every new release contains the full history of metadata.
> > >
> > > The dataset contains an enhanced [3] and historified [4] version of
> user,
> > > page and revision metadata and serves as a base to Wiksitats API on
> > edits,
> > > users and pages [5] [6].
> > >
> > > We hope you will have as much fun playing with the data as we have
> > building
> > > it, and we're eager to hear from you [7], whether for issues, ideas or
> > > usage of the data.
> > >
> > > Analytically yours,
> > >
> > > --
> > > Joseph Allemandou (joal) (he / him)
> > > Sr Data Engineer
> > > Wikimedia Foundation
> > >
> > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > > [2]
> > >
> > >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > > [3] Many pre-computed fields are present in the dataset, from
> edit-counts
> > > by user and page to reverts and reverted information, as well as time
> > > between events.
> > > [4] As accurate as possible historical usernames and page-titles (as
> well
> > > as user-groups and blocks) is available in addition to current values,
> > and
> > > are provided in a denormalized way to every event of the dataset.
> > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > > [6] https://wikimedia.org/api/rest_v1/
> > > [7]
> > >
> > >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Announcement - Mediawiki History Dumps

Joseph Allemandou
Hi Giovanni,
The pagelinks table is great for temporal snapshots: you know about links
between pages at the time of the query. Parsing the wikitext is needed to
provide an historical view of the links :)
Cheers
Joseph

On Tue, Feb 18, 2020 at 12:22 AM Giovanni Luca Ciampaglia <[hidden email]>
wrote:

> Thank you Joseph; great to hear there is interest in building such a
> dataset. You say that the link information would need to be parsed from
> wikitext, which is complicated; would the pagelinks table help as an
> alternative source of data?
>
> *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> Assistant Professor
> Computer Science and Engineering
> <https://www.usf.edu/engineering/cse/> ∙ University
> of South Florida <https://www.usf.edu/>
>
> *Due to Florida’s broad open records law, email to or from university
> employees is public record, available to the public and the media upon
> request.*
>
>
> On Thu, Feb 13, 2020 at 9:27 AM Joseph Allemandou <
> [hidden email]>
> wrote:
>
> > Hi Giovanni,
> > Thank you for your message :)
> > You are correct in that there is no information on page-to-page link as
> of
> > today, as well as no information for instance on historical values of
> > revisions being redirects for instance.
> > We share with you the idea that such information is extremely valuable,
> and
> > we have in mind to be able to extract it at some point.
> > The reason for which it has not yet been done is because those pieces
> > of information are only available through parsing the wikitext of every
> > revision, which is not only resource intensive but also complicated
> > technically (templates, version changes etc).
> > You can be sure we will send another announcement when we'll release that
> > data :)
> > Best,
> >
> > On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia <
> > [hidden email]>
> > wrote:
> >
> > > Hi Joseph,
> > >
> > > Thanks a lot for creating and sharing such a valuable resource. I went
> > > through the schema and from what I understand there is no information
> > about
> > > page-to-page links, correct? Are there any resources that would provide
> > > such historical data?
> > >
> > > Best,
> > >
> > > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> > > Assistant Professor
> > > Computer Science and Engineering
> > > <https://www.usf.edu/engineering/cse/> ∙ University
> > > of South Florida <https://www.usf.edu/>
> > >
> > > *Due to Florida’s broad open records law, email to or from university
> > > employees is public record, available to the public and the media upon
> > > request.*
> > >
> > >
> > > On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> > > [hidden email]> wrote:
> > >
> > > > Hi Analytics People,
> > > >
> > > > The Wikimedia Analytics Team is pleased to announce the release of
> the
> > > most
> > > > complete dataset we have to date to analyze content and contributors
> > > > metadata: Mediawiki History [1] [2].
> > > >
> > > > Data is in TSV format, released monthly around the 3rd of the month
> > > > usually, and every new release contains the full history of metadata.
> > > >
> > > > The dataset contains an enhanced [3] and historified [4] version of
> > user,
> > > > page and revision metadata and serves as a base to Wiksitats API on
> > > edits,
> > > > users and pages [5] [6].
> > > >
> > > > We hope you will have as much fun playing with the data as we have
> > > building
> > > > it, and we're eager to hear from you [7], whether for issues, ideas
> or
> > > > usage of the data.
> > > >
> > > > Analytically yours,
> > > >
> > > > --
> > > > Joseph Allemandou (joal) (he / him)
> > > > Sr Data Engineer
> > > > Wikimedia Foundation
> > > >
> > > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > > > [2]
> > > >
> > > >
> > >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> > > > [3] Many pre-computed fields are present in the dataset, from
> > edit-counts
> > > > by user and page to reverts and reverted information, as well as time
> > > > between events.
> > > > [4] As accurate as possible historical usernames and page-titles (as
> > well
> > > > as user-groups and blocks) is available in addition to current
> values,
> > > and
> > > > are provided in a denormalized way to every event of the dataset.
> > > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > > > [6] https://wikimedia.org/api/rest_v1/
> > > > [7]
> > > >
> > > >
> > >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> >
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l