Can we drop revision hashes (rev_sha1)?

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Can we drop revision hashes (rev_sha1)?

Daniel Kinzler-2
Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR)
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd
like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
expensive with MCR. With multiple content objects per revision, we need to track
the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts
query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core
uses it, and I'm not aware of any extension using it either. It seems to be used
primarily in offline analysis for detecting (manual) reverts by looking for
revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with every
revision update? Or can we just compute the hashes on the fly for the offline
analysis? Computing hashes is slow since the content needs to be loaded first,
but it would only have to be done for pairs of revisions of the same page with
the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking
all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Erik Zachte-3
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata.
Parsing the full archive dumps is a quite expensive, time-wise.

This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.

Erik Zachte

-----Original Message-----
From: Wikitech-l [mailto:[hidden email]] On Behalf Of Daniel Kinzler
Sent: Friday, September 15, 2017 12:52
To: Wikimedia developers <[hidden email]>
Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR) <https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Andrew Otto
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <[hidden email]> wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -----Original Message-----
> From: Wikitech-l [mailto:[hidden email]] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers <[hidden email]>
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for the
> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

James Hare-4
What I wonder is – does this *need* to be a part of the database table, or
can it be a dataset generated from each revision and then published
separately? This way each user wouldn’t have to individually compute the
hashes while we also get the (ostensible) benefit of getting them out of
the table.

On September 15, 2017 at 12:41:03 PM, Andrew Otto ([hidden email])
wrote:

We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <[hidden email]> wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -----Original Message-----
> From: Wikitech-l [mailto:[hidden email]] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers <[hidden email]>
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for
the

> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Stas Malyshev
In reply to this post by Andrew Otto
Hi!

> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.

As a random idea - would it be possible to calculate the hashes when
data is transitioned from SQL to Hadoop storage? I imagine that would
slow down the transition, but not sure if it'd be substantial or not. If
we're using the hash just to compare revisions, we could also use
different hash (maybe non-crypto hash?) which may be faster.

--
Stas Malyshev
[hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Andrew Otto
> As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?

We take monthly snapshots of the entire history, so every month we’d have
to pull the content of every revision ever made :o


On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <[hidden email]>
wrote:

> Hi!
>
> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> > from the little I know:
> >
> > Most analytical computations (for things like reverts, as you say) don’t
> > have easy access to content, so computing SHAs on the fly is pretty hard.
> > MediaWiki history reconstruction relies on the SHA to figure out what
> > revisions revert other revisions, as there is no reliable way to know if
> > something is a revert other than by comparing SHAs.
>
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
>
> --
> Stas Malyshev
> [hidden email]
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Andrew Otto
> can it be a dataset generated from each revision and then published
separately?

Perhaps it be generated asynchronously via a job?  Either stored in
revision or a separate table.

On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto <[hidden email]> wrote:

> > As a random idea - would it be possible to calculate the hashes when data
> is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d have
> to pull the content of every revision ever made :o
>
>
> On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <[hidden email]>
> wrote:
>
>> Hi!
>>
>> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
>> but
>> > from the little I know:
>> >
>> > Most analytical computations (for things like reverts, as you say) don’t
>> > have easy access to content, so computing SHAs on the fly is pretty
>> hard.
>> > MediaWiki history reconstruction relies on the SHA to figure out what
>> > revisions revert other revisions, as there is no reliable way to know if
>> > something is a revert other than by comparing SHAs.
>>
>> As a random idea - would it be possible to calculate the hashes when
>> data is transitioned from SQL to Hadoop storage? I imagine that would
>> slow down the transition, but not sure if it'd be substantial or not. If
>> we're using the hash just to compare revisions, we could also use
>> different hash (maybe non-crypto hash?) which may be faster.
>>
>> --
>> Stas Malyshev
>> [hidden email]
>>
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Stas Malyshev
In reply to this post by Andrew Otto
Hi!

On 9/15/17 1:06 PM, Andrew Otto wrote:
>> As a random idea - would it be possible to calculate the hashes
> when data is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d
> have to pull the content of every revision ever made :o

Why? If you already seen that revision in previous snapshot, you'd
already have its hash? Admittedly, I have no idea how the process works,
so I am just talking out of general knowledge and may miss some things.
Also of course you already have hashes from revs till this day and up to
the day we decide to turn the hash off. Starting that day, it'd have to
be generated, but I see no reason to generate one more than once?
--
Stas Malyshev
[hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

C. Scott Ananian
Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
We could keep it for the wikitext, but drop the hash for the metadata, and
drop any support for a "combined" hash over wikitext + all-other-pieces.

...which begs the question about how reverts work in MCR.  Is it just the
wikitext which is reverted, or do categories and other metadata revert as
well?  And perhaps we can just mark these at revert time instead of trying
to reconstruct it after the fact?
 --scott

On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <[hidden email]>
wrote:

> Hi!
>
> On 9/15/17 1:06 PM, Andrew Otto wrote:
> >> As a random idea - would it be possible to calculate the hashes
> > when data is transitioned from SQL to Hadoop storage?
> >
> > We take monthly snapshots of the entire history, so every month we’d
> > have to pull the content of every revision ever made :o
>
> Why? If you already seen that revision in previous snapshot, you'd
> already have its hash? Admittedly, I have no idea how the process works,
> so I am just talking out of general knowledge and may miss some things.
> Also of course you already have hashes from revs till this day and up to
> the day we decide to turn the hash off. Starting that day, it'd have to
> be generated, but I see no reason to generate one more than once?
> --
> Stas Malyshev
> [hidden email]
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Chad
In reply to this post by James Hare-4
We could keep it in the XML dumps (it's part of the XSD after all)...just
compute it at export time. Not terribly hard, I don't think, we should have
the parsed content already on hand....

-Chad

On Fri, Sep 15, 2017 at 12:51 PM James Hare <[hidden email]> wrote:

> What I wonder is – does this *need* to be a part of the database table, or
> can it be a dataset generated from each revision and then published
> separately? This way each user wouldn’t have to individually compute the
> hashes while we also get the (ostensible) benefit of getting them out of
> the table.
>
> On September 15, 2017 at 12:41:03 PM, Andrew Otto ([hidden email])
> wrote:
>
> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.
>
> See
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
> (particularly the *revert* fields).
>
>
>
> On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <[hidden email]>
> wrote:
>
> > Compute the hashes on the fly for the offline analysis doesn’t work for
> > Wikistats 1.0, as it only parses the stub dumps, without article content,
> > just metadata.
> > Parsing the full archive dumps is a quite expensive, time-wise.
> >
> > This may change with Wikistats 2.0 with has a totally different process
> > flow. That I can't tell.
> >
> > Erik Zachte
> >
> > -----Original Message-----
> > From: Wikitech-l [mailto:[hidden email]] On
> > Behalf Of Daniel Kinzler
> > Sent: Friday, September 15, 2017 12:52
> > To: Wikimedia developers <[hidden email]>
> > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> >
> > Hi all!
> >
> > I'm working on the database schema for Multi-Content-Revisions (MCR) <
> > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> > and I'd like to get rid of the rev_sha1 field:
> >
> > Maintaining revision hashes (the rev_sha1 field) is expensive, and
> becomes
> > more expensive with MCR. With multiple content objects per revision, we
> > need to track the hash for each slot, and then re-calculate the sha1 for
> > each revision.
> >
> > That's expensive especially in terms of bytes-per-database-row, which
> > impacts query performance.
> >
> > So, what do we need the rev_sha1 field for? As far as I know, nothing in
> > core uses it, and I'm not aware of any extension using it either. It
> seems
> > to be used primarily in offline analysis for detecting (manual) reverts
> by
> > looking for revisions with the same hash.
> >
> > Is that reason enough for dragging all the hashes around the database
> with
> > every revision update? Or can we just compute the hashes on the fly for
> the
> > offline analysis? Computing hashes is slow since the content needs to be
> > loaded first, but it would only have to be done for pairs of revisions of
> > the same page with the same size, which should be a pretty good
> > optimization.
> >
> > Also, I believe Roan is currently looking for a better mechanism for
> > tracking all kinds of reverts directly.
> >
> > So, can we drop rev_sha1?
> >
> > --
> > Daniel Kinzler
> > Principal Platform Engineer
> >
> > Wikimedia Deutschland
> > Gesellschaft zur Förderung Freien Wissens e.V.
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Daniel Kinzler-2
In reply to this post by Erik Zachte-3
Am 15.09.2017 um 19:49 schrieb Erik Zachte:
> Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.

We can always compute the hash when outputting XML dumps that contain the full
content (it's already loaded, so no big deal), and then generate the XML dump
with only meta-data from the full dump.


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Daniel Kinzler-2
In reply to this post by C. Scott Ananian
A revert restores a previous revision. It covers all slots.

The fact that reverts, watching, protecting, etc still works per page, while you
can have multiple kinds of different content on the page, is indeed the point of
MCR.

Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:

> Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
> We could keep it for the wikitext, but drop the hash for the metadata, and
> drop any support for a "combined" hash over wikitext + all-other-pieces.
>
> ...which begs the question about how reverts work in MCR.  Is it just the
> wikitext which is reverted, or do categories and other metadata revert as
> well?  And perhaps we can just mark these at revert time instead of trying
> to reconstruct it after the fact?
>  --scott
>
> On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <[hidden email]>
> wrote:
>
>> Hi!
>>
>> On 9/15/17 1:06 PM, Andrew Otto wrote:
>>>> As a random idea - would it be possible to calculate the hashes
>>> when data is transitioned from SQL to Hadoop storage?
>>>
>>> We take monthly snapshots of the entire history, so every month we’d
>>> have to pull the content of every revision ever made :o
>>
>> Why? If you already seen that revision in previous snapshot, you'd
>> already have its hash? Admittedly, I have no idea how the process works,
>> so I am just talking out of general knowledge and may miss some things.
>> Also of course you already have hashes from revs till this day and up to
>> the day we decide to turn the hash off. Starting that day, it'd have to
>> be generated, but I see no reason to generate one more than once?
>> --
>> Stas Malyshev
>> [hidden email]
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Daniel Kinzler-2
In reply to this post by Stas Malyshev
Ok, a little more detail here:

For MCR, we would have to keep around the hash of each content object ("slot")
AND of each revision. This makes the revision and content tables "wider", which
is a problem because they grow quite "tall", too. It also means we have to
compute a hash of hashes for each revision, but that's not horrible.

I'm hoping we can remove the hash from both tables. Keeping the hash of each
content object and/or each revision somewhere else is fine with me. Perhaps it's
sufficient to generate it when generating XML dumps. Maybe we want it in hadoop.
Maybe we want to have it in a separate SQL database. But perhaps we don't
actually need it.

Can someone explain *why* they want the hash at all?

Am 15.09.2017 um 22:01 schrieb Stas Malyshev:

> Hi!
>
>> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
>> from the little I know:
>>
>> Most analytical computations (for things like reverts, as you say) don’t
>> have easy access to content, so computing SHAs on the fly is pretty hard.
>> MediaWiki history reconstruction relies on the SHA to figure out what
>> revisions revert other revisions, as there is no reliable way to know if
>> something is a revert other than by comparing SHAs.
>
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
>


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Matthew Flaschen-2
In reply to this post by Daniel Kinzler-2
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> Also, I believe Roan is currently looking for a better mechanism for tracking
> all kinds of reverts directly.

Let's see if we want to use rev_sha1 for that better solution (a way to
track reverts within MW itself) before we drop it.

I know Roan is planning to write an RFC on reverts.

Matt

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Gergo Tisza
At a quick glance, EventBus and FlaggedRevs are the two extensions using
the hashes. EventBust just puts them into the emitted data; FlaggedRevs
detects reverts to the latest stable revision that way (so there is no
rev_sha1 based lookup in either case, although in the case of FlaggedRevs I
could imagine a use case for something like that).

Files on the other hand use hash lookups a lot, and AIUI they are planned
to become MCR slots eventually.

For a quick win, you could just reduce the hash size. We have around a
billion revisions, and probably won't ever have more than a trillion;
square that for birthday effect and add a couple extra zeros just to be
sure, and it still fits comfortably into 80 bits. If hashes only need to be
unique within the same page then maybe 30-40.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Antoine Musso-3
In reply to this post by Daniel Kinzler-2
On 15/09/2017 12:51, Daniel Kinzler wrote:
>
> I'm working on the database schema for Multi-Content-Revisions (MCR)
> <https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>  and I'd
> like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
> expensive with MCR. With multiple content objects per revision, we need to track
> the hash for each slot, and then re-calculate the sha1 for each revision.
<snip>

Hello,

That was introduced by Aaron Schulz. The purpose is to have them pre
computed since that is quite expensive to  have to do it on million of rows.

A use case was to easily detect reverts.

See for reference:
https://phabricator.wikimedia.org/T23860
https://phabricator.wikimedia.org/T27312

I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some
insights about it.



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

MZMcBride-2
Antoine Musso wrote:
>I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some
>insights about it.

Yes. Brion started a thread about the use of SHA-1 in February 2017:

https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087664.html
https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087666.html

Of note, we have <https://www.mediawiki.org/wiki/Manual:Hashing>.

The use of base-36 SHA-1 instead of base-16 SHA-1 for revision.rev_sha1
has always perplexed me. It'd be nice to better(?) document that design
decision. It's referenced here:
https://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063445.html

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Daniel Kinzler-2
In reply to this post by Matthew Flaschen-2
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
>> Also, I believe Roan is currently looking for a better mechanism for tracking
>> all kinds of reverts directly.
>
> Let's see if we want to use rev_sha1 for that better solution (a way to track
> reverts within MW itself) before we drop it.


The problem is that if we don't drop is, we have to *introduce* it for the new
content table for MCR. I'd like to avoid that.

I guess we can define the field and just null it, but... well. I'd like to avoid
that.


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Dan Andreescu
So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc.  So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it.  That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.

I defer to your expertise when you say it's expensive to keep in the db,
and I can see how that would get much worse with MCR.  I'm sure we can
figure something out, though.  Right now it seems like our options are, as
others have pointed out:

* compute async and store in DB or somewhere else that's central and easy
to access from all the branches I mentioned
* update how we detect reverts and keep a revert database with good
references to wiki_db, rev_id so it can be brought back in context.

Personally, I would love to get better revert detection, using sha1 exact
matches doesn't really get to the heart of the issue.  Important phenomena
like revert wars, bullying, and stalking are hiding behind bad revert
detection.  I'm happy to brainstorm ways we can use Analytics
infrastructure to do this.  We definitely have the tools necessary, but not
so much the man-power.  That said, please don't strip out rev_sha1 until
we've accounted for all its "data customers".

So, put another way, I think it's totally fine if we say ok everyone, from
date XYZ, you will no longer have rev_sha1 in the database, but if you want
to know whether an edit reverts a previous edit or a series of edits, go
*HERE*.  That's fine.  And just for context, here's how we do our revert
detection in Hadoop (it's pretty fancy) [2].


[1] https://github.com/mediawiki-utilities/python-mwreverts
[2]
https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10dc811279826ffdff236e8b0f2d/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/denormalized/DenormalizedRevisionsBuilder.scala#L174-L317

On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler <[hidden email]
> wrote:

> Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> > On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> >> Also, I believe Roan is currently looking for a better mechanism for
> tracking
> >> all kinds of reverts directly.
> >
> > Let's see if we want to use rev_sha1 for that better solution (a way to
> track
> > reverts within MW itself) before we drop it.
>
>
> The problem is that if we don't drop is, we have to *introduce* it for the
> new
> content table for MCR. I'd like to avoid that.
>
> I guess we can define the field and just null it, but... well. I'd like to
> avoid
> that.
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Can we drop revision hashes (rev_sha1)?

Danny B.-2

---------- Původní e-mail ----------
Od: Dan Andreescu <[hidden email]>
Komu: Wikimedia developers <[hidden email]>
Datum: 18. 9. 2017 16:26:18
Předmět: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
"So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it. That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.
"



I use rev_sha1 on replicas to check the consistency of modules, templates or
other pages (typically help) which should be same between projects (either
within one language or even crosslanguage, if the page is not language
dependent). In other words to detect possible changes in them and syncing
them.




Also, I haven't noticed it mentioned in the thread: Flow also notices users
on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather
mentioning it.







Kind regards







Danny B.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
12