use Linus' git SHA1 mapping to unbloat the text table

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

use Linus' git SHA1 mapping to unbloat the text table

jidanni
Gentlemen, why doesn't the database's text table use Linus Torvalds'
"git" style mapping of identical contents to the same row, as they
have the same SHA1 hash?

Currently even undoing a user's edit just points to a fresh row in the
text table instead of pointing to the identical old one.

This despite these words in tables.sql:
  -- It's possible for multiple revisions to use the same text,
  -- for instance revisions where only metadata is altered
  -- or a rollback to a previous version.

Examining my wiki,
echo "SELECT old_text FROM text;"|mysql --default-character-set=binary radioscanningtw -N|\
        perl -nwle 'use Digest::SHA1 qw/sha1_hex/;$h{sha1_hex($_)}++;END{for(keys %h){print $h{$_}}}'|sort -nr|uniq -c
      1 247
      1 5
      2 4
     10 3
    261 2
   1206 1
I find all but the last 1206 records are duplicated.

echo "SELECT old_text FROM text;"|mysql --default-character-set=binary radioscanningtw -N|\
        perl -lnwe 'use Digest::SHA1 qw/sha1_hex/;print sha1_hex($_),"\t", $_'|sort|uniq -c|sort -nr|\
        perl -C -nwle '/.{0,88}/;print $&;exit if $.==5'
    247 da39a3ee5e6b4b0d3255bfef95601890afd80709
      5 bf36408b7db0ea4b834b935ae2992e97fd438539 請問台中港務警察局的頻率、頻道,有人知道嗎?可以分享嗎?
      4 fa21b2d9a4ace2bb86917e7a83ad20a1f5301917 {{c|486.1000}}|{{c|DCS 065}}|{{c|呼 8xx}
      4 a860f97b87c81344239766c2f243bfff05ae7cdd #REDIRECT [[Project:幫助]]
      3 edabb0d4f0f21cd9dfba867ef9cdbc584c8937c1 全國監獄 {{c|150.6750}}
I even have 247 separate entries for a file with 0 bytes, from a page
blanking incident. One would be enough.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: use Linus' git SHA1 mapping to unbloat the text table

Domas Mituzas
Hello,

> Gentlemen, why doesn't the database's text table use Linus Torvalds'
> "git" style mapping of identical contents to the same row, as they
> have the same SHA1 hash?

mediawiki-l? anyway, you can use external storage to use CAS-based  
storage, if you really want.

--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: use Linus' git SHA1 mapping to unbloat the text table

jidanni
DM> you can use external storage to use CAS-based storage, if you really want.
Ah,
http://en.wikipedia.org/wiki/Content-addressable_storage#Open_Source_Implementations
http://en.wikipedia.org/wiki/Git_(software)#Implementation
And while you're at it, he says subversion is for goners,
http://www.google.com/search?q=torvalds+subversion+git

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: use Linus' git SHA1 mapping to unbloat the text table

Domas Mituzas
Hi!

> Ah,
> http://en.wikipedia.org/wiki/Content-addressable_storage#Open_Source_Implementations
> http://en.wikipedia.org/wiki/Git_(software)#Implementation

Thanks for sharing these extremely valuable links. How did you find  
them?

> And while you're at it, he says subversion is for goners,
> http://www.google.com/search?q=torvalds+subversion+git

Good for him. Should we store all our content in GIT, from now on?

--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: use Linus' git SHA1 mapping to unbloat the text table

Bugzilla from andrew@epstone.net
In reply to this post by jidanni
On Mon, Mar 30, 2009 at 12:49 PM,  <[hidden email]> wrote:
> DM> you can use external storage to use CAS-based storage, if you really want.
> Ah,
> http://en.wikipedia.org/wiki/Content-addressable_storage#Open_Source_Implementations
> http://en.wikipedia.org/wiki/Git_(software)#Implementation

Neither git nor Linus Torvalds invented Content-Addressable Storage.
They've been around for years, but we haven't ever needed it enough to
implement it. I assume that if we did need it, we would, as Tim
Starling, one of our staff developers, has been working actively on a
history recompression project.

> And while you're at it, he says subversion is for goners,
> http://www.google.com/search?q=torvalds+subversion+git

While we would like to move to a distributed RCS, we're not doing it
because Linus Torvalds told us to.

--
Andrew Garrett

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l