|
Dear all;
First it was Encarta, then printed Britannica. Tomorrow, Knol.[1][2] It is not a good moment for Wikipedia "rivals". We at Archive Team are attempting to download all the 700,000 Knols.[3] For the sake of history. Join us, #archiveteam EFNET. Regards, emijrp [1] http://knol.google.com/k [2] http://news.bbc.co.uk/2/hi/technology/7144970.stm [3] http://db.tt/GNrEh61y -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT <http://code.google.com/p/avbot/> | StatMediaWiki<http://statmediawiki.forja.rediris.es> | WikiEvidens <http://code.google.com/p/wikievidens/> | WikiPapers<http://wikipapers.referata.com> | WikiTeam <http://code.google.com/p/wikiteam/> Personal website: https://sites.google.com/site/emijrp/ _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
>
> We at Archive Team are attempting to download all the 700,000 Knols.[3] For > the sake of history. Join us, #archiveteam EFNET. > I did some followup. I'm not sure I can help out with Knol anymore, but I discovered that AT is having some trouble making good archives of wikimedia sites. Theoretically, wikipedia et al SHOULD be easy to reconstitute, right? That's why we're using CC licenses and all. Else if we drop the ball, WP will be gone. This seems like a priority to me! The main problem seems to be obtaining commons images: http://archiveteam.org/index.php?title=Wikiteam So at the very least, we don't appear to have very good documentation. Who could best help Archive Team out? Has anyone done/written documentation on completely restoring 1 or more wikimedia wikis from 'public backup' [1]? What can we do to help them? sincerely, Kim Bruning [1] "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus Torvalds _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
I know from experience that a wiki can be re-built from any one of the
dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki On Wed, May 16, 2012 at 10:01 PM, Kim Bruning <[hidden email]> wrote: > > > > We at Archive Team are attempting to download all the 700,000 Knols.[3] > For > > the sake of history. Join us, #archiveteam EFNET. > > > > I did some followup. I'm not sure I can help out with Knol > anymore, but I discovered that AT is having some trouble > making good archives of wikimedia sites. > > Theoretically, wikipedia et al SHOULD be easy to > reconstitute, right? That's why we're using CC licenses > and all. Else if we drop the ball, WP will be gone. > This seems like a priority to me! > > The main problem seems to be obtaining commons images: > http://archiveteam.org/index.php?title=Wikiteam > > So at the very least, we don't appear to have very good > documentation. Who could best help Archive Team out? Has > anyone done/written documentation on completely restoring 1 > or more wikimedia wikis from 'public backup' [1]? > > What can we do to help them? > > sincerely, > Kim Bruning > > [1] "Real Men don't make backups. They upload it via ftp and > let the world mirror it." - Linus Torvalds > > _______________________________________________ > Wikimedia-l mailing list > [hidden email] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
> I know from experience that a wiki can be re-built from any one of the > dumps that are provided, (pages-meta-current) for example contains > everything needed to reboot a site except its user database > (names/passwords ect). see > http://www.mediawiki.org/wiki/Manual:Moving_a_wiki Sure. Does this include all images, including commons images, eventually converted to operate locally? I'm thinking about full snapshot-and-later-restore, say 25 or 50 years from now, or in an academic setting, (or FSM-forbid in a worst case scenario <knock on wood>). That's what the AT folks are most interested in. ==Fire Drill== Has anyone recently set up a full-external-duplicate of (for instance) en.wp? This includes all images, all discussions, all page history (excepting the user accounts and deleted pages) This would be a useful and important exercise; possibly to be repeated once per year. I get a sneaky feeling that the first few iterations won't go so well. I'm sure AT would be glad to help out with the running of these fire drills, as it seems to be in line with their mission. sincerely, Kim Bruning _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by John Doe-27
On Thu, May 17, 2012 at 1:11 PM, John <[hidden email]> wrote:
> > I know from experience that a wiki can be re-built from any one of the > dumps that are provided, (pages-meta-current) for example contains > everything needed to reboot a site except its user database > (names/passwords ect). see > http://www.mediawiki.org/wiki/Manual:Moving_a_wiki How would we regain control of our existing usernames in the event that the user database was lost in the move? -- John Vandenberg _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by Kim Bruning
Except for files, getting a content clone up is relativity easy, and can be
done in a fairly quick order (aka less than two weeks for everything). I know there is talk about getting a rsync setup for images. _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
> Except for files, getting a content clone up is relativity easy, and can be > done in a fairly quick order (aka less than two weeks for everything). I > know there is talk about getting a rsync setup for images. Ouch, 2 weeks. We need the images to be replicable too though. <scratches head> sincerely, Kim Bruning _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
that two week estimate was given worst case scenario. Given the best case
we are talking as little as a few hours for the smaller wikis to 5 days or so for a project the size of enwiki. (see http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`) On Wed, May 16, 2012 at 11:10 PM, Kim Bruning <[hidden email]> wrote: > On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote: > > Except for files, getting a content clone up is relativity easy, and can > be > > done in a fairly quick order (aka less than two weeks for everything). I > > know there is talk about getting a rsync setup for images. > > Ouch, 2 weeks. We need the images to be replicable too though. <scratches > head> > > > sincerely, > Kim Bruning > > > _______________________________________________ > Wikimedia-l mailing list > [hidden email] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for
exactly how to import an existing dump, I know the process of re-importing a cluster for the toolserver is normally just a few days when they have the needed dumps. On Thu, May 17, 2012 at 12:13 AM, John <[hidden email]<[hidden email]> > wrote: > that two week estimate was given worst case scenario. Given the best case > we are talking as little as a few hours for the smaller wikis to 5 days or > so for a project the size of enwiki. (see > http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`) > > > On Wed, May 16, 2012 at 11:10 PM, Kim Bruning <[hidden email]>wrote: > >> On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote: >> > Except for files, getting a content clone up is relativity easy, and >> can be >> > done in a fairly quick order (aka less than two weeks for everything). I >> > know there is talk about getting a rsync setup for images. >> >> Ouch, 2 weeks. We need the images to be replicable too though. <scratches >> head> >> >> >> sincerely, >> Kim Bruning >> >> >> _______________________________________________ >> Wikimedia-l mailing list >> [hidden email] >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l >> > > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by John Doe-27
On Thu, May 17, 2012 at 12:13 AM, John <[hidden email]> wrote:
> that two week estimate was given worst case scenario. Given the best case > we are talking as little as a few hours for the smaller wikis to 5 days or > so for a project the size of enwiki. (see > http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor > progress on image dumps`) Where are you getting these figures from? Are you talking about a full history copy? Also, what about the copyright issues (especially, attribution)? _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by John Doe-27
On Thu, May 17, 2012 at 12:18 AM, John <[hidden email]> wrote:
> take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for > exactly how to import an existing dump, I know the process of re-importing > a cluster for the toolserver is normally just a few days when they have the > needed dumps. Toolserver doesn't have full history, does it? _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
Toolserver is a clone of the wmf servers minus files. they run a database
replication of all wikis. these times are dependent on available hardware and may very, but should provide a decent estimate On Thu, May 17, 2012 at 12:23 AM, Anthony <[hidden email]> wrote: > On Thu, May 17, 2012 at 12:18 AM, John <[hidden email]> wrote: > > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor > > exactly how to import an existing dump, I know the process of > re-importing > > a cluster for the toolserver is normally just a few days when they have > the > > needed dumps. > > Toolserver doesn't have full history, does it? > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
Ill run a quick benchmark and import the full history of simple.wikipedia
to my laptop wiki on a stick, and give an exact duration On Thu, May 17, 2012 at 12:26 AM, John <[hidden email]> wrote: > Toolserver is a clone of the wmf servers minus files. they run a database > replication of all wikis. these times are dependent on available hardware > and may very, but should provide a decent estimate > > > > On Thu, May 17, 2012 at 12:23 AM, Anthony <[hidden email]> wrote: > >> On Thu, May 17, 2012 at 12:18 AM, John <[hidden email]> wrote: >> > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor >> > exactly how to import an existing dump, I know the process of >> re-importing >> > a cluster for the toolserver is normally just a few days when they have >> the >> > needed dumps. >> >> Toolserver doesn't have full history, does it? >> > > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
On Thu, May 17, 2012 at 12:30 AM, John <[hidden email]> wrote:
> Ill run a quick benchmark and import the full history of simple.wikipedia to > my laptop wiki on a stick, and give an exact duration Simple.wikipedia is nothing like en.wikipedia. For one thing, there's no need to turn on $wgCompressRevisions with simple.wikipedia. Is $wgCompressRevisions still used? I haven't followed this in quite a while. _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
*Simple.wikipedia is nothing like en.wikipedia* I care to dispute that
statement, All WMF wikis are setup basically the same (an odd extension here or there is different, and different namespace names at times) but for the purpose of recovery simplewiki_p is a very standard example. this issue isnt just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs simplewiki is just a matter of time, for enwiki a 5 day estimate would be fairly standard (depending on server setup) and lower times for smaller databases. typically you can explain it in a rate of X revisions processed per Y time unit, regardless of the project. and that rate should be similar for everything given the same hardware setup. On Thu, May 17, 2012 at 12:37 AM, Anthony <[hidden email]> wrote: > On Thu, May 17, 2012 at 12:30 AM, John <[hidden email]> wrote: > > Ill run a quick benchmark and import the full history of > simple.wikipedia to > > my laptop wiki on a stick, and give an exact duration > > Simple.wikipedia is nothing like en.wikipedia. For one thing, there's > no need to turn on $wgCompressRevisions with simple.wikipedia. > > Is $wgCompressRevisions still used? I haven't followed this in quite a > while. > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
On Thu, May 17, 2012 at 12:45 AM, John <[hidden email]> wrote:
> Simple.wikipedia is nothing like en.wikipedia I care to dispute that > statement, All WMF wikis are setup basically the same (an odd extension here > or there is different, and different namespace names at times) but for the > purpose of recovery simplewiki_p is a very standard example. this issue isnt > just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs > simplewiki is just a matter of time, for enwiki a 5 day estimate would be > fairly standard (depending on server setup) and lower times for smaller > databases. typically you can explain it in a rate of X revisions processed > per Y time unit, regardless of the project. and that rate should be similar > for everything given the same hardware setup. Are you compressing old revisions, or not? Does the WMF database compress old revisions, or not? In any case, I'm sorry, a 20 gig mysql database does not scale linearly to a 20 terabyte mysql database. _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by John Vandenberg
On Thu, May 17, 2012 at 1:14 PM, John Vandenberg <[hidden email]> wrote:
> How would we regain control of our existing usernames in the event > that the user database was lost in the move? That would be up to the end project to decide, Although ideally they shouldn't unless you can prove some how it was you otherwise there is possible issues with mis-attribution if someone else managed to regain the account. _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
If both are accessible I've seen an extension that allowed you to claim
your username. Saw it in action when Wowpedia forked from the Wikia Wowwiki and they let people claim their old usernames with an edit (and code in edit summary iirc) on the other wiki. James On Wed, May 16, 2012 at 10:03 PM, K. Peachey <[hidden email]> wrote: > On Thu, May 17, 2012 at 1:14 PM, John Vandenberg <[hidden email]> wrote: > > How would we regain control of our existing usernames in the event > > that the user database was lost in the move? > > That would be up to the end project to decide, Although ideally they > shouldn't unless you can prove some how it was you otherwise there is > possible issues with mis-attribution if someone else managed to regain > the account. > > _______________________________________________ > Wikimedia-l mailing list > [hidden email] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by Anthony-73
Anthony the process is linear, you have a php inserting X number of rows
per Y time frame. Yes rebuilding the externallinks, links, and langlinks tables will take some additional time and wont scale. However I have been working with the toolserver since 2007 and Ive lost count of the number of times that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can be done in a semi-reasonable timeframe. The WMF actually compresses all text blobs not just old versions. complete download and decompression of simple only took 20 minutes on my 2 year old consumer grade laptop with a standard home cable internet connection, same download on the toolserver (minus decompression) was 88s. Yeah Importing will take a little longer but shouldnt be that big of a deal. There will also be some need cleanup tasks. However the main issue, archiving and restoring wmf wikis isnt an issue, and with moderately recent hardware is no big deal. Im putting my money where my mouth is, and getting actual valid stats and figures. Yes it may not be an exactly 1:1 ratio when scaling up, however given the basics of how importing a dump functions it should remain close to the same ratio On Thu, May 17, 2012 at 12:54 AM, Anthony <[hidden email]> wrote: > On Thu, May 17, 2012 at 12:45 AM, John <[hidden email]> wrote: > > Simple.wikipedia is nothing like en.wikipedia I care to dispute that > > statement, All WMF wikis are setup basically the same (an odd extension > here > > or there is different, and different namespace names at times) but for > the > > purpose of recovery simplewiki_p is a very standard example. this issue > isnt > > just about enwiki_p but *all* wmf wikis. Doing a data recovery for > enwiki vs > > simplewiki is just a matter of time, for enwiki a 5 day estimate would be > > fairly standard (depending on server setup) and lower times for smaller > > databases. typically you can explain it in a rate of X revisions > processed > > per Y time unit, regardless of the project. and that rate should be > similar > > for everything given the same hardware setup. > > Are you compressing old revisions, or not? Does the WMF database > compress old revisions, or not? > > In any case, I'm sorry, a 20 gig mysql database does not scale > linearly to a 20 terabyte mysql database. > Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
|
In reply to this post by Kim Bruning
Well to be honest, I am still upset about how much data is deleted
from wikipedia because it is not "notable", there are so many articles that I might be interested in that are lost in the same garbage as spam and other things. We should make non notable articles and non harmful ones available in the backups as well. mike On Thu, May 17, 2012 at 2:28 AM, Kim Bruning <[hidden email]> wrote: > On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote: >> I know from experience that a wiki can be re-built from any one of the >> dumps that are provided, (pages-meta-current) for example contains >> everything needed to reboot a site except its user database >> (names/passwords ect). see >> http://www.mediawiki.org/wiki/Manual:Moving_a_wiki > > > Sure. Does this include all images, including commons images, eventually > converted to operate locally? > > I'm thinking about full snapshot-and-later-restore, say 25 or 50 years > from now, or in an academic setting, (or FSM-forbid in a worst case scenario > <knock on wood>). That's what the AT folks are most interested in. > > ==Fire Drill== > Has anyone recently set up a full-external-duplicate of (for instance) en.wp? > This includes all images, all discussions, all page history (excepting the user > accounts and deleted pages) > > This would be a useful and important exercise; possibly to be repeated once per year. > > I get a sneaky feeling that the first few iterations won't go so well. > > I'm sure AT would be glad to help out with the running of these fire drills, as > it seems to be in line with their mission. > > sincerely, > Kim Bruning > > _______________________________________________ > Wikimedia-l mailing list > [hidden email] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org _______________________________________________ Wikimedia-l mailing list [hidden email] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l |
| Powered by Nabble | Edit this page |
