Quantcast

[Wikimedia-l] Knol is closing tomorrow

classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
On Thu, May 17, 2012 at 1:22 AM, John <[hidden email]> wrote:
> Anthony the process is linear, you have a php inserting X number of rows per
> Y time frame.

Amazing.  I need to switch all my databases to MySQL.  It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.

> Yes rebuilding the externallinks, links, and langlinks tables
> will take some additional time and wont scale.

And this is part of the process too, right?

> However I have been working
> with the toolserver since 2007 and Ive lost count of the number of times
> that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
> be done in a semi-reasonable timeframe.

Re-importing how?  From the compressed XML full history dumps?

> The WMF actually compresses all text
> blobs not just old versions.

Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
WMF using gzip or object?

> complete download and decompression of simple
> only took 20 minutes on my 2 year old consumer grade laptop with a standard
> home cable internet connection, same download on the toolserver (minus
> decompression) was 88s. Yeah Importing will take a little longer but
> shouldnt be that big of a deal.

For the full history English Wikipedia it *is* a big deal.

If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.

Do you plan to run compressOld.php?  Are you going to import
everything in plain text first, and *then* start compressing?  Seems
like an awful lot of wasted hard drive space.

> There will also be some need cleanup tasks.
> However the main issue, archiving and restoring wmf wikis isnt an issue, and
> with moderately recent hardware is no big deal. Im putting my money where my
> mouth is, and getting actual valid stats and figures. Yes it may not be an
> exactly 1:1 ratio when scaling up, however given the basics of how importing
> a dump functions it should remain close to the same ratio

If you want to put your money where your mouth is, import
en.wikipedia.  It'll only take 5 days, right?

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

John Doe-27
On Thu, May 17, 2012 at 1:52 AM, Anthony <[hidden email]> wrote:

> On Thu, May 17, 2012 at 1:22 AM, John <[hidden email]> wrote:
> > Anthony the process is linear, you have a php inserting X number of rows
> per
> > Y time frame.
>
> Amazing.  I need to switch all my databases to MySQL.  It can insert X
> rows per Y time frame, regardless of whether the database is 20
> gigabytes or 20 terabytes in size, regardless of whether the average
> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> RAID array or a cluster of servers, etc.
>

When refering to X over Y time, its an average of a of say 1000 revisions
per 1 minute, any X over Y period must be considered with averages in mind,
or getting a count wouldnt be possible.



> > Yes rebuilding the externallinks, links, and langlinks tables
> > will take some additional time and wont scale.
>
> And this is part of the process too, right?

That does not need to be completed prior to the site going live, it can be
done after making it public

> That part isnt
> > However I have been working
> > with the toolserver since 2007 and Ive lost count of the number of times
> > that the TS has needed to re-import a cluster, (s1-s7) and even enwiki
> can
> > be done in a semi-reasonable timeframe.
>
> Re-importing how?  From the compressed XML full history dumps?


> > The WMF actually compresses all text
> > blobs not just old versions.
>
> Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
> WMF using gzip or object?
>
> > complete download and decompression of simple
> > only took 20 minutes on my 2 year old consumer grade laptop with a
> standard
> > home cable internet connection, same download on the toolserver (minus
> > decompression) was 88s. Yeah Importing will take a little longer but
> > shouldnt be that big of a deal.
>
> For the full history English Wikipedia it *is* a big deal.
>
> If you think it isn't, stop playing with simple.wikipedia, and tell us
> how long it takes to get a mirror up and running of en.wikipedia.
>
> Do you plan to run compressOld.php?  Are you going to import
> everything in plain text first, and *then* start compressing?  Seems
> like an awful lot of wasted hard drive space.
>

If you setup your sever/hardware correctly it will compress the text
information during insertion into the database and compressOld.php is
actually designed only for cases where you start with an uncompressed
configuration


> > There will also be some need cleanup tasks.
> > However the main issue, archiving and restoring wmf wikis isnt an issue,
> and
> > with moderately recent hardware is no big deal. Im putting my money
> where my
> > mouth is, and getting actual valid stats and figures. Yes it may not be
> an
> > exactly 1:1 ratio when scaling up, however given the basics of how
> importing
> > a dump functions it should remain close to the same ratio
>
> If you want to put your money where your mouth is, import
> en.wikipedia.  It'll only take 5 days, right?
>

If I actually had a server or the disc space to do it I would, just to
prove your smartass comments as stupid as they actually are. However given
my current resource limitations (fairly crappy internet connection, older
laptops, and lack of HDD) I tried to select something that could give
reliable benchmarks. If your willing to foot the bill for the new hardware
Ill gladly prove my point
_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Mike  Dupont
On Thu, May 17, 2012 at 6:06 AM, John <[hidden email]> wrote:
> If your willing to foot the bill for the new hardware
> Ill gladly prove my point

given the millions of dollars that wikipedia has, it should not be a
problem to provide such resources for a good cause like that.

--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
In reply to this post by John Doe-27
On Thu, May 17, 2012 at 2:06 AM, John <[hidden email]> wrote:

> On Thu, May 17, 2012 at 1:52 AM, Anthony <[hidden email]> wrote:
>> On Thu, May 17, 2012 at 1:22 AM, John <[hidden email]> wrote:
>> > Anthony the process is linear, you have a php inserting X number of rows
>> > per
>> > Y time frame.
>>
>> Amazing.  I need to switch all my databases to MySQL.  It can insert X
>> rows per Y time frame, regardless of whether the database is 20
>> gigabytes or 20 terabytes in size, regardless of whether the average
>> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
>> RAID array or a cluster of servers, etc.
>
> When refering to X over Y time, its an average of a of say 1000 revisions
> per 1 minute, any X over Y period must be considered with averages in mind,
> or getting a count wouldnt be possible.

The *average* en.wikipedia revision is more than twice the size of the
*average* simple.wikipedia revision.  The *average* performance of a
20 gig database is faster than the *average* performance of a 20
terabyte database.  The *average* performance of your laptop's thumb
drive is different from the *average* performance of a(n array of)
drive(s) which can handle 20 terabytes of data.

> If you setup your sever/hardware correctly it will compress the text
> information during insertion into the database

Is this how you set up your simple.wikipedia test?  How long does it
take import the data if you're using the same compression mechanism as
WMF (which, you didn't answer, but I assume is concatenation and
compression).  How exactly does this work "during insertion" anyway?
Does it intelligently group sets of revisions together to avoid
decompressing and recompressing the same revision several times?  I
suppose it's possible, but that would introduce quite a lot of
complication into the import script, slowing things down dramatically.

What about the answers to my other questions?

>> If you want to put your money where your mouth is, import
>> en.wikipedia.  It'll only take 5 days, right?
>
> If I actually had a server or the disc space to do it I would, just to prove
> your smartass comments as stupid as they actually are. However given my
> current resource limitations (fairly crappy internet connection, older
> laptops, and lack of HDD) I tried to select something that could give
> reliable benchmarks. If your willing to foot the bill for the new hardware
> Ill gladly prove my point

What you seem to be saying is that you're *not* putting your money
where your mouth is.

Anyway, if you want, I'll make a deal with you.  A neutral third party
rents the hardware at Amazon Web Services (AWS).  We import
simple.wikipedia full history (concatenating and compressing during
import).  We take the ratio of revisions in simple.wikipedia to the
ratio of revisions in en.wikipedia.  We import en.wikipedia full
history (concatenating and compressing during import).  If the ratio
of time it takes to import en.wikipedia vs simple.wikipedia is greater
than or equal to twice the ratio of revisions, then you reimburse the
third party.  If the ratio of import time is less than twice the ratio
of revisions (you claim it is linear, therefore it'll be the same
ratio), then I reimburse the third party.

Either way, we save the new dump, with the processing already done,
and send it to archive.org (and WMF if they're willing to host it).
So we actually get a useful result out of this.  It's not just for the
purpose of settling an argument.

Either of us can concede defeat at any point, and stop the experiment.
 At that point if the neutral third party wishes to pay to continue
the job, s/he would be responsible for the additional costs.

Shouldn't be too expensive.  If you concede defeat after 5 days, then
your CPU-time costs are $54 (assuming Extra Large High Memory
Instance).  Including 4 terabytes of EBS (which should be enough if
you compress on the fly) for 5 days should be less than $100.

I'm tempted to do it even if you don't take the bet.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

J Alexandr Ledbury-Romanov
I'd like to point out that the increasingly technical nature of this
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.

Alex
Wikimedia-l list administrator


2012/5/17 Anthony <[hidden email]>

> On Thu, May 17, 2012 at 2:06 AM, John <[hidden email]> wrote:
> > On Thu, May 17, 2012 at 1:52 AM, Anthony <[hidden email]> wrote:
> >> On Thu, May 17, 2012 at 1:22 AM, John <[hidden email]>
> wrote:
> >> > Anthony the process is linear, you have a php inserting X number of
> rows
> >> > per
> >> > Y time frame.
> >>
> >> Amazing.  I need to switch all my databases to MySQL.  It can insert X
> >> rows per Y time frame, regardless of whether the database is 20
> >> gigabytes or 20 terabytes in size, regardless of whether the average
> >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> >> RAID array or a cluster of servers, etc.
> >
> > When refering to X over Y time, its an average of a of say 1000 revisions
> > per 1 minute, any X over Y period must be considered with averages in
> mind,
> > or getting a count wouldnt be possible.
>
> The *average* en.wikipedia revision is more than twice the size of the
> *average* simple.wikipedia revision.  The *average* performance of a
> 20 gig database is faster than the *average* performance of a 20
> terabyte database.  The *average* performance of your laptop's thumb
> drive is different from the *average* performance of a(n array of)
> drive(s) which can handle 20 terabytes of data.
>
> > If you setup your sever/hardware correctly it will compress the text
> > information during insertion into the database
>
> Is this how you set up your simple.wikipedia test?  How long does it
> take import the data if you're using the same compression mechanism as
> WMF (which, you didn't answer, but I assume is concatenation and
> compression).  How exactly does this work "during insertion" anyway?
> Does it intelligently group sets of revisions together to avoid
> decompressing and recompressing the same revision several times?  I
> suppose it's possible, but that would introduce quite a lot of
> complication into the import script, slowing things down dramatically.
>
> What about the answers to my other questions?
>
> >> If you want to put your money where your mouth is, import
> >> en.wikipedia.  It'll only take 5 days, right?
> >
> > If I actually had a server or the disc space to do it I would, just to
> prove
> > your smartass comments as stupid as they actually are. However given my
> > current resource limitations (fairly crappy internet connection, older
> > laptops, and lack of HDD) I tried to select something that could give
> > reliable benchmarks. If your willing to foot the bill for the new
> hardware
> > Ill gladly prove my point
>
> What you seem to be saying is that you're *not* putting your money
> where your mouth is.
>
> Anyway, if you want, I'll make a deal with you.  A neutral third party
> rents the hardware at Amazon Web Services (AWS).  We import
> simple.wikipedia full history (concatenating and compressing during
> import).  We take the ratio of revisions in simple.wikipedia to the
> ratio of revisions in en.wikipedia.  We import en.wikipedia full
> history (concatenating and compressing during import).  If the ratio
> of time it takes to import en.wikipedia vs simple.wikipedia is greater
> than or equal to twice the ratio of revisions, then you reimburse the
> third party.  If the ratio of import time is less than twice the ratio
> of revisions (you claim it is linear, therefore it'll be the same
> ratio), then I reimburse the third party.
>
> Either way, we save the new dump, with the processing already done,
> and send it to archive.org (and WMF if they're willing to host it).
> So we actually get a useful result out of this.  It's not just for the
> purpose of settling an argument.
>
> Either of us can concede defeat at any point, and stop the experiment.
>  At that point if the neutral third party wishes to pay to continue
> the job, s/he would be responsible for the additional costs.
>
> Shouldn't be too expensive.  If you concede defeat after 5 days, then
> your CPU-time costs are $54 (assuming Extra Large High Memory
> Instance).  Including 4 terabytes of EBS (which should be enough if
> you compress on the fly) for 5 days should be less than $100.
>
> I'm tempted to do it even if you don't take the bet.
>
> _______________________________________________
> Wikimedia-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
<[hidden email]> wrote:
> I'd like to point out that the increasingly technical nature of this
> conversation probably belongs either on wikitech-l, or off-list, and that
> the strident nature of the comments is fast approaching inappropriate.

Really?  I think we're really getting somewhere.

In fact, I think someone at WMF should contact Amazon and see if
they'll let us conduct the experiment for free, in exchange for us
creating the dump for them to host as a public data set
(http://aws.amazon.com/publicdatasets/).

In case you got lost in the technical details, the original post was
asking "Has anyone recently set up a full-external-duplicate of (for
instance) en.wp?" and suggesting that we should do this on a yearly
basis as a fire drill.

My latest post was a concrete proposal for doing exactly that.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
Please have someone at WMF coordinate this so that there aren't
multiple requests made.  In my opinion, it should preferably be made
by a WMF employee.

Fill out the form at
https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry

Tell them you want to create a public data set which is a snapshot of
the English Wikipedia.  We can coordinate any questions, and any
implementation details, on a separate list.

On Thu, May 17, 2012 at 7:43 AM, Anthony <[hidden email]> wrote:

> On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
> <[hidden email]> wrote:
>> I'd like to point out that the increasingly technical nature of this
>> conversation probably belongs either on wikitech-l, or off-list, and that
>> the strident nature of the comments is fast approaching inappropriate.
>
> Really?  I think we're really getting somewhere.
>
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).
>
> In case you got lost in the technical details, the original post was
> asking "Has anyone recently set up a full-external-duplicate of (for
> instance) en.wp?" and suggesting that we should do this on a yearly
> basis as a fire drill.
>
> My latest post was a concrete proposal for doing exactly that.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

emijrp
In reply to this post by Kim Bruning
The only issues for Wikimedia projects perservation/forking by third
parties are the missing image dumps (which are being created since some
days ago, thanks Ariel) and the usernames/passwords table (not a big
problem in an apocalyptic scenario, where articles and images have top
priority).

We at WikiTeam are uploading wiki dumps to Internet Archive, and recently
some official mirrors of Wikimedia dumps (articles + images) are being
created around the globe (currently in 3 different locations).

I think we are taking great steps in the last year.

2012/5/17 Kim Bruning <[hidden email]>

> >
> > We at Archive Team are attempting to download all the 700,000 Knols.[3]
> For
> > the sake of history. Join us, #archiveteam EFNET.
> >
>
> I did some followup. I'm not sure I can help out with Knol
> anymore, but I discovered that AT is having some trouble
> making good archives of wikimedia sites.
>
> Theoretically, wikipedia et al SHOULD be easy to
> reconstitute, right?  That's why we're using CC licenses
> and all. Else if we drop the ball, WP will be gone.
> This seems like a priority to me!
>
> The main problem seems to be obtaining commons images:
> http://archiveteam.org/index.php?title=Wikiteam
>
> So at the very least, we don't appear to have very good
> documentation. Who could best help Archive Team out? Has
> anyone done/written documentation on completely restoring 1
> or more wikimedia wikis from 'public backup' [1]?
>
> What can we do to help them?
>
> sincerely,
>        Kim Bruning
>
> [1] "Real Men don't make backups. They upload it via ftp and
> let the world mirror it." - Linus Torvalds
>
> _______________________________________________
> Wikimedia-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Thomas Dalton
In reply to this post by Anthony-73
On 17 May 2012 12:43, Anthony <[hidden email]> wrote:
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).

What dump are you going to create? You are starting from a dump, why
can't Amazon just host that?

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
On Thu, May 17, 2012 at 8:11 AM, Thomas Dalton <[hidden email]> wrote:
> On 17 May 2012 12:43, Anthony <[hidden email]> wrote:
>> In fact, I think someone at WMF should contact Amazon and see if
>> they'll let us conduct the experiment for free, in exchange for us
>> creating the dump for them to host as a public data set
>> (http://aws.amazon.com/publicdatasets/).
>
> What dump are you going to create? You are starting from a dump, why
> can't Amazon just host that?

Because the XML dump is semi-useless - it's compressed in all the
wrong places to use for an actual running system.

Anyway, looking at how the AWS Public Data Sets work, it probably
would be best not to even create a dump, but just put up the running
(object compressed) database.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

emijrp
In reply to this post by emijrp
They are XML dumps. Why did you say they are semi-useless?

I'm not sure if all the MediaWiki revision table parameters are available
in the XML dumps, but most of them are.

2012/5/17 Anthony <[hidden email]>

> On Thu, May 17, 2012 at 8:01 AM, emijrp <[hidden email]> wrote:
> > We at WikiTeam are uploading wiki dumps to Internet Archive, and recently
> > some official mirrors of Wikimedia dumps (articles + images) are being
> > created around the globe (currently in 3 different locations).
>
> Are these actual database dumps, or are they those semi-useless XML dumps?
>



--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Tom Morris-5
In reply to this post by emijrp
WARNING: The following post is a work of technical fantasy rather than practical reality.

On the usernames and passwords thing, if we imagine our doomsday scenario (meteor hits the WMF data centre, the Foundation turn into evil psychopathic Nazis, whatever), one thing that might be useful and archive-oriented developers might want to consider would be some way of 'namespacing' usernames. That way, we could have it so a fork/new version could specify that, say, all the usernames on all the existing content are usernames on en.wikipedia.org, and distinguish those from the usernames on post-apocalyptic Wikipedia. That way we can keep the attribution chain to the old usernames without the issue of identity theft.

It'd also be a good step towards attribution in distributed wikis. This might be for something like a future attempt at Citizendium (or perhaps someone wants to make a version of Wikipedia with pending changes or the image filter or one of the other many things the community cannot agree on).

In addition, it would be useful to be able to distinguish with usernames on sites that reuse Commons images (if I upload an image to Commons with the username 'Tom Morris' and then some non-WMF wiki reuses it, it may be attributing it to the local user 'Tom Morris' rather than the Commons user).

Finally, it'd be potentially useful for wikis which use some Wikipedia content combined with some local content. For instance, I know wikiqueer.org uses Wikipedia content with attribution, and combines the encyclopaedic content of Wikipedia with non-encyclopedic community content that wouldn't meet up with Wikipedia's mission or NPOV (they have the supposedly very controversial POV that LGBT people deserve equal rights).

In all these cases, as well as our potential doomsday scenario, being able to clearly distinguish between local usernames and usernames on other wikis might be quite useful. The inner semantic web dork suggests that perhaps we could consider using something like a uniform resource indicator (URI) to identify users. ;-)

We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).

--
Tom Morris
<http://tommorris.org/>



_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
In reply to this post by emijrp
On Thu, May 17, 2012 at 8:22 AM, emijrp <[hidden email]> wrote:
> They are XML dumps. Why did you say they are semi-useless?

Because they are XML dumps, mainly.  The data in the WMF database is
compressed in a format which can be easily randomly accessed.  The
dump procedure is to uncompress it, convert it to XML. and then
recompress it, in a format which can't be easily randomly accessed.
The import procedure is to uncompress the "dump", convert it from XML,
and then recompress it in a format which is easily randomly accessed.

There are some hacks to get around this with the bz2 version of the
"dump", but this is far less efficient than the format which the data
already is in before the "dump" process takes place.

> I'm not sure if all the MediaWiki revision table parameters are available in
> the XML dumps, but most of them are.

The main problem is that they are compressed in a format which is
terrible for actual use.  The missing information (mostly, indexes),
is a secondary problem, however.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
In reply to this post by Tom Morris-5
On Thu, May 17, 2012 at 8:31 AM, Tom Morris <[hidden email]> wrote:
> We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).

Also, there's http://en.wikipedia.org/wiki/Template:User_committed_identity

But most people don't seem to care about these things.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Tom Morris-5
On Thursday, 17 May 2012 at 13:34, Anthony wrote:
> On Thu, May 17, 2012 at 8:31 AM, Tom Morris <[hidden email] (mailto:[hidden email])> wrote:
> > We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).
>
>
> Also, there's http://en.wikipedia.org/wiki/Template:User_committed_identity
>
> But most people don't seem to care about these things.

Sure, the use cases of Committed Identities are slightly different.

--
Tom Morris
<http://tommorris.org/>



_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Thomas Dalton
In reply to this post by Anthony-73
On 17 May 2012 13:32, Anthony <[hidden email]> wrote:
> Because they are XML dumps, mainly.  The data in the WMF database is
> compressed in a format which can be easily randomly accessed.

It's a dump. It's not supposed to be randomly accessed. We're talking
about archives, not mirrors.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony-73
On Thu, May 17, 2012 at 8:38 AM, Thomas Dalton <[hidden email]> wrote:
> On 17 May 2012 13:32, Anthony <[hidden email]> wrote:
>> Because they are XML dumps, mainly.  The data in the WMF database is
>> compressed in a format which can be easily randomly accessed.
>
> It's a dump.

Not really.  Yes, it's called that.  And historically, it was that,
but the XML "dumps" aren't really dumps at all.

> It's not supposed to be randomly accessed. We're talking
> about archives, not mirrors.

That's why I said they're semi-useless (i.e. half-useless), not useless.

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Kim Bruning
In reply to this post by Anthony-73
On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote:
>
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).


That sounds like an excellent plan. At the same time, it might be useful to get Archive Team
involved.

* They have warm bodies. (always useful, one can never have enough volunteers ;)
* They have experience with very large datasets
* They'd be very happy to help (it's their mission)
* Some of them may be able to provide Sufficient Storage(tm) and server capacity. Saves us
the Amazon AWS bill.
* We might set a precedent where others might provide their data to AT directly too.

AT's mission dovetails nicely with ours. We provide the sum of all human knowledge to people.
AT ensures that the sum of all human knowledge is not subtracted from.


sincerely,
        Kim Bruning

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Neil Harris
In reply to this post by Anthony-73
On 17/05/12 12:49, Anthony wrote:

> Please have someone at WMF coordinate this so that there aren't
> multiple requests made.  In my opinion, it should preferably be made
> by a WMF employee.
>
> Fill out the form at
> https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry
>
> Tell them you want to create a public data set which is a snapshot of
> the English Wikipedia.  We can coordinate any questions, and any
> implementation details, on a separate list.
>

That's a fantastic idea, and would give en: Wikipedia yet another public
replica for very little effort. I would imagine that if they are willing
to host enwiki, they may also be be willing to host most, or all, of the
other projects.

It will also mean that running Wikipedia data-munching experiments on
EC2 will become much easier.

Neil


_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Mike  Dupont
In reply to this post by Kim Bruning
Hello People,
I have completed my first set in uploading the osm/fosm dataset (350gb
unpacked) to archive.org
http://osmopenlayers.blogspot.de/2012/05/upload-finished.html

We can do something similar with wikipedia, the bucket size of
archive.org is 10gb, we need to split up the data in a way that it is
useful. I have done this by putting each object on one line and each
file contains the full data records and the parts that belong to the
previous block and next block, so you are able to process the blocks
almost stand alone.

mike

_______________________________________________
Wikimedia-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
123
Loading...