34 TB Wikimedia Commons files on archive.org: you can help

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

34 TB Wikimedia Commons files on archive.org: you can help

Federico Leva (Nemo)
WikiTeam[1] has released an update of the chronological archive of all
Wikimedia Commons files, up to 2013. Now at ~34 TB total.
<https://archive.org/details/wikimediacommons>
        I wrote to – I think – all the mirrors in the world, but apparently
nobody is interested in such a mass of media apart from the Internet
Archive (and the mirrorservice.org which took Kiwix).
        The solution is simple: take a small bite and preserve a copy yourself.
One slice only takes one click, from your browser to your torrent
client, and typically 20-40 GB on your disk (biggest slice 1400 GB,
smallest 216 MB).
<https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs>

Nemo

P.s.: Please help spread the word everywhere.

[1] https://github.com/WikiTeam/wikiteam

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Antoine Musso-3
Le 01/08/2014 16:42, Federico Leva (Nemo) a écrit :

> WikiTeam[1] has released an update of the chronological archive of all
> Wikimedia Commons files, up to 2013. Now at ~34 TB total.
> <https://archive.org/details/wikimediacommons>
> I wrote to – I think – all the mirrors in the world, but apparently
> nobody is interested in such a mass of media apart from the Internet
> Archive (and the mirrorservice.org which took Kiwix).
> The solution is simple: take a small bite and preserve a copy yourself.
> One slice only takes one click, from your browser to your torrent
> client, and typically 20-40 GB on your disk (biggest slice 1400 GB,
> smallest 216 MB).
> <https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs>

Hello,

Have you thought about contacting companies having massive storage such
as Dropbox ?  Maybe they will be happy to share a few TB :-]


--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

MZMcBride-2
Antoine Musso wrote:

>Le 01/08/2014 16:42, Federico Leva (Nemo) a écrit :
>> WikiTeam[1] has released an update of the chronological archive of all
>> Wikimedia Commons files, up to 2013. Now at ~34 TB total.
>> <https://archive.org/details/wikimediacommons>
>> I wrote to – I think – all the mirrors in the world, but apparently
>> nobody is interested in such a mass of media apart from the Internet
>> Archive (and the mirrorservice.org which took Kiwix).
>> The solution is simple: take a small bite and preserve a copy yourself.
>> One slice only takes one click, from your browser to your torrent
>> client, and typically 20-40 GB on your disk (biggest slice 1400 GB,
>> smallest 216 MB).
>>
>><https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarbal
>>ls>
>
>Hello,
>
>Have you thought about contacting companies having massive storage such
>as Dropbox ?  Maybe they will be happy to share a few TB :-]

I believe Amazon has donated space at some point, but I don't know (m)any
details. I very briefly searched around and found
<https://wikitech.wikimedia.org/wiki/Amazon_Public_Data_Sets>.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Federico Leva (Nemo)
Yes, "all the mirrors in the world" included Amazon. No reply from them
either and I'm not going to write companies who don't have a mirroring
program/an explicit interest in the offer. It's appreciated if others
do, though.

Nemo

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

MZMcBride-2
Federico Leva (Nemo) wrote:
>Yes, "all the mirrors in the world" included Amazon. No reply from them
>either and I'm not going to write companies who don't have a mirroring
>program/an explicit interest in the offer. It's appreciated if others
>do, though.

Yeah, 34 TB is still a lot of data, unfortunately. I think most people
reading this list recognize and appreciate this. (I actually have a draft
e-mail about Dispenser requesting 24 TB just a few weeks ago....)

I'd personally like to see a price breakdown for this project. Doing a bit
of quick research, it sounds like storage alone would probably cost maybe
$4,000 USD, but it depends whether you're buying individual 2 TB drives or
you're buying larger 20 TB drives. More than this, though, is the ongoing
and recurring costs assuming you want to keep this data online. Is having
this (backup) data be available online an explicit goal here? Or is the
primary goal simply to have an offline backup of this data?

In either case (online or offline), a price breakdown would help nearly
any volunteer organization (such as a Wikimedia chapter) decide whether
to help in this effort. Crowd-sourcing the funding for this project is
also a possibility, either via individual donations (Kickstarter, perhaps)
or via small grants from various Internet-related or free content-related
organizations (EFF, Mozilla, Wikimedia, et al.).

Soliciting money for this project requires a much clearer, detailed plan.
The current shoe-string strategy of everyone downloading a piece of the 34
TB is certainly romantic, but it also seems to be impractical and silly.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Federico Leva (Nemo)
Thanks MZ for your suggestion to ask money. I'm not interested. For
those who are:
<https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox>

Nemo

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Pine W
When we get Wikimedia Cascadia (name decision still pending) approved and
have our legal paperwork in order we could potentially host a Commons or
other Wikimedia backup. I think this would be doable if we can work out the
legal issues and WMF approves a GAC request for some cheap storage.

Pine
On Aug 2, 2014 9:51 AM, "Federico Leva (Nemo)" <[hidden email]> wrote:

> Thanks MZ for your suggestion to ask money. I'm not interested. For
> those who are:
> <https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox>
>
> Nemo
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Jeremy Baron
On Aug 2, 2014 8:17 PM, "Pine W" <[hidden email]> wrote:
> I think this would be doable if we can work out the
> legal issues and WMF approves a GAC request for some cheap storage.

What legal issues do you envision?

-Jeremy
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Pine W
No offense, I would prefer that Cascadia discuss potential legal issues
privately with WMF before we start speculating in public.

There is probably a way to make this successful in the end.

Pine
On Aug 2, 2014 5:19 PM, "Jeremy Baron" <[hidden email]> wrote:

> On Aug 2, 2014 8:17 PM, "Pine W" <[hidden email]> wrote:
> > I think this would be doable if we can work out the
> > legal issues and WMF approves a GAC request for some cheap storage.
>
> What legal issues do you envision?
>
> -Jeremy
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
OQ
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

OQ
In reply to this post by Federico Leva (Nemo)
S3 prices have dropped since that page was last modified so the 2200 quoted
is about 1200 now a month. If the data doesn't need retrieved except for
rare cases, putting in on Glacier would drop it closer to 350-400
On Aug 2, 2014 12:51 PM, "Federico Leva (Nemo)" <[hidden email]> wrote:

> Thanks MZ for your suggestion to ask money. I'm not interested. For
> those who are:
> <https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox>
>
> Nemo
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Jeremy Baron
In reply to this post by Pine W
On Aug 2, 2014 8:25 PM, "Pine W" <[hidden email]> wrote:
> No offense, I would prefer that Cascadia discuss potential legal issues
> privately with WMF before we start speculating in public.
>
> There is probably a way to make this successful in the end.

I find that rather confusing.

As the legal team's email footers say, they are not your lawyer (and not
your chapter's lawyer either).

I can't think of any legal issues you'd encounter besides the ones WMF
already deals with. (copyright/trademark/defamation/trade secrets/national
security/CDA 230/DMCA/etc.) If you want legal advice on any of those issues
then you need to consult counsel outside WMF.

Or maybe there's a concern I haven't imagined yet.

-Jeremy
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 34 TB Wikimedia Commons files on archive.org: you can help

Pine W
Yes, those plus a few others. Yes WMF counsel can't act as Cascadia's
counsel but I would want to see what terms WMF might offer like
indemnifying Cascadia for any issues relating to archiving the Commons
content.

Pine
 On Aug 2, 2014 5:32 PM, "Jeremy Baron" <[hidden email]> wrote:

> On Aug 2, 2014 8:25 PM, "Pine W" <[hidden email]> wrote:
> > No offense, I would prefer that Cascadia discuss potential legal issues
> > privately with WMF before we start speculating in public.
> >
> > There is probably a way to make this successful in the end.
>
> I find that rather confusing.
>
> As the legal team's email footers say, they are not your lawyer (and not
> your chapter's lawyer either).
>
> I can't think of any legal issues you'd encounter besides the ones WMF
> already deals with. (copyright/trademark/defamation/trade secrets/national
> security/CDA 230/DMCA/etc.) If you want legal advice on any of those issues
> then you need to consult counsel outside WMF.
>
> Or maybe there's a concern I haven't imagined yet.
>
> -Jeremy
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

Daniel Mietchen
In reply to this post by Federico Leva (Nemo)
The issue of mirroring Wikimedia content has been discussed with a
number of scholarly institutions engaged in data-rich research, and
the response was generally of the "send us the specs, and we will see
what we can do" kind.

I would be interested in giving this another go if someone could
provide me with those specs, preferably for Wikimedia projects as a
whole as well as broken down by individual projects or languages or
timestamps etc.

The WikiTeam's Commons archive would make for a good test dataset.

Daniel

--
http://www.naturkundemuseum-berlin.de/en/institution/mitarbeiter/mietchen-daniel/
https://en.wikipedia.org/wiki/User:Daniel_Mietchen/Publications
http://okfn.org
http://wikimedia.org


On Fri, Aug 1, 2014 at 4:42 PM, Federico Leva (Nemo) <[hidden email]> wrote:

> WikiTeam[1] has released an update of the chronological archive of all
> Wikimedia Commons files, up to 2013. Now at ~34 TB total.
> <https://archive.org/details/wikimediacommons>
>         I wrote to – I think – all the mirrors in the world, but apparently
> nobody is interested in such a mass of media apart from the Internet
> Archive (and the mirrorservice.org which took Kiwix).
>         The solution is simple: take a small bite and preserve a copy yourself.
> One slice only takes one click, from your browser to your torrent
> client, and typically 20-40 GB on your disk (biggest slice 1400 GB,
> smallest 216 MB).
> <https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs>
>
> Nemo
>
> P.s.: Please help spread the word everywhere.
>
> [1] https://github.com/WikiTeam/wikiteam
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

Federico Leva (Nemo)
Daniel Mietchen, 03/08/2014 03:57:

> The issue of mirroring Wikimedia content has been discussed with a
> number of scholarly institutions engaged in data-rich research, and
> the response was generally of the "send us the specs, and we will see
> what we can do" kind.
>
> I would be interested in giving this another go if someone could
> provide me with those specs, preferably for Wikimedia projects as a
> whole as well as broken down by individual projects or languages or
> timestamps etc.
>
> The WikiTeam's Commons archive would make for a good test dataset.

Ariel keeps
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Requirements
up to date. Anything else needed?

Nemo

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

Daniel Mietchen
That seems to be sufficient to get things rolling. I will give it a try. Thanks!
--
http://www.naturkundemuseum-berlin.de/en/institution/mitarbeiter/mietchen-daniel/
https://en.wikipedia.org/wiki/User:Daniel_Mietchen/Publications
http://okfn.org
http://wikimedia.org


On Sun, Aug 3, 2014 at 9:18 AM, Federico Leva (Nemo) <[hidden email]> wrote:

> Daniel Mietchen, 03/08/2014 03:57:
>> The issue of mirroring Wikimedia content has been discussed with a
>> number of scholarly institutions engaged in data-rich research, and
>> the response was generally of the "send us the specs, and we will see
>> what we can do" kind.
>>
>> I would be interested in giving this another go if someone could
>> provide me with those specs, preferably for Wikimedia projects as a
>> whole as well as broken down by individual projects or languages or
>> timestamps etc.
>>
>> The WikiTeam's Commons archive would make for a good test dataset.
>
> Ariel keeps
> https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Requirements
> up to date. Anything else needed?
>
> Nemo
>
> _______________________________________________
> Commons-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/commons-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l