images torrent

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

images torrent

Yousef Ourabi-3
Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images-TAR-Archives-t3581071.html

When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?

Thanks!

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

jmerkey-3
> Google dugg this up:
> http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images-TAR-Archives-t3581071.html
>
> When I fire up rTorrent, it says "can't resolve host" -- is there an
> updated version of this floating around? Did anyone every successfully
> download it?
>
> Thanks!
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

download

http://meta.wikimedia.org/wiki/Wikix

You can get the images faster with wikix.  You will need to download the
latest XML Dump.  You will need the image tags in the dump if you post the
images somewhere if any of them are fair use.

Jeff


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Yousef Ourabi-3
Jeff -- thanks that is exactly the sort of thing I'm looking for.

What I don't understand is how this fits in with the "no-crawler" policy?

Also I appreciate you sharing the tool with the community.

Thanks,
Yousef

----- Original Message -----
From: [hidden email]
To: "Wikimedia developers" <[hidden email]>
Cc: [hidden email]
Sent: Thursday, October 25, 2007 12:34:33 PM (GMT-0800) America/Los_Angeles
Subject: Re: [Wikitech-l] images torrent

> Google dugg this up:
> http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images-TAR-Archives-t3581071.html
>
> When I fire up rTorrent, it says "can't resolve host" -- is there an
> updated version of this floating around? Did anyone every successfully
> download it?
>
> Thanks!
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

download

http://meta.wikimedia.org/wiki/Wikix

You can get the images faster with wikix.  You will need to download the
latest XML Dump.  You will need the image tags in the dump if you post the
images somewhere if any of them are fair use.

Jeff


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

jmerkey-3
> Jeff -- thanks that is exactly the sort of thing I'm looking for.
>
> What I don't understand is how this fits in with the "no-crawler" policy?


The number of people with 2.5 TB lying around to use to host Wikipedia's
images and workspace for thumbnails and rendering does not seem to be that
large.  Wikix is not very intrusive in any event.

Better have a lot of space.   As of 20071018 dump, the total for all
images is:

[root@gadugi ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb3             473G  9.1G  440G   3% /
/dev/sdb1             388M   14M  354M   4% /boot
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/sda1             1.1T  330G  716G  32% /wikidump
/dev/hda3             112G   36G   71G  34% /w
/dev/sdb4             616G  406G  179G  70% /image
[root@gadugi ~]#


406GB for total image payload for the English Wikipedia.

Jeff






_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Anthony-73
In reply to this post by Yousef Ourabi-3
On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
> What I don't understand is how this fits in with the "no-crawler" policy?
>
What "no-crawler" policy?

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

RLS-2
Anthony wrote:
> On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
>> What I don't understand is how this fits in with the "no-crawler" policy?
>>
> What "no-crawler" policy?
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l

[[Wikipedia:Database download#Please do not use a web crawler]]

--Darkwind

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

jmerkey-3
> Anthony wrote:
>> On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
>>> What I don't understand is how this fits in with the "no-crawler"
>>> policy?
>>>
>> What "no-crawler" policy?
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> [[Wikipedia:Database download#Please do not use a web crawler]]
>
> --Darkwind
>


Wikix downloads images in a non-intrusive manner, in fact, its no more
intrusive than your average workstation browsing the site.  This was by
design.  I intentionally make the tool slow in order to avoid any impacts
on the site.

Given google's high rankings of Wikipedia pages, and the relationship with
ask.com and others, it obvious that massive web crawling of the site is
permitted by these search engines and in fact is encouraged.

Needless to say, wikix is nowhere near as intense as these other
applications.  Provided it is not being used in a malicious manner, it
does not appear to impinge on this policy.  Esspecially since Wikimedia
states at download.wikimedia.org image tarballs will be supported at some
point.

Jeff



> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Yousef Ourabi-3
well I've already started the scripts.

I would be interested in mirroring this images so people from the community who need access to them have multiple choices for download -- and thus remove some of the burden from Wikipedia.

What is the process for this (other than sending out an email) ? How receptive would wikipedia be to something like that?

Thanks.


----- Original Message -----
From: [hidden email]
To: "Wikimedia developers" <[hidden email]>
Cc: "Wikimedia developers" <[hidden email]>
Sent: Friday, October 26, 2007 5:24:12 AM (GMT-0800) America/Los_Angeles
Subject: Re: [Wikitech-l] images torrent

> Anthony wrote:
>> On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
>>> What I don't understand is how this fits in with the "no-crawler"
>>> policy?
>>>
>> What "no-crawler" policy?
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> [[Wikipedia:Database download#Please do not use a web crawler]]
>
> --Darkwind
>


Wikix downloads images in a non-intrusive manner, in fact, its no more
intrusive than your average workstation browsing the site.  This was by
design.  I intentionally make the tool slow in order to avoid any impacts
on the site.

Given google's high rankings of Wikipedia pages, and the relationship with
ask.com and others, it obvious that massive web crawling of the site is
permitted by these search engines and in fact is encouraged.

Needless to say, wikix is nowhere near as intense as these other
applications.  Provided it is not being used in a malicious manner, it
does not appear to impinge on this policy.  Esspecially since Wikimedia
states at download.wikimedia.org image tarballs will be supported at some
point.

Jeff



> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Anthony-73
In reply to this post by RLS-2
On 10/26/07, RLS <[hidden email]> wrote:
> Anthony wrote:
> > On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
> >> What I don't understand is how this fits in with the "no-crawler" policy?
> >>
> > What "no-crawler" policy?
> >
> [[Wikipedia:Database download#Please do not use a web crawler]]
>
Have Google and Yahoo been informed of this policy?

BTW, that talks about articles, not images.  And it contradicts
robots.txt, especially "## we're disabling this experimentally
11-09-2006\n#Crawl-delay: 1"

It seems to stem from something said on the Village Pump back in 2003.
 I for one am going to go with robots.txt, not something someone said
on some Wikipedia page.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Anthony-73
On 10/26/07, Anthony <[hidden email]> wrote:

> On 10/26/07, RLS <[hidden email]> wrote:
> > Anthony wrote:
> > > On 10/26/07, Yousef Ourabi <[hidden email]> wrote:
> > >> What I don't understand is how this fits in with the "no-crawler" policy?
> > >>
> > > What "no-crawler" policy?
> > >
> > [[Wikipedia:Database download#Please do not use a web crawler]]
> >
> Have Google and Yahoo been informed of this policy?
>
> BTW, that talks about articles, not images.  And it contradicts
> robots.txt, especially "## we're disabling this experimentally
> 11-09-2006\n#Crawl-delay: 1"
>
> It seems to stem from something said on the Village Pump back in 2003.

Here's the diff:
http://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Village_pump&diff=prev&oldid=989600

Some other fun stuff from the village pump circa 2003:
*"I suggest that all articles about movies and tv shows be scrapped,
and instead have the links point to the apropriate page on the
Internet Movie Database." - Vroman
*"There is never a good reason to delete perfectly good material from
the Wikipedia. Wikipedia isn't paper." - Zoe
*"A wiki devoted just to movies and TV shows would not be a bad thing.
We're probably not there yet, though." - Wapcaplet (inventor of
Wikia?)
*"I fear we can't ban this range. Banning this is banning all of AOL." - JeLuF
*"As nobody else had edited this article, its arbitrary deletion was
uncontroversial. However, deleting articles that someone else has
edited (beyond blanking/reverts) is more controversial, with strong
opinions on both sides." - Martin

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Gregory Maxwell
In reply to this post by Yousef Ourabi-3
On 10/25/07, Yousef Ourabi <[hidden email]> wrote:
> Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images-TAR-Archives-t3581071.html
>
> When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
>
> Thanks!


Anyone who wants copies of the image collection and seriously has the
storage to take one, please contact me.  I've been doing one off rsync
image feeds off one of my own systems.

Transferring the files via HTTP is miserably slow, especially if you
want the full 1.6TBish collection. ;)

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Anthony-73
On 10/26/07, Gregory Maxwell <[hidden email]> wrote:

> On 10/25/07, Yousef Ourabi <[hidden email]> wrote:
> > Google dugg this up: http://www.nabble.com/BitTorrent-Downloads-Posted-for-enwiki-20070402-images-TAR-Archives-t3581071.html
> >
> > When I fire up rTorrent, it says "can't resolve host" -- is there an updated version of this floating around? Did anyone every successfully download it?
> >
> > Thanks!
>
> Anyone who wants copies of the image collection and seriously has the
> storage to take one, please contact me.  I've been doing one off rsync
> image feeds off one of my own systems.
>
Hmm, depends.  How would you be transferring them to me?

> Transferring the files via HTTP is miserably slow, especially if you
> want the full 1.6TBish collection. ;)
>
I'd imagine it's bandwidth limited no matter how you do it.  15-16
days or so at 10mbps.  Add 4-5 days for HTTP/1.1 handshaking and say
20 days.  Assuming you're using some sort of pipelining, of course.

http://www.google.com/search?q=%281.6+terabytes+divided+by+10+megabits%29+divided+by+days+per+second

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Aryeh Gregor
In reply to this post by Anthony-73
On 10/26/07, Anthony <[hidden email]> wrote:
> Have Google and Yahoo been informed of this policy?

No, since they're our number-one referers.

> BTW, that talks about articles, not images.  And it contradicts
> robots.txt, especially "## we're disabling this experimentally
> 11-09-2006\n#Crawl-delay: 1"
>
> It seems to stem from something said on the Village Pump back in 2003.
>  I for one am going to go with robots.txt, not something someone said
> on some Wikipedia page.

I believe a more accurate story would be as follows:

1) Live mirrors of the site, however big or small, are discouraged
without prior agreement.  You're supposed to use the dumps for this.
If you want to provide some kind of useful value-added "gateway" or
framing or whatever, that for instance marks up the pages in some
useful way or whatever, *and* you very clearly acknowledge the source
and give a link, *and* you don't run ads or similar, *and* you don't
use too much bandwidth, that's probably fine (although best to ask
first).  If you don't meet the preceding conditions, you may be asked
to pay a fee for the mirroring service, or face blocking.

2) Anything that uses enough server resources to slow down the site
will probably be blocked or killed if it's noticed.  In the old days
this was a concern, but nowadays it's probably not.

There was a page I once saw where someone had put up the statement
that bots should only request pages once every ten seconds or
something.  When I looked in the histories, I saw that Brion had added
it in like 2003, along with a description of the hardware Wikipedia
was being run on: a single server with one Pentium CPU.  Later someone
removed the part of that edit with the grossly-outdated server
description, but neglected to remove the by then ludicrous blanket
restriction on crawlers.

Anyway, it comes down to this: it's always courteous to ask, but if
you don't cause any actual damage probably nobody will notice or care.
 Don't take that as any official party line, I'm not a sysadmin, but
that seems to hold as far as I can tell.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Bugzilla from andrew@epstone.net
In reply to this post by Anthony-73
On 10/27/07, Anthony <[hidden email]> wrote:
> > [[Wikipedia:Database download#Please do not use a web crawler]]
> >
> Have Google and Yahoo been informed of this policy?
>

Context: "Please do not use a web crawler to download large numbers of
articles."

As in "Don't use a web crawler to get big amounts of data for your own
personal use" (i.e. for mirroring). And it's quite valid, if lots of
people downloaded the entire site one article at a time, we'd end up
with big problems - especially seeing as the load would be evenly
distributed across many articles, and hence there'd be a lot of extra
parsing happening.

Google and Yahoo have nothing to do with this, as search engines would
represent a tiny portion of our requests (whereas many users doing a
lot of requesting would not), and use the data obtained for the public
benefit.

--
Andrew Garrett

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: images torrent

Anthony-73
On 10/28/07, Andrew Garrett <[hidden email]> wrote:

> On 10/27/07, Anthony <[hidden email]> wrote:
> > > [[Wikipedia:Database download#Please do not use a web crawler]]
> > >
> > Have Google and Yahoo been informed of this policy?
> >
>
> Context: "Please do not use a web crawler to download large numbers of
> articles."
>
> As in "Don't use a web crawler to get big amounts of data for your own
> personal use" (i.e. for mirroring). And it's quite valid, if lots of
> people downloaded the entire site one article at a time, we'd end up
> with big problems - especially seeing as the load would be evenly
> distributed across many articles, and hence there'd be a lot of extra
> parsing happening.
>
> Google and Yahoo have nothing to do with this, as search engines would
> represent a tiny portion of our requests (whereas many users doing a
> lot of requesting would not), and use the data obtained for the public
> benefit.
>
The same could be said about Yousef Ourabi, though.  He's only one
person, and he's "interested in mirroring this images so people from
the community who need access to them have multiple choices for
download".

I think Simetrical got the right de facto policy.  Don't run a live
mirror, and don't slow down or break anything, and no one's going to
care or even notice.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l