Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Platonides
Anthony wrote:
> (although I still haven't seen the WMF step up
> to the plate and make it easy for people to make a full history fork, or
> even to download all the images)

You'll find full history dumps of almost all wikis at
http://download.wikimedia.org/

Although not trivial, downloading all images is in fact quite easy. You
can find scripts to do that already made. You can also ask Brion to
rsync3 them.
But do you have enough space to dedicate?
How many wikis do you want to mirror? Just commons is more than 3 TB...

That's the reason so few people were interested in the images when the
image dump was available.


_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Peter Gervai-5
On Tue, Jun 23, 2009 at 03:15, Platonides<[hidden email]> wrote:
> Although not trivial, downloading all images is in fact quite easy. You
> can find scripts to do that already made. You can also ask Brion to
> rsync3 them.
> But do you have enough space to dedicate?
> How many wikis do you want to mirror? Just commons is more than 3 TB...

Well disks are cheap nowadays. If it's really just the question of
asking, I may be interested. for example.

The more complex question is the parameters of such usage, meaning
what can I do with the images after I've got them. This is the main
reason behind not publishing them in the first hand: the images itself
aren't suggesting any particular license.

Now that I wrote this, it would be possible (not sure if feasible,
though) to publish CC-BY-SA pictures with author info in the comment
of the image itself. Most image formats support sizeable comment
blocks, and standardised templates make it possible to select media by
license, and get author/copyright info to put into the file.

> That's the reason so few people were interested in the images when the
> image dump was available.

People are interested, generally, but not in mirroring the whole shebang. :-)

grin

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

metasj
In reply to this post by Nikola Smolenski
Yes, but my understanding is that while google provided part of the mbp data
and scans, its continued updates to ocr since then are not being shared.  I
would be glad to learn this was not the case...

samuel klein.  [hidden email].  +1 617 529 4266

On Jun 21, 2009 3:14 AM, "Nikola Smolenski" <[hidden email]> wrote:

Дана Saturday 20 June 2009 18:29:24 Brian написа:

> This has reminded me to complain about Google Books. Google has the
world's > best OCR (in virtue ...
Often, these books are available in the Million Books Project too.

_______________________________________________ foundation-l mailing list
[hidden email]...
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony-73
In reply to this post by Platonides
On Mon, Jun 22, 2009 at 9:15 PM, Platonides <[hidden email]> wrote:

> Anthony wrote:
> > (although I still haven't seen the WMF step up
> > to the plate and make it easy for people to make a full history fork, or
> > even to download all the images)
>
> You'll find full history dumps of almost all wikis at
> http://download.wikimedia.org/


Key word being "almost".

Although not trivial, downloading all images is in fact quite easy.


Yep.  All I need is permission.


> But do you have enough space to dedicate?


Not at the moment.  No sense in buying the drives when I don't have
permission to fill them up.


> How many wikis do you want to mirror? Just commons is more than 3 TB...


Commons and En.wikipedia would probably be good for starters.

The main thing I want is permission to scrape en.wikipedia, though.  (Not
really scraping, as I'd probably use the API and Special:Export.  Basically
I just would like someone official to tell me how *fast* I'm allowed to use
the API and Special:Export.  Special:Export especially, because I could
easily overwhelm the servers using that, due to a bug in the script.)

That's the reason so few people were interested in the images when the
> image dump was available.


I downloaded it.  It was well under 1 TB at the time.
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Brian J Mingus
In reply to this post by metasj
2009/6/23 Samuel Klein <[hidden email]>

> Yes, but my understanding is that while google provided part of the mbp
> data
> and scans, its continued updates to ocr since then are not being shared.  I
> would be glad to learn this was not the case...
>

The dataset you need to train an OCR system to be as good as theirs is the
raw images and the plain text. They aren't making it easy to get either of
those things :( They have presumably improved the software in other ways as
well..

WTF GOOG?
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Michael Snow-3
Brian wrote:

> 2009/6/23 Samuel Klein <[hidden email]>
>  
>> Yes, but my understanding is that while google provided part of the mbp
>> data
>> and scans, its continued updates to ocr since then are not being shared.  I
>> would be glad to learn this was not the case...
>>    
> The dataset you need to train an OCR system to be as good as theirs is the
> raw images and the plain text. They aren't making it easy to get either of
> those things :( They have presumably improved the software in other ways as
> well..
>
> WTF GOOG?
>  
Well, when your shorthand uses their stock ticker symbol, your argument
has already been coopted.

--Michael Snow

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Brian J Mingus
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow <[hidden email]>wrote:

>
> > The dataset you need to train an OCR system to be as good as theirs is
> the
> > raw images and the plain text. They aren't making it easy to get either
> of
> > those things :( They have presumably improved the software in other ways
> as
> > well..
> >
> > WTF GOOG?
> >
> Well, when your shorthand uses their stock ticker symbol, your argument
> has already been coopted.
>
> --Michael Snow
>

I get the joke but um, I used it on purpose and which one of my arguments
been "coopted" ??
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Michael Snow-3
Brian wrote:

> On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow <[hidden email]>wrote:
>  
>>> The dataset you need to train an OCR system to be as good as theirs is
>>>      
>> the
>>    
>>> raw images and the plain text. They aren't making it easy to get either
>>>      
>> of
>>    
>>> those things :( They have presumably improved the software in other ways
>>>      
>> as
>>    
>>> well..
>>>
>>> WTF GOOG?
>>>      
>> Well, when your shorthand uses their stock ticker symbol, your argument
>> has already been coopted.
>>
>> --Michael Snow
>>    
> I get the joke but um, I used it on purpose and which one of my arguments
> been "coopted" ??
>  
Coopting is not like rebutting; it does not bite chunks out of specific
pieces, it swallows whole. Symbols are powerful things, perhaps even
more so outside the mathematical logic of argument. They do not serve
only your purposes, even if you use them purposefully. My observations
may be wry, but they are not entirely in jest.

--Michael Snow

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Brian J Mingus
Ok Shakespeare. But in plain english you appear to be saying that
corporations are inherently greedy and have a tendency to be evil. Sure, but
we expect more out of GOOG. This is not MSFT we are talking about.

On Tue, Jun 23, 2009 at 12:13 PM, Michael Snow <[hidden email]>wrote:

> Brian wrote:
> > On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow <[hidden email]
> >wrote:
> >
> >>> The dataset you need to train an OCR system to be as good as theirs is
> >>>
> >> the
> >>
> >>> raw images and the plain text. They aren't making it easy to get either
> >>>
> >> of
> >>
> >>> those things :( They have presumably improved the software in other
> ways
> >>>
> >> as
> >>
> >>> well..
> >>>
> >>> WTF GOOG?
> >>>
> >> Well, when your shorthand uses their stock ticker symbol, your argument
> >> has already been coopted.
> >>
> >> --Michael Snow
> >>
> > I get the joke but um, I used it on purpose and which one of my arguments
> > been "coopted" ??
> >
> Coopting is not like rebutting; it does not bite chunks out of specific
> pieces, it swallows whole. Symbols are powerful things, perhaps even
> more so outside the mathematical logic of argument. They do not serve
> only your purposes, even if you use them purposefully. My observations
> may be wry, but they are not entirely in jest.
>
> --Michael Snow
>
> _______________________________________________
> foundation-l mailing list
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony-73
In reply to this post by Brian J Mingus
On Tue, Jun 23, 2009 at 1:09 PM, Brian <[hidden email]> wrote:

> 2009/6/23 Samuel Klein <[hidden email]>
>
> > Yes, but my understanding is that while google provided part of the mbp
> > data
> > and scans, its continued updates to ocr since then are not being shared.
>  I
> > would be glad to learn this was not the case...
> >
>
> The dataset you need to train an OCR system to be as good as theirs is the
> raw images and the plain text. They aren't making it easy to get either of
> those things :( They have presumably improved the software in other ways as
> well..
>
> WTF GOOG?


It's almost like they're trying to run a business or something.
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony-73
In reply to this post by Brian J Mingus
On Tue, Jun 23, 2009 at 2:24 PM, Brian <[hidden email]> wrote:

> Ok Shakespeare. But in plain english you appear to be saying that
> corporations are inherently greedy and have a tendency to be evil. Sure,
> but
> we expect more out of GOOG. This is not MSFT we are talking about.


Of course they're inherently greedy.  That's the whole purpose of a
for-profit corporation - to make as much money as possible for its
shareholders.  As for "tendency to be evil", I think that rests on your
definition of "evil".
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony-73
On Tue, Jun 23, 2009 at 3:58 PM, Anthony <[hidden email]> wrote:

> On Tue, Jun 23, 2009 at 2:24 PM, Brian <[hidden email]> wrote:
>
>> Ok Shakespeare. But in plain english you appear to be saying that
>> corporations are inherently greedy and have a tendency to be evil. Sure,
>> but
>> we expect more out of GOOG. This is not MSFT we are talking about.
>
>
> Of course they're inherently greedy.  That's the whole purpose of a
> for-profit corporation - to make as much money as possible for its
> shareholders.
>

I guess even a non-profit is inherently greedy, it's just greedy for
something other than money.  The WMF is greedy for the spread of free
knowledge.

But this is off-topic.  Let's take it to another list or something.
_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Reply | Threaded
Open this post in threaded view
|

Re: Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

John Mark Vandenberg
On Wed, Jun 24, 2009 at 6:10 AM, Anthony <[hidden email]> wrote:

>
> On Tue, Jun 23, 2009 at 3:58 PM, Anthony <[hidden email]> wrote:
>
> > On Tue, Jun 23, 2009 at 2:24 PM, Brian <[hidden email]> wrote:
> >
> >> Ok Shakespeare. But in plain english you appear to be saying that
> >> corporations are inherently greedy and have a tendency to be evil. Sure,
> >> but
> >> we expect more out of GOOG. This is not MSFT we are talking about.
> >
> >
> > Of course they're inherently greedy.  That's the whole purpose of a
> > for-profit corporation - to make as much money as possible for its
> > shareholders.
> >
>
> I guess even a non-profit is inherently greedy, it's just greedy for
> something other than money.  The WMF is greedy for the spread of free
> knowledge.
>
> But this is off-topic.  Let's take it to another list or something.

off-topic?? ... surely you jest!!

I think about _three_ of the 50+ emails in this thread have been on
the topic of open access journal articles on Wikisource.

--
John Vandenberg

_______________________________________________
foundation-l mailing list
[hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
123