Re: [Wikimedia-l] Catching copy and pasting early

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Pine W
It should be relatively easy to catch a significant percentage of those copyright violations with the assistance of automated search tools. The trick is to do it at a large scale in near-realtime, which might require some computationally intensive and bandwidth intensive work. James, can I suggest that you take this discussion to Wiki-Research-l? There are a number of ways that the copyright violation problem could be addressed and I think this would be a good subject for discussion on that list, or at Wikimania. Depending on how the discussion on Research goes, it might be good to invite some dev or tech ops people to participate in the discussion as well.

Pine


On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter <[hidden email]> wrote:
This is one of the best ideas Ive read on here!


> Date: Sun, 20 Jul 2014 20:00:28 -0600
> From: [hidden email]
> To: [hidden email]; [hidden email]; [hidden email]; [hidden email]; [hidden email]; [hidden email]; [hidden email]
> Subject: [Wikimedia-l] Catching copy and pasting early
>
> Come across another few thousand edits of copy and paste violations again
> today. These have occurred over more than a year. It is wearing me out.
> Really what is the point on collaborating on Wikipedia if it is simply a
> copyright violation. We need a solution and one has been proposed here a
> couple of years ago https://en.wikipedia.org/wiki/Wikipedia:Turnitin
>
> We now need programmers to carry it out. The Wiki Education Foundation has
> expressed interest. We will need support from the foundation as this
> software will likely need to mesh closely with edits as they come in. I am
> willing to offer $5,000 dollars Canadian (almost the same as American) for
> a working solution that tags potential copyright issues in near real time
> with a greater than 90% accuracy. It is to function on at least all medical
> and pharmacology articles but I would not complain if it worked on all of
> Wikipedia. The WMF is free to apply.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: <a href="https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org" target="_blank">https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@...
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Jane Darnell
Isn't that what Corenbot does/did? I always found it very confusing though whenever I ran into it, and the false positives are huge (so many sites copy Wikimedia content these days)


On Mon, Jul 21, 2014 at 9:11 AM, Pine W <[hidden email]> wrote:
It should be relatively easy to catch a significant percentage of those
copyright violations with the assistance of automated search tools. The
trick is to do it at a large scale in near-realtime, which might require
some computationally intensive and bandwidth intensive work. James, can I
suggest that you take this discussion to Wiki-Research-l? There are a
number of ways that the copyright violation problem could be addressed and
I think this would be a good subject for discussion on that list, or at
Wikimania. Depending on how the discussion on Research goes, it might be
good to invite some dev or tech ops people to participate in the discussion
as well.

Pine


On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter <[hidden email]>
wrote:

> This is one of the best ideas Ive read on here!
>
>
> > Date: Sun, 20 Jul 2014 20:00:28 -0600
> > From: [hidden email]
> > To: [hidden email]; [hidden email];
> [hidden email]; [hidden email]; [hidden email];
> [hidden email]; [hidden email]
> > Subject: [Wikimedia-l] Catching copy and pasting early
> >
> > Come across another few thousand edits of copy and paste violations again
> > today. These have occurred over more than a year. It is wearing me out.
> > Really what is the point on collaborating on Wikipedia if it is simply a
> > copyright violation. We need a solution and one has been proposed here a
> > couple of years ago https://en.wikipedia.org/wiki/Wikipedia:Turnitin
> >
> > We now need programmers to carry it out. The Wiki Education Foundation
> has
> > expressed interest. We will need support from the foundation as this
> > software will likely need to mesh closely with edits as they come in. I
> am
> > willing to offer $5,000 dollars Canadian (almost the same as American)
> for
> > a working solution that tags potential copyright issues in near real time
> > with a greater than 90% accuracy. It is to function on at least all
> medical
> > and pharmacology articles but I would not complain if it worked on all of
> > Wikipedia. The WMF is free to apply.
> >
> > --
> > James Heilman
> > MD, CCFP-EM, Wikipedian
> >
> > The Wikipedia Open Textbook of Medicine
> > www.opentextbookofmedicine.com
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: <a href="https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org" target="_blank">https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@...
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Andrew G. West
Having dabbled in this initiative a couple years back when it first
started to gain some traction, I'll make some comments.

Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It
basically searched took the title of a new article, searched for that
term via the Yahoo! Search API, and looked for nearly-exact text matches
among the first results (using an edit distance metric).

Through the hard work of Jake Orlowitz and others we got free access to
the TurnItIn API (academic plagiarism detection). Their tool is much
more sophisticated in terms of text matching and has access to material
behind many pay-walls.

In terms of Jane's concern, we are (rather, "we imagine being")
primarily limited to finding violations originating at new article
creation or massive text insertions, because content already on WP has
been scraped and re-copied so many times.

*I want to emphasize this is a gift-wrapped academic research project*.
Jake, User:Madman, and myself even began amassing ground-truth to
evaluate our approach. This was nearly a chapter in my dissertation. I
would be very pleased for someone to come along, build a tool of
practice, and also get themselves a WikiSym/CSCW paper in the process. I
don't have the free cycles to do low-level coding, but I'd be happy to
advise, comment, etc. to whatever degree someone would desire. Thanks, -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com


On 07/21/2014 03:52 AM, Jane Darnell wrote:

> Isn't that what Corenbot does/did? I always found it very confusing
> though whenever I ran into it, and the false positives are huge (so many
> sites copy Wikimedia content these days)
>
>
> On Mon, Jul 21, 2014 at 9:11 AM, Pine W <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     It should be relatively easy to catch a significant percentage of those
>     copyright violations with the assistance of automated search tools. The
>     trick is to do it at a large scale in near-realtime, which might require
>     some computationally intensive and bandwidth intensive work. James,
>     can I
>     suggest that you take this discussion to Wiki-Research-l? There are a
>     number of ways that the copyright violation problem could be
>     addressed and
>     I think this would be a good subject for discussion on that list, or at
>     Wikimania. Depending on how the discussion on Research goes, it might be
>     good to invite some dev or tech ops people to participate in the
>     discussion
>     as well.
>
>     Pine
>
>
>     On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter
>     <[hidden email] <mailto:[hidden email]>>
>     wrote:
>
>      > This is one of the best ideas Ive read on here!
>      >
>      >
>      > > Date: Sun, 20 Jul 2014 20:00:28 -0600
>      > > From: [hidden email] <mailto:[hidden email]>
>      > > To: [hidden email]
>     <mailto:[hidden email]>; [hidden email]
>     <mailto:[hidden email]>;
>      > [hidden email] <mailto:[hidden email]>;
>     [hidden email] <mailto:[hidden email]>;
>     [hidden email] <mailto:[hidden email]>;
>      > [hidden email] <mailto:[hidden email]>;
>     [hidden email] <mailto:[hidden email]>
>      > > Subject: [Wikimedia-l] Catching copy and pasting early
>      > >
>      > > Come across another few thousand edits of copy and paste
>     violations again
>      > > today. These have occurred over more than a year. It is wearing
>     me out.
>      > > Really what is the point on collaborating on Wikipedia if it is
>     simply a
>      > > copyright violation. We need a solution and one has been
>     proposed here a
>      > > couple of years ago
>     https://en.wikipedia.org/wiki/Wikipedia:Turnitin
>      > >
>      > > We now need programmers to carry it out. The Wiki Education
>     Foundation
>      > has
>      > > expressed interest. We will need support from the foundation as
>     this
>      > > software will likely need to mesh closely with edits as they
>     come in. I
>      > am
>      > > willing to offer $5,000 dollars Canadian (almost the same as
>     American)
>      > for
>      > > a working solution that tags potential copyright issues in near
>     real time
>      > > with a greater than 90% accuracy. It is to function on at least all
>      > medical
>      > > and pharmacology articles but I would not complain if it worked
>     on all of
>      > > Wikipedia. The WMF is free to apply.
>      > >
>      > > --
>      > > James Heilman
>      > > MD, CCFP-EM, Wikipedian
>      > >
>      > > The Wikipedia Open Textbook of Medicine
>      > > www.opentextbookofmedicine.com
>     <http://www.opentextbookofmedicine.com>
>      > > _______________________________________________
>      > > Wikimedia-l mailing list, guidelines at:
>      > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>      > > [hidden email]
>     <mailto:[hidden email]>
>      > > Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>      > <mailto:[hidden email]
>     <mailto:[hidden email]>?subject=unsubscribe>
>      >
>      > _______________________________________________
>      > Wikimedia-l mailing list, guidelines at:
>      > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>      > [hidden email]
>     <mailto:[hidden email]>
>      > Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>      > <mailto:[hidden email]
>     <mailto:[hidden email]>?subject=unsubscribe>
>      >
>     _______________________________________________
>     Wikimedia-l mailing list, guidelines at:
>     https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>     [hidden email]
>     <https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>     [hidden email]>
>     Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>     <mailto:[hidden email]
>     <mailto:[hidden email]>?subject=unsubscribe>
>
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Nathan Awrich



On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West <[hidden email]> wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.

Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).

Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.

In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.

*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com



Some questions that aren't answered by the Wikipedia:Turnitin page:

#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?

#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts? 

#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially). 

#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes? 

It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation. 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Jane Darnell
It's been a while, but as I recall, my problem with the Corenbot is the text that was inserted on the page (some loud banner with a link to the original text on some website, which was often not at all related to the matter at hand). My confusion was the instructional text in the link, and I wasn't sure if I should leave it or delete it (ah those were the days back when I thought my submissions were thoughtfully read the moment I pressed publish!). The problem with implementation of this sort of idea is that you need a bunch of field workers to sift through all of the positives, so you are sure you are not needlessly confusing some newbie somewhere. The bot is one thing, the workflow is something else entirely.


On Mon, Jul 21, 2014 at 4:29 PM, Nathan <[hidden email]> wrote:



On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West <[hidden email]> wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.

Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).

Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.

In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.

*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com



Some questions that aren't answered by the Wikipedia:Turnitin page:

#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?

#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts? 

#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially). 

#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes? 

It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation. 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Sage Ross
Hey folks.

As James noted, Wiki Education Foundation is planning to do some work
on this problem. I'll the project manager for it, and I'll be grateful
for all the help and advice I can get. I'm in the process now of
finding a development company to work with.

Our current plan is to complete a "feasibility study" by February
2014. Basically, that means doing enough exploratory development to
get a clear picture of just how big a project it will be. The first
goal would be to scratch our own itch: to set up a system that checks
all edits made by student editors in our courses, and which highlights
apparently plagiarism on a course dashboard (on wikiedu.org) and
alerts the instructor and/or the student via email. However, if we can
do that, it should provide a good starting point for scaling up to all
of Wikipedia.

I think Jane is right to highlight the workflow problem. That's also a
workflow that would be very different for a Wikipedia-wide system
versus what I describe above, where we're working with editors in a
specific context (course assignments) and we can communicate with them
offwiki. My first idea would be something that notifies the
responsible editors directly so that they can fix the problems
themselves, rather than one that requires "field workers" to sift
through the positives to clean up after others. The point would be to
catch problems early, so that users correct their own behavior before
they've done the same thing over and over again.

Nathan, finding answers to some of those questions will be part of the
feasibility study. One of the key goals Wiki Ed has for this is to
minimize false positives, so it we'll want to spend some time
experimenting with what kinds of edits we can reliably detect as true
positives. It may be that only edits of a certain size are worth
checking, or only blocks of text that don't rewrite existing content.
Regarding term papers, it might be a little confusing to refer to
"Turnitin", as the working plan has been to use a different service
from the same company, called iThenticate. This one is different from
Turnitin in that it's more focused on checking content against
published sources (on the web and in academic databases) and it
doesn't include the database of previously-submitted papers like
Turnitin.

Andrew: when we get closer to breaking ground, I'd love to talk it
over with you.

Sage Ross

User:Sage (Wiki Ed) / User:Ragesoss
Product Manager, Digital Services
Wiki Education Foundation

On Mon, Jul 21, 2014 at 11:17 AM, Jane Darnell <[hidden email]> wrote:

> It's been a while, but as I recall, my problem with the Corenbot is the text
> that was inserted on the page (some loud banner with a link to the original
> text on some website, which was often not at all related to the matter at
> hand). My confusion was the instructional text in the link, and I wasn't
> sure if I should leave it or delete it (ah those were the days back when I
> thought my submissions were thoughtfully read the moment I pressed
> publish!). The problem with implementation of this sort of idea is that you
> need a bunch of field workers to sift through all of the positives, so you
> are sure you are not needlessly confusing some newbie somewhere. The bot is
> one thing, the workflow is something else entirely.
>
>
> On Mon, Jul 21, 2014 at 4:29 PM, Nathan <[hidden email]> wrote:
>>
>>
>>
>>
>> On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West <[hidden email]>
>> wrote:
>>>
>>> Having dabbled in this initiative a couple years back when it first
>>> started to gain some traction, I'll make some comments.
>>>
>>> Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically
>>> searched took the title of a new article, searched for that term via the
>>> Yahoo! Search API, and looked for nearly-exact text matches among the first
>>> results (using an edit distance metric).
>>>
>>> Through the hard work of Jake Orlowitz and others we got free access to
>>> the TurnItIn API (academic plagiarism detection). Their tool is much more
>>> sophisticated in terms of text matching and has access to material behind
>>> many pay-walls.
>>>
>>> In terms of Jane's concern, we are (rather, "we imagine being") primarily
>>> limited to finding violations originating at new article creation or massive
>>> text insertions, because content already on WP has been scraped and
>>> re-copied so many times.
>>>
>>> *I want to emphasize this is a gift-wrapped academic research project*.
>>> Jake, User:Madman, and myself even began amassing ground-truth to evaluate
>>> our approach. This was nearly a chapter in my dissertation. I would be very
>>> pleased for someone to come along, build a tool of practice, and also get
>>> themselves a WikiSym/CSCW paper in the process. I don't have the free cycles
>>> to do low-level coding, but I'd be happy to advise, comment, etc. to
>>> whatever degree someone would desire. Thanks, -AW
>>>
>>> --
>>> Andrew G. West, PhD
>>> Research Scientist
>>> Verisign Labs - Reston, VA
>>> Website: http://www.andrew-g-west.com
>>>
>>
>>
>> Some questions that aren't answered by the Wikipedia:Turnitin page:
>>
>> #Has any testing been done on a set of edits to see what the results might
>> look like? I'm a little unconvinced on the idea of comparing edits with tens
>> millions of term papers or other submissions. If testing hasn't begun, why
>> not? What's lacking?
>>
>> #The page says there will be no formal or contractual relationship between
>> Turnitin and WMF, but I don't see how this can necessarily be true if its
>> assumed Turnitin will be able to use the "Wikipedia" name in marketing
>> material. Thoughts?
>>
>> #What's the value of running the process against all edits (many of which
>> may be minor, or not involve any substantial text insertions) vs. skimming
>> all or a subset of all pages each day? (I'm assuming a few million more
>> pageloads per day won't affect the Wikimedia servers substantially).
>>
>> #What mechanism would be used to add the report link to the talkpages? A
>> bot account operated by Turnitin? Would access to the Turnitin database be
>> restricted / proprietary, or could other bot developers query it for various
>> purposes?
>>
>> It sounds like there's a desire to just skip to the end and agree to
>> switch Turnitin on as a scan for all edits, but I think these questions and
>> more will need to be answered before people will agree to anything like full
>> scale implementation.
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Catching copy and pasting early

Kerry Raymond
In reply to this post by Nathan Awrich

In light of the editor retention problem, I suggest we have to be very careful with any kind of “plagiarism detector” software because we have real subject matter experts among our editors. I’m aware of members of local history societies who have had issues with copyright violation because they have content on their own websites which they then contribute to Wikipedia. It’s not a copyright violation because it’s their own work, but it was deleted, they were accused of copyright violation and they were naturally very unhappy about both. Being new users they did not know any way to get this redressed, they asked me for help and I got nowhere with the editor who deleted the material who would not accept their assertion that they were the original authors (how on earth could they prove it?). As a result, none of them are now active editors. Having had a whole bunch of my own images nearly deleted from Commons because they appear on my own website (despite my user name being my real name and my real name is all over my website), I know how they feel about having accusations of copyright violation all over your contributions – it’s really offensive. Strangely we have no way to whitelist particular websites in relation to particular users (in theory, you’d want to be able to whitelist books and off-line resources too but in practice “copies” from these are far less likely to be noticed), so the same problem can arise again and again for an individual contributor.

 

So I would be very hesitant about putting any visible tag on an article suggesting it was a copyright violation (as it seems to me it is both offensive and potentially libellous to the editor who has in good faith contributed their own work). I think any concern about copyright has to be first raised with the editor involved as a question NOT an accusation. And I note that it is often very difficult to communicate with new/occasional editors as they often have no email address associated with their account and they don’t see talk page message banners unless they are remember-me logged-in. It’s ironic that at a time a contributor is most likely to want/need help, we are in the worst position to know they want it or offer it if we see they need it.

 

So, I’m with Jane on this one. It’s easy enough to detect a lot of potential copyright violations automatically. What’s hard and very much a manual task is confirming it really is a copyright violation and, where required, educating the contributor. I think there’s a real danger to automating the first part without a good solution to the second part. We have far too many editors who use tools as weapons already, so I am reluctant to give them more weapons.

 

Kerry

 

 


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l