Research on automatically created articles

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Research on automatically created articles

Denny Vrandečić-2
Hi all,

I found a paper at IJCAI 2016, which left me quite curious: https://siddbanpsu.github.io/publications/ijcai16-banerjee.pdf 

In short, they find red links, classify them, find the closest similar articles, use the section titles from these articles to decide on sections, search for content for the sections, paraphrase it, and write complete Wikipedia articles.

Then they uploaded the articles to Wikipedia, and from the 50 uploaded articles, only 3 got deleted. The rest stayed. I was rather excited when I heard that - where the articles really that good?

Then I took a look at the articles and... well, judge for yourself. The paper only mentions three articles of the 47 survivors:


https://en.wikipedia.org/wiki/Atripliceae (here is the last version as created by the bot before significant human clean-up: https://en.wikipedia.org/w/index.php?title=Atripliceae&oldid=697456858 )


I have connected with the first author and he promised me to give a list of all articles as soon as he can get it, which will be in a few weeks because he is away from his university computer right now. He was able to produce one more article though:


(Also, see history for the extent of human clean-up)

I am not writing to talk badly about the authors or about the reviewing practice at IJCAI, or about the state of research in that area. Also, I really do not want to discourage research in this area.

I have a few questions, though:

1) the fact that so many of these articles have survived for half a year indicates that there are some problems with our review processes. Does someone want to make an investigation why these articles survived in the given state?

2) as far as I know we don't have rules for this kind of experiments, but maybe we should. In particular, I feel, that, BLPs should not be created by an experimental approach like this one. Should we set up rules for this kind of experiments?

3) Wikipedia contributors are participating in these experiments without consent. I find that worrysome, and would like to hear what others think.

I have invited the first author to join this list.

I understand the motivation: by exposing from the beginning that these articles were created by bots, they would have been scrutinized differently than articles written by humans. Therefore they remained quiet about the fact (but are willing to reveal it now, now that the experiment is over - they also explicitly don't have any intentions of expanding the scope of the experiment at the given point of time).

Cheers,
Denny




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Ziko van Dijk-3
Hello Denny,

I agree with all three points. The experiment reminds me of "babelfish accidents" as we called them in de.WP, and the experimemts of Google and Microsoft to "support" "translations" between Wikipedias.

Very strange this repeating "Dick Barbour is legendary in..."

Kind regards
Ziko


2016-08-09 20:29 GMT+02:00 Denny Vrandečić <[hidden email]>:
Hi all,

I found a paper at IJCAI 2016, which left me quite curious: https://siddbanpsu.github.io/publications/ijcai16-banerjee.pdf 

In short, they find red links, classify them, find the closest similar articles, use the section titles from these articles to decide on sections, search for content for the sections, paraphrase it, and write complete Wikipedia articles.

Then they uploaded the articles to Wikipedia, and from the 50 uploaded articles, only 3 got deleted. The rest stayed. I was rather excited when I heard that - where the articles really that good?

Then I took a look at the articles and... well, judge for yourself. The paper only mentions three articles of the 47 survivors:


https://en.wikipedia.org/wiki/Atripliceae (here is the last version as created by the bot before significant human clean-up: https://en.wikipedia.org/w/index.php?title=Atripliceae&oldid=697456858 )


I have connected with the first author and he promised me to give a list of all articles as soon as he can get it, which will be in a few weeks because he is away from his university computer right now. He was able to produce one more article though:


(Also, see history for the extent of human clean-up)

I am not writing to talk badly about the authors or about the reviewing practice at IJCAI, or about the state of research in that area. Also, I really do not want to discourage research in this area.

I have a few questions, though:

1) the fact that so many of these articles have survived for half a year indicates that there are some problems with our review processes. Does someone want to make an investigation why these articles survived in the given state?

2) as far as I know we don't have rules for this kind of experiments, but maybe we should. In particular, I feel, that, BLPs should not be created by an experimental approach like this one. Should we set up rules for this kind of experiments?

3) Wikipedia contributors are participating in these experiments without consent. I find that worrysome, and would like to hear what others think.

I have invited the first author to join this list.

I understand the motivation: by exposing from the beginning that these articles were created by bots, they would have been scrutinized differently than articles written by humans. Therefore they remained quiet about the fact (but are willing to reveal it now, now that the experiment is over - they also explicitly don't have any intentions of expanding the scope of the experiment at the given point of time).

Cheers,
Denny




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Federico Leva (Nemo)
In reply to this post by Denny Vrandečić-2
Denny Vrandečić, 09/08/2016 20:29:
> 1) the fact that so many of these articles have survived for half a year
> indicates that there are some problems with our review processes. Does
> someone want to make an investigation why these articles survived in the
> given state?

Looks like the good old trick of making sure that the most prominent
parts are ok (first line, headers, footnotes) and then adding mere
fillers for the rest...

Nemo

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates
In reply to this post by Ziko van Dijk-3
There appears to be out-of-policy use of multiple accounts involved in this work. If this is the case, all of these articles may subject to deletion on procedural grounds, completely independently of any real or perceived notability, article quality or research quality.

I HIGHLY recommend against such use of multiple accounts. 

cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 7:53 AM, Ziko van Dijk <[hidden email]> wrote:
Hello Denny,

I agree with all three points. The experiment reminds me of "babelfish accidents" as we called them in de.WP, and the experimemts of Google and Microsoft to "support" "translations" between Wikipedias.

Very strange this repeating "Dick Barbour is legendary in..."

Kind regards
Ziko


2016-08-09 20:29 GMT+02:00 Denny Vrandečić <[hidden email]>:
Hi all,

I found a paper at IJCAI 2016, which left me quite curious: https://siddbanpsu.github.io/publications/ijcai16-banerjee.pdf 

In short, they find red links, classify them, find the closest similar articles, use the section titles from these articles to decide on sections, search for content for the sections, paraphrase it, and write complete Wikipedia articles.

Then they uploaded the articles to Wikipedia, and from the 50 uploaded articles, only 3 got deleted. The rest stayed. I was rather excited when I heard that - where the articles really that good?

Then I took a look at the articles and... well, judge for yourself. The paper only mentions three articles of the 47 survivors:


https://en.wikipedia.org/wiki/Atripliceae (here is the last version as created by the bot before significant human clean-up: https://en.wikipedia.org/w/index.php?title=Atripliceae&oldid=697456858 )


I have connected with the first author and he promised me to give a list of all articles as soon as he can get it, which will be in a few weeks because he is away from his university computer right now. He was able to produce one more article though:


(Also, see history for the extent of human clean-up)

I am not writing to talk badly about the authors or about the reviewing practice at IJCAI, or about the state of research in that area. Also, I really do not want to discourage research in this area.

I have a few questions, though:

1) the fact that so many of these articles have survived for half a year indicates that there are some problems with our review processes. Does someone want to make an investigation why these articles survived in the given state?

2) as far as I know we don't have rules for this kind of experiments, but maybe we should. In particular, I feel, that, BLPs should not be created by an experimental approach like this one. Should we set up rules for this kind of experiments?

3) Wikipedia contributors are participating in these experiments without consent. I find that worrysome, and would like to hear what others think.

I have invited the first author to join this list.

I understand the motivation: by exposing from the beginning that these articles were created by bots, they would have been scrutinized differently than articles written by humans. Therefore they remained quiet about the fact (but are willing to reveal it now, now that the experiment is over - they also explicitly don't have any intentions of expanding the scope of the experiment at the given point of time).

Cheers,
Denny




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

siddhartha banerjee
In reply to this post by Denny Vrandečić-2
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

WereSpielChequers-2
I have proposed https://en.wikipedia.org/wiki/Mazaua for deletion - I assume it was one of the others involved.


Our newpage patrollers are pretty experienced at tagging for deletion the pages of spam and clearly non notable articles that get created by the hundred every day. If someone was to waste everyone's time by creating a bunch of articles that look like press releases from over enthusiastic marketing departments, and appeals for a drummer in time for the first rehearsal of the next big thing on the Bournemouth grunge scene then I've no doubt they would be deleted pdq. Easier still watch a hundred articles at the start of the NPP process, predict how they'd fare and then test your prediction against the result.

If you successfully produce a bunch of flawed articles that look like the sort of articles we accept from goodfaith newbies with idiosyncratic English, then that doesn't tell us anything about our ability to filter out the sort of stuff that we need to delete, but it could mean that patrollers will be less tolerant of what appears to be someone with limited English writing an article about an island that probably merits an article. https://en.wikipedia.org/wiki/Mazaua

Jonathan

On 9 August 2016 at 22:30, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates
In reply to this post by siddhartha banerjee
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Denny Vrandečić-2
So here's the list of accounts that were used in order to create the articles:


Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever mass-create a large number of articles, and it is only these 50 articles or so we can clean up now. I am not terribly worried about this particular work (according to the paper there were 47 surviving articles at the time of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such experiments from other groups. It would be great to set up a few rules for this kind of behavior, so that we can at least point to them. If the only rule that was broken here was the "don't use multiple accounts" rule, I am not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates <[hidden email]> wrote:
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Ziko van Dijk-3
Hello,

Do we have a collection of already existing and relevant policies and statements, at least for English Wikipedia? On Meta I found this page
https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
which main statement is that research is too various and complex to give some few recommendations.

At first sight, I find it difficult to read something relevant from https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not

I imagine that guidelines could be helpful with regard to a) research that includes editing wiki pages, b) the editing of students or pupils for educational purposes.

Research and educational activity should not disturb the efforts of the Wikipedia community to create and improve encyclopedic content. Disturbance can occur from creating sub standard content and involving in activities that disrupts work flows. ...

This guidelines could be only a recommendation, as long the Wikipedia communities don't change their rules. But it'd be great, anyway, if the guidelines can be based somehow on existing Wikipedia rules.

Kind regards
Ziko





2016-08-12 0:41 GMT+02:00 Denny Vrandečić <[hidden email]>:
So here's the list of accounts that were used in order to create the articles:


Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever mass-create a large number of articles, and it is only these 50 articles or so we can clean up now. I am not terribly worried about this particular work (according to the paper there were 47 surviving articles at the time of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such experiments from other groups. It would be great to set up a few rules for this kind of behavior, so that we can at least point to them. If the only rule that was broken here was the "don't use multiple accounts" rule, I am not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates <[hidden email]> wrote:
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates
I think you misunderstand the nature of en.wiki.

en.wiki is not a rule-based automata; en.wiki is an autonomous community that works by consensus. 

I cannot imagine a set of research rules constructed outside en.wiki that lets you 'safely' do interact with it. Observe it, maybe, but not interact with it. I can also imagine that certain kinds of observation (or certain results coming out of observation) making further observation difficult.

The best advice I can provide is to team up with an experienced editor or two.

[For editing for educational rather than research purposes see https://en.wikipedia.org/wiki/Wikipedia:Education_program ]

cheers
stuart

--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 11:04 AM, Ziko van Dijk <[hidden email]> wrote:
Hello,

Do we have a collection of already existing and relevant policies and statements, at least for English Wikipedia? On Meta I found this page
https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
which main statement is that research is too various and complex to give some few recommendations.

At first sight, I find it difficult to read something relevant from https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not

I imagine that guidelines could be helpful with regard to a) research that includes editing wiki pages, b) the editing of students or pupils for educational purposes.

Research and educational activity should not disturb the efforts of the Wikipedia community to create and improve encyclopedic content. Disturbance can occur from creating sub standard content and involving in activities that disrupts work flows. ...

This guidelines could be only a recommendation, as long the Wikipedia communities don't change their rules. But it'd be great, anyway, if the guidelines can be based somehow on existing Wikipedia rules.

Kind regards
Ziko





2016-08-12 0:41 GMT+02:00 Denny Vrandečić <[hidden email]>:
So here's the list of accounts that were used in order to create the articles:


Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever mass-create a large number of articles, and it is only these 50 articles or so we can clean up now. I am not terribly worried about this particular work (according to the paper there were 47 surviving articles at the time of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such experiments from other groups. It would be great to set up a few rules for this kind of behavior, so that we can at least point to them. If the only rule that was broken here was the "don't use multiple accounts" rule, I am not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates <[hidden email]> wrote:
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates

--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 11:45 AM, Stuart A. Yeates <[hidden email]> wrote:
I think you misunderstand the nature of en.wiki.

en.wiki is not a rule-based automata; en.wiki is an autonomous community that works by consensus. 

I cannot imagine a set of research rules constructed outside en.wiki that lets you 'safely' do interact with it. Observe it, maybe, but not interact with it. I can also imagine that certain kinds of observation (or certain results coming out of observation) making further observation difficult.

The best advice I can provide is to team up with an experienced editor or two.

[For editing for educational rather than research purposes see https://en.wikipedia.org/wiki/Wikipedia:Education_program ]

cheers
stuart

--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 11:04 AM, Ziko van Dijk <[hidden email]> wrote:
Hello,

Do we have a collection of already existing and relevant policies and statements, at least for English Wikipedia? On Meta I found this page
https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
which main statement is that research is too various and complex to give some few recommendations.

At first sight, I find it difficult to read something relevant from https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not

I imagine that guidelines could be helpful with regard to a) research that includes editing wiki pages, b) the editing of students or pupils for educational purposes.

Research and educational activity should not disturb the efforts of the Wikipedia community to create and improve encyclopedic content. Disturbance can occur from creating sub standard content and involving in activities that disrupts work flows. ...

This guidelines could be only a recommendation, as long the Wikipedia communities don't change their rules. But it'd be great, anyway, if the guidelines can be based somehow on existing Wikipedia rules.

Kind regards
Ziko





2016-08-12 0:41 GMT+02:00 Denny Vrandečić <[hidden email]>:
So here's the list of accounts that were used in order to create the articles:


Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever mass-create a large number of articles, and it is only these 50 articles or so we can clean up now. I am not terribly worried about this particular work (according to the paper there were 47 surviving articles at the time of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such experiments from other groups. It would be great to set up a few rules for this kind of behavior, so that we can at least point to them. If the only rule that was broken here was the "don't use multiple accounts" rule, I am not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates <[hidden email]> wrote:
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Kerry Raymond
In reply to this post by Stuart A. Yeates
Presuming the research is being conducted under the usual ethics regimes, putting articles into mainspace Wikipedia is putting them in front or readers and into the workflows and activities of editors. This would appear to me to constitute experiments on human subjects, which usually introduces issues of informed consent and potential to harm them. Can we be shown theethical approval documents for this particular project to see how these concerns were addressed?

Kerry

Sent from my iPad

On 12 Aug 2016, at 9:45 AM, Stuart A. Yeates <[hidden email]> wrote:

I think you misunderstand the nature of en.wiki.

en.wiki is not a rule-based automata; en.wiki is an autonomous community that works by consensus. 

I cannot imagine a set of research rules constructed outside en.wiki that lets you 'safely' do interact with it. Observe it, maybe, but not interact with it. I can also imagine that certain kinds of observation (or certain results coming out of observation) making further observation difficult.

The best advice I can provide is to team up with an experienced editor or two.

[For editing for educational rather than research purposes see https://en.wikipedia.org/wiki/Wikipedia:Education_program ]

cheers
stuart

--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 11:04 AM, Ziko van Dijk <[hidden email]> wrote:
Hello,

Do we have a collection of already existing and relevant policies and statements, at least for English Wikipedia? On Meta I found this page
https://meta.wikimedia.org/wiki/Research:Wikipedia_Research_Management
which main statement is that research is too various and complex to give some few recommendations.

At first sight, I find it difficult to read something relevant from https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not

I imagine that guidelines could be helpful with regard to a) research that includes editing wiki pages, b) the editing of students or pupils for educational purposes.

Research and educational activity should not disturb the efforts of the Wikipedia community to create and improve encyclopedic content. Disturbance can occur from creating sub standard content and involving in activities that disrupts work flows. ...

This guidelines could be only a recommendation, as long the Wikipedia communities don't change their rules. But it'd be great, anyway, if the guidelines can be based somehow on existing Wikipedia rules.

Kind regards
Ziko





2016-08-12 0:41 GMT+02:00 Denny Vrandečić <[hidden email]>:
So here's the list of accounts that were used in order to create the articles:


Also some edits may have been done through IPs.

In discussion with Sidd it was clear that they did not plan to ever mass-create a large number of articles, and it is only these 50 articles or so we can clean up now. I am not terribly worried about this particular work (according to the paper there were 47 surviving articles at the time of writing, i.e. in Spring).

What I am concerned about is the fact that there will be more such experiments from other groups. It would be great to set up a few rules for this kind of behavior, so that we can at least point to them. If the only rule that was broken here was the "don't use multiple accounts" rule, I am not sure whether that would be sufficient.

Cheers,
Denny



On Wed, Aug 10, 2016 at 1:47 AM Stuart A. Yeates <[hidden email]> wrote:
* The previous work you cite appears to have created articles in the draft namespace rather than the article namespace. This is a very important and very relevant detail, meaning your situation is in no way comparable to the previous work from my point of view
* You appear to be solving a problem that the community of wikipedia editors does not have. We have enough low-quality stub articles that need human effort to improve and we're not really interested in more unless either (a) they demonstrably combat some of the systematic biases we're struggling with or (b) they demonstrably attract new cohorts users to do that improvement. Note that the examples discussed in the research newsletter are a non-English writer and a women writer. These are important details.
* Your paper appears not to attempt to make any attempt to measure the statistical significance of your results; this isn't science.
* Most of your sources are _really_ _really_ bad. https://en.wikipedia.org/wiki/Talonid Contains 8 unique refs, one of which is good, one of which is a passable and the others should be removed immediately (but I won't because it'll make it harder for third parties reading this conversation to follow it.).

If you want to properly evaluate your technique, try this: Randomly pick N articles from https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats splitting them into control and subjects randomly. Parse each subject article for sentences that your system appears to understand. For each sentence your thing you understand look for reliable sources to support that sentence. Add a single ref to a single statement in each article. Add all the refs using a single account with a message on the user page about the nature of the edits. If you're not able to add any refs, mark it as a failure. Measure article lifespan for each group.

If you're in a hurry and want fast results, work with articles less than a week old (hint: articles IDs are numerically increasing sequence) or the intersection of https://en.wikipedia.org/wiki/Category:Articles_lacking_sources subcats and Category:Articles_for_deletion Both of these groups of articles are actively being considered for deletion.

cheers
stuart
 

--
...let us be heard from red core to black sky

On Wed, Aug 10, 2016 at 9:30 AM, siddhartha banerjee <[hidden email]> wrote:
Hello Everyone,

I am the first author of the paper that Denny has referred. Firstly, I want to thank Denny for asking me to join this list and know more about this discussion. 

1. Regarding quality, we know that there are issues, and even in the conference, I have repeatedly told the audience that I am not satisfied with the quality of the content generated. However, the percentage of articles that were not removed when the paper was submitted was minimal. I have sent Denny a list of accounts that were used and it might have been possible that several articles created have been removed from those accounts within the last couple of months. I was not aware of the multiple account policy. 

2. The area of Wikipedia article generation have been explored by others in the past. [http://www.aclweb.org/anthology/P09-1024http://wwwconference.org/proceedings/www2011/companion/p161.pdf] We were not aware of any rules regarding these sort of experiments. However, we do understand that such experiments can harm the general quality of this great encyclopedic resource, hence we did out analysis on bare minimum articles. In fact, we did our initial work on it back in 2014, and Wikimedia research even covered details about our paper here -- https://blog.wikimedia.org/2015/02/02/wikimedia-research-newsletter-january-2015/#Bot_detects_theatre_play_scripts_on_the_web_and_writes_Wikipedia_articles_about_them 

If questions were raised at that point, we would surely not have done anything further on this, or rather do things offline without creating or adding any content on Wikipedia. 

I understand your point about imposing rules and I think it makes sense. However, during this research, we were not aware of any rules, hence continued our work. 
As I have told Denny, our purpose was to check whether we could create bare minimal articles which could be eventually improved by authors on Wikipedia, and also to see if they are totally removed. But, it was done with a few articles and we did not create anything beyond that point. Also, we did not do any manual modifications to the articles although we saw quality issues because it would void our analysis and claims. 

Thanks everyone for your time and the great work you are doing for the Wikipedia community. 

Regards,
Sidd



 

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

siddhartha banerjee
In reply to this post by Denny Vrandečić-2
As I mentioned earlier, I was not sure about the multiple account policy. I got the notification about the incident being raised, and I will be happy with whatever decisions Wiki administrators make. 

As Denny mentioned, we did not plan anything large-scale but only for a small group of edits. Furthermore, we mentioned the results only being valid until a particular date before the submission of that conference paper and things may have already changed a lot (articles removed, edited further, etc). We have not made any additions since Feb, nor do we plan to do anything further. Whatever we do, would be offline. 

To Denny's point about other researchers trying to do the same kind of research, I do see research in this area coming up and it might make sense to have certain rules (although I do not have much idea on how these things work abt rules on Wiki in general.) I know this because some researchers have contacted me previously on this work, and they are also looking into similar areas. One example in this area of work is the following -- this is very recent: http://snap.stanford.edu/wikiworkshop2016/papers/wikiworkshop_icwsm2016_pochampally.pdf

Regarding human subjects, no reviewers in the conferences as well as any other person from Wikimedia mentioned anything on that earlier. Our previous works were featured earlier on Wikimedia newsletters (links in earlier emails) and still nothing on it was mentioned nor we found any information on Wikipedia in general about it. As per the requirements, approval would be necessary if: Data about living individuals through intervention or interaction or Identifiable private information about living individuals. As is mentioned. the "about" fact is very imp -- because nothing about editors data was used or collected in the research. 

If rules do change, I will keep following the thread and also please let me know -- I will try to inform to all researchers who work in this area if they get in touch with me.

-- Siddhartha


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates
You interacted with living people through wikipedia and you're recorded information about whether they deleted your references. Many people (like myself) are directly traceable from their wikipedia accounts to their real-life identifies. 

How is a record of someone doing something not about them?

cheers
stuart


--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 2:04 PM, siddhartha banerjee <[hidden email]> wrote:
As I mentioned earlier, I was not sure about the multiple account policy. I got the notification about the incident being raised, and I will be happy with whatever decisions Wiki administrators make. 

As Denny mentioned, we did not plan anything large-scale but only for a small group of edits. Furthermore, we mentioned the results only being valid until a particular date before the submission of that conference paper and things may have already changed a lot (articles removed, edited further, etc). We have not made any additions since Feb, nor do we plan to do anything further. Whatever we do, would be offline. 

To Denny's point about other researchers trying to do the same kind of research, I do see research in this area coming up and it might make sense to have certain rules (although I do not have much idea on how these things work abt rules on Wiki in general.) I know this because some researchers have contacted me previously on this work, and they are also looking into similar areas. One example in this area of work is the following -- this is very recent: http://snap.stanford.edu/wikiworkshop2016/papers/wikiworkshop_icwsm2016_pochampally.pdf

Regarding human subjects, no reviewers in the conferences as well as any other person from Wikimedia mentioned anything on that earlier. Our previous works were featured earlier on Wikimedia newsletters (links in earlier emails) and still nothing on it was mentioned nor we found any information on Wikipedia in general about it. As per the requirements, approval would be necessary if: Data about living individuals through intervention or interaction or Identifiable private information about living individuals. As is mentioned. the "about" fact is very imp -- because nothing about editors data was used or collected in the research. 

If rules do change, I will keep following the thread and also please let me know -- I will try to inform to all researchers who work in this area if they get in touch with me.

-- Siddhartha


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

siddhartha banerjee
In reply to this post by Denny Vrandečić-2
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

siddhartha banerjee
I thought I should add this too as I missed it in the previous email.
talks about the Content Analysis (seeing number of references removed, or content removed)-- which we did (with the few articles)  and that is what we followed as it says "generally considered exempt from such requirements and does not require an IRB approval.". 
My advisor should be able to add more thoughts on it (I have requested him to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee <[hidden email]> wrote:
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd







_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Kerry Raymond
I am asking you to share the documentation of the ethical clearance or exemption your institution would have required, not what people did or didn't say to you as part of conference reviewing or at conferences. Ethical clearance is a process that should have been undertaken before your research commenced, not when you are writing the paper or attending a conference. Are you saying you undertook the research without any consideration of the ethics? Does your university have no guidelines about this?

The Wikipedia guidelines about content analysis are not particularly relevant here. You were not analysing existing Wikipedia articles but injecting new articles of dubious quality into Wikipedia.

Nor is the data about individuals my point. If you wasted people's time reacting to the articles created, you did them harm. If people derived incorrect information from reading your articles, you did them harm. None of those people were aware they were part of your research experiment; that means they did not have informed consent in relation to choosing to participate in your experiment. You could have generated the articles and sought the opinions of readers and editors of Wikipedia on those articles without placing them into Wikipedia itself. That way would have enabled informed consent; others not wishing to take part would not be mislead into doing so.

Sent from my iPad

On 12 Aug 2016, at 3:24 PM, siddhartha banerjee <[hidden email]> wrote:

I thought I should add this too as I missed it in the previous email.
talks about the Content Analysis (seeing number of references removed, or content removed)-- which we did (with the few articles)  and that is what we followed as it says "generally considered exempt from such requirements and does not require an IRB approval.". 
My advisor should be able to add more thoughts on it (I have requested him to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee <[hidden email]> wrote:
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Kerry Raymond
I draw attention to Penn State's IRB website


On 12 Aug 2016, at 6:03 PM, Kerry Raymond <[hidden email]> wrote:

I am asking you to share the documentation of the ethical clearance or exemption your institution would have required, not what people did or didn't say to you as part of conference reviewing or at conferences. Ethical clearance is a process that should have been undertaken before your research commenced, not when you are writing the paper or attending a conference. Are you saying you undertook the research without any consideration of the ethics? Does your university have no guidelines about this?

The Wikipedia guidelines about content analysis are not particularly relevant here. You were not analysing existing Wikipedia articles but injecting new articles of dubious quality into Wikipedia.

Nor is the data about individuals my point. If you wasted people's time reacting to the articles created, you did them harm. If people derived incorrect information from reading your articles, you did them harm. None of those people were aware they were part of your research experiment; that means they did not have informed consent in relation to choosing to participate in your experiment. You could have generated the articles and sought the opinions of readers and editors of Wikipedia on those articles without placing them into Wikipedia itself. That way would have enabled informed consent; others not wishing to take part would not be mislead into doing so.

Sent from my iPad

On 12 Aug 2016, at 3:24 PM, siddhartha banerjee <[hidden email]> wrote:

I thought I should add this too as I missed it in the previous email.
talks about the Content Analysis (seeing number of references removed, or content removed)-- which we did (with the few articles)  and that is what we followed as it says "generally considered exempt from such requirements and does not require an IRB approval.". 
My advisor should be able to add more thoughts on it (I have requested him to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee <[hidden email]> wrote:
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Kerry Raymond
And to its policies


With particular reference to

"Intervention includes both physical procedures by which data are gathered (for example, venipuncture) and manipulations of the participant or the participant’s environment that are performed for research purposes."

Putting that articles into Wikipedia manipulated the environment of Wikipedia readers and editors.

Now I am not saying that huge harm was done, you would have to ask those who subsequently edited the articles (a known group) and those who read the articles (an unknown group) to find out if they are unhappy about what took place.

What I am saying is that if consideration had been given to the question who is impacted by this research plan, the maybe the research plan would have been redesigned to prevent the problem, and we would not have to have this conversation.

Kerry

Sent from my iPad

On 12 Aug 2016, at 6:08 PM, Kerry Raymond <[hidden email]> wrote:

I draw attention to Penn State's IRB website


On 12 Aug 2016, at 6:03 PM, Kerry Raymond <[hidden email]> wrote:

I am asking you to share the documentation of the ethical clearance or exemption your institution would have required, not what people did or didn't say to you as part of conference reviewing or at conferences. Ethical clearance is a process that should have been undertaken before your research commenced, not when you are writing the paper or attending a conference. Are you saying you undertook the research without any consideration of the ethics? Does your university have no guidelines about this?

The Wikipedia guidelines about content analysis are not particularly relevant here. You were not analysing existing Wikipedia articles but injecting new articles of dubious quality into Wikipedia.

Nor is the data about individuals my point. If you wasted people's time reacting to the articles created, you did them harm. If people derived incorrect information from reading your articles, you did them harm. None of those people were aware they were part of your research experiment; that means they did not have informed consent in relation to choosing to participate in your experiment. You could have generated the articles and sought the opinions of readers and editors of Wikipedia on those articles without placing them into Wikipedia itself. That way would have enabled informed consent; others not wishing to take part would not be mislead into doing so.

Sent from my iPad

On 12 Aug 2016, at 3:24 PM, siddhartha banerjee <[hidden email]> wrote:

I thought I should add this too as I missed it in the previous email.
talks about the Content Analysis (seeing number of references removed, or content removed)-- which we did (with the few articles)  and that is what we followed as it says "generally considered exempt from such requirements and does not require an IRB approval.". 
My advisor should be able to add more thoughts on it (I have requested him to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee <[hidden email]> wrote:
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Research on automatically created articles

Stuart A. Yeates
It's worth noting that https://en.wikipedia.org/wiki/Talonid appears (in so far as I can grasp what it's about) to at least verge on being medical information. 

Medical information is subject to specific laws and an exceedingly brave place to start a research project like this. In terms of potential harm to research subjects (=readers of wikipedia) it pretty much hits the jackpot.

cheers
stuart



--
...let us be heard from red core to black sky

On Fri, Aug 12, 2016 at 8:22 PM, Kerry Raymond <[hidden email]> wrote:
And to its policies


With particular reference to

"Intervention includes both physical procedures by which data are gathered (for example, venipuncture) and manipulations of the participant or the participant’s environment that are performed for research purposes."

Putting that articles into Wikipedia manipulated the environment of Wikipedia readers and editors.

Now I am not saying that huge harm was done, you would have to ask those who subsequently edited the articles (a known group) and those who read the articles (an unknown group) to find out if they are unhappy about what took place.

What I am saying is that if consideration had been given to the question who is impacted by this research plan, the maybe the research plan would have been redesigned to prevent the problem, and we would not have to have this conversation.

Kerry

Sent from my iPad

On 12 Aug 2016, at 6:08 PM, Kerry Raymond <[hidden email]> wrote:

I draw attention to Penn State's IRB website


On 12 Aug 2016, at 6:03 PM, Kerry Raymond <[hidden email]> wrote:

I am asking you to share the documentation of the ethical clearance or exemption your institution would have required, not what people did or didn't say to you as part of conference reviewing or at conferences. Ethical clearance is a process that should have been undertaken before your research commenced, not when you are writing the paper or attending a conference. Are you saying you undertook the research without any consideration of the ethics? Does your university have no guidelines about this?

The Wikipedia guidelines about content analysis are not particularly relevant here. You were not analysing existing Wikipedia articles but injecting new articles of dubious quality into Wikipedia.

Nor is the data about individuals my point. If you wasted people's time reacting to the articles created, you did them harm. If people derived incorrect information from reading your articles, you did them harm. None of those people were aware they were part of your research experiment; that means they did not have informed consent in relation to choosing to participate in your experiment. You could have generated the articles and sought the opinions of readers and editors of Wikipedia on those articles without placing them into Wikipedia itself. That way would have enabled informed consent; others not wishing to take part would not be mislead into doing so.

Sent from my iPad

On 12 Aug 2016, at 3:24 PM, siddhartha banerjee <[hidden email]> wrote:

I thought I should add this too as I missed it in the previous email.
talks about the Content Analysis (seeing number of references removed, or content removed)-- which we did (with the few articles)  and that is what we followed as it says "generally considered exempt from such requirements and does not require an IRB approval.". 
My advisor should be able to add more thoughts on it (I have requested him to reply on this thread).

Thanks,
Sidd




On Thu, Aug 11, 2016 at 9:36 PM, siddhartha banerjee <[hidden email]> wrote:
As I have mentioned earlier, this is not the first work on article generation. This is one of the first work we know: https://people.csail.mit.edu/csauper/pubs/sauper-sm-thesis.pdf
All these did not mention anything about human subjects as finally no personal information is used (about the person, who is deleting, etc). Nor did any reviewers/attendees in the conferences in this area question on this aspect. 
Also, https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-01-28/Recent_research is relevant here as it talks about our previous work.

if "record of someone doing something" is relevant from human subjects point of view, any data on Wikipedia can be used to find the editors (if not the real person). For example:
I have met several researchers who work using data (revisions from Wikipedia) and nothin on IRB ever came up.

Nevertheless, as I said, if there are concrete rules, I think it would help the research community as a whole to know what can or cannot be done and also ask for permissions.
I appreciate the suggestions that Stuart mentioned in a previous email abut experimenting on would be deleted or articles lacking sources. But, as of now we are not planning anything and if we do, we would for sure get in touch with Denny (who had a video chat with me before starting this thread) and would try to know the best ways of doing it.

I have asked my PhD advisor (other author on the paper) to check this thread and he will be able to give more inputs as I am not very qualified to comment on these aspects. 

Thanks,
Sidd






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
12