[Wikimedia-l] Internet Archive BOT

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Wikimedia-l] Internet Archive BOT

Martin Pascal-2
HI,

My native language is French, automatic translation into English.
This message follows the numerous detection of false 404 links by the Internet Archive robot because it is blacklisted on a lot of servers. Small details concerning the archiving service of Wikiwix ( https://nl.wikipedia.org/wiki/Wikipedia:De_kroeg#Internet_Archive_Bot )
It is based solely on this Javascript to be implemented since 2008 in French Wikipedia: https://fr.wikipedia.org/wiki/MediaWiki:Gadget-ArchiveLinks.js
The advantage of this solution makes it possible to add other archiving sources, and does not modify the content of Wikipedia articles.
New links are detected by 3 different means:
• Annual recovery: https://dumps.wikimedia.org/backup-index.html,
• Recovery on IRC and on the WEB of Recents Changes.
And we also recommend clicking on the archive link as soon as the source is added by a contributor, this immediately generates storage of the link and allows you to test the rendering of the archived page.
In addition to fighting 404 errors, this solution also offers the advantage of protecting against changes in content that may appear in the pages to be archived.
Wikiwix strictly respects copyright, archiving is only done with the author's approval using the noarchive tag.
Since 2015, I have been alerting about the deployment of the IA ​​robot: 2015: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2015/Bots_and_gadgets: the bot solution with modification of the template cache is currently exclusive to WayBackMachine, 2017: https://fr.wikipedia.org/wiki/Discussion_user:Pmartin#I_left_you_a_message! : attempted collaboration abort by the bot trainer and bot stopped following numerous false detections on page 404.
The role of IABOT is to detect the links present in Wikipedia which are in errors 404, to find an archive in priority on the WayBack Machine, and to modify the articles to replace the dead link there.
This process is not good because IABOT only allows one archive url to be stored on all the languages, which greatly favors the Wayback Machine, to the detriment of the different versions of the page. While the template should link to a page that would list all of the possible archives for a 404 page.
A week has been planned for the end of July 2020 to resolve the few stabilization problems that Wikiwix currently encounters, linked to the new solution which consumes only 30 euros of electricity per month, we can also support this week for a deployment of the solution on the NL part of Wikipedia.

Could someone stop this bots, otherwise the false detection of links will become contagious for all projects?

Pascal Martin
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] Internet Archive BOT

Effe iets anders
Hi Pascal, all,

this is being discussed here:
https://en.wikipedia.org/wiki/User_talk:Cyberpower678 THe last response was
June 16, and it seems to focus on geo-blocking as the cause for
blacklisting (in case anyone feels called to help out the developer).

This bot performs incredible work and I hope it gets fixed soon!

Best,
Lodewijk

On Tue, Jun 23, 2020 at 5:04 AM Pascal Martin <[hidden email]> wrote:

> HI,
>
> My native language is French, automatic translation into English.
> This message follows the numerous detection of false 404 links by the
> Internet Archive robot because it is blacklisted on a lot of servers. Small
> details concerning the archiving service of Wikiwix (
> https://nl.wikipedia.org/wiki/Wikipedia:De_kroeg#Internet_Archive_Bot )
> It is based solely on this Javascript to be implemented since 2008 in
> French Wikipedia:
> https://fr.wikipedia.org/wiki/MediaWiki:Gadget-ArchiveLinks.js
> The advantage of this solution makes it possible to add other archiving
> sources, and does not modify the content of Wikipedia articles.
> New links are detected by 3 different means:
> • Annual recovery: https://dumps.wikimedia.org/backup-index.html,
> • Recovery on IRC and on the WEB of Recents Changes.
> And we also recommend clicking on the archive link as soon as the source
> is added by a contributor, this immediately generates storage of the link
> and allows you to test the rendering of the archived page.
> In addition to fighting 404 errors, this solution also offers the
> advantage of protecting against changes in content that may appear in the
> pages to be archived.
> Wikiwix strictly respects copyright, archiving is only done with the
> author's approval using the noarchive tag.
> Since 2015, I have been alerting about the deployment of the IA ​​robot:
> 2015:
> https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2015/Bots_and_gadgets:
> the bot solution with modification of the template cache is currently
> exclusive to WayBackMachine, 2017:
> https://fr.wikipedia.org/wiki/Discussion_user:Pmartin#I_left_you_a_message! :
> attempted collaboration abort by the bot trainer and bot stopped following
> numerous false detections on page 404.
> The role of IABOT is to detect the links present in Wikipedia which are in
> errors 404, to find an archive in priority on the WayBack Machine, and to
> modify the articles to replace the dead link there.
> This process is not good because IABOT only allows one archive url to be
> stored on all the languages, which greatly favors the Wayback Machine, to
> the detriment of the different versions of the page. While the template
> should link to a page that would list all of the possible archives for a
> 404 page.
> A week has been planned for the end of July 2020 to resolve the few
> stabilization problems that Wikiwix currently encounters, linked to the new
> solution which consumes only 30 euros of electricity per month, we can also
> support this week for a deployment of the solution on the NL part of
> Wikipedia.
>
> Could someone stop this bots, otherwise the false detection of links will
> become contagious for all projects?
>
> Pascal Martin
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

[Wikimedia-l] RE : Internet Archive BOT

Martin Pascal-2
Good evening Lodewijk,

I completely agree with you on the work done by this robot but I had warned that it was a dead end near Danny Horn employee of the WMF who was piloting the project for the WMF:
- energy consuming: is this the role of the WMF to host a robot that would pollute the WEB to detect 404 errors,
- modify the articles: is it the role of the WMF to host a robot which allows the articles to be edited at the risk of being considered contributors,
- hosting of archives: it is the role of the WMF to provide an exclusive hosting solution for Internet Archive.

On these three points I expressed myself on my Wikipedia discussion page and we are awaiting the return of the bot trainer which will remain unanswered.

The solution which is in place via the Internet Archive robot does not in any way solve the problems of the modifications of sources which are numerous and happen frequently to have a link on an article with high visibility on Wikipedia.

The solution I have proposed is perennial, non-exclusive, non-polluting, the wikipeda community in hand, does not deteriorate Wikipedia articles because the solution is there but ignored by Wikimedia authorities to the detriment of the foundations of the foundation.

In short, I have only one employee and I am going to leave his mission, because the time we lost working for Internet Archive has never been taken into account and therefore the time spent trying to collaborate has resulted in degrading our solution.

The future will prove me right, I am not looking for fortune otherwise I will have sold the positions of Linterweb, which in the good old days managed the Kiwix project and the external links archives project.

As a result, we are able to provide archives in Zeno archive formats that can be used by offline solutions in the long term:
https://blog.wikiwix.com/2009/12/07/okawix-et-openzim/

So what do I do next week is what I launch my only employee on the recovery of archives on all the languages ​​of Wikipedia we have one configuration to change and Wikipedia has a backup solution for external links hosted in Europe in a DataCenter managed by European funds.

These wars of influence are wearing out for me and I did not want to arm myself to fight against it, my daughter will remember forever that Linterweb will have been the archiver of the external links of French Wikipedia for only 30 euros of energy per month, small step for Linterweb but big step for the ecological transition which awaits us.


Regards,
"If I don't have a bad deal to bite into, I invent one and after having liquidated it, give up the credit to someone else, so I can continue to be me- same, that is, no one. It's clever. "
My name is Nobody.


De : effe iets anders
Envoyé le :mercredi 24 juin 2020 07:16
À : Wikimedia Mailing List
Objet :Re: [Wikimedia-l] Internet Archive BOT

Hi Pascal, all,

this is being discussed here:
https://en.wikipedia.org/wiki/User_talk:Cyberpower678 THe last response was
June 16, and it seems to focus on geo-blocking as the cause for
blacklisting (in case anyone feels called to help out the developer).

This bot performs incredible work and I hope it gets fixed soon!

Best,
Lodewijk

On Tue, Jun 23, 2020 at 5:04 AM Pascal Martin <[hidden email]> wrote:

> HI,
>
> My native language is French, automatic translation into English.
> This message follows the numerous detection of false 404 links by the
> Internet Archive robot because it is blacklisted on a lot of servers. Small
> details concerning the archiving service of Wikiwix (
> https://nl.wikipedia.org/wiki/Wikipedia:De_kroeg#Internet_Archive_Bot )
> It is based solely on this Javascript to be implemented since 2008 in
> French Wikipedia:
> https://fr.wikipedia.org/wiki/MediaWiki:Gadget-ArchiveLinks.js
> The advantage of this solution makes it possible to add other archiving
> sources, and does not modify the content of Wikipedia articles.
> New links are detected by 3 different means:
> • Annual recovery: https://dumps.wikimedia.org/backup-index.html,
> • Recovery on IRC and on the WEB of Recents Changes.
> And we also recommend clicking on the archive link as soon as the source
> is added by a contributor, this immediately generates storage of the link
> and allows you to test the rendering of the archived page.
> In addition to fighting 404 errors, this solution also offers the
> advantage of protecting against changes in content that may appear in the
> pages to be archived.
> Wikiwix strictly respects copyright, archiving is only done with the
> author's approval using the noarchive tag.
> Since 2015, I have been alerting about the deployment of the IA ​​robot:
> 2015:
> https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2015/Bots_and_gadgets:
> the bot solution with modification of the template cache is currently
> exclusive to WayBackMachine, 2017:
> https://fr.wikipedia.org/wiki/Discussion_user:Pmartin#I_left_you_a_message! :
> attempted collaboration abort by the bot trainer and bot stopped following
> numerous false detections on page 404.
> The role of IABOT is to detect the links present in Wikipedia which are in
> errors 404, to find an archive in priority on the WayBack Machine, and to
> modify the articles to replace the dead link there.
> This process is not good because IABOT only allows one archive url to be
> stored on all the languages, which greatly favors the Wayback Machine, to
> the detriment of the different versions of the page. While the template
> should link to a page that would list all of the possible archives for a
> 404 page.
> A week has been planned for the end of July 2020 to resolve the few
> stabilization problems that Wikiwix currently encounters, linked to the new
> solution which consumes only 30 euros of electricity per month, we can also
> support this week for a deployment of the solution on the NL part of
> Wikipedia.
>
> Could someone stop this bots, otherwise the false detection of links will
> become contagious for all projects?
>
> Pascal Martin
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>