Retrieving page source without editing

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Retrieving page source without editing

Mark Wagner-2
I'm working on a bot to deal with the flood of no-source and untagged images
on the English Wikipedia.  My current design calls for, once a day,
downloading the upload log for the previous 24 hours, then checking each
image description page and adding a template as appropriate.  About 2000
images are uploaded each day, and only around 15% need tagging.  What's the
best way of getting the wikitext of an article if there's an 85% chance that
you won't be editing it?  Is Special:Export faster than starting an edit, or
is there some other method?

Thanks,
Mark
[[en:User:Carnildo]]
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Marco Schuster
Mark Wagner schrieb:

> I'm working on a bot to deal with the flood of no-source and untagged images
> on the English Wikipedia.  My current design calls for, once a day,
> downloading the upload log for the previous 24 hours, then checking each
> image description page and adding a template as appropriate.  About 2000
> images are uploaded each day, and only around 15% need tagging.  What's the
> best way of getting the wikitext of an article if there's an 85% chance that
> you won't be editing it?  Is Special:Export faster than starting an edit, or
> is there some other method?
>
> Thanks,
> Mark
> [[en:User:Carnildo]]
Use this code:
<?PHP
function GetPageSource($page) {
         $wikiIndexPHP="/w/index.php";
         $wikiSrv="en.wikipedia.org";
         $fp = fsockopen ($wikiSrv, 80, $errno, $errstr, 30);
         if (!$fp) {
                 echo "$errstr ($errno)<br />\n";
         } else {
         fputs ($fp,"GET
".$wikiIndexPHP."?title=".urlencode($page)."&action=raw HTTP/1.0
Host: ".$wikiSrv."
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.0.1)
Gecko/20060111 Firefox/1.5.0.1
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: ".MakeCookieString()."
Cache-Control: max-age=0\r\n\r\n");
                 while (!feof($fp)) {
                         $buf.= fgets($fp,128);
                 }
                 fclose($fp);
                 UpdateSessionCookie($buf);
                 $buf=preg_match ("/\r\n\r\n(.*)$/is",$buf,$hit);
                 return $hit[1];
         }
}
?>
When you call this function it returns the whole page source.

Greets,

Marco
[[:de:User:HardDisk]]
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Daniel Herding
In reply to this post by Mark Wagner-2
Mark Wagner wrote:
> I'm working on a bot to deal with the flood of no-source and untagged images
> on the English Wikipedia.  My current design calls for, once a day,
> downloading the upload log for the previous 24 hours, then checking each
> image description page and adding a template as appropriate.

Sounds useful. Are you using the Python Wikipedia Bot Framework? If so,
we should add it to the repository as soon as your script is working.

> About 2000
> images are uploaded each day, and only around 15% need tagging.  What's the
> best way of getting the wikitext of an article if there's an 85% chance that
> you won't be editing it?

My suggestion is that you first add a method called newimages() to the
Site class. If you haven't already done it, you can copy Site.newpages()
and modify it to make it look up new images.

Then you can add a generator called NewImagesGenerator() to
pagegenerators.py. It will look a bit like AllpagesPageGenerator.

When you then have such a generator, you can just wrap the existing
PreloadingGenerator around it, and it will do all the work for you.

> Is Special:Export faster than starting an edit, or
> is there some other method?

Using Special:Export is much faster because it allows you to load
several pages at the same time. So you need much less requests to the
server. Especially in this case, where you don't need edit tokens and
stuff for most of the pages.

If you need help, just send me what you got, then I can help you.


Daniel
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Andre Engels
In reply to this post by Mark Wagner-2
2006/3/11, Mark Wagner <[hidden email]>:
> I'm working on a bot to deal with the flood of no-source and untagged images
> on the English Wikipedia.  My current design calls for, once a day,
> downloading the upload log for the previous 24 hours, then checking each
> image description page and adding a template as appropriate.  About 2000
> images are uploaded each day, and only around 15% need tagging.  What's the
> best way of getting the wikitext of an article if there's an 85% chance that
> you won't be editing it?  Is Special:Export faster than starting an edit, or
> is there some other method?

Special:Export allows you to get more than one page at once, thus
speeding up the loading of pages considerably. To do this the normal
way would entail 2300 requests to the server (2000 pages and 300
edits), if you do it through Special:Export with for example 50 pages
at a time (and you can do it with 100 or 200 without problems), it's
only 340 requests left.

If you use the Python Wikipediabot framework (and if you haven't done
much programming yet, I would advise that, since it gives you many
things already programmed-in that could be useful), there is a method
site.getall() to do this.

Andre Engels
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Marco Schuster
Andre Engels schrieb:
  > If you use the Python Wikipediabot framework (and if you haven't done
> much programming yet, I would advise that, since it gives you many
> things already programmed-in that could be useful), there is a method
> site.getall() to do this.
Not anyone is experienced enough to get it running.

Marco
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Mark Wagner-2
In reply to this post by Daniel Herding
On 3/11/06, Daniel Herding <[hidden email]> wrote:

>
> Mark Wagner wrote:
> > I'm working on a bot to deal with the flood of no-source and untagged
> images
> > on the English Wikipedia.  My current design calls for, once a day,
> > downloading the upload log for the previous 24 hours, then checking each
> > image description page and adding a template as appropriate.
>
> Sounds useful. Are you using the Python Wikipedia Bot Framework? If so,
> we should add it to the repository as soon as your script is working.


I probably should have mentioned that I'm using Perl, with a framework based
on the code that Pearle uses to access Wikipedia.

> Is Special:Export faster than starting an edit, or
> > is there some other method?
>
> Using Special:Export is much faster because it allows you to load
> several pages at the same time. So you need much less requests to the
> server. Especially in this case, where you don't need edit tokens and
> stuff for most of the pages.


Sounds like the best way to go, then.  Thanks.

--
Mark
[[en:User:Carnildo[[
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving page source without editing

Daniel Herding
>> Mark Wagner wrote:
>>> I'm working on a bot to deal with the flood of no-source and untagged
>>> images on the English Wikipedia.  My current design calls for, once a day,
>>> downloading the upload log for the previous 24 hours, then checking each
>>> image description page and adding a template as appropriate.

You should talk to Anders Wegge Jakobsen <[hidden email]> who just wrote
on [hidden email] that he wants to do
something very similar, using the Python Wikipedia Bot Framework:

 > I'm working on a script for tagging images without license
 > information for the danish wikipedia
_______________________________________________
Wikibots-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikibots-l