(Large) Discrepancies between API-Statistics and Stepwise API-based extraction [Update]

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

(Large) Discrepancies between API-Statistics and Stepwise API-based extraction [Update]

Rüdiger Gleim
Hello,

thanks to Brad for the information regarding the discrepancies between
the actual numbers fetched from the API and the Statistics which stem
from a dedicated (but not necessarily sync) table.

Running

initSiteStats.php --update --active --use-master

solved most issues. Since I do not own the Wiki nor have access to the
system I politely asked the Admin to do so.

What remains is a gap between the number of edits/revisions from the
statistics page and the numbers I get when iterating ov6er the pages and
fetching queries via:

http://wiki.muenster.org/api.php?action=query&prop=revisions&pageids=2085&rvprop=ids|user|comment|timestamp&rvlimit=100&format=xml

The number of pages and users do match- so Iam sure I do not miss any pages.

Statistics report 37660 revisions, querying via API over all pages sums
up to 32659, thus missing 5001 revisions.

Running initSiteStats.php did also update the Statistics report from
46329 to 37660.

=> Is my assumpting correct that the sum of all Revisions queried via
api.php?action=query&prop=revisions&pageids= should match the statistics
number of revisions?
=> At first I also queried for content which resulted in 968 cases where
I lacked permissions to query. So I omitted the content for test
purposes. But this did not solve the problem.
=> What is the difference between continue=|| and rvcontinue=123456?
Until now I have only been using rvcontinue. Including continue did not
make a difference, but I could not find out what the meaning of
continue=|| is.

Thank you (again) for your time and support.

Best Wishes,

Rüdiger


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: (Large) Discrepancies between API-Statistics and Stepwise API-based extraction [Update]

Brad Jorsch (Anomie)
On Fri, Feb 2, 2018 at 4:21 AM, Rüdiger Gleim <[hidden email]> wrote:

> => Is my assumpting correct that the sum of all Revisions queried via
> api.php?action=query&prop=revisions&pageids= should match the statistics
> number of revisions?
>

No, the "edits" number also includes deleted revisions.

Another possibility is revisions that aren't attached to a valid page, if
some bug allowed that to happen.


> => What is the difference between continue=|| and rvcontinue=123456? Until
> now I have only been using rvcontinue. Including continue did not make a
> difference, but I could not find out what the meaning of continue=|| is.
>

It won't make a difference in modern MediaWiki. On outdated wikis running
MediaWiki 1.21 to 1.25, it will change the format in which the continuation
data is returned. In any case, when you're manually adding it you should
use an empty value rather than "||".

Note the values of all continuation parameters, whether old or new format,
should be considered as opaque tokens by clients (even though they usually
have obvious structure). The API may change the format of these
continuation tokens at any time without warning, and this will not be
considered a breaking change.

The old format for continuation returns data in a query-continue node to be
combined with the previous requests' parameters. When using generators or
multiple query submodules, the client has to do some non-obvious processing
of that continuation data to avoid missing data or looping. See
https://www.mediawiki.org/wiki/API:Raw_query_continue for details.

In 1.21, a new, easier to use format was introduced, enabled by passing an
empty "continue" parameter when making the initial query. In this mode the
API handles the tricky parts of generators and multiple query submodules
for you, all you have to do is combine everything under the returned
"continue" node with the original request's parameters. This was made the
default in MediaWiki 1.26. See
https://www.mediawiki.org/wiki/API:Query#Continuing_queries for details.
New clients should use this new format since it's much harder to handle it
incorrectly.

The "continue=||" is part of the new format's handling of the tricky bits
of generators and multiple query modules.

--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l