Migrating to "dumb query-continue"

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Migrating to "dumb query-continue"

Yuri Astrakhan
Hi everyone, there seem to have been many great changes in the API, so I decided to take a look at improving my old bots a bit, together with the rest of the pywiki framework. While looking, a few thoughts and questions have occured that I hope someone could comment on.

I have been out of the loop for a long time, so do forgive me if I misunderstand some recent changes and how they are suppose to work, or if this is a non-issue. Also I appologise if these issues have already been discussed and/or resolved.


My first idea for this email is "dumb continue":

Can we change the continue so that clients are not required to understand parameters once the first request has been made? This way a user of a client library can iterate over query results without knowing how to continue, and the library would not need to understand what to do with each parameter (iterator scenario):

for datablock in mwapi.Query( { generator=allpages, prop=links|categories, otherParams=... } ):
    #
    # Process the returned data blocks one at a time
    #

The way it is done now, Query() method must understand how to do continue in depth. Which parameters to look at first, which - at second, how to handle when there are no more links while there are more categories to enumerate.
Now there is even a high bug potential -- if there are no more links, API returns just two continues - clcontinue & gapcontinue - which means that if the client makes the same request with the two additional "continue" parameters, API will return the same result again, possibly producing duplicate errors and consuming extra server resources.


Proposal:
Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.

Also, because the "page" objects might be incomplete between different data blocks, the user might need to know when a complete "page" object is returned. API should probably introduce an "incomplete" attribute on the page to indicate that the client should merge it with the page from the following data blocks with the same ID until there is no more "incomplete" flag. Page revision number could be used on the client to see if the page has been changed between calls:

for page in mwapi.QueryCompletePages( { same parameters as example above } ):
    # process each page

API Implementation details:
In the example above where we have a generator & two properties, the next continue would be set to the very first item that had any of the properties incomplete. The properties continue will be as before, except that if there is no more categories, clcategory is set to some magic value like '|' to indicate that it is done and no more SQL requests to categories tables are needed on subsequent calls.
The server should not return the maximum number of pages from the generator, if properties enumeration have not reached them yet (e.g. if generatorLimit=max & linksLimit=1 -> will return just the first page with one link on each return)

Backwards compatibility:
This change might impact any client that will use the presence of the "plcontinue" or "clcontinue" fields as a guide to not use the next "gapcontinue". The simplest (and long overdue) solution is to add the "version=" parameter.

While at it, we might want to expand the action=paraminfo to include meaninful version data. Better yet, make a new "moduleinfo" action that returns any requested specifics about each module, e.g.:
action=moduleinfo  &   modules=  parse | query | query+allpages  &  props= version | params



Thanks! Please let me know what you think.

--Yuri

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Platonides
On 15/12/12 11:37, Yuri Astrakhan wrote:
> Hi everyone, there seem to have been many great changes in the API,
> so I decided to take a look at improving my old bots a bit, together
> with the rest of the pywiki framework. While looking, a few thoughts
> and questions have occured that I hope someone could comment on.

Hi Yuri!
It's nice to see you.


> *Proposal:*
> Query() method from above should be able to take ALL continue values and
> append ALL of them to the next query, without knowing anything about
> them, and without removing or changing any of the original request
> parameters. Query() will do this until server returns a data block with
> no more <query-continue> section.
+1


I'm not sure what's the case you mention of an incomplete page, can you
provide an example?

> Also, because the "page" objects might be incomplete between different
> data blocks, the user might need to know when a complete "page" object
> is returned. API should probably introduce an "incomplete" attribute on
> the page to indicate that the client should merge it with the page from
> the following data blocks with the same ID until there is no more
> "incomplete" flag. Page revision number could be used on the client to
> see if the page has been changed between calls:

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Hi Platonides! Good to be back :)

By incomlete I meant that when you run an allpages generator and request links, the pllimit is for the total number of links combined from all pages. Which means that if there are 20 pages with 3 links each, and pllimit=10, the first result will have 3 complete pages and one page with just one link out of three.

The next result will return the missing 2 links for that page, two more complete pages, and another page with two links out of three. So my proposal is to mark those last pages "incomplete", so that users can handle this intelligently, and not assume that whatever links are listed are all the links there are.


On Sat, Dec 15, 2012 at 5:42 AM, Platonides <[hidden email]> wrote:
On 15/12/12 11:37, Yuri Astrakhan wrote:
> Hi everyone, there seem to have been many great changes in the API,
> so I decided to take a look at improving my old bots a bit, together
> with the rest of the pywiki framework. While looking, a few thoughts
> and questions have occured that I hope someone could comment on.

Hi Yuri!
It's nice to see you.


> *Proposal:*
> Query() method from above should be able to take ALL continue values and
> append ALL of them to the next query, without knowing anything about
> them, and without removing or changing any of the original request
> parameters. Query() will do this until server returns a data block with
> no more <query-continue> section.
+1


I'm not sure what's the case you mention of an incomplete page, can you
provide an example?

> Also, because the "page" objects might be incomplete between different
> data blocks, the user might need to know when a complete "page" object
> is returned. API should probably introduce an "incomplete" attribute on
> the page to indicate that the client should merge it with the page from
> the following data blocks with the same ID until there is no more
> "incomplete" flag. Page revision number could be used on the client to
> see if the page has been changed between calls:

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Brad Jorsch (Anomie)
In reply to this post by Yuri Astrakhan
On Sat, Dec 15, 2012 at 5:37 AM, Yuri Astrakhan <[hidden email]> wrote:
> My first idea for this email is "dumb continue":

Continuing *is* confusing. In fact, I think you have made an error in
your example:

> Now there is even a high bug potential -- if there are no more links, API
> returns just two continues - clcontinue & gapcontinue - which means that if
> the client makes the same request with the two additional "continue"
> parameters, API will return the same result again, possibly producing
> duplicate errors and consuming extra server resources.

Actually, if the client makes the request with both the clcontinue and
gapcontinue parameters, it will wind up skipping some results.

Say gaplimit was 3, so the original query returns pages A, B, and C
but manages to includes only the categories for A and B. A correct
continue would return the remaining categories for B and C. But if you
include gapcontinue, you'll instead get pages D, E, and F and never
see those categories from C.

> Proposal:
> Query() method from above should be able to take ALL continue values and
> append ALL of them to the next query, without knowing anything about them,
> and without removing or changing any of the original request parameters.
> Query() will do this until server returns a data block with no more
> <query-continue> section.

That would be quite a change. It would mean the API wouldn't return
gapcontinue at all until plcontinue and clcontinue are both exhausted,
and then would keep returning the *old* gapcontinue until plcontinue
and clcontinue are both exhausted again.

This would break some possible use cases which I'm not entirely sure
we should break. For example, I can imagine a bot that would use
generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it
finds whichever revision it is looking for, and then ignore rvcontinue
in favor of gfoocontinue to move on to the next page. With "dumb
continue", it wouldn't be able to do that.


If I were to redesign continuing right now, I'd just structure it a
little more. Instead of something like this like we get now:

  <query-continue>
    <links plcontinue="..." />
    <categories clcontinue="..." gclcontinue="..." />
    <watchlist wlstart="..." />
    <allmessages amfrom="..." />
  </query-continue>

I'd return something like this:

  <query-continue>
    <prop>
      <links plcontinue="..." />
      <categories clcontinue="..." />
    </prop>
    <generator>
      <categories gclcontinue="..." />
    </generator>
    <list>
      <watchlist wlstart="..." />
    </list>
    <meta>
      <allmessages amfrom="..." />
    </meta>
  </query-continue>

The client would still have to know how to manipulate
list=/meta=/generator=/prop=, particularly when using more than one of
these in the same query. But the rules are simpler, it wouldn't have
to know that gclcontinue is for generator=categories while clcontinue
is for prop=categories, and it would be easy to know what exactly to
include in prop= when continuing to avoid repeated results.

> API Implementation details:
> In the example above where we have a generator & two properties, the next
> continue would be set to the very first item that had any of the properties
> incomplete. The properties continue will be as before, except that if there
> is no more categories, clcategory is set to some magic value like '|' to
> indicate that it is done and no more SQL requests to categories tables are
> needed on subsequent calls.
> The server should not return the maximum number of pages from the generator,
> if properties enumeration have not reached them yet (e.g. if
> generatorLimit=max & linksLimit=1 -> will return just the first page with
> one link on each return)

You can't get away with changing the generator's continue like that
and still get correct results, because you can't assume the generator
generates pages in the same order every prop module processes them.
Nor can you assume each prop module will process pages in the same
order. For example, many prop modules order by page_id but may be ASC
or DESC on their "dir" parameter.

IMO, if a client wants to ensure it has complete results for any page
objects in the result, it should just process all of the prop
continuation parameters to completion.

> Backwards compatibility:
> This change might impact any client that will use the presence of the
> "plcontinue" or "clcontinue" fields as a guide to not use the next
> "gapcontinue".

That at least is easy enough to avoid: when all non-generator
continues are whatever magic value is "ignore", then don't output any
of them. You have to be able to detect this anyway to know when to
output the new value for the generator's continue.

A less solvable problem is the one I raised above.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Assume wiki has pages A and B with links and categories: A(l1,l2,l3,l4,l5,c1,c2,c3), B(l1,c1).  This is how API behaves now:

1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=B, plcontinue=l3, clcontinue=c3

client ignores gapcontinue because there are others, and adds pl & cl continues:
2 req)  initial & plcontinue=l3 & clcontinue=c3
2 res)  A(l3,l4,c3), gapcontinue=B, plcontinue=l5

this is where a *potential" for the bug is: client must understand that since there is no more clcontinue, but there is plcontinue, there are no more categories in this set of pages, so it should not ask for   prop=categories until it finishes with plcontinue. Once done, it should resume prop=categories and also add gapcontinue=B.

3 bad req)  initial & plcontinue=l5
3 bad res)  A(l5,c1,c2), gapcontinue=B, clcontinue=c3

3 good req)  initial but with prop=links only & plcontinue=l5
3 good res)  A(l5) & gapcontinue=B

4 req) initial & gapcontinue=B
4 res) B(l1,c1)  -- done

I think this puts too much unneeded burden on the client code to handle these cases correctly. Instead, API should be simplified to return clcontinue=| in result #2, and results 1 and 2 should have gapcontinue=A.  Client could simply merge all resulting continue values into following requests, and greatly simplify all the code for the most common "get everything I requested" scenario, and hence should be the default behavior:

1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

2 req)  initial & gapcontinue= & plcontinue=l3 & clcontinue=c3
2 res)  A(l3,l4,c3), gapcontinue=, plcontinue=l5, clcontinue=|

3 req)  initial & gapcontinue= & plcontinue=l5 & clcontinue=|
3 res)  A(l5) & gapcontinue=B, plcontinue=, clcontinue=

4 req) initial & gapcontinue=B & plcontinue= & clcontinue=
4 res) B(l1,c1)  -- no continue section, done


That would be quite a change. It would mean the API wouldn't return
gapcontinue at all until plcontinue and clcontinue are both exhausted,
and then would keep returning the *old* gapcontinue until plcontinue
and clcontinue are both exhausted again.

Correct, API would return an empty gapcontinue until it finishes with the first set, than it will return the beginning of the next set until that is exhausted as well, etc.
 
This would break some possible use cases which I'm not entirely sure
we should break. For example, I can imagine a bot that would use
generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it
finds whichever revision it is looking for, and then ignore rvcontinue
in favor of gfoocontinue to move on to the next page. With "dumb
continue", it wouldn't be able to do that.


I do not think API should support the case you described with gaplimit=1, because that fundamentally breaks the original API goal of "get data about many pages with lots of elements on them in one request". I would prefer the client do two separate queries: 1) list pages  2) many queries "list revisions for page X". Having generator with gaplimit=1 does not improve server performance or minimize traffic.

But even if we do find compelling reasons to include that, for the advanced scenario "skip subquery and follow on with the generator" it might make sense to introduce appendable "|next" value keyword gapcontinue=A|next or a gcommand=skipcurrent parameter. I am not sure it is the cleanest solution, but it is certainly cleaner than forcing every client out there to have the complex logic from above for all common cases.

1 req)  prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2
1 res)  A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

client decided it does not need anything else from A, so it adds |next to gapcontinue. API ignores all other property continues.
2 req)  initial & gapcontinue=|next, plcontinue=l3, clcontinue=c3
2 res)   B(l1,c1) -- done

The client would still have to know how to manipulate
list=/meta=/generator=/prop=, particularly when using more than one of
these in the same query. But the rules are simpler, it wouldn't have
to know that gclcontinue is for generator=categories while clcontinue
is for prop=categories, and it would be easy to know what exactly to
include in prop= when continuing to avoid repeated results.

Complex client logic is exactly what I am trying to avoid. Ideally all "continue" values should be joined into a single "query-continue = magic-value"  of no interesting user-passable properties.
 
You can't get away with changing the generator's continue like that
and still get correct results, because you can't assume the generator
generates pages in the same order every prop module processes them.
Nor can you assume each prop module will process pages in the same
order. For example, many prop modules order by page_id but may be ASC
or DESC on their "dir" parameter.

Totally agree - I forgot about the sub-ordering. So we either keep the same gapcontinue until the set is exhausted. The key here is that if we do not let the client manipulate the continue parameters, the server could later be optimized to return less results if they cannot yet be populated.

 
IMO, if a client wants to ensure it has complete results for any page
objects in the result, it should just process all of the prop
continuation parameters to completion.

The result set might be huge. It wouldn't be nice to have a 12GB x64 only client lib requirement :)

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Brad Jorsch (Anomie)
On Tue, Dec 18, 2012 at 10:03 AM, Yuri Astrakhan
<[hidden email]> wrote:
> I do not think API should support the case you described with gaplimit=1,
> because that fundamentally breaks the original API goal of "get data about
> many pages with lots of elements on them in one request".

Oh? I thought the goal of the API was to provide a machine-usable
interface to MediaWiki so people don't have to screen-scrape the HTML
pages, which alleviates the worry about whether changes to the user
interface are going to break screen-scrapers. I never knew it was all
about *bulk* data access *only*.

> But even if we do find compelling reasons to include that, for the advanced
> scenario "skip subquery and follow on with the generator" it might make
> sense to introduce appendable "|next" value keyword gapcontinue=A|next

How do things decide whether "foocontinue=A|next" is saying "the next
foocontinue after A" or really means "A|next"? For example,
https://en.wiktionary.org/w/api.php?action=query&titles=secundus&prop=links&pllimit=1&plcontinue=46486|0|nautical
currently returns plcontinue "46486|0|next".

Or are you proposing every module be individually coded to recognize
this "|next"?

> Ideally all
> "continue" values should be joined into a single "query-continue =
> magic-value"  of no interesting user-passable properties.

So clients can make absolutely no decisions about processing the data
they get back? No thanks.

Why not propose adding something like that as an option, instead of
trying to force everyone to do things your way? Say have a parameter
dumbcontinue=1 that replaces query-continue with

  <query-dumb-continue>prop=links|categories&plcontinue=...&clcontinue=...&wlstart=...&allmessages=...</query-dumb-continue>

Entirely compatible.

>> IMO, if a client wants to ensure it has complete results for any page
>> objects in the result, it should just process all of the prop
>> continuation parameters to completion.
>
> The result set might be huge. It wouldn't be nice to have a 12GB x64 only
> client lib requirement :)

Then use a smaller limit on your generator. And don't do this for
prop=revisions&rvprop=content.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan

> I do not think API should support the case you described with gaplimit=1,
> because that fundamentally breaks the original API goal of "get data about
> many pages with lots of elements on them in one request".

Oh? I thought the goal of the API was to provide a machine-usable
interface to MediaWiki so people don't have to screen-scrape the HTML
pages, which alleviates the worry about whether changes to the user
interface are going to break screen-scrapers. I never knew it was all
about *bulk* data access *only*.

Brad, API has a clear goal (at least that was my goal when I wrote it), to provide access to all the functionality a wiki UI offers. The point here is how *easy* and at the same time *efficient* API is. The continue in the past few years has gotten too complex for client libraries to use efficiently to combine multiple requests.  The example you gave seems very uncommon, and it can easily be solved with making one extra API call to get a list of articles first - which in this case of O(N) would only make it O(N+1) -- still O(N). That's why I don't think we should even go into the "|next" ability - it is very rare it will be used and can be done easily by another call without generator. See below on iteration point.


> But even if we do find compelling reasons to include that, for the advanced
> scenario "skip subquery and follow on with the generator" it might make
> sense to introduce appendable "|next" value keyword gapcontinue=A|next

How do things decide whether "foocontinue=A|next" is saying "the next
foocontinue after A" or really means "A|next"? For example,
<a href="https://en.wiktionary.org/w/api.php?action=query&amp;titles=secundus&amp;prop=links&amp;pllimit=1&amp;plcontinue=46486|0|nautical" target="_blank">https://en.wiktionary.org/w/api.php?action=query&titles=secundus&prop=links&pllimit=1&plcontinue=46486|0|nautical
currently returns plcontinue "46486|0|next".

Or are you proposing every module be individually coded to recognize
this "|next"?


Again, unless there are good usage scenarios to keep this, I don't think we ever need this "|next" feature - it was a "just in case" idea, which I doubt we will need.

 
> Ideally all
> "continue" values should be joined into a single "query-continue =
> magic-value"  of no interesting user-passable properties.

So clients can make absolutely no decisions about processing the data
they get back? No thanks.

When you make a SQL query to the server, you don't get to control the "continue" process. You can stop and make another query with different initial parameters. Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits. API can be seen as a stream returning server - with some "continue" parameter. You don't like result - you do another query. That's how you control it. Documenting the "continue" properties is a sure way to over-complicate API usage and remove server's ability to optimize the process in the future, without adding any significant benefit.
 

Why not propose adding something like that as an option, instead of
trying to force everyone to do things your way? Say have a parameter
dumbcontinue=1 that replaces query-continue with

  <query-dumb-continue>prop=links|categories&plcontinue=...&clcontinue=...&wlstart=...&allmessages=...</query-dumb-continue>

Entirely compatible.

This might be a good solution. Need community feedback on this.
 

>> IMO, if a client wants to ensure it has complete results for any page
>> objects in the result, it should just process all of the prop
>> continuation parameters to completion.
>
> The result set might be huge. It wouldn't be nice to have a 12GB x64 only
> client lib requirement :)

Then use a smaller limit on your generator. And don't do this for
prop=revisions&rvprop=content.

My bad, didn't thee the "prop" continuation, thought you meant all of them. Lastly, lets try keeping sarcasm to the minimal with a technical discussion. We have Wikipedia talk pages for that.


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Brad Jorsch (Anomie)
On Tue, Dec 18, 2012 at 11:39 AM, Yuri Astrakhan
<[hidden email]> wrote:
>
> When you make a SQL query to the server, you don't get to control the
> "continue" process. You can stop and make another query with different
> initial parameters. Same goes for iterating through a collection - none of
> the programming languages offering IEnumerable have stream control
> functionality - too complicated without clear benefits.

The difference in all those examples is that you're iterating over one
list of results. You're not iterating over a list of results and at
the same time over multiple sublists of results inside each of the
results in the main list.

> Documenting the "continue" properties is a sure way to over-complicate
> API usage and remove server's ability to optimize the process in the
> future, without adding any significant benefit.

No one is documenting the values of the continue properties, just how
the properties are supposed to be used to manipulate the original
query.

It seems to me that you're removing the ability for the client to
optimize the queries issued (besides forgoing the use of generators
entirely and having to make 10× as many queries using titles= or
pageids=) for no proposed benefit.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan

It seems to me that you're removing the ability for the client to
optimize the queries issued (besides forgoing the use of generators
entirely and having to make 10× as many queries using titles= or
pageids=) for no proposed benefit.


not 10x queries ---  one additional query per 5000+ requests, for an extremely edge case scenario you have given.

your example - run allpages generator with the gaplimit=1  -- and for each page get a list of revisions.  That means - you do at least one API request per page. With the change -- you will need just one extra query per 5000+ requests to get the list first. A tiny load increase, for a very rare case. I tried to come up with more use cases, but nothing came to mind. Feel free to suggest other use cases.

On the other hand, the proposed benefit is huge for the vast majority of the API users.
One simple "no-brainer" way to continue a query once it's issued, without any complex code by any api client frameworks. Right now client framework must understand what is being queried, what params should be set and removed to exhaust all properties, what to add later. And *every* framework must handle this, without any major benefit, but with additional chance of doing it either in inefficient or possibly buggy way. My previous email listed all the complex steps needed frameworks have to do.


Besides, if we introduce versions (which we definetly should, as it gives us a way to move forward, optimize and rework the api structure), we can always keep the old way for the compatibility sake. I think versions is a better overall way to move forward and to give warnings about incompatible changes than adding extra URL parameters.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Petr Onderka
> not 10x queries ---  one additional query per 5000+ requests, for an
> extremely edge case scenario you have given.

I believe what Brad is talking about is that when you use pageids (or titles),
you are usually limited to 50 of them per query.
But if you use generator, the limit is usually 500.
Which means your approach would lead to 10× as many queries.

Petr Onderka
[[en:User:Svick]]

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Petr, in Brad's example he used gaplimit=1, which meant he would get one page per result with many revisions.

This is no different than writing titles= or pageids=  with just one value.

So if instead of using generator, the client would make just one extra api request to get a list of 5000 pages, it will continue as before. Total extra cost -- +1 more request per 5000 for an rare edge case, while getting a major benefit for all other usage cases.


On Tue, Dec 18, 2012 at 1:52 PM, Petr Onderka <[hidden email]> wrote:
> not 10x queries ---  one additional query per 5000+ requests, for an
> extremely edge case scenario you have given.

I believe what Brad is talking about is that when you use pageids (or titles),
you are usually limited to 50 of them per query.
But if you use generator, the limit is usually 500.
Which means your approach would lead to 10× as many queries.

Petr Onderka
[[en:User:Svick]]

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Petr Onderka
In reply to this post by Yuri Astrakhan
On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <[hidden email]> wrote:
> Same goes for iterating through a collection - none of the programming
> languages offering IEnumerable have stream control functionality - too
> complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later)
a query like generator=allpages&prop=links might result in something
like IEnumerable<IEnumerable<Link>> [2].
And iterating the outer IEnumerable corresponds to iterating gapcontinue,
while iterating the inner IEnumerable corresponds to plcontinue
(of course it's not that simple, since I'm not using limit=1, but I
hope you get the idea).

And while this means some more work for the library writer (in this case, me)
than your alternative, it also means the user has more control over
what exactly is retrieved.

Petr Onderka
[[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/
[2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>,
but I didn't want to complicate it with even more generics.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Petr, thanks, I will look closely at your library and post my thoughts.  

Could you take look at http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your library would solve these? Also, if you can think of other common use cases from your library users (not your library internals, as it is just an intermediary), please post them too. I posted cases I saw in interwiki & casechecker bots.

Thanks!


On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka <[hidden email]> wrote:
On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <[hidden email]> wrote:
> Same goes for iterating through a collection - none of the programming
> languages offering IEnumerable have stream control functionality - too
> complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later)
a query like generator=allpages&prop=links might result in something
like IEnumerable<IEnumerable<Link>> [2].
And iterating the outer IEnumerable corresponds to iterating gapcontinue,
while iterating the inner IEnumerable corresponds to plcontinue
(of course it's not that simple, since I'm not using limit=1, but I
hope you get the idea).

And while this means some more work for the library writer (in this case, me)
than your alternative, it also means the user has more control over
what exactly is retrieved.

Petr Onderka
[[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/
[2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>,
but I didn't want to complicate it with even more generics.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Petr Onderka
Well, I can't tell you any use cases from my library users, because
there aren't any
(like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before:
IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used,
it always continues all props, even if the user iterates only one of them.

Petr Onderka
[[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan <[hidden email]> wrote:

> Petr, thanks, I will look closely at your library and post my thoughts.
>
> Could you take look at
> http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your
> library would solve these? Also, if you can think of other common use cases
> from your library users (not your library internals, as it is just an
> intermediary), please post them too. I posted cases I saw in interwiki &
> casechecker bots.
>
> Thanks!
>
>
> On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka <[hidden email]> wrote:
>>
>> On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <[hidden email]>
>> wrote:
>> > Same goes for iterating through a collection - none of the programming
>> > languages offering IEnumerable have stream control functionality - too
>> > complicated without clear benefits.
>>
>> Actually in my C# library [1] (I plan to publicize it more later)
>> a query like generator=allpages&prop=links might result in something
>> like IEnumerable<IEnumerable<Link>> [2].
>> And iterating the outer IEnumerable corresponds to iterating gapcontinue,
>> while iterating the inner IEnumerable corresponds to plcontinue
>> (of course it's not that simple, since I'm not using limit=1, but I
>> hope you get the idea).
>>
>> And while this means some more work for the library writer (in this case,
>> me)
>> than your alternative, it also means the user has more control over
>> what exactly is retrieved.
>>
>> Petr Onderka
>> [[en:User:Svick]]
>>
>> [1] https://github.com/svick/LINQ-to-Wiki/
>> [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>,
>> but I didn't want to complicate it with even more generics.
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Petr, I played with your library a bit. Its has some interesting and creative pieces and uses some cool tech (love Roslyn). Might need a bit of love and polishing, as I think the syntax is too verbose, but that's irrelevant here.

This is your code to list link titles from all non-redirect pages in a wiki.

var source = wiki.Query.allpages()
    .Where(p => p.filterredir == allpagesfilterredir.nonredirects)
    .Pages
    .Select(p => PageResult.Create(p.info, p.links().Select(l => l.title).ToEnumerable()));;

foreach ( var page in source.Take(2000))   // just the first 10 pages
    foreach( var linkTitle in page.Data.Take(1))  // first 1 link from each page
         Console.WriteLine(linkTitle);

The "page" foreach starts by getting
http://en.wikipedia.org/w/api.php: action=query & meta=siteinfo & siprop=namespaces

The linkTitle foreach causes 18 more api calls to start getting the links, all with plcontinue, before it yeilds even a single link. 

And the reason for it, as Brad correctly noted, is that links are sorted in a different order from titles. At this point, you are half way through the current block, you have made 19 fairly expensive api calls, and if (and that's a big if) you decide to continue with the next gapcontinue, based on the first link you get, you still need to do each "plcontinue" so that you don't miss any pages.

The only thing you can really do, with minimal calls is -- get a block of  data, take a RANDOM page with links on it, check the first link, and decide to go on to the next block. I see absolutelly no sense in this use.

In short - there are no way you can say "next page" until you iterate through every plcontinue in the current set.  EXCEPT! if you go one page at a time (gaplimit=1) - in which case you can safely skip to the next gapcontinue. But this is exactly what I am trying to avoid, because it does not give any benefit whatsoever in using the generator. I might even suspect that it costs much more - because running generator, even with limit=1 has a bigger cost than just querying one specific page info and filling it out.


On Tue, Dec 18, 2012 at 3:59 PM, Petr Onderka <[hidden email]> wrote:
Well, I can't tell you any use cases from my library users, because
there aren't any
(like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before:
IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used,
it always continues all props, even if the user iterates only one of them.

Petr Onderka
[[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan <[hidden email]> wrote:
> Petr, thanks, I will look closely at your library and post my thoughts.
>
> Could you take look at
> http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your
> library would solve these? Also, if you can think of other common use cases
> from your library users (not your library internals, as it is just an
> intermediary), please post them too. I posted cases I saw in interwiki &
> casechecker bots.
>
> Thanks!
>
>
> On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka <[hidden email]> wrote:
>>
>> On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <[hidden email]>
>> wrote:
>> > Same goes for iterating through a collection - none of the programming
>> > languages offering IEnumerable have stream control functionality - too
>> > complicated without clear benefits.
>>
>> Actually in my C# library [1] (I plan to publicize it more later)
>> a query like generator=allpages&prop=links might result in something
>> like IEnumerable<IEnumerable<Link>> [2].
>> And iterating the outer IEnumerable corresponds to iterating gapcontinue,
>> while iterating the inner IEnumerable corresponds to plcontinue
>> (of course it's not that simple, since I'm not using limit=1, but I
>> hope you get the idea).
>>
>> And while this means some more work for the library writer (in this case,
>> me)
>> than your alternative, it also means the user has more control over
>> what exactly is retrieved.
>>
>> Petr Onderka
>> [[en:User:Svick]]
>>
>> [1] https://github.com/svick/LINQ-to-Wiki/
>> [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>,
>> but I didn't want to complicate it with even more generics.
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Sorry, hit send too fast:

Petr, when you say you have two nested foreach(), the outer foreach does not iterate through the blocks, it iterates through pages. Which means you still must iterate through every plcontinue in the set before issuing next gapcontinue. In other words - your library does exactly that - a simple iteration. You don't skip blocks of results midway, and you lib would benefit from the change. (all this assumes I understood your code correctly)


On Tue, Dec 18, 2012 at 5:10 PM, Yuri Astrakhan <[hidden email]> wrote:
Petr, I played with your library a bit. Its has some interesting and creative pieces and uses some cool tech (love Roslyn). Might need a bit of love and polishing, as I think the syntax is too verbose, but that's irrelevant here.

This is your code to list link titles from all non-redirect pages in a wiki.

var source = wiki.Query.allpages()
    .Where(p => p.filterredir == allpagesfilterredir.nonredirects)
    .Pages
    .Select(p => PageResult.Create(p.info, p.links().Select(l => l.title).ToEnumerable()));;

foreach ( var page in source.Take(2000))   // just the first 10 pages
    foreach( var linkTitle in page.Data.Take(1))  // first 1 link from each page
         Console.WriteLine(linkTitle);

The "page" foreach starts by getting
http://en.wikipedia.org/w/api.php: action=query & meta=siteinfo & siprop=namespaces

The linkTitle foreach causes 18 more api calls to start getting the links, all with plcontinue, before it yeilds even a single link. 

And the reason for it, as Brad correctly noted, is that links are sorted in a different order from titles. At this point, you are half way through the current block, you have made 19 fairly expensive api calls, and if (and that's a big if) you decide to continue with the next gapcontinue, based on the first link you get, you still need to do each "plcontinue" so that you don't miss any pages.

The only thing you can really do, with minimal calls is -- get a block of  data, take a RANDOM page with links on it, check the first link, and decide to go on to the next block. I see absolutelly no sense in this use.

In short - there are no way you can say "next page" until you iterate through every plcontinue in the current set.  EXCEPT! if you go one page at a time (gaplimit=1) - in which case you can safely skip to the next gapcontinue. But this is exactly what I am trying to avoid, because it does not give any benefit whatsoever in using the generator. I might even suspect that it costs much more - because running generator, even with limit=1 has a bigger cost than just querying one specific page info and filling it out.


On Tue, Dec 18, 2012 at 3:59 PM, Petr Onderka <[hidden email]> wrote:
Well, I can't tell you any use cases from my library users, because
there aren't any
(like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before:
IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used,
it always continues all props, even if the user iterates only one of them.

Petr Onderka
[[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan <[hidden email]> wrote:
> Petr, thanks, I will look closely at your library and post my thoughts.
>
> Could you take look at
> http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your
> library would solve these? Also, if you can think of other common use cases
> from your library users (not your library internals, as it is just an
> intermediary), please post them too. I posted cases I saw in interwiki &
> casechecker bots.
>
> Thanks!
>
>
> On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka <[hidden email]> wrote:
>>
>> On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <[hidden email]>
>> wrote:
>> > Same goes for iterating through a collection - none of the programming
>> > languages offering IEnumerable have stream control functionality - too
>> > complicated without clear benefits.
>>
>> Actually in my C# library [1] (I plan to publicize it more later)
>> a query like generator=allpages&prop=links might result in something
>> like IEnumerable<IEnumerable<Link>> [2].
>> And iterating the outer IEnumerable corresponds to iterating gapcontinue,
>> while iterating the inner IEnumerable corresponds to plcontinue
>> (of course it's not that simple, since I'm not using limit=1, but I
>> hope you get the idea).
>>
>> And while this means some more work for the library writer (in this case,
>> me)
>> than your alternative, it also means the user has more control over
>> what exactly is retrieved.
>>
>> Petr Onderka
>> [[en:User:Svick]]
>>
>> [1] https://github.com/svick/LINQ-to-Wiki/
>> [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>,
>> but I didn't want to complicate it with even more generics.
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api



_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Brad Jorsch (Anomie)
In reply to this post by Yuri Astrakhan
On Tue, Dec 18, 2012 at 1:13 PM, Yuri Astrakhan <[hidden email]> wrote:
>
> On the other hand, the proposed benefit is huge for the vast majority of the
> API users.

The vast majority of API users use a framework like pywikipedia that
has already solved the continuation problem.

> your example - run allpages generator with the gaplimit=1

Yes, that was a contrived example to show that someone might not want
to be forced into following the prop continues to the end..

> Besides, if we introduce versions (which we definetly should, as it gives us
> a way to move forward, optimize and rework the api structure), we can always
> keep the old way for the compatibility sake. I think versions is a better
> overall way to move forward and to give warnings about incompatible changes
> than adding extra URL parameters.

The problem with versions is this: what if someone wants "version 1"
of query-continue (because "version 2" removed all features), but the
latest version of the rest of the API?

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan

The vast majority of API users use a framework like pywikipedia that
has already solved the continuation problem.

Not exactly correct - pywikipediabot have not solved it at all, instead they structured their library to avoid the whole problem and to get individual properties data separately with generators, which causes a much heavier server load. You can't ask pywiki to get page properties (links, categories, etc)- in fact it doesn't even perform prop=links|categories|.... Instead it uses a much more expensive generator=pagelinks for each page you query. Those bots that want this functionality have to go low level direct api call, and handle this issue by hand.

There are 30 frameworks listed on the docs site. If even the top one ignores this fundamental issue, how many do you think implement it correctly? I just spent considerable time trying to implement a generic query-agnostic continue, and was forced to do it in a very hacky way (like detecting /g..continue/ parameter, cutting it out, removing some prop=.. and ignoring warnings server sends due to me sending unneeded parameters. Not a good generic solution)

> your example - run allpages generator with the gaplimit=1

Yes, that was a contrived example to show that someone might not want
to be forced into following the prop continues to the end..

The problem with versions is this: what if someone wants "version 1"
of query-continue (because "version 2" removed all features), but the
latest version of the rest of the API?

But wouldn't we want people NOT to use API in a inefficient way, to prevent extra server load? But anyway, I agree, lets not remove abilities - lets introduce a version parameter, and do a simple approach by default.  Those who want use the old legacy, will add &legacycontirue="" parameter.

IMO, we need the version support regardless - it will allow us to restructure parameters and resulting data based on the client's version=xx request. Plus we can finally require 'agent' from all the clients (current javascript clients have no way to pass in the agent string). There was a discussion with Roan a few years ago about it, and versioning is needed in order to do most of these things. http://www.mediawiki.org/wiki/API/REST_proposal/Kickoff_meeting_notes

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Petr Onderka
In reply to this post by Yuri Astrakhan
On Tue, Dec 18, 2012 at 11:10 PM, Yuri Astrakhan
<[hidden email]> wrote:
> The linkTitle foreach causes 18 more api calls to start getting the links,
> all with plcontinue, before it yeilds even a single link.

Yeah, it's certainly possible there will be many plcontinue calls just
to get the first link.
But that doesn't mean you have to get all plcontinues when you want
only some links.

On Tue, Dec 18, 2012 at 11:16 PM, Yuri Astrakhan
<[hidden email]> wrote:
> Petr, when you say you have two nested foreach(), the outer foreach does not
> iterate through the blocks, it iterates through pages. Which means you still
> must iterate through every plcontinue in the set before issuing next
> gapcontinue.

It doesn't mean that.
For example, in the extreme case where you don't want to know any
links from this page
(say, because you want to filter the articles in a way that cannot be
expressed directly by the API),
you don't have to use plcontinue for this page at all.

A specific example might be changing your code into
(yeah, built specifically to make my point):

foreach (var page in source.Where(p =>
p.Info.title.Contains("\u2014")).Take(2000))

In this case, the link for "—All You Zombies—" will be retrieved from
the first call,
so no plcontinue is needed.
The link for "—And He Built a Crooked House—" will be retrieved using
one plcontinue call.
But there are no more articles with that character in title in the first page,
so no more plcontinue calls are necessary, and gapcontinue can be used now.
The second page contains no articles with that character at all,
so without any plconitines, gapcontinue will be used right away.

With your “dumb query-continue”, doing this would require many more calls.

Petr Onderka
[[en:User:Svick]]

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Migrating to "dumb query-continue"

Yuri Astrakhan
Background:

max aplimit = max pllimit = 500 (5000 bots)

Server SQL :
  pageset = select * from pages where start='!' limit 5000
  select * from links where id in pageset limit 5000

Since each wiki page has more than 1 link, you need to do about a 50-100 api calls to get all the links in a block. Btw, it also means that it is by far more efficient to set gaplimit = 50 -- 100 because otherwise the server populates and returns 5000 page headers each time, hugely wasting both SQL and network bandwidth.

Links are sorted by pageid, pages - by title. If you need links for the first page in a block, the chances are that you have to iterate through 50% of all other page links first.

Now lets look at your example:

* If you set your gaplimit=100 & pllimit=5000, you get all the links for 100 pages in one call, which is no different than simple-continue.

* If you set "max" to both, and you want 80% of the pages per block, you most likely will have to download 99% of the links - same as downloading everything -- simple-continue.

* If you want a small percentage of pages, like 1 per block, than on average you still have to download 50+% of links. Even in the best case scenario, if you are lucky, you need one additional links block to know that first page has no more links.

Proper way to do the last case it is to use allpages without links, go through them and make a list of 500 page ids you want. Afterwards, download all links with a different query -- pageids=list, not generator.  Assuming 1 needed page per block, you just saved time and bandwidth of 250 blocks with links! A huge huge saving, without even counting how much less the SQL servers had to work. That's 250 queries you didn't have to make.



So you see, no matter how you look at this problem, you either 1) simple-stream the whole result set, or 2) do two separate queries - one to get the list of all titles and select needed, and another call to get their links for them. A much much faster, more efficient, green solution.


Lastly - if we enable simple query by default, the server can do much smarter logic - if gaplimit=max & pllimit=max, reduce gaplimit=pllimit/50. In other words, the server will return only pages it can fill with links, but not much more. This is open for discussion of course, and I haven't finalized how to do this properly.

I hope all this explains it enough. If you have other thoughts, please contact me privately, there is no need to involve the whole list in this.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
12