Access to MediaWiki API with 900 RPS

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Access to MediaWiki API with 900 RPS

Eric Kuo
Hi,

This is Eric from Yahoo. My team develops mobile apps for Taiwan and Hong Kong users. We want to provide wiki description on keywords in our contents, and we consider using MediaWiki API:OpenSearch and/or API:Query to achieve this. Our estimated RPS is 900, and we will cache the query result on our side. We would like to know if there is any concern with respect to our RPS, and if so, what is the best practice. 

Any comments and suggestions are welcome. Thank you for your time. 

Best regards,
Eric

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Access to MediaWiki API with 900 RPS

Erik Bernhardson
(cc'ing the discovery mailing list, as that team owns both the implementation and operation of search.)

I can partially answer this as one of the people responsible for search, but I have to defer to others about API, bots, and such.

This would be a noticeable portion of our traffic, for reference:

action=opensearch (and generator variants): 1.5k RPS
action=query&list=search (and generator variants): 600 RPS
all api: 8k RPS (might be a bit higher, this is averaged over an hour)

opensearch is relatively cheap, the p95 to our search servers is ~30ms, with p50 at 7ms. So 600 RPS of opensearch traffic wouldn't be too hard on our search cluster. Using action=query is going to be too heavy, the full text searches are computationally more expensive to serve.

Might I ask, which wiki(s) would you be querying against? opensearch traffic is spread across our search cluster, but individual wikis only hit portions of it. For example opensearch on en.wikipedia.org is served by ~40% of the cluster, but zh.wikipedia.org (chinese) is only served by ~13%. If you are going to send heavy traffic to zh I might need to adjust those numbers to spread the load to more servers (easy enough, just need to know).

Additionally, you mentioned descriptions and keywords. These would not be provided directly by the opensearch api so you might be thinking of using the generator version of it (action=query&generator=prefixsearch) to get the results augmented (ex: /w/api.php?action=query&format=json&prop=extracts&generator=prefixsearch&exlimit=5&exintro=1&explaintext=1&gpssearch=yah&gpslimit=5). I'm not personally sure how expensive that is, someone else would have to chime in. 

So, from a computational point of view and only with respect to the search portion of our cluster, this seems plausible as long as we coordinate so that we know the traffic is coming. Others will have to chime in about the wider picture.

Erik B.

On Mon, Nov 14, 2016 at 4:40 PM, Eric Kuo <[hidden email]> wrote:
Hi,

This is Eric from Yahoo. My team develops mobile apps for Taiwan and Hong Kong users. We want to provide wiki description on keywords in our contents, and we consider using MediaWiki API:OpenSearch and/or API:Query to achieve this. Our estimated RPS is 900, and we will cache the query result on our side. We would like to know if there is any concern with respect to our RPS, and if so, what is the best practice. 

Any comments and suggestions are welcome. Thank you for your time. 

Best regards,
Eric

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api



_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: [discovery] Access to MediaWiki API with 900 RPS

Max Semenik
On Mon, Nov 14, 2016 at 6:07 PM, Erik Bernhardson <[hidden email]> wrote:
Additionally, you mentioned descriptions and keywords. These would not be provided directly by the opensearch api so you might be thinking of using the generator version of it (action=query&generator=prefixsearch) to get the results augmented (ex: /w/api.php?action=query&format=json&prop=extracts&generator=prefixsearch&exlimit=5&exintro=1&explaintext=1&gpssearch=yah&gpslimit=5). I'm not personally sure how expensive that is, someone else would have to chime in.

Highly dependent on page size and cache hit ratio, with worst case not very pleasant:
* Cache hit at the extracts level - costs a memcached read
* Parser cache hit - process head HTML into extract
* Parser cache miss - parse lede wikitext into HTML, process HTML

Overall, we're talking somewhere 30 to 300 milliseconds (can't find query API module stats in Graphite, so can't tell more statistically).

--
Best regards,
Max Semenik ([[User:MaxSem]])

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Access to MediaWiki API with 900 RPS

Lee-Wei Mar
In reply to this post by Erik Bernhardson
Resend this mail to [hidden email]

Hi wikimedia team,



 This is Lee-Wei from Yahoo. Thanks a lot to Erik for the traffic numbers and Max for the response time estimation. Currently, we will only use the opensearch API in our backend (with cache).
 Here are some detail traffic numbers.
  - expected RPS to opensearch:  peak 900 RPS (we will adopt cache so if the cache layer works fine, the average RPS would be 15)
 
 We plan to release this feature on Jan. 5th, 2017. Please do let us know if you have any question/concern about it.

 Thanks again for the team's kindly help and support.

 Cheers,

Lee-Wei



On Tuesday, November 15, 2016 10:07 AM, Erik Bernhardson <[hidden email]> wrote:


(cc'ing the discovery mailing list, as that team owns both the implementation and operation of search.)

I can partially answer this as one of the people responsible for search, but I have to defer to others about API, bots, and such.

This would be a noticeable portion of our traffic, for reference:

action=opensearch (and generator variants): 1.5k RPS
action=query&list=search (and generator variants): 600 RPS
all api: 8k RPS (might be a bit higher, this is averaged over an hour)

opensearch is relatively cheap, the p95 to our search servers is ~30ms, with p50 at 7ms. So 600 RPS of opensearch traffic wouldn't be too hard on our search cluster. Using action=query is going to be too heavy, the full text searches are computationally more expensive to serve.

Might I ask, which wiki(s) would you be querying against? opensearch traffic is spread across our search cluster, but individual wikis only hit portions of it. For example opensearch on en.wikipedia.org is served by ~40% of the cluster, but zh.wikipedia.org (chinese) is only served by ~13%. If you are going to send heavy traffic to zh I might need to adjust those numbers to spread the load to more servers (easy enough, just need to know).

Additionally, you mentioned descriptions and keywords. These would not be provided directly by the opensearch api so you might be thinking of using the generator version of it (action=query&generator=prefixsearch) to get the results augmented (ex: /w/api.php?action=query&format=json&prop=extracts&generator=prefixsearch&exlimit=5&exintro=1&explaintext=1&gpssearch=yah&gpslimit=5). I'm not personally sure how expensive that is, someone else would have to chime in. 

So, from a computational point of view and only with respect to the search portion of our cluster, this seems plausible as long as we coordinate so that we know the traffic is coming. Others will have to chime in about the wider picture.

Erik B.

On Mon, Nov 14, 2016 at 4:40 PM, Eric Kuo <[hidden email]> wrote:
Hi,

This is Eric from Yahoo. My team develops mobile apps for Taiwan and Hong Kong users. We want to provide wiki description on keywords in our contents, and we consider using MediaWiki API:OpenSearch and/or API:Query to achieve this. Our estimated RPS is 900, and we will cache the query result on our side. We would like to know if there is any concern with respect to our RPS, and if so, what is the best practice. 

Any comments and suggestions are welcome. Thank you for your time. 

Best regards,
Eric

______________________________ _________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/ mailman/listinfo/mediawiki-api







_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api