Efficient caching of large data sets for wikidata

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient caching of large data sets for wikidata

Daniel Kinzler
Hi all!

I'd like to get some input on a tricky problem regarding caching in memcached
(or accelerator cache). For Wikidata, we often need to look up the label (name)
of an item in a given language - on wikidata itself, as well as on wikis that
use wikidata. So, if something somewhere references Q5, we need to somehow look
up the label "Human" in English resp "Mensch" in German, etc.

The typical access pattern is to look up labels for a dozen or so items in a
handful of languages in order to generate a single response. In some situations
we can batch that into a single lookup, but at other times, we do not have
sufficient context, and it will be one label at a time. Also, some items (and
thus, their labels) are references a lot more than others.

Anyway, to get the labels, we can either fetch the full data item (several KB of
data) from the page content store (external store). This is what we currently
do, and it's quite bad. Full data items are cached in memcached and shared
between wikis - but it's still a lot of data to move, and may swamp memcached.

Alternatively, we can fetch the labels from the wb_terms table - we have a
mechanism for that, but no caching layer.

And now, the point of this email: how to make a caching layer for lookups to the
wb_terms table?

Naively, we could just give each label (one per item and language) a cache key,
and put it into memcached. I'm not sure this would improve performance much, and
it would mean massive overhead to memcached; also, putting *all* labels into
memcached would likely swamp it (we have in the order of 100 million labels and
descriptions).

We could put all "terms" for a given entity under one cache key, shared between
wikis.But then we'd still be moving a lot of pointless data around. Or we could
group using some hashing mechanism. But then we would not be able to take
advantage of the fact that some items are used a lot more often than others.

I'd like a good mechanism to cache just the 1000 or so most used labels,
preferably locally in the accelerator cache. Would it be possible to make a
"compartment" with LRU semantics in our caching infrastructure? As far as I
know, all of a memcached server, or all of an APC instance, acts as a single LRU
cache.

In order to make it less likely for rarely used labels to hog the cache, I can
think of two strategies:

1) Use a low expiry time. If set to 1 hour, only stuff accessed every hour stays
in the cache.

2) Use randomized writing: put things into the cache only 1/3 of the time; this
makes it more likely for frequently used labels to get into the cache... I'm no
good at probabilities and statistics, but I'd love to discuss this option with
someone who can actually calculate how well this might work.


So, which strategy should we use? We have:

* Full item data from external store + memcached
* Individual simple database queries, no cache
* DB query + memcached, low duration, one key per label
* DB query + memcached, randomized, one key per label
* Group cache entries by item (similar to caching full entities)
* ...

Are there other options, or other aspects that should be considered? Which
strategy would you recommend?

-- daniel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Efficient caching of large data sets for wikidata

Aaron Schulz
A few things to note:

* APC is not LRU, it just detects expired items on get() and clears everything when full (https://groups.drupal.org/node/397938)
* APC has a low max keys config on production, so using key-per-item would require that to change
* Implementing LRU groups for BagOStuff would require heavy CAS use and would definitely be bad over the wire (and not great locally either)

Just how high is the label traffic/queries? Do we profile this?

If it is super high, I'd suggest the following as a possibility:
a) Install a tiny redis instance on each app server.
b) Have a sorted set in redis containing (label key => score) and individual redis keys for label strings (with label keys). Label keys would be like P33-en. The sorted set and string values would use a common key prefix in redis. The sorted-set key would mention the max size.
c) Cache get() method would use the normal redis GET method. Once every 10 times it could send a Lua command to bump the label key's score in the sorted-set (ZSCORE) to that of the highest score +1 (find via ZRANGE key -1 -1 WITHSCORES).
d) Cache set() method would be a no-op except once every 10 times. When it does anything, it would send a Lua command to remove the lowest scored key if there is no room (ZREMRANGEBYRANK key 0 1) and in any case add the label key with a score equal to the highest score + 1. It would also add the value in the separate key for that value with a TTL (likewise deleting it on eviction). The sorted-set TTL would be set to max(current TTL, new value TTL).
e) Cache misses would fetch from the DB rather than text store

If high traffic causes flooding, the "10" number can be tweaked (or eliminated) or the "highest rank + 1" logic could be tweaked to insert new labels with a score that's better than only 3/8 of the stuff rather than all of it (borrowing from MySQL). The above method just uses O(logN) redis stuff.

Such a thing could probably be useful for at least a few more use cases I'd bet.