This is a quick reminder that tonight, at the TechCOm IRC hour, we will be
talking about the job queue. There have been several issues iwth it lately, and
we want to make sure that we have all relevant aspects on the radar.
As always, the discussion will take place in the IRC channel
#wikimedia-office on Wednesday 21:00 UTC (2pm PDT, 23:00 CEST).
This is not an RFC meeting, as there is no concrete proposal. Rather, it's an
opportunity to further our understanding of the problems and hand, and to float
ideas for possible improvements.
* With 600k jobs in the backlog of commonswiki, only 7k got processed in a day.
* For wikis with just a few thousand pages, we sometimes see millions of
UpdateHtmlCache jobs sitting in the queue.
* Jobs that were triggered months ago were found to continue failing and re-trying
Issues and considerations:
* Jobs re-trying indefinitely
** mechanism is obscure/undocumented. Some rely on rootJob parameters, some use
** Batching prevents deduplication. When and how should jobs do batch
operations? Can we automatically break up small batches?
** Delaying jobs may improve deduplication, but support for delayed jobs is
** Custom coalescing could improve the chance for deduplication.
* Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates
the parser cache, and RefreshLinks re-parse the page - but does not trigger an
UpdateHtmlCache, which it probably should.
* The throttling mechanism does not take into account the nature and run-time of
different job types.
* Scaling is achieved by running more cron jobs.
* Kafka-based JQ is being tested by Services. Generally saner. Should improve
ability to track causality (which job got triggered by which other job). T157088
* No support for recurrent jobs. Should we keep using cron?
Principal Platform Engineer
Gesellschaft zur Förderung Freien Wissens e.V.