Wikimedia production excellence (August 2019)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Wikimedia production excellence (August 2019)

Krinkle
📘 Read on Phabricator at
https://phabricator.wikimedia.org/phame/post/view/172
-------

How’d we do in our strive for operational excellence in August? Read on to
find out!

##  📊 Month in numbers

* 3 documented incidents. [1]
* 42 new Wikimedia-prod-error reports. [2]
* 31 Wikimedia-prod-error reports closed. [3]
* 210 currently open Wikimedia-prod-error reports in total. [4]

The number of recorded incidents in August, at three, was below average for
the year so far. However, in previous years (2017-2018), August also has
2-3 incidents. – Explore the data at https://codepen.io/Krinkle/full/wbYMZK

To read more about these incidents, their investigations, and pending
actionables; check
https://wikitech.wikimedia.org/wiki/Incident_documentation#2019

##  *️⃣ When you have eliminated the impossible...

Reports from Logstash indicated that some user requests were aborted by a
fatal PHP error from the MessageCache class. The user would be shown a
generic system error page. The affected requests didn’t seem to have
anything obvious in common, however. This made it difficult to diagnose.

MessageCache is responsible for fetching interface messages, such as the
localised word “Edit” on the edit button. It calls a “load()” function and
then tries to access the loaded information. However, sometimes the load
function would claimed to have finished its work, but yet the information
was not there.

When the load function initialises all the messages for a particular
language, it keeps track of this, so as to not do the same a second time.
From any one angle I could look at this code, no obvious mistakes stood
out. A deeper investigation revealed that two unrelated changes (more than
a year apart), each broke 1 assumption that was safe to break. But, put
together, and this seemingly impossible problem emerges. Check out the
details of the investigation at
https://phabricator.wikimedia.org/T208897#5373846.

##  📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help.
The workboard lists error reports, grouped by the month in which they were
first observed.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
→  https://phabricator.wikimedia.org/maniphest/query/pzVPXPeMfRIz/#R

Breakdown of recent months (past two weeks not included):

* January: 1 report left (unchanged).
* February: 2 reports left (unchanged). ⚠️
* March: 4 reports left (unchanged). ⚠️
* April: 2 reports got fixed! (8 of 14 reports left).
* May: 4 of 10 reports left (unchanged). ⚠️
* June: 1 report got fixed! (8 of 11 reports left).
* July: 2 reports got fixed (17 of 18 reports left).
* August: 14 new reports remain unsolved.
* September: 11 new reports remain unsolved.

-------

##  🎉 Thanks!

Thank you to Aaron Schulz, Daimona, David Barratt, James Forrester, Kosta
Harlan, Piotr Miazga, Roan Kattouw, Tom Arrow, Željko Filipin, and everyone
else who helped by reporting, investigating, or resolving problems in
Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

-------


Footnotes:

[1] Incidents. –
https://wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident+documentation%2F201908&namespace=0&hideredirects=1&stripprefix=1

[2] Tasks created. –
https://phabricator.wikimedia.org/maniphest/query/8fpsoBLrmlFu/#R

[3] Tasks closed. –
https://phabricator.wikimedia.org/maniphest/query/U9.KRVNW52Yb/#R

[4] Open tasks. –
https://phabricator.wikimedia.org/maniphest/query/47MGY8BUDvRD/#R
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l