Wikimedia production excellence (January 2019)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikimedia production excellence (January 2019)

Krinkle
📘 (est. 3 minute read)
https://phabricator.wikimedia.org/phame/live/1/post/140/
-------

How’d we do in our strive for operational excellence last month? Read on to
find out!

- Month in numbers.
- Highlighted stories.
- Current problems.

##  📊 *Month in numbers*

* 4 documented incidents in January 2019. [1]
* 16 Wikimedia-prod-error tasks closed. [2]
* 17 Wikimedia-prod-error tasks created. [3]

##  *️⃣ *Unable to move certain file pages*

Xiplus reported that renaming a File page on zh.wikipedia.org led to a
fatal database exception. Andre Klapper identified the stack trace from the
logs, and Brad (Anomie) investigated.

The File renaming failed because the File page did not have a media file
associated with it (such move action is not currently allowed in
MediaWiki). But, while handling this error the code caused a different
error. The impact was that the user didn't get informed about why the move
failed. Instead, they received a generic error page about a fatal database
exception.

Brad fixed the code a few hours later, and it was deployed by Roan later
that same day.
Thanks! —  https://phabricator.wikimedia.org/T213168

##  *️⃣ *DBPerformance regression detected and fixed*

During a routine audit of Logstash dashboards, I found a DBPerformance
warning. The warning indicated that the limit of 0 for “master connections”
was violated. That's a cryptic way of saying it found code in MediaWiki
that uses a database master connection on a regular page view.

MediaWiki can have many replica database servers, but there can be only one
master database at any given moment. To reduce chances of overload,
delaying edits, or network congestion; we make sure to use replicas
whenever possible. We usually involve the master only when source data is
being changed, or is about to be changed. For example, when editing a page,
or saving changes.

As the vast majority of traffic is page views, we have lower thresholds for
latency and dependency on page views. In particular, page views may (in the
future) be routed to secondary data centres that don’t even have a master
DB.

Tchanders from the Anti-Harassment tea) investigated the issue, found the
culprit, and fixed it in time for the next MediaWiki train. Thanks! —
https://phabricator.wikimedia.org/T214735

##  *️⃣ *TemplateData missing in action*

Tacsipacsi and Evad37 both independently reported the same TemplateData
issue. TemplateData powers the template insertion dialog in VisualEditor.
It wasn't working for some templates after we deployed the 1.33-wmf.13
branch.

The error was “Argument 1 passed to ApiResult::setIndexedTagName() must be
an instance of array, null given”. This means there was code that calls a
function with the wrong parameter. For example, the variable name may've
been misspelled, or it may've been the wrong variable, or (in this case)
the variable didn't exist. In such case, PHP implicitly assumes “null”.

Bartosz (Matmarex) found the culprit. The week before, I made a change to
TemplateData that changed the “template parameter order” feature to be
optional. This allows users to decide whether VisualEditor should force an
order for the parameters in the wikitext. It turned out I forgot to update
one of the references to this variable, which still assumed it was always
present.

Brad (Anomie) fixed it later that week, and it was deployed the next day.
Thanks! — https://phabricator.wikimedia.org/T213953

##  📈 *Current problems*

Take a look at the workboard and look for tasks that might need your help.
The workboard lists known issues, grouped by the week in which they were
first observed.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

There are currently 188 open Wikimedia-prod-error tasks as of 12 February
2019. (We’ve had a slight increase since November; 165 in December, 172 in
January.)

For this month’s edition, I’d like to draw attention to a few older issues
that are still reproducible:

* [2013; Collection extension] Special:Book fatal error for blocked users.
https://phabricator.wikimedia.org/T56179
* [2013; CentralNotice] Fatal error when placeholder key contains a space.
https://phabricator.wikimedia.org/T58105
* [2014; LQT] Fatal error when attempting to view certain threads. —
https://phabricator.wikimedia.org/T61791
* [2015; MassMessage] Warning about Invalid message parameters. —
https://phabricator.wikimedia.org/T93110
* [2015; Wikibase] Warning “UnresolvedRedirectException” for some pages on
Wikidata (and Commons). — https://phabricator.wikimedia.org/T93273

##  💡Terminology

A “Fatal error” (or uncaught exception) prevents a user action. For example
— a page might display “MWException: Unknown class NotificationCount.”,
instead the article content.
A “Warning” (or non-fatal, or PHP error) lets the program continue to
display a mostly page regardless. This may cause corrupt, incorrect, or
incomplete information to be shown. For example — a user may receive a
notification that says “You have (null) new messages”.

##  🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or
resolving problems in Wikimedia production. Including: Xiplus‚ Anomie,
Daimona Gilles, He7d3r, Jdforrester, MatmaRex, MModell, Nikerabbit,
Catrope, Tchanders, Tgr, and Thiemo.

Thanks!

Until next time,

– Timo Tijhof

👢*There's a snake in my boot. Reach for the sky!*

-------

Footnotes:

[1] Incidents. –
https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20190100&to=Incident+documentation%2F20190200&namespace=0


[2] Tasks closed. –
https://phabricator.wikimedia.org/maniphest/query/COTGbmxGcm_l/#R

[3] Tasks created. –
https://phabricator.wikimedia.org/maniphest/query/DLRuzOg9bSJA/#R
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikimedia production excellence (January 2019)

Brad Jorsch (Anomie)
On Tue, Feb 12, 2019 at 10:54 PM Krinkle <[hidden email]> wrote:

> Brad fixed the code a few hours later, and it was deployed by Roan later
> that same day.
> Thanks! —  https://phabricator.wikimedia.org/T213168
>

Correction: It was Gergő Tisza who submitted the patch to fix the code for
this one, not me.


--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l