Canary Deploys for MediaWiki

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Canary Deploys for MediaWiki

Tyler Cipriani
tl;dr: Scap will deploy to canary servers and check for error-log spikes in the next version (to be released Soon™).

In light of recent incidents[0] which have created outages accompanied by large, easily detectable, error-rate spikes, a patch has recently landed in Scap[1] that will:

    1. Push changes to a set of canary servers[2] before syncing to proxy servers
    2. Wait a configurable length of time (currently 20 seconds[3]) for any errors to have time to make themselves known
    3. Query Logstash (using a script written by Gabriel Wicke[4]) to determine if the error rate has increased over a configurable threshold (currently 10-fold[5])

Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for their mad Logstash skillz)!

It is noteworthy, that in instances where expedience is required—we're in the middle of an outage and who cares what Logstash has to say—the `--force` flag can be added to skip canary checks all together (i.e. `scap sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).

The RelEng team's eventual goal is still to move MediaWiki deployments to the more robust and resillient Scap3 deployment framework. There is some high-priority work that has to happen before the Scap3 move. In the interim, we are taking steps (like this one) to respond to incidents and keep deployments safe.

Hopefully, this work and the error-rate alert work from Ori last week[6] will allow everyone to be more conscientious and more keenly aware of deployments that cause large aberrations in the rate of errors.

<3,
Your Friendly Neighborhood Release Engineering Team

[0]. https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki is the recent example I could find, but there have been others.
[1]. https://phabricator.wikimedia.org/D248
[2]. https://gerrit.wikimedia.org/r/#/c/294742/
[3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
[4]. https://gerrit.wikimedia.org/r/#/c/292505/
[5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
[6]. https://gerrit.wikimedia.org/r/#/c/300327/

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Canary Deploys for MediaWiki

Roan Kattouw-2
Note to deployers: when syncing certain config changes (e.g. adding a new
variable) that touch both InitialiseSettings and CommonSettings, you will
now need to use sync-dir wmf-config, because individual sync-files will
likely fail if the intermediate state throws notices/errors.

(It was a good idea to do this before, but it'll be more strongly enforced
now.)

On Jul 25, 2016 12:35, "Tyler Cipriani" <[hidden email]> wrote:

> tl;dr: Scap will deploy to canary servers and check for error-log spikes
> in the next version (to be released Soon™).
>
> In light of recent incidents[0] which have created outages accompanied by
> large, easily detectable, error-rate spikes, a patch has recently landed in
> Scap[1] that will:
>
>    1. Push changes to a set of canary servers[2] before syncing to proxy
> servers
>    2. Wait a configurable length of time (currently 20 seconds[3]) for any
> errors to have time to make themselves known
>    3. Query Logstash (using a script written by Gabriel Wicke[4]) to
> determine if the error rate has increased over a configurable threshold
> (currently 10-fold[5])
>
> Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo
> Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for
> their mad Logstash skillz)!
>
> It is noteworthy, that in instances where expedience is required—we're in
> the middle of an outage and who cares what Logstash has to say—the
> `--force` flag can be added to skip canary checks all together (i.e. `scap
> sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).
>
> The RelEng team's eventual goal is still to move MediaWiki deployments to
> the more robust and resillient Scap3 deployment framework. There is some
> high-priority work that has to happen before the Scap3 move. In the
> interim, we are taking steps (like this one) to respond to incidents and
> keep deployments safe.
>
> Hopefully, this work and the error-rate alert work from Ori last week[6]
> will allow everyone to be more conscientious and more keenly aware of
> deployments that cause large aberrations in the rate of errors.
>
> <3,
> Your Friendly Neighborhood Release Engineering Team
>
> [0].
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki
> is the recent example I could find, but there have been others.
> [1]. https://phabricator.wikimedia.org/D248
> [2]. https://gerrit.wikimedia.org/r/#/c/294742/
> [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
> [4]. https://gerrit.wikimedia.org/r/#/c/292505/
> [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
> [6]. https://gerrit.wikimedia.org/r/#/c/300327/
>
> _______________________________________________
> Ops mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/ops
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Canary Deploys for MediaWiki

Alex Monk
If the intermediate state throws notices/errors, wouldn't it be a better
idea to sync-file in the correct order to prevent such notices/errors?

On 25 July 2016 at 21:54, Roan Kattouw <[hidden email]> wrote:

> Note to deployers: when syncing certain config changes (e.g. adding a new
> variable) that touch both InitialiseSettings and CommonSettings, you will
> now need to use sync-dir wmf-config, because individual sync-files will
> likely fail if the intermediate state throws notices/errors.
>
> (It was a good idea to do this before, but it'll be more strongly enforced
> now.)
>
> On Jul 25, 2016 12:35, "Tyler Cipriani" <[hidden email]> wrote:
>
>> tl;dr: Scap will deploy to canary servers and check for error-log spikes
>> in the next version (to be released Soon™).
>>
>> In light of recent incidents[0] which have created outages accompanied by
>> large, easily detectable, error-rate spikes, a patch has recently landed in
>> Scap[1] that will:
>>
>>    1. Push changes to a set of canary servers[2] before syncing to proxy
>> servers
>>    2. Wait a configurable length of time (currently 20 seconds[3]) for
>> any errors to have time to make themselves known
>>    3. Query Logstash (using a script written by Gabriel Wicke[4]) to
>> determine if the error rate has increased over a configurable threshold
>> (currently 10-fold[5])
>>
>> Big thanks to the folks that helped in this effort: Gabriel Wicke,
>> Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson
>> (for their mad Logstash skillz)!
>>
>> It is noteworthy, that in instances where expedience is required—we're in
>> the middle of an outage and who cares what Logstash has to say—the
>> `--force` flag can be added to skip canary checks all together (i.e. `scap
>> sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).
>>
>> The RelEng team's eventual goal is still to move MediaWiki deployments to
>> the more robust and resillient Scap3 deployment framework. There is some
>> high-priority work that has to happen before the Scap3 move. In the
>> interim, we are taking steps (like this one) to respond to incidents and
>> keep deployments safe.
>>
>> Hopefully, this work and the error-rate alert work from Ori last week[6]
>> will allow everyone to be more conscientious and more keenly aware of
>> deployments that cause large aberrations in the rate of errors.
>>
>> <3,
>> Your Friendly Neighborhood Release Engineering Team
>>
>> [0].
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki
>> is the recent example I could find, but there have been others.
>> [1]. https://phabricator.wikimedia.org/D248
>> [2]. https://gerrit.wikimedia.org/r/#/c/294742/
>> [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
>> [4]. https://gerrit.wikimedia.org/r/#/c/292505/
>> [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
>> [6]. https://gerrit.wikimedia.org/r/#/c/300327/
>>
>> _______________________________________________
>> Ops mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/ops
>>
>
> _______________________________________________
> Ops mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/ops
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Canary Deploys for MediaWiki

Bryan Davis
On Mon, Jul 25, 2016 at 3:07 PM, Alex Monk <[hidden email]> wrote:

> On 25 July 2016 at 21:54, Roan Kattouw <[hidden email]> wrote:
>>
>> Note to deployers: when syncing certain config changes (e.g. adding a new
>> variable) that touch both InitialiseSettings and CommonSettings, you will
>> now need to use sync-dir wmf-config, because individual sync-files will
>> likely fail if the intermediate state throws notices/errors.
>>
>> (It was a good idea to do this before, but it'll be more strongly enforced
>> now.)
>
> If the intermediate state throws notices/errors, wouldn't it be a better
> idea to sync-file in the correct order to prevent such notices/errors?

I think Alex is "more right" here. If you are introducing a new $wmgX
var you really should always sync-file the changed InitialiseSettings
file first and then the CommonSettings that uses it. There's no really
good reason to spew a bunch of "undefined X" warnings and there is no
guarantee with sync-dir that the files will be sent in the proper
order.

Bryan
--
Bryan Davis              Wikimedia Foundation    <[hidden email]>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Canary Deploys for MediaWiki

Legoktm
Hi,

On 07/25/2016 04:12 PM, Bryan Davis wrote:
> On Mon, Jul 25, 2016 at 3:07 PM, Alex Monk <[hidden email]> wrote:
>> On 25 July 2016 at 21:54, Roan Kattouw <[hidden email]> wrote:
> I think Alex is "more right" here. If you are introducing a new $wmgX
> var
> <snip>

And to continue tooting the extension.json horn, if you're using
extension.json, there shouldn't be any need for $wmg*[1] variables
anymore. You can set 'wgFooBar' in InitialiseSettings.php and it'll just
work, because loading through extension.json doesn't set the defaults to
global scope upon loading like the old PHP entry points did, requiring
the $wg = $wmg hack.

Since a lot of already-deployed extensions use the $wg = $wmg pattern,
we're tracking this cleanup at <https://phabricator.wikimedia.org/T119117>.

[1] Okay, you'll still need the $wmgUseExtensionName variable, but that
is only introduced when deploying the extension for the first time.

-- Legoktm


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Ops] Canary Deploys for MediaWiki

Roan Kattouw-2
In reply to this post by Bryan Davis
On Jul 25, 2016 16:12, "Bryan Davis" <[hidden email]> wrote:
>
> I think Alex is "more right" here. If you are introducing a new $wmgX
> var you really should always sync-file the changed InitialiseSettings
> file first and then the CommonSettings that uses it. There's no really
> good reason to spew a bunch of "undefined X" warnings and there is no
> guarantee with sync-dir that the files will be sent in the proper
> order.
>

That's true, but sometimes (either because multiple changes are made, or
things are removed, or whatever) there is no sync order that does not
produce errors. This isn't common but I have had it happen. In those cases,
you'll now be forced to use sync-dir.
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l