eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Faidon Liambotis
Hi all,

You may have heard already that, like last year, we are planning to
switch our active datacenter from eqiad to codfw in the week of April
17th and back to eqiad two weeks later, on the week of May 1st. We do
this periodically in order to exercise our ability to run from the
backup site in case of a disaster, as well as our ability to switch
seamlessly to it with little user impact.

Switching will be a gradual, multi-step process, the most visible step
of which will be the switch of MediaWiki application servers and
associated data stores. This will happen on April 19th (eqiad->codfw)
and May 3rd (codfw->eqiad), both at 14:00 UTC. During those windows, the
sites will be placed into read-only mode, for a period that we estimate
to last approximately 20 to 30 minutes.

Furthermore, the deployment train will freeze for the weeks of April
17th and May 1st[1], but operate normally on the week of April 24th, in
order to exercise our ability to deploy code while operating from the
backup datacenter.

1: https://wikitech.wikimedia.org/wiki/Deployments

Compared to last year we have improved our processes considerably[2], in
particular by making more services operate in an active/active manner,
as well as by working on an automation and orchestration framework[3] to
perform parallel executions across the fleet. The core of the MediaWiki
switchover will be performed semi-automatically using a new software[4]
that will execute all the necessary commands in sequence with little
human involvement, and thus lowering the risk of introducing errors and
delays.

2: https://wikitech.wikimedia.org/wiki/Switch_Datacenter
3: https://github.com/wikimedia/cumin
4: https://github.com/wikimedia/operations-switchdc

Improving and automating our processes means that we're not going to be
following the exact same steps as last year. Because of that, and
because of other changes introduced in our environment over the course
of the year, there is a possibility of errors creeping into the process.
We'll certainly try to fix any issues that arise during those weeks and
we'd like to ask everyone to be on high-alert and vigilant.

To report any issues, please use one of the following channels:

1. File a Phabricator issue with project #codfw-rollout
2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent, or
during the migration)
3. Send an e-mail to the Operations list: [hidden email] (any time)

Thanks,
Faidon
--
Faidon Liambotis
Principal Operations Engineer
Acting Director of Technical Operations
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Faidon Liambotis
Hi all,

The first part of this switch completed successfully today. The total
time in which the projects were in read-only was approximately 17
minutes.

There were a few hiccups that were or still are being dealt with, the
most major ones identified so far being:

- The switch to codfw in combination with what looks like a
  ContentTranslation bug resulted into an overload of the x1 database
  shard which in turn affected Echo & Flow. The total outage for Echo &
  Flow was from 15:36 until 16:18 UTC.
 
  ContentTranslation was also misbehaving since 15:36 and was disabled
  on all Wikipedias at 15:57 UTC. The Language team and Roan are trying
  to identify the root cause. It will be gradually reenabled once it's
  found and the issue gets fixed. [T163344]

- ORES Watchlist items appear duplicated. the ORES team is still
  investigating this. [T163337]

- codfw API database slaves overloaded and an additional one had to be
  pooled in in order to handle the load but it doesn't look like it was
  enough to fully alleviate this. The API is and has been available
  throughout this work, albeit with reduced performance. This is still
  being dealt with by the DBA team. [T163351]

- The IPs of the Redis servers used for locking was misconfigured in
  mediawiki-config, which resulted into file uploads (e.g. to Commons)
  and deletions to not work until it was manually fixed. This was not
  working between 14:30-14:44 UTC. [T163354]

The issues above are still being worked on and the situation is
evolving, some some of the above may be inaccurate already. Phabricator
is the authoritative place for both the root cause and mitigation action
of all of those issues, with #codfw-rollout being the common tag.

Please follow-up there either on existing issues or new ones that you
may discover over the course of the next 2-3 weeks :)

Thanks to everyone in and outside of ops for both the substantial amount
of work that has gone into preparing for this day, as well as for all
the firefighting for the better part of today. Expect to hear more from
us when this project concludes.

Best,
Faidon
--
Faidon Liambotis
Principal Operations Engineer
Acting Director of Technical Operations
Wikimedia Foundation

On Fri, Apr 07, 2017 at 04:58:09PM +0300, Faidon Liambotis wrote:

> Hi all,
>
> You may have heard already that, like last year, we are planning to
> switch our active datacenter from eqiad to codfw in the week of April
> 17th and back to eqiad two weeks later, on the week of May 1st. We do
> this periodically in order to exercise our ability to run from the
> backup site in case of a disaster, as well as our ability to switch
> seamlessly to it with little user impact.
>
> Switching will be a gradual, multi-step process, the most visible step
> of which will be the switch of MediaWiki application servers and
> associated data stores. This will happen on April 19th (eqiad->codfw)
> and May 3rd (codfw->eqiad), both at 14:00 UTC. During those windows, the
> sites will be placed into read-only mode, for a period that we estimate
> to last approximately 20 to 30 minutes.
>
> Furthermore, the deployment train will freeze for the weeks of April
> 17th and May 1st[1], but operate normally on the week of April 24th, in
> order to exercise our ability to deploy code while operating from the
> backup datacenter.
>
> 1: https://wikitech.wikimedia.org/wiki/Deployments
>
> Compared to last year we have improved our processes considerably[2], in
> particular by making more services operate in an active/active manner,
> as well as by working on an automation and orchestration framework[3] to
> perform parallel executions across the fleet. The core of the MediaWiki
> switchover will be performed semi-automatically using a new software[4]
> that will execute all the necessary commands in sequence with little
> human involvement, and thus lowering the risk of introducing errors and
> delays.
>
> 2: https://wikitech.wikimedia.org/wiki/Switch_Datacenter
> 3: https://github.com/wikimedia/cumin
> 4: https://github.com/wikimedia/operations-switchdc
>
> Improving and automating our processes means that we're not going to be
> following the exact same steps as last year. Because of that, and
> because of other changes introduced in our environment over the course
> of the year, there is a possibility of errors creeping into the process.
> We'll certainly try to fix any issues that arise during those weeks and
> we'd like to ask everyone to be on high-alert and vigilant.
>
> To report any issues, please use one of the following channels:
>
> 1. File a Phabricator issue with project #codfw-rollout
> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent, or
> during the migration)
> 3. Send an e-mail to the Operations list: [hidden email] (any time)
>
> Thanks,
> Faidon
> --
> Faidon Liambotis
> Principal Operations Engineer
> Acting Director of Technical Operations
> Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Faidon Liambotis
Hi all,

The switchback to eqiad is successfully completed as this week. The main
read-only phase of the switch happend already on Wednesday at 14:30 UTC,
as originally scheduled.

The readonly time was approximate 13 minutes in this run (down from 17)
and was more uneventful than the switchover two weeks ago. Multiple bugs
were fixed and small features were added over the course of the past two
weeks that explain this drop in runtime and increased resilience.

Short summary of where we're at:

- Extension:Cognate caused a brief x1 outage; it's still unclear whether
  this was switchover related or not and it's still being investigated.
  [T164407]

- The job queue corruption issue that was found during the first
  switchover was worked around, but a long-term fix to the issue is
  still pending. [T163337]

- The Content Translation issues are still being worked on, but didn't
  cause an issue this time. [T163344]

- We were unfortunately unable to use the new MediaWiki etcd integration
  this time either, and despite Tim's herculean efforts at the last
  minute, due to stability reasons. [T156924]

The workboard for the project is still at #codfw-rollout and will
continue to be updated as we go through these issues.

This was, overall, a success. A number of issues were identified and
most of them have already been fixed -- this was the purpose of this
whole endeavour :)

Many thanks to everyone that has contributed to this goal! This was
really an effort across multiple teams and many individuals, all of
which worked hard and under strict deadlines to contribute to this
project. I am personally grateful to all of you, you know who you are :)

Work on this project will continue throughout the next fiscal year (July
2017 - June 2018) across the Technology department, with the ultimate
holy-grail goal of an active-active setup for all of our services. We'll
keep you all up-to-date on the progress.

Best regards,
Faidon
--
Faidon Liambotis
Principal Operations Engineer
Acting Director of Technical Operations
Wikimedia Foundation

On Wed, Apr 19, 2017 at 08:33:49PM +0300, Faidon Liambotis wrote:

> Hi all,
>
> The first part of this switch completed successfully today. The total
> time in which the projects were in read-only was approximately 17
> minutes.
>
> There were a few hiccups that were or still are being dealt with, the
> most major ones identified so far being:
>
> - The switch to codfw in combination with what looks like a
>   ContentTranslation bug resulted into an overload of the x1 database
>   shard which in turn affected Echo & Flow. The total outage for Echo &
>   Flow was from 15:36 until 16:18 UTC.
>  
>   ContentTranslation was also misbehaving since 15:36 and was disabled
>   on all Wikipedias at 15:57 UTC. The Language team and Roan are trying
>   to identify the root cause. It will be gradually reenabled once it's
>   found and the issue gets fixed. [T163344]
>
> - ORES Watchlist items appear duplicated. the ORES team is still
>   investigating this. [T163337]
>
> - codfw API database slaves overloaded and an additional one had to be
>   pooled in in order to handle the load but it doesn't look like it was
>   enough to fully alleviate this. The API is and has been available
>   throughout this work, albeit with reduced performance. This is still
>   being dealt with by the DBA team. [T163351]
>
> - The IPs of the Redis servers used for locking was misconfigured in
>   mediawiki-config, which resulted into file uploads (e.g. to Commons)
>   and deletions to not work until it was manually fixed. This was not
>   working between 14:30-14:44 UTC. [T163354]
>
> The issues above are still being worked on and the situation is
> evolving, some some of the above may be inaccurate already. Phabricator
> is the authoritative place for both the root cause and mitigation action
> of all of those issues, with #codfw-rollout being the common tag.
>
> Please follow-up there either on existing issues or new ones that you
> may discover over the course of the next 2-3 weeks :)
>
> Thanks to everyone in and outside of ops for both the substantial amount
> of work that has gone into preparing for this day, as well as for all
> the firefighting for the better part of today. Expect to hear more from
> us when this project concludes.
>
> Best,
> Faidon
> --
> Faidon Liambotis
> Principal Operations Engineer
> Acting Director of Technical Operations
> Wikimedia Foundation
>
> On Fri, Apr 07, 2017 at 04:58:09PM +0300, Faidon Liambotis wrote:
> > Hi all,
> >
> > You may have heard already that, like last year, we are planning to
> > switch our active datacenter from eqiad to codfw in the week of April
> > 17th and back to eqiad two weeks later, on the week of May 1st. We do
> > this periodically in order to exercise our ability to run from the
> > backup site in case of a disaster, as well as our ability to switch
> > seamlessly to it with little user impact.
> >
> > Switching will be a gradual, multi-step process, the most visible step
> > of which will be the switch of MediaWiki application servers and
> > associated data stores. This will happen on April 19th (eqiad->codfw)
> > and May 3rd (codfw->eqiad), both at 14:00 UTC. During those windows, the
> > sites will be placed into read-only mode, for a period that we estimate
> > to last approximately 20 to 30 minutes.
> >
> > Furthermore, the deployment train will freeze for the weeks of April
> > 17th and May 1st[1], but operate normally on the week of April 24th, in
> > order to exercise our ability to deploy code while operating from the
> > backup datacenter.
> >
> > 1: https://wikitech.wikimedia.org/wiki/Deployments
> >
> > Compared to last year we have improved our processes considerably[2], in
> > particular by making more services operate in an active/active manner,
> > as well as by working on an automation and orchestration framework[3] to
> > perform parallel executions across the fleet. The core of the MediaWiki
> > switchover will be performed semi-automatically using a new software[4]
> > that will execute all the necessary commands in sequence with little
> > human involvement, and thus lowering the risk of introducing errors and
> > delays.
> >
> > 2: https://wikitech.wikimedia.org/wiki/Switch_Datacenter
> > 3: https://github.com/wikimedia/cumin
> > 4: https://github.com/wikimedia/operations-switchdc
> >
> > Improving and automating our processes means that we're not going to be
> > following the exact same steps as last year. Because of that, and
> > because of other changes introduced in our environment over the course
> > of the year, there is a possibility of errors creeping into the process.
> > We'll certainly try to fix any issues that arise during those weeks and
> > we'd like to ask everyone to be on high-alert and vigilant.
> >
> > To report any issues, please use one of the following channels:
> >
> > 1. File a Phabricator issue with project #codfw-rollout
> > 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent, or
> > during the migration)
> > 3. Send an e-mail to the Operations list: [hidden email] (any time)
> >
> > Thanks,
> > Faidon
> > --
> > Faidon Liambotis
> > Principal Operations Engineer
> > Acting Director of Technical Operations
> > Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: eqiad->codfw datacenter switchover, weeks of Apr 17th/May 1st

Pine W
Thanks for publishing the email summaries.

This is one of those situations where, contrary to some our perfectionist
tendencies (including mine), it's a surprise if everything goes 100%
according to plan, and discovering problems through a live-fire exercise is
a characteristic of a successful exercise. I'm reminded a bit of some
discussions at the Wikimedia Conference and elsewhere where people talk
about risks and failures; I think it's nice to be reminded that measured
risks and failures are OK in some scenarios. Nothing ventured, nothing
gained (or learned!)

Pine
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l