[ANNOUNCEMENT] RESTBase and related services DC switch-over test

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANNOUNCEMENT] RESTBase and related services DC switch-over test

Marko Obrovac
Hello,

The WMF’s technology department has for this quarter the goal of testing
and temporarily switching the main operational data centre from Eqiad
(located in Chicago) to Codfw (located in Dallas)~[1,2]. This includes both
back-end-processing as well as serving live traffic from it.

As a part of this effort, we are scheduling a switch-over for RESTBase and
its back-end services, including: Parsoid, the Mobile Content Service,
CXServer, Mathoid, Citoid, Apertium and Zotero~[3]. Technically, it will
not be a real switch-over per se, because we will keep all of those
services active in both DCs. However, external traffic will be directed to
the Dallas DC only.

=== When is it and what does it mean for me? ===
The switch-over test is planned for this Thursday, 2016-03-17. We have
allotted a three-hour window for this~[4].  There is nothing users should
do before or after the switch; it will be transparent for them. There are
two things users should note, though:

1) At the time of the switch-over, users might receive error responses for
a while (both 4xx and 5xx status codes). While we will test most of the
things ahead of time, we cannot test the actual traffic shifting, so small
bumps might be noticed.
2) After the switch to the Dallas DC, users will likely see their response
latencies slightly elevated. During the test, some requests might
experience a slightly larger latency. This will occur because all of the
services that will be responding to live requests still need to contact the
main MediaWiki cluster, which will remain in Eqiad (the other DC) until a
complete switch-over of the infrastructure is performed. However, given the
multiple levels of caching, the 40 ms of penalty to go cross-DC for an
uncached API request does not seem too taxing.

=== Wait, what about my service X running in WMF production? ===
If you are a service owner of one the aforementioned services, there are no
explicit actions you should take prior to, during or after the switch-over
test. This test could, however, affect your service depending on whether it
usually serves live traffic or is mostly operational during various
internal updates. MediaWiki and JobQueue processing will still be performed
in Eqiad, so in the latter case your service should not see a change in the
usage pattern. If, however, your service is mostly in charge of responding
to live requests coming through RESTBase, those will be handled by
instances in Codfw. However, as these services are full replicas of their
Eqiad counterparts and are stateless, no major breakage will happen.

Should you have any questions or concerns, don’t hesitate to contact us
here or on IRC (#wikimedia-services @ freenode).

Best,
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation

[1]
https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Technology
[2] https://phabricator.wikimedia.org/project/profile/1723/
[3] https://phabricator.wikimedia.org/T127974
[4]
https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0March.C2.A017
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [ANNOUNCEMENT] RESTBase and related services DC switch-over test

Marko Obrovac
FYI, the test has started. We are in the process of switching the traffic
to the Dallas DC.

Cheers,
Marko

On 14 March 2016 at 22:54, Marko Obrovac <[hidden email]> wrote:

> Hello,
>
> The WMF’s technology department has for this quarter the goal of testing
> and temporarily switching the main operational data centre from Eqiad
> (located in Chicago) to Codfw (located in Dallas)~[1,2]. This includes both
> back-end-processing as well as serving live traffic from it.
>
> As a part of this effort, we are scheduling a switch-over for RESTBase and
> its back-end services, including: Parsoid, the Mobile Content Service,
> CXServer, Mathoid, Citoid, Apertium and Zotero~[3]. Technically, it will
> not be a real switch-over per se, because we will keep all of those
> services active in both DCs. However, external traffic will be directed to
> the Dallas DC only.
>
> === When is it and what does it mean for me? ===
> The switch-over test is planned for this Thursday, 2016-03-17. We have
> allotted a three-hour window for this~[4].  There is nothing users should
> do before or after the switch; it will be transparent for them. There are
> two things users should note, though:
>
> 1) At the time of the switch-over, users might receive error responses for
> a while (both 4xx and 5xx status codes). While we will test most of the
> things ahead of time, we cannot test the actual traffic shifting, so small
> bumps might be noticed.
> 2) After the switch to the Dallas DC, users will likely see their response
> latencies slightly elevated. During the test, some requests might
> experience a slightly larger latency. This will occur because all of the
> services that will be responding to live requests still need to contact the
> main MediaWiki cluster, which will remain in Eqiad (the other DC) until a
> complete switch-over of the infrastructure is performed. However, given the
> multiple levels of caching, the 40 ms of penalty to go cross-DC for an
> uncached API request does not seem too taxing.
>
> === Wait, what about my service X running in WMF production? ===
> If you are a service owner of one the aforementioned services, there are
> no explicit actions you should take prior to, during or after the
> switch-over test. This test could, however, affect your service depending
> on whether it usually serves live traffic or is mostly operational during
> various internal updates. MediaWiki and JobQueue processing will still be
> performed in Eqiad, so in the latter case your service should not see a
> change in the usage pattern. If, however, your service is mostly in charge
> of responding to live requests coming through RESTBase, those will be
> handled by instances in Codfw. However, as these services are full replicas
> of their Eqiad counterparts and are stateless, no major breakage will
> happen.
>
> Should you have any questions or concerns, don’t hesitate to contact us
> here or on IRC (#wikimedia-services @ freenode).
>
> Best,
> Marko Obrovac, PhD
> Senior Services Engineer
> Wikimedia Foundation
>
> [1]
> https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Technology
> [2] https://phabricator.wikimedia.org/project/profile/1723/
> [3] https://phabricator.wikimedia.org/T127974
> [4]
> https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0March.C2.A017
>
>


--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Services] [ANNOUNCEMENT] RESTBase and related services DC switch-over test

Gabriel Wicke-3
The Services DC fail-over test finished without user impact, and
traffic is now switched back to eqiad.

We found an issue with one of the Cassandra nodes running out of
memory after switching update processing to codfw, as well as a
hick-up with one instance in eqiad. Due to the redundant set-up, this
did not affect operations. We are investigating these issues, and will
address them before the general fail-over test in April.

Many thanks to Marko Obrovac, Eric Evans (Services), Filippo
Giunchedi, Giuseppe Lavagetto and Emanuele Rocca (Operations), who
prepared the infrastructure to make the switch-over this smooth.

Gabriel

On Thu, Mar 17, 2016 at 3:10 AM, Marko Obrovac <[hidden email]> wrote:

> FYI, the test has started. We are in the process of switching the traffic to
> the Dallas DC.
>
> Cheers,
> Marko
>
> On 14 March 2016 at 22:54, Marko Obrovac <[hidden email]> wrote:
>>
>> Hello,
>>
>> The WMF’s technology department has for this quarter the goal of testing
>> and temporarily switching the main operational data centre from Eqiad
>> (located in Chicago) to Codfw (located in Dallas)~[1,2]. This includes both
>> back-end-processing as well as serving live traffic from it.
>>
>> As a part of this effort, we are scheduling a switch-over for RESTBase and
>> its back-end services, including: Parsoid, the Mobile Content Service,
>> CXServer, Mathoid, Citoid, Apertium and Zotero~[3]. Technically, it will not
>> be a real switch-over per se, because we will keep all of those services
>> active in both DCs. However, external traffic will be directed to the Dallas
>> DC only.
>>
>> === When is it and what does it mean for me? ===
>> The switch-over test is planned for this Thursday, 2016-03-17. We have
>> allotted a three-hour window for this~[4].  There is nothing users should do
>> before or after the switch; it will be transparent for them. There are two
>> things users should note, though:
>>
>> 1) At the time of the switch-over, users might receive error responses for
>> a while (both 4xx and 5xx status codes). While we will test most of the
>> things ahead of time, we cannot test the actual traffic shifting, so small
>> bumps might be noticed.
>> 2) After the switch to the Dallas DC, users will likely see their response
>> latencies slightly elevated. During the test, some requests might experience
>> a slightly larger latency. This will occur because all of the services that
>> will be responding to live requests still need to contact the main MediaWiki
>> cluster, which will remain in Eqiad (the other DC) until a complete
>> switch-over of the infrastructure is performed. However, given the multiple
>> levels of caching, the 40 ms of penalty to go cross-DC for an uncached API
>> request does not seem too taxing.
>>
>> === Wait, what about my service X running in WMF production? ===
>> If you are a service owner of one the aforementioned services, there are
>> no explicit actions you should take prior to, during or after the
>> switch-over test. This test could, however, affect your service depending on
>> whether it usually serves live traffic or is mostly operational during
>> various internal updates. MediaWiki and JobQueue processing will still be
>> performed in Eqiad, so in the latter case your service should not see a
>> change in the usage pattern. If, however, your service is mostly in charge
>> of responding to live requests coming through RESTBase, those will be
>> handled by instances in Codfw. However, as these services are full replicas
>> of their Eqiad counterparts and are stateless, no major breakage will
>> happen.
>>
>> Should you have any questions or concerns, don’t hesitate to contact us
>> here or on IRC (#wikimedia-services @ freenode).
>>
>> Best,
>> Marko Obrovac, PhD
>> Senior Services Engineer
>> Wikimedia Foundation
>>
>> [1]
>> https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Technology
>> [2] https://phabricator.wikimedia.org/project/profile/1723/
>> [3] https://phabricator.wikimedia.org/T127974
>> [4]
>> https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0March.C2.A017
>>
>
>
>
> --
> Marko Obrovac, PhD
> Senior Services Engineer
> Wikimedia Foundation
>
> _______________________________________________
> Services mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/services
>



--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Services] [ANNOUNCEMENT] RESTBase and related services DC switch-over test

Giuseppe Lavagetto
In reply to this post by Marko Obrovac
On Mon, Mar 14, 2016 at 10:54 PM, Marko Obrovac <[hidden email]> wrote:

> Hello,
>
> The WMF’s technology department has for this quarter the goal of testing and
> temporarily switching the main operational data centre from Eqiad (located
> in Chicago) to Codfw (located in Dallas)~[1,2]. This includes both
> back-end-processing as well as serving live traffic from it.
>
> As a part of this effort, we are scheduling a switch-over for RESTBase and
> its back-end services, including: Parsoid, the Mobile Content Service,
> CXServer, Mathoid, Citoid, Apertium and Zotero~[3]. Technically, it will not
> be a real switch-over per se, because we will keep all of those services
> active in both DCs. However, external traffic will be directed to the Dallas
> DC only.
>

Hi all, just a quick heads-up:

given the small issues we experienced last time, which we've found to
be unrelated to the switch itself, we scheduled a new switch-over test
lasting 24 hours, which is scheduled to start tomorrow (April 5th) at
14:00 UTC. We don't expect any significant user impact.

Anyways, should you have any questions or concerns, don’t hesitate to
contact us here
or on IRC (#wikimedia-services / #wikimedia-operations @ freenode).

Cheers,

Giuseppe
--
Giuseppe Lavagetto, Ph.d.
Senior Technical Operations Engineer, Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l