Re: Kafka Main Eqiad outage and failover of Eventbus/Eventstreams to codfw

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Kafka Main Eqiad outage and failover of Eventbus/Eventstreams to codfw

Luca Toscano
[Adding some other mailing lists in Cc]

Hi everybody,

as a lot of you have probably already noticed yesterday reading the
operations@ mailing list, we had an outage of the Kafka Main eqiad cluster
that forced us to switch the Eventbus and Eventstreams services to codfw.

All the precise timings will be listed in
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad,
but for a quick glimpse:

2018-07-11 17:00 UTC - Eventbus service switched to codfw
2018-07-11 18:44 UTC - Eventstreams service switched to codfw

We are going to switch back those services to eqiad during the next couple
of hours. The consumers of the Eventstreams service may get some failures
or data drops, apologies in advance for the trouble.

Cheers,

Luca

Il giorno gio 12 lug 2018 alle ore 00:00 Luca Toscano <
[hidden email]> ha scritto:

> Hi everybody,
>
> as you might have seen from the operations' channel on IRC the Kafka Main
> Eqiad cluster (kafka100[1-3].eqiad.wmnet) suffered a long outage due to new
> topics pushed out with too long names (causing fs operation issues, etc..).
> I'll update this email thread tomorrow EU time with more details, tasks,
> precise root cause, etc.., but the important bit to know is that Eventbus
> and Eventstreams have been failed over to the Kafka Main Codfw cluster.
> This should be transparent to everybody but please let us know otherwise.
>
> Thanks for the patience!
>
> (a very sleepy :) Luca
>
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l