Gerrit was down today

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Gerrit was down today

Greg Grossmeier-2
(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and
services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006
    But this is missing a lot of the information/discussion that is
    happening in #wikimedia-operations on Freenode. A link to the
    incident report will be pasted into that etherpad when it is
    created.

--
| Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
| Release Team Manager            A18D 1138 8E47 FAC8 1C7D |

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Chad Horohoe
Hi!

Sorry for the extended downtime! From what we can tell, it appears as though
the machine that Gerrit is running on (lead) is having some hardware issues
that
are making the CPU misbehave. We've worked around it for now, so things
should
be up (and Zuul is processing CI events just fine).

However, since it appears it's a hardware problem, we're planning to
migrate off
of lead to a new machine (cobalt). The public IP addresses will not be
changing.
The plan right now is to do this migration tomorrow with a scheduled
downtime
at 17:00UTC (10:00 PST).

We'll be keeping a close eye on things in the meantime, so if things
deteriorate
again we can start the migration sooner.

(and yeah, wikitech incident report to follow, I'm a little burnt out right
now though)

Thanks again for bearing with us!

-Chad

On Thu, Oct 6, 2016 at 2:32 PM Greg Grossmeier <[hidden email]> wrote:

> (It wasn't just you)
>
> Gerrit was down today starting around 17:49 UTC. It is now back up and
> services are coming back online.
>
> A full investigation into the cause of the outage is still on-going.[0]
>
> Apologies for the downtime.
>
> WMF Release Engineering
>
> [0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006
>     But this is missing a lot of the information/discussion that is
>     happening in #wikimedia-operations on Freenode. A link to the
>     incident report will be pasted into that etherpad when it is
>     created.
>
> --
> | Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
> | Release Team Manager            A18D 1138 8E47 FAC8 1C7D |
>
> _______________________________________________
> Engineering mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Gergo Tisza
Thanks a lot for the quick recovery!

Would it be possible to use something other than a redirect next time when
traffic needs to be blocked? An apache deny rule or a 404 would work, but a
redirect means that reloading the page (or reopening the browser) will
cause the URL to be lost with little hope of recovery (browsers don't
record redirects in the history). That can be very annoying when one uses
tabs as bookmarks (bad habit as it is).

On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe <[hidden email]> wrote:

> Hi!
>
> Sorry for the extended downtime! From what we can tell, it appears as
> though
> the machine that Gerrit is running on (lead) is having some hardware
> issues that
> are making the CPU misbehave. We've worked around it for now, so things
> should
> be up (and Zuul is processing CI events just fine).
>
> However, since it appears it's a hardware problem, we're planning to
> migrate off
> of lead to a new machine (cobalt). The public IP addresses will not be
> changing.
> The plan right now is to do this migration tomorrow with a scheduled
> downtime
> at 17:00UTC (10:00 PST).
>
> We'll be keeping a close eye on things in the meantime, so if things
> deteriorate
> again we can start the migration sooner.
>
> (and yeah, wikitech incident report to follow, I'm a little burnt out
> right now though)
>
> Thanks again for bearing with us!
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Amir Sarabadani-2
It was bothering to me but I'm guessing this is one of so so many flaws of
gerrit itself and probably not fixable easily (other people are more
qualified to comment) but i want to suggest speeding up the process to move
to differential which is much better in handling such down times alongside
with other benefits.

Best

On Fri, Oct 7, 2016, 2:26 AM Gergo Tisza <[hidden email]> wrote:

> Thanks a lot for the quick recovery!
>
> Would it be possible to use something other than a redirect next time when
> traffic needs to be blocked? An apache deny rule or a 404 would work, but a
> redirect means that reloading the page (or reopening the browser) will
> cause the URL to be lost with little hope of recovery (browsers don't
> record redirects in the history). That can be very annoying when one uses
> tabs as bookmarks (bad habit as it is).
>
> On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe <[hidden email]>
> wrote:
>
> > Hi!
> >
> > Sorry for the extended downtime! From what we can tell, it appears as
> > though
> > the machine that Gerrit is running on (lead) is having some hardware
> > issues that
> > are making the CPU misbehave. We've worked around it for now, so things
> > should
> > be up (and Zuul is processing CI events just fine).
> >
> > However, since it appears it's a hardware problem, we're planning to
> > migrate off
> > of lead to a new machine (cobalt). The public IP addresses will not be
> > changing.
> > The plan right now is to do this migration tomorrow with a scheduled
> > downtime
> > at 17:00UTC (10:00 PST).
> >
> > We'll be keeping a close eye on things in the meantime, so if things
> > deteriorate
> > again we can start the migration sooner.
> >
> > (and yeah, wikitech incident report to follow, I'm a little burnt out
> > right now though)
> >
> > Thanks again for bearing with us!
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Chad Horohoe
This is actually how we have Apache configured to respond to Gerrit being
unavailable - that error page is served with a 503 when Gerrit is really
down.

Today I hacked it to always show that page, so even when it was "up" people
wouldn't be hitting it -- we were still debugging and restarting things so
I didn't
want to give false hopes or end up with half-completed transactions.

This can all be improved I think with some Apache config tweaks.

-Chad

On Thu, Oct 6, 2016 at 4:14 PM Amir Ladsgroup <[hidden email]> wrote:

> It was bothering to me but I'm guessing this is one of so so many flaws of
> gerrit itself and probably not fixable easily (other people are more
> qualified to comment) but i want to suggest speeding up the process to move
> to differential which is much better in handling such down times alongside
> with other benefits.
>
> Best
>
> On Fri, Oct 7, 2016, 2:26 AM Gergo Tisza <[hidden email]> wrote:
>
> Thanks a lot for the quick recovery!
>
> Would it be possible to use something other than a redirect next time when
> traffic needs to be blocked? An apache deny rule or a 404 would work, but a
> redirect means that reloading the page (or reopening the browser) will
> cause the URL to be lost with little hope of recovery (browsers don't
> record redirects in the history). That can be very annoying when one uses
> tabs as bookmarks (bad habit as it is).
>
> On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe <[hidden email]>
> wrote:
>
> > Hi!
> >
> > Sorry for the extended downtime! From what we can tell, it appears as
> > though
> > the machine that Gerrit is running on (lead) is having some hardware
> > issues that
> > are making the CPU misbehave. We've worked around it for now, so things
> > should
> > be up (and Zuul is processing CI events just fine).
> >
> > However, since it appears it's a hardware problem, we're planning to
> > migrate off
> > of lead to a new machine (cobalt). The public IP addresses will not be
> > changing.
> > The plan right now is to do this migration tomorrow with a scheduled
> > downtime
> > at 17:00UTC (10:00 PST).
> >
> > We'll be keeping a close eye on things in the meantime, so if things
> > deteriorate
> > again we can start the migration sooner.
> >
> > (and yeah, wikitech incident report to follow, I'm a little burnt out
> > right now though)
> >
> > Thanks again for bearing with us!
> >
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Gerrit was down today

Roan Kattouw-2
In reply to this post by Greg Grossmeier-2
Looks like it's down again? I was going to ask on IRC, but due to netsplits
(caused by freenode maintenance), IRCCloud is down too.

(IRC and Gerrit both down... clearly I should just go to lunch now :) )

On Oct 6, 2016 14:32, "Greg Grossmeier" <[hidden email]> wrote:

> (It wasn't just you)
>
> Gerrit was down today starting around 17:49 UTC. It is now back up and
> services are coming back online.
>
> A full investigation into the cause of the outage is still on-going.[0]
>
> Apologies for the downtime.
>
> WMF Release Engineering
>
> [0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006
>     But this is missing a lot of the information/discussion that is
>     happening in #wikimedia-operations on Freenode. A link to the
>     incident report will be pasted into that etherpad when it is
>     created.
>
> --
> | Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
> | Release Team Manager            A18D 1138 8E47 FAC8 1C7D |
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Gerrit was down today

Amir Sarabadani-2
Chad wrote:
However, since it appears it's a hardware problem, we're planning to
migrate off
of lead to a new machine (cobalt). The public IP addresses will not be
changing.
The plan right now is to do this migration tomorrow with a scheduled
downtime
at 17:00UTC (10:00 PST).

TLDR: scheduled down time.

Best

On Fri, Oct 7, 2016 at 11:25 PM Roan Kattouw <[hidden email]> wrote:

> Looks like it's down again? I was going to ask on IRC, but due to netsplits
> (caused by freenode maintenance), IRCCloud is down too.
>
> (IRC and Gerrit both down... clearly I should just go to lunch now :) )
>
> On Oct 6, 2016 14:32, "Greg Grossmeier" <[hidden email]> wrote:
>
> > (It wasn't just you)
> >
> > Gerrit was down today starting around 17:49 UTC. It is now back up and
> > services are coming back online.
> >
> > A full investigation into the cause of the outage is still on-going.[0]
> >
> > Apologies for the downtime.
> >
> > WMF Release Engineering
> >
> > [0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006
> >     But this is missing a lot of the information/discussion that is
> >     happening in #wikimedia-operations on Freenode. A link to the
> >     incident report will be pasted into that etherpad when it is
> >     created.
> >
> > --
> > | Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
> > | Release Team Manager            A18D 1138 8E47 FAC8 1C7D |
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Chad Horohoe
In reply to this post by Roan Kattouw-2
Yeah, we're working on a migration right now. It didn't go as smoothly as I
would have hoped.

Also Freenode is netsplitting which is not very helpful right now :(

Everything will be back soon!

-Chad

On Fri, Oct 7, 2016 at 12:55 PM Roan Kattouw <[hidden email]> wrote:

> Looks like it's down again? I was going to ask on IRC, but due to
> netsplits (caused by freenode maintenance), IRCCloud is down too.
>
> (IRC and Gerrit both down... clearly I should just go to lunch now :) )
>
> On Oct 6, 2016 14:32, "Greg Grossmeier" <[hidden email]> wrote:
>
> (It wasn't just you)
>
> Gerrit was down today starting around 17:49 UTC. It is now back up and
> services are coming back online.
>
> A full investigation into the cause of the outage is still on-going.[0]
>
> Apologies for the downtime.
>
> WMF Release Engineering
>
> [0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006
>     But this is missing a lot of the information/discussion that is
>     happening in #wikimedia-operations on Freenode. A link to the
>     incident report will be pasted into that etherpad when it is
>     created.
>
> --
> | Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
> | Release Team Manager            A18D 1138 8E47 FAC8 1C7D |
>
> _______________________________________________
>
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Engineering mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Daniel Zahn-2
In reply to this post by Roan Kattouw-2
The Gerrit migration is over. It is back up and served from new server
"cobalt" now. It feels faster than before as well.  Thanks much to
Brandon Black for help.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Engineering] Gerrit was down today

Chad Horohoe
Heya!

Gonna reboot Gerrit real quick this morning. Turns out "cobalt" did not
have hyperthreading turned on. Services should be back momentarily!

-Chad

On Fri, Oct 7, 2016 at 2:07 PM Daniel Zahn <[hidden email]> wrote:

> The Gerrit migration is over. It is back up and served from new server
> "cobalt" now. It feels faster than before as well.  Thanks much to
> Brandon Black for help.
>
> _______________________________________________
> Engineering mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l