Fwd: Another labs outage - curse of the accursed hardware failure continues
---------- Forwarded message ----------
From: Yuvi Panda <[hidden email]>
Date: Fri, Feb 27, 2015 at 11:42 AM
Subject: Another labs outage - curse of the accursed hardware failure continues
To: Wikimedia Labs <[hidden email]>
A repeat of the failure that happened a few days ago. Underlying flaky
hardware, andrewbogott is looking into it atm.
== Why is everything so terrible? ==
Labs instances are Virtual Machines that run on physical hardware.
When the underlying hardware dies, the virtual machines on them also
die. This is similar to AWS or other cloud providers. We had one spare
machine (virt1012) in case any of the currently in use machines died
and needed a lifeboat.
A week or so ago one of the machines (virt1005) died, and we migrated
things to virt1012. This week, the new machine, virt1012, has been
having issues, and that's why the outages are all so similar. So the
current instability is basically caused by *two* different
hardware-related issues happening to two different machines with
And specifically for toollabs, it would be awesome for it to be able
to survive one virt* node being down. This is not an easy problem to
solve, but here's the tracking ticket for it:
Andrew is working through his night (again) to diagnose / fix this
issue (thanks!) and we'll keep you updated as things progress. Thank
you for your patience.