CI outage - ongoing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

CI outage - ongoing

Chad Horohoe
Hi folks,

Right now our CI infrastructure (Zuul/Jenkins/Nodepool) are having a bad
day and aren't able
to spawn new instances to perform tests. The outage is ongoing and there
isn't an ETA for
restoration of service just yet.

In the meantime: please avoid force-merging (doing the Verified+2 check
yourself) and skipping
Jenkins unless you're dealing with an urgent production issue that must
land today. Doing so
makes Zuul get extra noisy which makes further diagnosis difficult.

Thanks for your patience!

-Chad & rest of RelEng
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: CI outage - ongoing

Antoine Musso-3
On 05/07/16 23:28, Chad Horohoe wrote:

> Hi folks,
>
> Right now our CI infrastructure (Zuul/Jenkins/Nodepool) are having a bad
> day and aren't able
> to spawn new instances to perform tests. The outage is ongoing and there
> isn't an ETA for
> restoration of service just yet.
>
> In the meantime: please avoid force-merging (doing the Verified+2 check
> yourself) and skipping
> Jenkins unless you're dealing with an urgent production issue that must
> land today. Doing so
> makes Zuul get extra noisy which makes further diagnosis difficult.
>
> Thanks for your patience!
>
> -Chad & rest of RelEng

Hello,

The issue is resolved now and the backlog has been processed.

It started around 19:40 UTC when labs lost the ability to create
instance. That fully recovered at 21:40 UTC and the backlog has been
completely processed by 22:30UTC.

--
Antoine "hashar" Musso



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: CI outage - ongoing

Andrew Bogott
On 7/5/16 5:47 PM, Antoine Musso wrote:

> On 05/07/16 23:28, Chad Horohoe wrote:
>> Hi folks,
>>
>> Right now our CI infrastructure (Zuul/Jenkins/Nodepool) are having a bad
>> day and aren't able
>> to spawn new instances to perform tests. The outage is ongoing and there
>> isn't an ETA for
>> restoration of service just yet.
>>
>> In the meantime: please avoid force-merging (doing the Verified+2 check
>> yourself) and skipping
>> Jenkins unless you're dealing with an urgent production issue that must
>> land today. Doing so
>> makes Zuul get extra noisy which makes further diagnosis difficult.
>>
>> Thanks for your patience!
>>
>> -Chad & rest of RelEng
>
> Hello,
>
> The issue is resolved now and the backlog has been processed.
>
> It started around 19:40 UTC when labs lost the ability to create
> instance. That fully recovered at 21:40 UTC and the backlog has been
> completely processed by 22:30UTC.
>
The incident report for this outage is here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160706-CI-Outage.
It was complicated!

-Andrew


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l