[Wikitech-l] Database problem post-mortem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Wikitech-l] Database problem post-mortem

Brion Vibber
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

== Summary ==

Full disk on the database master for non-en.wikipedia.org made most of
our wikis uneditable for about 2.5 hours, during Europe midday / US morning.

Immediate problem is repaired; some minor further cleanup needed;
procedural changes recommended.


== Disruption and data loss ==

The last edits to make it to the slave servers were at:
2007-01-19 11:50:29 UTC

A few more made it through on samuel before it stopped accepting more
data up to:
2007-01-19 11:51:03
(23 broken edits on de.wikipedia.)

After that point the database didn't accept more writes, leaving a
read-only state which didn't allow any further consistency problems to
develop.

There _may_ be some minor problems related to caching of revision data
where ID numbers overlap from the old server, but this is unclear.


== Inspection and repair ==

I was woken up around 13:50 to take a look, informed that samuel
(non-enwiki master) was out of disk space and wikis were read-only.

After a few minutes to check that the slaves were consistent and that
there wasn't _too_ bad a lag between them and the master, I decided to
go ahead with a master switch to adler, leaving samuel out of service
until it gets re-cloned.

By 14:26 the master switch was done, and read-write service restored.


== Further work: immediate ==

If really desired, we may be able to clone the small number of 'lost'
edits from samuel.

Once we no longer need samuel's data, it should have its database
re-cloned from one of the slaves consistent with the new state, and it
can be restored to slave service.


== Further work: long-term ==

Our procedure for monitoring disk space and cleaning up binlogs is terrible.

Low-disk warnings from Nagios are routinely ignored, in part because the
thresholds seem much too high.

Binlog cleanup appears to be entirely manual and ad-hoc; there is no set
schedule or assignment to do this.

The good news is this task is easy to automate.

Recommendation:
* automate cleanup of binlogs on the db masters.
* make low-disk warnings more reasonable and visible for the masters
specifically (where it really, really matters)

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFsOS1wRnhpk1wk44RAjujAKDLga9UHrs9Z5o0E6DM24puZvkSMwCeO9N0
/TIoWOSKKdUMOO3Lu5Bdn0M=
=R6SD
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Database problem post-mortem

Brad Patrick
Is there a latent hardware solution necessary?; that is, was the problem a
function of size as well as cleanup, or just the cleanup?

On 1/19/07, Brion Vibber <[hidden email]> wrote:

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> == Summary ==
>
> Full disk on the database master for non-en.wikipedia.org made most of
> our wikis uneditable for about 2.5 hours, during Europe midday / US
> morning.
>
> Immediate problem is repaired; some minor further cleanup needed;
> procedural changes recommended.
>
>
> == Disruption and data loss ==
>
> The last edits to make it to the slave servers were at:
> 2007-01-19 11:50:29 UTC
>
> A few more made it through on samuel before it stopped accepting more
> data up to:
> 2007-01-19 11:51:03
> (23 broken edits on de.wikipedia.)
>
> After that point the database didn't accept more writes, leaving a
> read-only state which didn't allow any further consistency problems to
> develop.
>
> There _may_ be some minor problems related to caching of revision data
> where ID numbers overlap from the old server, but this is unclear.
>
>
> == Inspection and repair ==
>
> I was woken up around 13:50 to take a look, informed that samuel
> (non-enwiki master) was out of disk space and wikis were read-only.
>
> After a few minutes to check that the slaves were consistent and that
> there wasn't _too_ bad a lag between them and the master, I decided to
> go ahead with a master switch to adler, leaving samuel out of service
> until it gets re-cloned.
>
> By 14:26 the master switch was done, and read-write service restored.
>
>
> == Further work: immediate ==
>
> If really desired, we may be able to clone the small number of 'lost'
> edits from samuel.
>
> Once we no longer need samuel's data, it should have its database
> re-cloned from one of the slaves consistent with the new state, and it
> can be restored to slave service.
>
>
> == Further work: long-term ==
>
> Our procedure for monitoring disk space and cleaning up binlogs is
> terrible.
>
> Low-disk warnings from Nagios are routinely ignored, in part because the
> thresholds seem much too high.
>
> Binlog cleanup appears to be entirely manual and ad-hoc; there is no set
> schedule or assignment to do this.
>
> The good news is this task is easy to automate.
>
> Recommendation:
> * automate cleanup of binlogs on the db masters.
> * make low-disk warnings more reasonable and visible for the masters
> specifically (where it really, really matters)
>
> - -- brion vibber (brion @ pobox.com)
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFFsOS1wRnhpk1wk44RAjujAKDLga9UHrs9Z5o0E6DM24puZvkSMwCeO9N0
> /TIoWOSKKdUMOO3Lu5Bdn0M=
> =R6SD
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Brad Patrick
General Counsel & Interim Executive Director
Wikimedia Foundation, Inc.
[hidden email]
727-231-0101
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [Wikitech-l] Database problem post-mortem

Brion Vibber
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brad Patrick wrote:
> Is there a latent hardware solution necessary?; that is, was the problem a
> function of size as well as cleanup, or just the cleanup?

Just cleanup. Binlogs had been accumulating since September 30 on
samuel, totaling about 65 GB. Had they been cleaned up more regularly
there would have been no disk shortage.

(Incidentally I removed the earliest 10GB during emergency cleanup to
free up some work space on the drive.)

- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFsPrHwRnhpk1wk44RAgvyAKCDijMh4yO8RQiuK3ysNgZst2T6PgCfTA5Y
mtWmsMlJzfiCw5pXLk9fq4s=
=NDvw
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wikitech-l