Email notification

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Email notification

Tim Starling-2
I've deleted all the slow refreshLinks2 jobs which have apparently been
preventing the job queue from making any headway for the last few months.
Some people report that they have received hundreds of edit notification
emails in the last few hours, due to the months of backlog now being cleared.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling <[hidden email]> wrote:
> I've deleted all the slow refreshLinks2 jobs which have apparently been
> preventing the job queue from making any headway for the last few months.
> Some people report that they have received hundreds of edit notification
> emails in the last few hours, due to the months of backlog now being cleared.

So are there no alarm bells that go off when the job queue is
unreasonably long, or do people just not listen to them?  Perhaps we
could have a bot in #wikimedia-tech that would complain every hour if
the oldest job in the queue is more than X days old?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Thomas Dalton
2009/2/16 Aryeh Gregor <[hidden email]>:

> On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling <[hidden email]> wrote:
>> I've deleted all the slow refreshLinks2 jobs which have apparently been
>> preventing the job queue from making any headway for the last few months.
>> Some people report that they have received hundreds of edit notification
>> emails in the last few hours, due to the months of backlog now being cleared.
>
> So are there no alarm bells that go off when the job queue is
> unreasonably long, or do people just not listen to them?  Perhaps we
> could have a bot in #wikimedia-tech that would complain every hour if
> the oldest job in the queue is more than X days old?

Alternatively, the number of jobs processed per request could be made
a function of the length of the backlog (in terms of time) - the
longer the backlog is, the faster we process jobs. Then if the job
queue get to being months behind we would all notice it because
everything would start running really slowly. (Obviously, the length
of the job queue needs to be added to whatever diagnostic screen the
devs first check when the site slows down, otherwise it won't help
much.)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
On Mon, Feb 16, 2009 at 11:02 AM, Thomas Dalton <[hidden email]> wrote:
> Alternatively, the number of jobs processed per request could be made
> a function of the length of the backlog (in terms of time) - the
> longer the backlog is, the faster we process jobs. Then if the job
> queue get to being months behind we would all notice it because
> everything would start running really slowly.

Jobs are not processed on requests.  They're processed by a cron job.
You can't just automatically run them at a crazy rate, because that
will cause slave lag and other bad stuff.  If too many are
accumulating, it's probably due to a programming error that needs to
be found and fixed by human inspection.  (Tim just made several
commits fixing things that were spewing out too many jobs.)

> (Obviously, the length
> of the job queue needs to be added to whatever diagnostic screen the
> devs first check when the site slows down, otherwise it won't help
> much.)

#wikimedia-tech has enough people that regular warnings posted there
would probably get noticed.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Thomas Dalton
2009/2/16 Aryeh Gregor <[hidden email]>:
> On Mon, Feb 16, 2009 at 11:02 AM, Thomas Dalton <[hidden email]> wrote:
>> Alternatively, the number of jobs processed per request could be made
>> a function of the length of the backlog (in terms of time) - the
>> longer the backlog is, the faster we process jobs. Then if the job
>> queue get to being months behind we would all notice it because
>> everything would start running really slowly.
>
> Jobs are not processed on requests.  They're processed by a cron job.

According to the documentation, by default they are run on requests,
does Wikimedia not use that default?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
On Mon, Feb 16, 2009 at 11:33 AM, Thomas Dalton <[hidden email]> wrote:
> According to the documentation, by default they are run on requests,
> does Wikimedia not use that default?

That's correct, it doesn't.  The default is really only for easier
installation on shared hosting where cron might not be available (and
perhaps Windows, although I imagine that has some cron equivalent).

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Bugzilla from andrew@epstone.net
In reply to this post by Aryeh Gregor

On Feb 16, 2009, at 7:32 AM, Aryeh Gregor <Simetrical
+[hidden email]> wrote:
>

> So are there no alarm bells that go off when the job queue is
> unreasonably long, or do people just not listen to them?  Perhaps we
> could have a bot in #wikimedia-tech that would complain every hour if
> the oldest job in the queue is more than X days old?

The job queue does not have a timestamp field.

Andrew Garrett

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Thomas Dalton
2009/2/16 Andrew Garrett <[hidden email]>:

>
> On Feb 16, 2009, at 7:32 AM, Aryeh Gregor <Simetrical
> +[hidden email]> wrote:
>>
>
>> So are there no alarm bells that go off when the job queue is
>> unreasonably long, or do people just not listen to them?  Perhaps we
>> could have a bot in #wikimedia-tech that would complain every hour if
>> the oldest job in the queue is more than X days old?
>
> The job queue does not have a timestamp field.

That would be a mistake, then.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Platonides
In reply to this post by Aryeh Gregor
Aryeh Gregor wrote:

> If too many are
> accumulating, it's probably due to a programming error that needs to
> be found and fixed by human inspection.  (Tim just made several
> commits fixing things that were spewing out too many jobs.)
>
>> (Obviously, the length
>> of the job queue needs to be added to whatever diagnostic screen the
>> devs first check when the site slows down, otherwise it won't help
>> much.)
>
> #wikimedia-tech has enough people that regular warnings posted there
> would probably get noticed.

People did complain about long job queue on #wikimedia-tech. I don't
think they were taken too seriously.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
On Mon, Feb 16, 2009 at 6:20 PM, Platonides <[hidden email]> wrote:
> People did complain about long job queue on #wikimedia-tech. I don't
> think they were taken too seriously.

Yes, because they're not a bot who a) we know is actually noting a
real problem instead of subjective impressions, and who b) spams the
complaint on an ongoing basis like nagios does.

Part of the problem is that the measure of job queue length we really
care about is "what was the last job executed?", not "how many jobs
are in the queue?".  If we added a job_timestamp column and put an
index on it, we could replace (or supplement) the cruddy poor-quality
estimate we have now with a probably more useful and certainly more
accurate one.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Tim Starling-2
In reply to this post by Aryeh Gregor
Aryeh Gregor wrote:

> On Mon, Feb 16, 2009 at 9:18 AM, Tim Starling <[hidden email]> wrote:
>> I've deleted all the slow refreshLinks2 jobs which have apparently been
>> preventing the job queue from making any headway for the last few months.
>> Some people report that they have received hundreds of edit notification
>> emails in the last few hours, due to the months of backlog now being cleared.
>
> So are there no alarm bells that go off when the job queue is
> unreasonably long, or do people just not listen to them?  Perhaps we
> could have a bot in #wikimedia-tech that would complain every hour if
> the oldest job in the queue is more than X days old?

If you check the server admin log, you'll find that this is the latest in
a long series of attempts to fix this problem. I don't think it's
completely fixed yet.

I'm not sure what good a complaining bot would do, any more than a
complaining user which we seem to have plenty of. Deleting the jobs was
not a solution, and can't really be repeated without breaking things.
There's still a fair bit more programming to do.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Platonides
In reply to this post by Aryeh Gregor
Aryeh Gregor wrote:
> Part of the problem is that the measure of job queue length we really
> care about is "what was the last job executed?", not "how many jobs
> are in the queue?".  If we added a job_timestamp column and put an
> index on it, we could replace (or supplement) the cruddy poor-quality
> estimate we have now with a probably more useful and certainly more
> accurate one.

Agree. Adding job queue lag to Special:Statistics  would benefit both
users and sysadmins.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
In reply to this post by Tim Starling-2
On Mon, Feb 16, 2009 at 8:20 PM, Tim Starling <[hidden email]> wrote:
> I'm not sure what good a complaining bot would do, any more than a
> complaining user which we seem to have plenty of. Deleting the jobs was
> not a solution, and can't really be repeated without breaking things.
> There's still a fair bit more programming to do.

I misunderstood the problem, evidently.  I thought it was a one-off
thing due to software bugs the sysadmins didn't know about.  It seems
it's more like a known, ongoing problem whose cause isn't understood
yet, so there's no point in bugging people about it all the time, no.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Aryeh Gregor
On Tue, Feb 17, 2009 at 9:27 AM, Aryeh Gregor
<[hidden email]> wrote:
> I misunderstood the problem, evidently.  I thought it was a one-off
> thing due to software bugs the sysadmins didn't know about.  It seems
> it's more like a known, ongoing problem whose cause isn't understood
> yet, so there's no point in bugging people about it all the time, no.

Although, I still think a "oldest job" statistic might be useful.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Tim Landscheidt
Aryeh Gregor <[hidden email]> wrote:

> On Tue, Feb 17, 2009 at 9:27 AM, Aryeh Gregor
> <[hidden email]> wrote:
>> I misunderstood the problem, evidently.  I thought it was a one-off
>> thing due to software bugs the sysadmins didn't know about.  It seems
>> it's more like a known, ongoing problem whose cause isn't understood
>> yet, so there's no point in bugging people about it all the time, no.

> Although, I still think a "oldest job" statistic might be useful.

Not only for statistics, it would also provide users with an
opportunity to see whether it is due to the job queue or a
bug if a category membership or a link/template relationship
is not updated soon after the actual edit to article.

  In the same way it would be nice to extend "?action=purge"
to log any discrepancies it encounters between the pre- and
post-purge states to ease debugging.

Tim

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Email notification

Tomasz Finc-2
In reply to this post by Platonides
Platonides wrote:

> Aryeh Gregor wrote:
>> Part of the problem is that the measure of job queue length we really
>> care about is "what was the last job executed?", not "how many jobs
>> are in the queue?".  If we added a job_timestamp column and put an
>> index on it, we could replace (or supplement) the cruddy poor-quality
>> estimate we have now with a probably more useful and certainly more
>> accurate one.
>
> Agree. Adding job queue lag to Special:Statistics  would benefit both
> users and sysadmins.
>
>

I've been toying with some additions to the job queue so that we have
some semblance about what is going on. Time stamp, actual stats of
progress and if we want to get extra fancy, better view of what the job
workers are doing.

Just simple things to help humans analyze what is going. And if were
lucky .. maybe tell us why.

Now that I'm back, I'm hoping to have something ready for Brion and
everyone to look at soon.

--tomasz

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l