📈 Wikimedia production errors help

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

📈 Wikimedia production errors help

Tyler Cipriani
Hello all!

Over the past few months we've reached the ignominious milestone of
the most open tasks of all time on the wikimedia-production-error
dashboard[0].

Background: The wikimedia-production-error dashboard is a workboard of
tasks created while digging through the Wikimedia production error
logs. All tasks there are log messages that have originated on
production servers.

The number of new tasks being created with this tag in a given week is
outpacing the number of tasks being closed in a given week: this past
week we added 41 tasks and only closed 22.

This is beginning to be unsustainable :(

There are currently 281 open tasks filed for errors in production.

Although we're triaging this workboard weekly, we rely on the
expertise of developers most familiar with the error messages to
triage them, prioritize them, and "fix" them (for whatever value of
"fix" is appropriate).

Below is a smattering of selected issues that could use some attention:

  1. PHP Fatal error: Out of memory in cdb/src/Reader/DBA.php[1]
  2. Uncaught ReferenceError: collectionCall is not defined[2]
  3. Flow: PHP Notice: Undefined index: flow-workflow-change[3]
  4. PHP Warning: unpack(): Type H: not enough input, need 4, have 0[4]
  5. TypeError: undefined is not an object (evaluating 'this.getMIMEType')[5]
  6. Elastica\Exception\ResponseException from line 56 of
GeoData/includes/Searcher.php[6]
  7. Wikimedia\CSS\Objects\ComponentValueList may not contain tokens
of type "[".[7]

Please help to triage or resolve these problems or any of the other
166 tasks needing triage[8] if you are able.

<3
-- Tyler

[0]: <https://phabricator.wikimedia.org/tag/wikimedia-production-error/>
[1]: <https://phabricator.wikimedia.org/T260234>
[2]: <https://phabricator.wikimedia.org/T259809>
[3]: <https://phabricator.wikimedia.org/T259739>
[4]: <https://phabricator.wikimedia.org/T259592>
[5]: <https://phabricator.wikimedia.org/T259419>
[6]: <https://phabricator.wikimedia.org/T258641>
[7]: <https://phabricator.wikimedia.org/T258093>
[8]: <https://phabricator.wikimedia.org/maniphest/query/LW5WTEnToXDn/#R.>

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Niklas Laxström
ma 14. syysk. 2020 klo 23.49 Tyler Cipriani ([hidden email]) kirjoitti:
> The number of new tasks being created with this tag in a given week is
> outpacing the number of tasks being closed in a given week: this past
> week we added 41 tasks and only closed 22.

Majority of the recently created tasks are frontend JavaScript errors.
The logging of these errors have only started recently. These issues
may have been present for years already, but they are reported now.

> This is beginning to be unsustainable :(

If there is an increase in the amount of real new issues and/or
decrease in the amount of issues fixed, then I would be worried. Given
what I said above, it's difficult to see if this is the case.

Regardless, I do agree that we should aim to minimize production
errors to make it easier to spot any new issues. I would encourage all
maintainers and development teams to ensure that they have a regular
process to check if they have and triage any production issues in code
they maintain.

I think we should expect the number to go up while the backlog of
unreported frontend errors are being reported, and then it would start
going down as developers work on to reduce the backlog of reported
issues. It will probably stabilize at some level, higher than
previously, indicating that some areas of code lack maintainers or
maintenance resources.

Ending with a question: do we want to have both frontend and backend
errors on the same tag/board, or should they be on separate ones?

  -Niklas

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Derk-Jan Hartman
In particular I count 13 frontend problems with the old TMH kaltura player.
There is clearly no intent to fix those (volunteer or employee), as the
Kaltura player has been unmaintained for 8 years.
The choices as far as I can tell are to ignore them, undeploy a/v playback
or to direct C-level management to get the audio and video stuff together.

DJ

On Tue, Sep 15, 2020 at 11:00 AM Niklas Laxström <[hidden email]>
wrote:

> ma 14. syysk. 2020 klo 23.49 Tyler Cipriani ([hidden email])
> kirjoitti:
> > The number of new tasks being created with this tag in a given week is
> > outpacing the number of tasks being closed in a given week: this past
> > week we added 41 tasks and only closed 22.
>
> Majority of the recently created tasks are frontend JavaScript errors.
> The logging of these errors have only started recently. These issues
> may have been present for years already, but they are reported now.
>
> > This is beginning to be unsustainable :(
>
> If there is an increase in the amount of real new issues and/or
> decrease in the amount of issues fixed, then I would be worried. Given
> what I said above, it's difficult to see if this is the case.
>
> Regardless, I do agree that we should aim to minimize production
> errors to make it easier to spot any new issues. I would encourage all
> maintainers and development teams to ensure that they have a regular
> process to check if they have and triage any production issues in code
> they maintain.
>
> I think we should expect the number to go up while the backlog of
> unreported frontend errors are being reported, and then it would start
> going down as developers work on to reduce the backlog of reported
> issues. It will probably stabilize at some level, higher than
> previously, indicating that some areas of code lack maintainers or
> maintenance resources.
>
> Ending with a question: do we want to have both frontend and backend
> errors on the same tag/board, or should they be on separate ones?
>
>   -Niklas
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Tyler Cipriani
In reply to this post by Niklas Laxström
Hi!

Thanks for the feedback, this is useful information.

On Tue, Sep 15, 2020 at 3:00 AM Niklas Laxström
<[hidden email]> wrote:
> ma 14. syysk. 2020 klo 23.49 Tyler Cipriani ([hidden email]) kirjoitti:
> If there is an increase in the amount of real new issues and/or
> decrease in the amount of issues fixed, then I would be worried. Given
> what I said above, it's difficult to see if this is the case.

Indeed, a trendline for production quality is difficult to compare if
a large backlog is being added.

> Regardless, I do agree that we should aim to minimize production
> errors to make it easier to spot any new issues. I would encourage all
> maintainers and development teams to ensure that they have a regular
> process to check if they have and triage any production issues in code
> they maintain.

+100 to checking for production errors. It's my hope that folks who
have code that is going out on a train are:

1. Aware their code is going to production that week
2. Watching for related logs and alerts (where possible)
3. Performing other software quality assurance activities on their
code as it rolls out (manual testing, for example)

My assessment of risk as a person deploying software to production is
necessarily linked to my view into quality assurance activities. If
production errors are growing, I worry about sustainability. The
production error dashboard's past stability has provided assurances
about shared awareness and priority of a given week's deployment.

That is, I know there are software quality activities that take place
sometime after code hits group0 or group1 or group2; however, much of
that activity remains opaque. This is why this dashboard is crucial
for deployment.

Having the explicit assurances of folks whose code is going to
production that week would be preferable to any inference I can make
from this dashboard. It's my hope that maintainers and teams triaging
and grooming this dashboard will create an emergent process that can
be used to provide real insight. That is, if we all are keeping this
dashboard up-to-date collectively, it will be easier to see when
quality assurance activities have taken place. Further, if we
collectively fret over this dashboard then we'll share a collective
awareness of anomalies.

> Ending with a question: do we want to have both frontend and backend
> errors on the same tag/board, or should they be on separate ones?

That's a good question. I think that having a single workboard is nice
as there are reporting features[0] that provide some insights about
the overall health of production. Those insights are, as evidenced,
only as good as their inputs, but they remain valuable to me.
Additionally, a single tag may be used in saved searches and custom
dashboards to make it easy to stay on top of issues seen in production
(is my hope which may not align with how folks triage in practice).

Thanks for the feedback. This anomaly makes more sense to me than it did :)

-- Tyler

[0]: <https://phabricator.wikimedia.org/project/reports/1055/>

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Tyler Cipriani
In reply to this post by Derk-Jan Hartman
On Tue, Sep 15, 2020 at 5:24 AM Derk-Jan Hartman
<[hidden email]> wrote:
>
> In particular I count 13 frontend problems with the old TMH kaltura player.
> There is clearly no intent to fix those (volunteer or employee), as the
> Kaltura player has been unmaintained for 8 years.
> The choices as far as I can tell are to ignore them, undeploy a/v playback
> or to direct C-level management to get the audio and video stuff together.

The tasks that I mentioned in my original message are, likewise, tasks
that I'm not sure belong to any team or any particular person.

I have been using the phab tag/milestone "Release Engineering
(Logspam)" to ensure that we don't lose track of tasks that are:

1. problems in production
2. tagged in phabricator with a team or component (in contrast to
problems with unknown components/team tags)
3. no longer resourced or maintained in a discernible way

Feel free to apply that tag if those 3 conditions apply to these
tasks. Tracking these will make it easier to raise awareness later.

Thanks!
-- Tyler

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Alex Ezell
Hi y'all,
Do we use levels for any of these error log outputs? That is, are they
classified on output as High, Medium, Low, Info, or something like that?

Or do we have to triage each of them as we examine them?

I was just thinking if they were somehow leveled, we could use measurements
of the number of each type and set targets for lowering the number of those
kinds of logs. That would potentially help visualize and prioritize the
work. It might be easier to say something like that, "Let's have a goal to
produce 10% less High errors in the next two months," than to have a more
nebulous approach that seems to require Tyler or someone from his team to
highlight tasks that are especially impactful.

I'm mostly ignorant of exactly how these processes work now so if I'm
telling y'all something you already know, forgive me. I was mostly thinking
out loud about how we could start to approach the work more systematically.

Alex Ezell (he/him)
Senior Engineering Manager
Wikimedia Foundation


On Tue, Sep 15, 2020 at 10:36 AM Tyler Cipriani <[hidden email]>
wrote:

> On Tue, Sep 15, 2020 at 5:24 AM Derk-Jan Hartman
> <[hidden email]> wrote:
> >
> > In particular I count 13 frontend problems with the old TMH kaltura
> player.
> > There is clearly no intent to fix those (volunteer or employee), as the
> > Kaltura player has been unmaintained for 8 years.
> > The choices as far as I can tell are to ignore them, undeploy a/v
> playback
> > or to direct C-level management to get the audio and video stuff
> together.
>
> The tasks that I mentioned in my original message are, likewise, tasks
> that I'm not sure belong to any team or any particular person.
>
> I have been using the phab tag/milestone "Release Engineering
> (Logspam)" to ensure that we don't lose track of tasks that are:
>
> 1. problems in production
> 2. tagged in phabricator with a team or component (in contrast to
> problems with unknown components/team tags)
> 3. no longer resourced or maintained in a discernible way
>
> Feel free to apply that tag if those 3 conditions apply to these
> tasks. Tracking these will make it easier to raise awareness later.
>
> Thanks!
> -- Tyler
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Brennen Bearnes
On 9/15/20 9:43 AM, Alex Ezell wrote:

> Do we use levels for any of these error log outputs? That is, are they
> classified on output as High, Medium, Low, Info, or something like that?

To an extent, yes.  We have separate channels for PHP errors and
exceptions, for example, and although I don't think we currently
differentiate in logstash, maybe we could plausibly draw a further
distinction between PHP error levels.  Intuitively, a low number of PHP
notices probably indicates something of lower severity than a high
number of fatals, and so forth.

Teasing out more detail about reported error severity could be a useful
exercise, but I'm not sure it would result in much more meaningful
signals than we currently have about production health.  Serious
problems can manifest as trivial-seeming notices, some issues start out
that way and cascade over time, and generally any form of recurring
logspam needs human evaluation before we can easily say much more than
"this is a problem".

> Or do we have to triage each of them as we examine them?

Yeah.  There are doubtless a lot of ways to improve the tooling we use
for that process, but right now I think it would be most helpful if we
just had more eyes _routinely_ on the logs and the workboard.  (See
Tyler's earlier and much more detailed/thoughtful response to this thread.)

--
Brennen Bearnes
Release Engineering

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Krinkle
In reply to this post by Niklas Laxström
On Tue, Sep 15, 2020 at 10:00 AM Niklas Laxström <[hidden email]>
wrote:

> ma 14. syysk. 2020 klo 23.49 Tyler Cipriani ([hidden email])
> kirjoitti:
> > The number of new tasks being created with this tag in a given week is
> > outpacing the number of tasks being closed in a given week: this past
> > week we added 41 tasks and only closed 22.
>
> Majority of the recently created tasks are frontend JavaScript errors.
> The logging of these errors have only started recently.


Aye, this is indeed a distraction currently. In talking with Tyler prior to
this email I failed to highlight what I think the main area of concern is,
which is indeed not just the total number of reports from this and last
month.

Rather, my main concern is that over the past six month (incl long before
the JS stuff came along), we've fallen quite a bit in addressing on-going
production errors.

For example, of the 30 odd backend errors reported in June, 14 were still
open a month later in July [1], and 12 were still open – three months later
– in September. The majority of these haven't even yet been triaged,
assigned assigned or otherwise acknowledged. And meanwhile we've got more
(non-JavaScript) stuff from July, August and September adding pressure. We
have to do better.

-- Timo

[1]
https://phabricator.wikimedia.org/phame/post/view/203/production_excellence_22_june_2020/
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Tyler Cipriani
In reply to this post by Brennen Bearnes
On Tue, Sep 15, 2020 at 11:06 AM Brennen Bearnes <[hidden email]> wrote:

> On 9/15/20 9:43 AM, Alex Ezell wrote:
> > Do we use levels for any of these error log outputs? That is, are they
> > classified on output as High, Medium, Low, Info, or something like that?
>
> Teasing out more detail about reported error severity could be a useful
> exercise, but I'm not sure it would result in much more meaningful
> signals than we currently have about production health.  Serious
> problems can manifest as trivial-seeming notices, some issues start out
> that way and cascade over time, and generally any form of recurring
> logspam needs human evaluation before we can easily say much more than
> "this is a problem".

This aligns with my view of our team's ability to assign meaningful
priorities. High-level general knowledge about our deployment, errors,
and error logging can't substitute for domain expertise. Teams with
expertise in particular codebase are best positioned to understand the
impact of a particular message and derive a useful priority.

> it would be most helpful if we
> just had more eyes _routinely_ on the logs and the workboard.  (See
> Tyler's earlier and much more detailed/thoughtful response to this thread.)

+1 An interface between the log triage workboard and process with
team/maintainer workflows is a missing component of assigning
priorities.

There is a long developer feedback loop past integration. Hopefully,
this process helps to shorten the feedback loop to developers and
reduce the opacity of the process beyond integration through release
and monitoring. Having the expertise of developers writing the code be
a part of the deployment and monitoring of that code in production is
the goal of this process and the key to its utility.

-- Tyler

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Dan Andreescu
In reply to this post by Krinkle
>
> For example, of the 30 odd backend errors reported in June, 14 were still
> open a month later in July [1], and 12 were still open – three months later
> – in September. The majority of these haven't even yet been triaged,
> assigned assigned or otherwise acknowledged. And meanwhile we've got more
> (non-JavaScript) stuff from July, August and September adding pressure. We
> have to do better.
>
> -- Timo
>

This feels like it needs some higher level coordination.  Like perhaps
managers getting together and deciding production issues are a priority and
diverting resources dynamically to address them.  Building an awesome new
feature will have a lot less impact if the users are hurting from growing
disrepair.  It seems to me like if individual contributors and maintainers
could have solved this problem, they would have by now.  I'm a little
worried that the only viable solution right now seems like heroes stepping
up to fix these bugs.

Concretely, I think expanding something like the Core Platform Team's
clinic duty might work.  Does anyone have a very rough idea of the time it
would take to tackle 293 (wow we went up by a dozen since this thread
started) tasks?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

AntiCompositeNumber
There is an impression among many community members, myself included,
that Foundation development generally prioritizes new features over
fixing existing problems. Foundation teams will sprint for a few
months to put together a minimum viable product, release it, then move
on to the new hotness, leaving user requests, bugfixes, and the like
behind. It often seems that the only way to get a bug fixed is to get
a volunteer developer to look at it. This is likely unintentional, but
it happens nonetheless.

Putting a higher priority within the Foundation on cleaning up old
toys before taking out new ones is necessary for the long-term
stability of the projects.

ACN

On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <[hidden email]> wrote:

>
> >
> > For example, of the 30 odd backend errors reported in June, 14 were still
> > open a month later in July [1], and 12 were still open – three months later
> > – in September. The majority of these haven't even yet been triaged,
> > assigned assigned or otherwise acknowledged. And meanwhile we've got more
> > (non-JavaScript) stuff from July, August and September adding pressure. We
> > have to do better.
> >
> > -- Timo
> >
>
> This feels like it needs some higher level coordination.  Like perhaps
> managers getting together and deciding production issues are a priority and
> diverting resources dynamically to address them.  Building an awesome new
> feature will have a lot less impact if the users are hurting from growing
> disrepair.  It seems to me like if individual contributors and maintainers
> could have solved this problem, they would have by now.  I'm a little
> worried that the only viable solution right now seems like heroes stepping
> up to fix these bugs.
>
> Concretely, I think expanding something like the Core Platform Team's
> clinic duty might work.  Does anyone have a very rough idea of the time it
> would take to tackle 293 (wow we went up by a dozen since this thread
> started) tasks?
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

C. Scott Ananian
ACN -- for what it's worth, I've been working for the foundation for a
while now, and I can report from the inside that the trend is definitely in
a positive direction.  There is a lot more internal focus on addressing
code debt and giving maintenance a fair spot at the table.  (In fact, my
entire team is now sitting inside 'maintenance' now, apparently; we used to
be 'platform evolution'.)  This email thread is one visible aspect of that
focus on code quality, not just features.

That said, the one aspect which hasn't improved much in my time at the
foundation has been the tendency of teams to work in silos.  This thread
also seems to be a symptom of that: a bunch of production issues are being
dropped on the floor ('not resolved in over a month') because they are
falling between the silos and nobody knows who is best able to fix them.
There are knowledge/expertise gaps among the silos as well: someone
qualified to fix a DB issue might be at sea trying to track down a front
end bug, and vice-versa---a number of generalists in the org could
technically tackle a bug no matter where it lies, but it will take them
much longer to grok an unfamiliar codebase than it would for someone more
familiar with that silo.  So bug triage is an increasingly technical task
in its own right.

This thread, as I read it sitting inside the org, isn't so much asking for
more attention to be paid to maintenance -- we're winning that battle,
internally -- as it is a plea for those folks on the edges of their silos
to keep an eye out for these things which are currently falling between
them and help with the triage.
  --scott, speaking only for myself and my view here



On Wed, Sep 16, 2020 at 11:25 PM AntiCompositeNumber <
[hidden email]> wrote:

> There is an impression among many community members, myself included,
> that Foundation development generally prioritizes new features over
> fixing existing problems. Foundation teams will sprint for a few
> months to put together a minimum viable product, release it, then move
> on to the new hotness, leaving user requests, bugfixes, and the like
> behind. It often seems that the only way to get a bug fixed is to get
> a volunteer developer to look at it. This is likely unintentional, but
> it happens nonetheless.
>
> Putting a higher priority within the Foundation on cleaning up old
> toys before taking out new ones is necessary for the long-term
> stability of the projects.
>
> ACN
>
> On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <[hidden email]>
> wrote:
> >
> > >
> > > For example, of the 30 odd backend errors reported in June, 14 were
> still
> > > open a month later in July [1], and 12 were still open – three months
> later
> > > – in September. The majority of these haven't even yet been triaged,
> > > assigned assigned or otherwise acknowledged. And meanwhile we've got
> more
> > > (non-JavaScript) stuff from July, August and September adding
> pressure. We
> > > have to do better.
> > >
> > > -- Timo
> > >
> >
> > This feels like it needs some higher level coordination.  Like perhaps
> > managers getting together and deciding production issues are a priority
> and
> > diverting resources dynamically to address them.  Building an awesome new
> > feature will have a lot less impact if the users are hurting from growing
> > disrepair.  It seems to me like if individual contributors and
> maintainers
> > could have solved this problem, they would have by now.  I'm a little
> > worried that the only viable solution right now seems like heroes
> stepping
> > up to fix these bugs.
> >
> > Concretely, I think expanding something like the Core Platform Team's
> > clinic duty might work.  Does anyone have a very rough idea of the time
> it
> > would take to tackle 293 (wow we went up by a dozen since this thread
> > started) tasks?
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>


--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Ed Sanders-2
Speaking specifically about the new JavaScript error logging, and
specifically to Alex's point about triaging these tasks, it would be very
helpful if the reports included some indication of how often the error is
occurring.

For example, VisualEditor is loaded several hundred thousands times per
day. If an error has occurred 4 times in the last 30 days (based on a
recent example) then it is probably very low priority.

On Thu, 17 Sep 2020 at 16:40, C. Scott Ananian <[hidden email]>
wrote:

> ACN -- for what it's worth, I've been working for the foundation for a
> while now, and I can report from the inside that the trend is definitely in
> a positive direction.  There is a lot more internal focus on addressing
> code debt and giving maintenance a fair spot at the table.  (In fact, my
> entire team is now sitting inside 'maintenance' now, apparently; we used to
> be 'platform evolution'.)  This email thread is one visible aspect of that
> focus on code quality, not just features.
>
> That said, the one aspect which hasn't improved much in my time at the
> foundation has been the tendency of teams to work in silos.  This thread
> also seems to be a symptom of that: a bunch of production issues are being
> dropped on the floor ('not resolved in over a month') because they are
> falling between the silos and nobody knows who is best able to fix them.
> There are knowledge/expertise gaps among the silos as well: someone
> qualified to fix a DB issue might be at sea trying to track down a front
> end bug, and vice-versa---a number of generalists in the org could
> technically tackle a bug no matter where it lies, but it will take them
> much longer to grok an unfamiliar codebase than it would for someone more
> familiar with that silo.  So bug triage is an increasingly technical task
> in its own right.
>
> This thread, as I read it sitting inside the org, isn't so much asking for
> more attention to be paid to maintenance -- we're winning that battle,
> internally -- as it is a plea for those folks on the edges of their silos
> to keep an eye out for these things which are currently falling between
> them and help with the triage.
>   --scott, speaking only for myself and my view here
>
>
>
> On Wed, Sep 16, 2020 at 11:25 PM AntiCompositeNumber <
> [hidden email]> wrote:
>
> > There is an impression among many community members, myself included,
> > that Foundation development generally prioritizes new features over
> > fixing existing problems. Foundation teams will sprint for a few
> > months to put together a minimum viable product, release it, then move
> > on to the new hotness, leaving user requests, bugfixes, and the like
> > behind. It often seems that the only way to get a bug fixed is to get
> > a volunteer developer to look at it. This is likely unintentional, but
> > it happens nonetheless.
> >
> > Putting a higher priority within the Foundation on cleaning up old
> > toys before taking out new ones is necessary for the long-term
> > stability of the projects.
> >
> > ACN
> >
> > On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <[hidden email]>
> > wrote:
> > >
> > > >
> > > > For example, of the 30 odd backend errors reported in June, 14 were
> > still
> > > > open a month later in July [1], and 12 were still open – three months
> > later
> > > > – in September. The majority of these haven't even yet been triaged,
> > > > assigned assigned or otherwise acknowledged. And meanwhile we've got
> > more
> > > > (non-JavaScript) stuff from July, August and September adding
> > pressure. We
> > > > have to do better.
> > > >
> > > > -- Timo
> > > >
> > >
> > > This feels like it needs some higher level coordination.  Like perhaps
> > > managers getting together and deciding production issues are a priority
> > and
> > > diverting resources dynamically to address them.  Building an awesome
> new
> > > feature will have a lot less impact if the users are hurting from
> growing
> > > disrepair.  It seems to me like if individual contributors and
> > maintainers
> > > could have solved this problem, they would have by now.  I'm a little
> > > worried that the only viable solution right now seems like heroes
> > stepping
> > > up to fix these bugs.
> > >
> > > Concretely, I think expanding something like the Core Platform Team's
> > > clinic duty might work.  Does anyone have a very rough idea of the time
> > it
> > > would take to tackle 293 (wow we went up by a dozen since this thread
> > > started) tasks?
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
> --
> (http://cscott.net)
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: 📈 Wikimedia production errors help

Jon Robson-2
Id be careful about using numbers in triage right now. The numbers are a little misleading as the error logging is only enabled on smaller wikis. Also if an error results in data loss but only impacts a small amount of people I would say that's worse than a benign error that occurs for lots.

We rolled out to Spanish, German and Japanese wikipedia yesterday so these numbers will start becoming more useful, but English Wikipedia will severely skew these numbers when we finally enable it.

On Tue, Sep 22, 2020, 9:59 AM Ed Sanders <[hidden email]> wrote:
Speaking specifically about the new JavaScript error logging, and
specifically to Alex's point about triaging these tasks, it would be very
helpful if the reports included some indication of how often the error is
occurring.

For example, VisualEditor is loaded several hundred thousands times per
day. If an error has occurred 4 times in the last 30 days (based on a
recent example) then it is probably very low priority.

On Thu, 17 Sep 2020 at 16:40, C. Scott Ananian <[hidden email]>
wrote:

> ACN -- for what it's worth, I've been working for the foundation for a
> while now, and I can report from the inside that the trend is definitely in
> a positive direction.  There is a lot more internal focus on addressing
> code debt and giving maintenance a fair spot at the table.  (In fact, my
> entire team is now sitting inside 'maintenance' now, apparently; we used to
> be 'platform evolution'.)  This email thread is one visible aspect of that
> focus on code quality, not just features.
>
> That said, the one aspect which hasn't improved much in my time at the
> foundation has been the tendency of teams to work in silos.  This thread
> also seems to be a symptom of that: a bunch of production issues are being
> dropped on the floor ('not resolved in over a month') because they are
> falling between the silos and nobody knows who is best able to fix them.
> There are knowledge/expertise gaps among the silos as well: someone
> qualified to fix a DB issue might be at sea trying to track down a front
> end bug, and vice-versa---a number of generalists in the org could
> technically tackle a bug no matter where it lies, but it will take them
> much longer to grok an unfamiliar codebase than it would for someone more
> familiar with that silo.  So bug triage is an increasingly technical task
> in its own right.
>
> This thread, as I read it sitting inside the org, isn't so much asking for
> more attention to be paid to maintenance -- we're winning that battle,
> internally -- as it is a plea for those folks on the edges of their silos
> to keep an eye out for these things which are currently falling between
> them and help with the triage.
>   --scott, speaking only for myself and my view here
>
>
>
> On Wed, Sep 16, 2020 at 11:25 PM AntiCompositeNumber <
> [hidden email]> wrote:
>
> > There is an impression among many community members, myself included,
> > that Foundation development generally prioritizes new features over
> > fixing existing problems. Foundation teams will sprint for a few
> > months to put together a minimum viable product, release it, then move
> > on to the new hotness, leaving user requests, bugfixes, and the like
> > behind. It often seems that the only way to get a bug fixed is to get
> > a volunteer developer to look at it. This is likely unintentional, but
> > it happens nonetheless.
> >
> > Putting a higher priority within the Foundation on cleaning up old
> > toys before taking out new ones is necessary for the long-term
> > stability of the projects.
> >
> > ACN
> >
> > On Wed, Sep 16, 2020 at 9:05 PM Dan Andreescu <[hidden email]>
> > wrote:
> > >
> > > >
> > > > For example, of the 30 odd backend errors reported in June, 14 were
> > still
> > > > open a month later in July [1], and 12 were still open – three months
> > later
> > > > – in September. The majority of these haven't even yet been triaged,
> > > > assigned assigned or otherwise acknowledged. And meanwhile we've got
> > more
> > > > (non-JavaScript) stuff from July, August and September adding
> > pressure. We
> > > > have to do better.
> > > >
> > > > -- Timo
> > > >
> > >
> > > This feels like it needs some higher level coordination.  Like perhaps
> > > managers getting together and deciding production issues are a priority
> > and
> > > diverting resources dynamically to address them.  Building an awesome
> new
> > > feature will have a lot less impact if the users are hurting from
> growing
> > > disrepair.  It seems to me like if individual contributors and
> > maintainers
> > > could have solved this problem, they would have by now.  I'm a little
> > > worried that the only viable solution right now seems like heroes
> > stepping
> > > up to fix these bugs.
> > >
> > > Concretely, I think expanding something like the Core Platform Team's
> > > clinic duty might work.  Does anyone have a very rough idea of the time
> > it
> > > would take to tackle 293 (wow we went up by a dozen since this thread
> > > started) tasks?
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
> --
> (http://cscott.net)
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l