Slow loading/timeouts on attendance register

Incident Report for SignOnSite

Resolved

A fix has been implemented and deployed.

Sites which reported un-responsiveness or extreme slowness of the desktop web attendance screen through our Support & CSM channels have had testing performed on them subsequent to the fix being released; Testing has shown that service looks to be restored to normal.

The issue only impacted the site attendance screen on desktop web. Our mobile applications & other areas of the web portal were operating as normal. If any further issues on the site attendance screen occur, please report them as soon as possible so we can further investigate.

This was a tricky one to resolve- we've been continuously working on it since the first reports came in of the issue early this morning. Our first response was to double the capacity of our database cluster, which bought some head-room database capacity this morning during our peak-load period and prevented the issue from progressing to impact more people or degrading the SignOnSite service overall.

For the more technical observers among you, the cause of the issue was the following:
- Over the weekend, we deployed some new composite indexes into our production database which gave us some great performance improvements across a number of common query use-cases.
- Unfortunately, once these new indexes were integrated into our production environment, the SQL query optimiser ended up selecting one of these new composite indexes for an extremely important and extremely hot-path query- this isn't something we were expecting to happen at all, because this particular query _overall_ would not be improved by the new index.
- What was going on, was that the optimiser was speeding up one part of the query at the expense of then needing some very expensive looped full-table scans to compute the result in another part of the query.
- The optimiser was detecting that the new composite index would minimise the cost function in one part of the query to an extent that it was selecting the new index, but this was actually a very bad choice for the whole query overall as the alternative index could have been re-used in multiple places throughout the query and the new one couldn't.
- We aren't sure yet exactly why the optimiser was mis-computing the cost of moving the second part of the query to full-table scans, but we think it was because of a small amount of dynamic logic that the optimiser wasn't able to penetrate.
- The end result was that a change designed to speed-up our application, ended up significantly impacting performance on sites which have had high numbers of workers on them _over their lifetime_. The profile of these sites varied, some had high activity levels and many active workers on them each day, and others had less foot-traffic traffic, but had been active for a long time.

Thank you everyone for your patience today while we worked through safely performing an online schema roll-back,

SignOnSite Engineering

Posted Mar 18, 2024 - 13:17 UTC

Identified

We are still working on a fix

Posted Mar 18, 2024 - 11:04 UTC

Update

We've identified the issue and are working on a fix.

Posted Mar 18, 2024 - 02:42 UTC

Update

We are continuing to investigate this issue.

Posted Mar 18, 2024 - 01:56 UTC

Investigating

We are currently investigating the cause of slow loading of the attendance register for sites with more than a few attendees.

Posted Mar 18, 2024 - 00:46 UTC

This incident affected: SignOnSite Field Platform (Website).