Summary
Incident Start time: 09:45am
Incident End Time: 11:15am
Business Services Impacted: CRM SPA app for a subset of customers
A scaling change made on the Azure infrastructure to mitigate database performance issues appears to have caused degraded CRM performance. One server was restarted which caused a spike in response times for 5-10 minutes while it warmed up, but performance was still slow on another server.
The scaling change was reverted and the second server was restarted, which also experienced a smaller spike while it warmed, then normal CRM performance was resumed.
Timeline
09:45am CRM slowness was reported by customers. This was raised with our infrastructure team who started investigating the issues.
10:12am The scaling changes made on Friday were reverted and one database instance was restarted.
10:25am Issue was raised again internally by staff as the instance restart had not fully resolved all issues.
10:39am The Devops team reported that a second instance wn1mdwk0000V8 was still suffering from performance issues and this was rebooted around 10:46am.
Staff internally were advised the instances had been restarted and would continue to be slow for 5-10 minutes until they had warmed up.
This incident was marked as resolved at 11:17am once several customers had confirmed their performance issues were resolved.
Going Forward
We have identified the root cause of the issues caused by scaling the services, and have an updated deployment plan to stop these updates from causing performance spikes in future once any service is restarted.
The Devops team have an update to review and deploy, to work around the issue of Microsoft Azure sending requests to servers before they are ready. This should help resolve the performance spikes when updates are deployed and/or services are restarted.