Summary
Incident Start time: 8:30am
Incident End Time: 9:30am
Business Services Impacted: Cloud integration services (including order confirmation) for a subset of customers
The Microsoft West Europe datacentre experienced an issue that meant that they had to switch from utility power to generator power, and a subset of these generators failed to take over as expected during the switch, leading to the impact. Power was restored by Microsoft engineers at 9am, most services were restored by 9:15, however the incident was still live on their status page (https://azure.status.microsoft/en-gb/status/) for hours after this time. Microsoft have posted their own Post-Mortem on their incident and plans going forward; https://azure.status.microsoft/en-gb/status/history/
The product team were able to quickly identify that the Orchestrated Integration Ring 2 was timing out requests (due to the outage) and quickly switched all cloud ERP integrations to direct to ring 1 that was still answering requests successfully. However, any order confirmation requests sent to Ring 2 during the outage had to be terminated, and customers were advised to retry the confirmation.
Timeline
At approximately 8:30am on 20/10/23, requests being sent to Ring 2 of the Orchestrated cloud integration servers were no longer completing, due to a power issue at Microsofts West Europe datacentre.
This was alerted both inhouse and by users, and the product team were already aware and had notified the support team when requests were coming in.
The Product Team Lead switched cloud integrations pointing at ring 2 to ring 1 which was not experiencing this issue, and any subsequent requests were completing successfully, this was completed by 9:30am. Any order confirmation requests sent to Ring 2 were unable to complete, and to avoid duplication, were terminated. This meant that the orders had to be requeued.
A list of affected orders were provided to the support team to ensure all customers affected directly had been informed and were able to resolve orders that had not confirmed successfully.
This incident was left open until Microsoft closed theirs around 3pm. Product team plan on allowing the regular update process to disperse integrations back to ring 2 rather than perform any manual migration.