Microsoft had faced a major outage on Monday night which affected the Microsoft 365 and the services related to it like the Microsoft Outlook. The company’s official health status page said that users “may be unable to access multiple Microsoft 365 services”.
The company’s Azure status page which is the built-in solution for managing identities in Office 365, revealed a problem with Azure Active Directory.
The outage affected users for almost 5 hours. Later in an update on its Twitter account, Microsoft said through a tweet that ‘We’ve identified a recent change that appears to be the source of the issue. We’re rolling back the change to mitigate impact’.
Even after the rollback, the issue wasn’t fixed because of which Microsoft had to reroute the site traffic to alternate infrastructure while they continued to investigate the issue. Microsoft also claimed that the existing Microsoft 365 sessions were working so users need not close those sessions.
It seems like there was a code change that caused the issue. “A code issue caused a portion of our infrastructure to experience delays processing authentication requests, which prevented users from being able to access multiple M365 services,” said Microsoft in an email update to Microsoft administrators who were impacted by the outage.
Later on, Azure’s Public Status library mentioned in an update that “We have identified the preliminary root cause and the extended impact as a combination of three separate and unrelated issues.
A code defect in a service update.
A tooling error in the Azure AD safe deployment system that impacted regional scoping.
A code defect in Azure AD’s rollback mechanism, resulting in a delay in reverting the service update.
Our monitoring automatically detected the issue within a minute of initial impact, and our engineering teams engaged immediately to initiate troubleshooting. Impact was variable based on regional load patterns and we immediately scaled out the services to help process the increased volume as a result of authentication retries due to the issue. Upon the successful rollback, full recovery for most customers was confirmed at 00:23 UTC on September 29. Our engineers are engaged and monitoring the system to help ensure it continues to operate within normal parameters”.
The systems now seem to be fully operational without any issues.