The engineering team at Microsoft responsible for Office 365 published a lengthy blog post about the outage on Tuesday. Microsoft is not always very transparent so it is refreshing to get an explanation and – more importantly – to get Promises To Do Better instead of defensiveness.
There is more in the blog post but here are the details about Tuesday’s outage:
“From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services. The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service. This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance. In combination with this standard process, we experienced a ‘gray’ failure of some active network elements; the elements failed, but did not alert us to their failure. Additionally, we have an increasing load of customers on-boarding to the service. These three issues in combination caused customer access to email services to be degraded for an extended period of time. By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices. At 2:10PM PST all services were fully restored. Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time. Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.”
Outages happen. It’s part of the process of strengthening the cloud services. The serious online service providers – Microsoft, Amazon, Google, Apple – are making their systems more redundant and less likely to crack under pressure. We can only hope they stay ahead of the increasing demand and the ever-growing load.