Microsoft’s Exchange-based Office 365 mail system went down for four hours on Wednesday afternoon. Service has been fully restored.
If you’re an Office 365 user, then incoming mail was slow to arrive, in some cases delayed by hours, and outgoing messages were stuck in the Outbox. Even webmail was down at least part of the time. As service was restored, the outgoing messages were sent, and all incoming messages were eventually delivered. No mail was lost.
According to Microsoft, the outage began at about 2:30pm; service began to recover for some people within a couple of hours and everything was working normally by 6:30pm PST. Complaints were flying on Twitter and it appeared the outage affected customers all over North America and possibly into Latin America.
Office 365 administrators can log into the Office 365 portal and click on “System Health” for current status and notes about problems. At 4pm Wednesday, this was the summary:
“Current Status: Engineers have determined that this issue may be related to a recent update to the service and are currently working to revert the update.
“User Experience: Affected users are unable to connect to the Exchange Online service when using multiple protocols including Outlook, Outlook Web App (OWA), Exchange ActiveSync (EAS), and Exchange Web Services (EWS). Affected users may also experience delays when sending and receiving messages.
“Customer Impact: A higher than average number of customers are reporting this issue. Analysis indicates that customers will likely have some users experiencing this issue.”
In other words, a patch that was supposed to be routine brought the whole system down. They took off the patch and everything came back to normal.
In the closing note, Microsoft lists its action item: “Review procedures for validating service updates to improve the change management process.” Yes, I bet they will. Somewhere engineers are saying, oh, crap, and managers are really irritable.
This is the first significant outage for Office 365 that I’m aware of for a couple of years. There were one or two a year for a while but it’s been a remarkably stable system recently. Let’s hope this is just an aberration. As I said a few years ago: “The most important thing is simply to recognize that outages happen. There is no technology that is capable of being delivered 100% of the time, 24×7, 365 days/year. Try to get past anger, try to get past blame, because outages are going to happen when there’s no one to blame and no one who deserves the anger.”