Thursday, 29 May 2014
Joyent Blames Operator Error for Data Centre Crash
Joyent, the California company specialising in cloud computing and virtualisation, witnessed the crash of its US-East-1 data centre earlier this week. The crash had the company scrambling to get all of its servers and VMs up and running as quickly as possible. In the aftermath, CTO Bryan Cantrill stated that an operator error “took down the data centre.”
Cantrill did not go on to explain what that meant, except to say that it was an honest admin mistake rather than an actual hardware failure. Whatever the administrator did caused every server at the data centre to simultaneously reboot. The number of servers and amount of data involved meant that getting everything up and running again was a slow process that took about an hour to complete.
A system-wide data centre reboot is serious in terms of customer satisfaction however it is not as bad as simultaneously losing multiple data centres or inadvertently deleting customer data. Fortunately, Joyent customers will be reimbursed for downtime based on existing service level agreements.
In the meantime, Cantrill says the incident is a great opportunity for his company to learn some important lessons about how their system works. The company will use the knowledge it gleans to improve data centre training and develop new systems to prevent similar incidents from happening in the future. Cantrill says no data centre jobs are at risk because of the error. He says his company is not out to punish the administrator, but rather to learn and get better.
When the data centre first went down, company officials were quick to respond with immediate updates. They continued updating the situation as the recovery process worked its way through each individual node. Joyent promised a full post-mortem to determine how a single error by an administrator could have resulted in a complete centre shut down.
The nature of cloud computing does not necessarily make a data centre more vulnerable to these sorts of shutdowns than a traditional, non-cloud environment. The problem with the cloud lies in how data is stored and accessed. In a traditional environment, the data belonging to a single customer is contained all in one place, allowing everything to start functioning again as soon as the rebooted server initialises. Things are different in the cloud environment.
Cloud-based data and applications are split-up and distributed across multiple servers simultaneously. That means every computer node has to be up and running before a customer's system is fully functional. This is not a problem on a small scale however a wide-scale reboot like Joyent's takes a lot longer to recover from because there are substantially more nodes to deal with.
In the end, Joyent will learn from this in order to get better. Like Google, Microsoft and Amazon before it, the company will figure it out and put new controls in place. However, it is only a matter of time before the next 'big one' hits…