Thursday, 7 May 2015

Johannesburg Data Centre Failure Raises Many Questions

The owners of a Johannesburg data centre have a lot more than just equipment damage to repair after an epic failure resulted in some customers being offline for 18 hours or longer.  MTN SA has issued countless apologies for the incident, which began around 6pm local time on Monday, May 3 2015.

Numerous news reports say the company was attempting a load shedding procedure when backup generators failed.  That failure led to a subsequent power spike when energy was restored, causing physical damage to an entire cluster at the company's Gallo Manor data centre.  The race was then on to repair the damaged equipment as quickly as possible.

Unfortunately, MTN did not have all of the spare parts they needed on hand to make repairs.  Some of the parts were not located or delivered until 3am on May 4, so repair efforts did not even get under way until more than nine hours after the outage.  Meanwhile, customers were left in the dark as to what had happened and when their servers would be back online.

In light of the failure and subsequent fallout, the company must now answer a number of questions:

  • Spare Parts – Why were there not enough spare parts on hand at the data centre to immediately affect repairs?  This is non-negotiable for the modern facility.  Infrastructure is not perfect, so the means must always be available to immediately repair it when there is a failure.

  • Resilience – Global networking in the modern era requires networks with built-in resilience and redundancy.  Why was this not already in place at MTN?  Even a basic amount of resilience could have provided uninterrupted service for customers, even as repairs were made.

  • Communications – MTN customers complained that communications from the company were lacking.  For their part, MTN said the outage affected the same portions of the data centre hosting their communications tools.  They were unable to effectively communicate until the first round of repairs had been started.  But why?  Should the company not have alternate means of communication in place?

It appears that the Johannesburg failure is as much about management as it is hardware.  Somewhere along the way, those responsible for maintaining the kind of service customers expect fell down on the job.  We suspect MTN will be looking at ways to improve their service long after the physical repairs to the data centre are complete.

Service Providers

The environment in South Africa may be such that data centre customers do not have many options in terms of service providers.  In the UK, we have no such problem.  Therefore, one of the lessons to be learned here is that of choosing a service provider wisely.

It is no longer acceptable to work with a data centre that does not have proven uptime of 98% to 99%.  It is also not acceptable for service providers to not have immediate repair capabilities.  Choose your provider wisely; the health of your business could depend on it.

Sources:



No comments:

Post a Comment