Home > Control panel > Operations notices > 10 August Dallas power outage report

Related Links

Notice Links:

Notice

10 August Dallas power outage report

PostedFri, 12 Aug 2011 03:08 AM UTC
Thu, 11 Aug 2011 23:08 PM EDT
Last UpdateSun, 14 Aug 2011 22:12 PM UTC (693 weeks ago)
Sun, 14 Aug 2011 18:12 PM EDT
StatusClosed
Affected Data CenterDallas (TX)

[Update at Sun 14@1008 UTC data center outage report append at the bottom of this notice]

Summary

Colo4 lost power at their Dallas facility on 10 August CST.  See http://accounts.colo4.com/status/ for the technical details.

RimuHosting's website was affected by the outage and was also unavailable.

We started pushing out information on the outage via Twitter and Facebook.  We had a good number of customers join in on our live chat page.  We brought up a temporary server in Auckland and used that to post a http://rimuhosting.com temporary page with information on the outage.

Colo4 brought some temporary electrical switching equipment online.

Power was restored.

Some of our core infrastructure servers did not come back up.  We needed to replace some power supplies to get them back up.

We then needed to restart some network switches.

Then there were some physical servers that had not powered up, so we needed to go through server by server to get those back up and running.

And we also had a large number of niggly little issues.  e.g. Some host servers that were misconfigured to not automatically restart their VPSs.  e.g. one VLAN configuration issue.  e.g. failed power supplies or blown hard drives or broken GRUB configs on physical servers.

We received a large number of support requests.  e.g. Tomcat or Liferay not set to run on startup.  e.g. SSL certificates installed with a passphrase on the private key requiring manual intervention for the server to complete boot up.  e.g. MySQL databases that required repair table commands to be issued.

We have worked through most of the issues now.  And if you are experiencing any problems that you'd like our help with please just pop in a support ticket and we can try to help.

Issues

Our core routers are on A+B power and should not have been impacted by the outage.  They were.  We will be investigating further.

We have some cabinets with A+B power feeds.  Some servers in these cabinets did not lose power.  But because of the core router issue they did not have network connectivity.  About half our VPS host servers did not lose power at all.

Core RimuHosting servers were affected by the outage.  Impacting our website, email, support tickets, and our equipment/location databases.

One of the things that was unavailable was our DNS management UI/API.  This prevented customers from changing their DNS records from unresponsive servers to running servers.

After the power was restored some servers remained down and needed hands-on work done.  e.g. restarts, re-grubbing, changing kernels, changing firewall settings via KVMoIPs, replacing power supplies.  These required data center staff to act.  And sometimes we were not getting the typically excellent and 'instant' responses we normally expect from the Colo4 staff.

Improvements

We need better fault tolerance and failover on our RimuHosting core infrastructure servers.  So we are better able to communicate with customers.  And so they can access key services like DNS.

We have had a number of requests from customers about how to reduce the impact of a fault on any one server or at one of the data centers they use.  Our howto on this at http://rimuhosting.com/knowledgebase/rimuhosting/load-balancing-and-failover is probably worth a refresh.  I invite your suggestions and advice.

Some customers have requested more information about servers with redundant power.  We can do that for dedicated server customers.  Your server will need to be located in a cabinet with A+B power feeds.  The server model will need to have a redundant dual power supply (currently only available on high end servers).  Just let us know and we can work through the details with you.

SLA credits

We will be offering credits to all customers with Dallas-based servers.  We will be offering a standard credit percentage or amount (to be determined). What do you think is appropriate in your case?

We know some customers were more impacted by the outage than others so we want to have the flexibility to ensure those customers get an appropriate credit.

To do that we will be letting customers claim the standard credit or apply for a different amount that best reflects the impact of the outage.  We will be setting up a web page that will help us manage this.  We want it up before the end of next week.  To claim your credit then, please go to https://rimuhosting.com/cp/hostingcredit.jsp

Any questions?

Thank you to everyone who contacted us on http://twitter.com/rimuhosting and http://facebook.com/rimuhostingcom For your supportive comments or constructive advice.  We appreciate your feedback.  It really helped our sysadmin crew during a long and stressful day.

If you feel this notice is missing any information or you needed further details, email us and we can update this notice or provide you any further details you need.  You are welcome to email us or join a public discussion on our blog at http://blog.rimuhosting.com/2011/08/12/10-august-report/

Outage report issued by the data center

What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the Colo4 facility at 3000 Irving Boulevard experienced an equipment failure with one of the automatic transfer switches (ATS) at service entrance #2, which supports some of our long-term customers. The ATS device was damaged and did not allow either commercial or generator power automatically -- or through bypass mode. Thus, to restore the power connection, a temporary replacement ATS was required to be put into service.

Colo4's standard redundant power offering has commercial power backed up by diesel generator and UPS.  Each of our six ATSs reports to its own generator and service entrance. The five other ATSs and service entrances at the facility were unaffected.

The ATS failure at service entrance #2 affected customers who had single circuit connectivity (one power supply). For customers who had redundant circuits (or A/B dual power supplies), they access two ATS switches, so the B circuit automatically handled the load. (A few customers with A/B power experienced initial downtime due to a separate switch that was connected to two PDUs and the same service entrance. Power was quickly restored.)

Response Actions: As soon as this incident occurred we worked to mobilize the proper professionals in our facility and extended team.  Our on-site electrical contractors and technical team, worked quickly with the general contractors and UPS contractors to assess the situation and determine fastest course of action to bring customers back online.

As part of our protocol, we first conducted a thorough check of the affected ATS as well as the supporting PDU, UPS, transformer, generator, service entrance, HVAC, and electrical.  It was determined that all other equipment was functioning properly and that the failure was limited to the ATS device. This step was important for us to ensure that the problem did not affect other equipment or replicate at other service entrances.

It was further determined that the ATS would need extensive repairs and that the best scenario for our customers would be to install a temporary ATS. As the ATS changeover involved high-voltage power, it was important that we moved cautiously and deliberately to ensure the safety of our employees, contractors and customers in the building as well as our customers' equipment. Safely bringing the new unit online was our top priority.

After the temporary ATS was installed and tested, the team brought up the HVAC, UPS and PDU individually to ensure that there was no damage to those devices.  Then, the team restored power to customer equipment. Power was restored as of 6:31PM CDT.

The UPSs were placed in bypass mode on the diesel generator to allow the batteries to fully charge. The transition from diesel generator to commercial power occurred at 9:00PM CDT with no customer impact.

Colo4 technicians worked with customers to help bring any equipment online that did not come back on with the power restore or to help reset devices where breakers tripped during the power restoration. This process continued throughout the evening.

Assessment: As part of our after-action assessment, the Colo4 management team has debriefed with all on-site technical team and electrical contractors as well as the equipment manufacturer, UPS contractors and general contractors to provide assessments on the ATS failure. While an ATS failure is rare, it is even rarer for an ATS to fail and not allow it to go into bypass mode.

While the ATS could be repaired, we made the decision to order a new replacement ATS. This is certainly a more expensive option, but it is the option that provides the best solution for the long-term stability for our customers.

Lessons Learned: Thankfully we've experienced few issues during our 11 years in business though any issue is one too many.  As part of our after-action review, we have made additional improvements to our existing emergency/disaster recovery plans.

Our technical team, HVAC, electrical and general contractors brought exceptionally fast, sophisticated thinking and action to get our customers back in business as quickly as possible.  The complexity of working with power of that size and scale at any time, but especially under pressure, shows the level of merit, knowledge and resolve that these individuals have.  Thank you to the technical team and all our contractors for a job well done to safely restore power for our customers.

As part of the debrief, all Colo4 network gear in both facilities was checked to ensure all equipment was on redundant power, and all is connected properly.

Unfortunately, we weren't well prepared on the customer service side.  Our customers were stressed and needed more frequent updates from us along the way. We very much wanted to provide you with an ETA earlier.  Due to the extent and complexity of the failure, we were unable to provide a proper ETA quickly and did not want to send out false information or set the wrong expectation.

Next Steps: Once we receive and test the new ATS, we will schedule a maintenance window to replace the equipment. We will provide at least three days advance notice and timelines to minimize any disruption.

#

Keep You Updated?

Log in to subscribe to changes to this notice.

Set your contact details for future notifications.