Notice Links:
Posted | Wed, 2 Dec 2009 19:23 PM UTC
Wed, 2 Dec 2009 14:23 PM EST |
---|---|
Last Update | Tue, 8 Dec 2009 03:57 AM UTC (780 weeks ago)
Mon, 7 Dec 2009 22:57 PM EST |
Status | Closed |
Affected Data Center | Dallas (TX) |
On Monday December 7 we will be performing network maintenance on our Dallas core. We have scheduled the maintenance to occur between 0000UTC and 0200UTC: http://tinyurl.com/ylp842v. If we need to reschedule it for any reason, we will update this notice with the new time. For some customers there will be just a single 1-5 minute network outages. Some other customers may be affected one or more of the multiple tasks being performed here. With each task resulting in a network 'blip' of up to a few minutes. e.g. 2-5 minutes for a switch stack restart. e.g. a few seconds to add an extra network cable/bonded port. We will be performing the following maintenance: Moving core1-2 to a new power feed. This is to provide extra power redundancy to switches attached to core1. This will affect a small number of customer's public network connections, and the majority of our private network customers. Adding additional redundancy between our core and Colo4 on our 'Premium' network. This may cause a short outage (1-2 minutes) while the new uplinks are installed. Adding additional redundancy between our two core stacks, core1 and core2. This may cause a short outage (1-2 minutes) while the new links are added. Restarting core2-1 to apply some configuration changes. This is not expected to be service affecting, but may cause a 5-10 minute period where connectivity via the public network is interrupted. See also: http://blog.rimuhosting.com/2009/12/03/dallas-networking-improvements/ Progress log@2335UTC: A data center technician preparing a patch cable required for the maintenance somehow managed to knock loose a network cable. Causing a network issue for some customers. We have re-requested that the core stack gear not be touched until the appropriate time. @0002UTC: We are proceeding with the scheduled maintenance now. @0012 UTC: Cross connects are in. Got a network loop, resolving. @0013 UTC: Configuring a 4x1Gbps trunk between the core switch stacks. @0021 UTC: We are now going to work on the next step which will cause downtime for private networks. @0029 UTC: The work being done on the private networking has now been completed. @0031 UTC: We are working on adding the redundant links back in the network. @0044 UTC: Some switches are not responding, we are investigating. @0104 UTC: Most switches are now back up and we are working on the remaining few to get them back up. @0114 UTC: All switches except for one is back up and running, we are investigating this switch now. @0120 UTC: All switches are back up, all servers are back up, we will continue with the rest of the maintenance work. @0151 UTC: At the time we were working on redundant links between core stacks. @0209 UTC: We are configuring HSRP on premium bandwidth. @0226 UTC: HSRP links are in and we are going to retry connect the redundant links between our cores once more, this will cause an outage. @0230 All the work is done. And everything should be working. Post maintenance reportThis work involved a number of significant improvements. We added extra providers. Configured HSRP. Added power redundancy. Added more redundant connections between switches. And increased our capacity. This work adds some very important protection to our network. It will provide long term benefits in performance and reliability. We estimated there would be a few short outages. Totaling just a few minutes downtime. The maintenance took longer than anticipated. By way of explanation (not excuse) the reason was 1) a human error (where a technician at the data center dislodged a cable he should not have been anywhere near) and 2) some technical issues (where switches refused to operate as advertised and as we had tested them beforehand). In future we will continue to work well as a team internally to discuss all changes. And to help avoid the incidence of unplanned human and technical issues. And we will also work to provide better timing on the potential duration of maintenance work. (aka Murphys law has once again proven that we need to provide an 'expected' maintenance time. But also be clear on what the potential duration could be if things do not go smoothly - always difficult to provide good guidance on as the unknowns and unexpected events are typically unknown and unexcepted. # |
Log in to subscribe to changes to this notice.
Set your contact details for future notifications.