Hi all.
Posting a new thread to discuss this as the original is getting a bit long now.
Last night our 3rd party data centre suffered a total power outage.
We have found out this morning that the owners of the data centre had contractors in doing some maintenance. They failed to notify us of this work.
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.
Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained. As we were unaware that this work was taking place we did not have any one on standby to react. Thankfully our incident management processes and call out system kicked in very effectively. We had engineers on site within about 20 mins and 97% of services were back up and running in around 4 hours.
We experienced intermittent timeouts with IMAP and webmail for a period but this was identified and resolved. This was due to the servers coming back on-line before the network to the other data centre. This meant that half of the servers failed to mount the storage correctly but the servers in the other data centre were working correctly.
I'll reiterate that this was a catastrophic failure which we believe was caused by human error. All the resiliency in the world could not have prevented this situation. However the swiftness of our reaction ensured that this was resolved in super quick time.
We currently have a case open with our suppliers pursuing this issue. We will not be able to report fully on their findings due to the nature of our contract with them but we will share as much information as we possibly can.
Where we did fall down on this one was the lack of call out for the Comms team and the fact that due to this we failed to communicate effectively. We could have used the UserGroup site and TBB to post information, but as we were all blissfully unaware that this had happned until I got in at 0800 this morning we failed to communicate effectively. I am dealing with that aspect internally this morning.
I'll post with updates as soon as I get them.