cancel
Showing results for 
Search instead for 
Did you mean: 

Update on this morning's outage

Mark
Grafter
Posts: 1,852
Registered: ‎04-04-2007

Update on this morning's outage

Hi all.
Posting a new thread to discuss this as the original is getting a bit long now.

Last night our 3rd party data centre suffered a total power outage.
We have found out this morning that the owners of the data centre had contractors in doing some maintenance. They failed to notify us of this work.
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.
Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained. As we were unaware that this work was taking place we did not have any one on standby to react. Thankfully our incident management processes and call out system kicked in very effectively. We had engineers on site within about 20 mins and 97% of services were back up and running in around 4 hours.
We experienced intermittent timeouts with IMAP and webmail for a period but this was identified and resolved. This was due to the servers coming back on-line before the network to the other data centre. This meant that half of the servers failed to mount the storage correctly but the servers in the other data centre were working correctly.
I'll reiterate that this was a catastrophic failure which we believe was caused by human error. All the resiliency in the world could not have prevented this situation. However the swiftness of our reaction ensured that this was resolved in super quick time.
We currently have a case open with our suppliers pursuing this issue. We will not be able to report fully on their findings due to the nature of our contract with them but we will share as much information as we possibly can.
Where we did fall down on this one was the lack of call out for the Comms team and the fact that due to this we failed to communicate effectively. We could have used the UserGroup site and TBB to post information, but as we were all blissfully unaware that this had happned until I got in at 0800 this morning we failed to communicate effectively. I am dealing with that aspect internally this morning.
I'll post with updates as soon as I get them.
72 REPLIES 72
shutter
Community Veteran
Posts: 22,206
Thanks: 3,769
Fixes: 65
Registered: ‎06-11-2007

Re: Update on this morning's outage

THANK YOU. for this update..... let`s hope future problems can be posted this way, to save a lot of aggravational posting..... Wink
LiamM
Grafter
Posts: 5,636
Registered: ‎12-08-2007

Re: Update on this morning's outage

Do you think there would ever be any merit in Networks Nightshift?
Mark
Grafter
Posts: 1,852
Registered: ‎04-04-2007

Re: Update on this morning's outage

Liam to be honest, the speed at which the networks guys got to the data centre would not have improved even if they were on nights. Bear in mind that the data centre is not in IH so some travelling was involved.
Having worked closely with the guys this morning I have to say that I am massively impressed with the speed of reaction and what was achieved in such a short period of time.
One of the most efficient incident management operations I have seen to date and the fact that we achieved so much in such a short period deserves recognition and credit.
LiamM
Grafter
Posts: 5,636
Registered: ‎12-08-2007

Re: Update on this morning's outage

That's fair enough and I agree probably not sensical right now.  But with growth?
Thanks for the update anyway.  I agree the Comms aspect was probably the most important of this mornings outage.
ChemicalBrother
Grafter
Posts: 1,887
Thanks: 5
Registered: ‎05-04-2007

Re: Update on this morning's outage

Moderators Note:
Thread stickied to give the appropriate exposure.
pierre_pierre
Grafter
Posts: 19,757
Thanks: 3
Registered: ‎30-07-2007

Re: Update on this morning's outage

service status says problem resolved, but.
I can get web mail via squirrel mail
I cant get imap email via thunderbird, password still not authenticating.
I am a free-online customer.
Chris
Legend
Posts: 17,724
Thanks: 600
Fixes: 169
Registered: ‎05-04-2007

Re: Update on this morning's outage

@pierre_pierre,
I'll have a look into this now.
*edit*
Any particular mailbox (pm me), I've tried 3 of them so far and can log in via telnet and pop through OE.
Former Plusnet Staff member. Posts after 31st Jan 2020 are not on behalf of Plusnet.
Chris
Legend
Posts: 17,724
Thanks: 600
Fixes: 169
Registered: ‎05-04-2007

Re: Update on this morning's outage

We think we've found the cause, I've popped another service status out to keep you updated.
Former Plusnet Staff member. Posts after 31st Jan 2020 are not on behalf of Plusnet.
eugeneg
Grafter
Posts: 38
Registered: ‎23-07-2007

Re: Update on this morning's outage

Quote from: Mark
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.
Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained.

This leaves me very much in the dark as I don't know the layout of the data centre, but surely it should be designed so that there is no single act that can cause everything to fail. 
I don't know which DB is employed, but Oracle databases (with the Replication option) do not need to be manually failed over.
Techcom
Newbie
Posts: 1
Registered: ‎21-02-2008

Re: Update on this morning's outage

I reported early this morning that all of me email addresses were inaccessible as the password were rejected.
I have asked a number of times when this fault will be rectified as I have very important emails now locked inside PlusNet and am unable to retrieve them.
PlusNet have not given me the courtesy of a reply.
This is outrageous. Last year one of their servers security was breached allowing indecent spam mail to access our email
addresses, so I had to change some, and now today - no reply from PlusNet as to the timescale to rectify the problem, and on the service status page it shows email as operating normally.
PlusNet it is NOT.
Chris
Legend
Posts: 17,724
Thanks: 600
Fixes: 169
Registered: ‎05-04-2007

Re: Update on this morning's outage

As far as we were aware the email problem was resolved, there were a couple of isolated instances of this but having contacted the customers in question they were resolved once they reentered their mailbox passwords into their email clients. Can you try this and see if it works?
Are you able to access your email via webmail?

*edit*
I've just tested 4 of your mailboxes via telnet and can access them without issues.
Former Plusnet Staff member. Posts after 31st Jan 2020 are not on behalf of Plusnet.
Tony_W
Grafter
Posts: 745
Registered: ‎11-08-2007

Re: Update on this morning's outage

Quote from: Mark
That action caused all power to all services to die including the generaters and ups.

UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?
Presumably PN ensured that enough power was stored (sufficient UPS boxes) to get them over the temporary interruption and allow sufficient time to have a controlled shutdown of their systems. That would only be normal good practice.
As for generators, they run independently on petroleum-based fuels. They have an automatic cut-in when power is removed from a system. I find it difficult to believe that the generators died when the power was removed - that is the mode which they are designed to work in.
Strange....

Mark
Grafter
Posts: 1,852
Registered: ‎04-04-2007

Re: Update on this morning's outage

Yip Tony thats how they work in normal circumstances, however thats what the "'big red fire kill switch" is for. To take down everything in the event of an emergency or major incident.
The ability to kill all power exists in Data Centres and this is what occurred last night.
Not quite what happened to us, but this article shows what can happen in Data Centres Link
Thankfully ours wasn't as catastrophic.

jelv
Seasoned Hero
Posts: 26,785
Thanks: 971
Fixes: 10
Registered: ‎10-04-2007

Re: Update on this morning's outage

Quote from: Tony

UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?

There is a huge difference between a power fail when the backup systems will kick in and
Quote
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre.

Of course if you are saying that you can do work on the electrics within a data centre while it is all live I suggest you get in touch with the data centre and make your services available to them - just let us know where and when your funeral is to be held.
jelv (a.k.a Spoon Whittler)
   Why I have left Plusnet (warning: long post!)   
Broadband: Andrews & Arnold Home::1 (FTTC 80/20)
Line rental: Pulse 8 Home Line Rental (£14.40/month)
Mobile: iD mobile (£4/month)