Update on this morning's outage
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Broadband
- :
- Re: Update on this morning's outage
Update on this morning's outage
21-02-2008 11:29 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Posting a new thread to discuss this as the original is getting a bit long now.
Last night our 3rd party data centre suffered a total power outage.
We have found out this morning that the owners of the data centre had contractors in doing some maintenance. They failed to notify us of this work.
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.
Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained. As we were unaware that this work was taking place we did not have any one on standby to react. Thankfully our incident management processes and call out system kicked in very effectively. We had engineers on site within about 20 mins and 97% of services were back up and running in around 4 hours.
We experienced intermittent timeouts with IMAP and webmail for a period but this was identified and resolved. This was due to the servers coming back on-line before the network to the other data centre. This meant that half of the servers failed to mount the storage correctly but the servers in the other data centre were working correctly.
I'll reiterate that this was a catastrophic failure which we believe was caused by human error. All the resiliency in the world could not have prevented this situation. However the swiftness of our reaction ensured that this was resolved in super quick time.
We currently have a case open with our suppliers pursuing this issue. We will not be able to report fully on their findings due to the nature of our contract with them but we will share as much information as we possibly can.
Where we did fall down on this one was the lack of call out for the Comms team and the fact that due to this we failed to communicate effectively. We could have used the UserGroup site and TBB to post information, but as we were all blissfully unaware that this had happned until I got in at 0800 this morning we failed to communicate effectively. I am dealing with that aspect internally this morning.
I'll post with updates as soon as I get them.
Re: Update on this morning's outage
21-02-2008 12:55 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Re: Update on this morning's outage
21-02-2008 12:59 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Re: Update on this morning's outage
21-02-2008 1:08 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Having worked closely with the guys this morning I have to say that I am massively impressed with the speed of reaction and what was achieved in such a short period of time.
One of the most efficient incident management operations I have seen to date and the fact that we achieved so much in such a short period deserves recognition and credit.
Re: Update on this morning's outage
21-02-2008 1:13 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Thanks for the update anyway. I agree the Comms aspect was probably the most important of this mornings outage.
Re: Update on this morning's outage
21-02-2008 1:23 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Thread stickied to give the appropriate exposure.
Re: Update on this morning's outage
21-02-2008 1:34 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I can get web mail via squirrel mail
I cant get imap email via thunderbird, password still not authenticating.
I am a free-online customer.
Re: Update on this morning's outage
21-02-2008 1:42 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I'll have a look into this now.
*edit*
Any particular mailbox (pm me), I've tried 3 of them so far and can log in via telnet and pop through OE.
Re: Update on this morning's outage
21-02-2008 1:58 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Re: Update on this morning's outage
21-02-2008 4:52 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: Mark During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.
Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained.
This leaves me very much in the dark as I don't know the layout of the data centre, but surely it should be designed so that there is no single act that can cause everything to fail.
I don't know which DB is employed, but Oracle databases (with the Replication option) do not need to be manually failed over.
Re: Update on this morning's outage
21-02-2008 5:10 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I have asked a number of times when this fault will be rectified as I have very important emails now locked inside PlusNet and am unable to retrieve them.
PlusNet have not given me the courtesy of a reply.
This is outrageous. Last year one of their servers security was breached allowing indecent spam mail to access our email
addresses, so I had to change some, and now today - no reply from PlusNet as to the timescale to rectify the problem, and on the service status page it shows email as operating normally.
PlusNet it is NOT.
Re: Update on this morning's outage
21-02-2008 5:35 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Are you able to access your email via webmail?
*edit*
I've just tested 4 of your mailboxes via telnet and can access them without issues.
Re: Update on this morning's outage
21-02-2008 5:59 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: Mark That action caused all power to all services to die including the generaters and ups.
UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?
Presumably PN ensured that enough power was stored (sufficient UPS boxes) to get them over the temporary interruption and allow sufficient time to have a controlled shutdown of their systems. That would only be normal good practice.
As for generators, they run independently on petroleum-based fuels. They have an automatic cut-in when power is removed from a system. I find it difficult to believe that the generators died when the power was removed - that is the mode which they are designed to work in.
Strange....
Re: Update on this morning's outage
21-02-2008 6:39 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
The ability to kill all power exists in Data Centres and this is what occurred last night.
Not quite what happened to us, but this article shows what can happen in Data Centres Link
Thankfully ours wasn't as catastrophic.
Re: Update on this morning's outage
21-02-2008 6:51 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Quote from: Tony
UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?
There is a huge difference between a power fail when the backup systems will kick in and
Quote During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre.
Of course if you are saying that you can do work on the electrics within a data centre while it is all live I suggest you get in touch with the data centre and make your services available to them - just let us know where and when your funeral is to be held.
jelv (a.k.a Spoon Whittler) Why I have left Plusnet (warning: long post!) Broadband: Andrews & Arnold Home::1 (FTTC 80/20) Line rental: Pulse 8 Home Line Rental (£14.40/month) Mobile: iD mobile (£4/month) |
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Help with my Plusnet services
- :
- Broadband
- :
- Re: Update on this morning's outage