Plusnet
Tuesday 9th February 2010Login | Register | Help
Pages: [1] 2 3 ... 5

Update on this morning's outage

  • Mark
  • Plusnet Staff
  • *
  • Posts: 1341
  • View Profile WWW
« on 21/02/2008, 11:29 »
Hi all.

Posting a new thread to discuss this as the original is getting a bit long now.


Last night our 3rd party data centre suffered a total power outage.

We have found out this morning that the owners of the data centre had contractors in doing some maintenance. They failed to notify us of this work.

During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.

Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained. As we were unaware that this work was taking place we did not have any one on standby to react. Thankfully our incident management processes and call out system kicked in very effectively. We had engineers on site within about 20 mins and 97% of services were back up and running in around 4 hours.

We experienced intermittent timeouts with IMAP and webmail for a period but this was identified and resolved. This was due to the servers coming back on-line before the network to the other data centre. This meant that half of the servers failed to mount the storage correctly but the servers in the other data centre were working correctly.

I'll reiterate that this was a catastrophic failure which we believe was caused by human error. All the resiliency in the world could not have prevented this situation. However the swiftness of our reaction ensured that this was resolved in super quick time.

We currently have a case open with our suppliers pursuing this issue. We will not be able to report fully on their findings due to the nature of our contract with them but we will share as much information as we possibly can.

Where we did fall down on this one was the lack of call out for the Comms team and the fact that due to this we failed to communicate effectively. We could have used the UserGroup site and TBB to post information, but as we were all blissfully unaware that this had happned until I got in at 0800 this morning we failed to communicate effectively. I am dealing with that aspect internally this morning.

I'll post with updates as soon as I get them.

« Reply #1 on 21/02/2008, 12:55 »
THANK YOU. for this update..... let`s hope future problems can be posted this way, to save a lot of aggravational posting..... Wink
Get paid by the government to go on cruises...... Join the Royal Navy
my website ...http://www.nemosphotography.co.uk        my blogsite.....http://nemosphotography.blogspot.com/        my RedBubble site ...http://lumixfz28.redbubble.com/     also      
The Peoples Gallery ....http://www.point101.com/peoples_gallery/    (click "Gallery" then "search" then enter  G.Emson)...  also.... http://www.imagekind.com/...69-41e8-bee9-372ac6a99d6a .... .... and for our American Readers....  http://www.americanframe....ch.aspx?keyword=LumixFZ28
Logged
  • Liam.
  • Usergroup Member
  • *
  • Posts: 5533
  • View Profile WWW
« Reply #2 on 21/02/2008, 12:59 »
Do you think there would ever be any merit in Networks Nightshift?
Liam Martin
PlusNet UserGroup Member & Ex-PlusNet Comms Team Staffer!
BBYWPro! & Business Premier User | DG834G Lover
Wormeries from the inventors!
Logged
  • Mark
  • Plusnet Staff
  • *
  • Posts: 1341
  • View Profile WWW
« Reply #3 on 21/02/2008, 13:08 »
Liam to be honest, the speed at which the networks guys got to the data centre would not have improved even if they were on nights. Bear in mind that the data centre is not in IH so some travelling was involved.

Having worked closely with the guys this morning I have to say that I am massively impressed with the speed of reaction and what was achieved in such a short period of time.

One of the most efficient incident management operations I have seen to date and the fact that we achieved so much in such a short period deserves recognition and credit.

  • Liam.
  • Usergroup Member
  • *
  • Posts: 5533
  • View Profile WWW
« Reply #4 on 21/02/2008, 13:13 »
That's fair enough and I agree probably not sensical right now.  But with growth?

Thanks for the update anyway.  I agree the Comms aspect was probably the most important of this mornings outage.
Liam Martin
PlusNet UserGroup Member & Ex-PlusNet Comms Team Staffer!
BBYWPro! & Business Premier User | DG834G Lover
Wormeries from the inventors!
Logged
« Reply #5 on 21/02/2008, 13:23 »
Moderators Note:

Thread stickied to give the appropriate exposure.
Logged
« Reply #6 on 21/02/2008, 13:34 »
service status says problem resolved, but.

I can get web mail via squirrel mail

I cant get imap email via thunderbird, password still not authenticating.

I am a free-online customer.
Free-online member since 15 Dec 1998
You dont have to be mad to understand what PN are up to, but it helps
Logged
  • Chris
  • Plusnet Staff
  • *
  • Posts: 4650
  • View Profile
« Reply #7 on 21/02/2008, 13:42 »
@pierre_pierre,

I'll have a look into this now.

*edit*

Any particular mailbox (pm me), I've tried 3 of them so far and can log in via telnet and pop through OE.

« Last Edit: 21/02/2008, 13:46 by Chris »

Chris Parr
Plusnet Comms Team
Service Status :: RSS :: Email

twitter / plusnet
Logged
  • Chris
  • Plusnet Staff
  • *
  • Posts: 4650
  • View Profile
« Reply #8 on 21/02/2008, 13:58 »
We think we've found the cause, I've popped another service status out to keep you updated.
Chris Parr
Plusnet Comms Team
Service Status :: RSS :: Email

twitter / plusnet
Logged
« Reply #9 on 21/02/2008, 16:52 »
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre. That action caused all power to all services to die including the generaters and ups. Failover couldnt work as the power to everything was killed. We believe that this was human error however we are awaiting a full report from our suppliers.

Having said that, not all services can be automatically failed over. DB's etc have to be failed over manually to ensure that data integrity is maintained.

This leaves me very much in the dark as I don't know the layout of the data centre, but surely it should be designed so that there is no single act that can cause everything to fail. 

I don't know which DB is employed, but Oracle databases (with the Replication option) do not need to be manually failed over.
Logged
« Reply #10 on 21/02/2008, 17:10 »
I reported early this morning that all of me email addresses were inaccessible as the password were rejected.

I have asked a number of times when this fault will be rectified as I have very important emails now locked inside PlusNet and am unable to retrieve them.

PlusNet have not given me the courtesy of a reply.

This is outrageous. Last year one of their servers security was breached allowing indecent spam mail to access our email
addresses, so I had to change some, and now today - no reply from PlusNet as to the timescale to rectify the problem, and on the service status page it shows email as operating normally.

PlusNet it is NOT.
Logged
  • Chris
  • Plusnet Staff
  • *
  • Posts: 4650
  • View Profile
« Reply #11 on 21/02/2008, 17:35 »
As far as we were aware the email problem was resolved, there were a couple of isolated instances of this but having contacted the customers in question they were resolved once they reentered their mailbox passwords into their email clients. Can you try this and see if it works?
Are you able to access your email via webmail?


*edit*

I've just tested 4 of your mailboxes via telnet and can access them without issues.

« Last Edit: 21/02/2008, 17:37 by Chris »

Chris Parr
Plusnet Comms Team
Service Status :: RSS :: Email

twitter / plusnet
Logged
« Reply #12 on 21/02/2008, 17:59 »
That action caused all power to all services to die including the generaters and ups.


UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?

Presumably PN ensured that enough power was stored (sufficient UPS boxes) to get them over the temporary interruption and allow sufficient time to have a controlled shutdown of their systems. That would only be normal good practice.

As for generators, they run independently on petroleum-based fuels. They have an automatic cut-in when power is removed from a system. I find it difficult to believe that the generators died when the power was removed - that is the mode which they are designed to work in.

Strange....


« Last Edit: 21/02/2008, 18:35 by Tony W »

  • Mark
  • Plusnet Staff
  • *
  • Posts: 1341
  • View Profile WWW
« Reply #13 on 21/02/2008, 18:39 »
Yip Tony thats how they work in normal circumstances, however thats what the "'big red fire kill switch" is for. To take down everything in the event of an emergency or major incident.

The ability to kill all power exists in Data Centres and this is what occurred last night.

Not quite what happened to us, but this article shows what can happen in Data Centres Link

Thankfully ours wasn't as catastrophic.



  • jelv
  • Bright Spark
  • *
  • Posts: 10522
  • View Profile
« Reply #14 on 21/02/2008, 18:51 »

UPS stands for Uninterruptible Power Supply - they work on batteries - generally several car batteries or lorry batteries in each unit plus inverters. How can they die when power is removed - they are 'uninterruptible' by definition?


There is a huge difference between a power fail when the backup systems will kick in and
Quote
During that work it appears that the contractors killed all power to the centre which was just like hitting the big red fire button in the centre.

Of course if you are saying that you can do work on the electrics within a data centre while it is all live I suggest you get in touch with the data centre and make your services available to them - just let us know where and when your funeral is to be held.
jelv
12/18 month broadband contracts have been abolished - all Plusnet residential contracts (including for existing users) are now 10 days (however deferred charges such as activation or hardware may have to be paid if you leave within a year)
Plusnet chatroom: /server usertools.plus.net   /join #usertools
Plusnet Unlimited is not without limits
Logged
« Reply #15 on 21/02/2008, 19:07 »
@Jelv

If that was aimed at me then I am not sure  what you mean.

When a UPS kicks in, my understanding is that it provides output to a unit (computer) and isolates it from the mains. No power would have got back to the building's mains supply.
Pages: [1] 2 3 ... 5
Jump to:  

Related Sites

Community Apps

Here at Plusnet we're always trying to use clever open source things to make our lives easier. Sometimes we write our own and make other people's lives easier too!

View the Plusnet Open Source applications page

About Plusnet

We sell broadband, phone, VoIP and more to homes and businesses in the UK. Winner of 9 out of 11 Categories in the 2008 USwitch survey. Winner of "Best Consumer ISP" at 2008 ISPA awards. Voted number 1 in the Broadband Choices 2008 survey.

© Plusnet plc All Rights Reserved. E&OE

Powered by SMF | SMF © 2006-2008, Simple Machines LLC

Add to Technorati Favourites