Translation - As usual Plusnet have engaged a cheapskate cowboy data centre so badly and unprofessionally managed that they can in the first place ever let the all too predicatble night-time cowboy contractors they have employed to do some dodgy on the cheap electrical work to shut everything down, presumably because only the night security guard (who has not a clue about how computers work) is actually on duty.
The facility that is being discussed here is a 'lights-out' facility in that it is not manned 24x7 although access is always available. The owners of this facility are Sky (although it was Easynet when we moved in), and I'd therefore suggest your 'cheapskate cowboy data centre' is complete rubbish. We are still awaiting the full incident report from them as this can take 10 days which is standard in cases like this are there are contractual and legal issues to cover. We have had a verbal account of the events but before we can comment we need the formal written report.
You are clearly lieing to us by saying that the outage was only 4 hours as I became unable to pick up email around 1am and had crucial email that had been sent by someone who had then gone to bed for a business meeting the following morning.
What would we gain by lieing [sic]? If we were going to do that we'd just say nothing at all, which is just not what we are about. We've also got over 200k customers who would be quite able to see if the events were not as stated. It's no secret, and there are Service Status posts detailing some intermittent issues with e-mail during the morning, however this was caused by the servers coming back before the network and therefore not mounting their storage correctly. This is one thing we have taken from the incident and are looking at changes to the load balancer to stop that happening again.
I telephoned and spoke to one of your call centre technical support staff at 1.30am.
I've just looked at the audit trail on you account and the first time it was accessed was at 07:56 on the 21st?
As usual there was the typical complacence, indifference and arrogance that has always characterised Plusnet telephone technical support staff who then gave the impression this was a trivial problem, was under control and all systems would be back no later than 3am. Whereas the reality was that these staff were so complacent that they didn't think a major ISP having its email down for 8 or 12 hours mattered and that it was more important for on call staff to be allowed to have their beauty sleep.
About half of my team were in overnight recovering from this incident. I believe all services were restored very quickly, and within the timescales that you quote with the exception of e-mail collection which although the servers were restored there was an issue with them. The reason that we didn't get any more staff up is that we needed to cover during the day too if more issues were encountered.
One hesitates to ask the other obvious questions such as where is your second or even third data centres that you cut over to routing your email through in the event of a failure in the first. What happens if the first is flooded or hit by a plane crash? Does that mean no Plusnet email for a week?
Absolutely not! no e-mail was lost during this incident as simple things like e-mail delivery automatically fail-over. The things that need manual intervention are things which connect to the database. We do this because if there was automatic failover there is the potential for data loss and integrity issues and we want someone to have made the decision to failover. Take for example the scenario where our data centres become unable to communicate with each other, with automatic failover the servers in one site would fail to their local database and the servers on the over would fail to theirs this would lead to a complete mess of data being written to each database and us not knowing which was accurate.
I have been told by the current MD of Entanet reseller CCS Leeds that ensuring 24/7 resilience in pop and smtp email is a simple matter and that he has engineered a number of setups that have made this possible in all circumstances. Any company like Plusnet that frequently lets all email go off for many hours at a time does so because it doesn't think its customers have the same importance as a major business customer and therefore hasn't bothered to invest adequately in sufficiently resilient alternative backup systems
The proof is in the pudding as they say and I'm sure they will have tested that in all circumstances. As for you comments about frequently letting e-mail go off for many hours at a time, I'll accept that it has happened more than anyone would like in the past 12 months, but I don't think it's frequent. In terms of investment, we have made significant investment in our infrastructure over the past 12 months with a large proportion of that on e-mail.
If by passing the buck, you mean reporting on the events to our customers then we're certainly guilt of that. Ultimately I am responsible for the team that support the infrastructure our facilities, and am therefore responsible for the service that we offer to all of our customers. Does an event like this cause me to review all of that, yes of course it does but in this case it was an event that could not have been planned for and is one that I've seen many times in data centres from the major ones in London to ones owned by the high street banks. These things do happen, and in my mind it's as important how you deal with them as to how it happened, and in this case my team did an outstanding job in restoring both service and servers in the affected data centre.
Phil