The following report provides a detailed account of the incident that resulted in the loss of legitimate customer email during the day and evening of Wednesday 22nd August. We would again like to express our sincere apologies for this disruption to your email service. Everyone at PlusNet understands the importance of email, and we do recognise the inconvenience this problem has caused.
Throughout these events, we have provided regular updates via Service Status and have been discussing the incident in our Community Site discussion forums.
After a successful internal trial, we started the work to introduce a new spam detection system to our customer e-mail platform. Following the first stage of this deployment, a previously unidentified issue occurred which required us to make a responsive change to our mail delivery platform. This was not planned as part of the original upgrade and in making the change an error was introduced to a configuration file on the mail platform. As a direct result, a large number of legitimate e-mails sent to our customers on 22nd August were incorrectly dropped from our mail platform.
The remainder of this report provides background and a detailed explanation as to what happened and why. It hopefully answers the remaining questions posed by our customers during the last week.
We have, for some time now, been seeking an appropriate solution to deal with the problem of unsolicited spam email. During this time we have looked at various anti-spam measures ranging from refusing to accept email that is malformed or clearly generated by a Botnet, to completely outsourcing our mail handling to a specialist spam partner who could perform higher quality spam cleansing than our existing systems are capable of.
One promising solution we tested was a spam detection appliance from Critical Path. This system has been in operation for over three months on ‘Gatekeeper’, our internal email platform. Gatekeeper has always been a target for spammers due to the sheer volume of PlusNet email addresses which are in the public domain, and addresses like “support@plus.net” which are in many people’s address books. The trial proved successful, with about 97% of all email Gatekeeper handles being correctly identified as spam, making this significantly more effective than any of our other anti-spam measures. On this basis, we decided to move ahead with implementing the same anti-spam system for all customer email.
We initially planned a small trial for volunteer customers as the first stage of implementing the solution. The work to install the Critical Path appliance in front of our customer mail platform was carried out during a planned maintenance window on the morning of 22nd August. At this point the appliance was set in a pass through mode, meaning that it did not actively change or block any emails, instead recording what action it would have taken before passing the mails onto the existing mail delivery platform. This diagram showing how the platform was reconfigured provides a reference for the timeline below:

As announced beforehand, work to deploy the new spam appliance in monitor-only mode onto our live network began early on Tuesday morning. Initially the device was activated in front of one mail server, where it was fully tested before being applied to all mail servers. At this time in the morning, with relatively low volumes of email, all tests proved successful and mail was passing through the spam appliance and into our mail delivery platform without difficulty. We knew at this stage however that the real test would come after 9AM when the platform began to get busier.
Mail queues began to form on the Spam Appliance from about 09:30. This had been anticipated to some degree, due to the normal burst in volume of mail at this time of the morning. During our internal trial and planning stage we had established mail throughput rates should be around four times that of the normal load on our mail platform. It was therefore expected that the queues formed during the busiest period would clear quickly once the morning mail rush was over.
By 10AM, the amount of queuing mail on the spam device was still rising, and it was felt that further action was needed. The decision was either to roll-back immediately or to investigate further and see if we could identify and resolve whatever problem was causing mail to queue. During this period we were working with the Critical Path engineers, and upon initial investigation we found a large amount of undeliverable bounce messages had formed on the Critical Path appliance.
The bounce messages in question were formed as a result of an existing ‘Clam AV‘ process that deals specifically with known phishing and image spam attacks. Normally, this mail is refused and the sending mail server is sent an error message to explain why (Bob Pullen explained this further in a recent Usergroup forum post). In this case, Critical Path had accepted the mails, but they were then being refused by our mail delivery platform and were queuing up on the appliance. This problem had not been identified during load testing carried out, both in the vendors facility and during the local testing at PlusNet.
The engineer managing the Critical Path deployment held a workshop with colleagues and a plan of action was agreed. At this stage they believed strongly that a simple solution could be found which would provide a work-around to the queuing mail. Finding a solution was considered favourable to performing a full roll-back of the implementation and it was believed that this would have the least negative impact on email delivery and our customers. The idea was to change the configuration of the mail delivery platform so that it stopped rejecting the known phishing and image spam, instead accepting and silently dropping this mail. This would prevent further mails queuing on the spam appliance.
At this point, having tested the proposed new configuration within the PlusNet staging (test) environment and with a view to finding a quick solution, the engineering team decided not to seek formal change control in advance of making the change on the live platform. The testing had appeared to be successful and mail was passing through correctly on the test platform and being delivered without any issue. The configuration change was regarded as both urgent and low risk.
The configuration adjustment itself was simply a change within a part of the mail server configuration known as the ACL (Access control list). Once the change was made to the live mail servers the platform was monitored by checking the activity logs, mail queues and number of external connections, all of which indicated to the engineers that the changes had been applied successfully.
Having made this change, it soon became apparent that the queues on the Critical Path appliance were still rising. At this point it was clear that there was a more fundamental problem with mail throughput from the Critical Path servers, in that they were not passing messages to the mail delivery platform quickly enough. The Critical Path engineers, who had also been involved in the roll-out, were then asked to investigate and tune the configuration in order to try and resolve the problem. Several further changes were made during this period, but we continued to see increasing mail queues.
We should add that the Critical Path boxes had been tested with our normal mail volumes in the vendor’s labs. Although we believe the queues were caused by a local delivery issue, we have not been able to perform further diagnosis at this point.
As the throughput issue was still unresolved, the decision was made to re-route any new e-mail away from the Critical Path appliance so that the queues which had built up could be cleared. At this time, even though it was realised that the root cause was not the rejected spam messages, the configuration changes made to the mail delivery platform were left in place. The view was that this would allow the remaining mail to dequeue from the Critical Path boxes more quickly.
Continued monitoring of the logs, mail queues and external connections again indicated that mail was flowing correctly and at this point it was assumed that with time the mail platform would return to normal operation.
Our engineering team were alerted by the Customer Support Centre that a number of calls and tickets were being received from customers who were reporting missing e-mails. Initially this had been put down to the mail queues, but it was felt that further investigation was warranted by the on-call engineer. The engineer who investigated this issue could not initially find any problem, and after around 90 minutes it was decided to call-out the engineer responsible for the Critical Path trial, who had performed the original work that morning. It was at this point that it became apparent that mail was being lost from the platform.
It was decided to systematically roll back all changes made during the day, including the ACL rule changes on the mail delivery platform. Once this had been done a large number of test messages were sent, all of which were received. This proved that the problem had been resolved for all new mail arriving on the mail platform. Some older email was still queued on the Critical Path appliances and this was cleared successfully over the following days.
Once we understood that email had been lost we started a full investigation into the causes of this issue. It was quickly recognised that while all of the work for the deployment of the Critical Path appliances was planned and authorised in accordance with our change control and peer review procedures, the way we handled the first problem following the deployment was incorrect.
The investigation revealed that a formatting error within the ACL rule change had caused the mail delivery platform to start processing mail incorrectly. This specifically was a sequencing issue, whereby the order of the commands written into the ACL rule meant that the variable set when a message was known spam was not being reset correctly for each new mail. This resulted in legitimate messages being seen by the mail platform as known Phishing or Image spam and, because of the rule to drop instead of reject this type of mail, they were removed. Although the new configuration was checked informally by another engineer before being applied, operational procedure was breached when the changes made to the rule were not formally peer reviewed or documented via change control.
Josh, the principal engineer working on this project (who made the fatal change) was the first to hold his hands up and apologise to customers for the impact this problem caused. The mistake itself was one that wasn’t picked up on our test platform (it is impossible to replicate the volume of email on the live platform and under less load the problematic condition was not triggered and no issue was apparent). Furthermore, the nature of the problem meant that the mail platform logs didn’t demonstrate any obvious faults (On a mail platform that handles 70 messages a second, where a minimum of 60% is known Spam, logged errors are easy enough to spot but incorrectly marked Spam message are not).
The biggest procedural issue here was that the correct peer review and change control procedure was not followed, although the process was overridden for what the engineer considered to be valid reasons at the time. We are now looking to streamline both these processes with a view to making them more agile and easier to follow, especially while working reactively on problems. We plan to produce an article to explain our change control processes in the near future. Obviously there has also been an internal process with those involved in the work on that day to address this appropriately.
In terms of the other questions we’ve been asked, one big comment is that customers don’t want us to drop or refuse any mail at all. That misunderstands a reality of much Botnet generated email traffic today. When obvious spam is recognised its perfectly normal for mail providers to prevent the delivery of that mail and is what almost email providers do. We will continue to reject mail which is recognised as known Spam, but it’s important to recognise that this is a different process to that of tagging suspected Spam and placing it in the Spam folder. Spam tagging is only performed on mail that has been accepted onto the mail platform because it has a valid form and we can’t be absolutely certain that it is spam.
Another question asked was about the type of logging we have on the mail platform, and whether that could be used to inform mail senders that their mail could have been incorrectly dropped. Unfortunately due to the nature of the error and the volume of mail involved, it is genuinely impossible for us to do this. The way that the logs record data do not allow us to see which mail was correctly handled and which was not. In addition to this the way the ACL was configured meant that we were unable to identify sender addresses. Customers should be sure that were there a practical way for us to achieve this, we would have gone to any lengths to make it so.
We hope this report and the details provided here do answer the valid questions customers raised in relation to this email problem. Like everyone affected by this we are extremely disappointed to be reporting a further set-back with email. We are as committed as it is possible to be to providing a stable and quality email solution for our customers, and will provide a further update regarding our plans in this regard shortly.
Kind Regards,
Bob Pullen
On Behalf of Team PlusNet
Thanks for the report Bob.
I've certainly been in the position of having to "think on my feet" during a rollout and I extend some sympathy to Josh.
I think the important thing to remember is that a mistake was made, has been identified, and processes have been altered to prevent a reoccurance of a similar problem.
Obviously it would have been better to prevent the problem in the first place. However, I'm firmly of the belief that we are all human, and mistakes are inevitable.
I applaud the openness that Plusnet have shown throughout the incident, and the openness they now show in publishing the incident report for everyone to see and critique. I hope lessons will be learned.
Barry
Thank you for a very open report - it certainly explains the mail behaviour I have seen. I trust Josh will not be overly censured - the man who never made a mistake...never made an improvement either....and is probably a liar as well!
Best wishes......Trevor
The trouble is that these types of incident keep happening with PlusNet. The comments on 'The Register' are of the 'laughing stock' variety and, given the history, that's hardly surprising.
Forgive my cynicism but I detect 'weasel words' in the penultimate paragraph. The sort of thing that we have come to expect from politicians.
"Unfortunately due to the nature of the error and the volume of mail involved, it is genuinely impossible for us to do this."
Which is it, the nature of the error or the volume? Should we infer that had the volume been smaller recovery would have been possible and, if so, why was it impossible with the actual volume?
"Customers should be sure that were there a practical way for us to achieve this, we would have gone to any lengths to make it so."
What is the significance of the word 'practical' here? Are we to infer that there was a way 'to achieve this' but it was not 'practical'.... in other words too great a hassle?
I note that in one thread a member of staff at PlusNet points to an external e-mail provider. Is it PlusNet policy to advise customers to seek their e-mail services elsewhere?
Chris
Thanks Chris. We all share a frustration about this happening and would agree it's unsurprising that the register has used it as an opportunity to have a dig. This incident reset the clock for us, and now we can only go back and build up another period of stability...
The report is very honest, and I'd say the opposite of weasel words have been used. With technology, most things are generally possible, but not everything is practical. We receive 70 emails a second, and that sheer volume means we can't log all of the details of every email and what happened to it. For these purposes, we could perhaps have found a way to send an email to the sender of every from address that hit our platform on the 22nd August, but we have no way of knowing whether that was a faked spam address or whether the mail itself was actually dropped as a result of the problem. Even if that were appropriate solution (which it isn't, because the vast majority of the email addresses there would be faked, mailing lists, or otherwise unhelpful), you would be talking about sending 10s of Millions of emails, which is impractical to say the least. I hope that explains why we didn't just say it was impossible though!
We don't have a policy on what services staff recommend to customers. We expect staff to be open and honest with customers, and to help them as much as possible on that basis. If it was appropriate for a customers needs to use a different email service, I don't think anyone here should get into trouble for telling a customer that.
Ian
I am very encouraged by the honesty of this report. Anyone in the IT world knows how difficult it is to stick to procedures when involved a 'live' incident.
It is in stark contrast to the frustrating and worrying secrecy of Onetel, who obviously had a problem with their email service for over 10 days, but wouldn't give any explanation or prognosis either at the time ar afterwards.
Keep up the honest approach - it really boosts my confidence in PlusNet.
The first I heard of this loss of emails was today when I received the PlusNet Newsletter. It may explain why I never received notification that an item I had ordered online was out of stock, so I wasted time chasing it up later.
Perhaps, as soon as the issue was resolved, you could have sent all customers an email warning us that emails which should have been received at PlusNet between x and y hours may have been lost. I would certainly have found it helpful.
Here at Plusnet we're always trying to use clever open source things to make our lives easier. Sometimes we write our own and make other people's lives easier too!
We sell broadband, phone, VoIP and more to homes and businesses in the UK. Winner of 9 out of 11 Categories in the 2008 USwitch survey. Winner of "Best Consumer ISP" at 2008 ISPA awards. Voted number 1 in the Broadband Choices 2008 survey.
© Plusnet plc All Rights Reserved. E&OE
Community Site News is powered by WordPress