It’s been a while since we last provided an update about our anti-spam platform back at the beginning of February. Since then you’d be forgiven for thinking that it’s all been a bit quiet on spam front, but rest assured we’ve definitely not been been resting on our laurels. Over the last two months our developers and network engineers have been busy beavering away at the configuration, code and script changes that will allow us to introduce the functionality promised when we last blogged about the subject.
You may recall back in November we announced our partnership with email security experts Postini. We’ve steadily been migrating customers across to the new anti-spam platform and we now have just under three quarters of our customers with their email being filtered by the new Postini systems. More…
It should be no secret by now that we are currently in the midst of migrating all of our customers over to a new anti-spam platform that we have been developing in conjunction with Postini.
Spam has been a very hot topic in our forums of late and our Customer Support Centre have also reported large increases in the number of customers who are getting in touch to report spam related problems.
A lot of customers have already been moved to Postini, but the work is a multi-stage project and some customers have been unsure what to expect after migration. Many may not know that they’re on the new platform yet, whilst others will be curious as to when their accounts will be getting migrated. This blog post is intended to clear up a lot of this confusion, answer some of the more commonly asked questions and provide an overall update regarding the progress of the project…
More…
*For the latest Postini anti-spam developments please see the more recently published update that can be found here*
This blog is designed to give you an overview of Postini’s insights into the previous month’s spam.
I’m hoping to provide you all with a blog similar to this one each month, showing the previous month’s trend in spam messages.
In a somewhat bus like moment, two customer trials were announced yesterday. Both trials, one for a new Spam management solution and the other a new Broadband Phone service, will be available to all customers on an opt-in basis. Registrations for both trials are now being accepted.
*For the latest Postini anti-spam developments please see the more recently published update that can be found here*
Spam is undeniably one of the biggest challenges we face as a service provider when it comes to building a reliable, stable and dependable email platform that our customers can rely on. Some have even argued that the ever increasing torrent of unsolicited email that now plagues the Internet has almost brought question to the usefulness of email as a reliable tool altogether.
Providing bandwidth, dealing with all the problems caused by spam (not least mail delays!) and maintaining constant house-keeping regimes is extremely costly. Spammers are continuously changing their techniques to circumvent the anti-spam precautions providers put in place, and things are made extremely difficult due to the lack of consistency in the way different email servers are set up around the globe.
So what does this mean I hear you cry? Are you turning email off? Well, fortunately enough I don’t think it’s quite come to that, although we would like to ask for your help…
*Please note that the Critical Path trial has now ended, so the infromation below should be read in context*
If you’re reading this blog-post then there’s a good chance that you’ve already seen the Service Status announcement that’s been published about the work we’ll be doing on our email platform next week? For those that haven’t though, you can see a basic overview of the work here.
For those about to continue reading, be warned as this is a fairly lengthy post and not for the faint-of-heart! (although hopefully you’ll find the information it contains useful!)
Since the Webmail Incident we’ve been working hard to improve our spam detection capabilities. There’s been the new Manage My Mail API, the ability to turn off email to virtual domains, improved spam detection rates, and more intuitive handling of spam messages at the server level to name but a few.
Whilst these things have certainly helped, they all still tie up resources across our mail delivery platform. We routinely see problems with email delays and more often than not it’s due to issues that have stemmed from the sheer amount of (often junk) email our mail servers are having to process and deliver.
We’ve attempted ACL blocking, made a multitude of Exim configuration changes and altered/upgraded our spam/virus processing. We’ve been fighting with the mail platform for too long now and we’re only too aware of the negative impact the ensuing problems are having on our customers.
Spam isn’t going to stop. In fact far from it, it’s going to get worse. If the previous years are anything to go by then as we approach Christmas things are going to get particularly nasty. We’re already seeing a significant rise in the volumes of spam reported and we absolutely must take proactive steps to avoid the worst happening.
As has been mentioned in the Planned Maintenance announcement, we’re going to be re-deploying the Critical Path appliances in front of the customer mail platform next week. This will form part of a trial that is expected to last at least three weeks if successful. In addition to re-trialling Critical Path, we’re also continuing to look at alternative/additional solutions. Whilst Critical Path may well become a permanent thing, it does not mean we are bound to exclusively using Critical Path for spam protection and does not deter us from the work we’re doing elsewhere.
Now it’s no secret that we have twice before attempted to introduce the Critical Path anti-abuse appliances in front of the customer mail platform and on both occasions our efforts have resulted in negative repercussions for our customers.
The first time we ended up losing emails and the second time we were chastised for poor advance communication and the subsequent email delays that arose.
It’s very important to note that the problems we encountered back then were mainly caused by the interaction between Critical Path’s equipment and ours, failure to follow procedural guidelines and a poorly defined set of roll-back criteria.
We’ve been working very hard over the last month alongside Critical Path’s most senior technical staff and we’re now confident that we have fully addressed and overcompensated for the things that bit us last time. We’ve very much got the customer at the centre of all of this and we’ll be rolling any changes back at the first hint of any trouble.
So what exactly happened last time?
OK, it makes sense at this point to elaborate on what caused the problems last time. This will help you understand what we’ve done to safeguard against similar things happening again.
The main problems with the previous implementations can be summarised as follows:
Critical Path were on site during the last trial and they saw the pain that was born from the problems that were encountered. They left that day with a conviction to help us resolve what had gone wrong, and as has already been mentioned we’ve been working closely alongside their most senior platform architects ever since.
How are we going to make sure it doesn’t happen again?
We’ve been careful to ensure that all of the above points have been addressed as follows:
The above changes have been tested by both ourselves and Critical Path and both parties are confident that the issues have been resolved.
Last week we also performed a full stress-test on a single sunmxcore mail server in an isolated environment. During this test 750,000 emails were successfully processed during a three hour period. None of the aforementioned issues were encountered.
On average a single sunmxcore server in it’s present state will process approximately 1.2 million emails a day. If you consider what we achieved during the above test then you should have an idea as to why we’re so eager for this to work.
During testing, we also managed to max the CPU on the sunmxcore (there was still plenty of processing potential remaining on the Critical Path appliance). We managed 240 concurrent connections. We only managed 8 the last time we implemented these changes so this is a good indication that there are no longer issues feeding messages from the CP appliances to our platform.
The roll-out
The roll-out is currently scheduled for Tuesday next week (30th October) and will last for several days dependent on whether or not certain success criteria are met.
We will start by replacing one mx.core with a Critical Path device. All traffic from this device will be routed to the removed mx.core server which will then handle the final delivery.
After the first server goes live the platform will be closely monitored. Graphs showing the latency and queues on the Critical Path devices alongside the queues on the sunmxcores will be made available to customers via an isolated portal page that will be visible here following the roll-out.
If all success criteria are met and no problems are encountered then we will introduce a second server on Wednesday, a third server on Thursday and a fourth on Friday.
Once we have reached this point, a decision will be made regarding our deployment to the remaining servers the following week (there are 22 servers in total). No more servers will be added over the weekend and there will be a dedicated resource monitoring the platform throughout this time.
There will be a Critical Path employee on site throughout the trial, and we will also be in contact with a further two senior engineers based in Germany and Ireland.
Roll-back
A decision to roll-back will be arrived at should any of the following criteria be met:
The proposed maintenance work that will be carried out should any of these conditions be met is as follows:
* These steps are to allow for the collation of statistics for post roll-out analysis.
The values above are not arbitrary as it took just one hour for a single Critical Path appliance to accumulate a queue of 100,000 emails the last time we rolled it to the live platform. By taking such a cautious, staged approach we’re hoping to protect customers.
What will the Critical Path boxes do?
There a a number of things the Critical Path boxes will do once they are live in front of the mail delivery servers:
Risks?
There are two risks associated with this work that are worth mentioning. These are what we based our roll-back criteria on and are the reason we’ve allowed for tweaking of the connection limit in the load balancer as part of the test plan.
What next?
As previously mentioned, we’re still exploring the possibility of using other vendors/suppliers. We’ve been working with a number of other third parties and hope to announce details regarding future trials before long.
We’re all hoping for a successful roll-out next week and are confident we’ve done all we can to safeguard our customers from any potential upset. Ultimately we hope this work proves to be a large step towards overcoming the problems spam email causes us and stabilising the platform for our customers once more.
Any questions, feedback or concerns regarding this work are welcomed as always over on our Community Site discussion forums.
Regards,
Bob Pullen.
In light of recent email problems, particularly delays, I thought I’d provide a quick update on the current state of play.
Good news is that we’re no longer suffering from email delays and all customers should be able to send and receive email in a timely fashion
A number of things have contributed to the delays over the last week or so, and it’s been pretty tricky trying to keep abreast of them all. Here’s a quick summary:
Critical Path trial - On the 22nd August we encountered an unfortunate problem that led to extended email delays and some customers’ email being incorrectly rejected as spam. The last Service Status post can be seen here and there’s a detailed incident report regarding the problem that you can find here.
Outbound email delays - Late last week we started seeing reports in the forums of customers whose email was being delayed on our outgoing relay servers. This was narrowed down to file system errors that we found in our mail logs. Moving the database from disk storage to an separate RAM drive soon saw this problem resolved. This was last reported on Service Status here.
Inbound email delays - We encountered two separate issues this week that had the potential to delay some messages for customers. One was an unforeseen result of the work to debug the problems we experienced with the Critical path boxes. These issues were last reported on Service Status here and have since been resolved. We also identified a problem that we suspect to have always existed with one of the spam filtering processes on the delivery servers. This was fixed this morning following the introduction of a new housekeeping script as announced here.
Housekeeping - We will always encounter problems that we have to reactively respond to. That’s only half of it though. It’s important that we’re running regular reporting to pro-actively identify those customers that have the potential to start negatively impacting the service for others. Over recent weeks we’ve been running daily reports showing the top users of our relay servers by IP address. This is normally populated with customers who have a virus or misconfigured mail server and most if not all appreciate us getting in touch to let them know. Todays top offender had sent in excess of 53,000 emails over our relay server in a 24 hour period - Now that’s a lot of email!
Not only have we been beavering away at the above but we’ve also seized the opportunity to increase the capacity of our relay servers. Yesterday we added an additional 2 high-end servers to the platform bringing the total to 8. We’ve seen no problems so far and the reduction in load on the platform since their deployment is very promising indeed.
Hopefully we’ve seen the last of email delays for a while but make sure you let our support team know if you see any problems or give one of us Comms folk a prod over on the forums is you suspect anything awry
Don’t forget that you can keep up to date with all the latest Service Status information by subscribing to the Usertool’s RSS or Email Feed.
Bob
With unsolicited email still very much a hot topic across the Community, I felt it was about time for another update on where we are with regards to the ongoing battle against spam.
Since the last update, we’ve shifted from working on the Manage My Mail tool and email API, to focussing on upgrades and configuration changes that we can make at the server level to help reduce spam volumes and improve the reliability of our anti-spam detection. More…
Site Links
Related Sites
Community Apps
Here at PlusNet we're always trying to use clever open source things to make our lives easier. Sometimes we write our own and make other people's lives easier too!
About PlusNet
Winner of 9 out of 11 Categories in the 2008 USwitch survey. Winner of "Best Consumer ISP" at 2008 ISPA awards. Voted number 1 in the Broadband Choices 2008 survey.
© PlusNet plc All Rights Reserved. E&OE
Community Site News is powered by WordPress