Spam: Critical Path Learning
Spam: Critical Path Learning
*Please note that the Critical Path trial has now ended, so the infromation below should be read in context* If you're reading this blog-post then there's a good chance that you've already seen the Service Status announcement that's been published about the work we'll be doing on our email platform next week? For those that haven't though, you can see a basic overview of the work here. For those about to continue reading, be warned as this is a fairly lengthy post and not for the faint-of-heart! (although hopefully you'll find the information it contains useful!) Since the Webmail Incident we've been working hard to improve our spam detection capabilities. There's been the new Manage My Mail API, the ability to turn off email to virtual domains, improved spam detection rates, and more intuitive handling of spam messages at the server level to name but a few. Whilst these things have certainly helped, they all still tie up resources across our mail delivery platform. We routinely see problems with email delays and more often than not it's due to issues that have stemmed from the sheer amount of (often junk) email our mail servers are having to process and deliver. We've attempted ACL blocking, made a multitude of Exim configuration changes and altered/upgraded our spam/virus processing. We've been fighting with the mail platform for too long now and we're only too aware of the negative impact the ensuing problems are having on our customers. Spam isn't going to stop. In fact far from it, it's going to get worse. If the previous years are anything to go by then as we approach Christmas things are going to get particularly nasty. We're already seeing a significant rise in the volumes of spam reported and we absolutely must take proactive steps to avoid the worst happening. As has been mentioned in the Planned Maintenance announcement, we're going to be re-deploying the Critical Path appliances in front of the customer mail platform next week. This will form part of a trial that is expected to last at least three weeks if successful. In addition to re-trialling Critical Path, we're also continuing to look at alternative/additional solutions. Whilst Critical Path may well become a permanent thing, it does not mean we are bound to exclusively using Critical Path for spam protection and does not deter us from the work we're doing elsewhere. Now it's no secret that we have twice before attempted to introduce the Critical Path anti-abuse appliances in front of the customer mail platform and on both occasions our efforts have resulted in negative repercussions for our customers. The first time we ended up losing emails and the second time we were chastised for poor advance communication and the subsequent email delays that arose. It's very important to note that the problems we encountered back then were mainly caused by the interaction between Critical Path's equipment and ours, failure to follow procedural guidelines and a poorly defined set of roll-back criteria. We've been working very hard over the last month alongside Critical Path's most senior technical staff and we're now confident that we have fully addressed and overcompensated for the things that bit us last time. We've very much got the customer at the centre of all of this and we'll be rolling any changes back at the first hint of any trouble. So what exactly happened last time? OK, it makes sense at this point to elaborate on what caused the problems last time. This will help you understand what we've done to safeguard against similar things happening again. The main problems with the previous implementations can be summarised as follows:
- The PlusNet Mail servers, tuned for Internet access, were tar pitting the Critical Path servers.
- Emails our servers were detecting as spam were being bounced back to the Critical Path boxes. These messages began queuing on the Critical Path appliances which made it very hard to diagnose issues as they were reported.
- The PlusNet mail servers were incorrectly handling connection limiting between ourselves and authorised hosts; i.e. the Critical Path devices.
- The Critical Path server, when presented with a large number of available connections, did not scale out sideways as well as expected.
- Even though we were pipelining emails between the servers, whenever an email with a spam was detected by the PlusNet servers, and a 550 was returned, the Critical Path machine tore down the connection and it took several seconds to re-establish.
- We made a change on-the-fly to address the spam rejection issues that resulted in customers' emails getting inadvertently deleted.
- Made some configuration changes to optimise the handling of connections by both the load balancers and the mail delivery servers.
- Fixed the rejection of spam messages by handing these off to isolated relay servers that will manage the failed delivery reports, allow us to monitor the queue more carefully and more importantly keep it separate from the Critical Path boxes.
- Configured Critical Path as an authorised host to prevent the tar pitting problems.
- Tested all fixes using one of the Critical Path appliances and a single mx.core mail delivery server in an isolated environment.
- Prepared a full roll-out plan detailing decision points and criteria to influence the decision to roll-back.
- Reinforced a strict change control policy preventing unplanned remedial work from being carried out on the platform. A roll-back will be favoured in this situation.
- The average latency for an email within the Critical Path appliances is greater than 1 minute over a ten minute period.
- The pending queue on the Critical Path appliance is greater than 10,000, and increases by more than 1,000 every 5 minutes.
- Drop the maximum number of connections to the MAA through the load balancer by increments of 50. If this gets as low as 200, and the problem still exists after 20 minutes then a full roll- back will be initiated.
- Before rolling back drop to 100 connections.*
- Drop, or rise by an increment of 50 after 5 minutes.*
- Effect complete roll-back after 15 minutes
- Configure the Critical Path boxes to drain via the mx.last servers to ensure any queues are dissipated as quickly as possible.
- Perform sender-verify checks - sender-verify is currently the duty of the sunmxcores. It involves checking the envelope sender address of each email that is received and ensuring that there are valid mail exchanger records associated with that domain. If none are found then the email is rejected. Critical Path dealing with this aspect of the mail transaction means that the resource normally required to perform all the DNS lookups is removed from our mail servers - It has been long suspected that this has caused the occasional email delay.
- Any email that the Critical Path boxes identify as spam will be handled in accordance with customers' existing anti-spam preferences. It will be either deleted at source, tagged as [-SPAM-] and delivered to customers' mailboxes, tagged as [-SPAM-] and delivered to customers' 'Spam' folders or not tagged at all. Customers can check their anti-spam preferences using the Manage My Mail tool found in the Member Centre.
- The headers of the email will show that it has come via the Critical Path appliance.
- During the stress-testing we were deleting messages as opposed to delivering them. This negates the load that the mx.core platform would normally come under and is therefore different to how things would be in the live environment - The reason we did this was to push the servers to their maximum and allow the CPU to hit 100%. It's worth noting that we would only expect there to be 40% of the messages we processed during the test when in the live environment.
- The way the load balancers work is to look for a sunmxcore server (there are 22) with available sessions. The Critical Path devices are much better equipped to handle incoming connections so there is a risk that the appliance may get swamped with connection requests from the load balancers - It's at this point that we would start tweaking the connection limit in the load balancers.
You must be a registered user to add a comment here. If you've already registered, please log in. If you haven't registered yet, please register and log in.