Managing IronPort Clusters

Unanswered Question
Nov 26th, 2007

Folks,

We are contemplating an 8+ X1050 box roll out mid to end 2008 with the boxes distributed across 2 data centers with geographical load balancing. We will be handling about 300k mailboxes across ~1000 domains.

I'm interested in hearing thoughts about effectively managing such a large deployment. (is this a large deployment?) Things that are concerns are general troubleshooting tools and methodologies as well as reporting, both for our management and the client's management.

How do the big installations out there manage their environment? Is the methodology to manage at the appliance level or has anybody deployed an enterprise management solution?

We will most likely be using AsyncOS 6.? and the M series for some of the consolidated reporting features.

I'd be interested in hearing your solutions.

Cheers,
Richard

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Donald Nash Mon, 11/26/2007 - 21:42

We have a nine MGAs here, a mixture of C60s and C600s. Eight are production and one is a hot spare. We're not geographically diverse; and being a university, we don't have clients demanding reports. Here are some lessons we've learned:

  • IronPort's Centralized Management feature is your friend. Make good use of it and it'll save you a lot of time. Be sure to plan out your structure carefully ahead of time. We're lucky in that all nine of our MGAs are running identical configurations (local stuff like IP addresses notwithstanding), but CM is more flexible than what we're using.
  • Use a hardware load balancer. It simplifies your DNS immensely, and having the ability to take MGAs in and out of production via a few commands on the load balancer is a major time saver when maintenance time rolls around.
  • If you use a hardware load balancer, be sure to turn on connection persistence or you will dilute the benefit of rate limiting. This is because otherwise incoming connections from the same source will be spread over multiple MGAs, effectively multiplying the rate limit by the number of MGAs involved. Since AsyncOS expresses rate limiting in terms of recipients per hour, you only need a persistence window of one hour.
  • When using a hardware load balancer with connection persistence, don't be shocked when you occasionally see the work queue spike up on one unit. It just means a jabberer is nailed down to that unit. This usually self-corrects, but if it doesn't then you can identify the jabberer and block it. If the jabberer turns out to be a botnet that by happenstance managed to get sent mostly to one unit, then toggling the affected unit in the load balancer will rearrange the load balancing and usually spread them back out.
  • Have a hot spare. IronPort's hardware it good, but it isn't bulletproof. Sometimes it fails, and sometimes you get unexpected traffic spikes. Having some spare capacity waiting to go is handy.
  • The hot spare can also double as a test bed, where you try out new configurations before pushing them out to the cluster. Centralized Management makes this easy. You can also try out new AsyncOS releases on the hot spare first, to make sure that the upgrade process doesn't botch your configuration (I've never seen this happen, but better safe than sorry).

We don't have an M-series yet, but it's on my list of things to get once the consolidated reporting stuff ships.

richard.doiron_... Tue, 11/27/2007 - 00:18

Thanks Don, all good things to think about.

Our configuration is 4 x1050's located at 2 datacenters with Cisco high availability and geographic load balancing. Each of the datacenters can theoretically handle our full load, so we will normally balance out the traffic using the Cisco hardware.

Each datacenter has 2 IronPorts facing the internet, located feet away from our carrier's internet peering points. We call these boxes the Edge and they are responsible for reputation filtering and providing a proxy for what's downstream.

The Edge connects to the Core, again 2 IronPorts in each datacenter. The core connects via vpns to our client mta's. With this layout, the client mta's are removed from the internet and are totally unreachable, except by passing through the Core. The Core is responsible for mail scrubbing.

Basically two concentric rings, with different responsibilities.

Centralized Management is essential, as we will have 2 basic configurations over 8 boxes.

So the issue and challenge that I see is troubleshooting this type of setup. Mail Flow Central dies in this type of environment and traffic load (> 8 M messages per day) and we'll have to see about the M series...

So in an average day, how do you tell how well the clusters are running? How to tell if you are under attack? All of this before the clients start calling...

Donald Nash Tue, 11/27/2007 - 17:00

Disclaimer: I don't know all the factors that went into your design process, so my judgements and questions below are based only on what you've written. Take them for what they're worth.

That's an interesting division of responsibility. I'm curious as to why you separated the Edge and Core units instead of consolidating all those functions at one layer. By dividing them up that way, you lessen your redundancy. A single failure cuts your capacity at that data center by 50% instead of 25%. Also, IronPort Anti-Spam uses the IP address of the SMTP sender as part of its calculations. If you're running IPAS on the Core units (which is what it sounds like), then it's missing out on that key piece of information unless you use the Incoming Relays feature. And finally, that's not an equal division of labor. You've got equal amounts of horsepower in both the Edge and the Core, but the Core is where all the hard work is done.

We do reputation filtering and content scanning on the same units without a problem. That simplifies our overall configuration (we have only one global configuration instead of two, and don't need to mess with Incoming Relays), makes for a more efficient hand-off from the reputation filtering to the content scanning (all internal rather than another SMTP hop), and maximizes the headroom of our existing hardware build-out by putting all units in parallel across the load.

We don't need the complexity of a VPN since all of our destination MTAs are on our internal network (sounds like that's not the case for you), but we do configure them to reject all SMTP connections except those from our IronPort units. That's sufficient to lock the spammers out.

The single biggest metric that we use to tell when we're under attack is the size of the work queue. All the heavy lifting (IPAS, Sophos, etc.) is done in the work queue, so if a large enough attack hits you then that's where the backlog happens. We monitor the work queues on all our units very closely. As I said before, there are occasional spikes on a single unit due to jabberers or lame attacks from a single attacking host. Simultaneous spikes on all units usually indicate a major distributed attack. Those are tough, since you can't just block the offender. We haven't seen one in quite some time, but in the past we have just ridden them out. However, you can also switch to more aggressive SenderBase reputation filtering to try to take the edge off.

Beyond that, they just work. Occasionally some piece of hardware fails (usually a disk), and then the affected unit will send e-mail about it if you have that feature turned on. Units which are being heavily stressed by an attack may go into resource conservation mode, which also triggers e-mail notifications. Resource conservation mode can also result in weird application failures due to insufficient memory, which for us always manifest as filter scripts not running to completion. This also results in notifications. If a unit dies completely then the SNMP probes would start failing, but we've never seen that.

Speaking of SNMP, be prepared to see occasional SNMP timeouts. SNMP apparently runs at a lower priority in AsyncOS, so requests occasionally get dropped. I've never seen it not correct itself on the next probe.

Yep, MFC just can't keep up with big traffic loads. It's a shame, because it's a useful product. I'm not privy to IronPort internal decision-making, but my educated guess is that its limitations are due to the fact that they wanted to make it a simple product to install. That meant putting the data collector, database, and user interface all on the same box where they compete with each other for resources. Once upon a time I gave serious thought to writing our own MFC knock-off, but with those three operations spread across separate hardware. That's now on hold waiting to see how consolidated reporting works out.

richard.doiron_... Tue, 11/27/2007 - 18:17

Our application needs are unique. Our clients must be able to communicate between themselves even if the Internet is on fire. Communications between clients is essential, communications between clients the internet is secondary.

Our clients have several thousand MTAs with over 1000 domains and those MTAs need to be able to exchange smtp between themselves.

Basically we have built a hub and spoke smtp network with the IronPorts at the hub.

The IronPorts at the outer edge may be overkill but we are constantly being hit by DDOS attacks, much as the country of Estonia was back in the spring.

http://www.infoworld.com/article/07/11/02/44OPsecadvise-denial-of-servic...

Donald Nash Tue, 11/27/2007 - 21:12

Picturing the Internet being on fire is rather amusing. :-)

Those are some pretty stringent requirements, so I understand now why you're putting an "air gap" between your core and edge ESAs. But reputation filtering is really cheap. Get your local SE to give you messages per hour figures for an X1050 for straight pass-through, reputation filtering, and reputation filtering + content scanning.

I'm curious about the Estonia comparison. From what I've read, it was Estonian web sites that were DDOS'ed, not mail servers. Don't get me wrong, I don't doubt that you get pummeled. I'm just wondering about the nature of the attacks. Are you being attacked with the intent of knocking you off the Internet, or are they spam attacks? You can't stop someone who is determined to knock you off the air with a packet flood of some sort, assuming his botnet is large enough. All you can do then is either take yourself offline to escape, or have some front-end hardware take the abuse. It sounds like your edge units serve the latter purpose, but X1050s are awfully expensive boxes to be used as glorified ablation shields. I'd be surprised if your Cisco load balancing gear couldn't do that job for you. I know, for example, that our NetScaler load balancers can deflect certain attacks like SYN floods.

On the other hand, if your threat is crippling spam attacks, then reputation filtering can deflect a huge amount of that with very little cost: a SenderBase lookup if the sending IP isn't already in the cache, and then either a "554 go away" greeting banner or a TCP RST if you really want to be rude.

If your clients are all trustworthy, then you have another option: Deploy all your defenses (content scanning included), at the edge, and leave the core with only the job of routing messages. This is what we do. All our incoming mail goes to inbound.mail.utexas.edu, and all our major MTAs (not all departments have jumped on board this part yet), punt their outbound mail to outbound.mail.utexas.edu. This gives us a hub and spoke design like yours for our internal traffic. All nine of our ESAs have the same configuration, and can act in either role. But we use the load balancer to keep the inbound and outbound traffic on separate units, so our internal traffic never hits the ESAs that might be under duress from an attack. And we can shuffle ESAs between roles using the load balancer, although we've never needed to do that. We have six inbound units, two outbound units, and one hot spare/test system. If things were to get really bad (unlike you, external connectivity isn't something we can sacrifice without much complaining from our users), we could bring the hot spare into service and convert one outbound unit to inbound. That would give us eight inbound units, at the expense of no redundancy for outbound/internal traffic. That's about a 30% increase in inbound capacity, taking into account the fact that we'd be throwing two C60s into a role currently being done by three C60s and three C600s. Not having internal redundancy is obviously not a situation we'd like to be in for very long, but it does buy us some time.

Incidentally, all it took to make our initially inbound-only configuration handle both inbound and outbound/internal traffic was to add a second listener. The SMTP routes needed to get internal mail to the right MTA were already in place as part of our inbound-only configuration. The second listener is private of course, and exempts its traffic from anti-spam defenses. It was a real lightbulb-over-the-head moment (or maybe a "Doh!" forehead-smacking one), when I realized how easy it would be to solve the problem of our internal traffic being delayed when our ESAs are backlogged due to a spam attack.

Actions

This Discussion