Following on from Part 1 of my revision of an old Server Fault post, we will continue on  to look at remediation after an intrusion.

(Part 3 available here)

Understand the problem fully:

  1. Do NOT put the affected systems back online until this stage is fully complete, unless you want to be the person whose post was the tipping point for me actually deciding to write this article. I'm not going to link to that post so that people can get a cheap laugh, but the real tragedy is when people fail to learn from their mistakes.
  2. Examine the 'attacked' systems to understand how the attacks succeeded in compromising your security. Make every effort to find out where the attacks "came from", so that you understand what problems you have and need to address to make your system safe in the future.
  3. Examine the 'attacked' systems again, this time to understand where the attacks went, so that you understand what systems were compromised in the attack. Ensure you follow up any pointers that suggest compromised systems could become a springboard to attack your systems further.
  4. Ensure the "gateways" used in any and all attacks are fully understood, so that you may begin to close them properly. (e.g. if your systems were compromised by a SQL injection attack, then not only do you need to close the particular flawed line of code that they broke in by, you would want to audit all of your code to see if the same type of mistake was made elsewhere).
  5. Understand that attacks might succeed because of more than one flaw. Often, attacks succeed not through finding one major bug in a system but by stringing together several issues (sometimes minor and trivial by themselves) to compromise a system. For example, using SQL injection attacks to send commands to a database server, discovering the website/application you're attacking is running in the context of an administrative user and using the rights of that account as a stepping-stone to compromise other parts of a system. Or as hackers like to call it: "another day in the office taking advantage of common mistakes people make".
This is where having a good incident manager is invaluable. It's her job to deal with pressure to get services back online quickly and balance that against the complexity of determining the extent of an intrusion.

The technical team investigating the intrusion, determining damage, etc. should be carefully shielded from immediate interference by well-meaning executives who may be making decisions based on emotions or bad publicity. That isn’t to say that the technical team can take all the time in the world either; nothing undermines customer confidence like a slow response or services that are offline for a very long time. There’s a balance between the various priorities and it’s up to the incident manager to make sure everyone is aware of that balance.

It’s important to be thorough when investigating the root cause and the extent of an intrusion so that you can give a correct and concise response when customers (or the press if you’re a larger organisation) press you for information. No one likes uncertainty and people will want an accurate response to questions about what happened to their data and what they need to do next.

It’s also important to understand the nature and extent of any intrusion in order to be sure you have correctly dealt with it. This again is a matter allowing competent people the time to be thorough in their examination of all potentially affected systems. This brings me back to my point in part 1 about worrying about services and “scale out” rather than small numbers of custom servers. This can both minimise exposure and improve your return to production; e.g. if intruders break into your front end web servers and the only thing they can directly access is web site code, you’ve hopefully minimised the impact of the intrusion (don’t misunderstand me, this is still very bad!) and hopefully you’ve also made it relatively easy to update the vulnerable front end code and redeploy new web servers.

Why not just "repair" the exploit or rootkit you've detected and put the system back online?

In situations like this the problem is that you don't have control of that system any more. It's not your computer any more.

The only way to be certain that you’ve got control of the system is to rebuild the system. While there’s a lot of value in finding and fixing the exploit used to break into the system, you can’t be sure about what else has been done to the system once the intruders gained control (indeed, its not unheard of for hackers that recruit systems into a botnet to patch the exploits they used themselves, to safeguard “their” new computer from other hackers, as well as installing their rootkit).

Can’t think of much to change here except to say that this is more true now than it ever was, if anything. Once you’ve lost control of a server it can no longer be trusted. These days there are just too many ways to hide exploit code.

And again, we now have greatly reduced cost and complexity for rebuilding servers. It’s now not only possible but easy to completely script the build of an entire web application stack and have a new web service spun up before the hipsters in marketing have finished arguing about wht vrbs 2 drp (srry) from the new domain name.

Make a plan for recovery and to bring your website back online and stick to it:

Nobody wants to be offline for longer than they have to be. That's a given. If this website is a revenue generating mechanism then the pressure to bring it back online quickly will be intense. Even if the only thing at stake is your / your company's reputation, this is still going generate a lot of pressure to put things back up quickly.

However, don’t give in to the temptation to go back online too quickly. Instead move with as fast as possible to understand what caused the problem and to solve it before you go back online or else you will almost certainly fall victim to an intrusion once again, and remember, “to get hacked once can be classed as misfortune; to get hacked again straight afterwards looks like carelessness” (with apologies to Oscar Wilde).

  1. I'm assuming you've understood all the issues that led to the successful intrusion in the first place before you even start this section. I don't want to overstate the case but if you haven't done that first then you really do need to. Sorry.
  2. Never pay blackmail / protection money. This is the sign of an easy mark and you don't want that phrase ever used to describe you.
  3. Don't be tempted to put the same server(s) back online without a full rebuild. It should be far quicker to build a new box or "nuke the server from orbit and do a clean install" on the old hardware than it would be to audit every single corner of the old system to make sure it is clean before putting it back online again. If you disagree with that then you probably don't know what it really means to ensure a system is fully cleaned, or your website deployment procedures are an unholy mess. You presumably have backups and test deployments of your site that you can just use to build the live site, and if you don't then being hacked is not your biggest problem.
  4. Be very careful about re-using data that was "live" on the system at the time of the hack. I won't say "never ever do it" because you'll just ignore me, but frankly I think you do need to consider the consequences of keeping data around when you know you cannot guarantee its integrity. Ideally, you should restore this from a backup made prior to the intrusion. If you cannot or will not do that, you should be very careful with that data because it's tainted. You should especially be aware of the consequences to others if this data belongs to customers or site visitors rather than directly to you.
  5. Monitor the system(s) carefully. You should resolve to do this as an ongoing process in the future (more below) but you take extra pains to be vigilant during the period immediately following your site coming back online. The intruders will almost certainly be back, and if you can spot them trying to break in again you will certainly be able to see quickly if you really have closed all the holes they used before plus any they made for themselves, and you might gather useful information you can pass on to your local law enforcement.
I'm mostly summarising earlier points here so I won't rehash them again. I will say that the plan I talk about for getting back into production should ideally be produced and wargamed before the incident occurs. We should be rehearsing all our business continuity plans really, and successful hack incidents are really just another class of risk to the business.