5. “Nuke from orbit” is still the best approach to a rooted system. See http://serverfault.com/a/218011/7783
I’ve talked about this in the Server Fault answer above, and I might do another post diving into some of the details behind my beliefs here but the drive to rebuild after getting a system compromised comes down to trust. For an illustration of why I use that word, please read “reflections on trusting trust” – an old paper but one that makes it very clear that once a system has been compromised you can no longer trust it.
You must be able to trust your systems; you must have confidence in the data you store and process or it is worthless. If you have customers then they need to be able to trust you with their information, heck even for a piddling little blog like this visitors expect that the site won’t try to exploit their browser, or that any contact details they leave on their comments won’t promptly be sold to a spammer.
This seems simple enough but it’s a fundamental part of any “contract” between a site and its users and the cost of re-tooling a compromised site properly is always going to be cheaper than the cost of your visitors losing faith in you, your business and your web presence (imagine if people lost faith in Amazon’s or Apple’s ability to securely store their credit cards…)
4. Even when there’s lots of pressure to react quickly, “measure twice, and cut once” is a good rule to work by.
In IT there is a lot of pressure on us to act quickly. If a service is down then there is pressure on the operations staff to get it working again. If a code project is behind schedule then there is a lot of pressure on the programming team to crank out code as quickly as possible.
This is a natural part of any business, and to some degree it is healthy if it indicates what areas the business considers to be most urgent. But it should not be allowed to drive how you act – it’s not possible to fix a problem properly until you fully understand it.
Use a tool to help you fully diagnose a problem such as the 5 whys system. Try to develop a “working style” that allows you to be aware of pressures (as I said earlier, understanding the pressures you and your managers are being placed under can be an important tool for understanding how systems matter to the business) but which allows you to put them at the back of your mind while you concentrate on the task in hand.
When working on a system issue I try to approach the system and its problem from a holistic as well as a purely analytical point of view, and this often allows me to spot root causes of an issue that I would otherwise miss if I just employed one approach. It also encourages other IT staff to contribute to diagnosing a problem, improving their own experience with troubleshooting as a general tool and their understanding of the system we’re looking at because everyone in the team can contribute something to a holistic view of a system.
To an outsider, especially one that doesn’t understand the approach I use, this appears to take more time because I’m spending more time thinking “up front” about a problem rather than reacting to it, but in the end it saves time because it allows me to solve a problem with fewer, but more comprehensive actions.
3. Putting stuff ‘in the cloud’ isn’t about fixing problems, just changing new for old.
This is a current manifestation of an issue in IT – our obsession with finding a magic button. My comment about the cloud is not a criticism of cloud solutions or organisations that use them but rather the drive some people in IT have to find some kind of magic button they can just push and everything will be alright without them having to waste time understanding what the “magic button” actually does.
Migrating a service to the cloud doesn’t remove the need to understand the service. For example, if you host a service on Amazon AWS and use a backup service provided by a company that are essentially re-selling Amazon S3 then you’re doing it wrong; putting all your eggs in one basket is a great way to end up scrambled.
2. No-one complains that you spent too much time & money on backups *after* a major disaster.
This is pretty much self-explanatory. Backups, as a concept at least, are one of the most simple tasks a sysadmin/operations team can deal with. It sounds so easy, yet time and time again people turn out to just not bother in the first place, or to have not tested the backups for months on end or (as per my comment above) to be storing backups right next to the item they’re protecting, either physically or logically, and hence losing a large part of their usefulness.
1. Nothing magical happened just because someone said “virtualisation” or any other buzzword during the design phase.
This is arguably another shout out to the magic button issue I describe under the point about “cloud” solutions above. Moving your servers off hardware to a virtual platform (whether cloud based or local) doesn’t alter the need to plan storage, network vlans and routes, etc. Nor does it usually do anything to make a problem magically more, or less complex.
What changes is how you might approach these problems – if your virtualised file server need more storage then you need to take it from the storage pool allocated to your virtual servers, rather than simply adding more physical disks to a conventional server as you would have in the past. If your virtualised server has more memory or CPU resources than it needs then you can return these resources to the pool available for other virtual guests instead of ignoring the wasted resources, or installing another role onto a physical server box as you might have done in the past.