What we’re going to here is go back. Way back. A long time ago, I made a brief comment on Mr Angry’s blog article about project managment disasters where I suggested a reason for the difference between a high level management view of IT projects vs. a lower level IT “Engineer” view of those projects. Mr Angry spun my comment into an entire article and made quite a few good points about how people at different levels look at these kinds of projects. Still true and well worth a read today.
Which brings me to another friend’s post - the nubby sysadmin posted about a culture of “failure is not an option” that he’s seen happen in various projects. If you read the Mr Angry posts I mentioned above, this might sound familiar.
IT Religion
I think this is a perfect example of IT religion at a high level that tries to trample over IT science principles held by the team actually delivering the project - the executive can't contemplate failure because the project is important to the company, so they refuse to even discuss the risks the project might face. It might also be a result of an executive who wants to lend support to a project without the technical knowledge to understand how it works, and who thinks that maintaining an 'aggressively positive' outlook will inspire people. It won't.The IT workers actually carrying out the work know that they must be aware of the risks and must be able to discuss them frankly with project stakeholders in order to maximise the chances of the project succeeding. Avoiding potholes in the road is much easier when you can see them.
The ironic thing of course is that you both probably believe in the project, yet someone who holds the “failure is not an option” point of view tends to treat any discussion of risks, fallback positions or even merely breaking a large project down into seperate smaller projects that can be implemented and tested in turn as ‘defeatist’… which is fine until the project does hit a rock.
This applies to day-to-day coding or sysadmin tasks too - if you’re working to a standard (e.g. complete a code update to the website by $date or achieve x% uptime on the production network) then it’s helpful to know what areas to concentrate on.
Is it better to roll out a website update with 90% of the features 100% complete, or 100% of the features 90% complete, or to delay the release entirely until you can deliver 100% of both?
If there is a major power issue in the datacentre that means the operations team have to power off half the server infrastructure then would the business prefer to lose email entirely? Or lose telephone support? or keep those available and have the website perform slowly because half the load balancers are off?
Those are all lousy alternatives of course, but it’s much better for the business if management realise that failure sometimes is an option, and discuss these kinds of things in advance rather than waiting until a disaster hits without any idea of what to do about it.