A system failover scheme
In many enterprises, having backup systems is seen as a necessary but resented overhead, and they will often take risky measures to minimise their costs. But is that wise?
Some will try to save by using lower-speced systems, or do double-duty by also using their backup systems for training or development. Backup systems are insurance, and like with regular insurance, not spending enough can be severely debilitating when the unfortunate does occur.
The justification for spending properly should be as simple as estimating how much the business would lose for every minute, hour or day that the main system is down. If using lower-specced backup systems, estimate how much business loss would occur because they cannot maintain the demand that the main systems can.
I think that by doing these opportunity cost calculations, skimping on backup systems will be shown to be folly. I think that the cost of backup systems will be shown to be very minor compared to the potential losses.
The failover schemeβ³
Having established the financial advantage of spending sufficiently upon backup systems, it then comes down to deciding what is the best topology
Many run critical systems in tandem, either master-slave or round-robin, but these are usually at the same location and tightly-coupled for fast failover should one server fail. However, data centres also have site wide threats, including fire, as has occurred in air-conditioning units, and terror attacks. The latter may not be from explosive devices, but cyber attacks on the supporting infrastructure systems. If systems are only duplicated in one other data centre, losing one centre can leave the enterprise operating from one centre only, which is a huge risk to the business.
Therefore, I propose using three complete systems in three separate locations. Also, since each is a potential main system, they must be fully up to specification for full-load operation, including full security, so non-secure usage like training and development should still be on separate systems from the business-critical ones. Of course, more than three can be provided, but three still allows for the main system to have a full backup system set should one set go down. This presents a lot less risk to the enterprise.
Sets | Risk profile | Comments |
---|---|---|
0 | Catastrophic | Completely offline |
1 | Critical | Cannot afford to lose any systems |
2 | At risk | Still have one system in reserve for backup |
3 | Safe | Normal operation, with maintenance of backup systems |
A common problem is that a lot of systems are not implemented such that failover can be fully automatic. This often results in a lot of manual emergency work to cutover. Of course, it also makes disaster-recovery testing more of a major effort and disruptive, as well as putting the business at risk of delays if settings are in error.
Making systems failover completely automatically enables a clear advantage of having more than two system sets: round-robin failover at set intervals around the year. This allows the last-used system to have plenty of time to do thorough scheduled maintenance, without unnecessarily putting the enterprise at risk by hastily trying to fit in with tight windows of opportunity.
Thus, the optimal setup is to have three system sets, with round-robin failover over every four months, with maintenance proceding all year round at an even pace.