A system failover scheme

In many enterprises, having backup systems is seen as a necessary but resented overhead, and they will often take risky measures to minimise their costs. But is that wise?

Some will try to save by using lower-speced systems, or do double-duty by also using their backup systems for training or development. Backup systems are insurance, and like with regular insurance, not spending enough can be severely debilitating when the unfortunate does occur.

The justification for spending properly should be as simple as estimating how much the business would lose for every minute, hour or day that the main system is down. If using lower-specced backup systems, estimate how much business loss would occur because they cannot maintain the demand that the main systems can.

I think that by doing these opportunity cost calculations, skimping on backup systems will be shown to be folly. I think that the cost of backup systems will be shown to be very minor compared to the potential losses.

The failover scheme

△

Having established the financial advantage of spending sufficiently upon backup systems, it then comes down to deciding what is the best topology

Many run critical systems in tandem, either master-slave or round-robin, but these are usually at the same location and tightly-coupled for fast failover should one server fail. However, data centres also have site wide threats, including fire, as has occurred in air-conditioning units, and terror attacks. The latter may not be from explosive devices, but cyber attacks on the supporting infrastructure systems. If systems are only duplicated in one other data centre, losing one centre can leave the enterprise operating from one centre only, which is a huge risk to the business.

Therefore, I propose using three complete systems in three separate locations. Also, since each is a potential main system, they must be fully up to specification for full-load operation, including full security, so non-secure usage like training and development should still be on separate systems from the business-critical ones. Of course, more than three can be provided, but three still allows for the main system to have a full backup system set should one set go down. This presents a lot less risk to the enterprise.

The enterprise risk profile for each number of working system sets is:

Sets	Risk profile	Comments
0	Catastrophic	Completely offline
1	Critical	Cannot afford to lose any systems
2	At risk	Still have one system in reserve for backup
3	Safe	Normal operation, with maintenance of backup systems

A common problem is that a lot of systems are not implemented such that failover can be fully automatic. This often results in a lot of manual emergency work to cutover. Of course, it also makes disaster-recovery testing more of a major effort and disruptive, as well as putting the business at risk of delays if settings are in error.

Making systems failover completely automatic enables a clear advantage of having more than two system sets: round-robin failover at set intervals around the year. This allows the last-used system to have plenty of time to do thorough scheduled maintenance, without unnecessarily putting the enterprise at risk by hastily trying to fit in with tight windows of opportunity.

Another option is to not rely upon one taking the full load, though still being able to, but have one favoured to carry the bulk of the load, while the other two run minor loads. That way, each is ongoing proof it can run, , and all the systems are running at lower stress levels, ensuring longer lifetimes. Failures through cold starting unused systems are a known risk.

Thus, the optimal setup is to have three system sets, with round-robin failover over every four months, with maintenance proceding all year round at an even pace.