To cover that off requires testing the whole ecosystem from the operating system up. Even if the system itself passes every imaginable test first time, does it do it after its been running for a day, or a week, or a year? What about when other applications are running, or after a vital security patch is applied? Even if you test all that every time, a hardware component might simply overheat or fail. That's why it's vital to design as much redundancy into the architecture as possible, and it's pretty much unforgivable for NATS to lose a backup system at the same time as a primary on such an important application.
I have no knowledge of Island Energy's systems, but I'd be surprised if the important stuff was developed in-house. This doesn't absolve them from any and all appropriate testing, of course, and from having a proven and well-rehearsed recovery plan.