In contrast with the fault-injection aspect of failure testing, the ACM Queue case-study article on the Hootsuite reactive system evolution is more understandable at a 40,000-foot level. There are a great many architectural and software-engineering organization and process considerations in how the system evolved over time. What stands out for me, among all of this, is the observation that the investigated platforms “appear to be the only platforms or libraries that put an emphasis on embracing and managing failure – which is to say they’re basically designed for resilience.”
After pondering the notion of fault injection, still to no meaningful conclusion, I recalled that I had undertaken something that might have some kinship with that idea.
Automatic Software Fire Drills
In the mind-70s, colleague Dick Morse and I were pondering about how to ensure that failure modes were operating and resilient in a distributed communication system. The nodes were connected using bisync over leased lines, and the concern was integrity of end-point to end-point communication with a protocol layer above that level.
The idea was to have ways where a node could, from time to time, automatically force an exception case within the assured-delivery procedure in order to cause a resilient recovery. Recovery was expected, and this was a way to design in its demonstration and logging of its success. We had the idea that this might help avoid the Maytag syndrome too, since network operators would observe actual fault-handling.
The extension of the private protocols was overtaken by advances and changes in communication technology and the fire-drill mechanism was not implemented for the distributed system.
Impacting Design Thinking
That did not prevent my using the idea in local communication arrangements at branch-office end-points though. I designed new software for management of terminal controllers in the offices. The idea of software fire drills informed that design and I had a means to safely force certain kinds of recovery at the office minicomputer so that terminal displays would be reattempted and recovered. This would be noticed but the outcome would be benign.
Ultimately, I did not have to enable the fire drill. The equipment we used failed often enough that there was no need for drills. The home-brew connection between the minicomputer and a separate, closed-firmware multi-terminal controller was also so limited that the minicomputer could not diagnose and reset the controller, it could only gracefully shut down terminal requests and allow the terminal-using applications to do likewise. I credit the influence on design thinking that the actual faults were handled resiliently as much as the equipment allowed.
In the late 60s, Sperry Univac was bringing up its own operating system on its System/360 plug-compatible processor, the Univac 9400. For software development on prototype hardware, the developer system used IBM disk drives. On replacement with Univac’s early-production compatible disk drives, the operating system began crashing.
It turned out that the IBM disk drives were so reliable that the failure cases in the operating-system disk drivers had never been exercised. Unreliability of prototype OEM drives exposed the fact that the OS error-handling had never been tested.
I don’t know how ideas about fault-injection could have been applied in that case. All I could conclude is that one should design for being able to cause error paths to run benignly, perhaps on demand, in the production system.
I supposed this experience also lurks behind my insistence that all development testing be done on production builds and that developers do not test anything different. That has a desirable impact on how software is developed in a form that allows operation to be demonstrated and confirmed in production.
This has arisen in my current thinking around development and verification of interactive PC games. I just realized the connection.