The article, “Abstracting the Geniuses Away from Failure Testing” puzzles me.
There appears to be agreement that the inability to demonstrate the durability and graceful degradation of distributed protocols is a matter of failed abstractions. At the same time there is consternation over the inability to have perfect failure detectors because “it is impossible to distinguish between delay and failure” in distributed systems. There is more explanation of this situation in the article.
A key issue is the difficulty of ascertaining that compositions of fault-tolerant components results in distributed fault-tolerance, and how is that to be confirmed (or demonstrated to fail).
Enter Fault Injection
The idea of fault injection appears to be the practice of intentionally perturbing a system to see how it responds. My brain is apparently too small to cotton to this notion. In any case, the limitation is that it does take brains to conceive of promising perturbations. There is also the matter of this being a form of black-box testing, working from the outside of the system and attempting to isolate cases where assured behavior is violated.
It turns out that it takes considerable ingenuity and labor intensity to come up with revealing fault injections.
To have a chance at automation of this kind of testing, advances in instrumentation are supposed. The money quote seems to be this: “A system is fault tolerant if it provides sufficient mechanisms to achieve its successful outcomes despite the given class of faults.”
At this point, it is not clear to me how this doesn’t revert to the previously unsolved problem with regard to establishing success and classes of faults. The limitation is mine, I’m thinking. There is promising success involving a kind of automated big data analysis of detailed operation of the system’s operation and use of fault-injection infrastructure. System redundancy seems to be a factor in the case of large-scale distributed systems.
I would prefer to leave such large-scale considerations to others.
Small is Not So Beautiful Either?
I do wonder whether I need to comprehend this with regard to scaling up of a tiny system to a grid of distributed ones.
Right now, the critical-for-me problem concerns a tiny non-distributed system. I need to confirm that a specified API and the behavior behind it is satisfied by an implementation. This is the classic testing case of not being able to demonstrate correctness of the implementation, but possibly demonstrating a failure. The problem is to figure out what tests demonstrate a kind of cover of correct behavior, and where can one probe for likely bad outcomes that would reveal a mistake.
This is a white-box problem. I have the code of the implementation and I have the specification. If there is a “bug,” it is likely in the specification because it is relatively easy to inspect for its faithful embodiment in code. It is a very small system, and kept small, and rigorously-specified, for that reason.
For a defect to be apparent, it seems that I must have higher-order requirements to satisfy with respect to successful outcomes and “classes of faults.” Here there is food for thought, although not because I have a distributed case to deal with. Yet.
I am left to ponder whether there is something about the scaled-up distribution cases that I must anticipate fairly early in the miniature cases I work on first.