1 files changed, 234 insertions, 0 deletions
diff --git a/_articles/2021-02-17-ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios.md b/_articles/2021-02-17-ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios.md
new file mode 100644
index 0000000..77dfa38
--- /dev/null
+++ b/_articles/2021-02-17-ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios.md
@@ -0,0 +1,234 @@
+---
+
+title: "ANN: fallible - Fault injection library for stress-testing failure scenarios"
+
+date: 2021-02-17
+
+layout: post
+
+lang: en
+
+ref: ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios
+
+---
+
+Yesterday I pushed v0.1.0 of [fallible], a miniscule library for fault-injection and stress-testing C programs.
+
+[fallible]: https://fallible.euandreh.xyz
+
+## Existing solutions
+
+Writing robust code can be challenging, and tools like static analyzers, fuzzers and friends can help you get there with more certainty.
+As I would try to improve some of my C code and make it more robust, in order to handle system crashes, filled disks, out-of-memory and similar scenarios, I didn't find existing tooling to help me get there as I expected to find.
+I couldn't find existing tools to help me explicitly stress-test those failure scenarios.
+
+Take the "[Writing Robust Programs][gnu-std]" section of the GNU Coding Standards:
+
+[gnu-std]: https://www.gnu.org/prep/standards/standards.html#Semantics
+
+> Check every system call for an error return, unless you know you wish to ignore errors.
+> (...) Check every call to malloc or realloc to see if it returned NULL.
+
+From a robustness standpoint, this is a reasonable stance: if you want to have a robust program that knows how to fail when you're out of memory and `malloc` returns `NULL`, than you ought to check every call to `malloc`.
+
+Take a sample code snippet for clarity:
+
+```c
+void a_function() {
+  char *s1 = malloc(A_NUMBER);
+  strcpy(s1, "some string");
+
+  char *s2 = malloc(A_NUMBER);
+  strcpy(s2, "another string");
+}
+```
+
+At a first glance, this code is unsafe: if any of the calls to `malloc` returns `NULL`, `strcpy` will be given a `NULL` pointer.
+
+My first instinct was to change this code to something like this:
+
+```diff
+@@ -1,7 +1,15 @@
+ void a_function() {
+   char *s1 = malloc(A_NUMBER);
++  if (!s1) {
++    fprintf(stderr, "out of memory, exitting\n");
++    exit(1);
++  }
+   strcpy(s1, "some string");
+
+   char *s2 = malloc(A_NUMBER);
++  if (!s2) {
++    fprintf(stderr, "out of memory, exitting\n");
++    exit(1);
++  }
+   strcpy(s2, "another string");
+ }
+```
+
+As I later found out, there are at least 2 problems with this approach:
+
+1. **it doesn't compose**: this could arguably work if `a_function` was `main`.
+   But if `a_function` lives inside a library, an `exit(1);` is a inelegant way of handling failures, and will catch the top-level `main` consuming the library by surprise;
+2. **it gives up instead of handling failures**: the actual handling goes a bit beyond stopping.
+   What about open file handles, in-memory caches, unflushed bytes, etc.?
+
+If you could force only the second call to `malloc` to fail, [Valgrind] would correctly complain that the program exitted with unfreed memory.
+
+[Valgrind]: https://www.valgrind.org/
+
+So the last change to make the best version of the above code is:
+
+```diff
+@@ -1,15 +1,14 @@
+-void a_function() {
++bool a_function() {
+   char *s1 = malloc(A_NUMBER);
+   if (!s1) {
+-    fprintf(stderr, "out of memory, exitting\n");
+-    exit(1);
++    return false;
+   }
+   strcpy(s1, "some string");
+
+   char *s2 = malloc(A_NUMBER);
+   if (!s2) {
+-    fprintf(stderr, "out of memory, exitting\n");
+-    exit(1);
++    free(s1);
++    return false;
+   }
+   strcpy(s2, "another string");
+ }
+```
+
+Instead of returning `void`, `a_function` now returns `bool` to indicate whether an error ocurred during its execution.
+If `a_function` returned a pointer to something, the return value could be `NULL`, or an `int` that represents an error code.
+
+The code is now a) safe and b) failing gracefully, returning the control to the caller to properly handle the error case.
+
+After seeing similar patterns on well designed APIs, I adopted this practice for my own code, but was still left with manually verifying the correctness and robustness of it.
+
+How could I add assertions around my code that would help me make sure the `free(s1);` exists, before getting an error report?
+How do other people and projects solve this?
+
+From what I could see, either people a) hope for the best, b) write safe code but don't strees-test it or c) write ad-hoc code to stress it.
+
+The most proeminent case of c) is SQLite: it has a few wrappers around the familiar `malloc` to do fault injection, check for memory limits, add warnings, create shim layers for other environments, etc.
+All of that, however, is tightly couple with SQLite itself, and couldn't be easily pulled off for using somewhere else.
+
+When searching for it online, an [interesting thread] caught my atention: fail the call to `malloc` for each time it is called, and when the same stacktrace appears again, allow it to proceed.
+
+[interesting thread]: https://stackoverflow.com/questions/1711170/unit-testing-for-failed-malloc
+
+## Implementation
+
+A working implementation of that already exists: [mallocfail].
+It uses `LD_PRELOAD` to replace `malloc` at run-time, computes the SHA of the stacktrace and fails once for each SHA.
+
+I initially envisioned and started implementing something very similar to mallocfail.
+However I wanted it to go beyond out-of-memory scenarios, and using `LD_PRELOAD` for every possible corner that could fail wasn't a good idea on the long run.
+
+Also, mallocfail won't work together with tools such as Valgrind, who want to do their own override of `malloc` with `LD_PRELOAD`.
+
+I instead went with less automatic things: starting with a `fallible_should_fail(char *filename, int lineno)` function that fails once for each `filename`+`lineno` combination, I created macro wrappers around common functions such as `malloc`:
+
+```c
+void *fallible_malloc(size_t size, const char *const filename, int lineno) {
+#ifdef FALLIBLE
+  if (fallible_should_fail(filename, lineno)) {
+    return NULL;
+  }
+#else
+  (void)filename;
+  (void)lineno;
+#endif
+  return malloc(size);
+}
+
+#define MALLOC(size) fallible_malloc(size, __FILE__, __LINE__)
+```
+
+With this definition, I could replace the calls to `malloc` with `MALLOC` (or any other name that you want to `#define`):
+
+```diff
+--- 3.c	2021-02-17 00:15:38.019706074 -0300
++++ 4.c	2021-02-17 00:44:32.306885590 -0300
+@@ -1,11 +1,11 @@
+ bool a_function() {
+-  char *s1 = malloc(A_NUMBER);
++  char *s1 = MALLOC(A_NUMBER);
+   if (!s1) {
+     return false;
+   }
+   strcpy(s1, "some string");
+
+-  char *s2 = malloc(A_NUMBER);
++  char *s2 = MALLOC(A_NUMBER);
+   if (!s2) {
+     free(s1);
+     return false;
+```
+
+With this change, if the program gets compiled with the `-DFALLIBLE` flag the fault-injection mechanism will run, and `MALLOC` will fail once for each `filename`+`lineno` combination.
+When the flag is missing, `MALLOC` is a very thin wrapper around `malloc`, which compilers could remove entirely, and the `-lfallible` flags can be omitted.
+
+This applies not only to `malloc` or other `stdlib.h` functions.
+If `a_function` is important or relevant, I could add a wrapper around it too, that checks if `fallible_should_fail` to exercise if its callers are also doing the proper clean-up.
+
+The actual code is just this single function, [`fallible_should_fail`], which ended-up taking only ~40 lines.
+In fact, there are more lines of either Makefile (111), README.md (82) or troff (306) on this first version.
+
+The price for such fine-grained control is that this approach requires more manual work.
+
+[mallocfail]: https://github.com/ralight/mallocfail
+[`fallible_should_fail`]: https://git.euandreh.xyz/fallible/tree/src/fallible.c?id=v0.1.0#n16
+
+## Usage examples
+
+### `MALLOC` from the `README.md`
+
+```c
+// leaky.c
+#include <string.h>
+#include <fallible_alloc.h>
+
+int main() {
+  char *aaa = MALLOC(100);
+  if (!aaa) {
+    return 1;
+  }
+  strcpy(aaa, "a safe use of strcpy");
+
+  char *bbb = MALLOC(100);
+  if (!bbb) {
+    // free(aaa);
+    return 1;
+  }
+  strcpy(bbb, "not unsafe, but aaa is leaking");
+
+  free(bbb);
+  free(aaa);
+  return 0;
+}
+```
+
+Compile with `-DFALLIBLE` and run [`fallible-check.1`][fallible-check]:
+```shell
+$ c99 -DFALLIBLE -o leaky leaky.c -lfallible
+$ fallible-check ./leaky
+Valgrind failed when we did not expect it to:
+(...suppressed output...)
+# exit status is 1
+```
+
+[fallible-check]: https:/fallible.euandreh.xyz/fallible-check.1.html
+
+## Conclusion
+
+For my personal use, I'll [package] them for GNU Guix and Nix.
+Packaging it to any other distribution should be trivial, or just downloading the tarball and running `[sudo] make install`.
+
+Patches welcome!
+
+[package]: https://git.euandreh.xyz/package-repository/about/