euandre.org - Unnamed repository; edit this file 'description' to name the repository.

title: "ANN: fallible - Fault injection library for stress-testing failure scenarios"

date: 2021-02-17

updated_at: 2022-03-06

layout: post

lang: en

ref: ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios

Yesterday I pushed v0.1.0 of fallible, a miniscule library for fault-injection and stress-testing C programs.

EDIT

2021-06-12: As of 0.3.0 (and beyond), the macro interface improved and is a bit different from what is presented in this article. If you're interested, I encourage you to take a look at it.

2022-03-06: I've archived the project for now. It still needs some maturing before being usable.

Existing solutions

Writing robust code can be challenging, and tools like static analyzers, fuzzers and friends can help you get there with more certainty. As I would try to improve some of my C code and make it more robust, in order to handle system crashes, filled disks, out-of-memory and similar scenarios, I didn't find existing tooling to help me get there as I expected to find. I couldn't find existing tools to help me explicitly stress-test those failure scenarios.

Take the "Writing Robust Programs" section of the GNU Coding Standards:

Check every system call for an error return, unless you know you wish to ignore errors. (...) Check every call to malloc or realloc to see if it returned NULL.

From a robustness standpoint, this is a reasonable stance: if you want to have a robust program that knows how to fail when you're out of memory and malloc returns NULL, than you ought to check every call to malloc.

Take a sample code snippet for clarity:

void a_function() {
  char *s1 = malloc(A_NUMBER);
  strcpy(s1, "some string");

  char *s2 = malloc(A_NUMBER);
  strcpy(s2, "another string");
}

At a first glance, this code is unsafe: if any of the calls to malloc returns NULL, strcpy will be given a NULL pointer.

My first instinct was to change this code to something like this:

@@ -1,7 +1,15 @@
 void a_function() {
   char *s1 = malloc(A_NUMBER);
+  if (!s1) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
+  if (!s2) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s2, "another string");
 }

As I later found out, there are at least 2 problems with this approach:

it doesn't compose: this could arguably work if a_function was main. But if a_function lives inside a library, an exit(1); is a inelegant way of handling failures, and will catch the top-level main consuming the library by surprise;
it gives up instead of handling failures: the actual handling goes a bit beyond stopping. What about open file handles, in-memory caches, unflushed bytes, etc.?

If you could force only the second call to malloc to fail, Valgrind would correctly complain that the program exitted with unfreed memory.

So the last change to make the best version of the above code is:

@@ -1,15 +1,14 @@
-void a_function() {
+bool a_function() {
   char *s1 = malloc(A_NUMBER);
   if (!s1) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    return false;
   }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
   if (!s2) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    free(s1);
+    return false;
   }
   strcpy(s2, "another string");
 }

Instead of returning void, a_function now returns bool to indicate whether an error ocurred during its execution. If a_function returned a pointer to something, the return value could be NULL, or an int that represents an error code.

The code is now a) safe and b) failing gracefully, returning the control to the caller to properly handle the error case.

After seeing similar patterns on well designed APIs, I adopted this practice for my own code, but was still left with manually verifying the correctness and robustness of it.

How could I add assertions around my code that would help me make sure the free(s1); exists, before getting an error report? How do other people and projects solve this?

From what I could see, either people a) hope for the best, b) write safe code but don't strees-test it or c) write ad-hoc code to stress it.

The most proeminent case of c) is SQLite: it has a few wrappers around the familiar malloc to do fault injection, check for memory limits, add warnings, create shim layers for other environments, etc. All of that, however, is tightly couple with SQLite itself, and couldn't be easily pulled off for using somewhere else.

When searching for it online, an interesting thread caught my atention: fail the call to malloc for each time it is called, and when the same stacktrace appears again, allow it to proceed.

Implementation

A working implementation of that already exists: mallocfail. It uses LD_PRELOAD to replace malloc at run-time, computes the SHA of the stacktrace and fails once for each SHA.

I initially envisioned and started implementing something very similar to mallocfail. However I wanted it to go beyond out-of-memory scenarios, and using LD_PRELOAD for every possible corner that could fail wasn't a good idea on the long run.

Also, mallocfail won't work together with tools such as Valgrind, who want to do their own override of malloc with LD_PRELOAD.

I instead went with less automatic things: starting with a fallible_should_fail(char *filename, int lineno) function that fails once for each filename+lineno combination, I created macro wrappers around common functions such as malloc:

void *fallible_malloc(size_t size, const char *const filename, int lineno) {
#ifdef FALLIBLE
  if (fallible_should_fail(filename, lineno)) {
    return NULL;
  }
#else
  (void)filename;
  (void)lineno;
#endif
  return malloc(size);
}

#define MALLOC(size) fallible_malloc(size, __FILE__, __LINE__)

With this definition, I could replace the calls to malloc with MALLOC (or any other name that you want to #define):

--- 3.c 2021-02-17 00:15:38.019706074 -0300
+++ 4.c 2021-02-17 00:44:32.306885590 -0300
@@ -1,11 +1,11 @@
 bool a_function() {
-  char *s1 = malloc(A_NUMBER);
+  char *s1 = MALLOC(A_NUMBER);
   if (!s1) {
     return false;
   }
   strcpy(s1, "some string");

-  char *s2 = malloc(A_NUMBER);
+  char *s2 = MALLOC(A_NUMBER);
   if (!s2) {
     free(s1);
     return false;

With this change, if the program gets compiled with the -DFALLIBLE flag the fault-injection mechanism will run, and MALLOC will fail once for each filename+lineno combination. When the flag is missing, MALLOC is a very thin wrapper around malloc, which compilers could remove entirely, and the -lfallible flags can be omitted.

This applies not only to malloc or other stdlib.h functions. If a_function is important or relevant, I could add a wrapper around it too, that checks if fallible_should_fail to exercise if its callers are also doing the proper clean-up.

The actual code is just this single function, [fallible_should_fail], which ended-up taking only ~40 lines. In fact, there are more lines of either Makefile (111), README.md (82) or troff (306) on this first version.

The price for such fine-grained control is that this approach requires more manual work.

Usage examples

`MALLOC` from the `README.md`

// leaky.c
#include <string.h>
#include <fallible_alloc.h>

int main() {
  char *aaa = MALLOC(100);
  if (!aaa) {
    return 1;
  }
  strcpy(aaa, "a safe use of strcpy");

  char *bbb = MALLOC(100);
  if (!bbb) {
    // free(aaa);
    return 1;
  }
  strcpy(bbb, "not unsafe, but aaa is leaking");

  free(bbb);
  free(aaa);
  return 0;
}

Compile with -DFALLIBLE and run fallible-check.1:

$ c99 -DFALLIBLE -o leaky leaky.c -lfallible
$ fallible-check ./leaky
Valgrind failed when we did not expect it to:
(...suppressed output...)
# exit status is 1

Conclusion

For my personal use, I'll package them for GNU Guix and Nix. Packaging it to any other distribution should be trivial, or just downloading the tarball and running [sudo] make install.

Patches welcome!