aboutsummaryrefslogblamecommitdiff
path: root/_articles/2021-02-17-ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios.md
blob: 0add24e45df82f4ef63545ecb56f7725c782a14c (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16















                                                                                                               
                                          








































































































































                                                                                                                                                                                                                                                   

                                           



































































                                                                                                                                                                                       
                                                                     







                                                                                                                            
                                                       
---

title: "ANN: fallible - Fault injection library for stress-testing failure scenarios"

date: 2021-02-17

layout: post

lang: en

ref: ann-fallible-fault-injection-library-for-stress-testing-failure-scenarios

---

Yesterday I pushed v0.1.0 of [fallible], a miniscule library for fault-injection and stress-testing C programs.

[fallible]: https://euandreh.xyz/fallible/

## Existing solutions

Writing robust code can be challenging, and tools like static analyzers, fuzzers and friends can help you get there with more certainty.
As I would try to improve some of my C code and make it more robust, in order to handle system crashes, filled disks, out-of-memory and similar scenarios, I didn't find existing tooling to help me get there as I expected to find.
I couldn't find existing tools to help me explicitly stress-test those failure scenarios.

Take the "[Writing Robust Programs][gnu-std]" section of the GNU Coding Standards:

[gnu-std]: https://www.gnu.org/prep/standards/standards.html#Semantics

> Check every system call for an error return, unless you know you wish to ignore errors.
> (...) Check every call to malloc or realloc to see if it returned NULL.

From a robustness standpoint, this is a reasonable stance: if you want to have a robust program that knows how to fail when you're out of memory and `malloc` returns `NULL`, than you ought to check every call to `malloc`.

Take a sample code snippet for clarity:

```c
void a_function() {
  char *s1 = malloc(A_NUMBER);
  strcpy(s1, "some string");

  char *s2 = malloc(A_NUMBER);
  strcpy(s2, "another string");
}
```

At a first glance, this code is unsafe: if any of the calls to `malloc` returns `NULL`, `strcpy` will be given a `NULL` pointer.

My first instinct was to change this code to something like this:

```diff
@@ -1,7 +1,15 @@
 void a_function() {
   char *s1 = malloc(A_NUMBER);
+  if (!s1) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
+  if (!s2) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s2, "another string");
 }
```

As I later found out, there are at least 2 problems with this approach:

1. **it doesn't compose**: this could arguably work if `a_function` was `main`.
   But if `a_function` lives inside a library, an `exit(1);` is a inelegant way of handling failures, and will catch the top-level `main` consuming the library by surprise;
2. **it gives up instead of handling failures**: the actual handling goes a bit beyond stopping.
   What about open file handles, in-memory caches, unflushed bytes, etc.?

If you could force only the second call to `malloc` to fail, [Valgrind] would correctly complain that the program exitted with unfreed memory.

[Valgrind]: https://www.valgrind.org/

So the last change to make the best version of the above code is:

```diff
@@ -1,15 +1,14 @@
-void a_function() {
+bool a_function() {
   char *s1 = malloc(A_NUMBER);
   if (!s1) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    return false;
   }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
   if (!s2) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    free(s1);
+    return false;
   }
   strcpy(s2, "another string");
 }
```

Instead of returning `void`, `a_function` now returns `bool` to indicate whether an error ocurred during its execution.
If `a_function` returned a pointer to something, the return value could be `NULL`, or an `int` that represents an error code.

The code is now a) safe and b) failing gracefully, returning the control to the caller to properly handle the error case.

After seeing similar patterns on well designed APIs, I adopted this practice for my own code, but was still left with manually verifying the correctness and robustness of it.

How could I add assertions around my code that would help me make sure the `free(s1);` exists, before getting an error report?
How do other people and projects solve this?

From what I could see, either people a) hope for the best, b) write safe code but don't strees-test it or c) write ad-hoc code to stress it.

The most proeminent case of c) is SQLite: it has a few wrappers around the familiar `malloc` to do fault injection, check for memory limits, add warnings, create shim layers for other environments, etc.
All of that, however, is tightly couple with SQLite itself, and couldn't be easily pulled off for using somewhere else.

When searching for it online, an [interesting thread] caught my atention: fail the call to `malloc` for each time it is called, and when the same stacktrace appears again, allow it to proceed.

[interesting thread]: https://stackoverflow.com/questions/1711170/unit-testing-for-failed-malloc

## Implementation

A working implementation of that already exists: [mallocfail].
It uses `LD_PRELOAD` to replace `malloc` at run-time, computes the SHA of the stacktrace and fails once for each SHA.

I initially envisioned and started implementing something very similar to mallocfail.
However I wanted it to go beyond out-of-memory scenarios, and using `LD_PRELOAD` for every possible corner that could fail wasn't a good idea on the long run.

Also, mallocfail won't work together with tools such as Valgrind, who want to do their own override of `malloc` with `LD_PRELOAD`.

I instead went with less automatic things: starting with a `fallible_should_fail(char *filename, int lineno)` function that fails once for each `filename`+`lineno` combination, I created macro wrappers around common functions such as `malloc`:

```c
void *fallible_malloc(size_t size, const char *const filename, int lineno) {
#ifdef FALLIBLE
  if (fallible_should_fail(filename, lineno)) {
    return NULL;
  }
#else
  (void)filename;
  (void)lineno;
#endif
  return malloc(size);
}

#define MALLOC(size) fallible_malloc(size, __FILE__, __LINE__)
```

With this definition, I could replace the calls to `malloc` with `MALLOC` (or any other name that you want to `#define`):

```diff
--- 3.c 2021-02-17 00:15:38.019706074 -0300
+++ 4.c 2021-02-17 00:44:32.306885590 -0300
@@ -1,11 +1,11 @@
 bool a_function() {
-  char *s1 = malloc(A_NUMBER);
+  char *s1 = MALLOC(A_NUMBER);
   if (!s1) {
     return false;
   }
   strcpy(s1, "some string");

-  char *s2 = malloc(A_NUMBER);
+  char *s2 = MALLOC(A_NUMBER);
   if (!s2) {
     free(s1);
     return false;
```

With this change, if the program gets compiled with the `-DFALLIBLE` flag the fault-injection mechanism will run, and `MALLOC` will fail once for each `filename`+`lineno` combination.
When the flag is missing, `MALLOC` is a very thin wrapper around `malloc`, which compilers could remove entirely, and the `-lfallible` flags can be omitted.

This applies not only to `malloc` or other `stdlib.h` functions.
If `a_function` is important or relevant, I could add a wrapper around it too, that checks if `fallible_should_fail` to exercise if its callers are also doing the proper clean-up.

The actual code is just this single function, [`fallible_should_fail`], which ended-up taking only ~40 lines.
In fact, there are more lines of either Makefile (111), README.md (82) or troff (306) on this first version.

The price for such fine-grained control is that this approach requires more manual work.

[mallocfail]: https://github.com/ralight/mallocfail
[`fallible_should_fail`]: https://git.euandreh.xyz/fallible/tree/src/fallible.c?id=v0.1.0#n16

## Usage examples

### `MALLOC` from the `README.md`

```c
// leaky.c
#include <string.h>
#include <fallible_alloc.h>

int main() {
  char *aaa = MALLOC(100);
  if (!aaa) {
    return 1;
  }
  strcpy(aaa, "a safe use of strcpy");

  char *bbb = MALLOC(100);
  if (!bbb) {
    // free(aaa);
    return 1;
  }
  strcpy(bbb, "not unsafe, but aaa is leaking");

  free(bbb);
  free(aaa);
  return 0;
}
```

Compile with `-DFALLIBLE` and run [`fallible-check.1`][fallible-check]:
```shell
$ c99 -DFALLIBLE -o leaky leaky.c -lfallible
$ fallible-check ./leaky
Valgrind failed when we did not expect it to:
(...suppressed output...)
# exit status is 1
```

[fallible-check]: https://euandreh.xyz/fallible/fallible-check.1.html

## Conclusion

For my personal use, I'll [package] them for GNU Guix and Nix.
Packaging it to any other distribution should be trivial, or just downloading the tarball and running `[sudo] make install`.

Patches welcome!

[package]: https://git.euandreh.xyz/package-repository/