ANN: fallible - Fault injection library for stress-testing failure scenarios
Updated on March 6, 2022
Yesterday I pushed v0.1.0 of fallible, a miniscule library for fault-injection and stress-testing C programs.
2021-06-12: As of 0.3.0 (and beyond), the macro interface improved and is a bit different from what is presented in this article. If you’re interested, I encourage you to take a look at it.
2022-03-06: I’ve archived the project for now. It still needs some maturing before being usable.
Writing robust code can be challenging, and tools like static analyzers, fuzzers and friends can help you get there with more certainty. As I would try to improve some of my C code and make it more robust, in order to handle system crashes, filled disks, out-of-memory and similar scenarios, I didn’t find existing tooling to help me get there as I expected to find. I couldn’t find existing tools to help me explicitly stress-test those failure scenarios.
Take the “Writing Robust Programs” section of the GNU Coding Standards:
Check every system call for an error return, unless you know you wish to ignore errors. (…) Check every call to malloc or realloc to see if it returned NULL.
From a robustness standpoint, this is a reasonable stance: if you want to have a robust program that knows how to fail when you’re out of memory and malloc
returns NULL
, than you ought to check every call to malloc
.
Take a sample code snippet for clarity:
At a first glance, this code is unsafe: if any of the calls to malloc
returns NULL
, strcpy
will be given a NULL
pointer.
My first instinct was to change this code to something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
@@ -1,7 +1,15 @@
void a_function() {
char *s1 = malloc(A_NUMBER);
+ if (!s1) {
+ fprintf(stderr, "out of memory, exitting\n");
+ exit(1);
+ }
strcpy(s1, "some string");
char *s2 = malloc(A_NUMBER);
+ if (!s2) {
+ fprintf(stderr, "out of memory, exitting\n");
+ exit(1);
+ }
strcpy(s2, "another string");
}
As I later found out, there are at least 2 problems with this approach:
- it doesn’t compose: this could arguably work if
a_function
wasmain
. But ifa_function
lives inside a library, anexit(1);
is a inelegant way of handling failures, and will catch the top-levelmain
consuming the library by surprise; - it gives up instead of handling failures: the actual handling goes a bit beyond stopping. What about open file handles, in-memory caches, unflushed bytes, etc.?
If you could force only the second call to malloc
to fail, Valgrind would correctly complain that the program exitted with unfreed memory.
So the last change to make the best version of the above code is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@@ -1,15 +1,14 @@
-void a_function() {
+bool a_function() {
char *s1 = malloc(A_NUMBER);
if (!s1) {
- fprintf(stderr, "out of memory, exitting\n");
- exit(1);
+ return false;
}
strcpy(s1, "some string");
char *s2 = malloc(A_NUMBER);
if (!s2) {
- fprintf(stderr, "out of memory, exitting\n");
- exit(1);
+ free(s1);
+ return false;
}
strcpy(s2, "another string");
}
Instead of returning void
, a_function
now returns bool
to indicate whether an error ocurred during its execution.
If a_function
returned a pointer to something, the return value could be NULL
, or an int
that represents an error code.
The code is now a) safe and b) failing gracefully, returning the control to the caller to properly handle the error case.
After seeing similar patterns on well designed APIs, I adopted this practice for my own code, but was still left with manually verifying the correctness and robustness of it.
How could I add assertions around my code that would help me make sure the free(s1);
exists, before getting an error report?
How do other people and projects solve this?
From what I could see, either people a) hope for the best, b) write safe code but don’t strees-test it or c) write ad-hoc code to stress it.
The most proeminent case of c) is SQLite: it has a few wrappers around the familiar malloc
to do fault injection, check for memory limits, add warnings, create shim layers for other environments, etc.
All of that, however, is tightly couple with SQLite itself, and couldn’t be easily pulled off for using somewhere else.
When searching for it online, an interesting thread caught my atention: fail the call to malloc
for each time it is called, and when the same stacktrace appears again, allow it to proceed.
A working implementation of that already exists: mallocfail.
It uses LD_PRELOAD
to replace malloc
at run-time, computes the SHA of the stacktrace and fails once for each SHA.
I initially envisioned and started implementing something very similar to mallocfail.
However I wanted it to go beyond out-of-memory scenarios, and using LD_PRELOAD
for every possible corner that could fail wasn’t a good idea on the long run.
Also, mallocfail won’t work together with tools such as Valgrind, who want to do their own override of malloc
with LD_PRELOAD
.
I instead went with less automatic things: starting with a fallible_should_fail(char *filename, int lineno)
function that fails once for each filename
+lineno
combination, I created macro wrappers around common functions such as malloc
:
1
2
3
4
5
6
7
8
9
10
11
12
13
void *fallible_malloc(size_t size, const char *const filename, int lineno) {
#ifdef FALLIBLE
if (fallible_should_fail(filename, lineno)) {
return NULL;
}
#else
(void)filename;
(void)lineno;
#endif
return malloc(size);
}
#define MALLOC(size) fallible_malloc(size, __FILE__, __LINE__)
With this definition, I could replace the calls to malloc
with MALLOC
(or any other name that you want to #define
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
--- 3.c 2021-02-17 00:15:38.019706074 -0300
+++ 4.c 2021-02-17 00:44:32.306885590 -0300
@@ -1,11 +1,11 @@
bool a_function() {
- char *s1 = malloc(A_NUMBER);
+ char *s1 = MALLOC(A_NUMBER);
if (!s1) {
return false;
}
strcpy(s1, "some string");
- char *s2 = malloc(A_NUMBER);
+ char *s2 = MALLOC(A_NUMBER);
if (!s2) {
free(s1);
return false;
With this change, if the program gets compiled with the -DFALLIBLE
flag the fault-injection mechanism will run, and MALLOC
will fail once for each filename
+lineno
combination.
When the flag is missing, MALLOC
is a very thin wrapper around malloc
, which compilers could remove entirely, and the -lfallible
flags can be omitted.
This applies not only to malloc
or other stdlib.h
functions.
If a_function
is important or relevant, I could add a wrapper around it too, that checks if fallible_should_fail
to exercise if its callers are also doing the proper clean-up.
The actual code is just this single function, fallible_should_fail
, which ended-up taking only ~40 lines.
In fact, there are more lines of either Makefile (111), README.md (82) or troff (306) on this first version.
The price for such fine-grained control is that this approach requires more manual work.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// leaky.c
#include <string.h>
#include <fallible_alloc.h>
int main() {
char *aaa = MALLOC(100);
if (!aaa) {
return 1;
}
strcpy(aaa, "a safe use of strcpy");
char *bbb = MALLOC(100);
if (!bbb) {
// free(aaa);
return 1;
}
strcpy(bbb, "not unsafe, but aaa is leaking");
free(bbb);
free(aaa);
return 0;
}
Compile with -DFALLIBLE
and run fallible-check.1
:
For my personal use, I’ll package them for GNU Guix and Nix.
Packaging it to any other distribution should be trivial, or just downloading the tarball and running [sudo] make install
.
Patches welcome!