src/content/blog/2021/02/17/fallible.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285

= ANN: fallible - Fault injection library for stress-testing failure scenarios
:updatedat: 2022-03-06

:fallible: https://euandreh.xyz/fallible/

Yesterday I pushed v0.1.0 of {fallible}[fallible], a miniscule library for
fault-injection and stress-testing C programs.

== _EDIT_

:changelog: https://euandreh.xyz/fallible/CHANGELOG.html
:tarball: https://euandre.org/static/attachments/fallible.tar.gz

2021-06-12: As of {changelog}[0.3.0] (and beyond), the macro interface improved
and is a bit different from what is presented in this article.  If you're
interested, I encourage you to take a look at it.

2022-03-06: I've {tarball}[archived] the project for now.  It still needs some
maturing before being usable.

== Existing solutions

:gnu-std: https://www.gnu.org/prep/standards/standards.html#Semantics
:valgrind: https://www.valgrind.org/
:so-alloc: https://stackoverflow.com/questions/1711170/unit-testing-for-failed-malloc

Writing robust code can be challenging, and tools like static analyzers, fuzzers
and friends can help you get there with more certainty.  As I would try to
improve some of my C code and make it more robust, in order to handle system
crashes, filled disks, out-of-memory and similar scenarios, I didn't find
existing tooling to help me get there as I expected to find.  I couldn't find
existing tools to help me explicitly stress-test those failure scenarios.

Take the "{gnu-std}[Writing Robust Programs]" section of the GNU Coding
Standards:

____
Check every system call for an error return, unless you know you wish to ignore
errors.  (...) Check every call to malloc or realloc to see if it returned NULL.
____

From a robustness standpoint, this is a reasonable stance: if you want to have a
robust program that knows how to fail when you're out of memory and `malloc`
returns `NULL`, than you ought to check every call to `malloc`.

Take a sample code snippet for clarity:

[source,c]
----
void a_function() {
  char *s1 = malloc(A_NUMBER);
  strcpy(s1, "some string");

  char *s2 = malloc(A_NUMBER);
  strcpy(s2, "another string");
}
----

At a first glance, this code is unsafe: if any of the calls to `malloc` returns
`NULL`, `strcpy` will be given a `NULL` pointer.

My first instinct was to change this code to something like this:

[source,diff]
----
@@ -1,7 +1,15 @@
 void a_function() {
   char *s1 = malloc(A_NUMBER);
+  if (!s1) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
+  if (!s2) {
+    fprintf(stderr, "out of memory, exitting\n");
+    exit(1);
+  }
   strcpy(s2, "another string");
 }
----

As I later found out, there are at least 2 problems with this approach:

. *it doesn't compose*: this could arguably work if `a_function` was `main`.
  But if `a_function` lives inside a library, an `exit(1);` is an inelegant way
  of handling failures, and will catch the top-level `main` consuming the
  library by surprise;
. *it gives up instead of handling failures*: the actual handling goes a bit
  beyond stopping.  What about open file handles, in-memory caches, unflushed
  bytes, etc.?

If you could force only the second call to `malloc` to fail,
{valgrind}[Valgrind] would correctly complain that the program exitted with
unfreed memory.

So the last change to make the best version of the above code is:

[source,diff]
----
@@ -1,15 +1,14 @@
-void a_function() {
+bool a_function() {
   char *s1 = malloc(A_NUMBER);
   if (!s1) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    return false;
   }
   strcpy(s1, "some string");

   char *s2 = malloc(A_NUMBER);
   if (!s2) {
-    fprintf(stderr, "out of memory, exitting\n");
-    exit(1);
+    free(s1);
+    return false;
   }
   strcpy(s2, "another string");
 }
----

Instead of returning `void`, `a_function` now returns `bool` to indicate whether
an error ocurred during its execution.  If `a_function` returned a pointer to
something, the return value could be `NULL`, or an `int` that represents an
error code.

The code is now a) safe and b) failing gracefully, returning the control to the
caller to properly handle the error case.

After seeing similar patterns on well designed APIs, I adopted this practice for
my own code, but was still left with manually verifying the correctness and
robustness of it.

How could I add assertions around my code that would help me make sure the
`free(s1);` exists, before getting an error report?  How do other people and
projects solve this?

From what I could see, either people a) hope for the best, b) write safe code
but don't strees-test it or c) write ad-hoc code to stress it.

The most proeminent case of c) is SQLite: it has a few wrappers around the
familiar `malloc` to do fault injection, check for memory limits, add warnings,
create shim layers for other environments, etc.  All of that, however, is
tightly couple with SQLite itself, and couldn't be easily pulled off for using
somewhere else.

When searching for it online, an {so-alloc}[interesting thread] caught my
atention: fail the call to `malloc` for each time it is called, and when the
same stacktrace appears again, allow it to proceed.

== Implementation

:mallocfail: https://github.com/ralight/mallocfail
:should-fail-fn: https://euandre.org/git/fallible/tree/src/fallible.c?id=v0.1.0#n16

A working implementation of that already exists: {mallocfail}[mallocfail].  It
uses `LD_PRELOAD` to replace `malloc` at run-time, computes the SHA of the
stacktrace and fails once for each SHA.

I initially envisioned and started implementing something very similar to
mallocfail.  However I wanted it to go beyond out-of-memory scenarios, and using
`LD_PRELOAD` for every possible corner that could fail wasn't a good idea on the
long run.

Also, mallocfail won't work together with tools such as Valgrind, who want to do
their own override of `malloc` with `LD_PRELOAD`.

I instead went with less automatic things: starting with a
`fallible_should_fail(char *filename, int lineno)` function that fails once for
each `filename`+`lineno` combination, I created macro wrappers around common
functions such as `malloc`:

[source,c]
----
void *fallible_malloc(size_t size, const char *const filename, int lineno) {
#ifdef FALLIBLE
  if (fallible_should_fail(filename, lineno)) {
    return NULL;
  }
#else
  (void)filename;
  (void)lineno;
#endif
  return malloc(size);
}

#define MALLOC(size) fallible_malloc(size, __FILE__, __LINE__)
----

With this definition, I could replace the calls to `malloc` with `MALLOC` (or
any other name that you want to `#define`):

[source,diff]
----
--- 3.c 2021-02-17 00:15:38.019706074 -0300
+++ 4.c 2021-02-17 00:44:32.306885590 -0300
@@ -1,11 +1,11 @@
 bool a_function() {
-  char *s1 = malloc(A_NUMBER);
+  char *s1 = MALLOC(A_NUMBER);
   if (!s1) {
     return false;
   }
   strcpy(s1, "some string");

-  char *s2 = malloc(A_NUMBER);
+  char *s2 = MALLOC(A_NUMBER);
   if (!s2) {
     free(s1);
     return false;
----

With this change, if the program gets compiled with the `-DFALLIBLE` flag the
fault-injection mechanism will run, and `MALLOC` will fail once for each
`filename`+`lineno` combination.  When the flag is missing, `MALLOC` is a very
thin wrapper around `malloc`, which compilers could remove entirely, and the
`-lfallible` flags can be omitted.

This applies not only to `malloc` or other `stdlib.h` functions.  If
`a_function` is important or relevant, I could add a wrapper around it too, that
checks if `fallible_should_fail` to exercise if its callers are also doing the
proper clean-up.

The actual code is just this single function,
{should-fail-fn}[`fallible_should_fail`], which ended-up taking only ~40 lines.
In fact, there are more lines of either Makefile (111), README.md (82) or troff
(306) on this first version.

The price for such fine-grained control is that this approach requires more
manual work.

== Usage examples

=== `MALLOC` from the `README.md`

:fallible-check: https://euandreh.xyz/fallible/fallible-check.1.html

[source,c]
----
// leaky.c
#include <string.h>
#include <fallible_alloc.h>

int main() {
  char *aaa = MALLOC(100);
  if (!aaa) {
    return 1;
  }
  strcpy(aaa, "a safe use of strcpy");

  char *bbb = MALLOC(100);
  if (!bbb) {
    // free(aaa);
    return 1;
  }
  strcpy(bbb, "not unsafe, but aaa is leaking");

  free(bbb);
  free(aaa);
  return 0;
}
----

Compile with `-DFALLIBLE` and run {fallible-check}[`fallible-check.1`]:

[source,shell]
----
$ c99 -DFALLIBLE -o leaky leaky.c -lfallible
$ fallible-check ./leaky
Valgrind failed when we did not expect it to:
(...suppressed output...)
# exit status is 1
----

== Conclusion

:package: https://euandre.org/git/package-repository/

For my personal use, I'll {package}[package] them for GNU Guix and Nix.
Packaging it to any other distribution should be trivial, or just downloading
the tarball and running `[sudo] make install`.

Patches welcome!