This article is derived from a presentation on the same subject.
When discussing about feature flags, I find that their costs and benefits are often well exposed and addressed. Online articles like “Feature Toggle (aka Feature Flags)” do a great job of explaining them in detail, giving great general guidance of how to apply techniques to adopt it.
However the weight of those costs and benefits apply differently on backend, frontend or mobile, and those differences aren’t covered. In fact, many of them stop making sense, or the decision of adopting a feature flag or not may change depending on the environment.
In this article I try to make the distinction between environments and how feature flags apply to them, with some final best practices I’ve acquired when using them in production.
Feature flags in general tend to be cited on the context of continuous deployment:
A: With continuous deployment, you deploy to production automatically
B: But how do I handle deployment failures, partial features, etc.?
A: With techniques like canary, monitoring and alarms, feature flags, etc.
Though adopting continuous deployment doesn’t force you to use feature flags, it creates a demand for it. The inverse is also true: using feature flags on the code points you more obviously to continuous deployment. Take the following code sample for example, that we will reference later on the article:
While being developed, being tested for suitability or something similar,
notifyListeners() may not be included in the code at once. So instead of
keeping it on a separate, long-lived branch, a feature flag can decide when the
new, partially implemented function will be called:
This allows your code to include
notifyListeners(), and decide when to call it
at runtime. For the price of extra things around the code, you get more
So the fundamental question to ask yourself when considering adding a feature flag should be:
Am I willing to pay with code complexity to get dynamicity?
It is true that you can make the management of feature flags as straightforward as possible, but having no feature flags is simpler than having any. What you get in return is the ability to parameterize the behaviour of the application at runtime, without doing any code changes.
Sometimes this added complexity may tilt the balance towards not using a feature flag, and sometimes the flexibility of changing behaviour at runtime is absolutely worth the added complexity. This can vary a lot by code base, feature, but fundamentally by environment: its much cheaper to deploy a new version of a service than to release a new version of an app.
So the question of which environment is being targeted is key when reasoning about costs and benefits of feature flags.
The key differentiator that makes the trade-offs apply differently is how much control you have over the environment.
When running a backend service, you usually are paying for the servers themselves, and can tweak them as you wish. This means you have full control do to code changes as you wish. Not only that, you decide when to do it, and for how long the transition will last.
On the frontend you have less control: even though you can choose to make a new version available any time you wish, you can’t force1 clients to immediately switch to the new version. That means that a) clients could skip upgrades at any time and b) you always have to keep backward and forward compatibility in mind.
Even though I’m mentioning frontend directly, it applies to other environment with similar characteristics: desktop applications, command-line programs, etc.
On mobile you have even less control: app stores need to allow your app to be updated, which could bite you when least desired. Theoretically you could make you APK available on third party stores like F-Droid, or even make the APK itself available for direct download, which would give you the same characteristics of a frontend application, but that happens less often.
On iOS you can’t even do that. You have to get Apple’s blessing on every single update. Even though we already know that is a bad idea for over a decade now, there isn’t a way around it. This is where you have the least control.
In practice, the amount of control you have will change how much you value dynamicity: the less control you have, the more valuable it is. In other words, having a dynamic flag on the backend may or may not be worth it since you could always update the code immediately after, but on iOS it is basically always worth it.
A rollout is used to roll out a new version of software.
They are usually short-lived, being relevant as long as the new code is being deployed. The most common rule is percentages.
On the backend, it is common to find it on the deployment infrastructure itself, like canary servers, blue/green deployments, a kubernetes deployment rollout, etc. You could do those manually, by having a dynamic control on the code itself, but rollbacks are cheap enough that people usually do a normal deployment and just give some extra attention to the metrics dashboard.
Any time you see a blue/green deployment, there is a rollout happening: most likely a load balancer is starting to direct traffic to the new server, until reaching 100% of the traffic. Effectively, that is a rollout.
On the frontend, you can selectively pick which user’s will be able to download the new version of a page. You could use geographical region, IP, cookie or something similar to make this decision.
CDN propagation delays and people not refreshing their web pages are also rollouts by themselves, since old and new versions of the software will coexist.
On mobile, the Play Store allows you to perform fine-grained staged rollouts, and the App Store allows you to perform limited phased releases.
Both for Android and iOS, the user plays the role of making the download.
In summary: since you control the servers on the backend, you can do rollouts at will, and those are often found automated away in base infrastructure. On the frontend and on mobile, there are ways to make new versions available, but users may not download them immediately, and many different versions of the software end up coexisting.
A feature flag is a flag that tells the application on runtime to turn on or off a given feature. That means that the actual production code will have more than one possible code paths to go through, and that a new version of a feature coexists with the old version. The feature flag tells which part of the code to go through.
They are usually medium-lived, being relevant as long as the new code is being developed. The most common rules are percentages, allow/deny lists, A/B groups and client version.
On the backend, those are useful for things that have a long development cycle, or that needs to done by steps. Consider loading the feature flag rules in memory when the application starts, so that you avoid querying a database or an external service for applying a feature flag rule and avoid flakiness on the result due to intermittent network failures.
Since on the frontend you don’t control when to update the client software, you’re left with applying the feature flag rule on the server, and exposing the value through an API for maximum dynamicity. This could be in the frontend code itself, and fallback to a “just refresh the page”/”just update to the latest version” strategy for less dynamic scenarios.
On mobile you can’t even rely on a “just update to the latest version” strategy, since the code for the app could be updated to a new feature and be blocked on the store. Those cases aren’t recurrent, but you should always assume the store will deny updates on critical moments so you don’t find yourself with no cards to play. That means the only control you actually have is via the backend, by parameterizing the runtime of the application using the API. In practice, you should always have a feature flag to control any relevant piece of code. There is no such thing as “too small code change for a feature flag”. What you should ask yourself is:
If the code I’m writing breaks and stays broken for around a month, do I care?
If you’re doing an experimental screen, or something that will have a very small impact you might answer “no” to the above question. For everything else, the answer will be “yes”: bug fixes, layout changes, refactoring, new screen, filesystem/database changes, etc.
An experiment is a feature flag where you care about analytical value of the flag, and how it might impact user’s behaviour. A feature flag with analytics.
They are also usually medium-lived, being relevant as long as the new code is being developed. The most common rule is A/B test.
On the backend, an experiment rely on an analytical environment that will pick the A/B test groups and distributions, which means those can’t be held in memory easily. That also means that you’ll need a fallback value in case fetching the group for a given customer fails.
On the frontend and on mobile they are no different from feature flags.
An operational toggle is like a system-level manual circuit breaker, where you turn on/off a feature, fail over the load to a different server, etc. They are useful switches to have during an incident.
They are usually long-lived, being relevant as long as the code is in production. The most common rule is percentages.
They can be feature flags that are promoted to operational toggles on the backend, or may be purposefully put in place preventively or after a postmortem analysis.
On the frontend and on mobile they are similar to feature flags, where the “feature” is being turned on and off, and the client interprets this value to show if the “feature” is available or unavailable.
Even though feature flags give you more dynamicity, they’re still somewhat manual: you have to create one for a specific feature and change it by hand.
If you find yourself manually updating a feature flags every other day, or tweaking the percentages frequently, consider making it fully dynamic. Try using a dataset that is generated automatically, or computing the content on the fly.
Say you have a configuration screen with a list of options and sub-options, and you’re trying to find how to better structure this list. Instead of using a feature flag for switching between 3 and 5 options, make it fully dynamic. This way you’ll be able to perform other tests that you didn’t plan, and get more flexibility out of it.
After effectively finishing a feature, the old code that coexisted with the new one will be deleted, and all traces of the transition will vanish from the code base. However if you just remove the feature flags from the API, all of the old versions of clients that relied on that value to show the new feature will go downgrade to the old feature.
This means that you should avoid deleting client-facing feature flags, and
retire them instead: use the client version to decide when the feature is
stable, and return
true for every client with a version greater or equal to
that. This way you can stop thinking about the feature flag, and you don’t break
or downgrade clients that didn’t upgrade past the transition.
Nested flags combine exponentially.
Pick strategic entry points or transitions eligible for feature flags, and beware of their nesting.
Add feature flags to the list of things to think about during whiteboarding, and deleting/retiring a feature flags at the end of the development.
Again, there is no such thing “too small for a feature flag”. Too many feature flags is a good problem to have, not the opposite. Automate the process of creating a feature flag to lower its cost.