High availability with feature flags

14 Dec 2022•7 min read

As a developer, you know that the reliability of your app is crucial to its success. But what about the reliability of the services that your app relies on? In particular, the reliability of a feature flags service is critical to the overall health and stability of your app since it lies in the critical path between your users and your features.

A feature flags service allows you to manage and control the rollout of new features in your app. By using a feature flags service, you can turn features on or off without the need for a full deployment. This means that you can test new features in a controlled environment and roll them out gradually to your users.

But what happens if your feature flags service goes down? What if someone accidentally removes a flag? What if someone rotates the API key without updating the correct environment variable? The list of potential hiccups goes on: if the service becomes unavailable, your app may become unstable or even crash. This is why it is so important to design your app with this in mind and embrace possible failures rather than hide behind a 99.9% SLA.

This is what the "Graceful degradation" design philosophy is about. Your app should continue to operate, albeit at a reduced level of performance, until the problem can be resolved. This can help prevent user frustration and ensure that the system remains usable even in the face of adversity.

Choosing the right service

Regardless of the feature flags solution you are using, it does not have control over certain factors that can affect your uptime. These factors can include:

Someone accidentally removes a flag
Someone rotates the API key without updating the correct environment variable
Someone messes up the CI configuration
The client has network issues

The most reliable provider will be the one that recognizes these potential issues and allows you to design your app with graceful degradation in mind. This way, even if some inevitable issue occurs, the app can still function properly.

The key to resilience

To achieve graceful degradation, it is important to handle potential errors in the same way you would handle a disabled feature. This approach is at the foundation of Tggl's design. By following this principle, using Tggl allows you to maintain a high level of reliability and user experience, even in the event of errors or failures.

The "off" state of a flag should yield the default behavior. This means that if all conditions are right, we use the value returned from Tggl, otherwise we fall back to a hardcoded value.

All of the following conditions have to be true to even receive a value from Tggl:

✅ Tggl server can be reached
✅ Flag exists
✅ Flag is active
✅ Flag's conditions are met

If we do not receive a value, it is impossible for us to determine the cause. It could be due to a network error, the absence of the flag in question, or the flag being inactive. This uncertainty is actually a positive aspect, as it requires us to handle all potential scenarios in the same way.

The second key to resilience is to hardcode the fallback value, basically answering the question: What should happen if anything goes wrong, or if the flag is inactive? Hardcoding the answer means two things:

In all scenarios where an unexpected error occurs, you do not impact the user, you simply fall back to a well-defined behavior. Since it is right there in the code even a network error cannot break your app.
Changing the default behavior requires deploying new code. For instance, when releasing a feature, at first you probably want the feature to be hidden if something goes wrong. But when you start to be more confident, you probably want to keep it visible if an unexpected error occurs.

With this strategy, you get the best of both worlds. You can still update the visibility of your features instantly from the Tggl admin, but you are immuned to unpredictable errors and your reliability does not depend on the reliability of an external service.

Releasing a new feature

Go to tggl.io and create a new flag using the Private beta template:

This template makes the flag active only for a few users that you can handpick, for everyone else the flag is off. In your code you can evaluate the flag like so:

if (client.get('new_login_page', false)) {
  // show new login page
}

If any issues arise, such as a network error or the accidental deletion of a flag, your app will automatically revert to the previous login page. This ensures that only a small number of users, typically those who are part of your beta testing team, will be affected. This can help to minimize disruptions and ensure that the majority of your users continue to have a smooth experience.

With just a few clicks, you can update the flag to be active for all users without the need for developer intervention. It is recommended to keep the flag active for a few days as a precautionary measure, in case any issues arise. This allows you to quickly and easily disable the feature if necessary. Once you are confident that everything is working as intended, you can remove the check entirely to maintain a clean and streamlined codebase.

Kill switches

Kill switches can be a useful tool in a variety of situations where you need to quickly and easily disable a feature in your application. Some examples of when you might use a kill switch include:

When you need to quickly fix a bug or issue that is affecting a particular feature.
When you want to temporarily disable a feature for maintenance purposes.
When you experience an unusual amount of traffic and want to propose a degraded version of your solution.

While it may be tempting to set up a flag that is always active and only deactivate it when necessary, this approach can be risky.

If a network error occurs, if there is a typo in the flag's name, or if the flag is accidentally deleted, your feature will be disabled. To avoid this, it is recommended to do the opposite: use a kill switch that can be activated as needed to kill parts of your app. This can help to ensure that your application remains stable and functional even in the event of unexpected issues.

Here the flag evaluation would need to be reversed:

if (client.get('my-kill-switch', true)) {
  // kill switch is not explicitly set to false => show the feature
}

Unlike release flags, kill switches are meant to stay in your code.

A/B testing

A/B testing is a method of comparing multiple versions of a product or feature simultaneously in order to determine which version performs the best. In contrast to the previous examples, A/B tests require the flag to return a value rather than just a boolean. In order to ensure that your application remains resilient to unexpected issues, you should include a default value in your code in case the flag result cannot be retrieved.

const variation = client.get('my-feature', 'Variation A')
 
if (variation === 'Variation A') {
  // => Default variation or explicit value
} else if (variation === 'Variation B') {
  // => Explicit value
}

Conclusion

It is important to remember that both technical and human errors are inevitable in the development and deployment of any application. As a developer, it is your responsibility to design and build an application that is resilient to these types of issues.

Tools like Tggl help you focus on what matters for your business while making sure you can provide the most reliable service to your customers at a lower cost.

High availability with feature flags

Choosing the right service

The key to resilience

Releasing a new feature

Kill switches

A/B testing

Conclusion

More articles

Implementing feature-flags in your agile development workflow

How QA teams use Tggl to test in production safely

How to remove old feature flags without breaking your code

The easiest way to start with feature flags