Case study How Sellsy achieved zero-downtime migration with Tggl
Introduction
Sellsy is a leading French company offering a customer relationship management (CRM) tool used by over 48,000 daily active users across more than 18,000 companies. With a strong team of over 220 employees, Sellsy secured a Series B funding of 55 million euros in early 2022.
Sellsy uses Tggl for a variety of use-cases, such as progressive rollouts and scheduled releases. This case study will explore one of their crucial use-cases: technical migrations. Specifically, we will focus on their successful migration of an Elasticsearch cluster from one cloud provider to another. This was a risky undertaking that Sellsy flawlessly executed by leveraging Tggl's feature flagging capabilities.
The challenge
Technical migrations, such as switching database provider, come with their own set of challenges:
- Ensuring zero downtime to maintain a seamless user experience
- Mitigating the high risk of data loss during the migration process
- Addressing the risk of the new provider failing under sudden load increase
- Providing a swift rollback option in case of migration failure
These challenges are not easy to overcome and could have a significant impact on both Sellsy's operations and customer satisfaction.
Before Tggl: the old approach
Previously, Sellsy used a "one-shot" strategy for this type of migration due to the lack of a better alternative. This process involved updating configuration files in production, which required multiple deployments and took approximately 5-6 minutes each time.
If any errors occurred, rolling back to the previous version required initiating a new deployment, which again took 5-6 minutes. This slow approach made it impossible to ensure a seamless transition without data loss, leading to temporary service interruptions—precisely the disruptions they wanted to avoid.
Using Tggl: a step-by-step strategy
Phase 1: double-writing
Using Tggl’s feature flagging capabilities, Sellsy adopted a phased approach to the migration. They started with a "double-write" strategy, where data was written to both the old and new clusters simultaneously to prepare for subsequent steps.
During this phase, they implemented a kill switch to halt writing to the new cluster if necessary, preventing any potential escalation of issues.
Fortunately, they didn't need to rely on this safety net. However, the added peace of mind provided by this extra layer of security had a significant positive impact on the team's confidence. They knew that with a simple button press, they could instantly prevent a disaster from occurring.
Phase 2: gradual traffic routing
Once double-writing was established and functioning as expected, Sellsy began directing read traffic to the new cluster using a second feature flag. This flag determined whether a read request would be routed to the old cluster or the new one. This flexibility allowed for on-the-fly logic changes to adapt to the current situation, without hardcoding anything.
The transition to the new cluster occurred in incremental steps:
- Initial phase: testing on specific accounts. By selecting a few client IDs in Tggl, the flag redirected only a few requests to the new cluster. This allowed Sellsy to validate that the read requests executed as expected, without being overwhelmed with logs during debugging.
- Next: gradually increasing traffic load. Using Tggl's random traffic sampling, Sellsy started redirecting 1-2% of clients to the new cluster, while carefully monitoring system performance. It is crucial to slowly increase traffic for proper auto-scaling, especially on new systems. This prevents existing instances from crashing before new instances have time to boot.
- Final phase: reaching 100% of traffic. Throughout the day, engineers increased the share of traffic routed to the new cluster directly from the Tggl dashboard, without making any code changes or deployments. This meant that any changes were instantly reflected on the servers.
Phase 3: monitoring & cleanup
After successfully redirecting 100% of the traffic to the new cluster, they continued to closely monitor the system for a few more days. Finally, they deactivated the feature flags and concluded the migration. They started by removing the flag from the code and deployed the change to production. Once the flags were no longer needed, they could safely remove them from Tggl.
Results and impact
By using Tggl to execute a technical migration in phases, Sellsy significantly improved their existing process:
- Zero downtime: The transition was seamless, the previous 1h maintenance window was no longer needed.
- Immediate rollbacks: If any issues had arisen, Sellsy could have instantly rolled back the changes, instead of experiencing a 5-6 minute delay as before.
- Increased confidence: The staged process gave the team confidence in the migration, reducing the stress typically associated with high-stakes operations.
Conclusion
The adoption of Tggl has drastically improved how Sellsy handles complex technical migrations. By leveraging Tggl's feature flagging capabilities, Sellsy has not only mitigated risks but also made the entire process more efficient and transparent for users. This aligns perfectly with their objective of providing uninterrupted, high-quality service to their clients.