The Hard Truth about GitOps and Database Rollbacks
Prepared for KubeCon North America 2024
Introduction
For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis.
In this post, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.
The Undo Button
One of the most liberating aspects of working on digital products is the ability to roll back changes. The Undo Button, I would argue, is one of the most transformative features of modern computing.
Correcting mistakes on typewriters was arduous. You would roll the carriage back and type over any errors, leaving messy, visible corrections. For bigger changes, entire pages had to be retyped. Correction fluid like whiteout offered a temporary fix, but it was slow and imprecise, requiring careful application and drying time.
Digital tools changed everything. The Undo Button turned corrections into a simple keystroke, freeing creators to experiment without fear of permanent mistakes. This shift replaced the stress of perfection with the freedom to try, fail, and refine ideas.
Rollbacks and Software Delivery
When it comes to software delivery, having an Undo Button is essential as well. The ability to roll back changes to a previous state is a critical safety net for teams deploying new features, updates, or bug fixes. Specifically, rollbacks impact one of the key metrics of software delivery: Mean Time to Recovery (MTTR).
MTTR is a measure of how quickly a system can recover from failures. When a deployment fails, or a bug is discovered in production, teams generally have two options: triage and fix the issue (roll forward), or roll back to a previous known stable state.
When the fix to an issue is not immediately clear, or the issue is severe, rolling back is often the fastest way to restore service. This is why having a reliable rollback mechanism is crucial for reducing MTTR and ensuring high availability of services.
How are rollbacks even possible?
Undoing a change in a local environment like a word processor is straightforward. There are multiple ways to implement an Undo Button, but they all rely on the same basic principle: the system keeps track of every change made and can revert to a previous state.
In a distributed system like modern, cloud-native applications, things are not so simple. Changes are made across multiple components with complex dependencies and configurations.
The key capability that enables rolling back changes is described in the seminal book, "Accelerate: The Science of Lean Software and DevOps". The authors identify "Comprehensive Configuration Management" as one of the key technical practices that enables high performance in software delivery:
"It should be possible to provision our environments and build, test, and deploy our software in a fully automated fashion purely from information stored in version control.” 1
In theory, this means that if we can store everything there is to know about our system in version control, and have an automated way to apply these changes, we should be able to roll back to a previous state by simply applying a previous commit.
GitOps and Rollbacks
The principle of "Comprehensive Configuration Management" evolved over the years into ideas like "Infrastructure as Code" and "GitOps". These practices advocate for storing all configuration and infrastructure definitions in version control in a declarative format, and using automation to apply these changes to the system.
Projects like ArgoCD and Flux have popularized the GitOps approach to managing Kubernetes clusters. By providing a structured way to define the desired state of your system in Git (e.g., Kubernetes manifests), and automatically reconciling the actual state with it, GitOps tools provide a structured and standardized way to manage satisfy this principle.
On paper, GitOps has finally brought us a working solution for rollbacks. Revert the commit that introduced the issue, and all of your problems are gone!
Problem solved. End of talk. Right?
The Hard Truth
Teams that have tried to fully commit to the GitOps philosophy usually find that the promise of "declarative" and "continuously reconciled" workflows is not as straightforward as it seems. Let's consider why.
Declarative resource management works exceptionally well for stateless resources like containers. The way Kubernetes handles deployments, services, and other resources is a perfect fit for GitOps. Consider how a typical deployment rollout works:
- A new replica set is created with the new version of the application.
- Health checks are performed to ensure the new version is healthy.
- Traffic is gradually shifted to healthy instances of the new version.
- As the new version proves stable, the old version is scaled down and eventually removed.
But will this work for stateful resources like databases? Suppose we want to change the schema of a database. Could we apply the same principles to roll back a schema migration? Here's what it would look like:
- A new database is spun up with the new schema.
- Health checks are performed to ensure the new schema is healthy.
- Traffic is gradually shifted to the new database instance.
- The old database is removed.
This would get the job done... but you would probably find yourself out of a job soon after.
Stateless resources are really great to manage because we can always throw out whatever isn't working for us and start fresh. But databases are different. They are stateful, and they are comprised not only of a software component (the database engine), the configuration (server parameters and schema), but also the data itself. The data itself cannot, by definition, be provisioned from version control.
Stateful resources like databases require a different approach to manage changes.
Up and Down Migrations
The common practice for managing schema changes in databases is to use "up" and "down" migration scripts in tandem with a migration tool (like Flyway or Liquibase). The idea is simple: when you want to make a change to the schema, you write a script that describes how to apply the change (the "up" migration). Additionally, you write a script that describes how to undo the change (the "down" migration).
For example, suppose you wanted to add a column named "short_bio" to a table named "users". Your up migration script might look like this:
ALTER TABLE users
ADD COLUMN short_bio TEXT;
And your down migration script might look like this:
ALTER TABLE users DROP COLUMN short_bio;
In theory, this concept is sound and satisfies the requirements of "Comprehensive Configuration Management". All information needed to apply and roll back the change is stored in version control.
Theory, once again, is quite different from practice.
The Myth of Down Migrations
After interviewing hundreds of engineers on this topic, we found that despite being a widely accepted concept, down are rarely used in practice. Why?
Naive assumptions
When you write a down migration, you are essentially writing a script that will be executed in the future to revert the changes you are about to make. By definition, this script is written before the "up" changes are applied. This means that the down migration is based on the assumption that the changes will be applied correctly.
But what if they are not?
Suppose the "up" migration was supposed to add two columns, the down file would be written to remove these two columns. But what if the migration was partially applied and only one column was added? Running the down file would fail, and we would be stuck in an unknown state.
Yes, some databases like PostgreSQL support transactional DDL, which means that if the migration fails, the changes are rolled back, and you end up with a state this consistent with a specific revision. But even for PostgreSQL, some operations cannot be run in a transaction, and the database can end up in an inconsistent state.
For MySQL, which does not support transactional DDL, the situation is even worse. If a migration fails halfway through, you are left with only a partially applied migration and no way to roll back.