Scaling challenges are inherently fun. Properly motivated, they mean so many people find value in what you're building that you're compelled to address them. They knock elbows with some of the most technically interesting topics in computer engineering. Maybe above all else, they tend to be ruthlessly measurable, so your success — when it comes — is obvious.
For the first four years of our business and platform (in essence, a set of APIs and clients companies we partner with use), we at Button resisted the siren song of an elaborate architecture that could scale to billions of customers with zero downtime. We had other more urgent priorities at the time, like getting our first thousands of customers through shipping and iterating on our product. Yet somewhere in the salad days of 2019 it became obvious we had outgrown aspects of our initial design and, for reasons we'll discuss, concluded the moment was right to re-author our topology and go "multi-region".
Now, with two years of operational experience running the result of that effort, we feel qualified to share the juiciest nuggets of our experience. In this post we'll try to cover the most salient and transferrable aspects, namely:
This will often be through the lens of our journey, with an earnest attempt to generalize where possible. Let's dive in!
Before we address this question directly, let's take a quick detour to an even more fundamental question: what do we even mean when we say "multi-region?"
Consider the following relatively simple hosted service, running out of a single data center in Oregon:
Requests from the internet arrive at a public-facing load balancer in a data center. The load balancer forwards requests to an application (there may be many instances running to choose from), which in turn consults a database and another internal service to respond.
Even with such a simple setup, we're already in great position to motivate the idea behind going multi-region.
First, consider the resiliency of this design. Put your thumb over the database and imagine if it were down. If every request depends on access to that database, then we've identified a single point of failure: it goes down and our business goes down. The same goes for the internal service and many things in between: networking devices, physical machines, power supplies, etc. Going multi-region isn't the only way to address these problems, but as we'll see it's a very effective way to defend against an entire class of them in one fell swoop.
Second, consider the latency of this design. Our customers on the west coast of the United States are probably perfectly happy with how long it takes to talk to our servers in Oregon (likely in the low tens of milliseconds per request). Folks in South Africa probably feel differently however, since the physical distance bits have to travel is so much larger. They likely experience hundreds of milliseconds of latency per request; which, multiplied a few times to account for establishing a secure connection and actually transferring data, will yield noticeably peeved humans. Unlike resiliency, latency seems to be uniquely solvable by going multi-region. There's just little wiggling around the speed of light.
Given these deficiencies, the concept of a multi-region design shouldn't be all that surprising: host your software in many data centers!
Here we've copied everything from our Oregon data center and fork-lifted it into another running in South Africa. With careful design of our application, we can be completely indifferent as to which data center serves any given request. From this property, we can develop a policy that determines which data center a given request is routed to. Something like:
Route a request to the nearest data center that is fully operational.
Our resiliency characteristics are dramatically different under this design. Not only can we place a thumb over the database in Oregon and imagine it offline, we can place our hand over the entire region (maybe intrepid Oregon state decides to disconnect from the continental US and float off into the pacific) and still see a way for our service to be operational. Per our routing rule, as long as we can detect malfunctions in Oregon, traffic should be routed away from the failing region and to the nearest operational one.
This is profound. Out of the dizzying possibilities of things that could fail catastrophically in a region, all have the exact same fix: detect the failure and route folks away while we correct it. And while the chance that a specific component fails may be low, the chance that any component fails may be quite high.
The careful reader may note that we could achieve a similar resiliency story by spinning up redundant copies of a system in a single region. True in a sense, it's also the case that catastrophic failures tend to be strongly correlated by geography. Data centers, it turns out, aren't immune to fires, and cloud providers like AWS opt to communicate the health of their services by one facet: region.
With respect to latency, we've solved the problem physically. Light (and thus, a byte) just doesn't have to travel as far. Depending on where our customers are, we can identify where new data centers will yield the most upside. Notably, the benefits of reducing latency are often non-linear. Button for example partners with large, established companies who may only sign a deal subject to an agreement on acceptable latency.
At this point maybe you're convinced that running an additional region (or N additional regions) has some understandable upside. While probably not sufficient to motivate the effort in themselves, there are a handful of other benefits to be reaped as well:
We've notably trivialized some major challenges in our description, namely:
If you're convinced of the upside and can defer your skepticism on these points, read on! All will be answered.
If the last section reads mostly as why you should go multi-region, this one will read mostly as its rebuttal. After all, answering when is in many ways the same as answering should.
Regrettably, when will predominantly depend on requirements completely unique to your team and your business. To make progress we'll have to dispense with any commitment to precision and resort to generalities.
On one hand, there's the time-value of refactoring. Your system will only get more complex and it'll never be cheaper to overhaul than now. Designing features for multiple regions will always be easier than designing them for single regions and retrofitting them later.
On the other, it's an enormous undertaking that could consume your best engineers for months. It will pull from your complexity budget. Designing features for multiple regions will always be harder than designing them for single regions.
So we're looking for a sweet spot. Can you afford to ship less product? Are you confident enough in the shape of your business that you're willing to invest in scaling it? How consequential would a prolonged outage be on your business' probability of success? If you're looking for a rule of thumb, this is about as good as we can offer:
You should go multi-region at the exact instant it would be irresponsible to your customers not to, and not a moment earlier.
A number of signals helped us determine we had crossed the threshold, which we present in case they have some power as patterns to match from:
Finally, going multi-region is not an all-or-nothing proposition. The answer may be a hard "no" — or at best unclear — taking in your technical surface area as a whole, but an enthusiastic "yes!" for a strict subset of it. If you can, identifying a subset to take multi-region reduces many of the costs suggested while also increasing your likelihood of pulling it off.
For our part, we aggressively scoped the project to just the essential components of our stack that would enjoy the most upside of the new architecture. The split breaks along who the client of our API is:
One large portion of traffic is from partners relaying various bits of data to us. On the other side of a connection is another computer, meaning if we have a lasting outage, the computer can replay all of the updates we have missed (need it be said, we still take great pains for this never to be the case). It also turns out computers rarely care all that much about the end-to-end latency of a request.
Conversely, the other large bucket of traffic has a human on the other end: people beginning a shopping journey and requiring various resources from our API, like customized links. Here there is little hope folks will patiently wait out our downtime and try again when we recover. Outages translate directly to lost revenue and lost good-will. This is all the more acute when our downtime makes our partners look bad in front of their own customers.
Very confidently, we took only the latter cohort of traffic multi-region.
Those disappointed by our hand waving in the last section are in for another bout. Of course there is no universal playbook detailing how you actually pull this off. But there are lots of patterns, and patterns are the way to enlightenment.
We'll walk through some of the most consequential moves we played, which roughly fell into two tiers:
Infrastructure Move 1: Provisioning and Management
Ultimately our ability to create fungible data centers in regions around the world without prohibitive investment is attributable to a marvel of the modern world: cloud hosting. With plenty to say about costs, quirks, and limitations, being an API call away from provisioning a computer in Ohio as easily as in Singapore is approaching a miracle.
In standing up new infrastructure, we strove to satisfy two goals:
Our cloud hosting provider offers programmatic access to nearly every capability they offer (these days, a table stakes feature). This enables higher-level tooling on top of their APIs, the most impactful of which for us has been Terraform.
Terraform exposes a declarative API well suited for our goals. You specify in code what resources you want (e.g., a computer, a networking subnet, a queue), the relationships between those resources (e.g., a service referencing the load balancer that fronts it), and instruct Terraform to apply what you've written. Terraform will observe what already exists and then create, update, or delete anything it needs to ensure your code matches what is provisioned.
We organized big chunks of resources under "modules" (a parameterized grouping of resources) and composed them all together in a single definition for everything running in a region.
Spinning up a new region amounts to creating a new instance of our module, passing the region and a cidr range as arguments, and applying.
Auditing for consistency across all our regions amounts to asking Terraform if any diff exists between our code and the live infrastructure.
Changing something across all regions amounts to submitting a code change for review, then applying the difference once it's been accepted and merged.
Infrastructure Move 2: Implementing a Routing Policy
We use DNS to load balance between regions. Clients looking to send us a request make a DNS query to translate our hostname to an IP address, and then initiate a connection with that address.
The DNS resolver is capable of knowing some interesting facts, which lets it make an intelligent decision for which IP to return:
From these, the resolver can implement our desired policy:
Importantly, we're able to configure exactly what checks are required to pass for the resolver to consider a region operational. If something awful happens in the middle of the night, the DNS system is sophisticated enough to have already drained traffic from the troublesome region while our on-call team is still drowsily trying to open their laptops.
That last fact however probably looks like an odd duck: why would you care if a DNS resolver can flip a coin or not? Well, it lets you create a "weighted" routing policy for gradually rolling out a change. For example, if we wanted to slowly introduce traffic to a new region, we would create two parallel sets of records, one with the new region and one without. We'd then start trickling out 0.1%, 1%, 5%, of traffic to the new set on up until we got to 100%.
This move ought to stress the importance of hostname allocation — something that is best gotten right on day one of a new company. We depended on different usage patterns of our API being discernable by hostname, such that we could route the things that needed multi-region support differently than other traffic. Once you publish a hostname to clients it can be hard or impossible to change it, so even if you plan to build for years out of a single region, it makes a lot of sense to create different hostnames for different workloads and route them all to the same place until the time for multi-region comes.
Infrastructure Move 3: Peering Connections
As noted earlier, we deployed a strict subset of our application to multiple regions. That subset still has dependencies back to other private resources in our original "home" region (everything that we didn't plan to move with the project) though.
To resolve this, we reached for a technique generally known as VPC peering, which builds a private tunnel between regions. Activated, services in one region can communicate over private IP addresses with services in our home region, so internal traffic never hits the public internet.
Not optimal from an isolation perspective, the ability to selectively reach back into our home region for resources we weren't ready to migrate was critical to keeping the project's scope manageable. In the years since, we've one-by-one taken many of the reasons for the peering connection and deployed alternatives. A handful of requirements will likely always necessitate it, though.
Application Move 1: Topology Rules to Limit Fate Sharing
With all the new serving regions, your graph of connections and dependencies can quickly grow unwieldy. We arrived at a few hard and fast rules that cut down the chaos and make it easy to reason about the consequences of losing a region.
With these properties, it would take all of our regional data centers going offline to cause a service disruption for our customers.
Application Move 2: Monolith Mitosis
A favorite of all those who grew their company around a monolith: what do you do if half of a really big service needs to go multi-region and the other doesn't?
Enter monolith mitosis: a time-honored technique for dividing and conquering. We used this technique to factor out a highly specialized service more appropriate for multi-region deployment.
Executed carefully, each step of this process is stress free. Note that if a service has a dependency like a database, it should be clear who will inherit ownership of it. At the end of the process only one service should have connectivity to it.
Application Move 3: State Management
Nearly all the work we took on at the application tier of our rewrite reduced to finding alternative ways to handle various bits of state. In plenty of ways that's not surprising — all state answers to where, and when you spread your compute across multiple regions you're compelled to find alternative wheres.
Here's the short list of our favorite techniques:
3a: In-Memory Caches
If you require read access to low-cardinality, low-write data (think simple configuration values), an in-memory cache may be the way to go.
We had just this requirement many places in our application. In the simple days of yore, we would simply ask a config service for whatever values we needed as a hard dependency of serving each request.
To remove the dependency, we instead require each service to persist an in-memory cache of the values it needs, and consult only that cache when serving customers. Out of band, the service then polls for updates on some interval. As a further optimization, we used a highly-redundant document storage layer (s3) to cheaply escrow the data. We were quite willing to introduce a bit of replication lag for the resiliency and latency improvements this model affords.
Since we require access to this data to coherently serve requests, we have our health checks sensitive to at least one complete warming of this cache. This health check bubbles all the way up to our DNS resolver when determining which regions are operational.
If the data is high-cardinality but still low-write, the same pattern can be applied but with an external cache instead of an in-memory one.
3b: Direct Encoding
If you require a database lookup of a small, bounded amount of fields to process a request, encoding those fields directly in an ID supplied by a client may be the way to go.
The canonical example for this is a "session ID", much like you'd find in a session cookie supplied by your web browser while on a site you've logged into.
We use a session ID as a part of our authentication scheme, and originally wrote to a single database table when authenticating a new client and read from the same table when verifying an existing client. This gave us the basic properties we wanted (ability to calculate business metrics like DAUs, know which partner had initiated the session, block malicious clients, etc.) at the cost of a very precarious single point of failure.
The alternative was to drop the database entirely and encode what data we needed right in the value of the session ID:
In the before box, you can imagine "s-1234" being the primary key for some row in our database, and in the after box the lengthier ID being some base-x encoded equivalent of the entire row. Since we'd otherwise lose the ability to discern which session IDs were guaranteed to