Every engineering team is a little different. We're all trying to solve problems with technology, but the way we build our systems, the tools we choose, and the practices we follow all vary.
As an engineer who came from a big company where the culture and principles were a bit more set in stone, one of the most exciting parts about my time at Button has been watching the team evolve with the company as we grow and experiment. What I've seen is how the values of the team shape the choices we have made and therefore the technologies we have chosen.
In this post, I will attempt to outline a few of the broader principles I believe we live by as an engineering team and the choices that have resulted from them. I don't claim that anything we do is the "One True Way", the right way, or that this is all-inclusive. All I claim is that this is the perspective of one engineer on how we do things.
The way that you write, run, and build software is an essential part of an engineering team. So, why not constantly evaluate how your team does those things? We have made experimentation and improvement a core part of how we build software.
For example, for our Node and Python projects, every service has its own repo. This is how we started building software at the beginning and it served us well for a long time. However, as we grew we started to see issues with code sharing across repos. It was tedious to share so we ended up not building out as robust of a standard library as we would have liked.
When we added Go as a new language at Button, we wanted to experiment with ways to avoid those problems. What we tried was having our Go code live in a monorepo called btngo. Each service lives in a directory in btngo/apps. We have a set of shared libraries for things like SQS clients, logging, caching, HTTP code, and anything else we need.
The monorepo removed a whole class of problems we previously had. It has been so successful that we are considering migrating our Node code over to a monorepo.
It should be easy for a new developer (or a new EC2 instance) to run every service your company has.
If every service needs its own specific magic incantations to get it to run and deploy, there's a tax to pay every time you hire a new engineer or try to change your pipelines. It gets even worse if there are services in multiple languages (like we have). Then, there's a different set of tricks to learn for each one.
Our solution to that has been to put every service into a Docker container— no exceptions. Docker isn't a fix-all though. When I started at Button, running each service still took getting a database up with the right parameters for that service and there was variance with how to run Node and Python services.
To fix this, we built an internal tool called Pint to manage the building, running, and testing. Using Docker and Pint lets us build our tooling to the interface of a Docker container instead of the specifics of the service.
Behind the scenes, Pint is generating the commands to run the service, bring up databases, bring up redis instances, migrate the database, or run tests. But all the user has to know is pint test, pint setup, pint run. The commands the user has to know are the same regardless of what language the service is written in. We let the tool take care of the language specifics.
This makes setting up a new engineer as easy as git clone and pint setup for any repo in our codebase.
In our AWS infrastructure, we use Amazon ECS to orchestrate our containers. We have Go services, Python services, and Node services all running together on the same machines with no extra effort by us. Adding Go services into our infrastructure a year ago was as easy as writing a Dockerfile for them. We even use Pint to run our CI suite and build/push Docker containers.
The combination of ECS and Docker has been great for our team as above all else we value simplicity. While ECS is not very extensible, it is simple to use and frees up our time to think about problems other than making the software run in production.
When you finish a feature and get your PR merged, what do you want to do next? Do you want a blazing fast deploy to prod and to move on with your life or do you want to hop into the hour-long build/deploy queue? I can tell you what we prefer.
Quick and easy deploys are something our engineering team holds dear to our hearts. We strive to have a system where deploying code doesn't break apart your day and where there's no need to worry about running things in the right order or passing the right configuration flags to some finicky tool.
We currently use Hubot, Github's friendly Slackbot, as a deploy frontend. Our bot "Botton" sits in a Slack channel waiting for deploy commands. Botton sends deploy requests from Slack to our deploy server where our custom scripts run.
We don't trade off quality for speed though. We always require our full suite of tests to pass before any deploy happens.
One of my coolest experiences as an engineer at Button was making a deploy in my first week. It was so easy and fast that I could barely believe it.
That setup has kept on improving since then. Over the years we have gone from using Ansible playbooks for engineers to deploy from their laptops, to using Hubot and a Heaven server, to writing our own deploy service to sit in the middle. The constant has always been making the tools developers use every day work well.
A core value of the Engineering Team at Button is to have high visibility into all of our software. One of our favorite sayings is "mystery free prod" (there's even a corresponding emoji in Slack). To get to "mystery free prod", we need several layers of monitoring. We have grown these layers over time to meet the needs as the company has grown.
We have grouped our monitoring tools into three classes: Exceptional behavior, Logs, and Metrics.
For application error logging, we use Sentry. Sentry aggregates all uncaught exceptions from our applications. We also report directly to it when we find error states in the code. We've used Sentry for years and it's been a great bang for the buck. If a service gives a 500, we can bet the details are in Sentry.
For more general logging, we use an in-house built logging cluster for system and application logging. All of our service's docker logs and instance system logs are shipped to Elasticsearch/S3 by Fluentd. A Kibana dashboard sits on top of this, providing us fast searches and aggregations of logs.
For metrics, we run Prometheus and Grafana. All of our services report a set of standard metrics to Prometheus, such as HTTP 5xx rate, HTTP request rate, and whether or not they are up. We also scrape Cloudwatch metrics for all of our tasks so we can see memory and CPU usage in Grafana.
All of our services get a generic Grafana API dashboard for free. This dashboard includes 5xx and 4xx rates for all endpoints, latency per endpoint by percentile, and downstream API latency (response time for calls this service makes to other services).
We have our Prometheus Alert Manager hooked up to PagerDuty, Slack, and email and direct alerts as appropriate for the defined severity level. I won't go into much detail on our oncall/alerting here, but if you're interested, my colleague Jiaqi published an excellent post on the subject.
We also use Prometheus for useful custom application metrics such as reporting traffic for different code paths. This is very helpful for verifying that new code is working as expected.
One of our company values is to "speak boldly and honestly." Within the Engineering Team, we live by that principle and actively encourage every engineer to make proposals they believe will improve the team.
Our usual process for a proposal is a RFC (Request for Comments) writeup and email to the engineering Google group. We comment asynchronously and on larger proposals will hold an in-person review to hammer out the questions.
Sharing designs asynchronously has several big benefits. People get a chance to think through the design and their comments instead of having to instantly respond in a meeting. I believe that improves the quality of the discussion. Also, building a culture where people who are less inclined to speak up in meetings (or who are dissuaded from speaking up) get an equal chance to contribute. For more thoughts on asynchronous design sharing and how it helps to build an inclusive, thoughtful culture I recommend this post.
The focus on writing in Button Engineering is one of my favorite parts of our culture. We take great pride in knowledge sharing, and a persuasive document can accomplish a lot. For example, adding Go as a standard language at Button started as one engineer's (very detailed) document proposing giving it a try.
You can even go back throughout the years and see the design docs for every service we have (and some that are long gone). Writing has always been essential to how we do work and will stay that way going forward.
Many things at Button have changed since the day I started. One of the main constants has been thoughtful engineers communicating openly and respectfully to make our company better, to build our culture, and to do great engineering work.
If you'd like to be a part of an engineering team that operates that way, take a look at our open roles!