On a quarterly basis, the Infrastructure division at Vendasta gets together for an all-day offsite meeting. The goal of the meeting is to review the most recent quarter’s plan and results to see how we did. We then review team plans for the next quarter. It’s a great opportunity for all members of the division (currently 18) to openly discuss issues and coordinate our efforts and metrics before the quarter begins. At the 2020 Q1 offsite meeting the whole team agreed that Infrastructure was seeing a significant drag on velocity due to the tech debt caused by incomplete migrations. We were so convinced that the SRE team dedicated a significant portion of their Q2 roadmap to migrations and added 2 new members to the team who were formally on our backend infrastructure team (aptly named TeamyMcTeamFace).
Breakdown of a Migration
All migrations tend to follow a basic common process:
- A new way of doing things is identified (the old way is the only way)
- Some people start doing things the new way (there are now 2 ways)
- People doing things the old way migrate to doing things the new way (there are still 2 ways)
- Zero people are doing things the old way (there are still 2 ways)
- Infrastructure, people, and processes supporting the old way are sunset (the new way is the only way)
At Vendasta, we found that many (most?) migrations stop before completing all 5 steps. Failing to complete all steps causes a number of problems but the interesting part is that people don’t do this on purpose. What we found is that there are a number of common patterns that despite good intentions lead to failure to complete the full migration.
The Snowflake
Developers at Vendasta are naturally curious and as such frequently try new technologies to improve efficiencies. This is great and it represents the first steps in the migration process. The reality is that these types of migrations often fail to progress beyond step two. Often the new way of doing things wasn’t significantly better so a full migration does not seem worthwhile. Other times the new way of doing things is significantly better but due to other priorities or perceived difficulty, the full migration does not seem worthwhile. In these cases, you’re left with a snowflake (something completely unique) that is inconsistent with the rest of the code base and is rarely documented as well as the old way. This is a liability because as the system evolves snowflakes are rarely considered and rarely covered fully with tests. They are a great source of hard to find and harder to fix bugs.
As a piece of advice from the future: It is almost always worth the effort to complete the migration by killing the snowflake and using the old way or by migrating everyone to the new way.
New vs Old
Most of the time when a great new way of doing things is discovered at Vendasta people get excited about the new way and become evangelists. They tend to push for others to join them and the migration progresses to the third step. This means that you no longer have a snowflake and both methods tend to be well understood. It is at this point that victory is often declared prematurely. People often forget to migrate “legacy” or “niche” projects. The thought is that the more active projects benefit from the new way and that is good enough. As per usual the truth is more complicated.
The problem is that as long as there are two ways to do things there is a 2x training overhead and there are now two different libraries or processes that need to be maintained and/or patched. When the library, process, or API is internally maintained this means that the maintaining team needs to dilute their focus between multiple active versions.
The Land of the Forgotten
In the case when all clients are migrated, people tend to dust off their hands and declare it a “job well done”. Again, this premature victory declaration means that value is left on the table. Oftentimes this means that dead-code is not deleted, resources are not freed up, documentation is not deleted or updated, and dependencies are often not removed. Excess dependencies slow build times, excess code slows all development activities and training, out of date documentation causes all kinds of issues, and finally, excess resources cost time & money.
This is definitely the least harmful of the cases but as your company and code grow its impact also grows. It is also particularly problematic because it is largely invisible to developers despite the fact it makes the code harder to follow, organize, and update.
Many types of migrations
The infrastructure group found itself the owner of a variety of the above types of incomplete migrations. Here are a few examples:
- Multiple active DNS providers from when we had migrated our DNS to Google Cloud DNS from Amazon Route53.
- Multiple actively supported major versions of client libraries for internal services.
- Projects using our proprietary golang workflow engine instead of the newly minted open source Cadence project.
The result of these incomplete migrations was that our team was supporting projects that had been made obsolete by open source alternatives, supporting multiple versions of internal libraries, and supporting non-differentiated infrastructure on multiple clouds! These were the causes of drag on our velocity that the team had become keenly aware of as it’s magnitude had grown over the years.
So what did we learn?
The first thing we learned is that mIgrations take a variety of paths and require different approaches. That’s a topic unto itself, so I’ll save that for another time. The real question is whether things improved? Did we recover some of the velocity that we’d lost?
This migration effort started in earnest partway through Q2 and here we are 50% of the way through Q3. The change is remarkable. We’ve been able to sunset (ie. delete) multiple codebases, eliminate multiple legacy APIs and technologies, and the team has seen a reduction in the number of parallel streams of work. No doubt it was a big undertaking, but the general consensus is that it was a positive change and we are looking forward to sharing our approach and processes with the rest of the organization so more people can take back their velocity!