## Twilio Segment 如何从微服务转回单体架构
Why Twilio Segment moved from microservices back to a monolith

原始链接: https://www.twilio.com/en-us/blog/developers/best-practices/goodbye-microservices

最初,团队构建了独立的微服务来解决性能瓶颈问题,但他们发现140多个目标集成最终成为了巨大的运营负担。部署缓慢,依赖管理混乱,不可靠的测试——经常因外部端点问题而失败——严重阻碍了开发人员的生产力。 为了解决这个问题,他们将所有目标代码整合到一个代码仓库中,标准化依赖项以简化维护。至关重要的是,他们构建了“流量记录器”,通过记录和回放网络流量来消除不稳定的基于HTTP的测试,将测试套件的运行时间从可能的一个小时缩短到毫秒级。 这使得成功过渡到单体架构成为可能,极大地提高了开发速度(共享库改进提高了43%),并简化了扩展。虽然承认存在权衡——降低了故障隔离性和内存缓存效率降低——但团队认为运营效益和生产力提升超过了这些担忧。这次经历表明,虽然微服务可能有效,但对于他们特定的服务器端目标集成来说,单体架构是一个更好的解决方案。

## Twilio Segment 的单体回归:摘要 Twilio Segment 最近从微服务架构回归到单体架构,理由是提高了开发人员的生产力。 核心问题并非微服务本身,而是其实现方式。 他们最终拥有 140 多个服务,即使是共享库的更新也需要协调部署——实际上是一种“分布式单体”。 评论员指出,真正的微服务应该允许独立部署。 频繁且协调的部署会抵消优势。 需要广泛重新部署的安全更新也是一个痛点。 一些人认为,问题源于组织和工程质量问题,而不是架构选择本身。 讨论强调了适当的服务粒度、将服务与业务能力对齐,以及理解“分布式系统溢价”——分布式系统带来的额外复杂性。 许多人认为微服务常常被错误应用,尤其是在组织缺乏必要的领域建模专业知识时。 这篇文章引发了关于过去 7 年中部署工具的进步是否使微服务更具可行性的争论。 最终,共识倾向于构建适合特定用例的方案,而不是盲目地追随架构趋势。
相关文章

原文

Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.

For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.

With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.

We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.

Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.

Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.

The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.

To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.

It took milliseconds to complete running the tests for all 140+ of our destinations after we integrated Traffic Recorder. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.

Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

The proof was in the improved velocity. When our microservice architecture was still in place, we made 32 improvements to our shared libraries. One year later,  we’ve made 46 improvements.

The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.

Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:

  1. Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.

  2. In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.

  3. Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.

Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.

Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.

  1. We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.
  2. We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.

When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.

The transition to a monolith was made possible by Stephen Mathieson, Rick Branson, Achille Roussel, Tom Holmes, and many more.

Special thanks to Rick Branson for helping review and edit this post at every stage.

联系我们 contact @ memedata.com