网飞：让服务具备高可用性的一些提示

Jul 7, 2018 00:00 · 5136 words · 11 minute read DevOps

译文

过去的四年里，Netflix 的订阅用户从不到 5000 万增长到了 1.25 亿。这种增长已经给我们带来了不小的挑战。实际上我们在这段时间里想方设法提高了我们服务的整体可用性，在这期间我们学到了很多，现在对如何使我们的系统更具高可用性有了更好的理解。但不全是好事，事实上我们走了很多弯路，学到了许多教训。尽管我们到现在还没有搞定所有的问题，并且仍然有机会来改进我们的系统，但我们想要分享一些所获得的经验和提示还有我们的最佳实践。但愿你们能够有所启发，以免凌晨三点被电话吵醒来处理客户的突发事件。

在 Netflix，我们打造并使用 Spinnaker 作为持续集成和交付的平台。许多最佳实践已经被应用到了 Spinnaker 中，因此很容易上手。本文我们将展示我们如何将我们的最佳实践成果编码至 Spinnaker 中，并帮助人们使他们的系统更具高可用性。

优先考虑区域部署而不是全球部署

我们的目标是尽可能提供最佳的用户体验。因此，我们想要限制任何系统变更的范围，验证变更，然后将其发布到更广泛的客户群。更具体地说，我们一次只部署一个 AWS 区域。这为我们的生产部署提供了额外的安全保障。它也赋予了我们快速转移来自受影响的客户的流量的能力，这是我们最重要的补救工具之一。

我们也建议在每个区域部署之间验证程序的功能，避免在目标区域的流量高峰期发布。

在 Spinnaker 中可以直接指定部署的目标区域。

使用红/黑策略来部署生产系统

红黑部署（蓝绿部署）的一个新版应用程序一旦通过运行状况检查后立即开始接受流量。一旦红版（新版）就位，黑版（老版）就被停止并且不再接收流量。如果需要回滚，就像启用之前的版本那样简单。对于我们的服务，这个模型允许我们快速更改，并且一旦发生状况可以恢复到一个已知的可用状态。

为了能够使用 Spinnaker 完成红黑部署，我们的工程师仅仅需要在管道中指定策略（选择性地设置参数），Spinnaker 就会接管部署。

使用部署窗口

无论何时你部署新版的应用程序，要记住两件事：首先，你（或者你的同事）是否能够看到部署造成的影响并且可以根据需求修复？第二，如果你的这次发布出问题，你可以将影响范围尽可能限制在最少的用户吗？

当你部署时这两点要注意。在我们这，我们的流媒体流量遵循一个相对可预测的模式：无论住哪，大多数人都在晚上看视频。我们推荐工作时间（白天）和非高峰期作为所选区域的部署时间窗。

Spinnaker 提供了图形界面。这使得你可以轻松指定部署的日期和时间。

确保自动部署不会在晚上或者周末触发

部署窗口也适用于自动触发的事件。Spinnaker 允许使用 cron 表达式作为管道触发器。这有助于减少用户的操作。但是这也可能带来风险：设计了一个与预期不符的在非工作时间或周末执行管道的 cron 表达式。无论你使用何种自动化方式，都要确保管道可以在无人值守的情况下自动触发。

Chaos Monkey

Netflix 打造并开源了 Chaos Monkey ，它也是我们的混乱工程套件的一部分。Chaos Monkey 随机自动终止生成中的实例，这个可以作为强制功能，以设计出对单例故障具有弹性的服务。如果它们不是这样的，Chaos Monkey 会提前暴露这个弱点，然后工程师在海量用户无限放大它之前就可以修复。在 Netflix，我们的最佳实践就是所有投入生产的服务都启用了 Chaos Monkey，工程师还要确认 Chaos Monkey 没有问题。

在生产部署前使用（单元、集尘、灰度）测试和金丝雀分析来验证代码

快速开发的关键是在部署前自动验证新版的软件。理想情况下，所有必要的测试套件不需要人工干预就可以运行。

另外，我们推荐使用金丝雀分析。金丝雀分析是根据服务的新改动验证实时流量的有效方法。我们在 Netflix 打造了一个内部工具，Kayenta，最近我们还把它开源了。Kantenta 轻松集成在 Spinnaker 中，结合人工审核，Kantenta 是完全扛起生产流量前的最后一道门。

自行决定是否要人工干预

尽可能自动化，但是在适当的地方要人工干预。比如，在发布新版本到生产端之前可能适当手动检查一下金丝雀的运行结果。

尽可能把你已经测试过的部署到生产端

现在你已经做了许多测试并且验证了新版本，我们强烈推荐只部署你测试过的。在我们的案例中这意味着我们更倾向于从测试环境复制一份有效的镜像而不是在生成环境先做一份新的镜像。

定期检查页面设置

有时你能为你的应用程序的可用性做的最好的事可以很简单，但是不明显。当你的应用程序发生了状况，并且很可能在某个地方出问题，那么一定要记录下能修复它的人。所以定期检查页面设置，这样能确保你在状况发生时迅速呼叫合适的人。

Spinnaker 有个方便的页面所有者按钮，所以确保这些信息最新尤为重要，否则在最需要的时候发现配置已经过时了，我们就无法获得随时能够联系上应用程序所有者的安全感。

知道怎么快速回滚部署

大多数人都知道即使各种测试，各种验证，有时候部署到生产端后还是会出问题。也许是一些罕见的 bug，只会到达一定规模后触发。无论怎么样，重要的是你要知道如果有需要如何回滚到一个可用的状态。

Spinnaker 中，如果新版的应用程序出问题，则可以通过 Server Group 操作下的 Rollback 选项进行回滚。回滚将开启你选择的 ASG（通常是上一个版本）并且停止错误的 ASG。Spinnaker 也支持创建能被自动甚至手动触发的回滚管道。

当实例不对劲时停止部署

几年来，成功部署后有几次我们遇到了很糟糕的状态，实例起了，但是它们并不健康，实际上不能正确地处理流量。这种“成功”的部署给了我们错觉，当服务本身实际上并不健康时，这种安全感很快就会消失，而且失败的请求会很快堆积起来，有时候甚至会导致重试风暴和各种破坏。从这些经验中，我们了解到了实例起来但是并不健康时停止部署的重要性。

Spinnaker 有灵活的关联实例健康的方法。当实例变得不那么健康时，Spinnaker 相应地将标注它；更重要的是，不健康的实例将不再接收流量。如果 ASG 中所有的实例都不健康，Spinnaker 将停止部署。为了让操作者方便，Spinnaker 会用不同的颜色清晰地标注实例状态：刚启动、正在启动、等待发现、不健康、监控。

对于自动部署，告知团队即将完成的部署和已完成的部署

让人们知道部署已经成功投入生产也是有必要的。这对我们成功运营至关重要。我们的最佳实践就是当部署更改时始终观察系统。当发生错误时，重要的是要知晓什么改变了以及何时发生。对于自动化部署，告知团队非常非常重要，这样他们知道要盯着系统的健康状态。我们内部使用 Slack 频道来通知。在 Spinnaker 中，管道可以通知操作员想要通知的任何频道。

自动执行非典型情况的部署而不是进行一次性的手动工作

每个工程师都为非典型情况部署写过一次性脚本。很多人已经碰到这种“一次性”的情况再次发生，现在团队的其他成员并不知道“那个”工程师在他的只跑一次就扔掉的脚本里做了什么。

管道是自动执行一系列步骤的有效方法，即使那些步骤不是每天都会执行到。一个例子是为紧急推送构建管道，提供参数来控制执行条件，比如跳过部署窗口。

别忘了在非关键情况下定期测试你的部署管道！

使用先决条件验证预期状态

这个世界变化无常，在 Netflix，我们成百上千个微服务随之变化。做出假设，例如关于其他系统的状态，可能是危险的。从我们的错误中吸取教训，我们现在使用先决条件来确保当我们部署了新的代码或做出其他改变时假设依旧有效。这对那些执行长期任务的管道来说尤其重要。使用先决条件可以在潜在的破坏性操作之前验证预期的状态。

结论

这篇文章总结了我们多年来在 Netflix 积累的各种提醒与最佳实践，都是从我们犯的错中学习到的。我们的目的是尽可能地围绕这些最佳实践构建工具，因为我们发现，陷入“成功的陷阱”后往往使我们受益匪浅。尽量别手工操作。

我们的目标是提高服务的可用性。对我们来说，这意味着当确实需要手动判断时人才会介入。这样，我们将工程师的时间用于那些提高可用性的任务，同时在不需要他们参与的情况下，他们可以专注于手上的事。

原文

Over the past four years, Netflix has gone from less than 50 Million subscribers to 125 Million subscribers. While this kind of growth has caused us no shortage of scaling challenges, we actually managed to improve the overall availability of our service in that time frame. Along the way, we have learned a lot and now have a much better understanding of what it takes to make our system more highly available. But the news is not all good. The truth is that we learned many of our lessons the hard way: through heroics, through mad scrambles when things went wrong, and sometimes unfortunately through customer-facing incidents. Even though we haven’t figured everything out and still have many opportunities to improve our systems, we want to share some of the experience we have gained and the tips or best practices we derived. Hopefully some of you will take something away that will save you a wake-up call at 3am for a customer-facing incident.

At Netflix, we have built and use Spinnaker as a platform for continuous integration and delivery. Many of the best practices discussed here have been encoded into Spinnaker, so that they are easy to follow. While in this article we show how we internally encode the best practices in Spinnaker, the tips and best practices are more general and will help anyone make their systems be highly available.

Prefer regional deploys over global ones

Our goal is to provide the best customer experience possible. As a result, we aim to limit the blast radius of any change to our systems, validate the change, and then roll it out to a broader set of customers. More specifically, we roll out deployments to one AWS region at a time. This gives us an extra layer of safety in our production deployments. It also gives us the ability to quickly shift traffic away from affected customers, which is one of our most important remediation tools.

We also recommend verifying application functionality between each regional deployment, and avoid publishing during peak hours in the targeted region.

In Spinnaker, it’s straightforward to specify the region a deploy is targeted for.

Use Red/Black deployment strategy for production deploys

In a Red/Black (also called Blue/Green) deployment a new version of an app (red) will start to receive traffic as soon as it passes health checks. Once the red version is healthy, the previous (black) version is disabled and receives no traffic. If a rollback is needed, making a change is as simple as enabling the previous version. For our services, this is a model that allows us to move fast and get back to a known good state if something goes wrong.

To accomplish a Red/Black deployment with Spinnaker, our engineers simply have to specify the strategy in their pipeline (and optionally set parameters on the strategy), and Spinnaker will take care of the rollout.

Use deployment windows

Whenever you deploy a new version of your app, there are two things to keep in mind: first, are you (and/or your colleagues) able to watch the impacts of the deployment and available to remediate if need be? And second, should there be a problem with your rollout, are you limiting the blast radius to the fewest customers possible?

Both of these reasons point to being mindful about when you deploy new software or new versions. In our case, our streaming traffic follows a relatively predictable pattern where most people stream in the evenings wherever they live. We recommend choosing deployment windows during working hours and at off-peak times for a selected region.

Spinnaker provides an interface to this through its pipeline UI. This makes it easy for you to specify the days and hours this pipeline can run.

Ensure automatically triggered deploys are not executed during off-hours or weekends

Deployment windows also apply to automatically triggered events. Spinnaker permits cron expressions as pipeline triggers. This can be useful in reducing the hand-holding our users have to do and reduces mental overhead. But this can also be a risky strategy: it’s easy to fashion an aggressive cron expression that’ll execute a pipeline during off-hours or weekends, which may not be what was expected. No matter what kind of automation you use, ensure that any pipeline triggered automatically (e.g., by cron) can be run unattended.

Enable Chaos Monkey

Chaos Monkey is built and open-sourced by Netflix and is part of our Chaos engineering suite of tools. Chaos Monkey unpredictably and automatically terminates instances in production. This serves as a forcing function to design services in a way that are resilient to single-instance failures. If they are not, Chaos Monkey will expose this vulnerability, so that service owners can fix it before it turns into a widespread customer-facing incident. At Netflix, our best practice is that all services in production should have Chaos Monkey enabled and owners of these services should detect no issues with Chaos Monkey terminating application instances.

Use (unit, integration, smoke) testing and canary analysis to validate code before it is pushed to production

The key to moving fast is automatically validating new versions of software before they are deployed. Ideally, running all necessary suites of tests can be done without any manual intervention.

In addition, we recommend the use of canary analysis. Canary analysis is an effective means to validate live traffic against new changes to a service. We have built a tool internally at Netflix, Kayenta, which we have recently open-sourced. Kayenta easily integrates within Spinnaker; in combination with manual judgement, Kayenta is a final gate before full fledged production traffic.

Use your judgement about manual intervention

Automate where possible, but use manual intervention where appropriate. For instance, it may be appropriate to check the results of a canary run before pushing a new version to production.

Where possible, deploy exactly what you tested to production

Now that you’ve done a lot of testing and validation of your new version, we highly recommend that you deploy to production exactly what you tested. In our case this means that we prefer to copy a validated image from a test environment rather than baking a new image in a production environment.

Regularly Review paging settings

Sometimes, the best thing you can do for the availability of your app is simple, but not obvious. When something goes wrong with your app, and more than likely it will at some point, it’s important that the people who can fix it will be paged. So review your paging settings, and do so regularly. This will help ensure that the right people will be called quickly in the case of an incident.

Internally, Spinnaker has a handy “Page owner” button, so it’s especially important that this information is up-to-date, so that we do not get a false sense of security of being able to reach the app owner but then finding out that the configuration is outdated when we most need it.

Know how to roll back your deploy quickly

Many of you will agree that even with solid testing, canaries, and other validation, sometimes something will be deployed to production that causes problems. Maybe it’s a rare bug exposed by a race condition that only gets triggered at scale. Whatever the case may be, it’s important that you know how to roll back to a good known state quickly if need be.

In Spinnaker, in the event of a production issue with a new version of your app, rollbacks are possible via the Rollback option under Server Group actions. Rollback will enable the ASG of your choice (usually the previous) and disable the faulty ASG. Spinnaker also supports creating rollback pipelines that can be executed upon pipeline failure or even triggered manually.

Fail a deployment when instances are not coming up healthy

Over the years, a few times we fell into a bad state when a deployment succeeded, and instances came up, but they weren’t healthy and were not in fact able to handle traffic appropriately. The “successful” deployment gave us a false sense of security which quickly vanished when a critical service wasn’t actually healthy, and requests were quickly stacking up and failing, sometimes causing a retry storm and all kinds of havoc. From these experiences, we learned how important it is to fail a deployment when instances come up, but aren’t healthy.

Spinnaker has a flexible means for associating instance health. When an instance is unhealthy, Spinnaker will note it accordingly; what’s more, unhealthy instances won’t receive traffic. In the event of all instances in an ASG being unhealthy, Spinnaker will fail a deployment. To make it easier for operators, Spinnaker also clearly marks instance state as coming up, starting, waiting for discovery, unhealthy, and healthy with different colors, as can be seen below.

For automated deployments, notify the team of impending and completed deployments

Letting people know that a deployment has successfully gone into production is also recommended. This is critical for us operating successfully. Our best practice is to always watch your systems when changes are deployed. When something goes wrong, it’s important to know what has changed and when. For automated deployments, then, it is particularly important to notify the team so that they know to keep an eye on service health. We use Slack channels internally for notifications. In Spinnaker, Pipelines can notify any channel upon completion that operators want to notify.

Automate non-typical deployment situations rather than doing one-off manual work

Every engineer has written one-off scripts for non-typical situations. And many have had situations where those “one-off” situations happen again, and now the rest of the team doesn’t know what the one engineer did in their script that was intended to run once and be thrown away!

Pipelines are an effective means for automating any series of steps, even those steps not executed day-to-day. One example is fashioning a pipeline for an emergency push where parameters are provided to control conditional execution such as skipping deployment windows.

Don’t forget to regularly test your non-typical (and typical) deployment pipelines in non-critical situations!

Use preconditions to verify expected state

We all operate in a world where systems change around us frequently. At Netflix, our hundreds of microservices change on an ongoing basis. Making assumptions, for instance about the state of other systems, can be dangerous. Learning from our own mistakes, we now use preconditions to ensure that assumptions are still valid when we deploy new code or make other changes. This is particularly important for pipelines that execute of long periods of time (possibly from a delay in manual judgement and/or deployment windows). Using precondition stages can verify expected state before potentially destructive actions.

Conclusion

This post summarizes a variety of tips and best practices that we have accumulated over the years at Netflix, often learning from our own mistakes. Our approach is to build tooling around these best practices whenever possible, because we have found that best practices are often followed when it’s easy to “fall into the pit of success” — when manual toil, which is brittle and unpleasant, is kept to a minimum.

Our goal is to always improve the availability of our service. For us, this means keeping the human in the loop when manual judgement is really needed, but not otherwise. By doing this, we target our engineers’ time towards those tasks that improve availability, while also freeing them up to focus on other things when their involvement is not needed.

译文