[关闭]
@Rays 2019-02-18T09:04:17.000000Z 字数 29198 阅读 1761

4 Techniques Serverless Platforms Use to Balance Performance and Cost

未分类


摘要: There are two aspects that have been key to the rapid adoption of serverless computing: the performance and the cost model. This article looks at those aspects, the tradeoffs, and opportunity ahead.

作者: Erwin van Eyk

审校: Richard Seroter

正文:

本文要点

  • 对于无服务器和功能即服务(FaaS)的普及,成本和性能模型是其中的两个关键驱动因素。The cost and performance models are two of the key drivers of the popularity of serverless and Function-as-a-Service (FaaS).
  • 冷启动耗时已大幅度降低,从几秒减少到几百毫秒,但仍有很大的改进空间。Cold starts have gone down a lot, from multiple seconds to 100s of milliseconds, but there is still much space for improvement.
  • 有多种技术可改进无服务器功能的性能,其中大部分聚焦于如何减少或避免冷启动。There are various techniques that are being used to improve the performance of serverless functions, most of which focus on reducing or avoiding cold starts.
  • 这些优化并非没有代价的。优化是根据用户应用的需求在性能和成本之间取得权衡。These optimizations are not free; it is a trade-off between performance and cost, which depends on the requirements of your application.

  • 目前为调整这些性能和代价上的权衡,由公有云提供的闭源无服务器服务为用户给出了很少的选项,而可运行在任何地方的开源FaaS框架(例如Fission)则提供了充分的灵活性。Currently, closed-source serverless services offered by public clouds offer few options for users to influence these trade-offs. Open-source FaaS frameworks that can run anywhere (such as Fission) offer full flexibility to tweak these performance/cost tradeoffs.

  • 无服务器计算并非仅是用户为所使用的资源付费,而是用户可为实际需要的性能付费。Serverless computing is not just about paying for the resources that you use; it is about only paying for the performance you actually need.

对于很多用户而言,无服务器计算是云计算下一步合乎逻辑的发展,即对应用程序做一组更高层级的抽象,并将更多的低层级操作工作交给云服务提供商实现,无论是公有的还是仅供内部使用的基础架构。无服务器技术承诺将按需提供可靠的性能,同时用户将直接按其所使用的资源付费。
Serverless computing is for many the logical next step in cloud computing, moving applications to a set of higher-level abstractions and offloading more of the low-level operational work to the cloud provider (regardless of whether that is a public one or an internal infrastructure team). It promises reliable performance on-demand while directly linking the pricing to the resources used.

这篇博文综合了我在2018年底几次会议上的演讲(此处链接给出了我在Kubecon China 2018上演讲的录像)。作为一位活跃的无服务器计算领域研究者和软件工程师,我的目标是让读者了解当前最先进的无服务器平台的内容,尤其是在性能方面,以及用户可如何影响性能。
This blogpost is a synthesis of a talk that I gave at a couple of conferences in late 2018 (a recording is available of the version of the talk I gave at Kubecon China 2018.) Being active as both a researcher and software engineer in the serverless computing domain, my aim is to give you an idea of what is going on under the covers of the current state-of-the-art serverless platforms—especially with regards to performance and how you can influence it.

性能和代价模型

无服务器计算的快速采纳归结为两个关键方面:性能和代价模型。
There are 2 aspects that have been key to the rapid adoption of serverless computing: the performance and the cost model.

The Performance Model

Serverless functions are designed to have almost no performance tuning knobs; the performance model is supposed to give the impression of an infinitely scalable, infinitely reliable computer.

However, in reality there are practical limits. For example, all serverless computing systems have the “cold start” problem-the latency of starting a function (more on this later). Even so, a large number of real world applications find these constraints acceptable.

We can think of the performance model of serverless abstracting over three important characteristics:

  1. Throughput:the most prominent feature of serverless performance is its fully managed autoscaling. As a user, you do not have to worry about provisioning resources, nor do you have to scale these resources up or down yourself. These concerns are managed by the cloud provider. It comes with the added benefit that you can rely on the (near-)infinite infrastructure resources of the cloud provider.

  2. Availability:similar to autoscaling, you have clear expectations for the availability of your serverless applications. Although this is still a relatively unexamined aspect of serverless computing, you can generally expect your serverless application to have an uptime similar to other cloud services offered by the vendor.

  3. Latency:latency overhead is a hot topic in serverless computing, which is certainly not yet low enough for most latency-sensitive use cases. Yet, it is still the one of thefastest ways to serve workloads without having a permanent deployment. Additionally, recent advances in serverless performance (such as pre-warmed containers or infrastructure resources, cached functions, and more -- which we’ll explore below) are leading to serverless being applied more and more for latency-sensitive use cases.

The Cost Model

Arguably even more important to the popularity of serverless than performance is its cost model. Any serverless offering ought to have the following three characteristics in its cost model:

  1. No costs when idle:the agreed characteristic of serverless solutions is the usage-based payment model. You only pay for the resources that you actually use for your applications, instead of paying for all resources - either used or reserved - in other cloud models.

  2. No upfront or recurring operational costs:not only do you only pay for the resources that you actually use, you should not have to pay any upfront or recurring fees for operational costs. In other words, if you do not use your serverless application in a month, your cloud costs should be zero.

  3. Granular billing:when your serverless application is in use, you pay on an extremely granular level. You pay for the fine-tuned resources that you actually consumed by the millisecond, instead of by hour or even longer in traditional cloud models.

The Central Trade-off in Serverless Computing

As we can see, these two main aspects of serverless computing are conflicting. As we’ll show later in this post, increasing performance requires increasing costs and reducing costs affects performance. In general, when you think of high-performance systems, you generally don't expect them to be very cost effective, and vice versa.

And, of course, we do not suddenly get all kinds of supercomputing resources for free if we would just start using this serverless thing. As any critic of serverless computing will start with: there are still servers in serverless computing- and someone still needs to pay for them.

Instead, what serverless computing is all about, is that with its cost and performance model it allows you to directly tie performance to a price. This explicit link between cost and performance has required serverless providers to find techniques to optimize performance within this strict cost model. And you can, too.

A Serverless Platform

Before we dive into the optimizations, it is useful to have an understanding of what the most basic Function-as-a-Service (FaaS) platform looks like under the covers - as functions are the building-blocks and execution units of serverless computing. Let’s review a reference architecture for a ‘representative’ FaaS platform, which we have been developing in collaboration with a number of companies and universities within the SPEC RG CLOUD group.

Covering the entire reference architecture is worth an article on its own (which we are working on!). But for the scope of this article, let’s discuss the FaaS part of serverless, focusing particularly on how FaaS functions are executed (we won’t be covering the development, build and monitoring workflows of a serverless functions); we are only going to cover how FaaS functions are executed.

Data Model

Starting with the data model, a FaaS platform uses two datastores for the functions:

The reason for these two conceptual stores is because of these two different access patterns: We need to lookup this small-sized metadata frequently and as fast as possible, whereas we only need the actual—potentially large—function sources when we need to deploy a function. However, in practice in some FaaS platforms these two conceptual stores are stored in the same database, for example, using the questionable approach of using Docker images as functions.


Figure 1 - The anatomy of the runtime of a FaaS platform.

Execution Model

To be able to deploy and execute these functions we need a set of components, which together make up the runtime of a FaaS platform (Figure 1).

Although event triggering and routing are also a fundamental part of serverless, these are not the concerns of the FaaS platform runtime. From the runtime’s point of view, there is no difference between events—whether they come from a message queue, HTTP request, or a modification in a database. All these events arrive at the Router; the component responsible for accepting events and deciding which function should be executed. However, since we only deploy functions when they are needed, it often happens that the router has to request a function instance to be deployed through the deployer.

The Deployer component has a single task: it takes the demand for a function together with the function metadata to decide how the function should be deployed. However, the actual deployment of the resources it typically handed off to a Resource Manager.

The Resource Manager is typically a conceptual layer below the FaaS platform; managing the deployment of generic cloud resources, such as containers, networks, and storage. Today this has become synonymous with Kubernetes. In the open-source FaaS space, nearly all platforms can be deployed on Kubernetes. Within our FaaS platform model, Kubernetes (or another resource manager) is responsible for receiving the decision of the Deployer and deploying the resources accordingly. In the process it fetches the needed function sources from the function store to deploy the function instances.

The final product is a Function Instance(also frequently referred to as a Worker), which is the actual deployed function that is capable of executing the function requests, which it receives from the router.

Fission: Fast Serverless Computing on Kubernetes

Although the reference architecture is pretty simple, it can be—and in practice is—implemented in a number of different ways. For example, a variety of databases is used to store functions, different resource managers are used, and the communication between the components is implemented anywhere from using HTTP requests, to using message queues, to a central database.

To give you an idea of how this reference architecture is implemented in practice let’s unpack Fission: a popular, open-source platform for fast serverless computing on Kubernetes.


Figure 2 - The runtime architecture of Fission (without advanced features and optimizations).

Excluding all optimizations and more advanced components, Fission’s architecture roughly implements the FaaS reference architecture.

Although it has a number of other features, such as canary deployments, Record-Replay, and more, Fission’s router too is primarily concerned with accepting HTTP requests (in Fission all events are converted to HTTP requests) and routing them to the correct function instances.

A component called the Executor implements the Deployer component in the reference architecture. For its function deployments it accesses the function metadata store, which is implemented using Kubernetes CRDs (which generally get stored in an ETCD cluster.)

Fission is built as a Kubernetes-native FaaS platform. It heavily relies on various features of Kubernetesfor the grunt of the resource management, with a simple function store (storagesvc) component deployed in the cluster. Function instances are deployed as Kubernetes deployments, allowing them to easily integrate with existing non-serverless Kubernetes deployments - such as microservices or other container-based applications.

Cold Starts

This reference architecture also allows us to address the elephant in the room: cold starts. A cold start is, in its essence, the worst-case time that a function execution will take. Cold starts can’t take advantage of shortcuts or other optimizations.


Figure 3: The typical lifecycle of a cold start and warm execution.

A cold start typically occurs when a request arrives at the Router without a function instance being available to handle the request. The router has to signal the Deployer to start the deployment of a new function instance. The deployer in turn signals the Resource Manager to deploy the desired resources that comprise the function instance. Only after the function instance is fully deployed the request can be forwarded by the router to the new function instance to be executed. Typically this cold start takes around 100s of milliseconds to multiple seconds in less-optimized platforms.

In contrast, (regular) warm executions are the best-case scenario: a function instance is already completely deployed and ready to handle the request. This allows the router to directly forward the request to the function instance without having to wait for any part of the deployment process. Typically, the latency added by the FaaS platform is a couple of milliseconds.

Why should I care?

Cold starts are not just a part of our reference architecture or Kubernetes-based platforms, the cold starts are currently a fundamental characteristic of serverless computing. Reducing cold starts is a hot topic in academic research as well as a prime concern of production-ready FaaS platforms.

Figure 4 - Cold starts of cloud providers over a 7-day period in 2017 (source: Want et al., Peeking Behind the Curtains of Serverless Platforms source)

In the summer of 2018, researchers presented a comprehensive investigation at the USENIX ATC conference into—among others—the cold start behavior of the FaaS platforms of the major cloud providers (Amazon Web Services, Google Cloud, and Microsoft Azure).

As Figure 3 shows, even on AWS Lambda, the longest-running serverless platform, cold starts are still present - with a minimum of 200ms cold start latency. Although this latency is going down as platforms mature, these magnitude latencies are still significant in any user-facing and latency-sensitive applications.

Reducing Cold Starts

FaaS platforms—open-source and hosted—are trying to mitigate these cold starts using a variety of techniques. The most straightforward approach is to minimize the overhead of all components involved in the function execution. For example, AWS recently open-sourced Firecracker, a highly optimized virtualization runtime specifically built to reduce the cold start latency of AWS Lambda and AWS Fargate.

However, reducing the overhead of the components only gets you so far. Which is why serverless platforms employ a number of techniques that make a trade-off between performance and (added) costs.

Let’s review four of the most used techniques:

  1. Function resource reusing
  2. Function runtime pooling
  3. Function prefetching
  4. Function prewarming

1. Function Resource Reusing

The first optimization might seem a bit redundant, but that is due to the fact that we take it for granted in today’s serverless ecosystem.

We take a note from functional programming and general computing theory, one execution of a function should never be able to influence another function execution; a function execution is atomic, self-contained, and isolated from other executions.

However, if we would stick to this notion, this would mean that to ensure independence of executions, each function execution would require its own independent set of resources, their own function instance. This would require each function execution to go through a cold start.


Figure 5 - FaaS function executions in theory (left) and in practice (right).

Obviously, this is not ideal in practice, nor do we have a need for such strict performance isolation in most cases. So, one of the first trade-offs that is made in serverless computing, is to have function executions share their function instances. A function instance can handle these requests one after the other.

This reusing of function instances leads us to an interesting question: how long should we keep these function instances around before cleaning them up? The answer to this question is the same as with nearly any question in computer science: it depends. Like all of the optimizations in the rest of this article, this optimization involves a trade-off between performance and cost.

To maximize the chances that future function executions can benefit from reusing an existing function instance, the platform could choose to keep around the function instance for a long time. However, the downside of taking this approach to the extreme is that you are not guaranteed that these function instances will be needed. So, you might be keeping these function instances alive—taking up resources and costing you money—unnecessary.

Instead, a cost-focused FaaS platform or user could choose to keep the function instance alive for little or no time at all. This would minimize the operational cost, since the function instances are not kept (idly) around. But, performance will likely be impacted when taking this to the extreme, since few function executions will be able to benefit from existing function instances.

What choices cloud providers make in this trade-off too was investigated in the publication at USENIX ATC. They found that all of the big three cloud providers opted to keep alive function instance longer, from multiple hours to days (Azure has an estimated keep-alive of on average 6 days). Likely these cloud providers keep around function instances for as long as the resources are not needed by other services.

2. Function Runtime Pooling

Next to improving performance by sharing resources during or after function executions, FaaS platforms employ several techniques to optimize performance by sharing resources beforehand. One of these techniques is called function runtime pooling.


Figure 6 - The deployment process of a function instance which consists of a user-defined function and a generic function runtime

This optimization is based on the insight that a function instance is comprised of two distinct parts:

  1. User-provided function: is the part which the user provides to the FaaS platform. It contains all the business logic of a specific function.
  2. Runtime: contains all the code that takes care of the plumbing of the function. It ensures that your function can handle requests, provides monitoring, and all other operational aspects. The function runtime is typically provided by the by the FaaS platform and can be specific to a programming language.

With this notion of a runtime and the actual function, you can also see the deployment of a function as a two-step process. First, in the function runtime deployment, the platform deploys a runtime without a function, resulting in a generic (or unspecialized) runtime. Then, in the second step of the deployment process we deploy the user-defined function onto the generic runtime—specializing it—which results in a function instance.

Using this multistep deployment process, the FaaS platform can now employ a technique that is common in many fields, namely resource pooling. The idea behind resource pooling is that you create or prepare resources ahead of time to reduce the costly creation at run time. A typical example of this is in multithreaded applications, where you can employ thread pools to reduce the cost of thread management by creating them ahead of time and sharing them among multiple actors.


Figure 7 - Pooling function runtimes to reduce cold starts: keeping a pool with generic runtimes around (left), taking out runtimes to deploy function instances quickly (center), and meanwhile rebalancing the pool to the desired state (right).

In serverless computing this technique can employed fairly similarly—which was one of key ideas that led to the creation of Fission, which was the first open-source FaaS platform to employ runtime pooling.

A FaaS platform employs function runtime pooling by maintaining pools of generic runtimes. Whenever a function instance needs to be deployed, we can take a runtime from this pool. We then only need to perform the last step of the deployment process to create a new function instance, dramatically reducing the cold start. Independent of the function deployment, the FaaS platform can then deploy new runtimes to rebalance the runtime pool.

However, this again also leads us to a trade-off. We need to find an answer to the question: how large do we need to keep this pool?

Performance-focused FaaS platforms could opt for a larger pool of generic runtime. This would ensure that even during spikes in the workload, the pool has enough generic runtimes to provide for the deployment of function instances. Yet, this also means that the large pool will be occupying resources permanently, resulting in larger operational costs.

Therefore, a cost-conscience FaaS platform could choose to opt for smaller or no runtime pooling, ensuring that the operational costs are minimized. However, as you decrease the pool size, the chances will increase that a sudden burst of requests will require more runtimes than present in the pool—depleting the runtime pool. When the runtime pool cannot keep up with demand, subsequent function executions will have to wait for the runtime to be deployed - increasing the cold start.

3. Function Prefetching

Next to preparing the runtime ahead of time, we can also prepare the deployment of the function itself in advanced. With function prefetching we can speed up getting the function to the runtime where it is needed faster by caching the function sources nearby.

This optimization might not make much sense if you have just a single cluster with a couple of nodes and a couple of functions. However, in larger enterprise FaaS environments, functions start to depend on large libraries or contain many static assets. With these they can quickly grow to 100s of MBs or even GBs in size. With these sizes, even transferring functions between co-located servers can end up adding seconds of delay to your cold start process.

Next to large functions, your serverless application might need to be geo-distributed, or needs to be deployed at the edge (for example with Cloudflare workers or AWS Lambda@edge).Transferring even small functions halfway across the world to the desired location impacts your cold start process by hundreds of milliseconds.

With function prefetching we can alleviate this cost of transferring function sources by caching the functions sources near the runtimes that will need them.


Figure 8 - A hierarchy of levels to cache functions sources.

There are many options where to cache the FaaS functions—options which you can view as a hierarchy. At the top of this hierarchy we have a single remote storage, which is the actual data store storing your functions. Below that there are several layers closer and closer to the runtimes in which you can cache the functions. You can cache functions conservatively once at a cluster level or take caching to its extreme caching (some) functions already at the runtimes that might need them in the future.

Ideally, in a world where caching is free, we would cache all functions at all possible locations, ensuring that the function transfer cost is zero. However, in practice, storage is not free, and caching everything everywhere will also grow your costs exponentially.

You can take the other extreme in this trade-off of performance vs. cost, by caching only minimally, or not at all. This will minimize your cost, but it will also mean that all functions will need to be fetched from the remote storage. Especially in FaaS platforms that rely on external providers to store their functions, such as Dockerhub, this can quickly become a source of performance degradation.

4. Function Prewarming

With most optimizations, the cold start will be there regardless of how much we optimize the process. Can we avoid this cold start in its entirety?

With prewarming we try exactly this: to avoid cold starts entirely by anticipating the demand for a function and deploying functions ahead of time.

This is not a novel idea. Prewarming (or as it is known in academia: predictive scheduling) has been introduced in many domains: in processors we have the branch predictor, in autoscaling research proactive autoscalers are an active field of research, and in cache management there is the notion of predictive caching.

Prewarming (or predictive scheduling) in FaaS platforms is not much different. Instead of waiting for a request to arrive at the platform before deploying the function (the cold start), in the ideal scenario we perfectly predict that there will be a request arriving at a certain time. This allows us to go through the cold start process ahead of time, completing the deployment of the function just before the request arrives. Instead of going through the entire cold start process, the request can immediately be executed; avoiding the cold start problem in its entirety.

Accurate predictions are difficult

Having a good predictor is key to employing effective prewarming. Yet—like predicting anything—predicting function demand is difficult. In the related, more mature domains, such as in CPU branch prediction and autoscaling, predictive scheduling remains an active field of research.

The approaches to this problem can be subdivided into two categories. With runtime analysis the platform monitors the runtime behavior of the function and the demand, trying to answer a number of questions: How long do function executions typically take? What kind of pattern do we see in the demand over time? Based on these observations, the platform tries to make a model of both the function and demand behavior, which the platform then uses to predict the future executions and tries to prewarm accordingly. The techniques used for runtime analysis vary widely: from simple rule-based predictors, to complex time series analysis, to various types of machine learning.

The other category of approaches falls into static analysis. Here the platform exploits the (additional) knowledge it has of a function to decide accurate times to prewarm. For example, you might know ahead of time that function B will be executed right after function A completes. Or, the platform might be aware of a trigger set to execute the function every hour. In general, static analysis provides more reliable predictions, but has a clear limit on how much you can know about a function ahead of time.

Optimistic vs. conservative prewarming

Not only is predicting function execution difficult, it also involves a trade-off. Since a prediction is always a probability of an event occurring, you have to decide at what threshold the prediction of a function execution needs to result in actual prewarming.

You can be very optimistic about this decision, prewarming functions at the slightest hint that the function will be needed. This is great news for the performance, because the chances that functions will be prewarmed get higher the more you lower this prewarming threshold. However, being optimistic is not great for your costs. A low threshold also means that you will prewarm a lot of functions that turned out mis-predicted; the expected demand for it never arrived.

An example of optimistic prewarming, and one of the earliest optimizations in FaaS, is function pinging. Early users of AWS Lambda figured that you could avoid these pesky cold starts by sending artificial requests to their functions every couple of minutes, preventing AWS from cleaning up the function. Despite the downsides and limitations of this approach, it ensured that there would always be a function instance alive—making this an extremely optimistic form of prewarming.

You can also take a conservative stance on prewarming. By setting the threshold for prewarming higher, you ensure that less resources are wasted on misguided prewarming. However, this comes at the cost of performance; the higher the threshold, the more functions cannot benefit from prewarming because of a lack certainty in its prediction.

Fission Workflows: Prewarming with Function Compositions

For an example of this conservative prewarming we introduce a project that we have been working for the Fission serverless platform. Fission Workflows is a system for composing your existing FaaS functions into more complex functions, allowing you to reuse functions instead of having to completely rewrite each new function from scratch. It builds on top of the best practices of the well-established workflow field, allowing you to define your workflows without having to worry about discovery, data transfer, and fault tolerance.


Figure 9 - An example of a workflow, showing parallel and sequential executions.

Since these workflows basically form a graph of dependencies between the different functions, we know exactly which functions will be needed when. This allows us to relatively predict if and when to prewarm instances and deploy functions ahead of time, anticipating they would be triggered based off the workflow sequence.

There are a lot of possibilities with predictive prewarming based on functions composition. We started with a simple prototype of this predictive prewarming, called horizon-based prewarming: we prewarm all the functions on the ‘horizon’. The horizon consists of all tasks that will be executed right after the current functions have completed.


Figure 10 - Horizon-based prewarming with function executing (yellow), functions prewarmed (blue), functions not started (red).

Figure 9 shows an example of this prewarming. Functions B and C will be prewarmed, because they both depend on currently executing functions. Function D and E will not be prewarmed because those depend on other functions which have not started executing.

In the ideal case—even with a simple algorithm as this—you can effectively reduce the number of cold starts in your function compositions to one (the first function).

Conclusion

By allowing a FaaS platform to handle the full lifecycle of your functions or applications, the serverless platform gains a lot of insight into your workload and control of the resources employed. This allows FaaS platforms to use various techniques to improve performance, which would be less effective or even impossible in other, more traditional, cloud models.

The major serverless platforms offered by the public cloud providers continuously work to optimize these trade-offs of cost and performance for the average user. However, these providers do not give you any opportunity to make your own trade-offs. Aside from changing memory and CPU requirements of your functions - which too can have an impact on the performance - none of the major cloud providers offer you options to, for example, alter pool sizes and cooldown durations in exchange for higher or lower costs.

In that respect open-source serverless platforms, such as Fission, are interesting, since they gives you the freedom to tweak all of these trade-offs to fit your specific use case. Even though the cost savings might not be as explicit when you are running your own serverless applications on-premises or on your cloud infrastructure - but without using the serverless services offered by the likes of AWS - in the end, your trade-offs will result in increased or decreased resource usage - which impacts datacenter/cloud costs and infrastructure utilization.

Being able to make these trade-offs leads us to one of the most promising aspects of serverless computing: serverless computing is not just about paying for the resources that you use; it is about only paying for the performance you actually need.

扩展阅读

作者简介

Erwin van Eyk works at the intersection between industry and academia. As a software engineer at Platform9, he contributes to Fission: an open-source, Kubernetes-native, Serverless platform. At the same time, he is a researcher investigating “Function Scheduling and Composition in FaaS Deployments” in the International Research Team @large at the Delft University of Technology. As a part of this, he leads the industry and academia combined serverless research effort at the SPEC Cloud Research Group.

查看英文原文: 4 Techniques Serverless Platforms Use to Balance Performance and Cost

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注