[]Amazon SageMaker Serverless Inference allows you to serve model inference requests in real time without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. You can let AWS handle the undifferentiated heavy lifting of managing the underlying infrastructure and save costs in the process. A Serverless Inference endpoint spins up the relevant infrastructure, including the compute, storage, and network, to stage your container and model for on-demand inference. You can simply select the amount of memory to allocate and the number of max concurrent invocations to have a production-ready endpoint to service inference requests.
[]With on-demand serverless endpoints, if your endpoint doesn’t receive traffic for a while and then suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a cold start. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. With provisioned concurrency on Serverless Inference, you can mitigate cold starts and get predictable performance characteristics for their workloads. You can add provisioned concurrency to your serverless endpoints, and for the predefined amount of provisioned concurrency, Amazon SageMaker will keep the endpoints warm and ready to respond to requests instantaneously. In addition, you can now use Application Auto Scaling with provisioned concurrency to address inference traffic dynamically based on target metrics or a schedule.
[]In this post, we discuss what provisioned concurrency and Application Auto Scaling are, how to use them, and some best practices and guidance for your inference workloads.
[]With provisioned concurrency on Serverless Inference endpoints, SageMaker manages the infrastructure that can serve multiple concurrent requests without incurring cold starts. SageMaker uses the value specified in your endpoint configuration file called ProvisionedConcurrency, which is used when you create or update an endpoint. The serverless endpoint enables provisioned concurrency, and you can expect that SageMaker will serve the number of requests you have set without a cold start. See the following code:
endpoint_config_response_pc = client.create_endpoint_config( EndpointConfigName=xgboost_epc_name_pc, ProductionVariants=[ { “VariantName”: “byoVariant”, “ModelName”: model_name, “ServerlessConfig”: { “MemorySizeInMB”: 4096, “MaxConcurrency”: 1, #Provisioned Concurrency value setting example “ProvisionedConcurrency”: 1 }, }, ], ) []By understanding your workloads and knowing how many cold starts you want to mitigate, you can set this to a preferred value.
[]Serverless Inference with provisioned concurrency also supports Application Auto Scaling, which allows you to optimize costs based on your traffic profile or schedule to dynamically set the amount of provisioned concurrency. This can be set in a scaling policy, which can be applied to an endpoint.
[]To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You can then use that text file when invoking the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. To define a target-tracking scaling policy for a serverless endpoint, use the SageMakerVariantProvisionedConcurrencyUtilization predefined metric:
{ “TargetValue”: 0.5, “PredefinedMetricSpecification”: { “PredefinedMetricType”: “SageMakerVariantProvisionedConcurrencyUtilization” }, “ScaleOutCooldown”: 1, “ScaleInCooldown”: 1 } []To specify a scaling policy based on a schedule (for example, every day at 12:15 PM UTC), you can modify the scaling policy as well. If the current capacity is below the value specified for MinCapacity, Application Auto Scaling scales out to the value specified by MinCapacity. The following code is an example of how to set this via the AWS CLI:
aws application-autoscaling put-scheduled-action –service-namespace sagemaker –schedule ‘cron(15 12 * * ? *)’ –scheduled-action-name ‘ScheduledScalingTest’ –resource-id endpoint/MyEndpoint/variant/MyVariant –scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency –scalable-target-action ‘MinCapacity=10’ []With Application Auto Scaling, you can ensure that your workloads can mitigate cold starts, meet business objectives, and optimize cost in the process.
[]You can monitor your endpoints and provisioned concurrency specific metrics using Amazon CloudWatch. There are four metrics to focus on that are specific to provisioned concurrency:
[]By monitoring and making decisions based on these metrics, you can tune their configuration with cost and performance in mind and optimize your SageMaker Serverless Inference endpoint.
[]For SageMaker Serverless Inference, you can choose either a SageMaker-provided container or bring your own. SageMaker provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning (ML) frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a list of available SageMaker images, see Available Deep Learning Containers Images. If you’re bringing your own container, you must modify it to work with SageMaker. For more information about bringing your own container, see Adapting Your Own Inference Container.
[]Creating a serverless endpoint with provisioned concurrency is a very similar process to creating an on-demand serverless endpoint. For this example, we use a model using the SageMaker built-in XGBoost algorithm. We work with the Boto3 Python SDK to create three SageMaker inference entities:
[]In this post, we don’t cover the training and SageMaker model creation; you can find all these steps in the complete notebook. We focus primarily on how you can specify provisioned concurrency in the endpoint configuration and compare performance metrics for an on-demand serverless endpoint with a provisioned concurrency enabled serverless endpoint.
[]In the endpoint configuration, you can specify the serverless configuration options. For Serverless Inference, there are two inputs required, and they can be configured to meet your use case:
[]For this example, we create two endpoint configurations: one on-demand serverless endpoint and one provisioned concurrency enabled serverless endpoint. You can see an example of both configurations in the following code:
xgboost_epc_name_pc = “xgboost-serverless-epc-pc” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime()) xgboost_epc_name_on_demand = “xgboost-serverless-epc-on-demand” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime()) endpoint_config_response_pc = client.create_endpoint_config( EndpointConfigName=xgboost_epc_name_pc, ProductionVariants=[ { “VariantName”: “byoVariant”, “ModelName”: model_name, “ServerlessConfig”: { “MemorySizeInMB”: 4096, “MaxConcurrency”: 1, # Providing Provisioned Concurrency in EPC “ProvisionedConcurrency”: 1 }, }, ], ) endpoint_config_response_on_demand = client.create_endpoint_config( EndpointConfigName=xgboost_epc_name_on_demand, ProductionVariants=[ { “VariantName”: “byoVariant”, “ModelName”: model_name, “ServerlessConfig”: { “MemorySizeInMB”: 4096, “MaxConcurrency”: 1, }, }, ], ) print(“Endpoint Configuration Arn Provisioned Concurrency: ” + endpoint_config_response_pc[“EndpointConfigArn”]) print(“Endpoint Configuration Arn On Demand Serverless: ” + endpoint_config_response_on_demand[“EndpointConfigArn”]) []With SageMaker Serverless Inference with a provisioned concurrency endpoint, you also need to set the following, which is reflected in the preceding code:
[]We use our two different endpoint configurations to create two endpoints: an on-demand serverless endpoint with no provisioned concurrency enabled and a serverless endpoint with provisioned concurrency enabled. See the following code:
endpoint_name_pc = “xgboost-serverless-ep-pc” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime()) create_endpoint_response = client.create_endpoint( EndpointName=endpoint_name_pc, EndpointConfigName=xgboost_epc_name_pc, ) print(“Endpoint Arn Provisioned Concurrency: ” + create_endpoint_response[“EndpointArn”]) endpoint_name_on_demand = “xgboost-serverless-ep-on-demand” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime()) create_endpoint_response = client.create_endpoint( EndpointName=endpoint_name_on_demand, EndpointConfigName=xgboost_epc_name_on_demand, ) print(“Endpoint Arn Provisioned Concurrency: ” + create_endpoint_response[“EndpointArn”])
[]Next, we can invoke both endpoints with the same payload:
%%time #On Demand Serverless Endpoint Test response = runtime.invoke_endpoint( EndpointName=endpoint_name_on_demand, Body=b”.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0″, ContentType=”text/csv”, ) print(response[“Body”].read()) %%time #Provisioned Endpoint Test response = runtime.invoke_endpoint( EndpointName=endpoint_name_pc, Body=b”.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0″, ContentType=”text/csv”, ) print(response[“Body”].read()) []When timing both cells for the first request, we immediately notice a drastic improvement in end-to-end latency in the provisioned concurrency enabled serverless endpoint. To validate this, we can send five requests to each endpoint with 10-minute intervals between each request. With the 10-minute gap, we can ensure that the on-demand endpoint is cold. Therefore, we can successfully evaluate cold start performance comparison between the on-demand and provisioned concurrency serverless endpoints. See the following code:
import time import numpy as np print(“Testing cold start for serverless inference with PC vs no PC”) pc_times = [] non_pc_times = [] # ~50 minutes for i in range(5): time.sleep(600) start_pc = time.time() pc_response = runtime.invoke_endpoint( EndpointName=endpoint_name_pc, Body=b”.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0″, ContentType=”text/csv”, ) end_pc = time.time() – start_pc pc_times.append(end_pc) start_no_pc = time.time() response = runtime.invoke_endpoint( EndpointName=endpoint_name_on_demand, Body=b”.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0″, ContentType=”text/csv”, ) end_no_pc = time.time() – start_no_pc non_pc_times.append(end_no_pc) pc_cold_start = np.mean(pc_times) non_pc_cold_start = np.mean(non_pc_times) print(“Provisioned Concurrency Serverless Inference Average Cold Start: {}”.format(pc_cold_start)) print(“On Demand Serverless Inference Average Cold Start: {}”.format(non_pc_cold_start)) []We can then plot these average end-to-end latency values across five requests and see that the average cold start for provisioned concurrency was approximately 200 milliseconds end to end as opposed to nearly 6 seconds with the on-demand serverless endpoint.
[]
[]Provisioned concurrency is a cost-effective solution for low throughput and spiky workloads requiring low latency guarantees. Provisioned concurrency will be suitable for use cases when the throughput is low, and you want to reduce costs compared with instance-based while still having predictable performance or for workloads with predictable traffic bursts with low latency requirements. For example, a chatbot application run by a tax filing software company typically sees high demand during the last week of March from 10:00 AM to 5:00 PM because it’s close to the tax filing deadline. You can choose on-demand Serverless Inference for the remaining part of the year to serve requests from end-users, but for the last week of March, you can add provisioned concurrency to handle the spike in demand. As a result, you can reduce costs during idle time while still meeting your performance goals.
[]On the other hand, if your inference workload is steady, has high throughput (enough traffic to keep the instances saturated and busy), has a predictable traffic pattern, and requires ultra-low latency, or it includes large or complex models that require GPUs, Serverless Inference isn’t the right option for you, and you should deploy on real-time inference. Synchronous use cases with burst behavior that don’t require performance guarantees are more suitable for using on-demand Serverless Inference. The traffic patterns and the right hosting option (serverless or real-time inference) are depicted in the following figures:
[]
[]
[]
[]Use the following factors to determine which hosting option (real time over on-demand Serverless Inference over Serverless Inference with provisioned concurrency) is right for your ML workloads:
[]The following figure illustrates this decision tree.
[]
[]With Serverless Inference with provisioned concurrency, you should still adhere to best practices for workloads that don’t use provisioned concurrency:
[]In addition, with the ability to configure ProvisionedConcurrency, you should set this value to the integer representing how many cold starts you would like to avoid when requests come in a short time frame after a period of inactivity. Using the metrics in CloudWatch can help you tune this value to be optimal based on preferences.
[]As with on-demand Serverless Inference, when provisioned concurrency is enabled, you pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. You also pay for provisioned concurrency usage based on the memory configured, duration provisioned, and amount of concurrency enabled.
[]Pricing can be broken down into two components: provisioned concurrency charges and inference duration charges. For more details, refer to Amazon SageMaker Pricing.
[]SageMaker Serverless Inference with provisioned concurrency provides a very powerful capability for workloads when cold starts need to be mitigated and managed. With this capability, you can better balance cost and performance characteristics while providing a better experience to your end-users. We encourage you to consider whether provisioned concurrency with Application Auto Scaling is a good fit for your workloads, and we look forward to your feedback in the comments!
[]Stay tuned for follow-up posts where we will provide more insight into the benefits, best practices, and cost comparisons using Serverless Inference with provisioned concurrency.
[]James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.
[]Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
[]Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
[]Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.
[]Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.
[]Shruti Sharma is a Sr. Software Development Engineer in AWS SageMaker team. Her current work focuses on helping developers efficiently host machine learning models on Amazon SageMaker. In her spare time she enjoys traveling, skiing and playing chess. You can find her on LinkedIn.
[]Hao Zhu is a Software Development with Amazon Web Services. In his spare time he loves to hit the slopes and ski. He also enjoys exploring new places, trying different foods, experiencing different cultures and is always up for a new adventure.