More customers are finding the need to build larger, scalable, and more cost-effective machine learning (ML) inference pipelines in the cloud. Outside of these base prerequisites, the requirements of ML inference pipelines in production vary based on the business use case. A typical inference architecture for applications like recommendation engines, sentiment analysis, and ad ranking need to serve a large number of models, with a mix of classical ML and deep learning (DL) models. Each model has to be accessible through an application programing interface (API) endpoint and be able to respond within a predefined latency budget from the time it receives a request.
In this post, we describe an inference architecture, developed in collaboration with the Commerce Einstein Team at Salesforce, built on Amazon Elastic Kubernetes Service (Amazon EKS) to not only address the base prerequisites, but also pack thousands of unique DL models in a scalable architecture. We explore a mix of Amazon Elastic Compute Cloud (Amazon EC2) instance families (c5, g4dn, Inf1) to develop an optimal design from a cost and performance aspect. To meet these requirements, we build the DL inferencing service on Amazon EKS using FastAPI, a lightweight and efficient Python-based API server, and develop a model bin packing strategy to efficiently share compute and memory resources between models. To load test the architecture, we use a natural language processing (NLP) open-source model from huggingface.co (bert-base-cased, approximately 800 MB) and simulate thousands of clients sending simultaneous requests to the service pool. We use AWS Inferentia, the custom ML chip on Inf1 instances, to package and serve 3,000 unique ML models while keeping the cost below $50/hour (On-Demand pricing), with a round trip latency of 70 milliseconds (P90) versus our target of 100 milliseconds. You can extend this architecture and optimization approach to any custom DL model.
The following is a simple, scalable, and highly available architecture based on a standard Amazon EKS infrastructure that can be deployed across Availability Zones.
The Amazon EKS cluster has several node groups, with one EC2 instance family per node group. Each node group can support different instance types, such as CPU (c5), GPU (g4dn), and AWS Inferentia (Inf1), and can pack multiple models per instance to maximize the number of served models. Next, the model serving application and DL framework dependencies are containerized, and these container images are stored on Amazon Elastic Container Registry (Amazon ECR). The container images are deployed to the cluster using deployment and service manifests, customized for each instance family. The model serving application downloads the model artifacts from Amazon Simple Storage Service (Amazon S3) at server initialization, which reduces the size of the container images on Amazon ECR and decouples the model data from the service definition.
Services like cluster-autoscaler, horizontal-pod-autoscaler, aws-load-balancer-controller, metrics-server, nginx-ingress-controller, neuron-device-plugin-daemonset, and nvidia-device-plugin-daemonset are deployed to the cluster as required. This design relies on Kubernetes coredns for name resolution of the service endpoints within the cluster. Each model is addressable through a combination of DNS name and model name. For example, http://
For customization of the DNS names, we can use an ingress manifest.
The architecture is set up to run in five simple steps: build, trace, pack, deploy, and test. The code repository can be accessed in the GitHub repo.
print(‘nTracing model …’) example_inputs = ( torch.cat([inputs[‘input_ids’]] * batch_size,0), torch.cat([inputs[‘attention_mask’]] * batch_size,0) ) os.makedirs(f’traced-{model_name}’, exist_ok=True) torch.set_num_threads(6) if ‘inf’ in processor: model_traced = torch.neuron.trace(model, example_inputs, verbose=1, compiler_workdir=f’./traced-{model_name}/compile_wd_{processor}_bs{batch_size}_seq {sequence_length}_pc{pipeline_cores}’, compiler_args = [‘–neuroncore-pipeline-cores’, str(pipeline_cores)]) else: model_traced = torch.jit.trace(model, example_inputs)
echo “Runtime: $runtime” echo “Processor: $processor” if [ “$runtime” == “docker” ]; then server=0 while [ $server -lt $num_servers ]; do run_opts=”–name ${app_name}-${server} -e NUM_MODELS=$num_models -e POSTPROCESS=$postprocess -e QUIET=$quiet -P” if [ “$processor” == “gpu” ]; then run_opts=”–gpus 0 ${run_opts}” fi CMD=”docker run -d ${run_opts} ${registry}${model_image_name}${model_image_tag}” echo “$CMD” eval “$CMD” server=$((server+1)) done elif [ “$runtime” == “kubernetes” ]; then kubectl create namespace ${namespace} –dry-run=client -o yaml | kubectl apply -f – ./generate-yaml.sh kubectl apply -f ${app_dir} else echo “Runtime $runtime not recognized” fi
Bin packing of ML models across EC2 instances is essential to efficiently share compute and memory resources between models. The task of bin packing the models can be formulated and solved as a 0-1 knapsack problem using combinatorial optimization. The following is the mathematical formulation for bin packing; nmodels is subject to a constraint of max processor memory (Mmax) and utilization (Cmax). This approach recommends optimal bin packing of the models across a minimal set of EC2 instances.
The number of bins is supersized to initialize the problem with a trivial solution of one model per bin and pruned to attain the target with minimal number of bins.
The following visualization shows a sample bin packing allocation for 78 models, each with unique memory and compute requirements. In this example, 78 models were packed into 23 instances (g4dn, Inf1) with a specified target maximum memory (7.4 GB) and compute capacity (83%). The coloring legend indicates model index.
We can use the preceding method to optimally bin pack models across instances. The following table summarizes bin packing results for Inf1 and g4dn instances. We chose these instance families because the transformer-based NLP model requires hardware acceleration to achieve the expected latency. We were able to bin pack more models in Inf1.6xlarge (48 GiB memory) compared to g4dn.12xlarge (192 GiB memory) because the Neuron compiler’s Auto Casting capability automatically converts FP32 models to 16-bit bfloat to maximize throughput.
Model | EC2 Instance | Server Type | Number of Models Bin Packed per Instance | Price per Instance (On-Demand), $/hr |
Price per Model-hour ($) |
bert-base-cased | inf1.2xlarge | FastAPI | 24 | 0.362 | 0.015 |
bert-base-cased | g4dn.xlarge | FastAPI | 18 | 0.526 | 0.029 |
bert-base-cased | inf1.xlarge | FastAPI | 11 | 0.228 | 0.020 |
bert-base-cased | inf1.6xlarge | FastAPI | 76 | 1.180 | 0.015 |
bert-base-cased | g4dn.12xlarge | FastAPI | 72 | 3.912 | 0.054 |
To load test at scale, we simulated over 40 clients sending simultaneous requests to the test pool (loading from multiple clients). Loading the system with more user requests increases the throughput at the expense of latency; the tests were designed to sweep the throughput-latency curve to find data points with optimized resource usage. We measured throughput and latency (P50, P90, P95) and summarized the results across four metrics: latency, throughput (inferences per second), number of models served per instance, and cost. In addition, we created tests to simulate single-sequential and single-random requests using curl to send a GET or POST request to each model in the test pool. The purpose of these experiments was to measure the best-case latency we can expect, as a baseline.
We ran several experiments to find the optimal model packing across minimal sets of EC2 instances that would result in the highest performance at the lowest cost. The following table summarizes some of these test results. The best results were observed upon using 40 Inf1.6xl instances with DNS caching enabled, serving 3,040 models with a throughput of 6,230 requests per second in 66 milliseconds (P90 latency) at a cost of $47.2/hour (On-Demand). The best mixed instance deployment, using 32 inf1.6xl and 21 g4dn.12xl instances, resulted in 3,048 models served, but with much lower throughput at 248 requests per second with an increased hourly cost of $119.91/hour. Although we didn’t use EC2 Spot Instances as a lever for cost-optimization in this architecture, we highly recommend using Spot if your inference workload is time flexible and fault tolerant. The following table summarizes our observations across instance types.
Exp. # | Instances (num x type) |
Models (num) | Sequential Response (ms) | Random Response (ms) | Throughput (req/s) | Latency with Load P90 (ms) | On-Demand Cost ($/hr) |
1 | 3 x inf1.6xl | 144 | 21 – 32 | 21 – 30 | 142 | 57 | 3.54 |
2 | 5 x inf1.6xl | 240 | 23 – 35 | 21 – 35 | 173 | 56 | 5.9 |
3 | 21 x inf1.6xl | 1008 | 23 – 39 | 23 – 35 | 218 | 24 | 24.78 |
4 | 32 x inf1.6xl | 1536 | 26 – 33 | 25 – 37 | 217 | 23 | 37.76 |
5 | 4 x g4dn.12xl | 288 | 27 – 34 | 28 – 37 | 178 | 30 | 15.64 |
6 | 14 x g4dn.12xl | 1008 | 26 – 35 | 31 – 41 | 154 | 30 | 54.76 |
7 | 32 x inf1.6xl + 21 x g4dn.12xl |
3048 | 27 – 35 | 24 – 45 | 248 | 28 | 119.91 |
8 | 40 x inf1.6xl | 3002 | 24 – 31 | 25 – 38 | 1536 | 33 | 47.2 |
9 | 40 x inf1.6xl (With DNS Caching) |
3040 | 24 – 31 | 25 – 38 | 6230 | 66 | 47.2 |
With this scalable architecture, we were able to scale inference across 3,000 models, achieving a target latency of 100 milliseconds, while simultaneously keeping costs under $50/hour (On-Demand) by optimizing for cost by efficiently bin packing models across a minimal set of EC2 instances. Across all the tests, Inf1 instances yielded the highest throughput, lowest cost, fastest response time, and the maximum bin packing ratio compared to other instances. AWS customers like Snap, Airbnb, Sprinklr and many more have been using AWS Inferentia to achieve the highest performance and lowest cost on a variety of deployments. Although the DL model tested requires the use of hardware acceleration on Inf1 and g4dn instance types, you can match other model types with different instance types (Inf1, CPU, GPU) and bin pack models accordingly by using the methodology described.
Learn more about the AWS Inferentia chip and EC2 Inf1 instances to get started running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.
Alex Iankoulski is a Principal Solutions Architect with a focus on autonomous workloads using containers. Alex is a hands-on full-stack infrastructure and software architect and has been building platforms using Docker to help accelerate the pace of innovation by applying container technologies to engineering, data science, and AI problems. Over the past 10 years, he has worked on combating climate change, democratizing AI and ML, and making travel safer, healthcare better, and energy smarter.
Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from Massachusetts Institute of Technology and has over 25 patents and publications to his credit.
Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.
Joshua Correa is a Salesforce Principal Member of Technical Staff working on the Commerce Einstein Team. Joshua has deep passion for building scalable, resilient and cost effective infrastructure for Machine Learning. Joshua enjoys working at the intersection of software engineering and data science to bring cutting edge models into production.