New generations of CPUs offer a significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high speed of development, and low operating cost, these general-purpose processors offer an alternative to other existing hardware solutions.
AWS, Arm, Meta and others helped optimize the performance of PyTorch 2.0 inference for Arm-based processors. As a result, we are delighted to announce that AWS Graviton-based instance inference performance for PyTorch 2.0 is up to 3.5 times the speed for Resnet50 compared to the previous PyTorch release (see the following graph), and up to 1.4 times the speed for BERT, making Graviton-based instances the fastest compute optimized instances on AWS for these models.
AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances, as shown in the following figure.
Additionally, the latency of inference is also reduced, as shown in the following figure.
We have seen a similar trend in the price-performance advantage for other workloads on Graviton, for example video encoding with FFmpeg.
The optimizations focused on three key areas:
The simplest way to get started is by using the AWS Deep Learning Containers (DLCs) on Amazon Elastic Compute Cloud (Amazon EC2) C7g instances or Amazon SageMaker. DLCs are available on Amazon Elastic Container Registry (Amazon ECR) for AWS Graviton or x86. For more details on SageMaker, refer to Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker and Amazon SageMaker adds eight new Graviton-based instances for model deployment.
To use AWS DLCs, use the following code:
sudo apt-get update sudo apt-get -y install awscli docker # Login to ECR to avoid image download throttling aws ecr get-login-password –region us-east-1 | docker login –username AWS –password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com # Pull the AWS DLC for pytorch # Graviton docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.0-cpu-py310-ubuntu20.04-ec2 # x86 docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.0-cpu-py310-ubuntu20.04-ec2
If you prefer to install PyTorch via pip, install the PyTorch 2.0 wheel from the official repo. In this case, you will have to set two environment variables as explained in the code below before launching PyTorch to activate the Graviton optimization.
To use the Python wheel, refer to the following code:
# Install Python sudo apt-get update sudo apt-get install -y python3 python3-pip # Upgrade pip3 to the latest version python3 -m pip install –upgrade pip # Install PyTorch and extensions python3 -m pip install torch python3 -m pip install torchvision torchaudio torchtext # Turn on Graviton3 optimization export DNNL_DEFAULT_FPMATH_MODE=BF16 export LRU_CACHE_CAPACITY=1024
You can use PyTorch TorchBench to measure the CPU inference performance improvements, or to compare different instance types:
# Pre-requisite: # pull and run the AWS DLC # or # pip install PyTorch2.0 wheels and set the previously mentioned environment variables # Clone PyTorch benchmark repo git clone https://github.com/pytorch/benchmark.git # Setup Resnet50 benchmark cd benchmark python3 install.py resnet50 # Install the dependent wheels python3 -m pip install numba # Run Resnet50 inference in jit mode. On successful completion of the inference runs, # the script prints the inference latency and accuracy results python3 run.py resnet50 -d cpu -m jit -t eval –use_cosine_similarity
You can use the Amazon SageMaker Inference Recommender utility to automate performance benchmarking across different instances. With Inference Recommender, you can find the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. We collected the preceding data using the Inference Recommender notebooks by deploying the models on production endpoints. For more details on Inference Recommender, refer to the GitHub repo. We benchmarked the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa fill mask, and RoBERTa sentiment analysis.
AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefits with Graviton instances across different workloads.
If you find use cases where similar performance gains aren’t observed on AWS Graviton, please open an issue on the AWS Graviton Technical Guide to let us know about it. We will continue to add more performance improvements to make Graviton the most cost-effective and efficient general-purpose processor for inference using PyTorch.
Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for machine leaning, HPC, and multimedia workloads. She is passionate about open-source development and delivering cost-effective software solutions with Arm SoCs.