The size of the machine learning (ML) models––large language models (LLMs) and foundation models (FMs)––is growing fast year-over-year, and these models need faster and more powerful accelerators, especially for generative AI. AWS Inferentia2 was designed from the ground up to deliver higher performance while lowering the cost of LLMs and generative AI inference.
In this post, we show how the second generation of AWS Inferentia builds on the capabilities introduced with AWS Inferentia1 and meets the unique demands of deploying and running LLMs and FMs.
The first generation of AWS Inferentia, a purpose-built accelerator launched in 2019, is optimized to accelerate deep learning inference. AWS Inferentia helped ML users reduce their inference costs and improve their prediction throughput and latency. With AWS Inferentia1, customers saw up to 2.3x higher throughput and up to 70% lower cost per inference than comparable inference-optimized Amazon Elastic Compute Cloud (Amazon EC2) instances.
AWS Inferentia2, featured in the new Amazon EC2 Inf2 instances and supported in Amazon SageMaker, is optimized for large-scale generative AI inference and is the first inference focused instance from AWS that is optimized for distributed inference, with high-speed, low-latency connectivity between accelerators.
You can now efficiently deploy a 175-billion-parameter model for inference across multiple accelerators on a single Inf2 instance without requiring expensive training instances. Until now, customers who had large models could only use instances that were built for training, but this is a waste of resources––given that they’re more expensive, consume more energy, and their workload doesn’t make use of all the available resources (such as faster networking and storage). With AWS Inferentia2, you can achieve 4 times higher throughput and up to 10 times lower latency compared to AWS Inferentia1. Also, the second generation of AWS Inferentia adds enhanced support for more data types, custom operators, dynamic tensors, and more.
AWS Inferentia2 has 4 times more memory capacity, 16.4 times higher memory bandwidth than AWS Inferentia1, and native support for sharding large models across multiple accelerators. The accelerators use NeuronLink and Neuron Collective Communication to maximize the speed of data transfer between them or between an accelerator and the network adapter. AWS Inferentia2 is better suited for larger models, which require sharding across multiple accelerators, although AWS Inferentia1 is still a great option for smaller models because it provides better price-performance compared to alternatives.
To compare both generations of AWS Inferentia, let’s review the architecture of AWS Inferentia1. It has four NeuronCores v1 per chip, shown in the following diagram.
Specifications per chip:
Let’s look at how AWS Inferentia2 is organized. Each AWS Inferentia2 chip has two upgraded cores based on the NeuronCore-v2 architecture. Like AWS Inferentia1, you can run different models on each NeuronCore or combine multiple cores to shard big models.
Specifications per chip:
NeuronCore-v2 has a modular design with four independent engines:
AWS Inferentia2 NeuronCore-v2 is faster and more optimized. Also, it’s capable of accelerating different types and sizes of models, ranging from simple models such as ResNet 50 to large language models or foundation models with billions of parameters such as GPT-3 (175 billion parameters). AWS Inferentia2 also has a larger and faster internal memory, when compared to AWS Inferentia1, as shown in the following table.
Chip | Neuron Cores | Memory Type | Memory Size | Memory Bandwidth |
AWS Inferentia | x4 (v1) | DDR4 | 8GB | 50GB/S |
AWS Inferentia 2 | x2 (v2) | HBM | 32GB | 820GB/S |
The memory you find in AWS Inferentia2 is the type High-Bandwidth Memory (HBM) type. Each AWS Inferentia2 chip has 32 GB and that can be combined with other chips to distribute very large models using NeuronLink (device-to-device interconnect). An inf2.48xlarge, for instance, has 12 AWS Inferentia2 accelerators with a total of 384 GB of accelerated memory. The speed of AWS Inferentia2 memory is 16.4 times faster than AWS Inferentia1, as shown in the previous table.
AWS Inferentia2 offers the following additional features:
The AWS Neuron SDK is used for compiling and running your model. It is natively integrated with PyTorch and TensorFlow. That way, you don’t need to run an additional tool. Use your original code, written in one of these ML frameworks, and with a few lines of code changes, you’re good to go with AWS Inferentia.
Let’s look at how to compile and run a model on AWS Inferentia1 and AWS Inferentia2 using PyTorch.
Load a pre-trained model and run it one time to warm it up:
import torch import torchvision model = torchvision.models.resnet50(weights=’IMAGENET1K_V1′).eval().cpu() x = torch.rand(1,3,224,224).float().cpu() # dummy input y = model(x) # warmup model
To trace the model to AWS Inferentia, import torch_neuron and invoke the tracing function. Keep in mind that the model needs to be PyTorch Jit traceable to work.
At the end of the tracing process, save the model as a normal PyTorch model. Compile the model one time and load it back as many times as you need. The Neuron SDK runtime is already integrated to PyTorch and is responsible for sending the operators to the AWS Inferentia1 chip automatically to accelerate your model.
In your inference code, you always need to import torch_neuron to activate the integrated runtime.
You can pass additional parameters to the compiler to customize the way it optimizes the model or to enable special features such as neuron-pipeline-cores. Shard your model across multiple cores to increase throughput.
import torch_neuron # Tracing the model using AWS NeuronSDK neuron_model = torch_neuron.trace(model,x) # trace model to Inferentia # Saving for future use neuron_model.save(‘neuron_resnet50.pt’) # Next time you don’t need to trace the model again # Just load it and AWS NeuronSDK will send it to Inferentia automatically neuron_model = torch.jit.load(‘neuron_resnet50.pt’) # accelerated inference on Inferentia y = neuron_model(x)
For AWS Inferentia2, the process is similar. The only difference is the package you import ends with x: torch_neuronx. The Neuron SDK takes care of the compilation and running of the model for you transparently. You can also pass additional parameters to the compiler to fine-tune the operation or activate specific functionalities.
import torch_neuronx # Tracing the model using NeuronSDK neuron_model = torch_neuronx.trace(model,x) # trace model to Inferentia # Saving for future use neuron_model.save(‘neuron_resnet50.pt’) # Next time you don’t need to trace the model again # Just load it and NeuronSDK will send it to Inferentia automatically neuron_model = torch.jit.load(‘neuron_resnet50.pt’) # accelerated inference on Inferentia y = neuron_model(x)
AWS Inferentia2 also offers a second approach for running a model called Lazy Tensor inference. In this mode, you don’t trace or compile the model previously; instead, the compiler runs on the fly every time you run your code. It isn’t recommended for production, given that traced mode has many advantages over Lazy Tensor inference. However, if you’re still developing your model and need to test it faster, Lazy Tensor inference can be a good alternative. Here’s how to compile and run a model using Lazy Tensor:
import torch import torchvision import torch_neuronx import torch_xla.core.xla_model as xm device = xm.xla_device() # Create XLA device model = torchvision.models.resnet50(weights=’IMAGENET1K_V1′).eval().cpu() model.to(device) x = torch.rand((1,3,224,224), device=device) # dummy input with torch.no_grad(): y = model(x) xm.mark_step() # Compilation occurs here
Now that you’re familiar with AWS Inferentia2, a good next step is to get started with PyTorch or Tensorflow and learn how to set up a dev environment and run tutorials and examples. Also, check the AWS Neuron Samples GitHub repo, where you can find multiple examples of how to prepare models to run on Inf2, Inf1, and Trn1.
The AWS Inferentia2 compiler is XLA-based, and AWS is part of OpenXLA initiative. This is the biggest difference over AWS Inferentia1, and that’s relevant because PyTorch, TensorFlow, and JAX have native XLA integrations. XLA brings many performance improvements, given that it optimizes the graph to compute the results in a single kernel launch. It fuses together successive tensor operations and outputs optimal machine code for accelerating model runs on AWS Inferentia2. Other parts of the Neuron SDK were also improved in AWS Inferentia2, while keeping the user experience as simple as possible while tracing and running models. The following table shows the features available in both versions of the compiler and runtime.
Feature | torch-neuron | torch-neuronx |
Tensorboard | Yes | Yes |
Supported Instances | Inf1 | Inf2 & Trn1 |
Inference Support | Yes | Yes |
Training Support | No | Yes |
Architecture | NeuronCore-v1 | NeuronCore-v2 |
Trace API | torch_neuron.trace() | torch_neuronx.trace() |
Distributed inference | NeuronCore Pipeline | Collective Communications |
IR | GraphDef | HLO |
Compiler | neuron-cc | neuronx-cc |
Monitoring | neuron-monitor / monitor-top | neuron-monitor / monitor-top |
For a more detailed comparison between torch-neuron (Inf1) and torch-neuronx (Inf2), refer to Comparison of torch-neuron (Inf1) versus torch-neuronx (Inf2 & Trn1) for Inference.
After tracing a model to deploy to Inf2, you have many deployment options. You can run real-time predictions or batch predictions in different ways. Inf2 is available because EC2 instances are natively integrated to other AWS services that make use of Deep Learning Containers (DLCs) such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker.
AWS Inferentia2 is compatible with the most popular deployment technologies. Here are a list of some of the options you have for deploying models using AWS Inferentia2:
The following table highlights the improvements AWS Inferentia2 brings over AWS Inferentia1. Specifically, we measure latency (how fast the model can make a prediction using each accelerator), throughput (how many inferences per second), and cost per inference (how much each inference costs in US dollars). The lower the latency in milliseconds and costs in US dollars, the better. The higher the throughput the better.
Two models were used in this process––both large language models: ELECTRA large discriminator and BERT large uncased. PyTorch (1.13.1) and Hugging Face transformers (v4.7.0), the main libraries used in this experiment, ran on Python 3.8. After compiling the models for batch size = 1 and 10 (using the code from the previous section as a reference), each model was warmed up (invoked one time to initialize the context) and then invoked 10 times in a row. The following table shows average numbers collected in this simple benchmark.
Model Name | Batch Size | Sentence Length | Latency (ms) | Improvements Inf2 over Inf1 (x Times) | Throughput (Inferences per Second) | Cost per Inference (EC2 us-east-1) ** | |||
Inf1 | Inf2 | Inf1 | Inf2 | Inf1 | Inf2 | ||||
ElectraLargeDiscriminator | 1 | 256 | 35.7 | 8.31 | 4.30 | 28.01 | 120.34 | $0.0000023 | $0.0000018 |
ElectraLargeDiscriminator | 10 | 256 | 343.7 | 72.9 | 4.71 | 2.91 | 13.72 | $0.0000022 | $0.0000015 |
BertLargeUncased | 1 | 128 | 28.2 | 3.1 | 9.10 | 35.46 | 322.58 | $0.0000018 | $0.0000007 |
BertLargeUncased | 10 | 128 | 121.1 | 23.6 | 5.13 | 8.26 | 42.37 | $0.0000008 | $0.0000005 |
* c6a.8xlarge with 32 AMD Epyc 7313 CPU was used in this benchmark.
**EC2 Public pricing in us-east-1 on April 20: inf2.xlarge: $0.7582/hr; inf1.xlarge: $0.228/hr. Cost per inference considers the cost per element in a batch. (Cost per inference equals the total cost of model invocation/batch size.)
For additional information about training and inference performance, refer to Trn1/Trn1n Performance.
AWS Inferentia2 is a powerful technology designed for improving performance and reducing costs of deep learning model inference. More performant than AWS Inferentia1, it offers up to 4 times higher throughput, up to 10 times lower latency, and up to 50% better performance/watt than other comparable inference-optimized EC2 instances. In the end, you pay less, have a faster application, and meet your sustainability goals.
It’s simple and straightforward to migrate your inference code to AWS Inferentia2, which also supports a broader variety of models, including large language models and foundation models for generative AI.
You can get started by following the AWS Neuron SDK documentation to set up a development environment and start your accelerated deep learning project. To help you get started, Hugging Face has added Neuron support to their Optimum library, which optimizes models for faster training and inference, and they have many examples tasks ready to run on Inf2. Also, check our Deploy large language models on AWS Inferentia2 using large model inference containers to learn about deploying LLMs to AWS Inferentia2 using model inference containers. For additional examples, see the AWS Neuron Samples GitHub repo.
Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.