ONNX (Open Neural Network Exchange) is an open-source standard for representing deep learning models widely supported by many providers. ONNX provides tools for optimizing and quantizing models to reduce the memory and compute needed to run machine learning (ML) models. One of the biggest benefits of ONNX is that it provides a standardized format for representing and exchanging ML models between different frameworks and tools. This allows developers to train their models in one framework and deploy them in another without the need for extensive model conversion or retraining. For these reasons, ONNX has gained significant importance in the ML community.
In this post, we showcase how to deploy ONNX-based models for multi-model endpoints (MMEs) that use GPUs. This is a continuation of the post Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints, where we showed how to deploy PyTorch and TensorRT versions of ResNet50 models on Nvidia’s Triton Inference server. In this post, we use the same ResNet50 model in ONNX format along with an additional natural language processing (NLP) example model in ONNX format to show how it can be deployed on Triton. Furthermore, we benchmark the ResNet50 model and see the performance benefits that ONNX provides when compared to PyTorch and TensorRT versions of the same model, using the same input.
ONNX Runtime is a runtime engine for ML inference designed to optimize the performance of models across multiple hardware platforms, including CPUs and GPUs. It allows the use of ML frameworks like PyTorch and TensorFlow. It facilitates performance tuning to run models cost-efficiently on the target hardware and has support for features like quantization and hardware acceleration, making it one of the ideal choices for deploying efficient, high-performance ML applications. For examples of how ONNX models can be optimized for Nvidia GPUs with TensorRT, refer to TensorRT Optimization (ORT-TRT) and ONNX Runtime with TensorRT optimization.
The Amazon SageMaker Triton container flow is depicted in the following diagram.
Users can send an HTTPS request with the input payload for real-time inference behind a SageMaker endpoint. The user can specify a TargetModel header that contains the name of the model that the request in question is destined to invoke. Internally, the SageMaker Triton container implements an HTTP server with the same contracts as mentioned in How Containers Serve Requests. It has support for dynamic batching and supports all the backends that Triton provides. Based on the configuration, the ONNX runtime is invoked and the request is processed on CPU or GPU as predefined in the model configuration provided by the user.
To use the ONNX backend, complete the following steps:
Ensure that you have access to an AWS account with sufficient AWS Identity and Access Management IAM permissions to create a notebook, access an Amazon Simple Storage Service (Amazon S3) bucket, and deploy models to SageMaker endpoints. See Create execution role for more information.
The transformers library provides for convenient method to compile the PyTorch model to ONNX format. The following code achieves the transformations for the NLP model:
onnx_inputs, onnx_outputs = transformers.onnx.export( preprocessor=tokenizer, model=model, config=onnx_config, opset=12, output=save_path )
Exporting models (either PyTorch or TensorFlow) is easily achieved through the conversion tool provided as part of the Hugging Face transformers repository.
The following is what happens under the hood:
A similar mechanism is followed for the computer vision use case from the torchvision model zoo:
torch.onnx.export( resnet50, dummy_input, args.save, export_params=True, opset_version=11, do_constant_folding=True, input_names=[“input”], output_names=[“output”], dynamic_axes={“input”: {0: “batch_size”}, “output”: {0: “batch_size”}}, )
In this section, we configure the computer vision and NLP model. We show how to create a ResNet50 and RoBERTA large model that has been pre-trained for deployment on a SageMaker MME by utilizing Triton Inference Server model configurations. The ResNet50 notebook is available on GitHub. The RoBERTA notebook is also available on GitHub. For ResNet50, we use the Docker approach to create an environment that already has all the dependencies required to build our ONNX model and generate the model artifacts needed for this exercise. This approach makes it much easier to share dependencies and create the exact environment that is needed to accomplish this task.
The first step is to create the ONNX model package per the directory structure specified in ONNX Models. Our aim is to use the minimal model repository for a ONNX model contained in a single file as follows:
Next, we create the model configuration file that describes the inputs, outputs, and backend configurations for the Triton Server to pick up and invoke the appropriate kernels for ONNX. This file is known as config.pbtxt and is shown in the following code for the RoBERTA use case. Note that the BATCH dimension is omitted from the config.pbtxt. However, when sending the data to the model, we include the batch dimension. The following code also shows how you can add this feature with model configuration files to set dynamic batching with a preferred batch size of 5 for the actual inference. With the current settings, the model instance is invoked instantly when the preferred batch size of 5 is met or the delay time of 100 microseconds has elapsed since the first request reached the dynamic batcher.
name: “nlp-onnx” platform: “onnxruntime_onnx” backend: “onnxruntime” max_batch_size: 32 input { name: “input_ids” data_type: TYPE_INT64 dims: [512] } input { name: “attention_mask” data_type: TYPE_INT64 dims: [512] } output { name: “last_hidden_state” data_type: TYPE_FP32 dims: [-1, 768] } output { name: “1550” data_type: TYPE_FP32 dims: [768] } instance_group { count: 1 kind: KIND_GPU } dynamic_batching { max_queue_delay_microseconds: 100 preferred_batch_size:5 }
The following is the similar configuration file for the computer vision use case:
name: “resenet_onnx” platform: “onnxruntime_onnx” max_batch_size : 128 input [ { name: “input” data_type: TYPE_FP32 format: FORMAT_NCHW dims: [ 3, 224, 224 ] } ] output [ { name: “output” data_type: TYPE_FP32 dims: [ 1000 ] } ]
We use the Boto3 APIs to create the SageMaker endpoint. For this post, we show the steps for the RoBERTA notebook, but these are common steps and will be the same for the ResNet50 model as well.
We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image and the model artifact from the previous step to create the SageMaker model.
To create the container, we pull the appropriate image from Amazon ECR for Triton Server. SageMaker allows us to customize and inject various environment variables. Some of the key features are the ability to set the BATCH_SIZE; we can set this per model in the config.pbtxt file, or we can define a default value here. For models that can benefit from larger shared memory size, we can set those values under SHM variables. To enable logging, set the log verbose level to true. We use the following code to create the model to use in our endpoint:
mme_triton_image_uri = ( f”{account_id_map[region]}.dkr.ecr.{region}.{base}” + “/sagemaker-tritonserver:22.12-py3” ) container = { “Image”: mme_triton_image_uri, “ModelDataUrl”: mme_path, “Mode”: “MultiModel”, “Environment”: { “SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE”: “16777216000”, # “16777216”, #”16777216000″, “SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE”: “10485760”, }, } from sagemaker.utils import name_from_base model_name = name_from_base(f”flan-xxl-fastertransformer”) print(model_name) create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer={ “Image”: inference_image_uri, “ModelDataUrl”: s3_code_artifact }, ) model_arn = create_model_response[“ModelArn”] print(f”Created Model: {model_arn}”)
You can use any instances with multiple GPUs for testing. In this post, we use a g4dn.4xlarge instance. We don’t set the VolumeSizeInGB parameters because this instance comes with local instance storage. The VolumeSizeInGB parameter is applicable to GPU instances supporting the Amazon Elastic Block Store (Amazon EBS) volume attachment. We can leave the model download timeout and container startup health check at the default values. For more details, refer to CreateEndpointConfig.
endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[{ “VariantName”: “AllTraffic”, “ModelName”: model_name, “InstanceType”: “ml.g4dn.4xlarge”, “InitialInstanceCount”: 1, #”VolumeSizeInGB” : 200, #”ModelDataDownloadTimeoutInSeconds”: 600, #”ContainerStartupHealthCheckTimeoutInSeconds”: 600, }, ],)’
Lastly, we create a SageMaker endpoint:
create_endpoint_response = sm_client.create_endpoint( EndpointName=f”{endpoint_name}”, EndpointConfigName=endpoint_config_name)
This is a generative model, so we pass in the input_ids and attention_mask to the model as part of the payload. The following code shows how to create the tensors:
tokenizer(“This is a sample”, padding=”max_length”, max_length=max_seq_len)
We now create the appropriate payload by ensuring the data type matches what we configured in the config.pbtxt. This also give us the tensors with the batch dimension included, which is what Triton expects. We use the JSON format to invoke the model. Triton also provides a native binary invocation method for the model.
response = runtime_sm_client.invoke_endpoint( EndpointName=endpoint_name, ContentType=”application/octet-stream”, Body=json.dumps(payload), TargetModel=f”{tar_file_name}”, # TargetModel=f”roberta-large-v0.tar.gz”, )
Note the TargetModel parameter in the preceding code. We send the name of the model to be invoked as a request header because this is a multi-model endpoint, therefore we can invoke multiple models at runtime on an already deployed inference endpoint by changing this parameter. This shows the power of multi-model endpoints!
To output the response, we can use the following code:
import numpy as np resp_bin = response[“Body”].read().decode(“utf8”) # — keys are — “outputs”:[{“name”:”1550″,”datatype”:”FP32″,”shape”:[1,768],”data”: [0.0013,0,3433…]}] for data in json.loads(resp_bin)[“outputs”]: shape_1 = list(data[“shape”]) dat_1 = np.array(data[“data”]) dat_1.resize(shape_1) print(f”Data Outputs recieved back :Shape:{dat_1.shape}”)
The ONNX backend uses C++ arena memory allocation. Arena allocation is a C++-only feature that helps you optimize your memory usage and improve performance. Memory allocation and deallocation constitutes a significant fraction of CPU time spent in protocol buffers code. By default, new object creation performs heap allocations for each object, each of its sub-objects, and several field types, such as strings. These allocations occur in bulk when parsing a message and when building new messages in memory, and associated deallocations happen when messages and their sub-object trees are freed.
Arena-based allocation has been designed to reduce this performance cost. With arena allocation, new objects are allocated out of a large piece of pre-allocated memory called the arena. Objects can all be freed at once by discarding the entire arena, ideally without running destructors of any contained object (though an arena can still maintain a destructor list when required). This makes object allocation faster by reducing it to a simple pointer increment, and makes deallocation almost free. Arena allocation also provides greater cache efficiency: when messages are parsed, they are more likely to be allocated in continuous memory, which makes traversing messages more likely to hit hot cache lines. The downside of arena-based allocation is the C++ heap memory will be over-allocated and stay allocated even after the objects are deallocated. This might lead to out of memory or high CPU memory usage. To achieve the best of both worlds, we use the following configurations provided by Triton and ONNX:
For a deeper understanding of the opportunities for performance tuning in ONNX, refer to the following figure.
Source: https://static.linaro.org/connect/san19/presentations/san19-211.pdf
By turning on the graph optimizations, cudnn_conv_algo_search, and parallel run mode parameters in our testing of the ResNet50 model, we saw the cold start time of the ONNX model graph reduce from 4.4 seconds to 1.61 seconds. An example of a complete model configuration file is provided in the ONNX configuration section of the following notebook.
The testing benchmark results are as follows:
The following graphs visualize these metrics.
Furthermore, in our testing of computer vision use cases, consider sending the request payload in binary format using the HTTP client provided by Triton because it significantly improves model invoke latency.
Other parameters that SageMaker exposes for ONNX on Triton are as follows:
Ensure that you delete the model, model configuration, and model endpoint after running the notebook. The steps to do this are provided at the end of the sample notebook in the GitHub repo.
In this post, we dove deep into the ONNX backend that Triton Inference Server supports on SageMaker. This backend provides for GPU acceleration of your ONNX models. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use this capability using single-model and multi-model endpoints. MMEs allow a better balance of performance and cost savings. To get started with MME support for GPU, see Host multiple models in one container behind one endpoint.
We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.
Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.