[]Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.
[]In 2021, AWS announced the integration of NVIDIA Triton Inference Server in SageMaker. You can use NVIDIA Triton Inference Server to serve models for inference in SageMaker. By using an NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.
[]In some scenarios, users want to deploy multiple models. For example, an application for revising English composition always includes several models, such as BERT for text classification and GECToR to grammar checking. A typical request may flow across multiple models, like data preprocessing, BERT, GECToR, and postprocessing, and they run serially as inference pipelines. If these models are hosted on different instances, the additional network latency between these instances increases the overall latency. For an application with uncertain traffic, deploying multiple models on different instances will inevitably lead to inefficient utilization of resources.
[]Consider another scenario, in which users develop multiple models with different versions, and each model uses a different training framework. A common practice is to use multiple containers, each of which deploys a model. But this will cause increased workload and costs for development, operation, and maintenance. In this post, we discuss how SageMaker and NVIDIA Triton Inference Server can solve this problem.
[]Let’s look at how SageMaker inference works. SageMaker invokes the hosting service by running a Docker container. The Docker container launches a RESTful inference server (such as Flask) to serve HTTP requests for inference. The inference server loads the model and listens to port 8080 providing external service. The client application sends a POST request to the SageMaker endpoint, SageMaker passes the request to the container, and returns the inference result from the container to the client.
[]In our architecture, we use NVIDIA Triton Inference Server, which provides concurrent runs of multiple models from different frameworks, and we use a Flask server to process client-side requests and dispatch these requests to the backend Triton server. While launching a Docker container, the Triton server and Flask server are started automatically. The Triton server loads multiple models and exposes ports 8000, 8001, and 8002 as gRPC, HTTP, and metrics server. The Flask server listens to 8080 ports and parses the original request and payload, and then invokes the local Triton backend via model name and version information. For the client side, it adds the model name and model version in the request in addition to the original payload, so that Flask is able to route the inference request to the correct model on Triton server.
[]The following diagram illustrates this process.
[]
[]A complete API call from the client is as follows:
[]In the following sections, we introduce the steps needed to prepare a model and build the TensorRT engine, prepare a Docker image, create a SageMaker endpoint, and verify the result.
[]We demonstrate hosting three typical ML models in our solution: image classification (ResNet50), object detection (YOLOv5), and a natural language processing (NLP) model (BERT-base). NVIDIA Triton Inference Server supports multiple formats, including TensorFlow 1. x and 2. x, TensorFlow SavedModel, TensorFlow GraphDef, TensorRT, ONNX, OpenVINO, and PyTorch TorchScript.
[]The following table summarizes our model details.
Model Name | Model Size | Format |
ResNet50 | 52M | Tensor RT |
YOLOv5 | 38M | Tensor RT |
BERT-base | 133M | ONNX RT |
[]NVIDIA provides detailed documentation describing how to generate the TensorRT engine. To achieve best performance, the TensorRT engine must be built over the device. This means the build time and runtime require the same computer capacity. For example, a TensorRT engine built on a g4dn instance can’t be deployed on a g5 instance.
[]You can generate your own TensorRT engines according to your needs. For test purposes, we prepared sample codes and deployable models with the TensorRT engine. The source code is also available on GitHub.
[]Next, we use an Amazon Elastic Compute Cloud (Amazon EC2) G4dn instance to generate the TensorRT engine with the following steps. We use YOLOv5 as an example.
[]Before we deploy to SageMaker, we start a Triton server to verify these three models are configured correctly. Use the following command to start a Triton server and load the models:
docker run –gpus all –rm -p8000:8000 -p8001:8001 -v
[]Enter nvidia-smi in the terminal to see GPU memory usage.
[]
[]The file structure is as follows:
[]
[]We define an abstract method to implement the inference interface, and each client implements this method:
from abc import ABC, abstractmethod class Base(ABC): @abstractmethod def inference(self,img): pass []The Triton server exposes an HTTP endpoint on port 8000, a gRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. The following is a sample ResNet client with a gRPC call. You can implement the HTTP interface or gRPC interface according to your use case.
from base import Base import numpy as np import tritonclient.grpc as grpcclient from PIL import Image import cv2 class Resnet(Base): def image_transform_onnx(self, image, size: int) -> np.ndarray: ”’Image transform helper for onnx runtime inference.”’ img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #OpenCV follows BGR convention and PIL follows RGB image = Image.fromarray(img) image = image.resize((size,size)) # now our image is represented by 3 layers – Red, Green, Blue # each layer has a 224 x 224 values representing image = np.array(image) # dummy input for the model at export – torch.randn(1, 3, 224, 224) image = image.transpose(2,0,1).astype(np.float32) # our image is currently represented by values ranging between 0-255 # we need to convert these values to 0.0-1.0 – those are the values that are expected by our model image /= 255 image = image[None, …] return image def inference(self, img): INPUT_SHAPE = (224, 224) TRITON_IP = “localhost” TRITON_PORT = 8001 MODEL_NAME = “resnet” INPUTS = [] OUTPUTS = [] INPUT_LAYER_NAME = “input” OUTPUT_LAYER_NAME = “output” INPUTS.append(grpcclient.InferInput(INPUT_LAYER_NAME, [1, 3, INPUT_SHAPE[0], INPUT_SHAPE[1]], “FP32″)) OUTPUTS.append(grpcclient.InferRequestedOutput(OUTPUT_LAYER_NAME, class_count=3)) TRITON_CLIENT = grpcclient.InferenceServerClient(url=f”{TRITON_IP}:{TRITON_PORT}”) INPUTS[0].set_data_from_numpy(self.image_transform_onnx(img, 224)) results = TRITON_CLIENT.infer(model_name=MODEL_NAME, inputs=INPUTS, outputs=OUTPUTS, headers={}) output = np.squeeze(results.as_numpy(OUTPUT_LAYER_NAME)) #print(output) lista = [x.decode(‘utf-8’) for x in output.tolist()] return lista []In this architecture, the NGINX, Flask, and Triton servers should be started at the beginning. Edit the serve file and add a line to start the Triton server.
[]The Docker file code looks as follows:
FROM nvcr.io/nvidia/tritonserver:22.04-py3 # Add arguments to achieve the version, python and url ARG PYTHON=python3 ARG PYTHON_PIP=python3-pip ARG PIP=pip3 ENV LANG=C.UTF-8 RUN apt-key adv –keyserver keyserver.ubuntu.com –recv-keys A4B469963BF863CC && apt-get update && apt-get install -y nginx && apt-get install -y libgl1-mesa-glx && apt-get clean && rm -rf /var/lib/apt/lists/* RUN ${PIP} install -U –no-cache-dir tritonclient[all] torch torchvision pillow==9.1.1 scipy==1.8.1 transformers==4.20.1 opencv-python==4.6.0.66 flask gunicorn && ldconfig && apt-get clean && apt-get autoremove && rm -rf /var/lib/apt/lists/* /tmp/* ~/* && mkdir -p /opt/program/models/ COPY sm /opt/program COPY model /opt/program/models WORKDIR /opt/program ENTRYPOINT [“python3”, “serve”] []Install and configure the aws-cli client with the following code:
sudo apt install awscli sudo apt install git-all aws configure # # input AWS Access Key ID, AWS Secret Access Key, Default region name and Default output format []Run the following command to build the Docker image and push the image to Amazon Elastic Container Registry (Amazon ECR). Provide your Region and account ID.
aws ecr get-login-password –region
[]Now it’s time to verify the result. Launch a notebook instance with an ml.c5.xlarge instance from the SageMaker console, and create a notebook with the conda_python3 kernel. The following code snippet shows an example deployment of an inference endpoint. The source code is available in the GitHub repo.
role = get_execution_role() sess = sage.Session() account = sess.boto_session.client(‘sts’).get_caller_identity()[‘Account’] region = sess.boto_session.region_name image = ‘{}.dkr.ecr.{}.amazonaws.com/inference/mytriton:latest’.format(account, region) model = sess.create_model( name=”mytriton”, role=role, container_defs=image) endpoint_cfg=sess.create_endpoint_config( name=”MYTRITONCFG”, model_name=”mytriton”, initial_instance_count=1, instance_type=”ml.g4dn.xlarge” ) endpoint=sess.create_endpoint( endpoint_name=”MyTritonEndpoint”, config_name=”MYTRITONCFG”) []Wait about 3 minutes until the inference server is started to verify the result.
[]The following code is the ResNet client request:
## resnet client runtime = boto3.Session().client(‘runtime.sagemaker’) img = cv2.imread(‘dog.jpg’) string_img = base64.b64encode(cv2.imencode(‘.jpg’, img)[1]).decode() payload = json.dumps({“modelname”: “resnet”,”payload”: {“img”:string_img}}) endpoint=”MyTritonEndpoint” response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’) out=response[‘Body’].read() res=eval(out) print(res) []We get the following response:
{‘modelname’: ‘resnet’, ‘result’: [‘11.250000:250:250:malamute, malemute, Alaskan malamute’, ‘9.914062:249:249:Eskimo dog, husky’, ‘9.906250:248:248:Saint Bernard, St Bernard’]} []The following code is the YOLOv5 client request:
# yolov5 client payload = json.dumps({“modelname”: “yolov5″,”payload”: {“img”:string_img}}) endpoint=”MyTritonEndpoint” response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’) out=response[‘Body’].read() res=eval(out) print(str(out)) []We get the following response:
b'{“modelname”: “yolov5”, “result”: [[16, 0.9168673157691956, 111.92530059814453, 258.53240966796875, 262.0159606933594, 533.407958984375, 768, 576], [2, 0.6941519379615784, 392.20037841796875, 573.6005249023438, 142.55178833007812, 224.56454467773438, 768, 576], [1, 0.5813695788383484, 131.8942413330078, 473.7420654296875, 179.61459350585938, 427.0913391113281, 768, 576], [7, 0.5316226482391357, 392.82275390625, 572.4647216796875, 144.685546875, 223.052734375, 768, 576]]}’ []The following code is the BERT client request:
# bert client text=”The world has [MASK] people.” payload = json.dumps({“modelname”: “bert_base”,”payload”: {“text”:text}}) endpoint=”MyTritonEndpoint” response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’) out=response[‘Body’].read() res=eval(out) print(res) []We get the following response:
{‘modelname’: ‘bert_base’, ‘result’: [{‘token’: ‘The world has many people.’, ‘score’: 0.16609132289886475}, {‘token’: ‘The world has no people.’, ‘score’: 0.07334889471530914}, {‘token’: ‘The world has few people.’, ‘score’: 0.0617995485663414}, {‘token’: ‘The world has two people.’, ‘score’: 0.03924647718667984}, {‘token’: ‘The world has its people.’, ‘score’: 0.023465465754270554}]} []Here we see our architecture is working as expected.
[]Note that hosting an endpoint will incur some costs. Therefore, delete the endpoint after you complete the test:
runtime.delete_endpoint(EndpointName=endpoint)
[]To estimate cost, assume that you have three models, but not all of them are long-running. You’re using one endpoint for each model, and the online time of each endpoint is different. Using ml.g4dn.xlarge as an example, the total cost is about $971.52/month. The following table lists the details.
Model Name | Endpoint Running /Day | Instance Type | Cost/Month (us-east-1) |
ResNet | 24 hours | ml.g4dn.xlarge | 0.736 * 24 * 30=$529.92 |
BERT | 8 hours | ml.g4dn.xlarge | 0.736 * 8 * 30=$176.64 |
YOLOv5 | 12 hours | ml.g4dn.xlarge | 0.736 * 12 * 30=$264.96 |
[]The following table shows the cost for sharing one endpoint for three models using the preceding architecture. The total cost is about $676.8/month. From this result, we can conclude that you can save 30% in costs while also having 24/7 service from your endpoint.
Model Name | Endpoint Running /Day | Instance Type | Cost/Month (us-east-1) |
ResNet, YOLOv5, BERT | 24 hours | ml.g4dn.2xlarge | 0.94 * 24 * 30 = $676.8 |
[]In this post, we introduced an improved architecture in which multiple models share one endpoint in SageMaker. Under some conditions, this solution can help you save costs and improve resource utilization. It is suitable for business scenarios with low concurrency and latency-insensitive requirements.
[]To learn more about SageMaker and AI/ML solutions, refer to Amazon SageMaker.
[] Zheng Zhang is a Senior Specialist Solutions Architect in AWS, he focuses on helping customers accelerate model training, inference and deployment for machine learning solutions. He also has rich experience in large-scale distributed training, design AI/ML solutions.
[]Yinuo He is an AI/ML specialist in AWS. She has experiences in designing and developing machine learning based products to provide better user experiences. She now works to help customers succeed in their ML journey.