Amazon SageMaker and SageMaker inference endpoints provide a capability of training and deploying your AI and machine learning (ML) workloads. With inference endpoints, you can deploy your models for real-time or batch inference. The endpoints support various types of ML models hosted using AWS Deep Learning Containers or your own containers with custom AI/ML algorithms. When you launch SageMaker inference endpoints with multiple instances, SageMaker distributes the instances across multiple Availability Zones (in a single Region) for high availability.
In some cases, however, to ensure lowest possible latency for customers in diverse geographical areas, you may require deploying inference endpoints in multiple Regions. Multi-Regional deployment of SageMaker endpoints and other related application and infrastructure components can also be part of a disaster recovery strategy for your mission-critical workloads aimed at mitigating the risk of a Regional failure.
SageMaker Projects implements a set of pre-built MLOps templates that can help manage endpoint deployments. In this post, we show how you can extend an MLOps SageMaker Projects pipeline to enable multi-Regional deployment of your AI/ML inference endpoints.
SageMaker Projects deploys both training and deployment MLOPs pipelines; you can use these to train a model and deploy it using an inference endpoint. To reduce complexity and cost of a multi-Region solution, we assume that you train the model in a single Region and deploy inference endpoints in two or more Regions.
This post presents a solution that slightly modifies a SageMaker project template to support multi-Region deployment. To better illustrate the changes, the following figure displays both a standard MLOps pipeline created automatically by SageMaker (Steps 1-5) as well as changes required to extend it to a secondary Region (Steps 6-11).
The SageMaker Projects template automatically deploys a boilerplate MLOps solution, which includes the following components:
The following additional steps modify the MLOps Projects template to enable the AI/ML model deployment in the secondary Region:
9–11 optionally, create resources in Amazon Route 53, Amazon API Gateway, and AWS Lambda to route application traffic to inference endpoints in the secondary Region.
Create a SageMaker project in your primary Region (us-east-2 in this post). Complete the steps in Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines until the section Modifying the sample code for a custom use case.
In this section, we discuss how to add manual CodePipeline approval and cross-Region model deployment stages to your existing pipeline created for you by SageMaker.
We need to add additional permissions to the AWS Identity and Access Management (IAM) role AmazonSageMakerServiceCatalogProductsUseRole created by AWS Service Catalog to enable CodePipeline and S3 bucket access for cross-Region deployment.
To spin up an inference endpoint in the secondary Region, the SecondaryRegionDeployment stage needs a CloudFormation template (for endpoint-config-template-secondary-region.yml) and a configuration file (secondary-region-config.json).
The CloudFormation template is configured entirely through parameters; you can further modify it to fit your needs. Similarly, you can use the config file to define the parameters for the endpoint launch configuration, such as the instance type and instance count:
{ “Parameters”: { “StageName”: “secondary-prod”, “EndpointInstanceCount”: “1”, “EndpointInstanceType”: “ml.m5.large”, “SamplingPercentage”: “100”, “EnableDataCapture”: “true” }
To add these files to your project, download them from the provided links and upload them to Amazon SageMaker Studio in the primary Region. In Studio, choose File Browser and then the folder containing your project name and ending with modeldeploy.
Upload these files to the deployment repository’s root folder by choosing the upload icon. Make sure the files are located in the root folder as shown in the following screenshot.
Next, we need to adjust the deployment build.py file to enable SageMaker endpoint deployment in the secondary Region to do the following:
You can download the updated build.py file and replace the existing one in your folder. In Studio, choose File Browser and then the folder containing your project name and ending with modeldeploy. Locate the build.py file and replace it with the one you downloaded.
The CloudFormation template uses the model artifacts stored in a S3 bucket and the Amazon ECR image path to deploy the inference endpoint in the secondary Region. This is different from the deployment from the model registry in the primary Region, because you don’t need to have a model registry in the secondary Region.
buildspec.yml contains instructions run by CodeBuild. We modify this file to do the following:
Open the buildspec.yml file from the model deploy folder and make the highlighted modifications as shown in the following screenshot.
Alternatively, you can download the following buildspec.yml file to replace the default file.
In this step, you add configuration parameters required for CodeBuild to create the model deployment configuration files in the secondary Region.
You need the values you specified for model training for FRAMEWORK and MODEL_VERSION. For example, to find these values for the Abalone model used in MLOps boilerplate deployment, open Studio and on the File Browser menu, open the folder with your project name and ending with modelbuild. Navigate to pipelines/abalone and open the pipeline.py file. Search for sagemaker.image_uris.retrieve and copy the relevant values.
We need to create an S3 bucket to hold the model artifacts in the secondary Region. SageMaker uses this bucket to get the latest version of model to spin up an inference endpoint. You only need to do this one time. CodeBuild automatically syncs the content of the bucket in the primary Region to the replication bucket with each pipeline run.
The deployment stage of the pipeline requires an approved model to start. This is required for the deployment in the primary Region.
All the changes required for multi-Region deployment of your SageMaker inference endpoint are now complete and you can start the deployment process.
Pushing these changes to the CodeCommit repository triggers a new pipeline run, because an EventBridge event monitors for pushed commits. After a few moments, you can monitor the run by navigating to the pipeline on the CodePipeline console.
Make sure to provide manual approval for deployment to production and the secondary Region.
You can verify that the secondary Region endpoint is created on the SageMaker console, by choosing Dashboard in the navigation pane and confirming the endpoint status in Recent activity.
You can optionally follow the instructions in Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda to expose the SageMaker inference endpoint in the secondary Region as an API using API Gateway and Lambda.
To delete the SageMaker project, see Delete an MLOps Project using Amazon SageMaker Studio. To ensure the secondary inference endpoint is destroyed, go to the AWS CloudFormation console and delete the related stacks in your primary and secondary Regions; this destroys the SageMaker inference endpoints.
In this post, we showed how a MLOps specialist can modify a preconfigured MLOps template for their own multi-Region deployment use case, such as deploying workloads in multiple geographies or as part of implementing a multi-Regional disaster recovery strategy. With this deployment approach, you don’t need to configure services in the secondary Region and can reuse the CodePipeline and CloudBuild setups in the primary Region for cross-Regional deployment. Additionally, you can save on costs by continuing the training of your models in the primary Region while utilizing SageMaker inference in multiple Regions to scale your AI/ML deployment globally.
Please let us know your feedback in the comments section.
Mehran Najafi, PhD, is a Senior Solutions Architect for AWS focused on AI/ML and SaaS solutions at Scale.
Steven Alyekhin is a Senior Solutions Architect for AWS focused on MLOps at Scale.