With recent advancements in generative AI, there are lot of discussions happening on how to use generative AI across different industries to solve specific business problems. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is all backed by very large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). These FMs can perform a wide range of tasks that span multiple domains, like writing blog posts, generating images, solving math problems, engaging in dialog, and answering questions based on a document. The size and general-purpose nature of FMs make them different from traditional ML models, which typically perform specific tasks, like analyzing text for sentiment, classifying images, and forecasting trends.
While organizations are looking to use the power of these FMs, they also want the FM-based solutions to be running in their own protected environments. Organizations operating in heavily regulated spaces like global financial services and healthcare and life sciences have auditory and compliance requirements to run their environment in their VPCs. In fact, a lot of times, even direct internet access is disabled in these environments to avoid exposure to any unintended traffic, both ingress and egress.
Amazon SageMaker JumpStart is an ML hub offering algorithms, models, and ML solutions. With SageMaker JumpStart, ML practitioners can choose from a growing list of best performing open source FMs. It also provides the ability to deploy these models in your own Virtual Private Cloud (VPC).
In this post, we demonstrate how to use JumpStart to deploy a Flan-T5 XXL model in a VPC with no internet connectivity. We discuss the following topics:
Apart from FLAN-T5 XXL, JumpStart provides lot of different foundation models for various tasks. For the complete list, check out Getting started with Amazon SageMaker JumpStart.
As part of the solution, we cover the following steps:
The following is an architecture diagram of the solution.
Let’s walk through the different steps to implement this solution.
To follow along with this post, you need the following:
Create a new CloudFormation stack by using the 01_networking.yaml template. This template creates a new VPC and adds two private subnets across two Availability Zones with no internet connectivity. It then deploys gateway VPC endpoints for accessing Amazon Simple Storage Service (Amazon S3) and interface VPC endpoints for SageMaker and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink.
Provide a stack name, such as No-Internet, and complete the stack creation process.
This solution is not highly available because the CloudFormation template creates interface VPC endpoints only in one subnet to reduce costs when following the steps in this post.
Create another CloudFormation stack using 02_sagemaker_studio.yaml, which creates a Studio domain, Studio user profile, and supporting resources like IAM roles. Choose a name for the stack; for this post, we use the name SageMaker-Studio-VPC-No-Internet. Provide the name of the VPC stack you created earlier (No-Internet) as the CoreNetworkingStackName parameter and leave everything else as default.
Wait until AWS CloudFormation reports that the stack creation is complete. You can confirm the Studio domain is available to use on the SageMaker console.
To verify the Studio domain user has no internet access, launch Studio using the SageMaker console. Choose File, New, and Terminal, then attempt to access an internet resource. As shown in the following screenshot, the terminal will keep waiting for the resource and eventually time out.
This proves that Studio is operating in a VPC that doesn’t have internet access.
We can deploy this model via Studio as well as via API. JumpStart provides all the code to deploy the model via a SageMaker notebook accessible from within Studio. For this post, we showcase this capability from the Studio.
NOTE: Irrespective of whether the SageMaker JumpStart model is deployed in the VPC or not, the model always runs in network isolation mode, which isolates the model container so no inbound or outbound network calls can be made to or from the model container. Because we’re using a VPC, SageMaker downloads the model artifact through our specified VPC. Running the model container in network isolation doesn’t prevent your SageMaker endpoint from responding to inference requests. A server process runs alongside the model container and forwards it the inference requests, but the model container doesn’t have network access.
Observe the value of the field Model data location on this page. All the SageMaker JumpStart models are hosted on a SageMaker managed S3 bucket (s3://jumpstart-cache-prod-{region}). Therefore, irrespective of which model is picked from JumpStart, the model gets deployed from the publicly accessible SageMaker JumpStart S3 bucket and the traffic never goes to the public model zoo APIs to download the model. This is why the model endpoint creation started successfully even when we’re creating the endpoint in a VPC that doesn’t have direct internet access.
The model artifact can also be copied to any private model zoo or your own S3 bucket to control and secure model source location further. You can use the following command to download the model locally using the AWS Command Line Interface (AWS CLI):
aws s3 cp s3://jumpstart-cache-prod-eu-west-1/huggingface-infer/prepack/v1.0.2/infer-prepack-huggingface-text2text-flan-t5-xxl.tar.gz .
This concludes the steps to deploy the Flan-T5 XXL model using JumpStart within a VPC with no internet access.
The following are some of the advantages of deploying SageMaker JumpStart models in VPC mode:
In this section, we share some alternate ways to deploy the model.
Models provided by SageMaker JumpStart don’t require you to access Studio. You can deploy them to SageMaker endpoints from any IDE, thanks to the JumpStart APIs. You could skip the Studio setup step discussed earlier in this post and use the JumpStart APIs to deploy the model. These APIs provide arguments where VPC configurations can be supplied as well. The APIs are part of the SageMaker Python SDK itself. For more information, refer to Pre-trained models.
SageMaker JumpStart also provides notebooks to deploy the model directly. On the model detail page, choose Open notebook to open a sample notebook containing the code to deploy the endpoint. The notebook uses SageMaker JumpStart Industry APIs that allow you to list and filter the models, retrieve the artifacts, and deploy and query the endpoints. You can also edit the notebook code per your use case-specific requirements.
Check out the CLEANUP.md file to find detailed steps to delete the Studio, VPC, and other resources created as part of this post.
If you encounter any issues in creating the CloudFormation stacks, refer to Troubleshooting CloudFormation.
Generative AI powered by large language models is changing how people acquire and apply insights from information. However, organizations operating in heavily regulated spaces are required to use the generative AI capabilities in a way that allows them to innovate faster but also simplifies the access patterns to such capabilities.
We encourage you to try out the approach provided in this post to embed generative AI capabilities in your existing environment while still keeping it inside your own VPC with no internet access. For further reading on SageMaker JumpStart foundation models, check out the following:
Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.
Mehran Nikoo is a Senior Solutions Architect at AWS, working with Digital Native businesses in the UK and helping them achieve their goals. Passionate about applying his software engineering experience to machine learning, he specializes in end-to-end machine learning and MLOps practices.