Today, many AWS customers are building enterprise-ready machine learning (ML) platforms on Amazon Elastic Kubernetes Service (Amazon EKS) using Kubeflow on AWS (an AWS-specific distribution of Kubeflow) across many use cases, including computer vision, natural language understanding, speech translation, and financial modeling.
With the latest release of open-source Kubeflow v1.6.1, the Kubeflow community continues to support this large-scale adoption of Kubeflow for enterprise use cases. The latest release includes many new exciting features like support for Kubernetes v1.22, combined Python SDK for PyTorch, MXNet, MPI, XGBoost in Kubeflow’s distributed Training Operator, new ClusterServingRuntime and ServingRuntime CRDs for model service, and many more.
AWS contributions to Kubeflow with the recent launch of Kubeflow on AWS 1.6.1 support all upstream open-source Kubeflow features and include many new integrations with the highly optimized, cloud-native, enterprise-ready AWS services that will help you build highly reliable, secure, portable, and scalable ML systems.
In this post, we discuss new Kubeflow on AWS v1.6.1 features and highlight three important integrations that have been bundled on one platform to offer you::
The use case in this blog will specifically focus on SageMaker integration with Kubeflow on AWS that could be added to your existing Kubernetes workflows enabling you to build hybrid machine learning architectures.
Kubeflow on AWS 1.6.1 provides a clear path to use Kubeflow, with the addition of the following AWS services on top of existing capabilities:
The following architecture diagram is a quick snapshot of all the service integrations (including the ones already mentioned) that are available for Kubeflow control and data plane components in Kubeflow on AWS. The Kubeflow control plane is installed on top of Amazon EKS, which is a managed container service used to run and scale Kubernetes applications in the cloud. These AWS service integrations allow you to decouple critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. For more details on the value that these service integrations add over open-source Kubeflow, refer to Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on AWS.
Let’s discuss in more detail on how the Kubeflow on AWS 1.6.1 key features could be helpful to your organization.
With the Kubeflow 1.6.1 release, we tried to provide better tools for different kinds of customers that make it easy to get started with Kubeflow no matter which options you choose. These tools provide a good starting point and can be modified to fit your exact needs.
We provide different deployment options for different customer use cases. Here you get to choose which AWS services you want to integrate your Kubeflow deployment with. If you decide to change deployment options later, we recommend that you do a fresh installation for the new deployment. The following deployment options are available:
If you want to deploy Kubeflow with minimal changes, consider the vanilla deployment option. All available deployment options can be installed using Kustomize, Helm, or Terraform.
We also have different add-on deployments that can be installed on top of any of these deployment options:
After you have decided which deployment option best suits your needs, you can choose how you want to install these deployments. In an effort to serve experts and newcomers alike, we have different levels of automation and configuration.
This creates an EKS cluster and all the related AWS infrastructure resources, and then deploys Kubeflow all in one command using Terraform. Internally, this uses EKS blueprints and Helm charts.
This option has the following advantages:
This option allows you to deploy Kubeflow in a two-step process:
This option has the following advantages:
The following diagram illustrates the architectures of both options.
SageMaker is a fully managed service designed and optimized specifically for managing ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference.
Many AWS customers who have portability requirements or on-premises standard restrictions use Amazon EKS to set up repeatable ML pipelines running training and inference workloads. However, this requires developers to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, and comply with appropriate security and regulatory requirements. These customers therefore want to use SageMaker for cost-optimized and managed infrastructure for model training and deployments and continue using Kubernetes for orchestration and ML pipelines to retain standardization and portability.
To address this need, AWS allows you to train, tune, and deploy models in SageMaker from Amazon EKS by using the following two options:
Starting with Kubeflow on AWS v1.6.1, all of the available Kubeflow deployment options bring together both Amazon SageMaker integration options by default on one platform. That means, you can now submit SageMaker jobs using SageMaker ACK operators from a Kubeflow Notebook server itself by submitting the custom SageMaker resource or from the Kubeflow pipeline step using SageMaker components.
There are two versions of SageMaker Components – Boto3 (AWS SDK for AWS SDK for Python) based version 1 components and SageMaker Operator for K8s (ACK) based version 2 components. The new SageMaker components version 2 support latest SageMaker training apis and we will continue to add more SageMaker features to this version of the component. You however have the flexibility to combine Sagemaker components version 2 for training and version 1 for other SageMaker features like hyperparameter tuning, processing jobs, hosting and many more.
Prometheus is an open-source metrics aggregation tool that you can configure to run on Kubernetes clusters. When running on Kubernetes clusters, a main Prometheus server periodically scrapes pod endpoints.
Kubeflow components, such as Kubeflow Pipelines (KFP) and Notebook, emit Prometheus metrics to allow monitoring component resources such as the number of running experiments or notebook count.
These metrics can be aggregated by a Prometheus server running in the Kubernetes cluster and queried using Prometheus Query Language (PromQL). For more details on the features that Prometheus supports, check out the Prometheus documentation.
The Kubeflow on AWS distribution provides support for the integration with following AWS managed services:
The Kubeflow on AWS distribution provides support for the integration of Amazon Managed Service for Prometheus and Amazon Managed Grafana to facilitate the ingestion and visualization of Prometheus metrics securely at scale.
The following metrics are ingested and can be visualized:
To configure Amazon Managed Service for Prometheus and Amazon Managed Grafana for your Kubeflow cluster, refer to Use Prometheus, Amazon Managed Service for Prometheus, and Amazon Managed Grafana to monitor metrics with Kubeflow on AWS.
In this use case, we use the Kubeflow vanilla deployment using Terraform installation option. When installation is complete, we log in to the Kubeflow dashboard. From the dashboard, we spin up a Kubeflow Jupyter notebook server to build a Kubeflow pipeline that uses SageMaker to run distributed training for an image classification model and a SageMaker endpoint for model deployment.
Make sure you meet the following prerequisites:
You also configure an AWS Command Line Interface (AWS CLI) profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:
aws configure –profile=kubeflow AWS Access Key ID [None]:
Verify the permissions that cloud9 will use to call AWS resources.
aws sts get-caller-identity
Verify from the below output that you see arn of the admin user that you have configured in AWS CLI profile. In this example it is “kubeflow-user”
{ “UserId”: “*******”, “Account”: “********”, “Arn”: “arn:aws:iam::*******:user/kubeflow-user” }
To install Amazon EKS and Kubeflow on AWS, complete the following steps:
To access the Kubeflow dashboard, complete the following steps:
Once you’re logged in to the Kubeflow dashboard, ensure you have the right namespace (kubeflow-user-example-com) chosen. Complete the following steps to set up your Kubeflow on AWS environment:
After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder eks-kubeflow-cloudformation-quick-start/workshop/pytorch-distributed-training in the cloned repository:
In the subsequent sections, we discuss each of these steps in detail.
As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.
We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:
… import torch import torch.distributed as dist import torch.nn as nn import torch.nn.functional as F import torch.optim as optim import torch.utils.data import torch.utils.data.distributed import torchvision from torchvision import datasets, transforms …
We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:
# Define models class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x
If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:
device = “cuda” if torch.cuda.is_available() else “cpu” …
Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size) …
We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of the model is placed on each machine and each device. See the following code:
model = Net().to(device) if is_distributed: model = torch.nn.parallel.DistributedDataParallel(model) …
The notebook uses the Kubeflow Pipelines SDK and its provided set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline.
The Kubeflow pipeline uses SageMaker component V2 for submitting training to SageMaker using SageMaker ACK Operators. SageMaker model creation and model deployment uses SageMaker component V1, which are Boto3-based SageMaker components. We use a combination of both components in this example to demonstrate the flexibility you have in choice.
In the following code, we create the Kubeflow pipeline where we run SageMaker distributed training using two ml.p3.2xlarge instances:
# Create Kubeflow Pipeline using Amazon SageMaker Service @dsl.pipeline(name=”PyTorch Training pipeline”, description=”Sample training job test”) def pytorch_cnn_pipeline(region=target_region, train_image=aws_dlc_sagemaker_train_image, serving_image=aws_dlc_sagemaker_serving_image, learning_rate=’0.01′, pytorch_backend=’gloo’, training_job_name=pytorch_distributed_jobname, instance_type=’ml.p3.2xlarge’, instance_count=’2′, network_isolation=’False’, traffic_encryption=’False’, ): # Step to run training on SageMaker using SageMaker Components V2 for Pipeline. training = sagemaker_train_ack_op( region=region, algorithm_specification=(f'{{ ‘ f'”trainingImage”: “{train_image}”,’ ‘”trainingInputMode”: “File”‘ f’}}’), training_job_name=training_job_name, hyper_parameters=(f'{{ ‘ f'”backend”: “{pytorch_backend}”,’ ‘”batch-size”: “64”,’ ‘”epochs”: “10”,’ f'”lr”: “{learning_rate}”,’ ‘”model-type”: “custom”,’ ‘”sagemaker_container_log_level”: “20”,’ ‘”sagemaker_program”: “cifar10-distributed-gpu-final.py”,’ f'”sagemaker_region”: “{region}”,’ f'”sagemaker_submit_directory”: “{source_s3}”‘ f’}}’), resource_config=(f'{{ ‘ f'”instanceType”: “{instance_type}”,’ f'”instanceCount”: {instance_count},’ ‘”volumeSizeInGB”: 50′ f’}}’), input_data_config=training_input(datasets), output_data_config=training_output(bucket_name), enable_network_isolation=network_isolation, enable_inter_container_traffic_encryption=traffic_encryption, role_arn=role, stopping_condition={“maxRuntimeInSeconds”: 3600} ) model_artifact_url = get_s3_model_artifact_op( training.outputs[“model_artifacts”] ).output # This step creates SageMaker Model which refers to model artifacts and inference script to deserialize the input image create_model = sagemaker_model_op( region=region, model_name=training_job_name, image=serving_image, model_artifact_url=model_artifact_url, network_isolation=network_isolation, environment=(f'{{ ‘ ‘”SAGEMAKER_CONTAINER_LOG_LEVEL”: “20”,’ ‘”SAGEMAKER_PROGRAM”: “inference.py”,’ f'”SAGEMAKER_REGION”: “{region}”,’ f'”SAGEMAKER_SUBMIT_DIRECTORY”: “{model_artifact_url}”‘ f’}}’), role=role ) # This step creates SageMaker Endpoint which will be called to run inference prediction = sagemaker_deploy_op( region=region, model_name_1=create_model.output, instance_type_1=’ml.c5.xlarge’ ) #Disable pipeline cache training.execution_options.caching_strategy.max_cache_staleness = “P0D”
After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipelines SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:
# DSL Compiler that compiles pipeline functions into workflow yaml. kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, “pytorch_cnn_pipeline.yaml”) # Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client client = kfp.Client() experiment = client.create_experiment(name=”ml_workflow”) # Run a specified pipeline my_run = client.run_pipeline(experiment.id, “pytorch_cnn_pipeline”, “pytorch_cnn_pipeline.yaml”) # Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.
The notebook STEP1.1_invoke_sagemaker_endpoint.ipynb invokes the SageMaker inference endpoint created in the previous step. Ensure you update the endpoint name:
# Invoke SageMaker Endpoint. * Ensure you update the endpoint # You can grab the SageMaker Endpoint name by either 1) going to the pipeline visualization of Kubeflow console and click the component for deployment, or 2) Go to SageMaker console and go to the list of endpoints, and then substitute the name to the EndpointName=’…’ in this cell. endpointName=’
To clean up your resources, complete the following steps:
In this post, we highlighted the value that Kubeflow on AWS 1.6.1 provides through native AWS-managed service integrations to address the need of enterprise-level AI and ML use cases. You can choose from several deployment options to install Kubeflow on AWS with various service integrations using Terraform, Kustomize, or Helm. The use case in this post demonstrated a Kubeflow integration with SageMaker that uses a SageMaker managed training cluster to run distributed training for an image classification model and SageMaker endpoint for model deployment.
We have also made available a sample pipeline example that uses the latest SageMaker components; you can run this directly from the Kubeflow dashboard. This pipeline requires the Amazon S3 data and SageMaker execution IAM role as the required inputs.
To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in Kubeflow on AWS. You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.
Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Kartik Kalamadi is a Software Development Engineer at Amazon AI. Currently focused on Machine Learning Kubernetes open-source projects such as Kubeflow and AWS SageMaker Controller for k8s. In my spare time I like playing PC Games and fiddling with VR using Unity engine.
Rahul Kharse is a Software Development Engineer at Amazon Web Services. His work focuses on integrating AWS services with open source containerized ML Ops platforms to improve their scalability, reliability, and security. In addition to focusing on customer requests for features, Rahul also enjoys experimenting with the latest technological developments in the field.