SageMaker Distribution is a pre-built Docker image containing many popular packages for machine learning (ML), data science, and data visualization. This includes deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab. In addition to this, SageMaker Distribution supports conda, micromamba, and pip as Python package managers.
In May 2023, we launched SageMaker Distribution as an open-source project at JupyterCon. This launch helped you use SageMaker Distribution to run experiments on your local environments. We are now natively providing that image in Amazon SageMaker Studio so that you gain the high performance, compute, and security benefits of running your experiments on Amazon SageMaker.
Compared to the earlier open-source launch, you have the following additional capabilities:
In this post, we show the features and advantages of using the SageMaker Distribution image.
If you have access to an existing Studio domain, you can launch SageMaker Studio. To create a Studio domain, follow the directions in Onboard to Amazon SageMaker Domain.
You can now start running your commands without needing to install common ML packages and frameworks! You can also run notebooks running on supported frameworks such as PyTorch and TensorFlow from the SageMaker examples repository, without having to switch the active kernels.
In the public beta announcement, we discussed graduating notebooks from local compute environments to SageMaker Studio, and also operationalizing the notebook using notebook jobs.
Additionally, you can directly run your local notebook code as a SageMaker training job by simply adding a @remote decorator to your function.
Let’s try an example. Add the following code to your Studio notebook running on the SageMaker Distribution image:
from sagemaker.remote_function import remote @remote(instance_type=”ml.m5.xlarge”, dependencies=’./requirements.txt’) def divide(x, y): return x / y divide(2, 3.0)
When you run the cell, the function will run as a remote SageMaker training job on an ml.m5.xlarge notebook, and the SDK automatically picks up the SageMaker Distribution image as the training image in Amazon Elastic Container Registry (Amazon ECR). For deep learning workloads, you can also run your script on multiple parallel instances.
SageMaker Distribution is available as a public Docker image. However, for data scientists more familiar with Conda environments than Docker, the GitHub repository also provides the environment files for each image build so you can build Conda environments for both CPU and GPU versions.
The build artifacts for each version are stored under the sagemaker-distribution/build_artifacts directory. To create the same environment as any of the available SageMaker Distribution versions, run the following commands, replacing the –file parameter with the right environment files:
conda create –name conda-sagemaker-distribution –file sagemaker-distribution/build_artifacts/v0/v0.2/v0.2.1/cpu.env.out # activate the environment conda activate conda-sagemaker-distribution
The open-source SageMaker Distribution image has the most commonly used packages for data science and ML. However, data scientists might require access to additional packages, and enterprise customers might have proprietary packages that provide additional capabilities for their users. In such cases, there are multiple options to have a runtime environment with all required packages. In order of increasing complexity, they are listed as follows:
If you experimented with SageMaker Studio, shut down all Studio apps to avoid paying for unused compute usage. See Shut down and Update Studio Apps for instructions.
Today, we announced the launch of the open-source SageMaker Distribution image within SageMaker Studio. We showed you how to use the image in SageMaker Studio as one of the available first-party images, how to operationalize your scripts using the SageMaker Python SDK @remote decorator, how to reproduce the Conda environments from SageMaker Distribution outside Studio, and how to customize the image. We encourage you to try out SageMaker Distribution and share your feedback through GitHub!
Durga Sury is an ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hiking with her 5-year-old husky.
Ketan Vijayvargiya is a Senior Software Development Engineer in Amazon Web Services (AWS). His focus areas are machine learning, distributed systems and open source. Outside work, he likes to spend his time self-hosting and enjoying nature.