Until recently, customers who wanted to use a deep learning (DL) framework with Amazon SageMaker Processing faced increased complexity compared to those using scikit-learn or Apache Spark. This post shows you how SageMaker Processing has simplified running machine learning (ML) preprocessing and postprocessing tasks with popular frameworks such as PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost.
Training an ML model takes many steps. One of them, data preparation, is paramount to creating an accurate ML model. A typical preprocessing step includes operations such as the following:
Likewise, you often need to run postprocessing jobs (for example, filtering or collating) and model evaluation jobs (scoring models against different test sets) as part of your ML model development lifecycle.
All these tasks involve running custom scripts on your dataset and saving the processed version for later use by your training jobs. In 2019, we launched SageMaker Processing, a capability of Amazon SageMaker that lets you run your preprocessing, postprocessing, and model evaluation workloads on a fully managed infrastructure. It does the heavy lifting for you, managing the infrastructure that runs your bespoke scripts. It spins up the necessary resources to do the job and tears them down when it’s done.
The SageMaker Python SDK provides a SageMaker Processing library that lets you do the following:
Before release 2.52 of the SageMaker Python SDK, using SageMaker Processing in combination with popular ML frameworks such as PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost required you to bring your own container. You had to first build a container and then make sure that it included the relevant framework and all its dependencies. We wanted to simplify data scientists’ lives by removing the need to create a custom container image for these popular frameworks. And we wanted to deliver the same consistent experience people already had with Processing when using scikit-learn or Spark.
In the following sections, we show you how to natively use popular ML frameworks such as PyTorch, TensorFlow, Hugging Face, or MXNet with SageMaker Processing, without having to build a single container.
The introduction of FrameworkProcessor—in release 2.52 of the SageMaker Python SDK in August 2021—changed everything. You can now use SageMaker Processing with your preferred ML framework among PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost. ML practitioners can now focus on perfecting their data processing code instead of spending additional energy on maintaining the lifecycle of custom containers. Now you can use one of the built-in containers and classes provided by SageMaker to use the data processing features of any of the previously mentioned frameworks. For this post, we only test one framework: PyTorch. However, you can reproduce the same procedures for any of the four other supported frameworks. The differences from one framework to the next are in the FrameworkProcessor subclass being used, the framework release, and the specifics of each framework for the data processing script.
To illustrate our solution, let’s imagine that we plan to train a model to classify animal pictures. We rely on a publicly available dataset, the COCO dataset, which contains images from Flickr representing a real-world dataset not pre-formatted or resized specifically for deep learning. This makes it a good fit for our example scenario. Before we even get to the training stage, our initial problem is that the images we want to use to train our model come in all forms and shapes. Therefore, to make sure that this doesn’t affect our training or impact the quality of our model, we preprocess the images. In particular, we make sure that they’re the same shape and size before moving any further.
The COCO dataset provides an annotation file that contains information on each image in the dataset, such as the class, superclass, file name, and URL to download the file. We limit the scope of the dataset for the sake of this example by only using animal images. For the train and validation sets, the data we need for the image labels and the file paths are under different headings in the annotations. We only use a small fraction of the dataset, sufficient for this example.
Before we train our model, all image data must have the same dimensions for length, width, and channel. Typically, algorithms use a square format, with identical length and width. However, most real-world datasets such as ours contain images in many different dimensions and ratios. To prepare our dataset for training, we need to resize and crop the images if they aren’t already square.
We also randomly augment the images to help our training algorithm generalize better. We only augment the training data, not the validation or test data, because we want to generate a prediction on the image as it normally would be presented for inference.
Our processing stage consists of two steps.
First, we instantiate the PyTorchProcessor class needed to run our bespoke data processing script:
import boto3 import sagemaker from sagemaker import get_execution_role from sagemaker.pytorch.processing import PyTorchProcessor region = boto3.session.Session().region_name role = get_execution_role() pytorch_processor = PyTorchProcessor( framework_version=”1.8″, role=role, instance_type=”ml.m5.xlarge”, instance_count=1 )
Second, we need to pass it the instructions to conduct the actual data processing tasks that are contained in our script:
We run this step with the following code:
from sagemaker.processing import ProcessingInput, ProcessingOutput pytorch_processor.run( code=”preprocessing.py”, source_dir=”scripts”, arguments = [‘Debug’, ‘Not used’], inputs=[ProcessingInput(source=”coco-annotations.zip”, destination=”/opt/ml/processing/input”)], outputs=[ ProcessingOutput(source=”/opt/ml/processing/tmp/data_structured”, output_name=”data_structured”), ProcessingOutput(source=”/opt/ml/processing/output/train”, output_name=”train”), ProcessingOutput(source=”/opt/ml/processing/output/val”, output_name=”validation”), ProcessingOutput(source=”/opt/ml/processing/output/test”, output_name=”test”), ProcessingOutput(source=”/opt/ml/processing/logs”, output_name=”logs”), ], )
At the end of this processing step, after sampling our initial dataset, we restructure it to fit the actual structure expected by the major ML frameworks. We also center, crop, and augment the images. We’re ready to proceed to the next stage and train our model. We also add an extra output (the data_structured folder) to save the restructured source data. This allows us to reuse the same dataset for further processing or training without restarting the whole preparation from scratch (that is, from the annotations file). More details on this can be found in the script.
In this post, we showed you how SageMaker Processing has simplified the use of the most popular ML frameworks, such as PyTorch, TensorFlow, MXNet, Hugging Face, and XGBoost. This is possible thanks to the introduction of FrameworkProcessor in the recent releases (2.52+) of the SageMaker Python SDK. You can now use the existing SageMaker containers provided natively for these frameworks with SageMaker Processing, and focus solely on your data processing code. Behind the scenes, SageMaker Processing manages the necessary infrastructure for you.
We hope this gave you a glimpse into the possibilities offered by SageMaker Processing. As a next step, you can look beyond preprocessing and postprocessing steps and consider the full lifecycle of an ML model. SageMaker Processing can play an active role before the training takes place but also post-training for any postprocessing tasks. You may want to also look at SageMaker Pipelines to automate the entire model lifecycle by crafting all these different steps together into a model pipeline.
This post was inspired by the post Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation when SageMaker Processing first launched. Check out the SageMaker Python SDK for more details on the other supported frameworks: Hugging Face, TensorFlow, MXNet, XGBoost.
Sample notebooks and scripts for all four supported frameworks are available on GitHub: PyTorch example, Hugging Face example, TensorFlow example, MXNet example.
If you have feedback about this post, let us know in the comments section. If you have questions about this post, start a new thread on one of the AWS Developer forums or contact AWS Support.

