
Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Data Wrangler can simplify your data preparation and feature engineering processes and help you with data selection, cleaning, exploration, and visualization. Data Wrangler has over 300 built-in transforms written in PySpark, so you can process datasets up to hundreds of gigabytes efficiently on the default instance, ml.m5.4xlarge.
However, when you work with datasets up to terabytes of data using built-in transforms, you might experience longer processing time or potential out-of-memory errors. Based on your data requirements, you can now use additional Amazon Elastic Compute Cloud (Amazon EC2) M5 instances and R5 instances. For example, you can start with a default instance (ml.m5.4xlarge) and then switch to ml.m5.24xlarge or ml.r5.24xlarge. You have the option of picking different instance types and finding the best trade-off of running cost and processing times. The next time you’re working on time series transformation and running heavy transformers to balance your data, you can right-size your Data Wrangler instance to run these processes faster.
When processing tens of gigabytes or even more with a custom Pandas transform, you might experience out-of-memory errors. You can switch from the default instance (ml.m5.4xlarge) to ml.m5.24xlarge, and the transform will finish without any errors. We thoroughly benchmarked and observed linear speedup as we increased instance size across a portfolio of datasets.
In this post, we share our findings from two benchmark tests to demonstrate how you can process larger and wider datasets with Data Wrangler.
Let’s review two tests we ran, aggregation queries and one-hot encoding, with different instance types using PySpark built-in transformers and custom Pandas transforms. Transformations that don’t require aggregation finish quickly and work well with the default instance type, so we focused on aggregation queries and transformations with aggregation. We stored our test dataset on Amazon Simple Storage Service (Amazon S3). This dataset’s expanded size is around 100 GB with 80 million rows and 300 columns. We used UI metrics to time benchmark tests and measure end-to-end customer-facing latency. When importing our test dataset, we disabled sampling. Sampling is enabled by default, and Data Wrangler only processes the first 100 rows when enabled.x
As we increased the Data Wrangler instance size, we observed a roughly linear speedup of Data Wrangler built-in transforms and custom Spark SQL. Pandas aggregation query tests only finished when we used instances larger than ml.m5.16xl, and Pandas needed 180 GB of memory to process aggregation queries for this dataset.
The following table summarizes the aggregation query test results.
| Instance | vCPU | Memory (GiB) | Data Wrangler built-in Spark transform time | Pandas Time (Custom Transform) |
| ml.m5.4xl | 16 | 64 | 229 seconds | Out of memory |
| ml.m5.8xl | 32 | 128 | 130 seconds | Out of memory |
| ml.m5.16xl | 64 | 256 | 52 seconds | 30 minutes |
The following table summarizes the one-hot encoding test results.
| Instance | vCPU | Memory (GiB) | Data Wrangler built-in Spark transform time | Pandas Time (Custom Transform) |
| ml.m5.4xl | 16 | 64 | 228 seconds | Out of memory |
| ml.m5.8xl | 32 | 128 | 130 seconds | Out of memory |
| ml.m5.16xl | 64 | 256 | 52 seconds | Out of memory |
To switch the instance type of your flow, complete the following steps:


A progress message appears.
When the switch is complete, a success message appears.
Data Wrangler uses the selected instance type for data analysis and data transformations. The default instance and the instance you switched to (ml.m5.16xlarge) are both running. You can change the instance type or switch back to the default instance before running a specific transformation.
You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren’t using manually. To shut down an instance that is running, complete the following steps:

If you shut down an instance used to run a flow, you can’t access the flow temporarily. If you get an error in opening the flow running an instance you previously shut down, wait for approximately 5 minutes and try opening it again.
In this post, we demonstrated how to process larger and wider datasets with Data Wrangler by switching instances to larger M5 or R5 instance types. M5 instances offer a balance of compute, memory, and networking resources. R5 instances are memory-optimized instances. Both M5 and R5 provide instance types to optimize cost and performance for your workloads.
To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.




