In the world of machine learning (ML), the quality of the dataset is of significant importance to model predictability. Although more data is usually better, large datasets with a high number of features can sometimes lead to non-optimal model performance due to the curse of dimensionality. Analysts can spend a significant amount of time transforming data to improve model performance. Additionally, large datasets are costlier and take longer to train. If time is a constraint, model performance may be limited as a result.
Dimension reduction techniques can help reduce the size of your data while maintaining its information, resulting in quicker training times, lower cost, and potentially higher-performing models.
Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for ML. Data Wrangler simplifies the process of data preparation and feature engineering like data selection, cleansing, exploration, and visualization from a single visual interface. Data Wrangler has more than 300 preconfigured data transformations that can effectively be used in transforming the data. In addition, you can write custom transformation in PySpark, SQL, and pandas.
Today, we’re excited to add a new transformation technique that is commonly used in the ML world to the list of Data Wrangler pre-built transformations: dimensionality reduction using Principal Component Analysis. With this new feature, you can reduce the high number of dimensions in your datasets to one that can be used with popular ML algorithms with just a few clicks on the Data Wrangler console. This can have significant improvements in your model performance with minimal effort.
In this post, we provide an overview of this new feature and show how to use it in your data transformation. We will show how to use dimensionality reduction on large sparse datasets.
Principal Component Analysis (PCA) is a method by which the dimensionality of features can be transformed in a dataset with many numerical features into one with fewer features while still retaining as much information as possible from the original dataset. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. Several features in a dataset often have less impact on the final result and may increase the processing time of ML models. It can become difficult for humans to understand and solve such high-dimensional problems. Dimensionality reduction techniques like PCA can help solve this for us.
In this post, we show how you can use the dimensionality reduction transform in Data Wrangler on the MNIST dataset to reduce the number of features by 85% and still achieve similar or better accuracy than the original dataset. The MNIST (Modified National Institute of Standards and Technology) dataset, which is the de facto “hello world” dataset in computer vision, is a dataset of handwritten images. Each row of the dataset corresponds to a single image that is 28 x 28 pixels, for a total of 784 pixels. Each pixel is represented by a single feature in the dataset with a pixel value ranging from 0–255.
To learn more about the new dimensionality reduction feature, refer to Reduce Dimensionality within a Dataset.
This post assumes that you have an Amazon SageMaker Studio domain set up. For details on how to set it up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.
To get started with the new capabilities of Data Wrangler, open Studio after upgrading to the latest release and choose the File menu, New, and Flow, or choose New data flow from the Studio launcher.
The dataset we use in this post contains 60,000 training examples and labels. Each row consists of 785 values: the first value is the label (a number from 0–9) and the remaining 784 values are the pixel values (a number from 0–255). First, we perform a Quick Model analysis on the raw data to get performance metrics and compare them with the model metrics post-PCA transformations for evaluation. Complete the following steps:
After the data is imported, Data Wrangler automatically validates the datasets and detects the data types for all the columns based on its sampling. In the MNIST dataset, because all the columns are long, we leave this step as is and go back to the data flow.
The flow editor now shows two blocks showcasing that the data was imported from a source and the data types recognized. You can also edit the data types if needed.
After confirming that the data quality is acceptable, we go back to the data flow and use Data Wrangler’s Data Quality and Insights Report. This report performs an analysis on the imported dataset and provides information about missing values, outliers, target leakage, imbalanced data, and a Quick Model analysis. Refer to Get Insights On Data and Data Quality for more information.
For this analysis, we only focus on the Quick Model part of the Data Quality report.
For this post, we use the Data Quality and Insights Report to show how the model performance is mostly preserved using PCA. We recommend that you use a deep learning-based approach for better performance.
The following screenshot shows a summary of the dataset from the report. Fortunately, we don’t have any missing values. The time taken for the report to generate depends on the size of the dataset, number of features, and the instance size used by Data Wrangler.
The following screenshot shows how the model performed on the raw dataset. Here we notice that the model has an accuracy of 93.7% utilizing 784 features.
Now let’s use the Data Wrangler dimensionality reduction transform to reduce the number of features in this dataset.
If you don’t see the dimensionality reduction option listed, you need to update Data Wrangler. For instructions, refer to Update Data Wrangler.
After applying PCA, the number of columns will be reduced from 784 to 115—this is an 85% reduction in the number of features.
We can now use the transformed dataset and generate another Data Quality and Insights Report as shown in the following screenshot to observe the model performance.
We can see in the second analysis that the model performance has improved and accuracy increased to 91.8% compared to the first Quick Model report. PCA reduced the number of features in our dataset by 85% while maintaining the model accuracy at similar levels.
Based on the Quick Model analysis from the report, model performance is at 91.8%. With PCA, we reduced the columns by 85% while still maintaining the model accuracy at similar levels. For better results, you can try deep learning models, which might offer even better performance.
We found the following comparison in training time using Amazon SageMaker Autopilot with and without PCA dimensionality reduction:
As data changes over time, it’s often desirable to retrain our parameters to new unseen data. Data Wrangler offers this capability through the use of refitting parameters. For more information on refitting trained parameters, refer to Refit trained parameters on large datasets using Amazon SageMaker Data Wrangler.
Previously, we applied PCA to a sample of the MNIST dataset containing 50,000 sample rows. Consequently, our flow file contains a model that has been trained on this sample and used for all created jobs unless we specify that we want to relearn those parameters.
To refit your model parameters on the MNIST training dataset, complete the following steps:
The Trained parameters section shows that there are 784 parameters. That is one parameter for each column because we excluded the label column in our PCA reduction.
Note that if we don’t select Refit in this step, the trained parameters learned during interactive mode will be used.
This flow file contains the model learned on the entire MNIST train dataset.
To clean up the environment so you don’t incur additional charges, delete the datasets and artifacts in Amazon S3. Additionally, delete the data flow file in Studio and shut down the instance it runs on. Refer to Shut Down Data Wrangler for more information.
Dimensionality reduction is a great technique to remove the unwanted variables from a model. It can be used to reduce the model complexity and noise in the data, thereby mitigating the common problem of overfitting in machine learning and deep learning models. In this blog we demonstrated that by reducing the number of features, we were still able to accomplish similar or higher accuracy for our models.
For more information about using PCA, refer to Principal Component Analysis (PCA) Algorithm. To learn more about the dimensionality reduction transform, refer to Reduce Dimensionality within a Dataset.
Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.
Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.
Vishaal Kapoor is a Senior Applied Scientist with AWS AI. He is passionate about helping customers understand their data in Data Wrangler. In his spare time, he mountain bikes, snowboards, and spends time with his family.
Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.