Data generates new value to businesses through insights and building predictive models. However, although data is plentiful, available data scientists are far and few. Despite our attempts in recent years to produce data scientists from academia and elsewhere, we still see a huge shortage that will continue into the near future.
To accelerate model building, data scientists and ML practitioners often take advantage of AutoML (automated machine learning) tools that can augment their work. They can take away the tedious and iterative process of data preparation, model training and tuning. AutoML tools help data scientists improve their productivity when developing ML models.
In this post, we discuss how data scientists and other advanced analytics users can use Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot to analyze their data sets and build highly predictive ML models. To demonstrate these capabilities, we use the Pima Indian Diabetes public data set from UCI.
The Pima Indian Diabetes data set contains the information of 768 women from a population near Phoenix, Arizona. The outcome tested was diabetes. It carries 258 tested positive and 500 tested negative observations, with one target and eight attributes: pregnancies, glucose, blood pressure, skin thickness, insulin, BMI (body mass index), age, and pedigree diabetes function. We use this data set to demonstrate how to use Autopilot and Data Wrangler to build highly predictive ML models without having to write any code.
The high-level steps for building an ML model are as follows:
We walk through these steps as we build a binary classification model using the Pima Indian Diabetes data set.
Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.
If this is your first time opening Data Wrangler, you may have to wait a few minutes for it to be ready.
You can now preview your data set.
You now have a flow diagram.
If not, you can easily modify them through the UI. If multiple data sources are present, you can join or concatenate them.
We can now create an analysis and add transformations.
Exploratory data analysis is an important step when building ML models. In this step, data scientists analyze data to listen to its story. If you have the patience to listen, data is a great storyteller. This step involves statistical analysis, summarization tables, histograms, scatter plots, outlier analysis, finding missing values, and more. We demonstrate some of these in this post.
The count summary shows that all columns have 768 entries. But on closer examination, we find that the minimum value is 0 for columns such as Glucose and BloodPressure. Missing values are stored as 0 in this data set. Let’s fix that.
The 0 entries under Glucose are now missing entries.
Data Wrangler gives you a couple of options to fix missing values.
This completes one iteration of analysis and transformation.
Data Wrangler gives you an option to build a quick model to see how predictive your features are.
The following chart shows the F1 score and the importance of the predictive features.
The F1 score is a commonly used metric in classification problems; it represents the harmonic average of recall and precision. If we build a model with this data at this stage, we get an approximate F1 score of 0.735 (1 being the best F1 score) and find that Glucose is the most important explanatory feature.
Another valuable feature of Data Wrangler is checking for target leakage. Target leakage is a phenomenon in which the target that you’re trying to predict has leaked into one or more of your features, and this feature isn’t available at prediction time.
We don’t have a target leakage situation in this data set, but if we did, we would need to remove that column from the data set so that the model doesn’t falsely show a perfect model during training.
Women that are less than 100 in Glucose and less than 80 in BloodPressure seem to have a lesser chance for diabetes. Let’s create a new feature using that information.
This custom formula creates a new column in the data set.
Next, let’s check if Pregnancies/Age could have some effect on the target.
As we can see, this new feature could have an influence on our target.
A quick model after adding these two features shows an improvement in our model’s F1 score.
Other features are available that also don’t require any coding, such as finding outliers and scaling features, but we don’t need them for this data set.
The notebook creates output for this flow as a CSV file in Amazon S3. You can see the S3 path for the output file in the notebook. Depending on your input data file, Data Wrangler might split the output into multiple files. If so, you need to combine them into a single CSV file with a single header, which you then feed into Autopilot.
Autopilot allows you to automatically build ML models. It explores your data, selects the algorithms relevant to your problem type, and prepares the data to facilitate model training and tuning. It ranks all of the optimized models tested by their performance and finds the best performing model, which you can deploy at a fraction of the time normally required.
We can either run Autopilot directly on the raw data or feed it with the enhanced data set that we generated with Data Wrangler.
If you’re not sure what problem type to use, you can leave it as Auto and Autopilot will figure it out.
Autopilot analyzes the input data, processes it, selects the right ML algorithm, and runs several trials of experiments on it to tune the model for best performance. It then ranks these trials and presents you the best model.
You can see the creation of this endpoint on the SageMaker console.
You see an endpoint URL that you can use to make predictions in real time.
When the batch transformation job is complete, you can see your inference job’s output in the S3 bucket.
In this post, you have learned an easy way for conducting exploratory data analysis, ML model development, deployment, and batch transformation to make predictions. This technique can be used by anyone that has access to data and wants to quickly build powerful machine learning models and thereby increase their productivity. Learn more about Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot by visiting their product pages.
Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and nonprofit customers on machine learning and artificial intelligence related projects, helping them build solutions using AWS. When not helping customers, he likes traveling to new places.