You can now retrain machine learning (ML) models and automate batch prediction workflows with updated datasets in Amazon SageMaker Canvas, thereby making it easier to constantly learn and improve the model performance and drive efficiency. An ML model’s effectiveness depends on the quality and relevance of the data it’s trained on. As time progresses, the underlying patterns, trends, and distributions in the data may change. By updating the dataset, you ensure that the model learns from the most recent and representative data, thereby improving its ability to make accurate predictions. Canvas now supports updating datasets automatically and manually enabling you to use the latest version of the tabular, image, and document dataset for training ML models.
After the model is trained, you may want to run predictions on it. Running batch predictions on an ML model enables processing multiple data points simultaneously instead of making predictions one by one. Automating this process provides efficiency, scalability, and timely decision-making. After the predictions are generated, they can be further analyzed, aggregated, or visualized to gain insights, identify patterns, or make informed decisions based on the predicted outcomes. Canvas now supports setting up an automated batch prediction configuration and associating a dataset to it. When the associated dataset is refreshed, either manually or on a schedule, a batch prediction workflow will be triggered automatically on the corresponding model. Results of the predictions can be viewed inline or downloaded for later review.
In this post, we show how to retrain ML models and automate batch predictions using updated datasets in Canvas.
For our use case, we play the part of a business analyst for an ecommerce company. Our product team wants us to determine the most critical metrics that influence a shopper’s purchase decision. For this, we train an ML model in Canvas with a customer website online session dataset from the company. We evaluate the model’s performance and, if needed, retrain the model with additional data to see if it improves the performance of the existing model or not. To do so, we use the auto update dataset capability in Canvas and retrain our existing ML model with the latest version of training dataset. Then we configure automatic batch prediction workflows—when the corresponding prediction dataset is updated, it automatically triggers the batch prediction job on the model and makes the results available for us to review.
The workflow steps are as follows:
You can perform these steps in Canvas without writing a single line of code.
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. The following table outlines the data schema.
Column Name | Data Type | Description |
Administrative | Numeric | Number of pages visited by the user for user account management-related activities. |
Administrative_Duration | Numeric | Amount of time spent in this category of pages. |
Informational | Numeric | Number of pages of this type (informational) that the user visited. |
Informational_Duration | Numeric | Amount of time spent in this category of pages. |
ProductRelated | Numeric | Number of pages of this type (product related) that the user visited. |
ProductRelated_Duration | Numeric | Amount of time spent in this category of pages. |
BounceRates | Numeric | Percentage of visitors who enter the website through that page and exit without triggering any additional tasks. |
ExitRates | Numeric | Average exit rate of the pages visited by the user. This is the percentage of people who left your site from that page. |
Page Values | Numeric | Average page value of the pages visited by the user. This is the average value for a page that a user visited before landing on the goal page or completing an ecommerce transaction (or both). |
SpecialDay | Binary | The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (such as Mother’s Day or Valentine’s Day) in which the sessions are more likely to be finalized with a transaction. |
Month | Categorical | Month of the visit. |
OperatingSystems | Categorical | Operating systems of the visitor. |
Browser | Categorical | Browser used by the user. |
Region | Categorical | Geographic region from which the session has been started by the visitor. |
TrafficType | Categorical | Traffic source through which user has entered the website. |
VisitorType | Categorical | Whether the customer is a new user, returning user, or other. |
Weekend | Binary | If the customer visited the website on the weekend. |
Revenue | Binary | If a purchase was made. |
Revenue is the target column, which will help us predict whether or not a shopper will purchase a product or not.
The first step is to download the dataset that we will use. Note that this dataset is courtesy of the UCI Machine Learning Repository.
For this walkthrough, complete the following prerequisite steps:
This is so that we can showcase the dataset update functionality. Ensure all the CSV files have the same headers, otherwise you may run into schema mismatch errors while creating a training dataset in Canvas.
Ensure all the predict*.csv files have the same headers, otherwise you may run into schema mismatch errors while creating a prediction (inference) dataset in Canvas.
To create a dataset in Canvas, complete the following steps:
Note that as of this writing, the dataset update functionality is only supported for Amazon S3 and locally uploaded data sources.
You can now create a dataset with multiple files.
We now have version 1 of the OnlineShoppersIntentions dataset with three files created.
The Data tab shows a preview of the dataset.
The Dataset files pane lists the available files.
We can see our first dataset version has three files. Any subsequent version will include all the files from previous versions and will provide a cumulative view of the data.
Let’s train an ML model with version 1 of our dataset.
By default, Canvas will pick up the most current dataset version for training.
The model training will take 2–5 minutes to complete. In our case, the trained model gives us a score of 89%.
Let’s update on our dataset using the auto update functionality and bring in more data and see if the model performance improves with the new version of dataset. Datasets can be manually updated as well.
You’re redirected to the Auto update tab for the corresponding dataset. We can see that Enable auto update is currently disabled.
An auto update dataset configuration has been created. It can be edited at any time. When a corresponding dataset update job is triggered on the specified schedule, the job will appear in the Job history section.
We can view our files in the dataset-update-demo S3 bucket.
The dataset update job will get triggered at the specified schedule and create a new version of the dataset.
When the job is complete, dataset version 2 will have all the files from version 1 and the additional files processed by the dataset update job. In our case, version 1 has three files and the update job picked up three additional files, so the final dataset version has six files.
We can view the new version that was created on the Version history tab.
The Data tab contains a preview of the dataset and provides a list of all the files in the latest version of the dataset.
Let’s retrain our ML model with the latest version of the dataset.
When the training is complete, let’s evaluate the model performance. The following screenshot shows that adding additional data and retraining our ML model has helped improve our model performance.
With an ML model trained, let’s create a dataset for predictions and run batch predictions on it.
Next, we set up auto updates on the prediction dataset.
In this step, we configure our auto batch prediction workflows.
We now have an automatic batch prediction workflow. This will be triggered when the Predict dataset is automatically updated.
Now let’s upload more CSV files to the predict S3 folder.
This operation will trigger an auto update of the predict dataset.
This will in turn trigger the automatic batch prediction workflow and generate predictions for us to view.
We can view all automations on the Automations page.
Thanks to the automatic dataset update and automatic batch prediction workflows, we can use the latest version of the tabular, image, and document dataset for training ML models, and build batch prediction workflows that get automatically triggered on every dataset update.
To avoid incurring future charges, log out of Canvas. Canvas bills you for the duration of the session, and we recommend logging out of Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.
In this post, we discussed how we can use the new dataset update capability to build new dataset versions and train our ML models with the latest data in Canvas. We also showed how we can efficiently automate the process of running batch predictions on updated data.
To start your low-code/no-code ML journey, refer to the Amazon SageMaker Canvas Developer Guide.
Special thanks to everyone who contributed to the launch.
Janisha Anand is a Senior Product Manager on the SageMaker No/Low-Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.
Prashanth is a Software Development Engineer at Amazon SageMaker and mainly works with SageMaker low-code and no-code products.
Esha Dutta is a Software Development Engineer at Amazon SageMaker. She focuses on building ML tools and products for customers. Outside of work, she enjoys the outdoors, yoga, and hiking.