With the rapid growth of data, many organizations are finding it difficult to analyze their large datasets to gain insights. As businesses rely more and more on automation algorithms, machine learning (ML) has become a necessity to stay ahead of the competition.
Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for ML. With Amazon Redshift ML, you can use simple SQL statements to create and train ML models from your data in Amazon Redshift and then use these models for a variety of use cases, such as classification of a binary or multiclass outcome or predicting a numeric value through regression. Amazon SageMaker Autopilot provides all the benefits of automatic model creation, but as an advanced user, you can also influence the model training by providing different parameters such as model type, objective, and so on.
Amazon Redshift ML allows you to address several ML challenges, such as the following:
You can use Amazon Redshift ML to automate data preparation, pre-processing, and selection of problem type as depicted in this blog post. In this post, we assume that you have a good understanding of your data and what problem type you want to use for your use case. We demonstrate how to use Amazon Redshift ML to solve a regression problem predicting bike rental counts. We also provide some best practices for creating test data, validating your model, and using it for inference. We also show you how you can use the SageMaker console to troubleshoot the training process as an advanced user.
As a prerequisite for implementing the example in this post, you need to set up an Amazon Redshift cluster with ML enabled on it. For the preliminary steps to get started, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.
In this post, we use Amazon Redshift ML to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions. Because we need to predict a numerical outcome, we create a regression model.
We walk you through the following high-level steps:
To prepare the raw data for this model, we populated three tables in Amazon Redshift using different public datasets.
You can use the following script to create and load the data in these tables in Amazon Redshift, and use the Amazon Redshift Query Editor to run these SQL scripts:
CREATE TABLE IF NOT EXISTS ridership ( trip_id INT , trip_duration_seconds INT , trip_start_time timestamp , trip_stop_time timestamp , from_station_name VARCHAR(50) , to_station_name VARCHAR(50) , from_station_id SMALLINT , to_station_id SMALLINT , user_type VARCHAR(20)); CREATE TABLE IF NOT EXISTS weather ( longitude_x DECIMAL(5,2) , latitude_y DECIMAL(5,2) , station_name VARCHAR(20) , climate_id BIGINT , datetime_utc TIMESTAMP , weather_year SMALLINT , weather_month SMALLINT , weather_day SMALLINT , time_utc VARCHAR(5) , temp_c DECIMAL(5,2) , temp_flag VARCHAR(1) , dew_point_temp_c DECIMAL(5,2) , dew_point_temp_flag VARCHAR(1) , rel_hum SMALLINT , rel_hum_flag VARCHAR(1) , precip_amount_mm DECIMAL(5,2) , precip_amount_flag VARCHAR(1) , wind_dir_10s_deg VARCHAR(10) , wind_dir_flag VARCHAR(1) , wind_spd_kmh VARCHAR(10) , wind_spd_flag VARCHAR(1) , visibility_km VARCHAR(10) , visibility_flag VARCHAR(1) , stn_press_kpa DECIMAL(5,2) , stn_press_flag VARCHAR(1) , hmdx SMALLINT , hmdx_flag VARCHAR(1) , wind_chill VARCHAR(10) , wind_chill_flag VARCHAR(1) , weather VARCHAR(10)); CREATE TABLE IF NOT EXISTS holiday ( holiday_date DATE , description VARCHAR(100));
To load the data, use the following COPY commands. Replace the AWS Identity and Access Management (IAM) role with the IAM role that you created as part of the prerequisite steps earlier.
COPY ridership FROM ‘s3://redshift-ml-bikesharing-data/bike-sharing-data/ridership/’ IAM_ROLE ‘arn:aws:iam::
Let’s discuss about how the data can be biased and how selecting the right distribution of data impacts accuracy. For most ML problems, data preparation is the most time-consuming process; it involves preparing the data, finding relevant attributes, and cleaning and curating it to be used as input to the ML model. Bias or anomalies in the input data distribution also play a key role in model accuracy, therefore it’s very important to curate that as much as possible. Let’s explore and prepare our input dataset.
The ridership table contains Bike Share Toronto’s ridership information for 2017 and 2018. We performed the following data preparation steps to make it more meaningful for our ML model:
We carried out all these transformations using the following simple view:
CREATE OR REPLACE VEIW ridership_view AS SELECT trip_time , trip_count , TO_CHAR(trip_time,’hh24′) ::INT trip_hour , TO_CHAR(trip_time, ‘dd’) :: INT trip_day , TO_CHAR(trip_time, ‘mm’) :: INT trip_month , TO_CHAR(trip_time, ‘yy’) :: INT trip_year , TO_CHAR(trip_time, ‘q’) :: INT trip_quarter , TO_CHAR(trip_time, ‘w’) :: INT trip_month_week , TO_CHAR(trip_time, ‘d’) :: INT trip_week_day FROM (SELECT CASE WHEN TRUNC(r.trip_start_time) < '2017-07-01'::DATE THEN CONVERT_TIMEZONE('US/Eastern', DATE_TRUNC('hour',r.trip_start_time)) ELSE DATE_TRUNC('hour',r.trip_start_time) END trip_time , COUNT(1) trip_count FROM ridership r WHERE r.trip_duration_seconds BETWEEN 60 AND 60 * 60 * 24 GROUP BY 1);
The weather table contains Government of Canada historical weather data for 2017 and 2018 in UTC format. We performed the following data preparation activities:
We use the following view in Amazon Redshift to apply these transformations:
CREATE OR REPLACE VEIW weather_view AS SELECT CONVERT_TIMEZONE(‘US/Eastern’, DATE_TRUNC(‘hour’,datetime_utc)) daytime , ROUND(AVG(temp_c)) temp_c , ROUND(AVG(precip_amount_mm)) precip_amount_mm FROM weather GROUP BY 1;
In this step, we create a new table, trip_data, in Amazon Redshift to combine all relevant attributes from the input tables. We also create a random ordered attribute named serial_number to denote the serial number of each record:
CREATE TABLE trip_data AS SELECT r.trip_time ,r.trip_count ,r.trip_hour ,r.trip_day ,r.trip_month ,r.trip_year ,r.trip_quarter ,r.trip_month_week ,r.trip_week_day ,w.temp_c ,w.precip_amount_mm ,CASE WHEN h.holiday_date IS NOT NULL THEN 1 WHEN TO_CHAR(r.trip_time,’D’)::INT IN (1,7) THEN 1 ELSE 0 END is_holiday , ROW_NUMBER() OVER (ORDER BY RANDOM()) serial_number FROM ridership_view r JOIN weather_view w ON ( r.trip_time = w.daytime ) LEFT OUTER JOIN holiday h ON ( TRUNC(r.trip_time) = h.holiday_date );
The following is an example SELECT query:
select * from trip_data limit 5;
The following table summarizes our output.
Trip time | trip count | trip hour | trip day | trip month | trip year | trip quarter | trip month week | trip week day | temp c | precip amount mm | is holiday | serial number |
2017-07-04 23:00 | 105 | 23 | 4 | 7 | 17 | 3 | 1 | 3 | 20 | 0 | 0 | 1 |
2018-09-03 9:00 | 233 | 9 | 3 | 9 | 18 | 3 | 1 | 2 | 25 | 0 | 1 | 2 |
2017-09-29 20:00 | 185 | 20 | 29 | 9 | 17 | 3 | 5 | 6 | 12 | 0 | 0 | 3 |
2017-04-09 23:00 | 28 | 23 | 9 | 4 | 17 | 2 | 2 | 1 | 12 | 0 | 1 | 4 |
2017-02-01 16:00 | 175 | 16 | 1 | 2 | 17 | 1 | 1 | 4 | 2 | 0 | 0 | 5 |
We can create an ML model in Amazon Redshift to predict the trip_count column in the preceding table for any given trip time. But before that, we may want to view the relationship of different attributes of this table with our target attribute, trip_count. The Pearson correlation coefficient is a popular method to find how strong a relationship is between two attributes. It returns a value between -1 and 1, in which 1 indicates a strong positive relationship and -1 indicates a strong negative relationship. A result of zero indicates no relationship at all.
We use the stored procedure sp_correlation, available in the Amazon Redshift Utilities GitHub repo, to view the relationship between our target column trip_count and the other numeric attributes in our dataset. This stored procedure allows us to get correlation of an attribute with other attributes in a table in Amazon Redshift.
call sp_correlation(‘public’,’trip_data’,’trip_count’,’tmp_corr_table’); select * from tmp_corr_table;
The following table summarizes our output.
trip count | trip hour | trip day | trip month | trip year | trip quarter | trip month week | trip week day | temp c | precip amount mm | is holiday | serial number |
1 | 0.32 | 0.02 | 0.19 | 0.12 | 0.20 | 0.02 | 0.02 | 0.52 | -0.07 | -0.12 | 0.00 |
The output shows that temperature and hour of the day have a strong positive correlation with the number of bike rentals. Similarly, precipitation and holidays have a negative correlation with the number of bike rentals. But serial number data has zero correlation because it’s a random number and we can safely exclude such columns in our ML model.
Now that our data preparation steps are complete, we can use the create model statement in Amazon Redshift to create our ML regression model. But before that, we want to divide our data into two sets by splitting 80% of the dataset for training and 20% for validation, which a common practice in ML. The training data is input to the ML model to identify the best possible algorithm for the model. After the model is created, we use the validation data to validate the model accuracy. We create the tables training_data and validation_data using the 20% threshold for our random ordered column serial_number with the following code:
CREATE TABLE training_data AS SELECT trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday FROM trip_data WHERE serial_number > (SELECT COUNT(1) * 0.2 FROM trip_data); CREATE TABLE validation_data AS SELECT trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday FROM trip_data WHERE serial_number <= (SELECT COUNT(1) * 0.2 FROM trip_data);
After that, we run the create model command in Amazon Redshift, which creates a regression model to predict the trip_count value for any input date and time:
CREATE MODEL predict_rental_count FROM training_data TARGET trip_count FUNCTION predict_rental_count IAM_ROLE ‘arn:aws:iam::
We define the following parameters in our create model statement:
The preceding statement takes a few seconds to complete. It initiates an Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.
In this step, we evaluate the accuracy of our ML model against our validation data.
For regression problems, typical accuracy metrics are mean square error (MSE) and root mean square error (RMSE), which measure the distance between the predicted numeric target and the actual numeric answer, also known as ground truth. We square the differences between the actual and predicted values and then find the average to calculate the MSE. The square root of MSE computes the RMSE. A model with a low MSE and RMSE score indicates a good model.
While creating the model, Autopilot automatically splits the input data into train and validation sets and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. We use the show model statement to view various metrics, including the calculated MSE by SageMaker while creating the model:
show model predict_rental_count;
The following output shows an MSE of 1960.272217.
Key | Value |
Model Name | predict_rental_count |
Schema Name | Public |
Owner | Awsuser |
Creation Time | Mon, 15.02.2021 16:13:18 |
Model State | READY |
validation:mse | 1960.272217 |
Estimated Cost | 4.215379 |
TRAINING DATA: | |
Query | SELECT TRIP_COUNT, TRIP_HOUR, TRIP_DAY, TRIP_MONTH, TRIP_YEAR, TRIP_QUARTER, TRIP_MONTH_WEEK, TRIP_WEEK_DAY, TEMP_C, PRECIP_AMOUNT_MM, IS_HOLIDAY |
FROM TRAINING_DATA | |
Target Column | TRIP_COUNT |
PARAMETERS: | |
Model Type | Auto |
Problem Type | Regression |
Objective | MSE |
Function Name | predict_rental_count |
Function Parameters | trip_hour trip_day trip_month trip_year trip_quarter trip_month_week trip_week_day temp_c precip_amount_mm is_holiday |
Function Parameter Types | int4 int4 int4 int4 int4 int4 int4 numeric numeric int4 |
IAM Role | arn:aws:iam::xxxxxxxxxxxx:role/redshift-sagemaker-role |
s3 Bucket | redshift-xxxxxxxxx-bucket |
Max Runtime | 5400 |
Let’s run inference queries against our validation data using the following SQL query:
SELECT trip_time , actual_count , predicted_count , ( actual_count – predicted_count ) difference FROM (SELECT trip_time , trip_count AS actual_count , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count FROM validation_data) LIMIT 5;
The following table summarizes our output.
trip_time | actual_count | predicted_count | difference |
2017-09-29 20:00 | 185 | 246 | -61 |
2017-01-02 17:00 | 76 | 75 | 1 |
2018-12-24 18:00 | 70 | 111 | -41 |
2018-02-02 14:00 | 70 | 64 | 6 |
2018-07-08 22:00 | 273 | 299 | -26 |
We can also calculate the MSE and RMSE metrics based on our validation data:
SELECT ROUND(AVG(POWER(( actual_count – predicted_count ),2)),2) mse , ROUND(SQRT(AVG(POWER(( actual_count – predicted_count ),2))),2) rmse FROM (SELECT trip_time , trip_count AS actual_count , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count FROM validation_data);
We get the following output.
The RMSE value indicates that our model is accurate enough to the actual values from our validation dataset.
Although the Create Model statement in Amazon Redshift automatically takes care of initiating Autopilot process to build, train, and tune the best ML model and deploy that model in Amazon Redshift, you can also view the intermediate steps performed in this process, which may help you with troubleshooting if something goes wrong.
When you create the model, you need to enter an Amazon Simple Storage Service (Amazon S3) bucket name as the value for the parameter s3_bucket. This bucket is used to share training data and artifacts between Amazon Redshift and SageMaker. Amazon Redshift creates a subfolder in this bucket prior to unloading the training data. When training is complete, it deletes this subfolder and its contents unless you set the parameter s3_garbage_collect to off, which may be used for troubleshooting purposes. For more information, see CREATE MODEL.
If you navigate to the SageMaker console, you can view the training jobs and hyperparameter tuning jobs initiated by the Create Model process. You need appropriate access permissions, like the AmazonSageMakerReadOnly policy, to view these details. For more information, see AWS Managed (Predefined) Policies for Amazon SageMaker.
In the hyperparameter tuning jobs section on the SageMaker console, you can view the best training job selected by SageMaker based on the MSE metrics, along with different hyperparameter values. The following screenshot shows that our best training job had an MSE value of 1960.272217, similar to what we saw in the Amazon Redshift show model statement.
Amazon SageMaker Studio is a web-based integrated development environment (IDE) for ML. You can view your model training activities on the Experiments and trials page. You can also view job metrics by choosing the option Describe AutoML Job, as shown in the following screenshot.
Studio also allows you to view two useful notebooks:
For more information, see Amazon SageMaker Autopilot notebooks generated to manage AutoML tasks.
Amazon Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Amazon Redshift ML to gain business insights for your data. You provide the data that you want to train a model. Then Amazon Redshift ML creates a model that capture patterns in the input data. You can then use that model to make predictions using familiar SQL commands. To learn more about Amazon RedShift ML, visit the webpage.
Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.
Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.
Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics.
Piali Das is a Senior Software Engineer in the Amazon SageMaker Autopilot team. She previously contributed to building SageMaker Algorithms. She enjoys scientific programming in general and has developed an interest in machine learning and distributed systems.