Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. Sagemaker provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common ML algorithms that are optimized to run efficiently against extremely large data in a distributed environment.
SageMaker requires that the training data for an ML model be present either in Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS) or Amazon FSx for Lustre (for more information, refer to Access Training Data). In order to train a model using data stored outside of the three supported storage services, the data first needs to be ingested into one of these services (typically Amazon S3). This requires building a data pipeline (using tools such as Amazon SageMaker Data Wrangler) to move data into Amazon S3. However, this approach may create a data management challenge in terms of managing the lifecycle of this data storage medium, crafting access controls, data auditing, and so on, all for the purpose of staging training data for the duration of the training job. In such situations, it may be desirable to have the data accessible to SageMaker in the ephemeral storage media attached to the ephemeral training instances without the intermediate storage of data in Amazon S3.
This post shows a way to do this using Snowflake as the data source and by downloading the data directly from Snowflake into a SageMaker Training job instance.
We use the California Housing Dataset as a training dataset for this post and train an ML model to predict the median house value for each district. We add this data to Snowflake as a new table. We create a custom training container that downloads data directly from the Snowflake table into the training instance rather than first downloading the data into an S3 bucket. After the data is downloaded into the training instance, the custom training script performs data preparation tasks and then trains the ML model using the XGBoost Estimator. All code for this post is available in the GitHub repo.
Figure 1: Architecture
The following figure represents the high-level architecture of the proposed solution to use Snowflake as a data source to train ML models with SageMaker.
The workflow steps are as follows:
To implement the solution provided in this post, you should have an AWS account, a Snowflake account and familiarity with SageMaker.
We use AWS CloudFormation to create a SageMaker notebook called aws-aiml-blogpost-sagemaker-snowflake-example and an IAM role called SageMakerSnowFlakeExample. Choose Launch Stack for the Region you want to deploy resources to.
Store your Snowflake credentials as a secret in Secrets Manager. For instructions on how to create a secret, refer to Create an AWS Secrets Manager secret.
To ingest the data, complete the following steps:
Figure 2: Open JupyterLab
Figure 3: Notebook Run All Cells
The following code snippet in the notebook ingests the dataset into Snowflake. See the snowflake-load-dataset.ipynb notebook for the full code.
# connect to Snowflake Table schema conn.cursor().execute(f”CREATE SCHEMA IF NOT EXISTS {schema}”) conn.cursor().execute(f”USE SCHEMA {schema}”) create_table_sql = f”CREATE TABLE IF NOT EXISTS {db}.{schema}.{table}n (” california_housing.rename(columns=str.upper, inplace=True) # iterating through the columns for col in california_housing.columns: column_name = col.upper() if (california_housing[col].dtype.name == “int” or california_housing[col].dtype.name == “int64″): create_table_sql = create_table_sql + column_name + ” int” elif california_housing[col].dtype.name == “object”: create_table_sql = create_table_sql + column_name + ” varchar(16777216)” elif california_housing[col].dtype.name == “datetime64[ns]”: create_table_sql = create_table_sql + column_name + ” datetime” elif california_housing[col].dtype.name == “float64″: create_table_sql = create_table_sql + column_name + ” float8″ elif california_housing[col].dtype.name == “bool”: create_table_sql = create_table_sql + column_name + ” boolean” else: create_table_sql = create_table_sql + column_name + ” varchar(16777216)” # Deciding next steps. Either column is not the last column (add comma) else end create_tbl_statement if california_housing[col].name != california_housing.columns[-1]: create_table_sql = create_table_sql + “,n” else: create_table_sql = create_table_sql + “)” # execute the SQL statement to create the table print(f”create_table_sql={create_table_sql}”) conn.cursor().execute(create_table_sql) print(f”snowflake_table={snowflake_table}”) conn.cursor().execute(‘TRUNCATE TABLE IF EXISTS ‘ + snowflake_table)
Figure 4: Snowflake Table
This notebook creates a custom training container with a Snowflake connection, extracts data from Snowflake into the training instance’s ephemeral storage without staging it in Amazon S3, and performs Distributed Data Parallel (DDP) XGBoost model training on the data. DDP training is not required for model training on such a small dataset; it is included here for illustration of yet another recently released SageMaker feature.
Figure 5: Open SageMaker Snowflake Example Notebook
We now create a custom container for the ML model training job. Note that root access is required for creating a Docker container. This SageMaker notebook was deployed with root access enabled. If your enterprise organization policies don’t allow root access to cloud resources, you may want to use the following Docker file and shell scripts to build a Docker container elsewhere (for example, your laptop) and then push it to Amazon ECR. We use the container based on the SageMaker XGBoost container image 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1 with the following additions:
Using the Snowflake connector and Python script ensures that users who use this container image for ML model training don’t have to write this code as part of their training script and can use this functionality that is already available to them.
The following is the Dockerfile for the training container:
# Build an image that can be used for training in Amazon SageMaker, we use # the SageMaker XGBoost as the base image as it contains support for distributed # training. FROM 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1 MAINTAINER Amazon AI
The container image is built and pushed to Amazon ECR. This image is used for training the ML model.
After we successfully create the container image and push it to Amazon ECR, we can start using it for model training.
For more information, refer to Prepare a Scikit-Learn Training Script.
print(f”the trained model is available in Amazon S3 -> {xgb_script_mode_estimator.model_data}”)
You can now deploy the trained model for getting inference on new data! For instructions, refer to Create your endpoint and deploy your model.
To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.
Figure 6: Cleaning Up
You will have to delete the Snowflake resources manually from the Snowflake console.
In this post, we showed how to download data stored in a Snowflake table to a SageMaker Training job instance and train an XGBoost model using a custom training container. This approach allows us to directly integrate Snowflake as a data source with a SageMaker notebook without having the data staged in Amazon S3.
We encourage you to learn more by exploring the Amazon SageMaker Python SDK and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.
Amit Arora is an AI and ML specialist architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.
Divya Muralidharan is a Solutions Architect at Amazon Web Services. She is passionate about helping enterprise customers solve business problems with technology. She has a Masters in Computer Science from Rochester Institute of Technology. Outside of office, she spends time cooking, singing, and growing plants.
Sergey Ermolin is a Principal AIML Solutions Architect at AWS. Previously, he was a software solutions architect for deep learning, analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since pre-GPU days, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks at Hewlett-Packard. Sergey holds an MSEE and a CS certificate from Stanford and a BS degree in physics and mechanical engineering from California State University, Sacramento. Outside of work, Sergey enjoys wine-making, skiing, biking, sailing, and scuba-diving. Sergey is also a volunteer pilot for Angel Flight.