In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and data and analytics.
Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select and clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new feature, you can use your own identity provider (IdP) such as Okta, Azure AD, or Ping Federate to connect to Snowflake via Data Wrangler.
In the following sections, we provide steps for an administrator to set up the IdP, Snowflake, and Studio. We also detail the steps that data scientists can take to configure the data flow, analyze the data quality, and add data transformations. Finally, we show how to export the data flow and train a model using SageMaker Autopilot.
For this walkthrough, you should have the following prerequisites:
Instead of having your users directly enter their Snowflake credentials into Data Wrangler, you can have them use an IdP to access Snowflake.
The following steps are involved to enable Data Wrangler OAuth access to Snowflake:
To set up your IdP, you must register the Data Wrangler application and set up your authorization server or API.
Refer to the following documentation for the IdPs that Data Wrangler supports:
Use the documentation provided by your IdP to register your Data Wrangler application. The information and procedures in this section help you understand how to properly use the documentation provided by your IdP.
Specific customizations in addition to the steps in the respective guides are called out in the subsections.
You’re specifying the SageMaker domain ID and AWS Region that you’re using to run Data Wrangler. You must register a URL for each domain and Region where you’re running Data Wrangler. Users from a domain and Region that don’t have redirect URLs set up for them won’t be able to authenticate with the IdP to access the Snowflake connection.
Within your IdP, you must set up an authorization server or an application programming interface (API). For each user, the authorization server or the API sends tokens to Data Wrangler with Snowflake as the audience.
Snowflake uses the concept of roles that are distinct from IAM roles used in AWS. You must configure the IdP to use ANY Role to use the default role associated with the Snowflake account. For example, if a user has systems administrator as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses systems administrator as the role.
Use the following procedure to set up the authorization server or API within your IdP:
The idle timeout is the time that the refresh token expires if it’s not used. If you’re scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
Note that Data Wrangler doesn’t support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.
If the refresh token expires, your users must reauthenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
For Azure AD, you must also specify a unique identifier for the scope.
After you’ve set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your IdP to get values for the following fields:
To configure Snowflake, complete the instructions in Import data from Snowflake.
Use the Snowflake documentation for your IdP to set up an external OAuth integration in Snowflake. See the previous section Register the Data Wrangler application within the IdP for more information on how to set up an external OAuth integration.
When you’re setting up the security integration in Snowflake, make sure you activate external_oauth_any_role_mode.
You store the fields and values in a Secrets Manager secret and add it to the Studio Lifecycle Configuration that you’re using for Data Wrangler. A Lifecycle Configuration is a shell script that automatically loads the credentials stored in the secret when the user logs into Studio. For information about creating secrets, see Move hardcoded secrets to AWS Secrets Manager. For information about using Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.
To create your secret for Snowflake credentials, complete the following steps:
Key names require lowercase letters due to case sensitivity. Data Wrangler gives a warning if you enter any of these incorrectly. Input the secret values as key-value pairs Key/value if you’d like, or use the Plaintext option.
The following is the format of the secret used for Okta. If you are using Azure AD, you need to add the datasource_oauth_scope field.
{ “token_url”:”https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token”, “client_id”:”example-client-id”, “client_secret”:”example-client-secretsovfDSUoOKAiLe4V6DiZrCLpW44x”, “identity_provider”:”OKTA”|”AZURE_AD”|”PING_FEDERATE”, “authorization_url”:”https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize” }
After you store the secret, you’re returned to the Secrets Manager console.
To create a Lifecycle Configuration in Studio, complete the following steps:
The configuration creates a file with the name “.snowflake_identity_provider_oauth_config”, containing the secret in the user’s home folder.
Complete the following steps to set the Lifecycle Configuration you just created as the default:
Your new settings should now be visible under Lifecycle configurations for personal Studio apps as default.
In this section, we cover how data scientists can connect to Snowflake as a data source in Data Wrangler and prepare data for ML.
To create your data flow, complete the following steps:
Creating a new flow can take a few minutes.
If you don’t see OAuth, verify the preceding Lifecycle Configuration steps.
You’re redirected to an IdP authentication page. For this example, we’re using Okta.
After the authentication is successful, you’re redirected to the Studio data flow page.
In the following example, we load Loan Data and retrieve all columns from 5,000 rows.
You’re redirected to the Prepare page, where you can add transformations and analyses to the data.
Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, and feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.
Use the Data Quality and Insights Report to perform an analysis of the data that you’ve imported into Data Wrangler. Data Wrangler creates the report from the sampled data.
The insights report has a brief summary of the data, which includes general information such as missing values, invalid values, feature types, outlier counts, and more. You can either download the report or view it online.
Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.
If you’re following the steps in the post, you’re directed here automatically after adding your dataset.
Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD should most likely be represented as a number type rather than string.
Looking through the data, we can see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will likely not provide value to our model in our use case, so we can drop them.
Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all appear to be categorical in nature.
The HOME_OWNERSHIP column has four possible values: RENT, MORTGAGE, OWN, and other.
Lastly, the PURPOSE column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector rather than columns.
This keeps the original PURPOSE column, if we decide to use it later.
Finally, we export this whole data flow to a feature store with a SageMaker Processing job, which creates a Jupyter notebook with the code pre-populated.
The exported notebook is now ready to run.
Now we can train the model using Amazon SageMaker Autopilot.
If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.
In this post, we demonstrated connecting Data Wrangler to Snowflake using OAuth, transforming and analyzing a dataset, and finally exporting it to the data flow so that it could be used in a Jupyter notebook. Most notably, we created a pipeline for data preparation without having to write any code at all.
To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors and cloud providers. He has helped large technology companies design data analytics solutions and has led engineering teams in designing and implementing data analytics platforms and data products.
Matt Marzillo is a Sr. Partner Sales Engineer at Snowflake. He has 10 years of experience in data science and machine learning roles both in consulting and with industry organizations. Matt has experience developing and deploying AI and ML models across many different organizations in areas such as marketing, sales, operations, clinical, and finance, as well as advising in consultative roles.
Huong Nguyen is a product leader for Amazon SageMaker Data Wrangler at AWS. She has 15 years of experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys audio books, gardening, hiking, and spending time with her family and friends.