Data preparation remains a major challenge in the machine learning (ML) space. Data scientists and engineers need to write queries and code to get data from source data stores, and then write the queries to transform this data, to create features to be used in model development and training. All of this data pipeline development work doesn’t really focus on the building of ML models, but focuses on the building of data pipelines necessary to make the data available to the models. Amazon SageMaker Data Wrangler makes it easier for data scientists and engineers to prepare data in the early phase of developing ML applications by using a visual interface.
Data Wrangler simplifies the process of data preparation and feature engineering using a single visual interface. Data Wrangler comes with over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can now use Snowflake as a data source in Data Wrangler to easily prepare data in Snowflake for ML.
In this post, we use a simulated dataset that represents loans from a financial services provider, which has been provided by Snowflake. This dataset contains lender data about loans granted to individuals. We use Data Wrangler to transform and prepare the data for later use in ML models, first building a data flow in Data Wrangler, then exporting it to Amazon SageMaker Pipelines. First, we walk through setting up Snowflake as the data source, then explore and transform the data using Data Wrangler.
This post assumes you have the following:
In this section, we cover the permissions required to set up Snowflake as a data source for Data Wrangler. This section requires you to perform steps in both the AWS Management Console and Snowflake. The user in each environment should have permission to create policies, roles, and secrets in AWS, and the ability to create storage integrations in Snowflake.
All permissions for AWS resources are managed via your IAM role attached to your Amazon SageMaker Studio instance. Snowflake-specific permissions are managed by the Snowflake admin; they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. Make sure that the correct permissions are set up outside of Data Wrangler.
Snowflake requires the following permissions on your output S3 bucket and prefix to be able to access objects in the prefix:
You can add a bucket policy to ensure that Snowflake only communicates with your bucket over HTTPS. For instructions, see What S3 bucket policy should I use to comply with the AWS Config rule s3-bucket-ssl-requests-only?
In this section, we cover creating the policy required for Snowflake to access data in an S3 bucket of your choosing. If you already have a policy and role that allows access to the S3 bucket you plan to use for the Data Wrangler output, you can skip this section and the next section, and start creating your storage integration in Snowflake.
# Example policy for S3 write access # This needs to be updated # Be sure to remove the angle brackets around
In this section, we create an IAM role and attach it to the policy we created.
You modify the trusted relationship and grant access to Snowflake later.
Later, we modify the trust relationship and specify the external ID for your Snowflake stage. An external ID is required to grant access to your AWS resources (such as Amazon S3) to a third party (Snowflake).
You now have an IAM policy created for an IAM role, and the policy is attached to the role.
In the next step, you create a Snowflake integration that references this role.
A storage integration in Snowflake stores a generated IAM entity for external cloud storage, with an optional set of allowed or blocked locations, in Amazon S3. An AWS administrator in your organization grants permissions on the storage location to the generated IAM entity. With this feature, users don’t need to supply credentials when creating stages or when loading or unloading data.
Create the storage integration with the following code:
CREATE STORAGE INTEGRATION IF NOT EXISTS SAGEMAKER_DATAWRANGLER_INTEGRATION TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = S3 STORAGE_AWS_ROLE_ARN = ‘
Run the following DESCRIBE INTEGRATION command to retrieve the ARN for the IAM user that was created automatically for your Snowflake account:
DESC INTEGRATION SAGEMAKER_DATAWRANGLER_INTEGRATION;
Record the following values from the output:
Now we update the trust policy.
{ “Version”: “2012-10-17”, “Statement”: [ { “Sid”: “”, “Effect”: “Allow”, “Principal”: { “AWS”: “
We use an external stage within Snowflake for loading data from an S3 bucket in your own account into Snowflake. In this step, we create an external (Amazon S3) stage that references the storage integration you created. For more information, see Creating an S3 Stage.
This requires a role that has the CREATE_STAGE privilege for the schema as well as the USAGE privilege on the storage integration. You can grant these privileges to the role as shown in the code in the next step.
Create the stage using the CREATE_STAGE command with placeholders for the external stage and S3 bucket and prefix. The stage also references a named file format object called my_csv_format:
grant create stage on schema public to role
Data Wrangler allows users to use the ARN of an AWS Secrets Manager secret or a Snowflake account name, user name, and password to access Snowflake. If you intend to use the Snowflake account name, user name, and password option, skip to the next section, which covers adding the data source. By default, Data Wrangler creates a Secrets Manager secret on your behalf, when using the second option.
To create a Secrets Manager secret manually, complete the following steps:
The names of the key are case-sensitive and must be lowercase. If you enter any of these incorrectly, Data Wrangler raises an error.
If you prefer, you can use the plaintext option and enter the secret values as JSON:
{ “username”: “
After you store the secret, you’re returned to the Secrets Manager console.
In this section, we cover setting up Snowflake as a data source in Data Wrangler. This post assumes that you have access to SageMaker, an instance of Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.
To create your data flow, complete the following steps:
Alternatively, on the File drop-down, choose New, then choose Data Wrangler Flow.
Creating a new flow can take a few minutes. After the flow has created, you see the Import data page.
Next, we add Snowflake as a data source.
Data Wrangler uses HTTPS to connect to Snowflake.
You’re redirected to the import menu.
Now that Snowflake is set up as a data source, you can access your data in Snowflake directly from the Data Wrangler query editor. The query we write in the editor is what Data Wrangler uses to import data from Snowflake to start our data flow.
For this post, our dataset is in the database FIN_LOANS, the schema is DEV, and the table is LOAN_INT_HV. My data warehouse is called MOONMAXW_DEV_WH; depending on your setup, these will likely differ.
Alternatively, you can specify the full path to the dataset in the query editor. Make sure you still choose the database and schema on the drop-down menus.
For this post, we retrieve all columns from 1,000 rows.
You’re taken to the Prepare page, where you can add transformations and analyses to the data.
Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.
On the Data Wrangler flow page, make sure you have chosen the Prepare tab. If you’re following the steps in the post, you’re directed here automatically after adding your dataset.
The first step we want to perform is to check that the correct data type was inferred on ingest for each column.
Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD should most likely be represented as a number type, rather than string.
Looking through the dataset, we can confirm that the rest of the columns appear to have been correctly inferred.
The dataset we’re using has several columns that likely aren’t beneficial to future models, so we start our transformation process by dropping the columns that aren’t useful.
The transformation console opens. Here you can preview your dataset, select from the available transformations, and preview the transformations.
Looking through the data, we can see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will likely not provide value to our model in our use case, so we drop them.
In the data flow view, we can see that this node in the flow has four steps, which represent the four columns we’re dropping for this part of the flow.
Next, we look for columns that are string data that can be formatted to be more beneficial to use later. Looking through our dataset, we can see that INT_RATE might be useful in a future model as float, but has a trailing character of %. Before we can use another built-in transformation (parse as type) to convert this to a float, we must strip the trailing character.
For this post, we keep the original column and set the output column to INT_RATE_PERCENTAGE to denote to future users of this data that this column is the interest rate as a percentage. Later, we convert this to a float.
When Data Wrangler adds a new column, it’s automatically added as the rightmost column.
Continuing with the preceding example, we’ve identified that INT_RATE_PERCENTAGE should be converted to a float type.
The From field is automatically populated.
As you can see, we now have six steps in this portion of the flow, four that represent columns being dropped, one that represents string formatting, and one that represents parse column as type.
Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all appear to be categorical in nature.
The first column in our list TERM has two possible values: 60 months and 36 months. Perhaps our future model would benefit from having these values one-hot encoded and placed into new columns.
We can now see two columns, TERM_36 months and TERM_60 months, are one-hot encoded to represent the corresponding value in the TERM column.
The HOME_OWNERSHIP column has four possible values: RENT, MORTGAGE, OWN, and other.
Lastly, the PURPOSE column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector, rather than columns.
This keeps the original PURPOSE column, if we decide to use it later.
We can now see nine different transformations in this flow, and we still haven’t written a single line of code.
As our last step in this flow, we want to handle outliers in our dataset. As part of the data exploration process, we can create an analysis (which we cover in the next section). In the following example scatter plot, I explored if I could gain insights from looking at the relationship between annual income, interest rate, and employment length by observing the dataset on a scatter plot. On the graph, we have the loan receivers INT_RATE_PERCENTAGE on the X axis, ANNUAL_INC on the Y axis, and the data is color-coded by EMP_LENGTH. The dataset has some outliers that might skew the result of our model later. To address this, we use Data Wrangler’s built-in transformation for handling outliers.
This is optional, but it’s good practice to notate that a column has been transformed for later consumers.
This option automatically clips values to the corresponding outlier detection bound, which we set next.
This allows values within four standard deviations of the mean to be considered valid (and therefore not clipped). Values outside of this bound are clipped.
The output includes an object type. We need to convert this to a float for it to be valid within our dataset and visualization.
In this section, we walk through adding analyses to dataset. We focus on visualizations, but there are several other options, including detecting target leakage, generating a bias report, or adding your own custom visualizations using the Altair library.
To create a scatter plot, complete the following steps:
The following screenshot shows our scatter plot.
We can compare this to the old version, before the anomalies were removed.
So far this is looking good, but let’s add a facet to break out each category in the Grade column into its own graph.
The following screenshot has been trimmed down for display purposes. The Y axis still represents ANNUAL_INC. For faceted plots, this is displayed on the bottommost plot.
Finally, we export this whole data flow as a pipeline, which creates a Jupyter notebook with the code pre-populated. With Data Wrangler, you can also export your data to a Jupyter notebook as a SageMaker processing job, SageMaker feature store, or export directly to Python code.
The pre-populated Jupyter notebook loads and opens automatically, displaying all the generated steps and code for your data flow. The following screenshot shows the input section that defines the data source.
If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.
In this post, we covered setting up Snowflake as a data source for Data Wrangler, adding transformations and analyses to a dataset, then exporting to the data flow for further use in a Jupyter notebook. We further improved our data flow after visualizing our dataset using the Data Wrangler built-in analysis functionality. Most notably, we built a data preparation pipeline without having to write a single line of code.
To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.
Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.
Maxwell Moon is a Senior Solutions Architect at AWS working with Independent Software Vendors (ISVs) to design and scale their applications on AWS. Outside of work, Maxwell is a dad to two cats, is an avid supporter of the Wolverhampton Wanderers Football Club, and tries to spend as much time playing music as possible.
Bosco Albuquerque is a Sr Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors, and cloud providers and has helped large technology companies in designing data analytics solutions as well as led engineering teams is designing and implementing data analytics platforms and data products.