Amazon SageMaker Data Wrangler is the fastest and easiest way for data scientists to prepare data for machine learning (ML) applications. With Data Wrangler, you can simplify the process of feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization through a single visual interface. Data Wrangler comes with 300 built-in data transformation recipes that you can use to quickly normalize, transform, and combine features. With the data selection tool in Data Wrangler, you can quickly select data from different data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift.
AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to Athena tables.
In this post, we demonstrate how to enable cross-account access for Data Wrangler using Athena as a source and Lake Formation as a central data governance capability. As shown in the following architecture diagram, Account A is the data lake account that holds all the ML-ready data derived from ETL pipelines. Account B is the data science account where a team of data scientists uses Data Wrangler to compile and run data transformations. We need to enable cross-account permissions for Data Wrangler in Account B to access the data tables located in Account A’s data lake via Lake Formation permissions.
With this architecture, data scientists and engineers outside the data lake account can access data from the lake and create data transformations via Data Wrangler.
Before you dive into the setup process, ensure the data to be shared across accounts are crawled and cataloged as detailed in this post. Let us presume this process has been completed and the databases and tables already exist in Lake Formation.
The following are the high-level steps to implement this solution:
To get started, create a central data lake in Account A. You can control the access to the data lake with policies and permissions, and define permissions at the database, table, or column level.
To kickstart the setup process, download the titanic dataset .csv file and upload it to your S3 bucket. After you upload the file, you need to register the bucket in Lake Formation. Lake Formation permissions enable fine-grained access control for data in your data lake.
Note: If the titanic dataset has already been cataloged, you can skip the registration step below.
To register your data store, complete the following steps:
If this is the first time you’re accessing Lake Formation, you need to add administrators to the account.
You now add AWS Identity and Access Management (IAM) users or roles specific to Account A as data lake administrators.
This can also be the IAM admin role of Account A.
For more information about security settings, see Changing the Default Security Settings for Your Data Lake.
Next, you need to register the S3 bucket as the data lake location.
This page should display a list of S3 buckets that are marked as data lake storage resources for Lake Formation. A single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources.
After this step, you should be able to see your S3 bucket under Data lake locations.
This step is optional. Skip this step if the titanic dataset has already been crawled and cataloged. The database and table for the dataset should pre-exist within the data lake.
Complete the following steps to register the database if it does not exist:
If it’s listed, make sure you revoke access to this group.
You should now be able to view the created database listed under Databases.
You should also be able to see the table in the Lake Formation console, under Data catalog in the navigation pane, under Tables. For this demo, let us presume the table name to be titanic_datalake_bucket_as as shown below.
To grant table permissions to Account A, complete the following steps:
You can also set a column filter.
You should be able to see the first five columns of the titanic_datalake_bucket_as table as per the granted permissions in the previous steps.
We have validated local access to the data lake table within Account A via this Athena step. Next, let’s grant access to an external account, in our case, Account B for the same table.
This external account is the account running Data Wrangler. To grant table permissions, complete the following steps:
You must revoke the Super permission from the IAMAllowedPrincipals group for this table before granting it external access. You can do this on the Actions menu under View permissions, then choose IAMAllowedPrincipals and choose Revoke.
We can find a Lake Formation entry on this page.
After you accept it, on the Resource shares page, you should see the shared Lake Formation entry, which encapsulates the catalog, database, and table information.
On the Lake Formation console in Account B, you can find the shared table owned by Account A on the Tables page. If you don’t see it, you can refresh your screen and the resource should appear shortly.
To use this shared table inside Account B, you need to create a database local to Account B in Lake Formation.
Next, for the shared titanic table in Lake Formation, you need to create a resource link. Resource links are Data Catalog objects that link to metadata databases and tables, typically to shared databases and tables from other AWS accounts. They help enable cross-account access to data in the data lake.
This is to make sure that Lake Formation manages the database and table permissions.
{ “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: [ “lakeformation:GetDataAccess”, “glue:GetPartitions” ], “Resource”: [ “*” ] } ] }
This is the table that you shared to Account B from Account A via AWS RAM.
In this final stage, you should be ready to validate the steps deployed so far by testing this in the Data Wrangler interface.
You should be able to see the local table (titanic_local) in the right pane.
This imports the titanic dataset, and you should be able to see the data flow page with the visual blocks on the Prepare tab.
In this post, we demonstrated how to enable cross-account access for Data Wrangler using Lake Formation and AWS RAM. Following this methodology, organizations can allow multiple data science and engineering teams to access data from a central data lake and build feature pipelines and transformation recipes consistently. For more information about Data Wrangler, see Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning and Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler.
Give Data Wrangler a try and share your feedback and questions in the comments section.
Rizwan Gilani is a Software Development Engineer at Amazon SageMaker. His passion lies with making machine learning more interactive and accessible at scale. Before that, he worked on Amazon Alexa as part of the core team that launched Alexa Communications.
Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.