Regulatory mandates, audit requirements, and security policies often call for data visibility and granular data control while using Amazon Simple Storage Service (Amazon S3) for shared datasets. Because data on Amazon S3 is often accessible by multiple applications and teams, fine-grained access controls should be implemented to restrict privileged information such as personally identifiable information (PII) to only authorized entities. For example, PII data used by a marketing application may need to be masked to meet data privacy requirements. Similarly, an order inventory dataset used by a production ordering application may include customer credit card information that shouldn’t be accessed by a business analytics application, so this data should be suppressed to prevent unintended data leakage.
In this post, we show you how to implement Amazon S3 Object Lambda to process and modify data retrieved from Amazon S3.
Currently, organizations employ some combination of manual processes and rules-based automation to identify and protect PII. Manual processes are slow, expensive, and can’t scale to address large amounts of data with accuracy. Manual processes also exacerbate human risk because sensitive data is in the hands of more human users and applications during the PII management processes. Rules-based automation is often used to augment manual processes, but this automation requires continued investment to keep it relevant and effective. These automation investments also have diminishing returns because they often require human support to sufficiently protect PII due to the context-driven nature of many PII scenarios that can’t be effectively addressed by rules-based automation alone.
From an implementation perspective, organizations typically either create and manage a proxy in front of Amazon S3 to intercept and redact data or create and store additional redacted derivative copies of datasets to provide multiple users and applications with redacted and unredacted versions of the same datasets. In both implementation models, you need to build and operate custom data processing software on additional infrastructure and storage, which adds complexity, data risk, and cost. These circumstances make it challenging for organizations to affordably and accurately protect PII at scale.
AWS customers manage many S3 buckets containing shared datasets that are accessed by multiple applications and users. You can use Amazon S3 Access Points to simplify data access management at scale. S3 Access Points have unique hostnames with dedicated access policies that describe how data can be accessed using the S3 Access Point. Before S3 Access Points, least privilege shared access to data meant managing permissions directly on the bucket using a single bucket policy document and bucket ACLs. These policies could represent hundreds of applications and users with various access needs and permissions. S3 Access Points simplify and streamline data access by creating individualized access permissions that easily scale with your data while providing management transparency and auditability.
With S3 Object Lambda, organizations can transform S3 objects in-flight as they are being retrieved through a standard Amazon S3 GET request by using S3 Object Lambda Access Points. AWS has provided two new pre-built AWS Lambda functions to help you detect, redact, and govern PII. Both functions are now available on the AWS Serverless Application Repository to be deployed at no license cost:
Unlike a human workforce, these capabilities can scale to large amounts of data without affecting accuracy and can reduce the number of humans and entities in contact with known and unknown PII data.
These S3 Object Lambda functions are powered by Amazon Comprehend, a fully managed service that uses state-of-the-art natural language processing (NLP) techniques to accurately identify PII. This means that the two new functions can capture variations in how PII is represented, regardless of how PII exists in text (such as numerically or as a combination of words and numbers). Amazon Comprehend can even use context in the text to understand if a 4-digit number is a PIN, the last four numbers of a Social Security number, or a year. With S3 Object Lambda, you don’t have to operate custom software or maintain additional infrastructure and storage to deploy this processing around your data. With just a few clicks on the AWS Management Console or through the AWS Command Line Interface (AWS CLI), you can configure and deploy the Amazon Comprehend-powered PII Lambda functions to control and manage your PII information.
The following diagram shows a basic data flow of how accessing an S3 object from an S3 Object Lambda Access Point uses S3 Object Lambda functions to detect and act on data as it’s being retrieved.
The solution contains the following steps:
In this post, we present two use cases to demonstrate how to configure and use the pre-built Lambda functions to detect and protect sensitive data.
This solution includes two architectures, which include resources that you create with AWS CloudFormation templates and through manual operations using the console. The first use case focuses on access control for PII. The second use case focuses on selective redaction of data for multiple personas.
The CloudFormation templates and Lambda function code are available in the GitHub repo.
In the first use case, you create an S3 Object Lambda Access Point and attach a pre-built Lambda function for access control. This pre-built Amazon Comprehend-powered function allows for just-in-time interception and denial of access to unknown PII in a scalable and cost-effective way, which reduces the risk of accidental PII exposure to unauthorized users. You can easily configure and deploy this function through the AWS Serverless Application Repository.
After you deploy the function, you validate that access to an S3 object is blocked if PII is detected during the object retrieval process. This scenario simulates a standard business user who may need to access existing data in S3 buckets but isn’t authorized to access PII data. By enabling S3 Object Lambda to discover, intercept, and block access to objects with unexpected PII, you can effectively discover unknown PII in your environments and protect against unintended PII leakage. If unknown PII is discovered in an object, a data governance user or data owner typically should review the object and decide on a course of action to either to redact the information or remove it before granting access to the business user.
The following diagram illustrates this architecture.
To implement this architecture, you complete the following high-level steps:
You attach this function to the S3 Object Lambda Access Point for access control.
The template creates AWS Identity and Access Management (IAM) resources, standard S3 Access Points, and an S3 bucket.
You download two files from the GitHub repo and upload to the bucket:
You do this by associating a supporting standard S3 Access Point and the previously created and configured ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point.
This role simulates a business user role in use by many people to access data as a part of their day-to-day responsibilities. The assumption is that there isn’t any business need to view any sensitive PII for these users. This user expects that the customer data they’re accessing doesn’t contain sensitive PII. The GeneralRole IAM role makes a GetObject call to the S3 Object Lambda Access Point to retrieve survey-results.txt. During retrieval the associated Lambda function is invoked and when the function detects PII it blocks retrieval and responds that the object can’t be retrieved. After you’re denied access to the file, you use GeneralRole to retrieve the innocuous.txt file using a GetObject call to validate you can retrieve files without PII.
In the second use case, you create multiple S3 Object Lambda Access Points and enable them with pre-built Lambda functions that are configured to redact specific types of PII depending on the accessors’ business needs. These functions are configured and deployed from the AWS Serverless Application Repository. Next, you validate that an S3 object is being properly redacted for each user based on the S3 Object Lambda Access Point performing the object retrieval.
This redaction example use case has three personas: an administrator, a billing user, and a customer support user. Each persona requires access to the same data with varying levels of redaction to achieve least privilege and still access the information necessary for their role:
Each user only has access to one S3 Object Lambda Access Point (managed through IAM permissions).
This use case demonstrates how S3 Object Lambda enables configurable user-specific redaction for data. The following diagram illustrates our architecture.
We deploy the architecture with the following high-level steps:
You attach this function to each S3 Object Lambda Access Point for the redaction use cases. You deploy three redaction functions, each configured differently to support the specific personas.
The template creates IAM resources, standard S3 Access Points, and an S3 bucket.
This file is an example of a sensitive call transcript containing known PII of various types, including phone numbers, banking info, and SSNs. This data simulates PII data that was recorded as a part of a call center interaction that is known to be sensitive and has been protected accordingly. A variety of personas have a valid business need to access this information, but each persona’s needs differ based on their role, and the business wants to implement the best practice of least privilege. The personas access the file through S3 Object Lambda Access Points to give them the appropriate level of information to do their job.
You associate the admin supporting standard S3 Access Point and the previously created admin ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for admin redaction.
You associate the billing supporting standard S3 Access Point and the previously created billing ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for billing redaction.
You associate the customer support supporting standard S3 Access Point and the previously created customer support ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for customer support redaction.
The download should complete successfully without modification to the file.
The download should complete with redaction of sensitive non-financial PII.
The download should complete with redaction of all financial PII while preserving contact information.
Permissions should be established such that roles can only access their corresponding S3 Object Lambda Access Points.
The following diagram indicates the permissions, features, and access control functionality that you use to manage how the S3 Object Lambda solutions work.
Using S3 Object Lambda for PII access control or redaction incurs costs from Amazon S3, Lambda, and Amazon Comprehend.
For more information about pricing, see Amazon S3 pricing, AWS Lambda pricing, and Amazon Comprehend pricing.
Use case #1: Detection and denial of data retrieval for objects with PII
In this section, we walk you through the steps to implement the first use case, which restricts access to objects containing PII.
To deploy the ComprehendPiiAccessControlS3ObjectLambda function, complete the following steps:
This is the function we attach to the S3 Object Lambda Access Point.
You should now be able to review your deployed function on the Lambda functions page.
You use the console to launch a CloudFormation stack (s3olap-access-control-foundation) that sets up the following resources:
Choose Launch Stack to deploy the resources, and make sure you’re in the US East (N. Virginia) Region (us-east-1):
Next, we upload our example PII data.
In this step, we create an S3 Object Lambda Access Point using the ComprehendPiiAccessControlS3ObjectLambda function to test our access control.
We can now test the solution by attempting to retrieve S3 objects with the Access Point.
You should receive a message that the download is denied.
The file shouldn’t contain any PII and should download successfully. If you see any unexpected characters in the file, download the file and open it in a text editor (some browsers experience text encoding issues).
Use case #2: Redaction of known PII data for multiple personas
In this section, we walk you through the steps to create multiple S3 Object Lambda Access Points and enable them with pre-built Lambda functions that are configured to redact specific types of PII depending on the accessors’ business needs.
Your first step is to deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with admin access.
You can change this value and redeploy the stack in the future to test other redaction scenarios.
You should now be able to review your deployed function on the Lambda functions page.
In this section, you deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with billing access.
We attach this function to the S3 Object Lambda Access Points for redaction use cases.
This field configures the Lambda function to redact the specified types of information discovered in the object. For more information about supported entity types, see Detect Personally Identifiable Information (PII).
You should now be able to review your deployed function on the Lambda functions page.
Finally, we deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with customer support access.
You should now be able to review your deployed function on the Lambda functions page.
You now launch the s3olap-redaction-foundation CloudFormation stack to set up the following resources:
Choose Launch Stack to deploy the resources, and make sure you’re in the US East (N. Virginia) Region (us-east-1).
Next, we upload our sample data.
You should now see transcript.txt listed in the call-transcripts-known-pii-[postfix] S3 bucket.
You now have the necessary IAM and Amazon S3 foundation to set up the redaction use cases. Next, we deploy the S3 Object Lambda Access Points.
We create the S3 Object Lambda Access Point for admin access using the ComprehendPiiRedactionS3ObjectLambda function.
Make sure to use this exact name. If any Object Lambda Access Points aren’t named properly, the provided IAM policies don’t allow access because they’re restricted by resource name.
We now create the S3 Object Lambda Access Point for billing access using the ComprehendPiiRedactionS3ObjectLambda function.
Make sure to use this exact name.
Finally, we create the S3 Object Lambda Access Point for customer support access using the ComprehendPiiRedactionS3ObjectLambda function.
Make sure to use this exact name.
To test the solution, we retrieve S3 objects using the S3 Object Lambda Access Points we just created.
No information should be redacted.
Sensitive data like the last four numbers of the SSN should be redacted.
All the financial data and the last four numbers of the SSN should be redacted.
If you see any unexpected characters in the files, download the file and open it in a text editor (some browsers experience text encoding issues).
If you want to implement this solution effectively, give each team or persona access to only one specific role (such as the billing role) and make sure teams only have access to an IAM role that corresponds to the level of data access they should have.
Finally, delete the resources you created in the earlier steps, in order to avoid additional charges.
In this post, we demonstrated how you can use S3 Object Lambda with Amazon Comprehend to detect, redact, and protect PII data. You can build your own Lambda functions and customize them further to meet your specific data protection needs and improve data value by using additional Amazon Comprehend features like entity recognition, key phrase recognition, sentiment analysis, and document classification. Also, consider Amazon Comprehend Medical as a HIPAA-eligible NLP service to analyze and extract data in a context-aware manner.
Use S3 Object Lambda throughout your AWS footprint to give you scalable and intelligent protection of data to help you mitigate data risk and manage access.
If you have any feedback about this post, please provide it in the comments section.
Ram Ramani joined Amazon in 2017 and is part of the core security specialist team with a deep focus on data protection and privacy. Ram’s work includes enabling customer adoption of the AWS Cloud by educating and evangelizing security best practices while the customers continue to innovate on their business . Prior to joining AWS, Ram spent 10 years working on various machine learning and security problems in the Telecom space.
Austin Quam is a Security Solutions Architect specializing in solving data security problems. Austin works with a diverse set of customers across North America, and is obsessed with helping customers achieve their business and security objectives on the AWS Cloud. Austin’s work includes security strategy, thought leadership, and detailed security design for cloud environments and workloads. Prior to joining AWS, Austin worked with several leading consulting firms serving clients across the US in many different cloud and security roles.