Many companies extract data from scanned documents containing tables and forms, such as PDFs. Some examples are audit documents, tax documents, whitepapers, or customer review documents. For customer reviews, you might be extracting text such as product reviews, movie reviews, or feedback. Further understanding of the individual and overall sentiment of the user base from the extracted text can be very useful.
You can extract data through manual data entry, which is slow, expensive, and prone to errors. Alternatively you can use simple optical character recognition (OCR) techniques, which require manual configuration and changes for different inputs. The process of extracting meaningful information from this data is often manual, time-consuming, and may require expert knowledge and skills around data science, machine learning (ML), and natural language processing (NLP) techniques.
To overcome these manual processes, we have AWS AI services such as Amazon Textract and Amazon Comprehend. AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. Because we use the same deep learning technology that powers Amazon.com, you get quality and accuracy from continuously learning APIs. And best of all, AI services on AWS don’t require ML experience.
Amazon Textract uses ML to extract data from documents such as printed text, handwriting, forms, and tables without the need for any manual effort or custom code. Amazon Textract extracts complete text from given documents and provides key information such as page numbers and bounding boxes.
Based on the document layout, you may need to separate paragraphs and headers into logical sections to get more insights from the document at a granular level. This is more useful than simply extracting all of the text. Amazon Textract provides information such as the bounding box location of each detected text and its size and indentation. This information can be very useful for segmenting text responses from Amazon Textract in the form of paragraphs.
In this post, we cover some key paragraph segmentation techniques to postprocess responses from Amazon Textract, and use Amazon Comprehend to generate insights such as sentiment and entity extraction:
After you segment the paragraphs using any of these techniques, you can gain further insights from the segmented text by using Amazon Comprehend for the following use cases:
In this post, we cover the sentiment analysis use case specifically.
We use two different sample movie review PDFs for this use case, which are available on GitHub. The document contains movie names as the headers for individual paragraphs and reviews as the paragraph content. We identify the overall sentiment of each movie as well as the sentiment for each review. However, testing an entire page as a single entity isn’t ideal for getting an overall sentiment. Therefore, we extract the text and identify reviewer names and comments and generate the sentiment of each review.
This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:
The following diagram illustrates the architecture of the solution.
Our workflow includes the following steps:
You deploy an AWS CloudFormation template to provision the necessary AWS Identity and Access Management (IAM) roles, services, and components of the solution, including Amazon S3, Lambda, Amazon Textract, Amazon Comprehend.
This template uses AWS Serverless Application Model (AWS SAM), which simplifies how to define functions and APIs for serverless applications, and also has features for these services, like environment variables.
The following screenshot of the stack details page shows the status of the stack as CREATE_IN_PROGRESS. It can take up to 5 minutes for the status to change to CREATE_COMPLETE. When it’s complete, you can view the outputs on the Outputs tab.
When the setup is complete, the next step is to walk through the process of uploading a file and validating the results after the file is processed through the pipeline.
To process a file and get the results, upload your documents to your new S3 bucket, then choose the S3 bucket URL corresponding to the s3BucketForTextractDemo key on the stack Outputs tab.
You can download the sample document used in this post from the GitHub repo and upload it to the s3BucketForTextractDemo S3 URL. For more information about uploading files, see How do I upload files and folders to an S3 bucket?
After the document is uploaded, the textraction-inovcation.py Lambda function is invoked. This function calls the Amazon Textract StartDocumentTextDetection API, which sets up an asynchronous job to detect text from the PDF you uploaded. The code uses the S3 object location, IAM role, and SNS topic created by the CloudFormation stack. The role ARN and SNS topic ARN were set as environment variables to the function by AWS CloudFormation. The code can be found in textract-post-processing-CFN.yml.
When the document is submitted to Amazon Textract for text detection, we get pages, lines, words, or tables as a response. Amazon Textract also provides bounding box data, which is derived based on the position of the text in the document. The bounding box data provides information about where the text position from the left and top, the size of the characters, and the width of the text.
We can use the bounding box data to identify lots of segments of the document, for example, identifying paragraphs from a whitepaper, movie reviews, auditing documents, or items on a menu. After these segments are identified, you can use Amazon Comprehend to find sentiment or key phrases to get insights from the document. For example, we can identify the technologies or algorithms used in a whitepaper or understand the sentiment of each reviewer for a movie.
In this section, we demonstrate the following techniques to identify the paragraphs:
The first technique we discuss is identifying headers and paragraphs based on the font size. If the headers in your document are bigger than the text, you can use font size for the extraction. For example, see the following sample document, which you can download from GitHub.
First, we need to extract all the lines from the Amazon Textract response and the corresponding bounding box data to understand font size. Because the response has a lot of additional information, we’re only extracting lines and bounding box data. We separate the text with different font sizes and order them based on size to determine headers and paragraphs. This process of extracting headers is done as part of the get_headers_to_child_mapping method in the lambda-helpery.py function.
The step-by-step flow is as follows:
As a second technique, we explain how to derive headers and paragraphs based on the left indentation of the text. In the following document, headers are aligned at the left of the page, and all the paragraph text is a bit further in the document. You can download this sample PDF on GitHub.
In this document, the main difference between the header and paragraph is left indentation. Similar to the process described earlier, first we need to get line numbers and indentation information. After we this information, all we have to do is separate the text based on the indentation and extract the text between two headers by using line numbers.
The step-by-step flow is as follows:
Similar to the preceding approach, we can use line spaces to get the paragraphs only from a page. We calculate line spacing using the top indent. The difference in top indentation of the current line and the next line or previous line provides us with the line spacing. We can separate segments if the line spacing is bigger. The detailed code can be found on GitHub. You can also download the sample document from GitHub.
We also provide a technique to extract segments or paragraphs of the document based on full stops. Consider preceding document as an example. After the Amazon Textract response is parsed and the lines are separated, we can iterate through each line and whenever we find a line that ends with a full stop, we can consider it as end of paragraph and any line thereafter is part of next paragraph. This is another helpful technique to identify various segments of the document. The code to perform this can be found on GitHub
As we described in the preceding processes, we can collect the text using various techniques. After the list of paragraphs are identified, we can use Amazon Comprehend to get the sentiment of the paragraph and key phrases of the text. Amazon Comprehend can give intelligent insights based on the text, which is very valuable to businesses because understanding the sentiment at each segment is very useful.
After you process the file, you can query the results for each paragraph.
You should see two tables:
You can see a mix of review sentiments.
The DynamoDB table data looks like the following screenshot, file path, header, paragraph data, and sentiment for paragraph.
This post demonstrated how to extract and process data from a PDF and visualize it to review sentiments. We separated the headers and paragraphs via custom coding and ran sentiment analysis for each section separately.
Processing scanned image documents helps you uncover large amounts of data, which can provide meaningful insights. With managed ML services like Amazon Textract and Amazon Comprehend, you can gain insights into your previously undiscovered data. For example, you can build a custom application to get text from a scanned legal document, purchase receipts, and purchase orders.
If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!
Srinivasarao Daruna is a Data Lab Architect at Amazon Web Services and comes from strong big data and analytics background. In his role, he helps customers with architecture and solutions to their business problem. He enjoys learning new things and solving complex problems for customers.
Mona Mona is a Senior AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with public sector customers and helps them adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML and has published multiple blog posts on these topics in the AWS AI/ML Blogs.
Divyesh Sah is as a Sr. Enterprise Solutions Architect in AWS focusing on financial services customers, helping them with cloud transformation initiatives in the areas of migrations, application modernization, and cloud native solutions. He has over 18 years of technical experience specializing in AI/ML, databases, big data, containers, and BI and analytics. Prior to AWS, he has experience in areas of sales, program management, and professional services.
Sandeep Kariro is an Enterprise Solutions Architect in the Telecom space. Having worked in cloud technologies for over 7 years, Sandeep provides strategic and tactical guidance to enterprise customers around the world. Sandeep also has in-depth experience in data-centric design solutions optimal for cloud deployments while keeping cost, security, compliance, and operations as top design principles. He loves traveling around the world and has traveled to several countries around the globe in the last decade.