Amazon Textract is a fully managed machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract can detect text in a variety of documents, including financial reports, medical records, and tax forms.
In many use cases, you need to extract and analyze documents with various visuals, such as logos, photos, and charts. These visuals contain embedded text that convolutes Amazon Textract output or isn’t required for your downstream process. For example, many real estate evaluation forms or documents contain pictures of houses or trends of historical prices. This information isn’t needed in downstream processes, and you have to remove it before using Amazon Textract to analyze the document. In this post, we illustrate two effective methods to remove these visuals as part of your preprocessing.
For this post, we use a PDF that contains a logo and a chart as an example. We use two different types of processes to convert and detect these visuals, then redact them.
In the first method, we use the OpenCV library canny edge detector to detect the edge of the visuals. For the second method, we write a custom pixel concentration analyzer to detect the location of these visuals.
You can extract these visuals out for further processing, and easily modify the code to fit your use case.
Searchable PDFs are native PDF files usually generated by other applications, such as text processors, virtual PDF printers, and native editors. These types of PDFs retain metadata, text, and image information inside the document. You can easily use libraries like PyMuPDF/fitz to navigate the PDF structure and identify images and text. In this post, we focus on non-searchable or image-based documents.
In this approach, we convert the PDF into PNG format, then grayscale the document with the OpenCV-Python library and use the Canny Edge Detector to detect the visual locations. You can follow the detailed steps in the following notebook.
You can further tune and optimize a few parameters to increase detection accuracy depending on your use case:
This approach has the following advantages:
However, the approach has the following drawbacks:
We implement our second approach by analyzing the image pixels. Normal text paragraphs retain a concentration signature in its lines. We can measure and analyze the pixel densities to identify areas with pixel densities that aren’t similar to the rest of document. You can follow the detailed steps in the following notebook.
You can tune the following parameters to optimize the accuracy of identifying non-text areas:
This approach is highly customizable. However, it has the following drawbacks:
In this post, we showed how you can implement two approaches to redact visuals from different documents. Both approaches are easy to implement. You can get high-quality results and customize either method according to your use case.
To learn more about different techniques in Amazon Textract, visit the public AWS Samples GitHub repo.
Yuan Jiang is a Sr Solution Architect with a focus in machine learning. He’s a member of the Amazon Computer Vision Hero program and the Amazon Machine Learning Technical Field Community.
Victor Rojo is a Sr Partner Solution Architect with Conversational AI focus. He’s also a member of the Amazon Computer Vision Hero program.
Luis Pineda is a Sr Partner Management Solution Architect. He’s also a member of the Amazon Computer Vision Hero program.
Miguel Romero Calvo is a Data Scientist from the AWS Machine Learning Solution Lab.