Processing large documents like PDFs and static images is a cornerstone of today’s highly regulated industries. From healthcare information like doctor-patient visits and bills of health, to financial documents like loan applications, tax filings, research reports, and regulatory filings, these documents are integral to how these industries conduct business. The mechanisms by which these documents are processed and analyzed, however, are often manual, time-consuming, error-prone, expensive, and not easily scalable.
Fortunately, recent innovations in this space are helping companies improve these methods. Machine learning (ML) techniques such as optical character recognition (OCR) and natural language processing (NLP) enable firms to digitize and extract text from millions of documents and understand the content, including contextual nuances of the language within them. Furthermore, you can then transform the extracted text by merging it with supplemental data to produce additional business insights.
This step-by-step method is called a document processing pipeline. The pipeline includes various components to extract, transform, enrich, and conform the data. New data domains are often introduced and used for numerous downstream business purposes. For example, in financial services, you could be identifying connected financial events, calculating environmental risk scores, and developing risk models. Because these documents help inform or even dictate such important data-driven decisions, it’s imperative for regulated industry companies to establish and maintain a robust data governance framework as part of these document processing pipelines. Without governance, pipelines become a dumping ground where documents are inconsistently stored, duplicated, and processed, and the business is unable to explain to potential auditors where the data that fed their decisions came from, or what that data was used for.
A data governance framework is made up of people, processes, and technology. It enables business users to work collaboratively with technologists to drive clean, certified, and trusted data. It consists of several components including data quality, data catalog, data ownership, data lineage, operation, and compliance. In this post, we discuss data catalog, data ownership, and data lineage, and how they tie together with building document processing pipelines for regulated industries.
For more information about design patterns on data quality, see How to Architect Data Quality on the AWS Cloud.
Data lineage is the part of data governance that refers to the practice of providing GPS services for data. At any point in time, it can explain where the data originated, what happened to it, what its latest status is, and where it’s headed from this point on.
It provides visibility while simplifying the ability to trace financial numbers back to their origin, and provides transparency on potential errors and their root cause analyses.
Furthermore, you can use data lineage captured over time as analytical inputs to drive accuracy scores.
It’s imperative for a document processing pipeline to have a well-defined data lineage framework. The framework should include an end-to-end lifecycle, responsibility model, and the technology to enable data transformation transparency. Without lineage, the data can’t be trusted.
To illustrate this end-to-end data lineage concept, we walk you through creating an NLP-powered document search engine with built-in lineage at each step. Every object and piece of data processed by this ML pipeline can be traced back to the original document.
Each processing component can be replaced by your choice of tooling or bespoke ML model. Furthermore, you can customize the solution to include other use cases, such as central document data lakes or supplemental tabular data feed to an online transaction processing (OLTP) application.
The solution follows an event-driven architecture in which the completion of each stage within the pipeline triggers the next step, while providing self-service lineage for traceability and monitoring. In addition, hooks have been included to provide capabilities to extend the pipeline to additional workloads.
This design uses the following AWS services (you can also follow along in the GitHub repo):
The overall design is grouped into five segments:
All components interact via asynchronous events to allow for scalability. The following diagram illustrates the conceptual design.
The following diagram illustrates the physical design.
This is an encapsulated module to register, track, and trace incoming documents. It’s designed to be used across many different document processing pipelines. In your organization, one team might decide to use the OCR and NLP modules designed in this post. Another team might decide to use a different pipeline. However, governance practices of each pipeline should be consistent, and documents should be registered one time with full transparency on movement and downstream usage. Each document can be processed several times. You can extend the catalog and lineage services designed in this post to keep track of many pipelines, from multiple sources of data.
At the core, the metadata services module contains four reference tables, an SNS topic, three SQS queues, and three self-contained Lambda functions. Tables are created in DynamoDB, and schemas can be easily extended to include additional data attributes deemed important for your pipeline.
In addition, you can extend this design to include additional data governance components such as data quality.
The tables are defined as follows.
Table Name | Purpose | DynamoDB Stream Enabled? | Data Governance component | Sample Use |
Document Registry | Keeps track of all incoming documents. Each document is assigned a unique document ID and registered one time in this table. | Yes | Catalog | Provides the ability to quickly look up and understand the document source and context metadata. |
Document Ownership | Covers responsibility model of the data governance in which each document acquired to the pipeline has a defined owner. | No | Ownership | Provides notification services and can be extended to manage data quality controls. |
Document Lineage | Keeps track of all data movements. It provides detailed lineage info that includes the source S3 bucket name, destination S3 bucket name, source file name, target file name, ARN ID of the AWS service that processed the document, and timestamp. | No |
A simple PartiQL query against this table based on the document ID provides a list of all steps the original document has taken. Query output can include the following columns: · Document ID · Original document name · Timestamp · Source S3 bucket · Source file name · Destination S3 bucket · Destination file name |
|
Pipeline Operations | Keeps a record of all pipeline actions taken on a document ID, including the current pipeline stage and its status, and keeps a timeline of the stages in chronological order. | Yes | An operational query on a document ID to determine where in the pipeline the current document processing is. |
DynamoDB Streams allows downstream application code to react to updates to objects in DynamoDB. It provides a mechanism to keep an event-based microservices architecture in place by triggering subsequent steps of a workflow whenever new documents are written to our Document Registry table, and subsequently when new document references are created in the Pipeline Operations table.
In addition, DynamoDB Streams provides developer teams with an efficient way of connecting your application logic to various updates in the tables (for example, to keep track of a particular document ID based on owner tags, or alert when certain unexpected problems arise while processing some documents).
The Lambda functions provide microservices API call capabilities for the document pipeline to self-register its movements and actions undertaken by the pipeline code:
The SNS topic is used as a sink for incoming messages from all pipeline movements and document registrations. It disseminates the messages to each downstream subscribed SQS queue according to what type of message was received. In this model, the number of consumers of the messages coming through the SNS topic could be greatly expanded as needed, and all messages are guaranteed to stay in order, because both the SNS topics and SQS queues are created in a First-In-First-Out (FIFO) configuration to prevent duplicates and maintain single-threaded processing in the pipeline.
Using Amazon SNS in the design provides scalability by creating a pub/sub architecture. A pub/sub architecture design is a pattern that provides a framework to decouple the services that produce an event from services that process the event. Many subscribers can subscribe to the same event and trigger different pipelines. For example, this design can easily be extended to process incoming XML file formats by subscribing an additional XML process pipeline for the same event.
The following table provides schema information. The document ID is identical and unique for each document and is part of the composite primary key used to identify movement of each document within the pipeline.
The following diagram shows the architecture of our metadata services.
The ingestion workload is triggered when a new document is uploaded to the NLP/Raw S3 bucket (or the bucket where raw documents are placed from users or front-end applications).
The ingestion module follows a four-step process (as shown in the following diagram):
All steps are registered in metadata services. The red dotted lines in the following diagram represent the metadata asynchronous API calls.
This module detects the incoming file format and uses Amazon Textract in this implementation to convert the incoming documents into text. Amazon Textract can process image files synchronously, and PDF and other documents asynchronously, to allow time for the service to complete its analysis.
The OCR module consists of the following process, as illustrated in the architecture diagram:
All the progress is registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.
This module detects key phrases and entities within the document by using the text output from the OCR module. A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. For example, “day” is a noun; “a beautiful day” is a noun phrase that includes an article (“a”) and an adjective (“beautiful”).
Once key phrases are understood, it’s quite likely that indexing them in an analytical tool would let you find this article quickly and accurately. For example, if you want to analyze corporate social responsibility (CSR) reports, you can find attributes such as “reducing carbon footprints,” “improving labor policies,” “participating in fair-trade,” and “charitable giving” by indexing results of this module.
We use Amazon Comprehend to perform this function in this pipeline. However, as we explained earlier, you can easily swap the tooling used for this design with your preferred tool. For example, you can replace Amazon Comprehend with an Amazon SageMaker custom model as an alternative to extract key phrases and entities in a more domain-focused way. SageMaker is an ML service that you can use to build, train, and deploy ML models for virtually any use case.
Amazon Comprehend is called on a synchronous basis to extract key phrases in the following steps (as illustrated in the following diagram):
All steps are being registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.
This module is responsible for the consumption and analytics segment of the pipeline. The steps are illustrated in the following diagram:
The analytics module has many potential implementations. For example, you can use a relational datastore like Amazon Relational Database (Amazon RDS) or Amazon Aurora to analyze extracted tabular data using SQL.
In this post, we architected an end-to-end document processing pipeline using AWS managed ML services. In addition, we introduced metadata services to help organizations create a centralized document repository to store documents one time but process multiple times. A data governance framework as illustrated in this design provides you with necessary guardrails to ensure documents are governed in a standard fashion across the organization, while providing lines of business with autonomy to decide your NLP and OCR models and choice of tooling.
The architecture discussed in this post has been coded and is available for deployment in the GitHub repo. You can download the code and create your pipeline within a few days.
David Kheyman is a Solutions Architect at Amazon Web Services based out of New York City, where he designs and implements repeatable AWS architecture patterns and solutions for large organizations.
Mojgan Ahmadi is a Principal Solutions Architect with Amazon Web Services based in New York, where she guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience on Software Development and Architecture, Data Governance and Engineering, and IT Management.
Anirudh Menon is a Solutions Architect with Amazon Web Services based in New York, where he helps financial services customers drive innovation with AWS solutions and industry-specific patterns.