Receipts and invoices are documents that are critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents.
In this post, we show how you can use Amazon Textract’s new Analyze Expense API to extract line item details in addition to key-value pairs from invoices and receipts, which is a frequent request we hear from customers. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts, and automatically extracts specific information like vendor name, price, and payment terms. In this post, we walk you through processing an invoice/receipt using Amazon Textract and extracting a set of fields and line-item details. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions.
We cover the following topics in this post:
SMBs, startups, and enterprises process paper-based invoices and receipts as part of their accounts payable process to reconcile their goods received and for auditing purposes. Employees who submit expense reports also submit scans or images of the associated receipts. Companies try to standardize electronic invoicing, but some vendors only offer paper invoices, and some countries legally require paper invoices.
The peculiarities of invoices and receipts mean it’s also a difficult problem to solve at scale—invoices and receipts all look different, because each vendor designs its own documents independently. The labels are imperfect and inconsistent. Vendor name is often not explicitly labeled and has to be interpreted based on context. Other important information such as customer number, customer ID, or account ID are labeled differently from document to document.
To solve this problem, you can use Amazon Textract to process invoices and receipts at scale. Amazon Textract works with any style of invoice or receipt, no templates or configuration required, and extracts relevant data that can be tricky to extract such as contact information, items purchased, and vendor name from those documents. That includes the line-item details, not just the headline amounts.
Amazon Textract also identifies vendor names that are critical for your workflows but may not be explicitly labeled. For example, Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo at the top of the page without an explicit key-value pair combination.
Amazon Textract also makes it easy to consolidate input from diverse receipts and invoices. Different documents use different words for the same concept. For example, Amazon Textract maps relationships between field names in different documents such as customer no., customer number, and account ID, and outputs standard taxonomy (in this case, INVOICE_RECEIPT_ID), thereby representing data consistently across document types.
Before we get started with the API and code samples, let’s review the Amazon Textract console. The following images show examples of both an invoice and a receipt document on the Analyze Expense output tab of the Amazon Textract console.
Amazon Textract automatically detects the vendor name, invoice number, ship to address, and more from the sample invoice and displays them on the Summary Fields tab. It also represents the standard taxonomy of fields in brackets next to the actual value on the document. For example, it identifies “INVOICE #” as the standard field INVOICE_RECEIPT_ID.
Additionally, Amazon Textract detects the items purchased and displays them on the Line Item Fields tab.
The following is a similar example of a receipt. Amazon Textract detects “Whole Foods Market” as VENDOR_NAME even though the receipt doesn’t explicitly mention it as the vendor name.
In this section, we explain the AnalyzeExpense API response structure using sample images. The following is a sample receipt and the corresponding AnalyzeExpense response JSON.
AnalyzeExpense JSON response of SummaryFields :
{ “DocumentMetadata”: { “Pages”: 1 }, “ExpenseDocuments”: [ { “ExpenseIndex”: 1, “SummaryFields”: [ { “Type”: { “Text”: “VENDOR_NAME”, “Confidence”: 97.0633544921875 }, “ValueDetection”: { “Text”: “New Store X1”, “Geometry”: { … }, “Confidence”: 96.65239715576172 }, “PageNumber”: 1 }, { “Type”: { “Text”: “OTHER”, “Confidence”: 81.0 }, “LabelDetection”: { “Text”: “Order type:”, “Geometry”: { … }, “Confidence”: 80.8591079711914 }, “ValueDetection”: { “Text”: “Quick Sale”, “Geometry”: { … }, “Confidence”: 80.82302856445312 }, “PageNumber”: 1 } …
AnalyzeExpense JSON response for LineItemGroups:
“LineItemGroups”: [ { “LineItemGroupIndex”: 1, “LineItems”: [ { “LineItemExpenseFields”: [ { “Type”: { “Text”: “ITEM”, “Confidence”: 99.95216369628906 }, “ValueDetection”: { “Text”: “Red Banana is innbusiness “, “Geometry”: { … }, “Confidence”: 99.81525421142578 }, “PageNumber”: 1 }, { “Type”: { “Text”: “PRICE”, “Confidence”: 99.95216369628906 }, “ValueDetection”: { “Text”: “$66.96”, “Geometry”: { … },
The AnalyzeExpense JSON output contains ExpenseDocuments, and each ExpenseDocument contains SummaryFields and LineItemGroups. The ExpenseIndex field uniquely identifies the expense, and associates the appropriate SummaryFields or LineItemGroups detected to that expense.
The most granular level of data in the AnalyzeExpense response consists of Type, ValueDetection, and LabelDetection (optional). Let’s call this set of data an AnalyzeExpense element. The preceding example illustrates an AnalyzeExpense element that contains Type, ValueDetection, and LabelDetection.
In the preceding example, Amazon Textract detected 16 SummaryField key-value pairs, including VENDOR_NAME: New Store X1 and Order type:Quick Sale. AnalyzeExpense detects this key-value pair and displays it as shown in the preceding example. The individual entities are as follows:
The AnalyzeExpense API also detects ITEM, QUANTITY, and PRICE within line items as normalized fields. If other text is in a line item on the receipt image, such as SKU or a detailed description, it’s included in the JSON as EXPENSE_ROW, as shown in the following example:
{ “Type”: { “Text”: “EXPENSE_ROW”, “Confidence”: 99.95216369628906 }, “ValueDetection”: { “Text”: “Red Banana is in x3 $66.96nbusiness “, “Geometry”: { … }, “Confidence”: 98.11214447021484 }
In addition to the detected content, the AnalyzeExpense API provides information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into your applications. For example, you can flag any elements that have a confidence score under a certain threshold for manual review.
The input document is either bytes or an Amazon Simple Storage Service (Amazon S3) object. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you use the Bytes property to pass a document loaded from a local file system.
Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.
You can call the AnalyzeExpense API using the AWS Command Line Interface (AWS CLI), as shown in the following code. Make sure you have the latest AWS CLI version installed.
aws textract analyze-expense –document ‘{“S3Object”: {“Bucket”: “
Apart from working with the JSON output as-is, you can use the Amazon Textract response parser library to parse the JSON returned by the AnalyzeExpense API. The library parses JSON and provides programming language-specific constructs to work with different parts of the document. For more details, refer to the GitHub repo. Using the Amazon Textract response parser makes it easier to deserialize the JSON response and use it in your application in a similar way that the Amazon Textract PrettyPrinter library allows you to print the parsed response in different formats. The following GitHub repository shows examples for parsing the Amazon Textract responses. You can parse SummaryFields and LineItemGroups for every ExpenseDocument in the AnalyzeExpense response JSON using the AnalyzeExpense parser as shown in the following code:
Install the latest boto3 python SDK – python3 -m pip install boto3 –-upgrade Install the latest version of amazon textract response parser python3 -m pip install amazon-textract-response-parser –upgrade client = boto3.client( service_name=’textract’, region_name= ‘us-east-1′, endpoint_url=’https://textract.us-east-1.amazonaws.com’, ) with open(documentName, ‘rb’) as file: img_test = file.read() bytes_test = bytearray(img_test) print(‘Image loaded’, documentName) # process using image bytes response = client.analyze_expense(Document={‘Bytes’: bytes_test})
You can further use the serializer and deserializer to validate the response JSON and convert it into the Python object representation, and vice versa.
The following code deserializes the response JSON:
# j holds the Textract JSON from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema t_doc = TAnalyzeExpenseDocumentSchema().load(json.loads(j))
The following code serializes the response JSON:
from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema t_doc = TAnalyzeExpenseDocumentSchema().dump(t_doc)
You can also convert the output to formats like CSV, Presto, TSV, HTML, LaTeX, and more by using the Amazon Textract PrettyPrinter library.
Install the PrettyPrinter library with the following code:
python3 -m pip install amazon-textract-prettyprinter –upgrade
Call the get_string method of textractprettyprinter.t_pretty_print_expense with the output_type as SUMMARY or LINEITEMGROUPS and table_format set to whichever format you want to output. The following example code outputs both SUMMARY and LINEITEMGROUPS in the fancy grid format:
import os import boto3 from textractprettyprinter.t_pretty_print_expense import get_string from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format “”” boto3 client for Amazon Texract “”” textract = boto3.client(service_name=’textract’) “”” Set the S3 Bucket Name and File name Please set the below variables to your S3 Location “”” s3_source_bucket_name = “YOUR S3 BUCKET NAME” s3_request_file_name = “YOUR S3 EXPENSE IMAGE FILENAME ” “”” Call the Textract AnalyzeExpense API with the input Expense Image in Amazon S3 “”” try: response = textract.analyze_expense( Document={ ‘S3Object’: { ‘Bucket’: s3_source_bucket_name, ‘Name’: s3_request_file_name } }) “”” Call Amazon Pretty Printer get_string method to parse the response and print it in fancy_grid format. You can set pretty print format to other types as well like csv, latex etc. “”” pretty_printed_string = get_string(textract_json=response, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid) “”” Use the pretty printed string to save the response in storage of your choice. Below is just printing it on stdout. “”” print(pretty_printed_string) except Exception as e_raise: print(e_raise) raise e_raise
The following is the PrettyPrinter output for a sample receipt.
The following is another example of detecting structured data from an invoice.
AnalyzeExpense detects the various normalized summary fields like PAYMENT_TERMS, INVOICER_RECEIPT_ID, TOTAL, TAX, and RECEIVER_ADDRESS.
It also detected one LineItemGroup with one LineItem having DESCRIPTION, QUANTITY, and PRICE, as shown in the following PrettyPrinter output.
The following diagram is a common solution architecture pattern you can use to process documents using Amazon Textract. The solution uses the new AnalyzeExpense API to process receipts and invoices on Amazon S3 and stores the results back in Amazon S3.
The solution architecture includes the following steps:
You can deploy the solution architecture using an AWS CloudFormation template that performs much of the setup work for you.
To use the solution, upload the receipt and invoice images in the S3 bucket referred by SourceBucket in the CloudFormation template. This triggers an event to invoke the Lambda function that calls the AnalyzeExpense API and parses the response JSON, converts the parsed response into CSV or fancy_grid format, and stores it back to another S3 bucket (referred by OutputBucket in the CloudFormation template).
You can extend the provided Lambda function further based on your requirements and also change the output format to other types like TSV, grid, LaTex, and many more by setting the appropriate value of output_type when calling the get_string method of textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter.
The sample Lambda function deployment package included in this CloudFormation template consists of the Boto3 SDK as well. If you want to upgrade the Boto3 SDK in future, you either need to create a new deployment package with the upgraded Boto3 SDK or use the latest Boto3 SDK provided by the Lambda Python runtime.
To delete the resources that the CloudFormation template created, complete the following steps:
In this post, we provided an overview of the new Amazon Textract AnalyzeExpense API to quickly and easily retrieve structured data from receipts and invoices. We also described how you can parse the AnalyzeExpense response JSON using the Amazon Textract parser library and save the output in different formats using Amazon Textract PrettyPrinter. Finally, we provided a solution architecture for processing invoices and receipts using Amazon S3, EventBridge, and a Lambda function.
For more information, see the Amazon Textract Developer Guide.
Dhawalkumar Patel is a Sr. Startups Machine Learning Solutions Architect at AWS with expertise in Machine Learning and Serverless domains. He has worked with organizations ranging from large enterprises to startups on problems related to distributed computing and artificial intelligence
Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.
Manish Chugh is a Sr. Solutions Architect at AWS based in San Francisco, CA. He has worked with organizations ranging from large enterprises to early-stage startups. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS. In his free time, he enjoys hiking East Bay trails, road biking, and watching (and playing) cricket.