Amazon Kendra is an easy-to-use intelligent search service that allows you to integrate search capabilities with your applications so users can find information stored across data sources like Amazon Simple Storage Service , OneDrive and Google Drive; applications such as SalesForce, SharePoint and Service Now; and relational databases like Amazon Relational Database Service (Amazon RDS). Using Amazon Kendra connectors enables you to synchronize data from multiple content repositories with your Amazon Kendra index. When end-users ask natural language questions, Amazon Kendra uses machine learning (ML) algorithms to understand the context and return the most relevant answers.
The Amazon Kendra’s S3 connector supports indexing documents and their associated metadata stored in an S3 bucket. It’s often the case that you want to make sure that applications running inside a VPC have access only to specific S3 buckets and in many cases the connection must not traverse the internet to reach public endpoints. Many customers, however, own multiple S3 buckets, some of which are accessible by VPC endpoints for Amazon S3. In this post, we describe how to use the updated Amazon Kendra S3 connector with VPC support for using VPC endpoints.
This post provides the steps to help you create an enterprise search engine on AWS using Amazon Kendra by connecting documents stored in a S3 bucket only accessible from within a VPC. For more information, see enhancing enterprise search with Amazon Kendra. The post also demonstrates how to configure your connector for Amazon S3 and configure how your index syncs with your data source when your data source content changes.
There are three main improvements to the Amazon Kendra S3 connector :
For this walkthrough, you should have the following prerequisites:
Before you can create an index in Amazon Kendra, you need to load documents into an S3 bucket. This section contains instructions to create an S3 bucket, get the files, and load them into the bucket. After completing all the steps in this section, you have a data source that Amazon Kendra can use.
Inside your bucket, you should now see four folders.
A data source is a location that stores the documents for indexing. You can synchronize data sources automatically with an Amazon Kendra index to make sure that searches correctly reflect new, updated, or deleted documents in the source repositories.
After completing all the steps in this section, you’ll have a data source linked to Amazon Kendra. For more information, see Adding documents from a data source.
Before continuing, make sure that the index creation is complete and the index shows as Active. For more information, see Creating an Index.
For more information about the different data sources that Amazon Kendra supports, see Adding documents from a data source.
Now you create an AWS Identity and Access Management (IAM) role for Amazon Kendra.
For more information on connecting your Amazon Kendra to your Amazon Virtual Private Cloud, see Configuring Amazon Kendra to use a VPC.
By default, metadata files are stored in the same directory as the documents. If you want to place these files in a different folder, you can add a prefix. For more information, see Amazon S3 document metadata.
This step defines the frequency with which the data source is synchronized with the Amazon Kendra index.
The duration of this process depends on the number of documents that you index. For this use case, it may take 15 minutes, after which you should see a message that the sync was successful. In the Sync run history section, you can see that 40 documents were synchronized.
Your Amazon Kendra index is now ready for natural language queries. When you search your index, Amazon Kendra uses all the data and metadata provided to return the most accurate answers to your search query. On the Amazon Kendra console, choose Search indexed content. In the query field, start with a query such as “Which AWS service has 11 nines of durability?”
For more information about querying the index, see Querying an Index
Your data source is set up to sync any new, modified or deleted data. Before you can synchronize your data source incrementally with an index in Amazon Kendra, you need to load new documents into an S3 bucket.
Now you can synchronize the new documents added to the S3 bucket:
The duration of this process depends on the number of documents that you index. For this use case, it may take 15 minutes, after which you should see a message that the sync was successful.
In the Sync run history section, you can see that 20 documents were synchronized.
In a scenario where the data source has stale information, you can now re-index the data source without having to delete and create a new data source. To modify the sync mode and re-index the data source, complete the following steps:
Now you can synchronize the new documents added to the S3 bucket.
In the Sync run history section, you can see that all documents were synchronized irrespective of the previous sync status under the modified column.
To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created:
Wait until you get the confirmation message; the process can take up to 15 minutes.
In this post, you learned how to use Amazon Kendra to deploy an enterprise search service using a secure connection to Amazon S3 that doesn’t require an internet gateway or Network Address Translation (NAT) device. You can enable quicker syncs for your documents using sync mode.
There are many additional features that we didn’t cover. For example:
To learn more about Amazon Kendra, refer Amazon Kendra Developer Guide.
Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel.
Arjun Agrawal is Software Engineer at AWS, currently working with an Amazon Kendra team on an enterprise search engine. He is passionate about new technology and solving real-world problems. Outside of work, he loves to hike and travel.