Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation in 71 languages and 4,970 language pairs. Amazon Translate is great for performing batch translation when you have large quantities of pre-existing text to translate and real-time translation when you want to deliver on-demand translations of content as a feature of your applications. It can also handle documents that are written in multiple languages.
Document automation is a common use case where machine learning (ML) can be applied to simplify storing, managing, and extracting insights from documents. In this post, we look at how to run batch translation jobs using the Boto3 Python library as run from an Amazon SageMaker notebook instance. You can also generalize this process to run batch translation jobs from other AWS compute services.
We start by creating an AWS Identity and Access Management (IAM) role and access policy to allow SageMaker to run batch translation jobs. If you’re using a simple text translation (such as under 5,000 bytes), the job is synchronous and the data is passed to Amazon Translate as bytes, However, when run as a batch translation job where files are accessed from an Amazon Simple Storage Service (Amazon S3) bucket, the data is read directly by Amazon Translate instead of being passed as bytes by the code run in the SageMaker notebook (in case of shorter text strings).
This section creates the permissions need to allow Amazon Translate access the S3 files.
For this post, we create a policy that’s not too open.
{ “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: [ “s3:ListBucket” ], “Resource”: [ “arn:aws:s3:::your-bucket” ] }, { “Effect”: “Allow”, “Action”: [ “s3:PutObject”, “s3:GetObject”, “s3:DeleteObject” ], “Resource”: [ “arn:aws:s3:::your-bucket/*” ] } ] }
So far everything you have done is a common workflow; now we make a change that allows Amazon Translate to have that trust relationship.
For example, the following screenshot shows the code with Service defined as lambda.amazonaws.com.
The following screenshot shows the updated code as translate.amazonaws.com.
We can now run a Jupyter notebook on SageMaker. Every notebook instance has an execution role, which we use to grant permissions for Amazon Translate. If you’re performing a synchronous translation with a short text, all you need to do is provide TranslateFullAccess to this role. In production, you can narrow down the permissions with granular Amazon Translate access.
If you haven’t already configured this role to have access to Amazon S3, you can do so following the same steps.
You can also choose to give access to all S3 buckets or specific S3 buckets when you create a SageMaker notebook instance and create a new role.
For this post, we attach the AmazonS3FullAccess policy to the role.
You can now run a simple synchronous Amazon Translation job on your SageMaker notebook.
If you try to run a batch translation job using Boto3 as in the following screenshot, you have a parameter called DataAccessRoleArn. This is the SageMaker execution role we identified earlier; we need to be able to pass this role to Amazon Translate, thereby allowing Amazon Translate to access data in the S3 bucket. We can configure this on the console, wherein the role is directly passed to Amazon Translate instead of through code run from a SageMaker notebook.
You first need to locate your role ARN.
This policy can now pass the translates3access2 role.
The next step is to attach this policy to the SageMaker execution role.
You can now run the code in the SageMaker notebook instance.
You have seen how to run batch jobs using Amazon Translate in a SageMaker notebook. You can easily apply the same process to running the code using Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Compute Cloud (Amazon EC2), or other services. You can also as a next step combine services like Amazon Comprehend, Amazon Transcribe, or Amazon Kendra to automate managing, searching, and adding metadata to your documents or textual data.
For more information about Amazon Translate, see Amazon Translate resources.
Raj Kadiyala is an AI/ML Tech Business Development Manager in AWS WWPS Partner Organization. Raj has over 12 years of experience in Machine Learning and likes to spend his free time exploring machine learning for practical every day solutions and staying active in the great outdoors of Colorado.
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, the AWS natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.