Many AWS customers have been successfully using Amazon Transcribe to accurately, efficiently, and automatically convert their customer audio conversations to text, and extract actionable insights from them. These insights can help you continuously enhance the processes and products that directly improve the quality and experience for your customers.
In many countries, such as India, English is not the primary language of communication. Indian customer conversations contain regional languages like Hindi, with English words and phrases spoken randomly throughout the calls. In the source media files, there can be proper nouns, domain-specific acronyms, words, or phrases that the default Amazon Transcribe model isn’t aware of. Transcriptions for such media files can have inaccurate spellings for those words.
In this post, we demonstrate how you can provide more information to Amazon Transcribe with custom vocabularies to update the way Amazon Transcribe handles transcription of your audio files with business-specific terminology. We show the steps to improve the accuracy of transcriptions for Hinglish calls (Indian Hindi calls containing Indian English words and phrases). You can use the same process to transcribe audio calls with any language supported by Amazon Transcribe. After you create custom vocabularies, you can transcribe audio calls with accuracy and at scale by using our post call analytics solution, which we discuss more later in this post.
We use the following Indian Hindi audio call (SampleAudio.wav) with random English words to demonstrate the process.
We then walk you through the following high-level steps:
Before we get started, we need to confirm that the input audio file meets the transcribe data input requirements.
A monophonic recording, also referred to as mono, contains one audio signal, in which all the audio elements of the agent and the customer are combined into one channel. A stereophonic recording, also referred to as stereo, contains two audio signals to capture the audio elements of the agent and the customer in two separate channels. Each agent-customer recording file contains two audio channels, one for the agent and one for the customer.
Low-fidelity audio recordings, such as telephone recordings, typically use 8,000 Hz sample rates. Amazon Transcribe supports processing mono recorded and also high-fidelity audio files with sample rates between 16,000–48,000 Hz.
For improved transcription results and to clearly distinguish the words spoken by the agent and the customer, we recommend using audio files recorded at 8,000 Hz sample rate and are stereo channel separated.
You can use a tool like ffmpeg to validate your input audio files from the command line:
ffmpeg -i SampleAudio.wav
In the returned response, check the line starting with Stream in the Input section, and confirm that the audio files are 8,000 Hz and stereo channel separated:
Input #0, wav, from ‘SampleAudio.wav’: Duration: 00:01:06.36, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, stereo, s16, 256 kb/s
When you build a pipeline to process a large number of audio files, you can automate this step to filter files that don’t meet the requirements.
As an additional prerequisite step, create an Amazon Simple Storage Service (Amazon S3) bucket to host the audio files to be transcribed. For instructions, refer to Create your first S3 bucket.Then upload the audio file to the S3 bucket.
Now we can start an Amazon Transcribe call analytics job using the audio file we uploaded.In this example, we use the AWS Management Console to transcribe the audio file.You can also use the AWS Command Line Interface (AWS CLI) or AWS SDK.
When the status of the job is Complete, you can review the transcription by choosing the job (SampleAudio).
The customer and the agent sentences are clearly separated out, which helps us identify whether the customer or the agent spoke any specific words or phrases.
Word error rate (WER) is the recommended and most commonly used metric for evaluating the accuracy of Automatic Speech Recognition (ASR) systems. The goal is to reduce the WER as much possible to improve the accuracy of the ASR system.
To calculate WER, complete the following steps. This post uses the open-source asr-evaluation evaluation tool to calculate WER, but other tools such as SCTK or JiWER are also available.
You get the following output:
REF: customer : हेलो, HYP: customer : हेलो, SENTENCE 1 Correct = 100.0% 3 ( 3) Errors = 0.0% 0 ( 3) REF: agent : गुड मोर्निग सौथ इंडिया ट्रेवल एजेंसी से मैं । लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ। HYP: agent : गुड मोर्निग *** इंडिया ट्रेवल एजेंसी ** सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ। SENTENCE 2 Correct = 84.0% 21 ( 25) Errors = 16.0% 4 ( 25) REF: customer : मैं बहुत ***** दिनोंसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं? HYP: customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं? SENTENCE 3 Correct = 96.0% 24 ( 25) Errors = 8.0% 2 ( 25) REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। SENTENCE 4 Correct = 83.3% 20 ( 24) Errors = 16.7% 4 ( 24) REF: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा। HYP: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा। SENTENCE 5 Correct = 100.0% 14 ( 14) Errors = 0.0% 0 ( 14) REF: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है। HYP: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है। SENTENCE 6 Correct = 100.0% 12 ( 12) Errors = 0.0% 0 ( 12) REF: customer : सिरियसली एनी टिप्स यू केन शेर HYP: customer : सिरियसली एनी टिप्स ** चिकन शेर SENTENCE 7 Correct = 75.0% 6 ( 8) Errors = 25.0% 2 ( 8) REF: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा। HYP: agent : आप टेक्सी यूस कर लो ड्रैब और पार्किंग का प्राब्लम नहीं होगा। SENTENCE 8 Correct = 92.9% 13 ( 14) Errors = 7.1% 1 ( 14) REF: customer : ग्रेट आइडिया थैंक्यू सो मच। HYP: customer : ग्रेट आइडिया थैंक्यू सो मच। SENTENCE 9 Correct = 100.0% 7 ( 7) Errors = 0.0% 0 ( 7) Sentence count: 9 WER: 9.848% ( 13 / 132) WRR: 90.909% ( 120 / 132) SER: 55.556% ( 5 / 9)
The wer command compares text from the files reference.txt and hypothesis.txt. It reports errors for each sentence and also the total number of errors (WER: 9.848% ( 13 / 132)) in the entire transcript.
From the preceding output, wer reported 13 errors out of 132 words in the transcript. These errors can be of three types:
We can make the following observations based on the transcript:
In the next step, we demonstrate how to correct the highlighted words in the preceding sentence using custom vocabulary in Amazon Transcribe:
To create a custom vocabulary, you need to build a text file in a tabular format with the words and phrases to train the default Amazon Transcribe model. Your table must contain all four columns (Phrase, SoundsLike, IPA, and DisplayAs), but the Phrase column is the only one that must contain an entry on each row. You can leave the other columns empty. Each column must be separated by a tab character, even if some columns are left empty. For example, if you leave the IPA and SoundsLike columns empty for a row, the Phrase and DisplaysAs columns in that row must be separated with three tab characters (between Phrase and IPA, IPA and SoundsLike, and SoundsLike and DisplaysAs).
To train the model with a custom vocabulary, complete the following steps:
You can only use characters that are supported for your language. Refer to your language’s character set for details.
The columns contain the following information:
Copy the transcript from the Amazon Transcribe job details page to a text file named hypothesis-custom-vocabulary.txt:
Customer : हेलो,
Agent : गुड मोर्निग इंडिया ट्रेवल एजेंसी सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ।
Customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं?
Agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है।
Customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा।
Agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है।
Customer : सिरियसली एनी टिप्स चिकन शेर
Agent : आप टेक्सी यूस कर लो ड्रैब और पार्किंग का प्राब्लम नहीं होगा।
Customer : ग्रेट आइडिया थैंक्यू सो मच।
Note that the highlighted words are transcribed as desired.
Run the wer command again with the new transcript:
wer -i reference.txt hypothesis-custom-vocabulary.txt
You get the following output:
REF: customer : हेलो, HYP: customer : हेलो, SENTENCE 1 Correct = 100.0% 3 ( 3) Errors = 0.0% 0 ( 3) REF: agent : गुड मोर्निग सौथ इंडिया ट्रेवल एजेंसी से मैं । लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ। HYP: agent : गुड मोर्निग *** इंडिया ट्रेवल एजेंसी ** सेम है। लावन्या बात कर रही हूँ किस तरह से मैं आपकी सहायता कर सकती हूँ। SENTENCE 2 Correct = 84.0% 21 ( 25) Errors = 16.0% 4 ( 25) REF: customer : मैं बहुत ***** दिनोंसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं? HYP: customer : मैं बहुत दिनों उनसे हैदराबाद ट्रेवल के बारे में सोच रहा था। क्या आप मुझे कुछ अच्छे लोकेशन के बारे में बता सकती हैं? SENTENCE 3 Correct = 96.0% 24 ( 25) Errors = 8.0% 2 ( 25) REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। SENTENCE 4 Correct = 100.0% 24 ( 24) Errors = 0.0% 0 ( 24) REF: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा। HYP: customer : हाँ बढिया थैंक यू मैं अगले सैटरडे और संडे को ट्राई करूँगा। SENTENCE 5 Correct = 100.0% 14 ( 14) Errors = 0.0% 0 ( 14) REF: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है। HYP: agent : एक सजेशन वीकेंड में ट्रैफिक ज्यादा रहने के चांसेज है। SENTENCE 6 Correct = 100.0% 12 ( 12) Errors = 0.0% 0 ( 12) REF: customer : सिरियसली एनी टिप्स यू केन शेर HYP: customer : सिरियसली एनी टिप्स ** चिकन शेर SENTENCE 7 Correct = 75.0% 6 ( 8) Errors = 25.0% 2 ( 8) REF: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा। HYP: agent : आप टेक्सी यूस कर लो ड्रैव और पार्किंग का प्राब्लम नहीं होगा। SENTENCE 8 Correct = 100.0% 14 ( 14) Errors = 0.0% 0 ( 14) REF: customer : ग्रेट आइडिया थैंक्यू सो मच। HYP: customer : ग्रेट आइडिया थैंक्यू सो मच। SENTENCE 9 Correct = 100.0% 7 ( 7) Errors = 0.0% 0 ( 7) Sentence count: 9 WER: 6.061% ( 8 / 132) WRR: 94.697% ( 125 / 132) SER: 33.333% ( 3 / 9)
The total WER is 6.061%, meaning 93.939% of the words are transcribed accurately.
Let’s compare the wer output for sentence 4 with and without custom vocabulary. The following is without custom vocabulary:
REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार महीना गोलकुंडा फोर सलार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। SENTENCE 4 Correct = 83.3% 20 ( 24) Errors = 16.7% 4 ( 24)
The following is with custom vocabulary:
REF: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। HYP: agent : हाँ बिल्कुल। हैदराबाद में बहुत सारे प्लेस है। उनमें से चार मिनार गोलकोंडा फोर्ट सालार जंग म्यूजियम और बिरला प्लेनेटोरियम मशहूर है। SENTENCE 4 Correct = 100.0% 24 ( 24) Errors = 0.0% 0 ( 24)
There are no errors in sentence 4. The names of the places are transcribed accurately with the help of custom vocabulary, thereby reducing the overall WER from 9.848% to 6.061% for this audio file. This means that the accuracy of transcription improved by nearly 4%.
We used the following custom vocabulary:
Phrase IPA SoundsLike DisplayAs गोलकुंडा-फोर गोलकोंडा फोर्ट सालार-जंग सा-लार-जंग सालार जंग चार-महीना चार मिनार
Amazon Transcribe checks if there are any words in the audio file that sound like the words mentioned in the Phrase column. Then the model uses the entries in the IPA, SoundsLike, and DisplaysAs columns for those specific words to transcribe with the desired spellings.
With this custom vocabulary, when Amazon Transcribe identifies a word that sounds like “गोलकुंडा-फोर (Golcunda-Four),” it transcribes that word as “गोलकोंडा फोर्ट (Golconda Fort).”
The accuracy of transcription also depends on parameters like the speakers’ pronunciation, overlapping speakers, talking speed, and background noise. Therefore, we recommend that you to follow the process with a variety of calls (with different customers, agents, interruptions, and so on) that cover the most commonly used domain-specific words for you to build a comprehensive custom vocabulary.
In this post, we learned the process to improve accuracy of transcribing one audio call using custom vocabulary. To process thousands of your contact center call recordings every day, you can use post call analytics, a fully automated, scalable, and cost-efficient end-to-end solution that takes care of most of the heavy lifting. You simply upload your audio files to an S3 bucket, and within minutes, the solution provides call analytics like sentiment in a web UI. Post call analytics provides actionable insights to spot emerging trends, identify agent coaching opportunities, and assess the general sentiment of calls.Post call analytics is an open-source solution that you can deploy using AWS CloudFormation.
Note that custom vocabularies don’t use the context in which the words were spoken, they only focus on individual words that you provide. To further improve the accuracy, you can use custom language models. Unlike custom vocabularies, which associate pronunciation with spelling, custom language models learn the context associated with a given word. This includes how and when a word is used, and the relationship a word has with other words. To create a custom language model, you can use the transcriptions derived from the process we learned for a variety of calls, and combine them with content from your websites or user manuals that contains domain-specific words and phrases.
To achieve the highest transcription accuracy with batch transcriptions, you can use custom vocabularies in conjunction with your custom language models.
In this post, we provided detailed steps to accurately process Hindi audio files containing English words using call analytics and custom vocabularies in Amazon Transcribe. You can use these same steps to process audio calls with any language supported by Amazon Transcribe.
After you derive the transcriptions with your desired accuracy, you can improve your agent-customer conversations by training your agents. You can also understand your customer sentiments and trends. With the help of speaker diarization, loudness detection, and vocabulary filtering features in the call analytics, you can identify whether it was the agent or customer who raised their tone or spoke any specific words. You can categorize calls based on domain-specific words, capture actionable insights, and run analytics to improve your products. Finally, you can translate your transcripts to English or other supported languages of your choice using Amazon Translate.
Sarat Guttikonda is a Sr. Solutions Architect in AWS World Wide Public Sector. Sarat enjoys helping customers automate, manage, and govern their cloud resources without sacrificing business agility. In his free time, he loves building Legos with his son and playing table tennis.
Lavanya Sood is a Solutions Architect in AWS World Wide Public Sector based out of New Delhi, India. Lavanya enjoys learning new technologies and helping customers in their cloud adoption journey. In her free time, she loves traveling and trying different foods.