[]AWS delivers services that meet customers’ artificial intelligence (AI) and machine learning (ML) needs with services ranging from custom hardware like AWS Trainium and AWS Inferentia to generative AI foundation models (FMs) on Amazon Bedrock. In February 2022, AWS and Hugging Face announced a collaboration to make generative AI more accessible and cost efficient.
[]Generative AI has grown at an accelerating rate from the largest pre-trained model in 2019 having 330 million parameters to more than 500 billion parameters today. The performance and quality of the models also improved drastically with the number of parameters. These models span tasks like text-to-text, text-to-image, text-to-embedding, and more. You can use large language models (LLMs), more specifically, for tasks including summarization, metadata extraction, and question answering.
[]Amazon SageMaker JumpStart is an ML hub that can helps you accelerate your ML journey. With JumpStart, you can access pre-trained models and foundation models from the Foundations Model Hub to perform tasks like article summarization and image generation. Pre-trained models are fully customizable for your use cases and can be easily deployed into production with the user interface or SDK. Most importantly, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave the virtual private cloud (VPC), you can trust that your data will remain private and confidential.
[]This post focuses on building a serverless meeting summarization using Amazon Transcribe to transcribe meeting audio and the Flan-T5-XL model from Hugging Face (available on JumpStart) for summarization.
[]The Meeting Notes Generator Solution creates an automated serverless pipeline using AWS Lambda for transcribing and summarizing audio and video recordings of meetings. The solution can be deployed with other FMs available on JumpStart.
[]The solution includes the following components:
[]The following diagram illustrates this architecture.
[]
[]As shown in the architecture diagram, the meeting recordings, transcripts, and notes are stored in respective Amazon Simple Storage Service (Amazon S3) buckets. The solution takes an event-driven approach to transcribe and summarize upon S3 upload events. The events trigger Lambda functions to make API calls to Amazon Transcribe and invoke the real-time endpoint hosting the Flan T5 XL model.
[]The CloudFormation template and instructions for deploying the solution can be found in the GitHub repository.
[]Real-time inference on SageMaker is designed for workloads with low latency requirements. SageMaker endpoints are fully managed and support multiple hosting options and auto scaling. Once created, the endpoint can be invoked with the InvokeEndpoint API. The provided CloudFormation template creates a real-time endpoint with the default instance count of 1, but it can be adjusted based on expected load on the endpoint and as the service quota for the instance type permits. You can request service quota increases on the Service Quotas page of the AWS Management Console.
[]The following snippet of the CloudFormation template defines the SageMaker model, endpoint configuration, and endpoint using the ModelData and ImageURI of the Flan T5 XL from JumpStart. You can explore more FMs on Getting started with Amazon SageMaker JumpStart. To deploy the solution with a different model, replace the ModelData and ImageURI parameters in the CloudFormation template with the desired model S3 artifact and container image URI, respectively. Check out the sample notebook on GitHub for sample code on how to retrieve the latest JumpStart model artifact on Amazon S3 and the corresponding public container image provided by SageMaker.
# SageMaker Model SageMakerModel: Type: AWS::SageMaker::Model Properties: ModelName: !Sub ${AWS::StackName}-SageMakerModel Containers: – Image: !Ref ImageURI ModelDataUrl: !Ref ModelData Mode: SingleModel Environment: { “MODEL_CACHE_ROOT”: “/opt/ml/model”, “SAGEMAKER_ENV”: “1”, “SAGEMAKER_MODEL_SERVER_TIMEOUT”: “3600”, “SAGEMAKER_MODEL_SERVER_WORKERS”: “1”, “SAGEMAKER_PROGRAM”: “inference.py”, “SAGEMAKER_SUBMIT_DIRECTORY”: “/opt/ml/model/code/”, “TS_DEFAULT_WORKERS_PER_MODEL”: 1 } EnableNetworkIsolation: true ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn # SageMaker Endpoint Config SageMakerEndpointConfig: Type: AWS::SageMaker::EndpointConfig Properties: EndpointConfigName: !Sub ${AWS::StackName}-SageMakerEndpointConfig ProductionVariants: – ModelName: !GetAtt SageMakerModel.ModelName VariantName: !Sub ${SageMakerModel.ModelName}-1 InitialInstanceCount: !Ref InstanceCount InstanceType: !Ref InstanceType InitialVariantWeight: 1.0 VolumeSizeInGB: 40 # SageMaker Endpoint SageMakerEndpoint: Type: AWS::SageMaker::Endpoint Properties: EndpointName: !Sub ${AWS::StackName}-SageMakerEndpoint EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName
[]For detailed steps on deploying the solution, follow the Deployment with CloudFormation section of the GitHub repository.
[]If you want to use a different instance type or more instances for the endpoint, submit a quota increase request for the desired instance type on the AWS Service Quotas Dashboard.
[]To use a different FM for the endpoint, replace the ImageURI and ModelData parameters in the CloudFormation template for the corresponding FM.
[]After you deploy the solution using the Lambda layer creation script and the CloudFormation template, you can test the architecture by uploading an audio or video meeting recording in any of the media formats supported by Amazon Transcribe. Complete the following steps:
[]Now we can check for a successful transcription.
[]We can also check the generated summary.
[]Even though LLMs have improved in the last few years, the models can only take in finite inputs; therefore, inserting an entire transcript of a meeting may exceed the limit of the model and cause an error with the invocation. To design around this challenge, we can break down the context into manageable chunks by limiting the number of tokens in each invocation context. In this sample solution, the transcript is broken down into smaller chunks with a maximum limit on the number of tokens per chunk. Then each transcript chunk is summarized using the Flan T5 XL model. Finally, the chunk summaries are combined to form the context for the final combined summary, as shown in the following diagram.
[]
[]The following code from the GenerateMeetingNotes Lambda function uses the Natural Language Toolkit (NLTK) library to tokenize the transcript, then it chunks the transcript into sections, each containing up to a certain number of tokens:
# Chunk transcript into chunks transcript = contents[‘results’][‘transcripts’][0][‘transcript’] transcript_tokens = word_tokenize(transcript) num_chunks = int(math.ceil(len(transcript_tokens) / CHUNK_LENGTH)) transcript_chunks = [] for i in range(num_chunks): if i == num_chunks – 1: chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:]) else: chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:CHUNK_LENGTH * (i + 1)]) transcript_chunks.append(chunk) []After the transcript is broken up into smaller chunks, the following code invokes the SageMaker real-time inference endpoint to get summaries of each transcript chunk:
# Summarize each chunk chunk_summaries = [] for i in range(len(transcript_chunks)): text_input = ‘{}n{}’.format(transcript_chunks[i], instruction) payload = { “text_inputs”: text_input, “max_length”: 100, “num_return_sequences”: 1, “top_k”: 50, “top_p”: 0.95, “do_sample”: True } query_response = query_endpoint_with_json_payload(json.dumps(payload).encode(‘utf-8’)) generated_texts = parse_response_multiple_texts(query_response) chunk_summaries.append(generated_texts[0]) print(generated_texts[0]) []Finally, the following code snippet combines the chunk summaries as the context to generate a final summary:
# Create a combined summary text_input = ‘{}n{}’.format(‘ ‘.join(chunk_summaries), instruction) payload = { “text_inputs”: text_input, “max_length”: 100, “num_return_sequences”: 1, “top_k”: 50, “top_p”: 0.95, “do_sample”: True } query_response = query_endpoint_with_json_payload(json.dumps(payload).encode(‘utf-8’)) generated_texts = parse_response_multiple_texts(query_response) results = { “summary”: generated_texts, “chunk_summaries”: chunk_summaries } []The full GenerateMeetingNotes Lambda function can be found in the GitHub repository.
[]To clean up the solution, complete the following steps:
[]This post demonstrated how to use FMs on JumpStart to quickly build a serverless meeting notes generator architecture with AWS CloudFormation. Combined with AWS AI services like Amazon Transcribe and serverless technologies like Lambda, you can use FMs on JumpStart and Amazon Bedrock to build applications for various generative AI use cases.
[]For additional posts on ML at AWS, visit the AWS ML Blog.
[]Eric Kim is a Solutions Architect (SA) at Amazon Web Services. He works with game developers and publishers to build scalable games and supporting services on AWS. He primarily focuses on applications of artificial intelligence and machine learning.