[]Labeling is an indispensable stage of data preprocessing in supervised learning. Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Ground Truth is easy to use, can reduce your labeling costs by up to 70% using automatic labeling, and provides options to work with labelers inside and outside of your organization.
[]This post explains how you can use Ground Truth partial labeling data loaded in Amazon Simple Storage Service (Amazon S3) to gamify labeling workflows. The core of the gamification approach is to create a bar chart race showing the progress of the labeling workflow and highlighting the evolution of completed labeling per workers. The bar chart race can be sent periodically (such as daily or weekly). We present options to create and send your bar chart manually or automatically.
[]This gamification approach to Ground Truth labeling workflows can allow you to:
[]We have successfully adopted this solution for a healthcare and life science customer. The labeling job owner kept the internal labeling team engaged by sending a bar chart race daily, and the labeling job was completed 20% faster than planned.
[]A first option for gamifying your Ground Truth labeling workflow via a bar chart race is to create an Amazon SageMaker instance to fetch the partial labeling data, parse the data and create the bar chart race manually. You then save it to Amazon S3 and send it to the workers. The following diagram shows this workflow.
[]To create your bar chart race manually, complete the following steps:
$cd /home/ec2-user/SageMaker/ $source activate python3 $pip install bar_chart_race $pip install ffmpeg-python $sudo su – $cd /usr/local/bin $mkdir ffmpeg $cd ffmpeg $wget https://www.johnvansickle.com/ffmpeg/old-$releases/ffmpeg-4.2.1-amd64-static.tar.xz $tar xvf ffmpeg-4.2.1-amd64-static.tar.xz $mv ffmpeg-4.2.1-amd64-static/ffmpeg . $ln -s /usr/local/bin/ffmpeg/ffmpeg /usr/bin/ffmpeg $exit
import boto3 import json import pandas as pd import numpy as np
s3 = boto3.client(‘s3’) bucket_name = ‘Example_SageMaker_GT’ prefix = ‘/annotations/worker-response/iteration-1/’
s3_res = boto3.resource(‘s3’) paginator = boto3.client(‘s3’).get_paginator(‘list_objects_v2’) pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix) times = [] subs = [] for page in pages: for work in page[‘Contents’]: content_object = s3_res.Object(bucket_name, work[‘Key’]) file_content = content_object.get()[‘Body’].read().decode(‘utf-8’) json_content = json.loads(file_content) times.append(json_content[‘answers’][0][‘submissionTime’]) subs.append(json_content[‘answers’][0][‘workerMetadata’][‘identityData’][‘sub’]) sub_map = { s: f’Name {i}’ for i,s in enumerate(np.unique(subs))}
df = pd.DataFrame({‘times’:times,’subs’:subs}) df[“subs”] = df[“subs”].map(sub_map) subs_df = pd.DataFrame(pd.Series(subs)) df[‘date’] = pd.to_datetime(df.times).dt.date df[‘hours’] = pd.to_datetime(df.times).dt.strftime(‘%Y-%m-%d %H:30’)
counts_per_sub_per_date = df.groupby([‘hours’,’subs’])[‘count’].count().unstack() counts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum()
import bar_chart_race as bcr bcr.bar_chart_race( df=counts_per_sub_per_date_cum, filename=None, orientation=’h’, sort=’desc’, #n_bars=len(counts_per_sub.columns), fixed_order=False, fixed_max=True, steps_per_period=5, interpolate_period=False, label_bars=True, bar_size=.95, period_label={‘x’: .99, ‘y’: .25, ‘ha’: ‘right’, ‘va’: ‘center’}, #period_fmt=’%B %d, %Y’, period_summary_func=lambda v, r: {‘x’: .99, ‘y’: .18, ‘s’: f’Total labels: {v.sum():,.0f}’, ‘ha’: ‘right’, ‘size’: 8, ‘family’: ‘Courier New’}, perpendicular_bar_func=’median’, period_length=50, figsize=(5, 3), dpi=144, cmap=’dark12′, title=’Who is going to be the top labeller?’, title_size=”, bar_label_size=7, tick_label_size=7, shared_fontdict={‘family’ : ‘Helvetica’, ‘color’ : ‘.1′}, scale=’linear’, writer=None, fig=None, bar_kwargs={‘alpha’: .7}, filter_column_colors=False)
[]Option 2 requires no manual intervention; the bar chart races are sent automatically to the workers at a fixed interval (such as every day or every week). We provide a completely serverless solution, where the computing is done through AWS Lambda. The advantage of this approach is that the you don’t need to deploy any computing infrastructure (the SageMaker notebook instance in the first option). The steps involved are as follows:
[]The following diagram illustrates this architecture.
[]The following is the code for the Lambda function:
import boto3 import json import os import numpy as np import pandas as pd from matplotlib import pyplot as plt from matplotlib import animation import bar_chart_race as bcr # point this to the path in your lambda layer plt.rcParams[‘animation.ffmpeg_path’] = ‘/opt/ffmpeg/bin/ffmpeg’ s3_res = boto3.resource(‘s3’) bucket_name = ‘YourBucketHere’ prefix = ‘GTFolder/annotations/worker-response/iteration-1/’ def lambda_handler(event, context): print(os.environ) print(os.getcwd()) print(os.listdir(‘/opt/’)) paginator = boto3.client(‘s3’).get_paginator(‘list_objects_v2′) pages = paginator.paginate(Bucket=bucket_name, Prefix=f'{prefix}’) times = [] subs = [] for page in pages: for work in page[‘Contents’]: content_object = s3_res.Object(bucket_name, work[‘Key’]) file_content = content_object.get()[‘Body’].read().decode(‘utf-8’) json_content = json.loads(file_content) times.append(json_content[‘answers’][0][‘submissionTime’]) subs.append(json_content[‘answers’][0][‘workerMetadata’][‘identityData’][‘sub’]) # this is where one would map back to the real names of the labelers, possibly # using Cognito for sub -> Name correspondence sub_map = { s: f’Name {i}’ for i,s in enumerate(np.unique(subs))} df = pd.DataFrame({‘times’:times,’subs’:subs}) df[“subs”] = df[“subs”].map(sub_map) df[‘date’] = pd.to_datetime(df.times).dt.date df[‘hours’] = pd.to_datetime(df.times).dt.strftime(‘%Y-%m-%d %H:30’) df[‘count’]=1 counts_per_sub_per_date = df.groupby([‘hours’,’subs’])[‘count’].count().unstack() counts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum() bcr.bar_chart_race(df=counts_per_sub_per_date_cum.iloc[:100], filename=’/tmp/barchart.mp4′, orientation=’h’, sort=’desc’, #n_bars=len(counts_per_sub.columns), fixed_order=False, fixed_max=True, steps_per_period=5, interpolate_period=False, label_bars=True, bar_size=.95, period_label={‘x’: .99, ‘y’: .25, ‘ha’: ‘right’, ‘va’: ‘center’}, #period_fmt=’%B %d, %Y’, period_summary_func=lambda v, r: {‘x’: .99, ‘y’: .18, ‘s’: f’Total labels: {v.sum():,.0f}’, ‘ha’: ‘right’, ‘size’: 8, ‘family’: ‘Courier New’}, perpendicular_bar_func=’median’, period_length=50, figsize=(5, 3), dpi=144, cmap=’dark12′, title=’Who is going to be the top labeller?’, title_size=”, bar_label_size=7, tick_label_size=7, shared_fontdict={‘family’ : ‘Helvetica’, ‘color’ : ‘.1′}, scale=’linear’, writer=None, fig=None, bar_kwargs={‘alpha’: .7}, filter_column_colors=False) boto3.client(‘s3’).upload_file(‘/tmp/barchart.mp4’, bucket_name, ‘barchart/barchart.mp4’) return { ‘statusCode’: 200, ‘body’: json.dumps(‘Hello from Lambda!’) }
[]When you finish this exercise, remove your resources with the following steps:
[]This post demonstrated how to use Ground Truth partial labeling data loaded in Amazon S3 to gamify labeling workflows by periodically creating a bar chart race. Engaging with workers with a bar chart race has been shown to spark a fruitful competition among workers, speed up labeling, and increase user engagement and user satisfaction.
[]Get started today! You can learn more about Ground Truth and kick off your own labeling and gamification processes by visiting the SageMaker console.
[]Daniele Angelosante is a Senior Engagement Manager with AWS Professional Services. He is passionate about AI/ML projects and products. In his free time, he likes coffee, sport, soccer, and baking.
[] Andrea Di Simone is a Data Scientist in the Professional Services team based in Munich, Germany. He helps customers to develop their AI/ML products and workflows, leveraging AWS tools. He enjoys reading, classical music and hiking.
[]Othmane Hamzaoui is a Data Scientist working in the AWS Professional Services team. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.