Data is at the heart of machine learning (ML). Including relevant data to comprehensively represent your business problem ensures that you effectively capture trends and relationships so that you can derive the insights needed to drive business decisions. With Amazon SageMaker Canvas, you can now import data from over 40 data sources to be used for no-code ML. Canvas expands access to ML by providing business analysts with a visual interface that allows them to generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code. Now, you can import data in-app from popular relational data stores such as Amazon Athena as well as third-party software as a service (SaaS) platforms supported by Amazon AppFlow such as Salesforce, SAP OData, and Google Analytics.
The process of gathering high-quality data for ML can be complex and time-consuming, because the proliferation of SaaS applications and data storage services has created a spread of data across a multitude of systems. For example, you may need to conduct a customer churn analysis using customer data from Salesforce, financial data from SAP, and logistics data from Snowflake. To create a dataset across these sources, you need to log into each application individually, select the desired data, and export it locally, where it can then be aggregated using a different tool. This dataset then needs to be imported into a separate application for ML.
With this launch, Canvas empowers you to capitalize on data stored in disparate sources by supporting in-app data import and aggregation from over 40 data sources. This feature is made possible through new native connectors to Athena and to Amazon AppFlow via the AWS Glue Data Catalog. Amazon AppFlow is a managed service that enables you to securely transfer data from third-party SaaS applications to Amazon Simple Storage Service (Amazon S3) and catalog the data with the Data Catalog with just a few clicks. After your data is transferred, you can simply access the data source within Canvas, where you can view table schemas, join tables within or across data sources, write Athena queries, and preview and import your data. After your data is imported, you can use existing Canvas functionalities such as building an ML model, viewing column impact data, or generating predictions. You can automate the data transfer process in Amazon AppFlow to activate on a schedule to ensure that you always have access to the latest data in Canvas.
The steps outlined in this post provide two examples of how to import data into Canvas for no-code ML. In the first example, we demonstrate how to import data through Athena. In the second example, we show how to import data from a third-party SaaS application via Amazon AppFlow.
In this section, we show an example of importing data in Canvas from Athena to conduct a customer segmentation analysis. We create an ML classification model to categorize our customer base into four different classes, with the end goal to use the model to predict which class a new customer will fall into. We follow three major steps: import the data, train a model, and generate predictions. Let’s get started.
To import data from Athena, complete the following steps:
The following screenshot shows an example of the preview table.
In our example, we segment customers based on the marketing channel through which they have engaged our services. This is specified by the column segmentation, where A is print media, B is mobile, C is in-store promotions, and D is television.
Your data is imported into Canvas as a dataset from the specific table in Athena.
After your data is imported, it shows up on the Datasets page. At this stage, you can build a model. To do so, complete the following steps:
On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mean of the data.
Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 2–4 hours.
The following model categorizes customers correctly 94.67% of the time.
On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps:
For our prediction, we want to understand what segmentation a customer will be if they are 32 years old and a lawyer by profession.
The updated prediction is displayed in the prediction window. In this example, a 32-year old lawyer is classified in segment D.
To import data from third-party SaaS applications into Canvas for no-code ML, you must first transfer data from the application to Amazon S3 via Amazon AppFlow. In this example, we transfer manufacturing data from SAP OData.
To transfer your data, complete the following steps:
Alternatively, you can automate the flow transfer by selecting Run flow on schedule.
When the flow is created, a green ribbon will populate at the top of the page indicating that it is successfully updated.
At this stage, you have successfully transferred your data from SAP OData to Amazon S3.
Now you can import the data from within the Canvas app. To import your data from Canvas, follow the same set of steps as described in the Data import section earlier in this post. For this example, on the Data source drop-down menu on the Data import page, you can see SAP OData listed.
You are now able to use all existing Canvas functionalities, such as cleaning your data, building an ML model, viewing column impact data, and generating predictions.
To clean up the resources provisioned, log out of the Canvas application by choosing Log out in the navigation pane.
With Canvas, you can now import data for no-code ML from 47 data sources through native connectors with Athena and Amazon AppFlow via the AWS Glue Data Catalog. This process enables you to directly access and aggregate data across data sources within Canvas after data is transferred via Amazon AppFlow. You can automate the data transfer to activate on a schedule, which means that you don’t have to go through the process again to refresh your data. With this process, you can create new datasets with your latest data without having to leave the Canvas app. This feature is now available in all AWS Regions where Canvas is available. To get started with importing your data, navigate to the Canvas console and follow the steps outlined in this post. To learn more, refer to Connect to data sources.
Brandon Nair is a Senior Product Manager for Amazon SageMaker Canvas. His professional interest lies in creating scalable machine learning services and applications. Outside of work he can be found exploring national parks, perfecting his golf swing or planning an adventure trip.
Sanjana Kambalapally is a Software Development Manager for AWS Sagemaker Canvas, which aims at democratizing machine learning by building no code ML applications.
Xin Xu is a software development engineer in the Canvas team, where he works on data preparation, among other aspects in no-code machine learning products. In his spare time, he enjoys jogging, reading and watching movies.
Volkan Unsal is a Sr. Frontend Engineer in the Canvas team, where he builds no-code products to make artificial intelligence accessible to humans. In his spare time, he enjoys running, reading, watching e-sports, and martial arts.