Business analysts work with data and like to analyze, explore, and understand data to achieve effective business outcomes. To address business problems, they often rely on machine learning (ML) practitioners such as data scientists to assist with techniques such as utilizing ML to build models using existing data and generate predictions. However, it isn’t always possible, as data scientists are typically tied up with their tasks and don’t have the bandwidth to help the analysts.
To be independent and achieve your goals as a business analyst, it would be ideal to work with easy-to-use, intuitive, and visual tools that use ML without the need to know the details and use code. Using these tools will help you solve your business problems and achieve the desired outcomes.
With a goal to help you and your organization become more effective, and use ML without writing code, we introduced Amazon SageMaker Canvas. This is a no-code ML solution that helps you build accurate ML models without the need to learn about technical details, such as ML algorithms and evaluation metrics. SageMaker Canvas offers a visual, intuitive interface that lets you import data, train ML models, perform model analysis, and generate ML predictions, all without writing a single line of code.
When using SageMaker Canvas to experiment, you may encounter data quality issues such as missing values or having the wrong problem type. These issues may not be discovered until quite late in the process after training a ML model. To alleviate this challenge, SageMaker Canvas now supports data validation. This feature proactively checks for issues in your data and provides guidance on resolutions.
In this post, we’ll demonstrate how you can use the data validation capability within SageMaker Canvas prior to model building. As the name suggests, this feature validates your dataset, reports issues, and provides useful pointers to fix them. By using better quality data, you will end up with a better performing ML model.
Data Validation is a new feature in SageMaker Canvas to proactively check for potential data quality issues. After you import the data and select a target column, you’re given a choice to validate your data as shown here:
If you choose to validate your data, Canvas analyzes your data for numerous conditions including:
Details for each validation criteria will be provided in the later sections of this post.
If all of the checks are passed, then you’ll get the following confirmation: “No issues have been found in your dataset”.
If any issue is found, you’ll get a notification to view and understand. This surfaces the data quality issues early, and it lets you address them immediately before wasting time and resources further in the process.
You can make your adjustments and keep validating your dataset until all of the issues are addressed.
When you’re building an ML model in SageMaker Canvas, several data quality issues related to the target column may cause your model build to fail. SageMaker Canvas checks for different kinds of problems that may impact your target column.
If you get any of the above warnings for your target column, then use the following steps to mitigate the issues:
Refer to the SageMaker Canvas data transformation documentation to perform the imputation steps mentioned above.
Aside from the target column, you may run into data quality issues with other data columns (feature columns) as well. Features columns are input data used to make an ML prediction.
To avoid incurring future session charges, log out of SageMaker Canvas.
SageMaker Canvas is a no-code ML solution that allows business analysts to create accurate ML models and generate predictions through a visual, point-and-click interface. We showed you how SageMaker Canvas helps you to make sure of data quality and mitigate data issues by proactively validating the dataset. By identifying the issues early, SageMaker Canvas helps you build quality ML models and reduce build iterations without expertise in data science and programming. To learn more about this new feature, refer to the SageMaker Canvas documentation.
To get started and learn more about SageMaker Canvas, refer to the following resources:
Hariharan Suresh is a Senior Solutions Architect at AWS. He is passionate about databases, machine learning, and designing innovative solutions. Prior to joining AWS, Hariharan was a product architect, core banking implementation specialist, and developer, and worked with BFSI organizations for over 11 years. Outside of technology, he enjoys paragliding and cycling.
Sainath Miriyala is a Senior Technical Account Manager at AWS working for automotive customers in the US. Sainath is passionate about designing and building large-scale distributed applications using AI/ML. In his spare time Sainath spends time with family and friends.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.