Feature engineering is expensive and time-consuming, which may lead you to adopt a feature store for managing features across teams and models. Unfortunately, machine learning (ML) lineage solutions have yet to adapt to this new concept of feature management. To achieve the full benefits of a feature store by enabling feature reuse, you need to be able to answer fundamental questions about features. For example, how were these features built? What models are using these features? What features does my model depend on? What features are built with this data source?
Amazon SageMaker provides two important building blocks to enable answering key feature lineage questions:
In this post, we explain how to extend ML lineage to include ML features and feature processing, which can help data science teams move to proactive management of features. We provide a complete sample notebook showing how to easily add lineage tracking to your workflow. You then use that lineage to answer key questions about how models and features are built and what models and endpoints are consuming them.
Feature lineage plays an important role in helping organizations scale their ML practice beyond the first few successful models to cover needs that emerge when they have multiple data science teams building and deploying hundreds or thousands of models. Consider the following diagram, showing a simplified view of the key artifacts and associations for a small set of models.
Imagine trying to manually track all of this for a large team, multiple teams, or even multiple business units. Lineage tracking and querying helps make this more manageable and helps organizations move to ML at scale. The following are four examples of how feature lineage helps scale the ML process:
The following diagram shows a sample set of ML lifecycle steps, artifacts, and associations that are typically needed for model lineage when using a feature store.
These components include the following:
There is no “one size fits all” approach to an overall model pipeline. This is simply an example, and you can adapt it to cover how your teams operate to meet your specific lineage requirements. The underlying APIs are flexible enough to cover a broad range of approaches.
Let’s walk through how to instrument your code to easily capture these associations. Our example uses a custom wrapper library we built around SageMaker ML Lineage Tracking. This library is a wrapper around the SageMaker SDK to support ease of lineage tracking across the ML lifecycle. Lineage artifacts include data, code, feature groups, features in a feature group, feature group queries, training jobs, and models.
First, we import the library:
from ml_lineage_helper import *
Next, ideally you want your lineage to even track the code you used to process your data with SageMaker Processing jobs or code used to train your model in SageMaker. If this code is version controlled (which we highly recommend!), we can reconstruct what those URL links would be in your chosen git management platform like GitHub or GitLab:
processing_code_repo_url = get_repo_link(os.getcwd(), ‘processing.py’) training_code_repo_url = get_repo_link(os.getcwd(), ‘pytorch-model/train_deploy.py’, processing_code=False) repo_links = [processing_code_repo_url, training_code_repo_url]
Finally, we create the lineage. Many of the inputs are optional, but in this example, we assume the following:
ml_lineage = MLLineageHelper() lineage = ml_lineage.create_ml_lineage(estimator, model_name=model_name, query=query, sagemaker_processing_job_description=preprocessing_job_description, feature_group_names=[‘customers’, ‘claims’], repo_links=repo_links) lineage
The following screenshot shows our results.
The call returns a pandas dataframe representing the lineage graph of artifacts that were created and associated on your behalf. It provides names, associations (such as Produced or ContributedTo), and ARNs that uniquely identify resources.
Now that the lineage is in place, you can use it to answer key questions about your features and models. Keep in mind that the full benefit of this lineage tracking comes when this practice is adopted across many data scientists working with large numbers of features and models.
Let’s look at some examples of what you can do with the lineage data now that lineage tracking is in place.
As a data scientist, you might be planning to use a specific data source. To avoid reinventing features that are based on the same raw data as existing features, you want to look at all the features that have already been built and are in production using that same data source. A simple call can get you that insight:
from ml_lineage_helper.query_lineage import QueryLineage query_lineage = QueryLineage() query_lineage.get_feature_groups_from_data_source(artifact_arn_or_s3_uri)
The following screenshot shows our results.
Or maybe you’re considering using a specific feature group, and you want to know what data sources are associated with it:
query_lineage.get_data_sources_from_feature_group(artifact_or_fg_arn, max_depth=3)
You might also need to audit a model or a set of model predictions. If incorrect predictions, or biased predictions, occurred in production, your team needs answers about how this happened. Given a model, you can query lineage to see all the steps used in the ML lifecycle to create the model:
ml_lineage = MLLineageHelper(sagemaker_model_name_or_model_s3_uri=’my-sagemaker-model-name’) ml_lineage.df
As more and more features are made available in a centralized feature store, owners of specific features need to plan for the evolution of feature groups, and eventually even deprecation of old features. These feature owners need to understand what models are using their features to understand the impact and who they need to work with. You can do this with the following code:
query_lineage.get_models_from_feature_group(artifact_or_fg_arn)
The following screenshot shows our results.
You can also reverse the question and find out which feature groups are associated with a given model:
query_lineage.get_feature_groups_from_model(artifact_arn_or_model_name)
In this post, we discussed the importance of tracking ML lineage, aspects of the ML lifecycle that you should track and add to the lineage, and how to use SageMaker to provide end-to-end ML lineage. We also covered how to incorporate Feature Store as you move towards reusable features across teams and models, and finally how to use the helper library to accomplish end-to-end ML lineage tracking. To try out Feature Store end to end lifecycle, including a module on lineage, you can explore this Feature Store Workshop and the notebooks for all the modules on GitHub. Also, you can extend this approach to cover your unique requirements. Visit the ML Lineage helper library we built, and try out the example notebook.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.
Mohan Pasappulatti is a Senior Solutions Architect at AWS, based in San Francisco, USA. Mohan helps high profile disruptive startups and strategic customers architect and deploy distributed applications, including machine learning workloads in production on AWS. He has over 20 years of work experience in several roles like engineering leader, chief architect and principal engineer. In his spare time, Mohan loves to cheer his college football team (LSU Tigers!), play poker, ski, watch the financial markets, play volleyball and spend time outdoors.