[]Starting today, we’re releasing new tools for multimodal financial analysis within Amazon SageMaker JumpStart. SageMaker JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few clicks. You can now access a collection of multimodal financial text analysis tools, including example notebooks, text models, and solutions.
[]With these new tools, you can enhance your tabular ML workflows with new insights from financial text documents and potentially help save up to weeks of development time. With the new SageMaker JumpStart Industry SDK, you can easily retrieve common public financial documents, including SEC filings, and further process financial text documents with features such as summarization and scoring of the text for various attributes, such as sentiment, litigiousness, risk, and readability. In addition, you can access pre-trained language models trained on financial text for transfer learning, and use example notebooks for data retrieval, text feature engineering, multimodal classification and regression models.
[]In this post, we show how to curate a dataset of SEC filings and financial variables, use natural language processing (NLP) for feature engineering on the dataset, and undertake multimodal ML to build a better ratings classifier.
[]The new financial analysis features include an example notebook that demonstrates APIs to retrieve parsed SEC filings, APIs for summarizers, and APIs to score text for various attributes (see SageMaker JumpStart SEC Filings Retrieval w/Summarizer and Scoring). A second notebook (Multi-category ML on SEC Filings Data) demonstrates multicategory classification on SEC filings. A third notebook (ML on a TabText (Multimodal) Dataset) shows how to undertake ML on multimodal financial data using the Paycheck Protection Program (PPP) as an example. Four additional text models (RoBERTa-SEC-Base, RoBERTa-SEC-WIKI-Base, RoBERTa-SEC-Large, and RoBERTa-SEC-WIKI-Large) are provided to generate embeddings for transfer learning using pre-trained financial models that have been trained on Wiki text and 10 years of SEC filings.
[]Finally, a SageMaker JumpStart solution (Corporate Credit Rating Prediction) demonstrates how to use the pipeline of SEC filings (long-form text data) and financial ratios (tabular data) to build corporate credit rating prediction models. This is the model discussed in this post, which is the first in a series of posts that describe these new financial analysis ML tools. In this post, we explain how you can use this solution for credit scoring, which is fully customizable so you can accelerate your ML journey.
[]We’re all familiar with individual credit scoring, especially our own credit scores, from FICO. In this notebook, we revisit the oldest and one of the most widely used models for corporate credit scoring, the Altman Z-score. The Altman model generates a credit score, where higher scores denote higher credit quality and lower scores denote lower quality firms.
[]Altman developed his model in 1968, using just 66 firms’ data to fit an accurate bankruptcy prediction model. It predicted which firms would default within 1 year. Altman fit this model using Linear Discriminant Analysis (LDA), arguably the first instance of the use of an ML algorithm in academic finance. This seminal paper has generated a family of Altman Z-score models that are used all over the globe. The model only requires a few inputs from a company’s financials and therefore may be applied to public and private firms, small and large. It’s in widespread use today. It uses tabular data.
[]In this post, you learn how to use a credit scoring model such as Altman’s Z-score, and enhance the model with financial text from SEC filings. The entire model is presented in the SageMaker JumpStart solution model card titled Corporate Credit Rating Prediction.
[]
[]The preceding model card appears in SageMaker JumpStart. You can access this model card through SageMaker Studio.
[]Navigate to that card and deploy the model by choosing Launch.
[]
[]The following page appears.
[]
[]You can see a model that is deployed for inference and an endpoint. Wait until they’re ready and show the status Complete. Choose Open Notebook to open the first notebook, which is for training and endpoint deployment. You can work through this notebook to learn how to use this solution and then modify it for any other application you may want on your own data. The solution comes with synthetic data and uses a subset of it to exemplify the steps needed to train the model, deploy it to an endpoint, and invoke the endpoint for inference. The notebook also contains code to deploy an endpoint of your own.
[]To open the second notebook, choose Use Endpoint in Notebook. This opens the inference notebook to use the already deployed example endpoint. In the inference notebook, you can see how to prepare the data to invoke the example endpoint to do inference on a batch of examples. The endpoint returns predicted ratings, as shown in the following screenshot, in the last code block of the inference notebook.
[]
[]You can use this solution as a template for a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s famous five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. You’re not restricted to the Altman variables and can add more variables as needed, or change out the variables completely. The main objective in this notebook is to show how to enhance Altman’s Z-score model with text so you can use ML techniques to achieve a best-in-class model.
[]The Altman model is widely used by a range of users and is therefore taught as part of required coursework by the Corporate Finance Institute (CFI). Altman himself offered a 50-year retrospective on the model in parts 1, 2, and 3, discussing its wide use and misuse. To learn more, watch him on video and read this article. For a critique and improvement on the model, see the following article by Seeking Alpha, a well-known investor community. The Z-score Plus model is even available as an app on mobile devices.
[]Therefore, think of this workflow as a well-established starting point for the use of ML for credit scoring.
[]To begin, run the notebooks on the example data within it to gain an understanding of how simple this solution is to use. This initiates modification of the notebook for your own model. The modification includes the following steps:
[]That’s it! The solution is self-contained and works with a few clicks.
[]Important: This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production.
[]You may want to explore the solution further. In the appendix, we offer more detail on credit scoring and some additional simple code to show how to add SEC text to standard tabular features to undertake multimodal ML. All these functionalities are made simple using APIs in SageMaker JumpStart models. We cover the following:
[]We have seen how to enhance tabular ML models for credit scoring with long-form financial text. You can adapt the training notebook and the inference notebook in the JumpStart solution Corporate Credit Rating Prediction with your own data and labels as follows:
[]SEC filings aren’t the only text that you can use. You can use any text that contains information about the label. For example, the text of internal rating analyses may be even better than SEC filings.
[]To get started, you can find the Corporate Credit Rating Prediction solution in SageMaker JumpStart in SageMaker Studio. For more information, see SageMaker JumpStart.
[]Legal Disclaimers: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. This post uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions.
[]Thanks to several team members for support with this work: Miyoung Choi, Vinay Hanumaiah, Cuong Nguyen, Xavier Ragot, Derrick Zhang, Li Zhang, Yue Zhao, Daniel Zhu
[]In this appendix, we discuss related topics to this solution.
[]The model is based on a well-known bankruptcy prediction approach, from the original paper by Ed Altman (1968). For a brief summary, see Measuring the ‘fiscal-fitness’ of a company: The Altman Z-Score.
[]The original seminal paper by Altman is available at: Altman, Edward. (September 1968). “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy“. Journal of Finance, v23(4): 189–209. [doi:10.1111/j.1540-6261.1968.tb00843.x]
[]The model uses eight inputs from a company’s financials:
[]These eight inputs translate into the following five financial ratios:
[]These ratios are used to fit binary class data of companies that go bankrupt and those that do not. Altman fitted the model using Linear Discriminant Analysis, possibly the earliest use of ML in finance. The linear discriminant function is as follows:
[]Zscore = 3.3A+0.99B+0.6C+1.2D+1.4E
[]These translate into suggested company credit quality ranges, which may vary by use, such as in the following example:
[]We enhance the Altman five-feature set (A,B,C,D,E stated above) with text from SEC filings to get an improved Z-score model.
[]We created a synthetic dataset that combined randomly chosen SEC filings with simulated financial data. Briefly, we created the synthetic dataset using the following steps (ticker names have not been included, so as to not cause confusion with real tickers):
[]The final dataset (stored as CCR_data.csv) comprises the MD&A text, industry code, and eight financial variables. The last column contains the rating, namely, the label for classification. The data contains seven categories of labels: AAA, AA, A, BBB, BB, B, CCC. These labels are not reflective of companies’ actual credit ratings since they are based on synthetically generated data. This synthetic dataset is automatically downloaded when you run the training notebook in the JumpStart solution Corporate Credit Rating Prediction in the training notebook described earlier.
[]The following code is a template for constructing a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. In this example, we observe an 8% increase in accuracy (on our example test data) when text is added.
[]SEC filings are retrieved from the SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website, which provides open data access (note the disclaimer in this post). EDGAR is the primary system under the US Securities and Exchange Commission (SEC) for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average. In the following code, we provide a simple, single API call that creates a dataset in a few lines of code, for any period of time and for a large number of tickers.
[]The API contains three parts:
[]This kicks off the processing job running in a SageMaker container and makes sure that even a very large retrieval can run without the notebook connection.
%%time dataset_config = EDGARDataSetConfig( tickers_or_ciks=[‘amzn’,…,’FB’], # list of stock tickers or CIKs form_types=[’10-K’, ’10-Q’], # list of SEC form types filing_date_start=’2019-01-01′, # starting filing date filing_date_end=’2020-12-31′, # ending filing date email_as_user_agent=’test-user@test.com’) # user agent email data_loader = DataLoader( role=sagemaker.get_execution_role(), # loading job execution role instance_count=1, # instances number, limit varies with instance type instance_type=’ml.c5.2xlarge’, # instance type volume_size_in_gb=30, # size in GB of the EBS volume to use volume_kms_key=None, # KMS key for the processing volume output_kms_key=None, # KMS key ID for processing job outputs max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours. sagemaker_session=sagemaker.Session(), # session object tags=None) # a list of key-value pairs data_loader.load( dataset_config, ‘s3://{}/{}/{}’.format(bucket, sec_processed_folder, ‘output’), # output s3 prefix (both bucket and folder names are required) ‘dataset_10k_10q.csv’, # output file name wait=True, logs=True) []The data is stored in a file denoted dataset_10k_10q.csv as shown in the preceding code. The file may be examined as follows:
client = boto3.client(‘s3’) client.download_file(S3_BUCKET_NAME, ‘{}/{}’.format(S3_FOLDER_NAME, ‘dataset_10k_10q.csv’), ‘dataset_10k_10q.csv’) data_frame_10k_10q = pd.read_csv(‘dataset_10k_10q.csv’) data_frame_10k_10q.head() []
[]The mdna column of text from this DataFrame is then combined with financial data to create a composite dataset, stored in a file titled CCR_data.csv, which is read in next. We denoted the composite of tabular and text data as TabText.
[]We read in this dataset and examine its properties. It has 11 features: one text column, one categorical column, eight numerical columns, and a label column (Ratings). Whereas the values from this dataset match the broad averages in the economy, and we trained a model on this data, this model should be trained on real data from the user.
%pylab inline import pandas as pd import os df = pd.read_csv(‘CCR_data.csv’) print(df.shape) df.head() []
[]Next, we convert the financial values into Altman’s five ratios, resulting in the final DataFrame we use for multimodal ML:
df[“A”] = df.EBIT/df.TotalAssets df[“B”] = df.NetSales/df.TotalAssets df[“C”] = df.MktValueEquity/df.TotalLiabs df[“D”] = (df.CurrentAssets-df.CurrentLiabs)/df.TotalAssets df[“E”] = df.RetainedEarnings/df.TotalAssets df = df.drop([“TotalAssets”,”CurrentLiabs”,”TotalLiabs”, “RetainedEarnings”, “CurrentAssets”, “NetSales”, “EBIT”, “MktValueEquity”], axis=1) df.head() []
[]The dataset has eight features: one text column, a categorical column, five numerical columns, and a label column. We have text of the MD&A section, industry code, five ratios (A, B, C, D, E) as described earlier developed by Altman. The label column is Rating.
[]As a cross-check, we compute the Z-score for each firm and examine the mean score by rating. The scores decline as the rating of firms drops. The confirms that the dataset captures the relationship between Z-scores and ratings. (Of course, we don’t use Z-score as a feature.)
df_z = df.drop([‘MDNA’,’industry_code’], axis=1) df_z[“Zscore”] = 3.3*df_z.A + 0.99*df_z.B + 0.6*df_z.C + 1.2*df_z.D + 1.4*df_z.E df_z = df_z.groupby(‘Rating’).mean().reset_index() df_z.index = [2,1,0,5,4,3,6] df_z[[“Rating”, “Zscore”]].sort_index() []
[]Our dataset is multimodal and contains the following:
[]We use the GluonNLP library based on the MXNet framework. Install the required packages. You can update the following example code with newer releases of mxnet. For newer releases of autogluon, see GluonNLP: NLP made easy.
%%capture !pip install —upgrade pip !pip install —upgrade setuptools !pip install —upgrade “mxnet_cu110<2.0.0" !pip install autogluon==0.2.0
[]First, we mimic the original version of the Altman model with just five financial ratios and industry code—this is just tabular data. We later fit an extended model with text and tabular data.
[]To start, we also choose a binary classification problem, where 1 = {AAA,AA,A,BBB} (namely, investment grade firms) and 0 = {BB,B,CCC} (below investment grade firms). We drop the text (MDNA) column from the dataset. In the solution itself, you will see a multi-category classification task, which we briefly highlight towards the end of the post.
df_tabular = df.copy() df_tabular = df_tabular.drop([‘MDNA’], axis=1) []Prepare the binary label based on rating:
trans_func = lambda x: 1 if x in {‘AAA’, ‘AA’, ‘A’, ‘BBB’} else 0 df_tabular[‘Rating’] = df_tabular[‘Rating’].transform(trans_func) []Implement an 80/20 train/test split on the data:
from sklearn.model_selection import train_test_split train_data, test_data = train_test_split(df_tabular, test_size=0.2, random_state=42)
[]We use the parsimonious framework from AutoGluon. This library accepts DataFrames containing text, tabular, CV data, and fits models automatically using a set of well-known classifiers, such as 𝐾-nearest neighbors, Gradient Boosted models, Random Forest models, Boosted models, Extra Trees models, XGBoost, and Neural Net models. These models are then stack-ensembled to get the best weighted model. You can also perform hyperparameter tuning. For full details, see AutoGluon: AutoML for Text, Image, and Tabular Data. Complete the following steps:
[]For full reference, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.
%%time from autogluon.tabular import TabularPredictor predictor = TabularPredictor(label=”Rating”).fit(train_data=train_data) []Next, assess metrics to determine the best-performing model on the test data:
best_model = predictor.get_model_best() print(“Best model: ” + best_model) performance = predictor.evaluate(test_data) results = predictor.leaderboard(test_data) results []
[]Note that balanced accuracy is average recall on both classes. MCC is the Matthews Correlation Coefficient.
[]We can also see the leaderboard generated from the preceding code and presented in order of validation score.
[]
[]We then combine the text and tabular data to get a final model to showcase multimodal ML. The steps remain exactly the same as before. You don’t need to perform vectorization of the text or one-hot encoding of the categorical variable. All this is handled by MXNet/AutoGluon. Even the label is auto-detected, so the class of problem doesn’t need to be specified.
[]Because the text in these sections is very long (thousands of words), we can’t use transformers, because they have a restricted number of words they can handle (usually less than 1000). Therefore, AutoGluon uses TF-IDF with n-grams to transform the text into numerical vectors and then apply ML to the text and tabular data.
[]We fit a model with very few lines of code. This time, we don’t drop the text column containing the MD&A:
df_tabtext = df.copy() # copy the full dataframe trans_func = lambda x: 1 if x in {‘AAA’, ‘AA’, ‘A’, ‘BBB’} else 0 df_tabtext[‘Rating’] = df_tabtext[‘Rating’].transform(trans_func) # Add binary label predictor = TabularPredictor(label=”Rating”, path=model_path).fit(train_data=train_data, excluded_model_types=[‘FASTAI’]) # Fit model print(“Best model: ” + predictor.get_model_best()) # Show best model performance = predictor.evaluate(test_data) print(predictor.leaderboard(test_data, silent=True)) []
[]Accuracy on the test dataset has increased to 93% (on TabText) versus 85% (on the tabular dataset).
[]We also see below the leaderboard generated from the preceding code and presented in order of validation score.
[]
[]SageMaker JumpStart has its own SDK with an API to further enhance the feature set with numerical values that score the text (in column MDNA) in the dataset for its various attributes. To see how to use this API, refer to the JumpStart example notebook SEC Filings Retrieval w/Summarizer and Scoring. This adds columns with additional values based on the percentage of words in the text that match separate word lists for each attribute, or the attribute may be based on an algorithm such as sentiment scoring and readability. You have 11 attributes: negative, certainty, risk, uncertainty, safe, fraud, litigious, positive, polarity, sentiment, and readability.
[]We use the Gunning fog index to calculate the readability score. Sentiment analysis uses VADER. Polarity calculation uses positive and negative word lists. The other NLP scores deliver the similarity (word frequency) with the default word lists (positive, negative, litigious, risk, fraud, safe, certainty, and uncertainty) provided through the smjsindustry library. You can also provide your own word list to calculate the NLP score of your own scoring types.
[]These numerical scores are added as new columns to the text DataFrame. This creates a multimodal DataFrame that is a mixture of tabular data and long-form text, called TabText. When you submit this DataFrame for ML, it’s a good idea to normalize the columns of NLP scores (usually with standard normalization or min-max scaling).
[]These scoring metrics are simple and report the proportion of words in a document that occur in a specified word list. The word lists aren’t the traditional financial word lists that are human curated, but are word lists that are generated from word embeddings that are close to the concepts that are being scored. Therefore, they may also contain words that don’t obviously relate to a concept (e.g., risk), but their occurrence implies the presence of discussion related to the concept. You can even bring your own word lists to quantify additional concepts (for example, ESG). This API call is shown in the following code:
import sagemaker from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST from smjsindustry import NLPScorer, NLPScorerConfig score_type_list = list( NLPScoreType(score_type, []) for score_type in NLPScoreType.DEFAULT_SCORE_TYPES if score_type not in NLPSCORE_NO_WORD_LIST ) score_type_list.extend([NLPScoreType(score_type, None) for score_type in NLPSCORE_NO_WORD_LIST]) nlp_scorer_config = NLPScorerConfig(score_type_list) nlp_score_processor = NLPScorer( ROLE, 1, ‘ml.c5.18xlarge’, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=sagemaker.Session(), tags=None) nlp_score_processor.calculate( nlp_scorer_config, “MDNA”, “CCR_data_input.csv”, ‘s3://{}/{}’.format(BUCKET, “nlp_score”), ‘ccr_nlp_score_sample.csv’ ) []This generates an extended DataFrame.
[]
[]Instead of training for binary classification as we did earlier, we can use the seven rating classes in the dataset for multicategory classification. The details of training the model on this extended DataFrame are provided in the training notebook Corporate Credit Rating Prediction solution in SageMaker JumpStart. The final performance on a sample of the data is shown in the confusion matrix.
[]
[]We can observe that the trained model is accurate on the test dataset, even though we trained it on a small subset of the data.
[]SageMaker makes it simple to deploy the model to an endpoint. As we discussed, you can then use this for inference, and the technical details (a few lines of code) are also shown in the training and inference notebooks that come with this solution.
[] Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. He works on multimodal machine learning in the area of financial applications.
[]Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.
[]Shenghua Yue is a Software Development Engineer at Amazon SageMaker. She focuses on building machine learning tools and products for customers.