Corporates and Banks base their important decisions on data, such as to make profit, reduce cost or save themselves from making loss. They need to take many crucial decisions with respect to:
- Investing in a new venture.
- Choice of consumers to target in a direct marketing campaign
- Identifying fraudulent credit card transactions in real time.
Answers to all the above business decisions are either a “Yes” or “No”. For example “Yes” to invest in a new venture, “Yes” to target a consumer, “Yes”, the credit card transaction is fraudulent.
It is a probabilistic classification model used to predict the outcome of a dependent binary variable (dependent variable can be YES/NO), say loan will get approved or not. There would be many independent variable or features, say income of loan applicant or credit score of loan applicant, which will used as input to the logistic model for prediction of depended variable (loan approved or not).
Suppose a bank gives home loan to a home loan applicant. When the applicant applies for a home loan, bank collects data of the applicants such as “average balance maintained in account”, “credit score”, “age”, “marital status” and “age of account” etc. The bank inputs these data to their logistic model and get a probability score between 0 and 1. Depending on this probability score, bank may or may not approve the loan. This probability score measure the chance of the applicant will turn defaulter (if bank is modelling dependent variable as default probability).
Higher the probability score, higher the change of the applicant becoming a defaulter. So bank will approve loan when the probability score is low. Building a good logistic regression model is critical, since the model lending depends on it.
Properties of Good Logistic Regression
- Accurately identify those applicants who will turn defaulter and identify those who will not default.
- Less numbers for errors such as identifying applicants as defaulter when they might not turn defaulter, in this case loan won’t be approved and bank will lose an opportunity to make profit. The other case could be that model says the applicant won’t turn a defaulter and loan gets approved but the applicant turn defaulter.
- The performance of the logistic regression model should not deteriorate over time. A good model’s performance, point 1 and 2, is consistence over different time period. A model built today should perform equally well after 6 months or a year. In general performance deteriorates over time but the rate should be the least.
Logistic Regression Example
- To accurately select consumers, who could be targeted, from list of potential customers for direct marketing. Cost of direct marketing through mail and call have grown and if contracted customer did not turn up the offer, than the efforts of the company goes in vein. So it is important to choose the right customers from list of potential customers, who could be targeted, such that many of the targeted customers turn up the offer and company save lot of effort and money. There by making the direct marketing effort a success and reducing the costs of operations. Many insurance firms sell a variety of product to customer, targeting the right consumer is very critical. These firms have consumer’s retail purchase, demography and income data etc. and they use these data to build logistic regression models to identify potential buyers. Few of the variable which are used to build logistic model are “average account balance”, “credit card risk code”, “age” and “marital status” etc.
- To predict corporate bankruptcy. This is one of the most widely researched topics. Lending or investing is subjected to risk. Before investing or lending in a particular corporate one has to know about the chances of the corporate getting bankrupt. So knowing the chances of bankruptcy will help investor to take profitable business decision. Logistic regression predicts the probability of bankruptcy.
- To predict whether a credit card transaction is fraudulent or not. Credit card fraud has increased many folds. Banks are using different models to identify fraudulent transaction. This is an area which is studied widely. Logistic regression model is also used to identify fraudulent transaction.
Once a logistic regression model is build one has to test the performance of the model in different time period. This task is critical since a model might perform accurately when it is build but performance may deteriorate over time which leads to wrong business decision. To check the performance of model there are many tests which will measure the performance of model on different metric. Few of the metric is listed below
Metric Used to test Model Performance
There are many metrics to measure the performance of the model. Few of the metrics are listed below:
Population/ Stability Metric:
1. Metric of Dependent Variables
- Average Probability Score of Dependent Variable: Suppose we have 1000 loan applicants; therefore we have 1000 probability scores. If the average probability score during different time intervals are similar then it tells us that diversity of population of loan applicants across different time intervals might be similar.
- Distribution of Probability Score: Probability score varies between 0 and 1. Distribution of probability score across different time tells visualizes the spread of the Population across the various Bands. One can see this graph to find if there is any major change in probability score distribution.
- Population Stability Index: Analyses whether the population on which model was developed and current population are similar in terms of stability of characteristics. PSI quantifies the difference by measuring the distributional shift in scores between two samples: current and baseline (on which model was built). PSI score of 0.1 or less tell that there is no change distribution.
2. Metric of Independent Variable
- Average value of Independent Variable: It measures the average value of independent variables over different time interval. This graph helps to visualize any significant change in the input data.
- Distribution of value of Independent Variable: Population Distribution visualizes the spread of the population across the various bands across various time periods as defined by the user for various attributes (variables).
- Population Stability Index: PSI calculated on independent variable. PSI quantifies the difference by measuring the distributional shift in scores between two samples: current and baseline (on which model was built). PSI score of 0.1 or less tell that there is no change distribution.
Accuracy Metric: This metric measures the accuracy of the model over different time periods.
- Mean Squared Error (MSE) and Change in MSE: The predicted dependent variables are bucked in 10 deciles. Mean of predicted probability of depended variable is calculated for each decile and probability of occurrence of an event, say loan applicant become defaulter, is calculated for each decile. Former mean is called predicted probability and the latter is called actual probability. So there would be 10 values each, these values would be used to calculate MSE. Lower the MSE better is the model. Over different time MSE is calculated and compared with base model MSE. Change in MSE should be with an acceptable region.
- Actual vs Predicted Rate: It visualizes the actual percent of an event, say percent of loan defaulter, versus predicted percent of that event over different time period. Difference between the graphs should be least.
- Actual vs Predicted by Band: Actual Rate by Band: It visualizes the percent of an event, say loan defaulter, by predicted probability score band over different time period. Ideally the graph should not cross other graph.
Efficiency: These set of metrics measures the efficiency of the model over different time periods.
- Top 10% Capture: It measures the actual percentage of an event, say loan defaulter, in top 10% of predicted probability score of dependent variable. If value is 28% then it means that in the top 10% of predicted probability scores only 28% of applicants were actual loan defaulters. Higher the value better is the model. It also visualizes the score over different time period.
- Change in Top 10% Capture: It visualizes and measures the change in Top 10% capture of an event in the current time period with respect to the Top 10% capture of the baseline time period (development sample). This should be with an acceptable region.
- KS Trend: KS test is used to check two distribution are significant different or not. In logistic regression we have predicted probability of an event, say loan defaulter, and actual occurrence of an event. This test measures whether they are significantly different. Predicted probabilities scores are divided into 10 deciles and for each decile % of actual event is calculated and using this KS score is calculated. This score is between 0 and 1. Higher the value better is the model.
- Change in KS Trend: It measure the change in KS trend with respect to KS trend of the baseline time period (development sample). This should be with an acceptable region.
- Gini coefficient: It is a statistic which measures the ability of a model or a characteristic to rank order risk. A Gini value of 0% means that the model cannot distinguish loan defaulter from non-defaulters. A Gini value of 100% means that a characteristic/scorecard distinguishes perfectly.
- Enabling a well-defined model governance process that addresses regulatory needs
- Single repository for all models: auditable & transparent.
- Visibility to leadership team – objective, consistent assessment of model risk.
- Proactively identifies performance issues & drill down to key drivers.
- Automated to reduce time & effort, to deploy & track models (any frequency).
This blog is written by Surjit Laha, Analytics Consultant at BRIDGEi2i
About BRIDGEi2i: BRIDGEi2i provides Business Analytics Solutions to enterprises globally, enabling them to achieve accelerated business impact harnessing the power of data. Our analytics services and technology solutions enable business managers to consume more meaningful information from big data, generate actionable insights from complex business problems and make data driven decisions across pan-enterprise processes to create sustainable business impact. To know more visit www.bridgei2i.com
The views and opinions expressed in this article are those of the author and do not necessarily reflect the official position or viewpoint of BRIDGEi2i.