Adjusting for Selection Bias in Credit Scoring Models Using Python

Charaf ZGUIOUAR
5 min readApr 3, 2024
A demonstration of how the sample could be a biased representation of the population

When building credit scoring models, one common challenge is selection bias. This bias occurs because financial institutions only observe the performance (e.g., default or no default) of loan applicants they choose to approve. Applicants who might have been rejected are not observed, leading to a biased understanding of what factors contribute to creditworthiness. This article demonstrates how to use rejection inference to adjust for this bias, ensuring our credit scoring model is both fair and accurate.

Understanding Selection Bias

Selection bias can significantly skew the outcomes of predictive models, leading to inaccurate predictions. In the context of credit scoring, this means potentially overlooking creditworthy individuals based on incomplete information. To mitigate this, we employ a technique known as rejection inference, which allows us to infer the missing information and correct for the bias.

First, let’s create a synthetic dataset that simulates loan applicants, including both those who were approved for a loan and those who were not. Our dataset will include:

- loan_amount : The requested loan amount.
- credit_score: The applicant’s credit score.
- income: The applicant’s annual income.
- was_approved: Whether the applicant was approved (1) or not (0).
- defaulted: Whether the applicant defaulted on the loan (1) or not (0), only for those who were approved.

import numpy as np
import pandas as pd
np.random.seed(42)
n_applicants = 1000

loan_amount = np.random.exponential(scale=10000, size=n_applicants)
credit_score = np.random.randint(300, 850, size=n_applicants)
income = np.random.normal(loc=50000, scale=15000, size=n_applicants)
X = np.vstack((loan_amount, credit_score, income)).T
was_approved = (X.dot(np.array([0.01, 0.02, 0.0001])) + np.random.normal(0, 1000, n_applicants)) > 0
was_approved = was_approved.astype(int)

X_approved = X[was_approved == 1]
defaulted = (X_approved.dot(np.array([-0.0001, -0.03, -0.00005])) + np.random.normal(0, 100, X_approved.shape[0])) > 0
defaulted = defaulted.astype(int)

data = pd.DataFrame({
'loan_amount': loan_amount,
'credit_score': credit_score,
'income': income,
'was_approved': was_approved,
'defaulted': 0
})
data.loc[was_approved == 1, 'defaulted'] = defaulted

Next, we model the loan approval process to understand which factors contribute to an applicant being approved or rejected. This model helps us estimate the probability of approval for all applicants, including those who were not approved.

from sklearn.linear_model import LogisticRegression
# Prepare data for modeling
X = data[['loan_amount', 'credit_score', 'income']]
y_approval = data['was_approved']
# Train the model
approval_model = LogisticRegression(random_state=42)
approval_model.fit(X, y_approval)
# Estimate the probability of approval
data['approval_probability'] = approval_model.predict_proba(X)[:, 1]

### Step 3: Apply Rejection Inference
### With the probabilities estimated, we now adjust our dataset using
### Inverse Probability Weighting (IPW) to account for the selection
### bias in the approval process.

data['weight'] = 1 / data['approval_probability']

We’re now ready to model the outcome (default) using the adjusted dataset. This model will predict the likelihood of defaulting on a loan, taking into consideration the selection bias.

from sklearn.model_selection import train_test_split
# Filter for approved applicants
approved_data = data[data['was_approved'] == 1]
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
approved_data[['loan_amount', 'credit_score', 'income']],
approved_data['defaulted'],
test_size=0.3,
random_state=42
)
# Train the outcome model
outcome_model = LogisticRegression(random_state=42)
outcome_model.fit(X_train, y_train, sample_weight=approved_data.loc[X_train.index, 'weight'])

Finally, we evaluate the performance of our model to ensure it accurately predicts defaults, adjusting for the selection bias in the process.


from sklearn.metrics import accuracy_score
y_pred = outcome_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model accuracy: {accuracy}')

Expanding Our Toolkit: The Heckman Two-Step Correction for Rejection Inference

In the previous section, we introduced rejection inference and demonstrated how to apply inverse probability weighting (IPW) to correct for selection bias in credit scoring models. However, IPW is just one of the tools available for this purpose. Another powerful technique for addressing selection bias is the Heckman two-step correction model, a method that has found widespread application across economics and social sciences since its introduction in the 1970s by Nobel laureate James Heckman.

Understanding the Heckman Two-Step Correction

The Heckman correction is specifically designed to correct for the bias that arises in samples selected from a larger population based on certain criteria. This makes it particularly relevant for credit scoring, where we typically observe outcomes (e.g., loan defaults) only for a subset of all applicants (those who are approved).

The first step involves estimating the probability that an observation is included in the sample, known as the selection equation. This is achieved through a probit or logistic regression model, where the dependent variable is a binary indicator of selection (e.g., whether a loan was approved).

The second step adjusts the main model of interest (e.g., predicting whether an approved loan will default) using the results from the first step. This adjustment typically involves incorporating an additional term (the inverse Mills ratio) derived from the selection model. The inclusion of this term corrects for the part of the selection process that is correlated with the outcome, thus mitigating selection bias.

Implementing the Heckman Correction in Python

Let’s extend our example by implementing the Heckman correction using Python’s statsmodels’library:

import statsmodels.api as sm
data['selected'] = data['was_approved'].apply(lambda x: 1 if x else 0)
# Selection equation
X_selection = sm.add_constant(data[['loan_amount', 'credit_score', 'income']])
y_selection = data['selected']
selection_model = sm.Probit(y_selection, X_selection).fit()
# Outcome equation (for approved loans)
X_outcome = sm.add_constant(data[data['was_approved'] == 1][['loan_amount', 'credit_score', 'income']])
y_outcome = data[data['was_approved'] == 1]['defaulted']
heckman_model = sm.heckman.Heckman(y_outcome, X_outcome, X_selection).fit()
print(heckman_model.summary())

This code snippet walks you through fitting a Heckman model to correct for selection bias in a credit scoring scenario. The final output provides a summary of the corrected outcome model, helping us understand the true predictors of loan default risk.

The Heckman correction offers a theoretically robust way to account for selection bias, making it a valuable addition to our data science toolkit. However, it requires careful attention to the specification of both the selection and outcome models. Accurately capturing the selection process is crucial for the effectiveness of the correction.

Furthermore, while the Heckman model is powerful, it’s not always the simplest solution to implement and interpret. Depending on the complexity of your data and the nature of the selection bias, simpler methods like IPW or even more complex ones, such as machine learning approaches, might be more appropriate.

By understanding and applying techniques like IPW and the Heckman two-step correction, we can build more accurate and fair credit scoring models. These methods allow us to make informed decisions and extend credit opportunities more equitably across applicants, contributing to a more inclusive financial ecosystem.

Remember, the choice of technique should be guided by the specifics of your dataset and the theoretical underpinnings of your analysis. Each method has its strengths and limitations, and the best approach may often involve a combination of techniques to thoroughly address selection bias.

--

--

Charaf ZGUIOUAR

Writes about data science, statistics, mathematics, and evolving complex systems.