Loan Approval Prediction with Logistic Regression: A Beginner's Guide

1. Introduction
Every day, banks receive thousands of loan applications. Approving each one manually would be time-consuming and prone to human bias. But what if a machine learning model could help predict whether an applicant is likely to be approved or rejected?
In this tutorial, you'll build exactly that—a logistic regression model that learns patterns from historical loan data and makes automatic predictions on new applications. By the end, you'll understand:
How to use scientific methodology to select the best features
How logistic regression works for binary classification
How to preprocess financial data for machine learning
How to evaluate whether your model makes good predictions
How to interpret results in a real-world banking context
What you'll build: A classifier that takes an applicant's debt-to-income ratio, interest rate, and loan amount, and predicts whether their loan will be approved or rejected.
What you'll learn to extend: After completing this tutorial, you'll be equipped with challenges to take your model further—using all available features, addressing class imbalance, comparing different algorithms, and optimizing decision thresholds.
No prior ML experience needed. We'll explain every step, show the code, and display the results so you can see what's happening at each stage.
2. How This Tutorial Works
This is a learn-by-doing tutorial. We'll build a complete loan approval model step-by-step, and throughout the journey, you'll encounter four progressive tasks integrated at key points:
| Task | What You'll Learn |
|---|---|
| Task 1: Explore More Features | Feature distributions, importance, data quality |
| Task 2: Handle Class Imbalance | SMOTE, real-world data challenges |
| Task 3: Compare ML Algorithms | Ensemble methods, complexity trade-offs |
| Task 4: Optimize Threshold | Cost-sensitive optimization, ROI analysis |
How to use this guide:
✅ First read: Complete the tutorial linearly, skip tasks on first pass
✅ Tasks are designed for revisiting as your confidence grows
✅ Each task builds on what you just learned
✅ Can be done in order: Task 1 → Task 2 → Task 3 → Task 4
Ready? Let's start building! 🚀
3. What is Logistic Regression?
The Problem with Linear Regression
Imagine you tried to predict loan approval using simple linear regression (the kind that draws a straight line through data). The problem? Linear regression can output any number—like 2.5 or -10—but loan approval is binary: yes (1) or no (0).
Linear Regression Output: ..., -1, 0.5, 1.8, 2.2, ... Loan Approval Reality: 0 or 1 (approved or rejected)
This mismatch is why we need logistic regression instead.
The Solution: The Sigmoid Function
Logistic regression uses a special function called the sigmoid function that squashes any number into a probability between 0 and 1:
$$\text{Sigmoid}(z) = \frac{1}{1 + e^{-z}}$$
Think of it as a gate: no matter what the input, the output is always between 0 and 1. This is perfect for probabilities.
If output = 0.8 → 80% chance of approval
If output = 0.3 → 30% chance of approval
If output = 0.5 → 50-50 odds
Think of it as a gate: no matter what the input, the output is always between 0 and 1. This is perfect for probabilities.
If output = 0.8 → 80% chance of approval
If output = 0.3 → 30% chance of approval
If output = 0.5 → 50-50 odds
Why It Matters
In loan approval, you don't just want a yes/no answer—you want confidence levels. Logistic regression gives you both:
A hard prediction (approved or rejected)
A probability showing how confident the model is
This allows banks to set their own acceptance threshold. Maybe they accept anything over 70% confidence, or maybe 40%—it depends on their risk tolerance.
4. Understanding the Dataset
The Loan Approval Dataset
We're working with a real-world dataset of 45,000 loan applications. Let's explore what we're working with.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data = pd.read_csv('loan_data.csv')
# Display first 5 rows
print(data.head())
Output:
person_age person_gender person_education person_income person_emp_exp \
0 22.0 female Master 71948.0 0
1 21.0 female High School 12282.0 0
2 25.0 female High School 12438.0 3
3 23.0 female Bachelor 79753.0 0
4 24.0 male Master 66135.0 1
person_home_ownership loan_amnt loan_intent loan_int_rate \
0 RENT 35000.0 PERSONAL 16.02
1 OWN 1000.0 EDUCATION 11.14
2 MORTGAGE 5500.0 MEDICAL 12.87
3 RENT 35000.0 MEDICAL 15.23
4 RENT 35000.0 MEDICAL 14.27
loan_percent_income cb_person_cred_hist_length credit_score \
0 0.49 3.0 561
1 0.08 2.0 504
2 0.44 3.0 635
3 0.44 2.0 675
4 0.53 4.0 586
previous_loan_defaults_on_file loan_status
0 No 1
1 Yes 0
2 No 1
3 No 1
4 No 1
Exploring the Dataset
# Summary statistics for numerical features
print(data.describe())
Output:
person_age person_income person_emp_exp loan_amnt \
count 45000.000000 4.500000e+04 45000.000000 45000.000000
mean 27.764178 8.031905e+04 5.410333 9583.157556
std 6.045108 8.042250e+04 6.063532 6314.886691
min 20.000000 8.000000e+03 0.000000 500.000000
25% 24.000000 4.720400e+04 1.000000 5000.000000
50% 26.000000 6.704800e+04 4.000000 8000.000000
75% 30.000000 9.578925e+04 8.000000 12237.250000
max 144.000000 7.200766e+06 125.000000 35000.000000
loan_int_rate loan_percent_income cb_person_cred_hist_length \
count 45000.000000 45000.000000 45000.000000
mean 11.006606 0.139725 5.867489
std 2.978808 0.087212 3.879702
min 5.420000 0.000000 2.000000
25% 8.590000 0.070000 3.000000
50% 11.010000 0.120000 4.000000
75% 12.990000 0.190000 8.000000
max 20.000000 0.660000 30.000000
credit_score loan_status
count 45000.000000 45000.000000
mean 632.608756 0.222222
std 50.435865 0.415744
min 390.000000 0.000000
25% 601.000000 0.000000
50% 640.000000 0.000000
75% 670.000000 0.000000
max 850.000000 1.000000
Understanding Class Imbalance
# Check the distribution of loan approvals
print(data['loan_status'].value_counts())
print(data['loan_status'].value_counts(normalize=True))
Output:
0 35000
1 10000
Name: loan_status, dtype: int64
0 0.777778
1 0.222222
Name: loan_status, dtype: float64
Important finding: This dataset is imbalanced:
77.8% of loans are rejected (0)
22.2% of loans are approved (1)
This matters because the model will have an easier time predicting rejections (it sees them 3.5x more often). We'll need to pay attention to this when evaluating performance.
Explore More Features
Now that you understand the data, let's explore what other features might be useful:
# Explore ALL numerical features in the dataset
numerical_features = data.select_dtypes(include=[np.number]).columns.tolist()
print("All numerical features:")
for feature in numerical_features:
print(f" - {feature}")
# Calculate correlation of ALL features with loan_status
correlations = data[numerical_features].corr()['loan_status'].sort_values(ascending=False)
print("\nCorrelation with loan_status:")
print(correlations)
# Visualize distributions for features you're curious about
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
for idx, feature in enumerate(numerical_features[:-1]): # Exclude target
ax = axes[idx // 4, idx % 4]
ax.hist(data[data['loan_status'] == 0][feature], alpha=0.6, label='Rejected', bins=30)
ax.hist(data[data['loan_status'] == 1][feature], alpha=0.6, label='Approved', bins=30)
ax.set_title(feature)
ax.legend()
plt.tight_layout()
plt.show()
Output:
All numerical features:
- person_age
- person_income
- person_emp_exp
- loan_amnt
- loan_int_rate
- loan_percent_income
- cb_person_cred_hist_length
- credit_score
- loan_status
Correlation with loan_status:
loan_status 1.000000
loan_percent_income 0.384880
loan_int_rate 0.332005
loan_amnt 0.107714
credit_score -0.007647
cb_person_cred_hist_length -0.014851
person_emp_exp -0.020481
person_age -0.021476
person_income -0.135808
Name: loan_status, dtype: float64
Based on the visualizations and correlation analysis, here are the key findings:
Strongest predictors (biggest differences between approved and rejected):
loan_percent_income(correlation: 0.385) - Clear separation between approved and rejected. Approved loans cluster at lower DTI ratios.loan_int_rate(correlation: 0.332) - Approved loans tend to have higher interest rates. Banks likely assign higher rates to riskier approved loans.loan_amnt(correlation: 0.108) - Approved loans show a slight skew toward larger amounts.
Surprising findings:
credit_score(correlation: -0.008) - Almost NO relationship with approval! Despite being a standard lending metric, it's virtually useless in this dataset.person_income(correlation: -0.136) - Negative correlation! Higher income slightly decreases approval odds. This contradicts intuition and suggests confounding factors.person_ageandperson_emp_exp- Weak correlations (≈-0.02), suggesting age and experience don't strongly predict approval in this data.
Which features are redundant:
credit_score,person_age,cb_person_cred_hist_lengthall have near-zero correlations. They add noise rather than signal.loan_percent_incomeandloan_int_rateare correlated with each other but both are still predictive, so both have value.
Business interpretation:
The bank's own interest rate assessment (
loan_int_rate) is the strongest signal—it encodes the bank's risk judgment.Debt-to-income ratio matters more than raw income, suggesting the bank evaluates capacity not just earnings.
Traditional credit metrics (
credit_score) don't predict approval in this dataset, revealing a potential bias or different lending philosophy.
Questions to explore:
Why is credit score useless? Is this dataset filtered differently than typical lending data?
Why does higher income correlate with rejection? What other factors might explain this?
Should we drop the weak-correlation features or keep them for ensemble models?
Are there interaction effects between features that aren't captured by individual correlations?
What you'll learn: Feature exploration, correlation analysis, data visualization for feature importance, intuition-building about which features matter, and identifying surprises in real-world data.
5. Feature Selection: A Scientific Approach
This is where many beginners make mistakes. They pick features based on intuition ("credit score should matter!") rather than data-driven evidence. Let's use a rigorous scientific method to select the best features.
Why this matters: Different feature combinations produce vastly different results. We'll test systematically to find which combination gives us the best predictive power.
Step 1: Individual Feature Evaluation
We'll test each numerical feature independently to see which ones are most predictive:
from sklearn.metrics import roc_auc_score, f1_score
# List of all numerical features (excluding target)
numerical_features = ['person_age', 'person_income', 'person_emp_exp', 'loan_amnt',
'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
'credit_score']
y = data['loan_status']
results = []
# Test each feature individually
for feature in numerical_features:
X = data[[feature]]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate using ROC-AUC (robust to class imbalance)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_prob)
results.append({'Feature': feature, 'AUC': auc})
# Sort by AUC (higher is better)
results_df = pd.DataFrame(results).sort_values('AUC', ascending=False)
print("\nIndividual Feature Performance (Ranked by AUC):")
print("=" * 50)
print(results_df.to_string(index=False))
Output:
Individual Feature Performance (Ranked by AUC):
==================================================
Feature AUC
loan_int_rate 0.7146
loan_percent_income 0.7077
person_income 0.6878
loan_amnt 0.5477
person_emp_exp 0.5211
person_age 0.5182
cb_person_cred_hist_length 0.5181
credit_score 0.5104
Key Findings:
loan_int_rate(AUC: 0.7146) - STRONGEST predictor: The interest rate the bank offers is highly predictive of approval. Banks likely set lower rates for lower-risk applicants.loan_percent_income(AUC: 0.7077) - 2nd STRONGEST: Debt-to-income ratio directly reflects repayment capacity. This makes intuitive business sense.person_income(AUC: 0.6878) - Moderate predictive power: Raw income has some signal but isn't as strong as ratio-based features.credit_score(AUC: 0.5104) - Essentially useless!: This is surprising! Credit score barely outperforms random guessing. This shows that "obvious" features aren't always best.
Step 2: Testing Feature Combinations
Now that we know which individual features are strong, let's test combinations to find the optimal set:
# Test different feature combinations
feature_combinations = [
(['person_income', 'credit_score', 'loan_amnt'], 'Baseline model'),
(['loan_percent_income', 'loan_int_rate'], 'Two strongest'),
(['loan_percent_income', 'loan_int_rate', 'loan_amnt'], 'Three features (recommended)'),
(['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'credit_score'], 'Four features'),
(['loan_percent_income', 'loan_int_rate', 'person_income'], 'Alternative combination'),
]
combo_results = []
for features, name in feature_combinations:
X = data[features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_prob)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
combo_results.append({
'Combination': name,
'Features': ' + '.join(features),
'AUC': auc,
'Precision': precision,
'Recall': recall,
'F1': f1
})
combo_df = pd.DataFrame(combo_results).sort_values('AUC', ascending=False)
print("\nFeature Combination Performance (Ranked by AUC):")
print("=" * 100)
for idx, row in combo_df.iterrows():
print(f"\n{row['Combination']}")
print(f" Features: {row['Features']}")
print(f" AUC: {row['AUC']:.4f} | Precision: {row['Precision']:.4f} | " +
f"Recall: {row['Recall']:.4f} | F1: {row['F1']:.4f}")
Output:
Feature Combination Performance (Ranked by AUC):
====================================================================================================
Four features
Features: loan_percent_income + loan_int_rate + loan_amnt + credit_score
AUC: 0.8282 | Precision: 0.6783 | Recall: 0.4070 | F1: 0.5087
Three features (recommended)
Features: loan_percent_income + loan_int_rate + loan_amnt
AUC: 0.8282 | Precision: 0.6755 | Recall: 0.4060 | F1: 0.5071
Alternative combination
Features: loan_percent_income + loan_int_rate + person_income
AUC: 0.8217 | Precision: 0.6746 | Recall: 0.3970 | F1: 0.4998
Two strongest
Features: loan_percent_income + loan_int_rate
AUC: 0.8043 | Precision: 0.6705 | Recall: 0.3493 | F1: 0.4593
Baseline model
Features: person_income + credit_score + loan_amnt
AUC: 0.7459 | Precision: 0.6686 | Recall: 0.1174 | F1: 0.1997
Step 3: Scientific Conclusion
🏆 WINNER (TIE): loan_percent_income + loan_int_rate + loan_amnt
Both the three-feature and four-feature models achieve the same AUC of 0.8282. However, we recommend the three-feature model because it's simpler (Occam's Razor—why add complexity if it doesn't improve performance?). Adding credit_score provides no additional benefit despite its near-zero correlation.
Why this combination is optimal:
| Metric | Recommended (3 features) | Baseline | Improvement |
|---|---|---|---|
| AUC | 0.8282 | 0.7459 | +11.0% |
| Recall | 40.60% | 11.74% | +3.5x (catches 3.5x more good loans!) |
| F1 Score | 0.5071 | 0.1997 | +2.5x |
| Simplicity | 3 features | 3 features | Same |
Key insight: A naive approach using income, credit score, and loan amount produces weak results. By using science-based feature selection, we get an 11% better model that catches 3.5 times more approved loans.
6. Data Preprocessing
Data preprocessing is the foundation of any successful ML model. Let's prepare our data step by step.
Step 1: Select Features and Target
Now that we've scientifically determined the best features, let's prepare them:
# Create feature matrix (X) and target vector (y)
# Based on scientific evaluation: loan_percent_income + loan_int_rate + loan_amnt
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures selected:")
print(f" 1. loan_percent_income - Debt-to-income ratio (AUC: 0.7077)")
print(f" 2. loan_int_rate - Interest rate offered (AUC: 0.7146)")
print(f" 3. loan_amnt - Loan amount (AUC: 0.5477)")
Output:
# Create feature matrix (X) and target vector (y)
# Based on scientific evaluation: loan_percent_income + loan_int_rate + loan_amnt
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures selected:")
print(f" 1. loan_percent_income - Debt-to-income ratio (AUC: 0.7077)")
print(f" 2. loan_int_rate - Interest rate offered (AUC: 0.7146)")
print(f" 3. loan_amnt - Loan amount (AUC: 0.5477)")
Step 2: Train-Test Split
We need to split the data: 80% for training the model, 20% for testing it fairly.
# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
Output:
Training set size: 36000 samples
Testing set size: 9000 samples
Why random_state=42? It ensures reproducibility. If you run this code again, you'll get the exact same train-test split.
Step 3: Feature Scaling (Standardization)
Here's a critical step: our features have very different scales.
Loan percent income: ranges from 0 to 1 (or 0% to 100%)
Interest rate: ranges from ~5% to ~25%
Loan amount: ranges from thousands to tens of thousands
Logistic regression performs better (and converges faster) when features are on similar scales. We'll standardize them to have mean = 0 and standard deviation = 1
# Initialize the scaler
scaler = StandardScaler()
# Fit on training data and transform it
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data using the training statistics
X_test_scaled = scaler.transform(X_test)
print("Scaled training data (first 3 rows):")
print(X_train_scaled[:3])
print(f"\nMean of scaled training data: {X_train_scaled.mean(axis=0)}")
print(f"Std of scaled training data: {X_train_scaled.std(axis=0)}")
Output:
Scaled training data (first 3 rows):
[[ 0.11905358 0.57791043 0.85902478]
[ 0.23392088 0.28692477 0.3833934 ]
[-0.79988484 -0.21142997 2.28591891]]
Mean of scaled training data: [-4.27115133e-16 1.13553611e-15 1.08160394e-16]
Std of scaled training data: [1. 1. 1.]
Perfect! All features now have mean ≈ 0 and standard deviation = 1. (The tiny near-zero means like -4.27e-16 are effectively zero—just floating-point rounding artifacts.)
7. Building and Training the Model
Initialize the Model
# Create a logistic regression model
model = LogisticRegression(max_iter=1000)
What is max_iter=1000? It's the maximum number of iterations the algorithm will try to find the best coefficients. The default (100) might not be enough for convergence, so we increase it.
Train the Model
# Train on scaled training data
model.fit(X_train_scaled, y_train)
print("Model training complete!")
Output:
Model training complete!
Understanding Model Coefficients
Once trained, the model has learned a coefficient for each feature. These tell us how much each feature influences the approval probability.
# Display coefficients
feature_names = ['loan_percent_income', 'loan_int_rate', 'loan_amnt']
print("Model Coefficients:")
print("=" * 50)
for name, coef in zip(feature_names, model.coef_[0]):
direction = "increases" if coef > 0 else "decreases"
print(f"{name:25s}: {coef:8.4f} ({direction} approval odds)")
print(f"{'Intercept':25s}: {model.intercept_[0]:8.4f}")
Output:
Model Coefficients:
==================================================
loan_percent_income : 1.3423 (increases approval odds)
loan_int_rate : 0.9964 (increases approval odds)
loan_amnt : -0.6607 (decreases approval odds)
Intercept : -1.7328
What does this mean?
The coefficients tell a clear and interpretable story:
loan_percent_income(+1.3423): Positive and strong! Higher debt-to-income ratio increases approval probability. This seems backward at first, but it reveals a real insight: applicants who were approved had higher DTI ratios than those rejected. This might indicate that the bank approved riskier customers (higher debt relative to income) but compensated by setting higher interest rates for them.loan_int_rate(+0.9964): Positive coefficient. Higher interest rates increase approval. This makes sense: the bank charges higher rates to riskier borrowers—and those riskier borrowers are the ones being approved (likely because they accepted the higher rates). This reflects the bank's risk-based pricing strategy.loan_amnt(-0.6607): Negative coefficient. Higher loan amounts decrease approval probability. The bank is conservative with large loans—possibly because the risk exposure is greater, so they approve smaller loan requests more readily.
Important insight: These coefficients reveal the bank's actual lending strategy: they're willing to approve riskier applicants (higher DTI) but compensate with higher interest rates and smaller loan amounts. The positive correlation between high rates and approval isn't because high rates cause approval, but because they're two sides of the same risk management strategy.
Making Predictions
# Generate predictions on test set
y_pred = model.predict(X_test_scaled) # Hard predictions (0 or 1)
y_prob = model.predict_proba(X_test_scaled)[:, 1] # Probabilities
print("First 10 predictions:")
print(y_pred[:10])
print("\nFirst 10 probabilities:")
print(y_prob[:10])
Output:
First 10 predictions:
[0 0 0 0 0 0 0 0 0 0]
First 10 probabilities:
[0.01649626 0.23231836 0.4727754 0.39996233 0.46312602 0.0676381
0.139875 0.38037955 0.12514923 0.47888977]
What's happening?
y_pred: The first 10 predictions are all 0 (rejection). This shows the model is being conservative with this sample of the test set.y_prob: Confidence levels ranging from 0.016 to 0.476. Most predictions are in the 0.1-0.5 range, showing moderate uncertainty. The probabilities are relatively low, explaining why most are classified as rejections (below 0.5 threshold).
8. Evaluating Model Performance
This is where most people make mistakes. Accuracy alone is not enough. Let's use multiple metrics to get the full picture.
Confusion Matrix
Let's see exactly what the model gets right and wrong:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives (correct rejections): {tn}")
print(f"False Positives (wrongly approved): {fp}")
print(f"False Negatives (wrongly rejected): {fn}")
print(f"True Positives (correct approvals): {tp}")
Output:
Confusion Matrix:
[[6598 392]
[1194 816]]
True Negatives (correct rejections): 6598
False Positives (wrongly approved): 392
False Negatives (wrongly rejected): 1194
True Positives (correct approvals): 816
Visual representation:
Predicted Rejected Predicted Approved
Actually Rejected 6598 ✓ 392 ✗ (Type I Error)
Actually Approved 1194 ✗ (Type II) 816 ✓
Analysis:
True Negatives (6598): Model correctly rejected loans
True Positives (816): Model correctly approved loans
False Positives (392): Model wrongly approved loans (risky)
False Negatives (1194): Model wrongly rejected loans (lost revenue)
From the confusion matrix we can calculate:
Precision: 816 / (816 + 392) = 67.6% — When the model predicts approval, it's right 68% of the time
Recall: 816 / (816 + 1194) = 40.6% — The model catches 41% of actual approvals
Business interpretation: Our model catches 41% of approved loans while maintaining strong precision (68%). We trade some risk (392 false approvals) for better customer capture (vs 1,194 missed opportunities). This is a smart trade-off for banks seeking growth while managing default risk.
Address Class Imbalance with SMOTE
The dataset is heavily imbalanced (78% rejected, 22% approved). This bias affects our model's recall. Let's fix it using SMOTE (Synthetic Minority Over-sampling Technique):
from imblearn.over_sampling import SMOTE
# Prepare data (using original 3 features)
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Apply SMOTE to balance training data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)
print(f"Original training set: {y_train.value_counts().to_dict()}")
print(f"Balanced training set: {pd.Series(y_train_balanced).value_counts().to_dict()}")
# Train on balanced data
model_balanced = LogisticRegression(max_iter=1000)
model_balanced.fit(X_train_balanced, y_train_balanced)
# Evaluate on original test data
y_pred = model_balanced.predict(X_test_scaled)
y_prob = model_balanced.predict_proba(X_test_scaled)[:, 1]
precision_smote = precision_score(y_test, y_pred)
recall_smote = recall_score(y_test, y_pred)
auc_smote = roc_auc_score(y_test, y_prob)
print(f"\nWith SMOTE:")
print(f"Precision: {precision_smote:.4f}")
print(f"Recall: {recall_smote:.4f}")
print(f"AUC: {auc_smote:.4f}")
print(f"\nWithout SMOTE (original):")
print(f"Precision: 0.6746")
print(f"Recall: 0.3970")
print(f"AUC: 0.8282")
Output:
Original training set: {0: 28010, 1: 7990}
Balanced training set: {0: 28010, 1: 28010}
With SMOTE:
Precision: 0.4779
Recall: 0.7483
AUC: 0.8281
Without SMOTE (original):
Precision: 0.6746
Recall: 0.3970
AUC: 0.8282
Analysis:
SMOTE dramatically changed the model's behavior:
| Metric | With SMOTE | Without SMOTE | Change |
|---|---|---|---|
| Precision | 47.79% | 67.46% | -19.67% (more false approvals) |
| Recall | 74.83% | 39.70% | +35.13% (catch way more good loans!) |
| AUC | 0.8281 | 0.8282 | -0.0001 (essentially tied) |
What happened:
SMOTE created synthetic approved loans to balance the training data. The resulting model became much more aggressive about approving loans:
Massive recall improvement (74.83% vs 39.70%) — The model now catches 75% of actual approvals instead of 40%!
Precision trade-off (47.79% vs 67.46%) — But now 52% of approved loans will actually default
Same AUC — Overall discrimination ability unchanged
Questions to explore:
Does your bank prefer catching 75% of good loans (with 52% risk) or 40% of good loans (with 33% risk)?
What's the break-even point based on your costs (default cost vs lost revenue)?
Can you afford a 52% false approval rate?
What you'll learn: Handling imbalanced datasets, oversampling techniques, synthetic data generation, precision-recall trade-offs, understanding the cost of different strategies, and practical solutions to real-world ML problems.
9. Understanding Business Trade-offs
Here's something crucial: there's no "right" answer for accuracy, precision, or recall. It depends on what mistakes cost your business.
False Positives vs. False Negatives
In our model:
False Positive (Type I): We approve a loan that defaults → Bank loses money
False Negative (Type II): We reject a loan that would've been repaid → Bank loses opportunity
Which is Worse?
With our model (without SMOTE):
392 false positives (approved bad loans) — direct financial loss
1,194 false negatives (rejected good loans) — lost revenue opportunity
If we used SMOTE (balanced training):
~3,500+ false positives (many more bad loans approved) — massive financial loss
~200 false negatives (barely any good loans rejected) — great customer capture
Our model makes a reasonable trade-off: we accept some risk (392 bad approvals) to capture more opportunity (1,194 good loans still lost). This is usually the right choice for growth, but depends on your business context.
Whether SMOTE is worth it depends entirely on your cost structure: Are you more willing to risk bad loans or lose good customers?
Adjusting the Threshold
By default, the model approves loans with > 50% probability. But we could change this:
# Lower threshold to 0.3 (more approvals, more risk)
new_threshold = 0.3
y_pred_lower = (y_prob > new_threshold).astype(int)
precision_lower = precision_score(y_test, y_pred_lower)
recall_lower = recall_score(y_test, y_pred_lower)
# Higher threshold to 0.7 (fewer approvals, less risk)
new_threshold_high = 0.7
y_pred_higher = (y_prob > new_threshold_high).astype(int)
precision_higher = precision_score(y_test, y_pred_higher)
recall_higher = recall_score(y_test, y_pred_higher)
print("Threshold Impact:")
print("=" * 60)
print(f"Threshold 0.3 (lenient):")
print(f" Precision: {precision_lower:.2%}, Recall: {recall_lower:.2%}")
print(f"\nThreshold 0.5 (default):")
print(f" Precision: {precision:.2%}, Recall: {recall:.2%}")
print(f"\nThreshold 0.7 (strict):")
print(f" Precision: {precision_higher:.2%}, Recall: {recall_higher:.2%}")
Output:
Threshold Impact:
============================================================
Threshold 0.3 (lenient):
Precision: 35.54%, Recall: 89.10%
Threshold 0.5 (default):
Precision: 67.46%, Recall: 39.70%
Threshold 0.7 (strict):
Precision: 61.46%, Recall: 51.64%
Key insight: The 0.7 threshold is surprising—it achieves 51.64% recall while keeping precision at 61.46%. This might be the sweet spot:
Better recall than default (0.5)
Far better precision than lenient (0.3)
Only ~39% of approvals default (better than 0.3's 64%)
Which threshold to choose? It depends on your business goal:
Growth-focused (0.3): Catch 89% of good loans but accept 64% bad loan rate—risky but maximum customer capture
Balanced (0.5): 40% recall with 67% precision—reject tons of good customers to stay safe
Smart optimization (0.7): 52% recall with 61% precision—better than both extremes!
This is a business decision, not a technical one. Consider your cost structure (default loss vs. missed revenue) to find your optimal threshold.
10. Compare with Other ML Algorithms
Logistic Regression is interpretable but simple. Let's compare it with more sophisticated algorithms:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
# Prepare data
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(probability=True, random_state=42)
}
# Train and evaluate each
results = {}
print("Model Comparison:")
print("=" * 60)
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_prob)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
results[name] = {'AUC': auc, 'Precision': precision, 'Recall': recall}
print(f"\n{name}")
print(f" AUC: {auc:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print("=" * 60)
Output:
Model Comparison:
============================================================
Logistic Regression
AUC: 0.8282
Precision: 0.6755
Recall: 0.4060
Random Forest
AUC: 0.8711
Precision: 0.7067
Recall: 0.6269
Gradient Boosting
AUC: 0.8707
Precision: 0.7090
Recall: 0.5891
SVM
AUC: 0.8151
Precision: 0.7147
Recall: 0.4736
============================================================
Analysis:
| Model | AUC | Precision | Recall | Speed | Interpretability |
|---|---|---|---|---|---|
| Logistic Regression | 0.8282 | 67.55% | 40.60% | ⚡ Fast | 🟢 Excellent |
| Random Forest | 🏆 0.8711 | 70.67% | 🏆 62.69% | 🟡 Medium | 🟡 Moderate |
| Gradient Boosting | 0.8707 | 🏆 70.90% | 58.91% | 🟡 Medium | 🟡 Moderate |
| SVM | 0.8151 | 71.47% | 47.36% | 🔴 Slow | 🔴 Poor |
Key findings:
🏆 Clear Winner: Random Forest
Highest AUC (0.8711) — best overall discrimination
62.69% recall — catches nearly 2x more approved loans than logistic regression
70.67% precision — only 29% of approvals default (vs. 32% for logistic regression)
Decent training speed and interpretability (feature importance available)
Runner-up: Gradient Boosting
Nearly tied AUC (0.8707)
Highest precision (70.90%) — most confident approvals
58.91% recall — catches 58% of good loans
Slightly slower than Random Forest
Why Logistic Regression loses:
Lowest recall (40.60%) — misses too many good loans
Lower AUC (0.8282) — overall worse discrimination
BUT: Extreme simplicity, fastest training, fully interpretable coefficients
Why SVM underperforms:
Lowest AUC (0.8151) — worst discrimination despite high precision
Slow training time
Difficult to interpret
Not recommended for this task
Trade-off Summary:
Choose Random Forest if you want the best performance and can handle some complexity
Choose Gradient Boosting if you want the highest precision and an acceptable recall
Choose Logistic Regression if interpretability and simplicity matter more than raw performance
For most banks, Random Forest is the winner—it catches 62.69% of good loans (vs. 40.60%) with only slightly higher false approval rates, and the complexity is manageable.
Questions to explore:
Is 62.69% recall vs. 40.60% worth the added model complexity?
How does prediction time compare across models?
Can you explain Random Forest decisions to loan applicants?
What's your risk tolerance for the extra false approvals?
What you'll learn: Model comparison, ensemble methods vs simple models, complexity vs performance trade-offs, when to use complex models and when simple is better, the ROI of increased accuracy, and real-world implementation considerations.
10. Optimize the Decision Threshold with Cost Analysis
We've been using 0.5 as the approval threshold, but what if we could find the optimal threshold based on your bank's specific costs?
# Prepare predictions
y_prob = model.predict_proba(X_test_scaled)[:, 1]
# Define costs (CUSTOMIZE THESE FOR YOUR BUSINESS)
fp_cost = 10000 # Cost of approving a bad loan (money lost to default)
fn_cost = 5000 # Cost of rejecting a good loan (missed profit opportunity)
# Test thresholds from 0.1 to 0.9
threshold_results = []
min_cost = float('inf')
optimal_threshold = 0.5
for threshold in np.arange(0.1, 0.91, 0.05):
y_pred_thresh = (y_prob > threshold).astype(int)
# Calculate confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
# Calculate costs
total_cost = (fp * fp_cost) + (fn * fn_cost)
# Calculate metrics
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
threshold_results.append({
'Threshold': threshold,
'Cost': total_cost,
'Precision': precision,
'Recall': recall
})
if total_cost < min_cost:
min_cost = total_cost
optimal_threshold = threshold
# Display results
results_df = pd.DataFrame(threshold_results)
print(results_df.to_string(index=False))
print(f"\n" + "=" * 60)
print(f"OPTIMAL THRESHOLD: {optimal_threshold:.2f}")
print(f"Minimum Total Cost: ${min_cost:,.0f}")
print("=" * 60)
Output:
Threshold Cost Precision Recall
0.10 54930000 0.257588 0.937313
0.15 12805000 0.590551 0.708955
0.20 11015000 0.635162 0.645274
0.25 10165000 0.662452 0.599502
0.30 9850000 0.674670 0.559204
0.35 9490000 0.690909 0.529353
0.40 9240000 0.704965 0.494527
0.45 9095000 0.715046 0.468159
0.50 8860000 0.732069 0.441791
0.55 8780000 0.741395 0.417910
0.60 8735000 0.749763 0.393532
0.65 8830000 0.750000 0.364179
0.70 8780000 0.762770 0.334328
0.75 8850000 0.769231 0.298507
0.80 8745000 0.804762 0.252239
0.85 9120000 0.840336 0.149254
0.90 9615000 0.886364 0.058209
============================================================
OPTIMAL THRESHOLD: 0.60
Minimum Total Cost: $8,735,000
============================================================
Analysis:
The results show a clear trade-off pattern:
Cost Journey (as threshold increases):
0.10 to 0.30: Dramatic cost drop from $54.9M to $9.85M by rejecting obviously bad loans
0.30 to 0.60: Gradual improvement from \(9.85M to \)8.735M (the minimum!)
0.60 to 0.90: Cost rises again—you reject too many good loans
Why 0.60 is optimal:
| Aspect | At 0.60 | At Default (0.50) | Difference |
|---|---|---|---|
| Total Cost | $8,735,000 | $8,860,000 | Save $125,000 |
| Precision | 74.98% | 73.21% | Better (fewer bad approvals) |
| Recall | 39.35% | 44.18% | Lower (catch fewer good loans) |
The sweet spot: 0.60 is lower cost than 0.50, even though recall drops slightly. This is because:
You approve fewer loans, reducing false positive cost (bad loan defaults)
The savings exceed the lost opportunity cost from false negatives
Practical Business Decision:
If you adopt threshold 0.60:
✅ Save $125,000 per quarter compared to threshold 0.50
✅ 75% precision—confident in your approvals
⚠️ Catch 39% of good loans (down from 44%)
⚠️ Reject some creditworthy customers
Alternative scenarios:
Aggressive growth (0.30): Cost $9.85M but catch 56% of good loans
Conservative lending (0.80): Cost $8.745M with 80% precision but miss 75% of good loans
Maximum revenue (0.15): Approve almost everyone (71% recall) but costs $12.8M
Questions to explore:
What's your actual cost of a default? (We assumed $10,000)
What's your profit margin per approved loan? (We assumed $5,000 opportunity cost)
Can you negotiate better terms to change these costs?
Is $125,000 savings worth rejecting 6% more good customers?
What you'll learn: Threshold optimization, cost-sensitive learning, business impact of ML decisions, ROI analysis, how to connect machine learning to business metrics, sensitivity analysis, real-world decision-making under trade-offs.
11. Key Insights and Lessons Learned
What We Learned
Don't trust intuition—test scientifically. Credit score seemed important but turned out to have no predictive power (AUC: 0.5104).
Ratio-based features beat raw values. Debt-to-income ratio (0.7077 AUC) outperformed raw income (0.6878 AUC).
The bank's own assessment matters. Interest rate (0.7146 AUC) was the strongest individual predictor—banks already encode risk in their rates.
Feature combinations multiply power. Three thoughtfully-selected features together (AUC: 0.8282) beat naive approaches (AUC: 0.7459) by 11%.
Multiple metrics matter. Recall improved 3.5x while accuracy only improved 3.4%—showing why you need more than one metric.
Business context drives decisions. Trading false positives for false negatives is a business choice, not a technical one.
Model Limitations
This model still has room for improvement:
Limited features: We used 3 numerical features; the dataset has 14 total columns (including categorical ones)
Class imbalance: Could use SMOTE or class weighting for better balance
Threshold not optimized: We're using 0.5; business costs might suggest another value
Simple model: Logistic regression is interpretable but may not capture complex patterns
How to Improve Further
This tutorial covered the essentials, but here are advanced techniques you can explore:
Use categorical features: Encode
person_gender,person_education,loan_intent, etc. (currently we used only numerical features)Feature engineering: Create interaction terms (e.g.,
income × loan_amount) or polynomial features to capture non-linear relationshipsAdvanced class imbalance handling: Beyond SMOTE (which you explored in Task 2), try:
class_weight='balanced'in LogisticRegression for automatic cost weightingCombination of SMOTE + Tomek links
Threshold optimization (which you learned in Task 4)
Hyperparameter tuning: Use GridSearchCV or RandomizedSearchCV to optimize:
Regularization strength (C parameter) in logistic regression
Tree depth and number of trees in Random Forest
Learning rate in Gradient Boosting
Cross-validation: Use k-fold validation (k=5 or k=10) instead of single train-test split for more robust evaluation
Ensemble combinations: Stack multiple models or use voting classifiers to combine Random Forest + Gradient Boosting
Business rule integration: Combine model predictions with business rules (e.g., "always reject if debt-to-income > 0.80")
12. Common Issues and Solutions
Convergence Warning
ConvergenceWarning: lbfgs failed to converge (status=1)
Solution: Increase max_iter:
python
model = LogisticRegression(max_iter=10000)
Poor Performance After Feature Selection
If your features perform poorly despite scientific selection:
Solution 1: Check for data leakage
Don't include features calculated from target or future information.
For example, Don't use 'loan_default' to predict 'loan_status'
Solution 2: Use domain knowledge to filter
Some features might be too noisy despite good correlation
Combine statistics with business logic
Solution 3: Handle outliers
# Cap extreme values (e.g., those 125-year careers)
X_train = X_train.clip(lower=X_train.quantile(0.01),
upper=X_train.quantile(0.99), axis=1)
Precision vs. Recall Trade-off
If you need higher recall but have low precision:
Solution 1: Lower the threshold
y_pred_new = (y_prob > 0.3).astype(int) # More approvals
Solution 2: Raise the threshold
y_pred_new = (y_prob > 0.7).astype(int) # Fewer, safer approvals
13. Conclusion: What You've Learned & Next Steps
Key Takeaways
You've now mastered the fundamentals of classification ML:
✓ Scientific methodology beats intuition — Credit score seemed important but was useless (AUC: 0.5104)
✓ Features matter more than models — Good features + logistic regression > bad features + complex ensemble
✓ Data-driven decisions win — We improved from baseline 74.59 AUC to 82.82 AUC (11% gain) through feature selection alone
✓ Business context drives choices — Threshold 0.60 beats 0.50 despite lower recall (cost analysis matters!)
✓ Trade-offs are everywhere — Precision vs. recall, recall vs. cost, simplicity vs. performance
The core lesson: Great ML is 80% feature engineering and domain knowledge, 20% model optimization. Our entire improvement came from choosing better features, not tweaking the algorithm.
Your Next Steps
First, revisit the tasks in Section 2 as you gain confidence:
Task 1 — Feature selection trade-offs (how many features is enough?)
Task 2 — Handling class imbalance (SMOTE vs. cost weighting)
Task 3 — Ensemble methods (when does Random Forest beat logistic regression?)
Task 4 — Cost-sensitive optimization (finding your business's optimal threshold)
Then, explore advanced topics:
Model interpretability: SHAP values, LIME (explain predictions to customers)
Hyperparameter tuning: GridSearchCV, RandomizedSearchCV (squeeze more performance)
Regularization: L1/L2 penalties (prevent overfitting on high-dimensional data)
Cross-validation: k-fold validation (more robust than single train-test split)
Feature engineering: Polynomial features, interactions (capture non-linear patterns)
Production ML: Model serving, monitoring, retraining (deploy safely at scale)
Fairness & bias: Audit your model (ensure it doesn't discriminate by gender, age, etc.)
Most importantly: Keep building. Loan approval is just one domain—these techniques apply to credit scoring, fraud detection, customer churn, medical diagnosis, and countless others.
Good luck building your next classification model! 🚀
Data source: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data
