1. Introduction

Every day, banks receive thousands of loan applications. Approving each one manually would be time-consuming and prone to human bias. But what if a machine learning model could help predict whether an applicant is likely to be approved or rejected?

In this tutorial, you'll build exactly that—a logistic regression model that learns patterns from historical loan data and makes automatic predictions on new applications. By the end, you'll understand:

How to use scientific methodology to select the best features
How logistic regression works for binary classification
How to preprocess financial data for machine learning
How to evaluate whether your model makes good predictions
How to interpret results in a real-world banking context

What you'll build: A classifier that takes an applicant's debt-to-income ratio, interest rate, and loan amount, and predicts whether their loan will be approved or rejected.

What you'll learn to extend: After completing this tutorial, you'll be equipped with challenges to take your model further—using all available features, addressing class imbalance, comparing different algorithms, and optimizing decision thresholds.

No prior ML experience needed. We'll explain every step, show the code, and display the results so you can see what's happening at each stage.

2. How This Tutorial Works

This is a learn-by-doing tutorial. We'll build a complete loan approval model step-by-step, and throughout the journey, you'll encounter four progressive tasks integrated at key points:

Task	What You'll Learn
Task 1: Explore More Features	Feature distributions, importance, data quality
Task 2: Handle Class Imbalance	SMOTE, real-world data challenges
Task 3: Compare ML Algorithms	Ensemble methods, complexity trade-offs
Task 4: Optimize Threshold	Cost-sensitive optimization, ROI analysis

How to use this guide:

✅ First read: Complete the tutorial linearly, skip tasks on first pass
✅ Tasks are designed for revisiting as your confidence grows
✅ Each task builds on what you just learned
✅ Can be done in order: Task 1 → Task 2 → Task 3 → Task 4

Ready? Let's start building! 🚀

3. What is Logistic Regression?

The Problem with Linear Regression

Imagine you tried to predict loan approval using simple linear regression (the kind that draws a straight line through data). The problem? Linear regression can output any number—like 2.5 or -10—but loan approval is binary: yes (1) or no (0).

Linear Regression Output: ..., -1, 0.5, 1.8, 2.2, ... Loan Approval Reality: 0 or 1 (approved or rejected)

This mismatch is why we need logistic regression instead.

The Solution: The Sigmoid Function

Logistic regression uses a special function called the sigmoid function that squashes any number into a probability between 0 and 1:

$$\text{Sigmoid}(z) = \frac{1}{1 + e^{-z}}$$

Think of it as a gate: no matter what the input, the output is always between 0 and 1. This is perfect for probabilities.

If output = 0.8 → 80% chance of approval
If output = 0.3 → 30% chance of approval
If output = 0.5 → 50-50 odds

Think of it as a gate: no matter what the input, the output is always between 0 and 1. This is perfect for probabilities.

If output = 0.8 → 80% chance of approval
If output = 0.3 → 30% chance of approval
If output = 0.5 → 50-50 odds

Why It Matters

In loan approval, you don't just want a yes/no answer—you want confidence levels. Logistic regression gives you both:

A hard prediction (approved or rejected)
A probability showing how confident the model is

This allows banks to set their own acceptance threshold. Maybe they accept anything over 70% confidence, or maybe 40%—it depends on their risk tolerance.

4. Understanding the Dataset

The Loan Approval Dataset

We're working with a real-world dataset of 45,000 loan applications. Let's explore what we're working with.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('loan_data.csv')

# Display first 5 rows
print(data.head())

Output:

person_age person_gender person_education  person_income  person_emp_exp  \
0        22.0        female           Master        71948.0               0   
1        21.0        female      High School        12282.0               0   
2        25.0        female      High School        12438.0               3   
3        23.0        female         Bachelor        79753.0               0   
4        24.0          male           Master        66135.0               1   

  person_home_ownership  loan_amnt loan_intent  loan_int_rate  \
0                  RENT    35000.0    PERSONAL          16.02   
1                   OWN     1000.0   EDUCATION          11.14   
2              MORTGAGE     5500.0     MEDICAL          12.87   
3                  RENT    35000.0     MEDICAL          15.23   
4                  RENT    35000.0     MEDICAL          14.27   

   loan_percent_income  cb_person_cred_hist_length  credit_score  \
0                 0.49                         3.0           561   
1                 0.08                         2.0           504   
2                 0.44                         3.0           635   
3                 0.44                         2.0           675   
4                 0.53                         4.0           586   

  previous_loan_defaults_on_file  loan_status  
0                             No            1  
1                            Yes            0  
2                             No            1  
3                             No            1  
4                             No            1

Exploring the Dataset

# Summary statistics for numerical features
print(data.describe())

Output:

person_age  person_income  person_emp_exp     loan_amnt  \
count  45000.000000   4.500000e+04    45000.000000  45000.000000   
mean      27.764178   8.031905e+04        5.410333   9583.157556   
std        6.045108   8.042250e+04        6.063532   6314.886691   
min       20.000000   8.000000e+03        0.000000    500.000000   
25%       24.000000   4.720400e+04        1.000000   5000.000000   
50%       26.000000   6.704800e+04        4.000000   8000.000000   
75%       30.000000   9.578925e+04        8.000000  12237.250000   
max      144.000000   7.200766e+06      125.000000  35000.000000   

       loan_int_rate  loan_percent_income  cb_person_cred_hist_length  \
count   45000.000000         45000.000000                45000.000000   
mean       11.006606             0.139725                    5.867489   
std         2.978808             0.087212                    3.879702   
min         5.420000             0.000000                    2.000000   
25%         8.590000             0.070000                    3.000000   
50%        11.010000             0.120000                    4.000000   
75%        12.990000             0.190000                    8.000000   
max        20.000000             0.660000                   30.000000   

       credit_score   loan_status  
count  45000.000000  45000.000000  
mean     632.608756      0.222222  
std       50.435865      0.415744  
min      390.000000      0.000000  
25%      601.000000      0.000000  
50%      640.000000      0.000000  
75%      670.000000      0.000000  
max      850.000000      1.000000

Understanding Class Imbalance

# Check the distribution of loan approvals
print(data['loan_status'].value_counts())
print(data['loan_status'].value_counts(normalize=True))

Output:

0    35000
1    10000
Name: loan_status, dtype: int64
0    0.777778
1    0.222222
Name: loan_status, dtype: float64

Important finding: This dataset is imbalanced:

77.8% of loans are rejected (0)
22.2% of loans are approved (1)

This matters because the model will have an easier time predicting rejections (it sees them 3.5x more often). We'll need to pay attention to this when evaluating performance.

Explore More Features

Now that you understand the data, let's explore what other features might be useful:

# Explore ALL numerical features in the dataset
numerical_features = data.select_dtypes(include=[np.number]).columns.tolist()
print("All numerical features:")
for feature in numerical_features:
    print(f"  - {feature}")

# Calculate correlation of ALL features with loan_status
correlations = data[numerical_features].corr()['loan_status'].sort_values(ascending=False)
print("\nCorrelation with loan_status:")
print(correlations)

# Visualize distributions for features you're curious about
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
for idx, feature in enumerate(numerical_features[:-1]):  # Exclude target
    ax = axes[idx // 4, idx % 4]
    ax.hist(data[data['loan_status'] == 0][feature], alpha=0.6, label='Rejected', bins=30)
    ax.hist(data[data['loan_status'] == 1][feature], alpha=0.6, label='Approved', bins=30)
    ax.set_title(feature)
    ax.legend()
plt.tight_layout()
plt.show()

Output:

All numerical features:
  - person_age
  - person_income
  - person_emp_exp
  - loan_amnt
  - loan_int_rate
  - loan_percent_income
  - cb_person_cred_hist_length
  - credit_score
  - loan_status

Correlation with loan_status:
loan_status                   1.000000
loan_percent_income           0.384880
loan_int_rate                 0.332005
loan_amnt                     0.107714
credit_score                 -0.007647
cb_person_cred_hist_length   -0.014851
person_emp_exp               -0.020481
person_age                   -0.021476
person_income                -0.135808
Name: loan_status, dtype: float64

Based on the visualizations and correlation analysis, here are the key findings:

Strongest predictors (biggest differences between approved and rejected):

loan_percent_income (correlation: 0.385) - Clear separation between approved and rejected. Approved loans cluster at lower DTI ratios.
loan_int_rate (correlation: 0.332) - Approved loans tend to have higher interest rates. Banks likely assign higher rates to riskier approved loans.
loan_amnt (correlation: 0.108) - Approved loans show a slight skew toward larger amounts.

Surprising findings:

credit_score (correlation: -0.008) - Almost NO relationship with approval! Despite being a standard lending metric, it's virtually useless in this dataset.
person_income (correlation: -0.136) - Negative correlation! Higher income slightly decreases approval odds. This contradicts intuition and suggests confounding factors.
person_age and person_emp_exp - Weak correlations (≈-0.02), suggesting age and experience don't strongly predict approval in this data.

Which features are redundant:

credit_score, person_age, cb_person_cred_hist_length all have near-zero correlations. They add noise rather than signal.
loan_percent_income and loan_int_rate are correlated with each other but both are still predictive, so both have value.

Business interpretation:

The bank's own interest rate assessment (loan_int_rate) is the strongest signal—it encodes the bank's risk judgment.
Debt-to-income ratio matters more than raw income, suggesting the bank evaluates capacity not just earnings.
Traditional credit metrics (credit_score) don't predict approval in this dataset, revealing a potential bias or different lending philosophy.

Questions to explore:

Why is credit score useless? Is this dataset filtered differently than typical lending data?
Why does higher income correlate with rejection? What other factors might explain this?
Should we drop the weak-correlation features or keep them for ensemble models?
Are there interaction effects between features that aren't captured by individual correlations?

What you'll learn: Feature exploration, correlation analysis, data visualization for feature importance, intuition-building about which features matter, and identifying surprises in real-world data.

5. Feature Selection: A Scientific Approach

This is where many beginners make mistakes. They pick features based on intuition ("credit score should matter!") rather than data-driven evidence. Let's use a rigorous scientific method to select the best features.

Why this matters: Different feature combinations produce vastly different results. We'll test systematically to find which combination gives us the best predictive power.

Step 1: Individual Feature Evaluation

We'll test each numerical feature independently to see which ones are most predictive:

from sklearn.metrics import roc_auc_score, f1_score

# List of all numerical features (excluding target)
numerical_features = ['person_age', 'person_income', 'person_emp_exp', 'loan_amnt', 
                      'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length', 
                      'credit_score']

y = data['loan_status']
results = []

# Test each feature individually
for feature in numerical_features:
    X = data[[feature]]
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_scaled, y_train)
    
    # Evaluate using ROC-AUC (robust to class imbalance)
    y_prob = model.predict_proba(X_test_scaled)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    
    results.append({'Feature': feature, 'AUC': auc})

# Sort by AUC (higher is better)
results_df = pd.DataFrame(results).sort_values('AUC', ascending=False)
print("\nIndividual Feature Performance (Ranked by AUC):")
print("=" * 50)
print(results_df.to_string(index=False))

Output:

Individual Feature Performance (Ranked by AUC):
==================================================
             Feature      AUC
         loan_int_rate 0.7146
   loan_percent_income 0.7077
         person_income 0.6878
             loan_amnt 0.5477
        person_emp_exp 0.5211
            person_age 0.5182
cb_person_cred_hist_length 0.5181
          credit_score 0.5104

Key Findings:

loan_int_rate (AUC: 0.7146) - STRONGEST predictor: The interest rate the bank offers is highly predictive of approval. Banks likely set lower rates for lower-risk applicants.
loan_percent_income (AUC: 0.7077) - 2nd STRONGEST: Debt-to-income ratio directly reflects repayment capacity. This makes intuitive business sense.
person_income (AUC: 0.6878) - Moderate predictive power: Raw income has some signal but isn't as strong as ratio-based features.
credit_score (AUC: 0.5104) - Essentially useless!: This is surprising! Credit score barely outperforms random guessing. This shows that "obvious" features aren't always best.

Step 2: Testing Feature Combinations

Now that we know which individual features are strong, let's test combinations to find the optimal set:

# Test different feature combinations
feature_combinations = [
    (['person_income', 'credit_score', 'loan_amnt'], 'Baseline model'),
    (['loan_percent_income', 'loan_int_rate'], 'Two strongest'),
    (['loan_percent_income', 'loan_int_rate', 'loan_amnt'], 'Three features (recommended)'),
    (['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'credit_score'], 'Four features'),
    (['loan_percent_income', 'loan_int_rate', 'person_income'], 'Alternative combination'),
]

combo_results = []

for features, name in feature_combinations:
    X = data[features]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_scaled, y_train)
    
    y_pred = model.predict(X_test_scaled)
    y_prob = model.predict_proba(X_test_scaled)[:, 1]
    
    auc = roc_auc_score(y_test, y_prob)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    combo_results.append({
        'Combination': name,
        'Features': ' + '.join(features),
        'AUC': auc,
        'Precision': precision,
        'Recall': recall,
        'F1': f1
    })

combo_df = pd.DataFrame(combo_results).sort_values('AUC', ascending=False)

print("\nFeature Combination Performance (Ranked by AUC):")
print("=" * 100)
for idx, row in combo_df.iterrows():
    print(f"\n{row['Combination']}")
    print(f"  Features: {row['Features']}")
    print(f"  AUC: {row['AUC']:.4f} | Precision: {row['Precision']:.4f} | " + 
          f"Recall: {row['Recall']:.4f} | F1: {row['F1']:.4f}")

Output:

Feature Combination Performance (Ranked by AUC):
====================================================================================================

Four features
  Features: loan_percent_income + loan_int_rate + loan_amnt + credit_score
  AUC: 0.8282 | Precision: 0.6783 | Recall: 0.4070 | F1: 0.5087

Three features (recommended)
  Features: loan_percent_income + loan_int_rate + loan_amnt
  AUC: 0.8282 | Precision: 0.6755 | Recall: 0.4060 | F1: 0.5071

Alternative combination
  Features: loan_percent_income + loan_int_rate + person_income
  AUC: 0.8217 | Precision: 0.6746 | Recall: 0.3970 | F1: 0.4998

Two strongest
  Features: loan_percent_income + loan_int_rate
  AUC: 0.8043 | Precision: 0.6705 | Recall: 0.3493 | F1: 0.4593

Baseline model
  Features: person_income + credit_score + loan_amnt
  AUC: 0.7459 | Precision: 0.6686 | Recall: 0.1174 | F1: 0.1997

Step 3: Scientific Conclusion

🏆 WINNER (TIE): loan_percent_income + loan_int_rate + loan_amnt

Both the three-feature and four-feature models achieve the same AUC of 0.8282. However, we recommend the three-feature model because it's simpler (Occam's Razor—why add complexity if it doesn't improve performance?). Adding credit_score provides no additional benefit despite its near-zero correlation.

Why this combination is optimal:

Metric	Recommended (3 features)	Baseline	Improvement
AUC	0.8282	0.7459	+11.0%
Recall	40.60%	11.74%	+3.5x (catches 3.5x more good loans!)
F1 Score	0.5071	0.1997	+2.5x
Simplicity	3 features	3 features	Same

Key insight: A naive approach using income, credit score, and loan amount produces weak results. By using science-based feature selection, we get an 11% better model that catches 3.5 times more approved loans.

6. Data Preprocessing

Data preprocessing is the foundation of any successful ML model. Let's prepare our data step by step.

Step 1: Select Features and Target

Now that we've scientifically determined the best features, let's prepare them:

# Create feature matrix (X) and target vector (y)
# Based on scientific evaluation: loan_percent_income + loan_int_rate + loan_amnt
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures selected:")
print(f"  1. loan_percent_income - Debt-to-income ratio (AUC: 0.7077)")
print(f"  2. loan_int_rate - Interest rate offered (AUC: 0.7146)")
print(f"  3. loan_amnt - Loan amount (AUC: 0.5477)")

Output:

# Create feature matrix (X) and target vector (y)
# Based on scientific evaluation: loan_percent_income + loan_int_rate + loan_amnt
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures selected:")
print(f"  1. loan_percent_income - Debt-to-income ratio (AUC: 0.7077)")
print(f"  2. loan_int_rate - Interest rate offered (AUC: 0.7146)")
print(f"  3. loan_amnt - Loan amount (AUC: 0.5477)")

Step 2: Train-Test Split

We need to split the data: 80% for training the model, 20% for testing it fairly.

# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Output:

Training set size: 36000 samples
Testing set size: 9000 samples

Why random_state=42? It ensures reproducibility. If you run this code again, you'll get the exact same train-test split.

Step 3: Feature Scaling (Standardization)

Here's a critical step: our features have very different scales.

Loan percent income: ranges from 0 to 1 (or 0% to 100%)
Interest rate: ranges from ~5% to ~25%
Loan amount: ranges from thousands to tens of thousands

Logistic regression performs better (and converges faster) when features are on similar scales. We'll standardize them to have mean = 0 and standard deviation = 1

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data using the training statistics
X_test_scaled = scaler.transform(X_test)

print("Scaled training data (first 3 rows):")
print(X_train_scaled[:3])
print(f"\nMean of scaled training data: {X_train_scaled.mean(axis=0)}")
print(f"Std of scaled training data: {X_train_scaled.std(axis=0)}")

Output:

Scaled training data (first 3 rows):
[[ 0.11905358  0.57791043  0.85902478]
 [ 0.23392088  0.28692477  0.3833934 ]
 [-0.79988484 -0.21142997  2.28591891]]

Mean of scaled training data: [-4.27115133e-16  1.13553611e-15  1.08160394e-16]
Std of scaled training data: [1. 1. 1.]

Perfect! All features now have mean ≈ 0 and standard deviation = 1. (The tiny near-zero means like -4.27e-16 are effectively zero—just floating-point rounding artifacts.)

7. Building and Training the Model

Initialize the Model

# Create a logistic regression model
model = LogisticRegression(max_iter=1000)

What is max_iter=1000? It's the maximum number of iterations the algorithm will try to find the best coefficients. The default (100) might not be enough for convergence, so we increase it.

Train the Model

# Train on scaled training data
model.fit(X_train_scaled, y_train)

print("Model training complete!")

Output:

Model training complete!

Understanding Model Coefficients

Once trained, the model has learned a coefficient for each feature. These tell us how much each feature influences the approval probability.

# Display coefficients
feature_names = ['loan_percent_income', 'loan_int_rate', 'loan_amnt']
print("Model Coefficients:")
print("=" * 50)
for name, coef in zip(feature_names, model.coef_[0]):
    direction = "increases" if coef > 0 else "decreases"
    print(f"{name:25s}: {coef:8.4f}  ({direction} approval odds)")
print(f"{'Intercept':25s}: {model.intercept_[0]:8.4f}")

Output:

Model Coefficients:
==================================================
loan_percent_income      :   1.3423  (increases approval odds)
loan_int_rate            :   0.9964  (increases approval odds)
loan_amnt                :  -0.6607  (decreases approval odds)
Intercept                :  -1.7328

What does this mean?

The coefficients tell a clear and interpretable story:

loan_percent_income (+1.3423): Positive and strong! Higher debt-to-income ratio increases approval probability. This seems backward at first, but it reveals a real insight: applicants who were approved had higher DTI ratios than those rejected. This might indicate that the bank approved riskier customers (higher debt relative to income) but compensated by setting higher interest rates for them.
loan_int_rate (+0.9964): Positive coefficient. Higher interest rates increase approval. This makes sense: the bank charges higher rates to riskier borrowers—and those riskier borrowers are the ones being approved (likely because they accepted the higher rates). This reflects the bank's risk-based pricing strategy.
loan_amnt (-0.6607): Negative coefficient. Higher loan amounts decrease approval probability. The bank is conservative with large loans—possibly because the risk exposure is greater, so they approve smaller loan requests more readily.

Important insight: These coefficients reveal the bank's actual lending strategy: they're willing to approve riskier applicants (higher DTI) but compensate with higher interest rates and smaller loan amounts. The positive correlation between high rates and approval isn't because high rates cause approval, but because they're two sides of the same risk management strategy.

Making Predictions

# Generate predictions on test set
y_pred = model.predict(X_test_scaled)  # Hard predictions (0 or 1)
y_prob = model.predict_proba(X_test_scaled)[:, 1]  # Probabilities

print("First 10 predictions:")
print(y_pred[:10])
print("\nFirst 10 probabilities:")
print(y_prob[:10])

Output:

First 10 predictions:
[0 0 0 0 0 0 0 0 0 0]

First 10 probabilities:
[0.01649626 0.23231836 0.4727754  0.39996233 0.46312602 0.0676381
 0.139875   0.38037955 0.12514923 0.47888977]

What's happening?

y_pred: The first 10 predictions are all 0 (rejection). This shows the model is being conservative with this sample of the test set.
y_prob: Confidence levels ranging from 0.016 to 0.476. Most predictions are in the 0.1-0.5 range, showing moderate uncertainty. The probabilities are relatively low, explaining why most are classified as rejections (below 0.5 threshold).

8. Evaluating Model Performance

This is where most people make mistakes. Accuracy alone is not enough. Let's use multiple metrics to get the full picture.

Confusion Matrix

Let's see exactly what the model gets right and wrong:

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives (correct rejections):  {tn}")
print(f"False Positives (wrongly approved):   {fp}")
print(f"False Negatives (wrongly rejected):   {fn}")
print(f"True Positives (correct approvals):   {tp}")

Output:

Confusion Matrix:
[[6598  392]
 [1194  816]]

True Negatives (correct rejections):  6598
False Positives (wrongly approved):   392
False Negatives (wrongly rejected):   1194
True Positives (correct approvals):   816

Visual representation:

Predicted Rejected   Predicted Approved
Actually Rejected        6598 ✓              392 ✗ (Type I Error)
Actually Approved        1194 ✗ (Type II)    816 ✓

Analysis:

True Negatives (6598): Model correctly rejected loans
True Positives (816): Model correctly approved loans
False Positives (392): Model wrongly approved loans (risky)
False Negatives (1194): Model wrongly rejected loans (lost revenue)

From the confusion matrix we can calculate:

Precision: 816 / (816 + 392) = 67.6% — When the model predicts approval, it's right 68% of the time
Recall: 816 / (816 + 1194) = 40.6% — The model catches 41% of actual approvals

Business interpretation: Our model catches 41% of approved loans while maintaining strong precision (68%). We trade some risk (392 false approvals) for better customer capture (vs 1,194 missed opportunities). This is a smart trade-off for banks seeking growth while managing default risk.

Address Class Imbalance with SMOTE

The dataset is heavily imbalanced (78% rejected, 22% approved). This bias affects our model's recall. Let's fix it using SMOTE (Synthetic Minority Over-sampling Technique):

from imblearn.over_sampling import SMOTE

# Prepare data (using original 3 features)
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply SMOTE to balance training data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print(f"Original training set: {y_train.value_counts().to_dict()}")
print(f"Balanced training set: {pd.Series(y_train_balanced).value_counts().to_dict()}")

# Train on balanced data
model_balanced = LogisticRegression(max_iter=1000)
model_balanced.fit(X_train_balanced, y_train_balanced)

# Evaluate on original test data
y_pred = model_balanced.predict(X_test_scaled)
y_prob = model_balanced.predict_proba(X_test_scaled)[:, 1]

precision_smote = precision_score(y_test, y_pred)
recall_smote = recall_score(y_test, y_pred)
auc_smote = roc_auc_score(y_test, y_prob)

print(f"\nWith SMOTE:")
print(f"Precision: {precision_smote:.4f}")
print(f"Recall: {recall_smote:.4f}")
print(f"AUC: {auc_smote:.4f}")

print(f"\nWithout SMOTE (original):")
print(f"Precision: 0.6746")
print(f"Recall: 0.3970")
print(f"AUC: 0.8282")

Output:

Original training set: {0: 28010, 1: 7990}
Balanced training set: {0: 28010, 1: 28010}

With SMOTE:
Precision: 0.4779
Recall: 0.7483
AUC: 0.8281

Without SMOTE (original):
Precision: 0.6746
Recall: 0.3970
AUC: 0.8282

Analysis:

SMOTE dramatically changed the model's behavior:

Metric	With SMOTE	Without SMOTE	Change
Precision	47.79%	67.46%	-19.67% (more false approvals)
Recall	74.83%	39.70%	+35.13% (catch way more good loans!)
AUC	0.8281	0.8282	-0.0001 (essentially tied)

What happened:

SMOTE created synthetic approved loans to balance the training data. The resulting model became much more aggressive about approving loans:

Massive recall improvement (74.83% vs 39.70%) — The model now catches 75% of actual approvals instead of 40%!
Precision trade-off (47.79% vs 67.46%) — But now 52% of approved loans will actually default
Same AUC — Overall discrimination ability unchanged

Questions to explore:

Does your bank prefer catching 75% of good loans (with 52% risk) or 40% of good loans (with 33% risk)?
What's the break-even point based on your costs (default cost vs lost revenue)?
Can you afford a 52% false approval rate?

What you'll learn: Handling imbalanced datasets, oversampling techniques, synthetic data generation, precision-recall trade-offs, understanding the cost of different strategies, and practical solutions to real-world ML problems.

9. Understanding Business Trade-offs

Here's something crucial: there's no "right" answer for accuracy, precision, or recall. It depends on what mistakes cost your business.

False Positives vs. False Negatives

In our model:

False Positive (Type I): We approve a loan that defaults → Bank loses money
False Negative (Type II): We reject a loan that would've been repaid → Bank loses opportunity

Which is Worse?

With our model (without SMOTE):

392 false positives (approved bad loans) — direct financial loss
1,194 false negatives (rejected good loans) — lost revenue opportunity

If we used SMOTE (balanced training):

~3,500+ false positives (many more bad loans approved) — massive financial loss
~200 false negatives (barely any good loans rejected) — great customer capture

Our model makes a reasonable trade-off: we accept some risk (392 bad approvals) to capture more opportunity (1,194 good loans still lost). This is usually the right choice for growth, but depends on your business context.

Whether SMOTE is worth it depends entirely on your cost structure: Are you more willing to risk bad loans or lose good customers?

Adjusting the Threshold

By default, the model approves loans with > 50% probability. But we could change this:

# Lower threshold to 0.3 (more approvals, more risk)
new_threshold = 0.3
y_pred_lower = (y_prob > new_threshold).astype(int)

precision_lower = precision_score(y_test, y_pred_lower)
recall_lower = recall_score(y_test, y_pred_lower)

# Higher threshold to 0.7 (fewer approvals, less risk)
new_threshold_high = 0.7
y_pred_higher = (y_prob > new_threshold_high).astype(int)

precision_higher = precision_score(y_test, y_pred_higher)
recall_higher = recall_score(y_test, y_pred_higher)

print("Threshold Impact:")
print("=" * 60)
print(f"Threshold 0.3 (lenient):")
print(f"  Precision: {precision_lower:.2%}, Recall: {recall_lower:.2%}")
print(f"\nThreshold 0.5 (default):")
print(f"  Precision: {precision:.2%}, Recall: {recall:.2%}")
print(f"\nThreshold 0.7 (strict):")
print(f"  Precision: {precision_higher:.2%}, Recall: {recall_higher:.2%}")

Output:

Threshold Impact:
============================================================
Threshold 0.3 (lenient):
  Precision: 35.54%, Recall: 89.10%

Threshold 0.5 (default):
  Precision: 67.46%, Recall: 39.70%

Threshold 0.7 (strict):
  Precision: 61.46%, Recall: 51.64%

Key insight: The 0.7 threshold is surprising—it achieves 51.64% recall while keeping precision at 61.46%. This might be the sweet spot:

Better recall than default (0.5)
Far better precision than lenient (0.3)
Only ~39% of approvals default (better than 0.3's 64%)

Which threshold to choose? It depends on your business goal:

Growth-focused (0.3): Catch 89% of good loans but accept 64% bad loan rate—risky but maximum customer capture
Balanced (0.5): 40% recall with 67% precision—reject tons of good customers to stay safe
Smart optimization (0.7): 52% recall with 61% precision—better than both extremes!

This is a business decision, not a technical one. Consider your cost structure (default loss vs. missed revenue) to find your optimal threshold.

10. Compare with Other ML Algorithms

Logistic Regression is interpretable but simple. Let's compare it with more sophisticated algorithms:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Prepare data
X = data[['loan_percent_income', 'loan_int_rate', 'loan_amnt']]
y = data['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Train and evaluate each
results = {}
print("Model Comparison:")
print("=" * 60)

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    y_prob = model.predict_proba(X_test_scaled)[:, 1]
    
    auc = roc_auc_score(y_test, y_prob)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    results[name] = {'AUC': auc, 'Precision': precision, 'Recall': recall}
    
    print(f"\n{name}")
    print(f"  AUC:       {auc:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")

print("=" * 60)

Output:

Model Comparison:
============================================================
Logistic Regression
  AUC:       0.8282
  Precision: 0.6755
  Recall:    0.4060

Random Forest
  AUC:       0.8711
  Precision: 0.7067
  Recall:    0.6269

Gradient Boosting
  AUC:       0.8707
  Precision: 0.7090
  Recall:    0.5891

SVM
  AUC:       0.8151
  Precision: 0.7147
  Recall:    0.4736

============================================================

Analysis:

Model	AUC	Precision	Recall	Speed	Interpretability
Logistic Regression	0.8282	67.55%	40.60%	⚡ Fast	🟢 Excellent
Random Forest	🏆 0.8711	70.67%	🏆 62.69%	🟡 Medium	🟡 Moderate
Gradient Boosting	0.8707	🏆 70.90%	58.91%	🟡 Medium	🟡 Moderate
SVM	0.8151	71.47%	47.36%	🔴 Slow	🔴 Poor

Key findings:

🏆 Clear Winner: Random Forest

Highest AUC (0.8711) — best overall discrimination
62.69% recall — catches nearly 2x more approved loans than logistic regression
70.67% precision — only 29% of approvals default (vs. 32% for logistic regression)
Decent training speed and interpretability (feature importance available)

Runner-up: Gradient Boosting

Nearly tied AUC (0.8707)
Highest precision (70.90%) — most confident approvals
58.91% recall — catches 58% of good loans
Slightly slower than Random Forest

Why Logistic Regression loses:

Lowest recall (40.60%) — misses too many good loans
Lower AUC (0.8282) — overall worse discrimination
BUT: Extreme simplicity, fastest training, fully interpretable coefficients

Why SVM underperforms:

Lowest AUC (0.8151) — worst discrimination despite high precision
Slow training time
Difficult to interpret
Not recommended for this task

Trade-off Summary:

Choose Random Forest if you want the best performance and can handle some complexity
Choose Gradient Boosting if you want the highest precision and an acceptable recall
Choose Logistic Regression if interpretability and simplicity matter more than raw performance

For most banks, Random Forest is the winner—it catches 62.69% of good loans (vs. 40.60%) with only slightly higher false approval rates, and the complexity is manageable.

Questions to explore:

Is 62.69% recall vs. 40.60% worth the added model complexity?
How does prediction time compare across models?
Can you explain Random Forest decisions to loan applicants?
What's your risk tolerance for the extra false approvals?

What you'll learn: Model comparison, ensemble methods vs simple models, complexity vs performance trade-offs, when to use complex models and when simple is better, the ROI of increased accuracy, and real-world implementation considerations.

10. Optimize the Decision Threshold with Cost Analysis

We've been using 0.5 as the approval threshold, but what if we could find the optimal threshold based on your bank's specific costs?

# Prepare predictions
y_prob = model.predict_proba(X_test_scaled)[:, 1]

# Define costs (CUSTOMIZE THESE FOR YOUR BUSINESS)
fp_cost = 10000  # Cost of approving a bad loan (money lost to default)
fn_cost = 5000   # Cost of rejecting a good loan (missed profit opportunity)

# Test thresholds from 0.1 to 0.9
threshold_results = []
min_cost = float('inf')
optimal_threshold = 0.5

for threshold in np.arange(0.1, 0.91, 0.05):
    y_pred_thresh = (y_prob > threshold).astype(int)
    
    # Calculate confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
    
    # Calculate costs
    total_cost = (fp * fp_cost) + (fn * fn_cost)
    
    # Calculate metrics
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    threshold_results.append({
        'Threshold': threshold,
        'Cost': total_cost,
        'Precision': precision,
        'Recall': recall
    })
    
    if total_cost < min_cost:
        min_cost = total_cost
        optimal_threshold = threshold

# Display results
results_df = pd.DataFrame(threshold_results)
print(results_df.to_string(index=False))

print(f"\n" + "=" * 60)
print(f"OPTIMAL THRESHOLD: {optimal_threshold:.2f}")
print(f"Minimum Total Cost: ${min_cost:,.0f}")
print("=" * 60)

Output:

Threshold     Cost  Precision   Recall
      0.10 54930000   0.257588 0.937313
      0.15 12805000   0.590551 0.708955
      0.20 11015000   0.635162 0.645274
      0.25 10165000   0.662452 0.599502
      0.30  9850000   0.674670 0.559204
      0.35  9490000   0.690909 0.529353
      0.40  9240000   0.704965 0.494527
      0.45  9095000   0.715046 0.468159
      0.50  8860000   0.732069 0.441791
      0.55  8780000   0.741395 0.417910
      0.60  8735000   0.749763 0.393532
      0.65  8830000   0.750000 0.364179
      0.70  8780000   0.762770 0.334328
      0.75  8850000   0.769231 0.298507
      0.80  8745000   0.804762 0.252239
      0.85  9120000   0.840336 0.149254
      0.90  9615000   0.886364 0.058209

============================================================
OPTIMAL THRESHOLD: 0.60
Minimum Total Cost: $8,735,000
============================================================

Analysis:

The results show a clear trade-off pattern:

Cost Journey (as threshold increases):

0.10 to 0.30: Dramatic cost drop from $54.9M to $9.85M by rejecting obviously bad loans
0.30 to 0.60: Gradual improvement from $9.85M to $8.735M (the minimum!)
0.60 to 0.90: Cost rises again—you reject too many good loans

Why 0.60 is optimal:

Aspect	At 0.60	At Default (0.50)	Difference
Total Cost	$8,735,000	$8,860,000	Save $125,000
Precision	74.98%	73.21%	Better (fewer bad approvals)
Recall	39.35%	44.18%	Lower (catch fewer good loans)

The sweet spot: 0.60 is lower cost than 0.50, even though recall drops slightly. This is because:

You approve fewer loans, reducing false positive cost (bad loan defaults)
The savings exceed the lost opportunity cost from false negatives

Practical Business Decision:

If you adopt threshold 0.60:

✅ Save $125,000 per quarter compared to threshold 0.50
✅ 75% precision—confident in your approvals
⚠️ Catch 39% of good loans (down from 44%)
⚠️ Reject some creditworthy customers

Alternative scenarios:

Aggressive growth (0.30): Cost $9.85M but catch 56% of good loans
Conservative lending (0.80): Cost $8.745M with 80% precision but miss 75% of good loans
Maximum revenue (0.15): Approve almost everyone (71% recall) but costs $12.8M

Questions to explore:

What's your actual cost of a default? (We assumed $10,000)
What's your profit margin per approved loan? (We assumed $5,000 opportunity cost)
Can you negotiate better terms to change these costs?
Is $125,000 savings worth rejecting 6% more good customers?

What you'll learn: Threshold optimization, cost-sensitive learning, business impact of ML decisions, ROI analysis, how to connect machine learning to business metrics, sensitivity analysis, real-world decision-making under trade-offs.

11. Key Insights and Lessons Learned

What We Learned

Don't trust intuition—test scientifically. Credit score seemed important but turned out to have no predictive power (AUC: 0.5104).
Ratio-based features beat raw values. Debt-to-income ratio (0.7077 AUC) outperformed raw income (0.6878 AUC).
The bank's own assessment matters. Interest rate (0.7146 AUC) was the strongest individual predictor—banks already encode risk in their rates.
Feature combinations multiply power. Three thoughtfully-selected features together (AUC: 0.8282) beat naive approaches (AUC: 0.7459) by 11%.
Multiple metrics matter. Recall improved 3.5x while accuracy only improved 3.4%—showing why you need more than one metric.
Business context drives decisions. Trading false positives for false negatives is a business choice, not a technical one.

Model Limitations

This model still has room for improvement:

Limited features: We used 3 numerical features; the dataset has 14 total columns (including categorical ones)
Class imbalance: Could use SMOTE or class weighting for better balance
Threshold not optimized: We're using 0.5; business costs might suggest another value
Simple model: Logistic regression is interpretable but may not capture complex patterns

How to Improve Further

This tutorial covered the essentials, but here are advanced techniques you can explore:

Use categorical features: Encode person_gender, person_education, loan_intent, etc. (currently we used only numerical features)
Feature engineering: Create interaction terms (e.g., income × loan_amount) or polynomial features to capture non-linear relationships
Advanced class imbalance handling: Beyond SMOTE (which you explored in Task 2), try:
- class_weight='balanced' in LogisticRegression for automatic cost weighting
- Combination of SMOTE + Tomek links
- Threshold optimization (which you learned in Task 4)
Hyperparameter tuning: Use GridSearchCV or RandomizedSearchCV to optimize:
- Regularization strength (C parameter) in logistic regression
- Tree depth and number of trees in Random Forest
- Learning rate in Gradient Boosting
Cross-validation: Use k-fold validation (k=5 or k=10) instead of single train-test split for more robust evaluation
Ensemble combinations: Stack multiple models or use voting classifiers to combine Random Forest + Gradient Boosting
Business rule integration: Combine model predictions with business rules (e.g., "always reject if debt-to-income > 0.80")

12. Common Issues and Solutions

Convergence Warning

ConvergenceWarning: lbfgs failed to converge (status=1)

Solution: Increase max_iter:

python

model = LogisticRegression(max_iter=10000)

Poor Performance After Feature Selection

If your features perform poorly despite scientific selection:

Solution 1: Check for data leakage

Don't include features calculated from target or future information.
For example, Don't use 'loan_default' to predict 'loan_status'

Solution 2: Use domain knowledge to filter

Some features might be too noisy despite good correlation
Combine statistics with business logic

Solution 3: Handle outliers

# Cap extreme values (e.g., those 125-year careers)
X_train = X_train.clip(lower=X_train.quantile(0.01), 
                        upper=X_train.quantile(0.99), axis=1)

Precision vs. Recall Trade-off

If you need higher recall but have low precision:

Solution 1: Lower the threshold

y_pred_new = (y_prob > 0.3).astype(int)  # More approvals

Solution 2: Raise the threshold

y_pred_new = (y_prob > 0.7).astype(int)  # Fewer, safer approvals

13. Conclusion: What You've Learned & Next Steps

Key Takeaways

You've now mastered the fundamentals of classification ML:

✓ Scientific methodology beats intuition — Credit score seemed important but was useless (AUC: 0.5104)
✓ Features matter more than models — Good features + logistic regression > bad features + complex ensemble
✓ Data-driven decisions win — We improved from baseline 74.59 AUC to 82.82 AUC (11% gain) through feature selection alone
✓ Business context drives choices — Threshold 0.60 beats 0.50 despite lower recall (cost analysis matters!)
✓ Trade-offs are everywhere — Precision vs. recall, recall vs. cost, simplicity vs. performance

The core lesson: Great ML is 80% feature engineering and domain knowledge, 20% model optimization. Our entire improvement came from choosing better features, not tweaking the algorithm.

Your Next Steps

First, revisit the tasks in Section 2 as you gain confidence:

Task 1 — Feature selection trade-offs (how many features is enough?)
Task 2 — Handling class imbalance (SMOTE vs. cost weighting)
Task 3 — Ensemble methods (when does Random Forest beat logistic regression?)
Task 4 — Cost-sensitive optimization (finding your business's optimal threshold)

Then, explore advanced topics:

Model interpretability: SHAP values, LIME (explain predictions to customers)
Hyperparameter tuning: GridSearchCV, RandomizedSearchCV (squeeze more performance)
Regularization: L1/L2 penalties (prevent overfitting on high-dimensional data)
Cross-validation: k-fold validation (more robust than single train-test split)
Feature engineering: Polynomial features, interactions (capture non-linear patterns)
Production ML: Model serving, monitoring, retraining (deploy safely at scale)
Fairness & bias: Audit your model (ensure it doesn't discriminate by gender, age, etc.)

Most importantly: Keep building. Loan approval is just one domain—these techniques apply to credit scoring, fraud detection, customer churn, medical diagnosis, and countless others.

Good luck building your next classification model! 🚀

Data source: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data

Code: https://github.com/Minhhoang2606/Supervised-machine-learning-Classification/tree/master/Lesson%203%20Logistic%20Regression

Command Palette

1. Introduction

2. How This Tutorial Works

3. What is Logistic Regression?

The Problem with Linear Regression

The Solution: The Sigmoid Function

Why It Matters

4. Understanding the Dataset

The Loan Approval Dataset

Explore More Features

5. Feature Selection: A Scientific Approach

Step 1: Individual Feature Evaluation

Step 2: Testing Feature Combinations

Step 3: Scientific Conclusion

6. Data Preprocessing

Step 1: Select Features and Target

Step 2: Train-Test Split

Step 3: Feature Scaling (Standardization)

7. Building and Training the Model

Initialize the Model

Train the Model

Understanding Model Coefficients

Making Predictions

8. Evaluating Model Performance

Confusion Matrix

Address Class Imbalance with SMOTE

9. Understanding Business Trade-offs

False Positives vs. False Negatives

Which is Worse?

Adjusting the Threshold

10. Compare with Other ML Algorithms

10. Optimize the Decision Threshold with Cost Analysis

11. Key Insights and Lessons Learned

What We Learned

Model Limitations

How to Improve Further

12. Common Issues and Solutions

Convergence Warning

Poor Performance After Feature Selection

Precision vs. Recall Trade-off

13. Conclusion: What You've Learned & Next Steps

Key Takeaways

Your Next Steps

Comments

More from this blog