Causal Models vs Predictive Models: Why It Matters

Introduction

In today’s data-driven world, businesses are waking up to a crucial reality: correlation is not causation. Just because two metrics move together doesn’t mean one causes the other. And when decisions involve money, time, and customer experience, guessing is not an option.

Enter causal models.

Causality

Causal models aim to answer a powerful question: "What would have happened if we had done something differently?"

They go beyond simple forecasting. While predictive models tell you what might happen, causal models explain why it happens and what would change under a different action.

In short, causal models help you understand which levers actually drive outcomes, allowing you to make decisions that move the needle.

Where Causal Models Are Used

Marketing → Campaign effectiveness, spend optimization

Product → Drivers of feature adoption or churn

Pricing → Impact of price changes

Operations → Effects of delivery time, staffing, etc.

Businesses don’t just want to know what’s happening—they want to influence it. Causal models enable confident, action-driven decisions. They’re no longer academic nice-to-haves—they're a competitive edge.

Predictive Models

Every time you get a Netflix recommendation, see a revenue forecast, or receive a fraud alert, a predictive model is likely behind it.

These models use historical data to forecast future outcomes. They learn patterns between features (X) and a target (Y), such as:

Will this user churn?

What will we sell next month?

How many orders will this customer make?

But here’s the key: predictive models don’t explain why something happens. That’s where causality comes in. If you want to know whether a discount caused a behavior change, you need a causal model.

Causality vs. Prediction in Action

Let’s illustrate the difference. Imagine you run an e-commerce business and want to estimate Customer Lifetime Value (CLV) over 365 days.

CAC Predicts CLV

You have data on how much you spent to acquire customers (CAC) and their CLVs:


df_cac.head()

customer_id	cac	clv_365
0	15.55	32.48
1	52.35	104.17
2	90.33	178.90
3	66.81	135.26
4	61.42	123.14

Let’s say CAC is highly correlated with CLV:

You train a Linear Regression model on this and get near-perfect results.


X_train, X_test, y_train, y_test = prep_data_for_regression(df_cac, "clv_365")
# Train a linear regression model
model_1 = LinearRegression()
model_1.fit(X_train, y_train)
y_pred = model_1.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")



>>>output
Mean Squared Error: 9.94
R-squared: 0.99

Great, right? Not so fast. This is synthetic data where the correlation was intentionally built in. In real life, such a correlation often doesn’t exist. More importantly, a predictive model like this can’t answer: "Does higher CAC cause higher CLV?"

Should you spend more on acquisition just because the model predicts higher CLV with higher CAC? Definitely not.

More Features: Age, Gender, Location, App Usage

You get more customer data: age, gender, urban_loc, app_user , where data looks like this now:


df_customers.head()

customer_id	age	gender	urban_loc	app_user	clv_365
0	46	male	0	0	32.48
1	32	female	0	0	104.17
2	25	female	1	1	178.90
3	38	female	0	1	135.26
4	36	female	1	0	123.14

You retrain your model and achieve a great predictive score (R2 = 1.00, MSE = 4.62). Errors are normally distributed.


X_train, X_test, y_train, y_test = prep_data_for_regression(df_customers, "clv_365")
# Train a linear regression model
model_2 = LinearRegression()
model_2.fit(X_train, y_train)
y_pred = model_2.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")



>>>output
Mean Squared Error: 4.62
R-squared: 1.00

If prediction is your only goal, you’re done. But if you want to know what drives CLV, you’re not.

Does App Usage Drive CLV?

Let’s check the coefficient for app_user. It shows that app users bring $19.6 more. Is that a causal effect? Or just a correlation?


# Feature Importance
importance = pd.DataFrame(
    {
        "Feature": model_2.feature_names_in_,
        "Importance": model_2.coef_,
    }
).sort_values(by="Importance", ascending=False)
importance

Feature	Importance
app_user	19.6
urban_loc	12.4
age	-14.8
gender	-23.8

Avg CLV: App Users vs. Non-App Users

Now, compare the average CLV of app users vs non-app users. The difference is $46.5.


app_user_mean = df.query("app_user == 1")["clv_365"].mean()
non_app_user_mean = df.query("app_user == 0")["clv_365"].mean()
difference = app_user_mean - non_app_user_mean

So, what’s the true effect—$19.6 or $46.5? Why did the estimated effect of app usage vary so much?

The Problem: Confounding

Let’s look at the age distribution: younger users are more likely to use the app and have higher CLV. That’s confounding: a third variable (age) influences both the treatment (app_user) and the outcome (CLV).

Specifically, age is a confounding variable:

Younger users are more likely to use the app

Younger users also tend to have higher CLV

This means that part of the observed effect of app usage is actually due to age, not the app itself.

Causal Models to the Rescue

There are many ways to estimate causal effects:

Causal Inference for the Brave and True

CausalML

DoWhy

Here we use a simple method: S-learner, a type of Meta-Learner.

What is an S-Learner?

An S-learner uses a standard ML model to estimate the Conditional Average Treatment Effect (CATE) as following:

1. Train an ML model on the original data, use Tree-based estimators:


X_train, X_test, y_train, y_test = prep_data_for_regression(df_customers, "clv_365")
model_3 = RandomForestRegressor(n_estimators=500, random_state=42)
model_3.fit(X_train, y_train)
y_pred = model_3.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R-squared: {r2:.2f}")



>>>ouput
Root Mean Squared Error: 2.22
R-squared: 1.00

Set app_user = 1 for all users and predict CLV:


X_all = pd.concat([X_train, X_test], axis=0)
y_all = pd.concat([y_train, y_test], axis=0)
model_3.fit(X_all, y_all)
# X_all treated
X_all["app_user"] = 1  # Set all customers to treatment (app user)
y_pred_treated = model_3.predict(X_all)

Set app_user = 0 for all users and predict again:


# X_all control
X_all["app_user"] = 0  # Set all customers to control (non-app user)
y_pred_control = model_3.predict(X_all)

The difference gives individual treatment effects (CATEs).


# Calculate the treatment effect
df_customers_analysis = df_customers.copy()
df_customers_analysis["clv_if_all_app_users"] = y_pred_treated.round(2)
df_customers_analysis["clv_if_all_non_app_users"] = y_pred_control.round(2)
df_customers_analysis["treatment_effect"] = (
    df_customers_analysis["clv_if_all_app_users"]
    - df_customers_analysis["clv_if_all_non_app_users"]
)
df_customers_analysis.head()

And if we add the new prediction columns to our original dataset, it will look like this:

customer_id	age	gender	urban_loc	app_user	clv_365	clv_if_all_app_users	clv_if_all_non_app_users	treatment_effect
0	46	male	0	0	32.48	137.27	96.48	40.79
1	32	female	0	0	104.17	176.62	135.78	40.84
2	25	female	1	1	178.90	179.02	138.48	40.54
3	38	female	0	1	135.26	66.96	26.41	40.55
4	36	female	1	0	123.14	154.23	113.94	40.29

On the left chart, you see the distribution of treatment effects centered around $40. On the right, each customer’s actual CLV is shown alongside their hypothetical CLV if their app usage status were reversed—red dots represent CLV if they were not app users, and green dots if they were app users.

The average of all CATEs gives the Average Treatment Effect (ATE)—the expected CLV uplift if everyone used the app.

Conclusion

Using the S-learner approach, we estimate the average treatment effect of app usage on CLV to be $40. Compare that to:

Linear regression coefficient: $19.6 (underestimated)

Raw group difference: $46.5 (overestimated)

Since we generated the synthetic data with a true effect of $40, this confirms the S-learner result is accurate.

This example shows the power of causal models: if you want to understand and influence outcomes—not just predict them—they’re essential.

Appendix

Function used to generate the synthetic data:


def generate_customer_data(n):
    np.random.seed(42)
    customer_ids = np.arange(n)
    age = np.random.randint(18, 51, size=n)
    gender = np.random.choice(["male", "female", "female"], size=n)
    urban_loc = np.random.choice([0, 1], size=n)
    min_age, max_age = age.min(), age.max()
    normalized_age = (age - min_age) / (max_age - min_age)
    base_prob = 0.4
    age_factor = 1.5 - normalized_age  # 1.5 for youngest, 0.5 for oldest
    probabilities = base_prob * age_factor
    probabilities = np.clip(probabilities, 0, 1)
    app_user = np.random.binomial(1, probabilities, size=n)
    base_clv = (
        age_factor * 50 + (gender == "female") * 50 + urban_loc * 25 + app_user * 40
    )
    cac = base_clv * 0.5 + np.random.normal(
        0, base_clv.mean() * 0.01, size=n
    )
    noise = np.random.normal(0, base_clv.mean() * 0.02, size=n)  # 10% noise
    clv_365 = base_clv + noise
    # Create DataFrame
    df = pd.DataFrame(
        {
            "customer_id": customer_ids,
            "age": age,
            "gender": gender,
            "urban_loc": urban_loc,
            "app_user": app_user,
            "cac": np.round(cac, 2),
            "clv_365": np.round(clv_365, 2),
        }
    )
    return df

God is the only true cause. All other causes are instruments through which His will is realized. Fakhr al-Din al-Razi