Introduction
In today’s data-driven world, businesses are waking up to a crucial reality: correlation is not causation. Just because two metrics move together doesn’t mean one causes the other. And when decisions involve money, time, and customer experience, guessing is not an option.
Enter causal models.
Causality
Causal models aim to answer a powerful question: "What would have happened if we had done something differently?"
They go beyond simple forecasting. While predictive models tell you what might happen, causal models explain why it happens and what would change under a different action.
In short, causal models help you understand which levers actually drive outcomes, allowing you to make decisions that move the needle.
Where Causal Models Are Used
- Marketing → Campaign effectiveness, spend optimization
- Product → Drivers of feature adoption or churn
- Pricing → Impact of price changes
- Operations → Effects of delivery time, staffing, etc.
Businesses don’t just want to know what’s happening—they want to influence it. Causal models enable confident, action-driven decisions. They’re no longer academic nice-to-haves—they're a competitive edge.
Predictive Models
Every time you get a Netflix recommendation, see a revenue forecast, or receive a fraud alert, a predictive model is likely behind it.
These models use historical data to forecast future outcomes. They learn patterns between features (
X
) and a target (Y
), such as:- Will this user churn?
- What will we sell next month?
- How many orders will this customer make?
But here’s the key: predictive models don’t explain why something happens. That’s where causality comes in. If you want to know whether a discount caused a behavior change, you need a causal model.
Causality vs. Prediction in Action
Let’s illustrate the difference. Imagine you run an e-commerce business and want to estimate Customer Lifetime Value (CLV) over 365 days.
CAC Predicts CLV
You have data on how much you spent to acquire customers (CAC) and their CLVs:
df_cac.head()
customer_id | cac | clv_365 |
0 | 15.55 | 32.48 |
1 | 52.35 | 104.17 |
2 | 90.33 | 178.90 |
3 | 66.81 | 135.26 |
4 | 61.42 | 123.14 |
Let’s say CAC is highly correlated with CLV:
You train a Linear Regression model on this and get near-perfect results.
X_train, X_test, y_train, y_test = prep_data_for_regression(df_cac, "clv_365") # Train a linear regression model model_1 = LinearRegression() model_1.fit(X_train, y_train) y_pred = model_1.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error: {mse:.2f}") print(f"R-squared: {r2:.2f}") >>>output Mean Squared Error: 9.94 R-squared: 0.99
Great, right? Not so fast. This is synthetic data where the correlation was intentionally built in. In real life, such a correlation often doesn’t exist. More importantly, a predictive model like this can’t answer: "Does higher CAC cause higher CLV?"
Should you spend more on acquisition just because the model predicts higher CLV with higher CAC? Definitely not.
More Features: Age, Gender, Location, App Usage
You get more customer data:
age
, gender
, urban_loc
, app_user
, where data looks like this now:df_customers.head()
customer_id | age | gender | urban_loc | app_user | clv_365 |
0 | 46 | male | 0 | 0 | 32.48 |
1 | 32 | female | 0 | 0 | 104.17 |
2 | 25 | female | 1 | 1 | 178.90 |
3 | 38 | female | 0 | 1 | 135.26 |
4 | 36 | female | 1 | 0 | 123.14 |
You retrain your model and achieve a great predictive score (R2 = 1.00, MSE = 4.62). Errors are normally distributed.
X_train, X_test, y_train, y_test = prep_data_for_regression(df_customers, "clv_365") # Train a linear regression model model_2 = LinearRegression() model_2.fit(X_train, y_train) y_pred = model_2.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error: {mse:.2f}") print(f"R-squared: {r2:.2f}") >>>output Mean Squared Error: 4.62 R-squared: 1.00
If prediction is your only goal, you’re done. But if you want to know what drives CLV, you’re not.
Does App Usage Drive CLV?
Let’s check the coefficient for
app_user
. It shows that app users bring $19.6
more. Is that a causal effect? Or just a correlation?# Feature Importance importance = pd.DataFrame( { "Feature": model_2.feature_names_in_, "Importance": model_2.coef_, } ).sort_values(by="Importance", ascending=False) importance
Feature | Importance |
app_user | 19.6 |
urban_loc | 12.4 |
age | -14.8 |
gender | -23.8 |
Avg CLV: App Users vs. Non-App Users
Now, compare the average CLV of app users vs non-app users. The difference is
$46.5
.app_user_mean = df.query("app_user == 1")["clv_365"].mean() non_app_user_mean = df.query("app_user == 0")["clv_365"].mean() difference = app_user_mean - non_app_user_mean
So, what’s the true effect—
$19.6
or $46.5
? Why did the estimated effect of app usage vary so much?The Problem: Confounding
Let’s look at the age distribution: younger users are more likely to use the app and have higher CLV. That’s confounding: a third variable (age) influences both the treatment (
app_user
) and the outcome (CLV).Specifically, age is a confounding variable:
- Younger users are more likely to use the app
- Younger users also tend to have higher CLV
This means that part of the observed effect of app usage is actually due to age, not the app itself.
Causal Models to the Rescue
There are many ways to estimate causal effects:
Here we use a simple method: S-learner, a type of Meta-Learner.
What is an S-Learner?
An S-learner uses a standard ML model to estimate the Conditional Average Treatment Effect (CATE) as following:
1. Train an ML model on the original data, use Tree-based estimators:
X_train, X_test, y_train, y_test = prep_data_for_regression(df_customers, "clv_365") model_3 = RandomForestRegressor(n_estimators=500, random_state=42) model_3.fit(X_train, y_train) y_pred = model_3.predict(X_test) mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print(f"Root Mean Squared Error: {rmse:.2f}") print(f"R-squared: {r2:.2f}") >>>ouput Root Mean Squared Error: 2.22 R-squared: 1.00
- Set
app_user = 1
for all users and predict CLV:
X_all = pd.concat([X_train, X_test], axis=0) y_all = pd.concat([y_train, y_test], axis=0) model_3.fit(X_all, y_all) # X_all treated X_all["app_user"] = 1 # Set all customers to treatment (app user) y_pred_treated = model_3.predict(X_all)
- Set
app_user = 0
for all users and predict again:
# X_all control X_all["app_user"] = 0 # Set all customers to control (non-app user) y_pred_control = model_3.predict(X_all)
- The difference gives individual treatment effects (CATEs).
# Calculate the treatment effect df_customers_analysis = df_customers.copy() df_customers_analysis["clv_if_all_app_users"] = y_pred_treated.round(2) df_customers_analysis["clv_if_all_non_app_users"] = y_pred_control.round(2) df_customers_analysis["treatment_effect"] = ( df_customers_analysis["clv_if_all_app_users"] - df_customers_analysis["clv_if_all_non_app_users"] ) df_customers_analysis.head()
And if we add the new prediction columns to our original dataset, it will look like this:
customer_id | age | gender | urban_loc | app_user | clv_365 | clv_if_all_app_users | clv_if_all_non_app_users | treatment_effect |
0 | 46 | male | 0 | 0 | 32.48 | 137.27 | 96.48 | 40.79 |
1 | 32 | female | 0 | 0 | 104.17 | 176.62 | 135.78 | 40.84 |
2 | 25 | female | 1 | 1 | 178.90 | 179.02 | 138.48 | 40.54 |
3 | 38 | female | 0 | 1 | 135.26 | 66.96 | 26.41 | 40.55 |
4 | 36 | female | 1 | 0 | 123.14 | 154.23 | 113.94 | 40.29 |
On the left chart, you see the distribution of treatment effects centered around
$40
. On the right, each customer’s actual CLV is shown alongside their hypothetical CLV if their app usage status were reversed—red dots represent CLV if they were not app users, and green dots if they were app users.The average of all CATEs gives the Average Treatment Effect (ATE)—the expected CLV uplift if everyone used the app.
Conclusion
Using the S-learner approach, we estimate the average treatment effect of app usage on CLV to be
$40
. Compare that to:- Linear regression coefficient:
$19.6
(underestimated)
- Raw group difference:
$46.5
(overestimated)
Since we generated the synthetic data with a true effect of
$40
, this confirms the S-learner result is accurate.This example shows the power of causal models: if you want to understand and influence outcomes—not just predict them—they’re essential.
Appendix
Function used to generate the synthetic data:
def generate_customer_data(n): np.random.seed(42) customer_ids = np.arange(n) age = np.random.randint(18, 51, size=n) gender = np.random.choice(["male", "female", "female"], size=n) urban_loc = np.random.choice([0, 1], size=n) min_age, max_age = age.min(), age.max() normalized_age = (age - min_age) / (max_age - min_age) base_prob = 0.4 age_factor = 1.5 - normalized_age # 1.5 for youngest, 0.5 for oldest probabilities = base_prob * age_factor probabilities = np.clip(probabilities, 0, 1) app_user = np.random.binomial(1, probabilities, size=n) base_clv = ( age_factor * 50 + (gender == "female") * 50 + urban_loc * 25 + app_user * 40 ) cac = base_clv * 0.5 + np.random.normal( 0, base_clv.mean() * 0.01, size=n ) noise = np.random.normal(0, base_clv.mean() * 0.02, size=n) # 10% noise clv_365 = base_clv + noise # Create DataFrame df = pd.DataFrame( { "customer_id": customer_ids, "age": age, "gender": gender, "urban_loc": urban_loc, "app_user": app_user, "cac": np.round(cac, 2), "clv_365": np.round(clv_365, 2), } ) return df
God is the only true cause. All other causes are instruments through which His will is realized. Fakhr al-Din al-Razi