← Back to homepage
Causal Factor Discovery in S&P 500 Returns
Moves beyond prediction: uses Double Machine Learning to estimate whether short-term momentum actually causes next-day stock returns: not just correlates with them.
Double Machine Learning Causal Inference LSTM Random Forest yfinance Python Jupyter Notebook
Causal Factor Discovery results

The Core Question

Most ML applied to finance asks: "Can we predict tomorrow's return?" This project asks a harder question: "Does short-term momentum cause tomorrow's return, or does it just correlate with it?"

The distinction matters enormously. A spurious correlation disappears when market conditions change. A causal relationship: if it exists: is structural and should be more robust. We use Double Machine Learning (DML), a method from econometrics, to answer this causally.

Why not just use a predictive model? Standard models (LSTMs, Random Forests) capture correlations: including confounders. DML orthogonalizes the treatment variable (momentum) against all confounders, isolating its direct causal effect on the outcome (next-day return).

Method: Double Machine Learning

DML (Chernozhukov et al., 2018) estimates causal effects in the presence of high-dimensional confounders using a two-stage approach:

  1. Residualize the treatment: Fit a ML model to predict the treatment (3-day momentum) from all confounders. Compute residuals.
  2. Residualize the outcome: Fit a ML model to predict the outcome (next-day return) from all confounders. Compute residuals.
  3. Causal estimate: Regress outcome residuals on treatment residuals. The coefficient is the Average Treatment Effect (ATE): the causal effect of momentum on return.

Cross-fitting with K-folds prevents overfitting from biasing the causal estimate. Conditional ATE (CATE) is also estimated per stock to capture heterogeneous effects.

Features Used

Data pulled from yfinance for 5 tickers: AAPL, MSFT, JPM, AMZN, XOM (2019–present).

Predictive Benchmarks

In parallel, we benchmark traditional predictive models to understand how much of the return variance they capture: and compare to the DML causal estimate:

Model Approach Metric
Linear Regression (OLS) Predictive MAE, MSE, R²
Random Forest Predictive MAE, MSE, R²
LSTM (2-layer, early stopping) Predictive MAE, MSE, R²
LSTM Ablation (base/deep/wide/deep-wide) Architecture study Training/val loss curves
DML (our method) Causal ATE, CATE per ticker

References