Chapter 4: Multi-factor and Beta Models
One regressor is rarely enough. Multiple regression introduces three new ideas: joint vs individual significance, multicollinearity, and adjusted \(R^2\). We apply them to the Fama–French factor model — first three factors, then five — and use it to attribute the return of a portfolio and to construct a market-neutral hedge.
By the end of the chapter you will be able to:
- Set up a multiple regression \(y = X\beta + \varepsilon\) with \(p\) columns of regressors and read the matrix algebra of \(\hat\beta = (X^\top X)^{-1} X^\top y\).
- Read every line of a multi-regressor
statsmodelssummary: coefficient, standard error, \(t\)-statistic, \(p\)-value, and confidence interval, and understand what each is asking. - Tell apart individual significance (the \(t\)-test on a single coefficient) from joint significance (the \(F\)-test on a block of coefficients).
- Diagnose multicollinearity with the variance inflation factor, recognise the small individual \(t\), large overall \(F\) fingerprint, and know which statistics remain trustworthy under it.
- Compute adjusted \(R^2\) and use it correctly when comparing nested models with different numbers of regressors.
- Estimate Fama–French three- and five-factor models on a stock, interpret the loadings on SMB, HML, RMW, and CMA, and use Newey–West HAC standard errors when residuals are serially correlated.
- Attribute a portfolio’s return into factor contributions and idiosyncratic alpha, and design the SPY hedge that neutralises market risk.
Chapter Introduction
Where you’ll see this: This chapter is the one that finance interviewers, risk-system documentation, and academic finance papers all assume you know. If somebody tells you “the stock has a market beta of 1.2, a value tilt of \(-0.4\), and a small alpha”, they are quoting numbers out of a multi-factor regression of the kind we build here. Why this matters: the same arithmetic decomposes a portfolio’s return into pieces and tells you how much SPY to short to hedge the market exposure.
Why one regressor is not enough
Chapter 5 fit a single line through the cloud of (market, stock) daily returns. The slope of that line was \(\beta\) — the stock’s exposure to the broad market — and the intercept was Jensen’s alpha. The model has the virtue of fitting on a Bloomberg terminal, and the vice of being incomplete. In real return data, the residuals from a CAPM regression are not white noise. They cluster in predictable ways by company size, by valuation, by profitability. A portfolio of small-cap value stocks that has been CAPM-hedged still has its residuals moving together; an investor who shorted SPY against the portfolio’s market beta has not, in fact, fully hedged. The market index leaks out only one of several systematic risk dimensions.
This empirical observation, first documented carefully by Fama and French in 1992–1993 and elaborated into a five-factor specification in 2015, drove a generation of asset pricing research. It is also the reason every institutional risk system in 2026 reports not one beta but a vector of betas — typically four to seven — and computes the hedge against each of them separately. The model that does the bookkeeping is multiple linear regression.
Multiple regression — same idea as a regression line, but instead of one predictor we have several (market, size, value, profitability, investment, …). Each one gets its own coefficient saying “when this factor moves by 1%, \(y\) moves by approximately \(\beta_k\)%, holding the others fixed.”
Fama–French — Eugene Fama and Ken French (1990s–2010s) noticed that beyond the market, size (SMB: small minus big) and value (HML: high book-to-market minus low) systematically explain stock returns. Later they added profitability (RMW) and investment (CMA). These five factors together are the workhorse risk model in academic finance.
Multiple regression is not just a longer formula. It introduces three new statistical phenomena that single-regressor analysis cannot see. First, with more than one regressor we must distinguish the question does any of these regressors matter? (a joint test) from the question does this particular regressor matter, holding the others fixed? (an individual test). The first is answered by an \(F\)-statistic, the second by a \(t\)-statistic, and the two can disagree. Second, regressors are typically correlated with one another — value stocks tend to invest conservatively, so HML and CMA share information. When regressors overlap, ordinary least squares cannot cleanly attribute credit between them, and individual coefficient standard errors inflate. This is multicollinearity, and it has a textbook fingerprint: each individual \(t\)-statistic is small, yet the joint \(F\) overwhelmingly rejects. Third, adding any regressor — even pure noise — mechanically raises \(R^2\). Comparing models of different size by \(R^2\) is therefore a trap. We replace it with adjusted \(R^2\), which docks the goodness-of-fit measure for each additional parameter.
The Fama–French factor model
The cleanest application of multi-regression in finance is the Fama–French factor model. Its specification is
\[ r_i - r_f \;=\; \alpha_i \;+\; \beta_M (r_M - r_f) \;+\; s \cdot \mathrm{SMB} \;+\; h \cdot \mathrm{HML} \;+\; r \cdot \mathrm{RMW} \;+\; c \cdot \mathrm{CMA} \;+\; \varepsilon_i. \]
Each \(X\) variable is itself a portfolio return — a long-short spread between two pre-sorted groups of stocks. SMB is small minus big, HML is high minus low book-to-market, RMW is robust minus weak operating profitability, CMA is conservative minus aggressive investment. The vector \((\beta_M, s, h, r, c)\) is the stock’s exposure to each of those five risk dimensions. The intercept \(\alpha_i\) is the average return left unexplained — “alpha” in the strict sense of the model.
For a working quant, the FF5 regression has three uses. Attribution: it decomposes a stock’s or portfolio’s return into the part driven by each factor and a residual. Hedging: it tells you how much of each factor portfolio to short to neutralise the exposure. Skill evaluation: a positive, statistically significant \(\hat\alpha\) — and only that — is the part of return that cannot be cloned by buying a passive combination of factor portfolios. The vast majority of “alpha” reported by mutual funds turns out, on FF5 examination, to be factor beta in disguise.
What we just learned: A multi-factor regression spits out a vector \((\hat\alpha, \hat\beta_M, \hat s, \hat h, \hat r, \hat c)\). The intercept is “skill”; the rest are “style”. You will use all of them.
Anatomy of the FF5 regression equation
Every loading in the FF5 equation has a name in academic finance and a concrete trading meaning. The diagram below renders the equation and labels each piece.
Each \(\beta_k\) is a partial slope: “by how much does \(r_i\) move when factor \(k\) moves by 1 unit, holding the other four factors fixed?” Performance attribution = reading this vector.
Roadmap
The chapter proceeds in the natural order of the workflow. We first generalise the regression model from one \(X\) to \(p\) regressors and write out the matrix algebra of the OLS estimator. We then read a multi-regressor statsmodels summary and unpack each column. The \(F\)-test gives us joint inference. Multicollinearity and the variance inflation factor explain when individual \(t\)-statistics can be misleading. Adjusted \(R^2\) enters as the correct model-comparison tool. We then turn to the Fama–French three- and five-factor models, fit them on a real stock with Newey–West HAC standard errors, interpret the loadings, and use them to build an attribution and a hedge. The chapter ends with a worked example and a set of exercises.
Model → Estimate → Infer → Diagnose → Apply.
The same five-step pipeline from Chapter 5 reappears. What changes is that every step now has to handle a vector of coefficients rather than a single slope, which is where the new statistical machinery comes in.
From one regressor to many
Where you’ll see this / why it matters: Every multi-factor model in finance — CAPM extensions, Fama–French, Carhart momentum, Barra risk models — is technically a multiple linear regression. If you can write down \(\mathbf y = \mathbf X \boldsymbol\beta + \boldsymbol\varepsilon\) and read off what each piece means, you have understood 80% of empirical asset pricing.
The multiple regression model
We are still drawing a line through data, except now the “line” lives in a higher-dimensional space — one slope per predictor. Each slope answers a “what if” question: if this one factor moves by one unit and everything else holds still, how much does \(y\) move?
Generalising Chapter 5 from one \(X\) to \(p\) predictors gives the multiple linear regression model:
\[ y_t \;=\; \beta_0 \;+\; \beta_1 x_{1t} \;+\; \beta_2 x_{2t} \;+\; \cdots \;+\; \beta_p x_{pt} \;+\; \varepsilon_t, \qquad t = 1, 2, \ldots, n. \]
The interpretation of each \(\beta_k\) is one of the most important — and most misread — ideas in applied statistics. The coefficient \(\beta_k\) is the partial effect of \(X_k\) on \(Y\), holding all other regressors \(X_1, \ldots, X_{k-1}, X_{k+1}, \ldots, X_p\) constant. It is not the marginal effect of \(X_k\) in isolation. When the regressors are correlated — which in finance they almost always are — the simple regression slope of \(Y\) on \(X_k\) and the multiple regression coefficient on \(X_k\) are different numbers, and they answer different questions.
In a CAPM regression on a small-cap value stock, the slope on the market factor confounds three things: pure market exposure, the stock’s tendency to behave like a small-cap, and its tendency to behave like a value stock. The CAPM \(\hat\beta_M\) mixes these. In the Fama–French three-factor regression, \(\hat\beta_M\) measures only the pure market exposure, because SMB and HML have absorbed the size and value components. The interpretive shift between simple and multiple regression is the shift from total to partial exposure.
What we just learned: A coefficient’s meaning depends on which other regressors are in the model with it. The market beta on a CAPM is not the same number as the market beta on an FF3, even on the same stock and the same days.
The coefficient on \(X_k\) in a multiple regression is the effect of \(X_k\) with the other regressors held fixed. Including or omitting another regressor will, in general, change every other coefficient in the model.
The matrix formulation
For \(n\) observations and \(p\) regressors plus an intercept, the model has \(n\) equations. Writing them out:
\[ \begin{aligned} y_1 &= \beta_0 + \beta_1 x_{11} + \beta_2 x_{21} + \cdots + \beta_p x_{p1} + \varepsilon_1, \\ y_2 &= \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} + \cdots + \beta_p x_{p2} + \varepsilon_2, \\ &\;\;\vdots \\ y_n &= \beta_0 + \beta_1 x_{1n} + \beta_2 x_{2n} + \cdots + \beta_p x_{pn} + \varepsilon_n. \end{aligned} \]
Writing 500 equations by hand is impractical. Matrix notation collapses them into a single line. Define
\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \quad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{21} & \cdots & x_{p1} \\ 1 & x_{12} & x_{22} & \cdots & x_{p2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \cdots & x_{pn} \end{bmatrix}, \quad \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}, \quad \boldsymbol\varepsilon = \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix}. \]
The model then reads
\[ \mathbf{y} \;=\; \mathbf{X}\boldsymbol\beta \;+\; \boldsymbol\varepsilon. \]
The matrix \(\mathbf X\) is the design matrix. It has \(n\) rows (one per observation) and \(p+1\) columns (one per regressor plus a leading column of ones for the intercept). The design matrix encodes everything the model knows about the inputs. The first column is all ones — that column is what makes the regression include an intercept; remove it and the line is forced through the origin. The remaining \(p\) columns are the regressors, in whatever units they happen to be in. The vector \(\boldsymbol\beta\) has length \(p+1\) and contains the intercept and the \(p\) partial slopes. The residual vector \(\boldsymbol\varepsilon\) has length \(n\).
What we just learned: Think of \(\mathbf X\) as a spreadsheet — rows are dates, columns are predictors, plus a column of 1s for the intercept. Everything OLS knows is in that spreadsheet.
Anatomy of the design matrix \(\mathbf{X}\)
Before we run a real Fama–French regression, let’s stare at the design matrix and label every piece. The diagram below shows a 5-row × 6-column \(\mathbf{X}\) for an FF5 regression — one row per date, one column per factor (plus an intercept column).
This is exactly the spreadsheet OLS will see. Every diagnostic in this chapter — VIF, condition number, leverage — is a function of these numbers alone.
The design matrix is the single most important object in applied regression. Almost every diagnostic — leverage, collinearity, condition number, prediction variance — is a function of \(\mathbf X\) alone. Always check \(\mathbf X\) visually before fitting: column means, ranges, and pairwise correlations. Bad data in a \(\mathbf X\) column is the most common cause of nonsense coefficients.
The OLS estimator
Ordinary least squares chooses \(\boldsymbol\beta\) to minimise the sum of squared residuals,
\[ S(\boldsymbol\beta) \;=\; \sum_{t=1}^n \bigl( y_t - \beta_0 - \beta_1 x_{1t} - \cdots - \beta_p x_{pt} \bigr)^2 \;=\; (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top (\mathbf{y} - \mathbf{X}\boldsymbol\beta). \]
Setting the gradient with respect to \(\boldsymbol\beta\) to zero and rearranging gives the normal equations,
\[ \mathbf{X}^\top \mathbf{X} \,\boldsymbol\beta \;=\; \mathbf{X}^\top \mathbf{y}, \]
and so long as \(\mathbf{X}^\top \mathbf{X}\) is invertible — equivalently, the columns of \(\mathbf{X}\) are linearly independent — the unique solution is
\[ \boxed{\;\widehat{\boldsymbol\beta} \;=\; (\mathbf{X}^\top \mathbf{X})^{-1}\,\mathbf{X}^\top \mathbf{y}\;} \]
The fitted values and residuals are \(\widehat{\mathbf y} = \mathbf X \widehat{\boldsymbol\beta}\) and \(\widehat{\boldsymbol\varepsilon} = \mathbf y - \widehat{\mathbf y}\). Under the classical Gauss–Markov assumptions — the same LINE conditions you met in Chapter 5, now applied to a vector of regressors — \(\widehat{\boldsymbol\beta}\) is unbiased and has the smallest variance of any linear unbiased estimator. This is the Gauss–Markov theorem: OLS is BLUE (best linear unbiased estimator).
The variance of \(\widehat{\boldsymbol\beta}\) has the matrix form
\[ \mathrm{Var}\bigl(\widehat{\boldsymbol\beta}\bigr) \;=\; \sigma^2 \,(\mathbf X^\top \mathbf X)^{-1}, \]
where \(\sigma^2\) is the residual variance, estimated by \(\hat\sigma^2 = \mathrm{SSR}/(n - p - 1)\) with SSR being the residual sum of squares. The square roots of the diagonal entries of \(\hat\sigma^2 (\mathbf X^\top \mathbf X)^{-1}\) are the standard errors \(SE(\hat\beta_k)\) that drive every individual \(t\)-statistic in the regression output.
What we just learned: OLS is one matrix formula. The same matrix \((\mathbf X^\top \mathbf X)^{-1}\) delivers the coefficients and the standard errors. Hold on to this — when we discuss multicollinearity later, it is the same matrix going bad.
The closed-form expression \(\widehat{\boldsymbol\beta} = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y\) tells you three things at once. First, it says the OLS estimator is a linear function of \(\mathbf y\) — so its sampling distribution inherits the distribution of the noise. Second, it makes clear what can go wrong: if any two columns of \(\mathbf X\) are perfectly correlated, \(\mathbf X^\top \mathbf X\) is singular, the inverse does not exist, and OLS fails. Third, it shows where the standard errors come from: the same \((\mathbf X^\top \mathbf X)^{-1}\) matrix appears in \(\mathrm{Var}(\widehat{\boldsymbol\beta})\). When regressors are nearly collinear, \((\mathbf X^\top \mathbf X)^{-1}\) has large diagonal entries — and that is multicollinearity, viewed algebraically.
Building the design matrix by hand
Let us see this in action on a toy three-day dataset. Suppose we have NVDA daily excess returns \(y\) and two regressors — the market excess return \(X_1\) and the SMB factor \(X_2\):
| Day | \(y\) (NVDA excess) | \(X_1\) (Mkt-RF) | \(X_2\) (SMB) |
|---|---|---|---|
| 1 | \(+0.020\) | \(+0.010\) | \(+0.005\) |
| 2 | \(-0.015\) | \(-0.008\) | \(-0.002\) |
| 3 | \(+0.005\) | \(+0.002\) | \(-0.001\) |
The design matrix \(\mathbf X\) and response vector \(\mathbf y\) are
\[ \mathbf X = \begin{bmatrix} 1 & 0.010 & 0.005 \\ 1 & -0.008 & -0.002 \\ 1 & 0.002 & -0.001 \end{bmatrix}, \qquad \mathbf y = \begin{bmatrix} 0.020 \\ -0.015 \\ 0.005 \end{bmatrix}. \]
In Python:
The output shows the three OLS coefficients \((\hat\beta_0, \hat\beta_1, \hat\beta_2)\) for this dataset. With three observations and three parameters the regression is, of course, fitting the data perfectly — the residuals are zero. With even one more observation the system becomes overdetermined and OLS gives the best linear projection of \(\mathbf y\) onto the column space of \(\mathbf X\).
What we just learned: OLS is two matrix multiplications followed by one solve. No fancy optimization — the answer is closed-form.
The formula \(\widehat{\boldsymbol\beta} = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y\) is the right mathematical statement, but in code you should never explicitly invert \(\mathbf X^\top \mathbf X\). Use np.linalg.solve(XtX, Xty) or, better still, statsmodels.OLS, which uses a QR decomposition internally. Explicit inversion is numerically unstable when the regressors are nearly collinear — the very situation where the matrix is hardest to invert correctly.
Reading a multi-regressor statsmodels summary
Where you’ll see this / why it matters:
statsmodels.summary()output is the lingua franca of empirical finance. Every academic paper, every analyst memo, every risk-system audit shows a version of this table. Learning to read it quickly — and ignore the parts that don’t matter on a first read — is a high-leverage skill.
A first FF3 fit
We now move from toy data to a real (but small, browser-friendly) Fama–French three-factor regression. The setup uses the same NVDA + SPY data shipped with Chapter 5, augmented with the Fama–French factors over 2023–2024.
This DataFrame is the workhorse of the chapter. Each row is one trading day; the first column is NVDA’s excess return, the next five are the Fama–French factor returns. With this in hand the FF3 regression is three lines.
Pre-read for the regression output below: This is statsmodels’ summary. We will only look at four things on first read: the const row (alpha), each factor row (loadings on Mkt-RF, SMB, HML), the F-statistic block at the top (joint test of all slopes), and adjusted R². Everything else can be ignored on first read.
What we just learned: Fitting an FF3 regression in statsmodels is three lines — add a constant column, call OLS().fit(), print the summary. The hard work is reading the output, not running the code.
The output is a wall of numbers. We will dissect it into four blocks — overall fit, individual coefficients, joint \(F\)-test, and diagnostics — and read each in turn.
Block 1: per-coefficient output
The middle block of the summary contains, for each regressor, five columns: the estimated coefficient, its standard error, its \(t\)-statistic, its \(p\)-value, and the lower and upper bounds of its 95% confidence interval. They look like this:
coef std err t P>|t| [0.025 0.975]
const 0.0008 0.001 1.30 0.193 -0.000 0.002
Mkt-RF 1.85 0.07 27.6 0.000 1.71 1.99
SMB -0.42 0.13 -3.2 0.001 -0.68 -0.16
HML -0.71 0.10 -7.0 0.000 -0.91 -0.51
(The exact numbers depend on the dataset; the layout is universal.)
The coefficient \(\hat\beta_k\) is the partial slope of \(Y\) on \(X_k\) holding the other regressors fixed. The standard error is the square root of the corresponding diagonal entry of \(\widehat{\mathrm{Var}}(\widehat{\boldsymbol\beta})\); it measures the sampling variability of \(\hat\beta_k\). The \(t\)-statistic is the ratio \(\hat\beta_k / SE(\hat\beta_k)\), which under \(H_0: \beta_k = 0\) has a \(t\)-distribution with \(n - p - 1\) degrees of freedom. The \(p\)-value is the two-tailed probability that a \(t\)-statistic at least as extreme would arise if the true coefficient were zero. The 95% confidence interval is \(\hat\beta_k \pm t_{0.025,\, n-p-1} \cdot SE(\hat\beta_k)\) — the range of values for \(\beta_k\) that the data fail to reject at the 5% level.
When you read each row, do three passes. First, look at the sign and magnitude of the coefficient: does it make economic sense? An NVDA market beta of 1.85 says NVDA amplifies market moves by 85%, which fits a high-beta megacap chipmaker. A coefficient of \(-0.71\) on HML says NVDA is the opposite of a value stock — it moves with growth, against the high-minus-low book-to-market spread. A coefficient of \(-0.42\) on SMB says NVDA is the opposite of a small-cap — it moves with large-caps. These three signs together identify NVDA as a large-cap growth stock with high market sensitivity, which is what the equity research analysts would say in plain English.
Second, look at the \(t\)-statistic. As a rule of thumb (good enough when \(n - p - 1 > 30\)), a \(|t|\) above 2 corresponds to a \(p\)-value below 5%, and a \(|t|\) above 2.58 corresponds to a \(p\)-value below 1%. All three slopes above clear the 5% bar easily; the intercept does not. NVDA in this sample has no statistically significant Jensen’s alpha once the FF3 factors absorb its systematic exposures.
Third, look at the confidence interval. The CI gives a range of plausible coefficient values, which is more informative than the point estimate alone. A \(\beta_M\) CI of \([1.71, 1.99]\) tells the risk manager that even at the low end of the data’s compatibility, NVDA has more market exposure than the market itself; even at the high end, the exposure is below 2. The CI for the intercept includes zero, which is the same statement as failing to reject the alpha test.
What we just learned: Each row of the coefficient block answers: sign (direction), t-stat (significance), CI (range of plausible values). Three passes, three numbers per row, and you have read the regression.
Block 2: overall fit
The top-right block reports \(R^2\), adjusted \(R^2\), the \(F\)-statistic, the prob(\(F\)-statistic) (i.e. the \(F\)-test \(p\)-value), AIC, BIC, and the residual sum of squares. We will discuss the \(F\)-test in detail in the next section. For now, focus on \(R^2\) and adjusted \(R^2\) — those are the headline numbers of the regression.
\(R^2\) is the fraction of total variance in \(Y\) that the model explains:
\[ R^2 \;=\; 1 \;-\; \frac{\mathrm{SSR}}{\mathrm{SST}}, \qquad \mathrm{SST} = \sum_{t=1}^n (y_t - \bar y)^2, \quad \mathrm{SSR} = \sum_{t=1}^n \hat\varepsilon_t^2. \]
In our FF3 fit on NVDA, \(R^2 \approx 0.58\) — the three factors together explain about 58% of NVDA’s daily excess-return variance. The remaining 42% is idiosyncratic risk, which a diversified portfolio could in principle remove.
Adding any regressor — even a column of random numbers — never decreases \(R^2\). The new regressor can always be assigned a small nonzero coefficient that, by construction, lowers the in-sample residual sum of squares. Comparing models of different size by \(R^2\) alone will therefore systematically favour the bigger model. Adjusted \(R^2\), defined later in this chapter, fixes this.
Block 3: the bottom diagnostics
The summary’s bottom block reports residual diagnostics — Durbin–Watson, Jarque–Bera, skewness, kurtosis, and the condition number of \(\mathbf X\). We will not dwell on each, but two are particularly load-bearing for the rest of the chapter.
The condition number of \(\mathbf X^\top \mathbf X\) measures how badly conditioned the design matrix is — equivalently, how close to collinear the regressors are. A condition number below 30 is typically fine; values above 100 are a red flag; values in the thousands signal severe collinearity and untrustworthy individual coefficients. statsmodels reports the square root of the condition number under the label Cond. No..
The Durbin–Watson statistic ranges between 0 and 4 and tests for first-order autocorrelation in the residuals. Values near 2 indicate no autocorrelation; values much below 2 indicate positive autocorrelation (which is endemic in financial time series and is the reason we will switch to Newey–West HAC standard errors later).
Joint significance: the \(F\)-test
Where you’ll see this / why it matters: If a job interviewer asks “are the Fama–French factors significant?” they are asking for the F-test result, not 5 separate t-stats. Whenever someone says “this block of regressors jointly matters”, they are quoting an \(F\)-test. It is the standard tool for deciding whether to add a group of new factors to a model.
The \(F\)-test asks one yes/no question: does this group of regressors, all considered together, add real fit to the model? A high \(F\) means “yes, somewhere in this group is real signal”; a low \(F\) means “you could drop this whole group and lose nothing.” Unlike the \(t\)-test (one regressor at a time), the \(F\)-test handles a whole block at once and is immune to the multicollinearity headaches we will meet next section.
From individual to joint inference
Suppose you have just added SMB and HML to the CAPM regression and want to know whether the two new regressors jointly improve the model. The natural null hypothesis is
\[ H_0 : \beta_{\mathrm{SMB}} = 0 \;\text{ and }\; \beta_{\mathrm{HML}} = 0. \]
You could test each individually with its \(t\)-statistic, but that procedure has two well-known problems. First, multiple-testing inflates the false-positive rate — running two 5%-level \(t\)-tests on a true null gives nearly a 10% chance of falsely rejecting at least one. Second, when SMB and HML are correlated with each other and with the market factor, individually neither \(t\)-statistic may clear the threshold even though the two together contribute a great deal of explanatory power. A test that pools the two regressors into a single joint hypothesis avoids both problems.
The instrument is the \(F\)-test for nested models. Let the unrestricted model contain all \(p\) regressors of interest with residual sum of squares \(\mathrm{SSR}_U\). Let the restricted model impose \(q\) linear restrictions on the coefficients (in our example, \(\beta_{\mathrm{SMB}} = \beta_{\mathrm{HML}} = 0\), so \(q = 2\)) and have residual sum of squares \(\mathrm{SSR}_R\). The \(F\)-statistic is
\[ F \;=\; \frac{(\mathrm{SSR}_R - \mathrm{SSR}_U)/q}{\mathrm{SSR}_U/(n - p - 1)} \;\sim\; F_{q,\,n-p-1}. \]
The numerator is the per-restriction improvement in fit when you free up the \(q\) coefficients. The denominator is the per-degree-of-freedom noise variance estimate. Their ratio is large when the freed coefficients actually buy a fit improvement larger than the noise floor would.
What we just learned: The \(F\)-statistic is “improvement in fit per new regressor” divided by “noise per leftover degree of freedom.” A large \(F\) means the new regressors are pulling more than their weight.
Anatomy of the partial \(F\)-statistic
The same formula written with arrows pointing at each piece:
Large numerator (the new block of regressors really does shrink the residuals) divided by small denominator (low noise level) gives a large \(F\) — we reject \(H_0\) and keep the new regressors.
The classical \(F\)-test reported by statsmodels at the top of the summary tests the all-slopes-zero null,
\[ H_0 : \beta_1 = \beta_2 = \cdots = \beta_p = 0, \]
which is the special case where the restricted model is intercept-only. Almost any real regression rejects this null at high significance — it is the easiest possible joint hypothesis to reject — so the headline \(F\)-statistic is rarely a decision-relevant test. What is decision-relevant is the partial \(F\)-test, where you compare two non-trivial nested models, for example CAPM nested inside FF3, or FF3 nested inside FF5.
Worked example: does adding SMB and HML beat CAPM?
We use the NVDA daily data again. Two models, the CAPM (one slope plus intercept) and the FF3 (three slopes plus intercept), are nested: setting \(s = h = 0\) in FF3 recovers CAPM. We compute both residual sums of squares and form the \(F\)-statistic by hand, then check statsmodels’s built-in test.
The two numbers should match exactly — this is the same arithmetic, just packaged. The associated \(p\)-value tells you the probability of observing a fit improvement this large by chance under the null that SMB and HML are jointly useless. In the NVDA 2023–2024 sample, the \(F\)-statistic is large and the \(p\)-value is essentially zero, so we reject the null and conclude that the two new factors jointly add explanatory power.
What we just learned: compare_f_test does the partial-\(F\) arithmetic for you. The by-hand version is included so you can see what’s inside; in real work, use the built-in.
It is entirely possible — and common — for the joint \(F\)-test to reject while neither individual \(t\)-test rejects. This is the multicollinearity signature, which we develop in the next section. The reverse is rarer but also possible: each individual coefficient is significant, yet they cancel in a way that the joint linear restriction is not rejected. Always run both kinds of test when you have correlated regressors.
The partial \(F\)-test for adding a block
The procedure generalises immediately. To test whether any block of \(q\) regressors adds value to a model already containing \(p - q\) other regressors, fit the restricted model (without the block) and the unrestricted model (with the block), form the \(F\)-statistic, and compare to \(F_{q, n - p - 1}\). The two most common partial \(F\)-tests in factor modelling are
- CAPM → FF3 with \(q = 2\): do SMB and HML jointly add value beyond the market?
- FF3 → FF5 with \(q = 2\): do RMW and CMA jointly add value beyond the FF3 set?
The second of these is the test Fama and French (2015) ran to motivate the move from a three- to a five-factor specification. On broad samples of US stocks, the test rejects at high significance — the profitability and investment factors do carry information beyond size and value.
If the second \(F\)-test rejects, you should retain RMW and CMA. If it fails to reject — which can happen on small samples or for stocks whose returns are well-captured by MKT-SMB-HML alone — keep the simpler FF3 model.
The headline \(F\)-statistic from a regression summary tells you whether all slopes are jointly zero. That is almost never the question of practical interest. The question of practical interest is whether the new block of regressors you have just added is jointly worth keeping. That is the partial \(F\)-test, and it is the right tool for nested model selection.
Multicollinearity and the variance inflation factor
Where you’ll see this / why it matters: VIF is the single check a referee in a finance journal will ask about if your multi-factor regression has weird coefficients (huge magnitudes, flipping signs across samples). If you cannot answer “what was the VIF on each regressor?”, the referee assumes you have not done the diagnostic and asks for revision.
Multicollinearity — fancy word for “two regressors are basically saying the same thing.” OLS sees them as nearly interchangeable, and the model has trouble splitting the credit between them. The classic everyday analogy: two sales reps Alice and Bob always visit clients together. Joint sales are great, but ask “how much did Alice add, holding Bob fixed?” and you have no leverage to answer. Same problem here.
Why correlated regressors blow up coefficient standard errors but leave predictions intact — explained with the cleanest geometric picture you’ll see.
— StatQuest
What multicollinearity does, and what it does not do
Multicollinearity is the condition where two or more regressors in \(\mathbf X\) are highly linearly correlated with each other. In the extreme — perfect multicollinearity — one regressor is an exact linear combination of the others; \(\mathbf X^\top \mathbf X\) is singular, the inverse does not exist, and OLS fails outright. In the more common case of near multicollinearity, \(\mathbf X^\top \mathbf X\) is invertible but ill-conditioned, and the diagonal entries of \((\mathbf X^\top \mathbf X)^{-1}\) are inflated. The consequence shows up directly in the standard errors of \(\hat{\boldsymbol\beta}\), which scale with \(\sqrt{\mathrm{diag}(\mathbf X^\top \mathbf X)^{-1}}\).
The everyday-life analogy: suppose two sales reps Alice and Bob always visit clients together. Total sales are great, but if you try to split the credit between them — how much did Alice contribute, holding Bob fixed? — you have no leverage. Their inputs to the joint output are nearly indistinguishable. OLS faces exactly this problem when two factors move nearly in lockstep.
Multicollinearity inflates the standard errors of individual coefficients and makes them noisy and unstable across samples. It does not:
- Bias the coefficient estimates themselves. \(\hat{\boldsymbol\beta}\) is still unbiased.
- Reduce \(R^2\) or adjusted \(R^2\). The model fit is unaffected.
- Inflate the standard error of fitted values \(\hat y = \mathbf X \hat{\boldsymbol\beta}\). Predictions remain accurate.
- Invalidate the joint \(F\)-test. The overall significance of the regressor block remains testable.
What it does break is the individual \(t\)-test on each affected coefficient, and the inference based on it (CIs, \(p\)-values). When you have severe collinearity, the joint \(F\)-test and out-of-sample prediction remain trustworthy; individual coefficient interpretation is not.
A numerical demonstration
Suppose we regress NVDA returns on HML and CMA, two FF5 factors that are typically correlated at \(\rho \approx 0.7\) in long samples. Over Sample A (one calendar period) and Sample B (a slightly shifted window) we might see:
| Sample | \(\hat\beta_{\mathrm{HML}}\) | \(\hat\beta_{\mathrm{CMA}}\) | \(\hat\beta_{\mathrm{HML}} + \hat\beta_{\mathrm{CMA}}\) | \(R^2\) |
|---|---|---|---|---|
| A | \(-1.12\) | \(+0.38\) | \(-0.74\) | 0.59 |
| B | \(-0.31\) | \(-0.43\) | \(-0.74\) | 0.59 |
The individual coefficients move wildly between the two samples — by an order of magnitude in HML’s case. But their sum is identical, and the model’s \(R^2\) does not budge. The joint effect of the value+investment exposure is well identified; the split of credit between HML and CMA is not. This is the diagnostic fingerprint of collinearity.
The variance inflation factor
VIF (Variance Inflation Factor) — a one-number diagnostic for each regressor. VIF = 1 → that regressor is independent of the others (perfectly fine). VIF > 5–10 → it overlaps heavily with at least one other regressor; the coefficient’s standard error is inflated. Always print the VIF table when you fit a multi-factor model — it takes one line of code.
The standard quantitative diagnostic is the variance inflation factor (VIF). For each regressor \(X_k\) in a multiple regression, regress \(X_k\) on all the other regressors and compute the resulting \(R_k^2\). Then
\[ \mathrm{VIF}_k \;=\; \frac{1}{1 - R_k^2}. \]
\(R_k^2\) measures the extent to which \(X_k\) can be reconstructed from the rest of \(\mathbf X\). If \(R_k^2 = 0\) — \(X_k\) is uncorrelated with the others — then \(\mathrm{VIF}_k = 1\) and there is no inflation. If \(R_k^2 = 0.9\), then \(\mathrm{VIF}_k = 10\), and the standard error of \(\hat\beta_k\) is \(\sqrt{10} \approx 3.16\) times larger than it would be in the orthogonal-design case.
The conventional thresholds are:
| VIF range | Interpretation |
|---|---|
| \(1 - 2\) | No problem; regressors essentially independent for inference. |
| \(2 - 5\) | Mild collinearity; individual SEs inflated by a moderate factor. |
| \(5 - 10\) | Serious; individual coefficients become noticeably unstable across samples. |
| \(> 10\) | Severe; do not trust individual \(t\)-tests. Aggregate, drop, or regularise. |
Anatomy of a VIF report
Here is what a coloured VIF table looks like — the visual you should mentally produce every time someone shows you a multi-factor regression.
Print this table every time you fit a multi-factor model. If everything is green or yellow you can interpret individual coefficients; orange or red and you must aggregate, drop, or regularise.
Computing VIF for each regressor takes one auxiliary regression per regressor. statsmodels has it built in:
On the FF5 daily factor data in our sample window, the VIFs tend to sit in the \(1\)–\(2\) range, with HML and CMA running slightly higher because of their well-documented correlation. None typically exceeds 5. The FF5 factors are deliberately constructed to be reasonably (though not perfectly) orthogonal — that is one of the design choices that make them well-suited as a regression basis.
What we just learned: A two-line snippet returns the VIF for every factor. For FF5 on standard daily data, expect 1–2 across the board. If you ever see double-digit VIFs, stop interpreting individual coefficients.
The small-\(t\) large-\(F\) fingerprint
Here is the textbook fingerprint of severe collinearity, the one that should always trigger a VIF check:
- The overall \(F\)-statistic is large and its \(p\)-value is essentially zero — the model as a whole is clearly significant.
- The overall \(R^2\) is high — say, above 0.5.
- Each individual \(t\)-statistic is small (say \(|t| < 2\)); each individual \(p\)-value is above 5%.
- None of the regressors looks “significant” by itself, yet the model collectively explains a lot.
This pattern is impossible under orthogonal regressors and almost diagnostic of collinearity. The interpretation is that the regressor block jointly carries large explanatory content, but OLS cannot allocate credit cleanly across the individual regressors because they overlap. The joint \(F\)-test is correct and trustworthy; the individual \(t\)-tests are correct in the sense of having the right distribution, but they answer a question — “is this regressor useful holding the others fixed?” — that has become statistically hard to answer because the data identify only the combination, not the components.
Four moves you can make in real work, in rough order of preference:
- Report the joint quantity. If only the sum \(\hat\beta_{\mathrm{HML}} + \hat\beta_{\mathrm{CMA}}\) is stable, report it. For most economic interpretations — “this stock’s value+investment exposure” — the sum is what you want anyway.
- Drop the weaker regressor. Keep the older, more theoretically motivated, or more economically meaningful factor.
- Combine into a composite. Build a single index that averages the collinear group, then regress on the composite.
- Regularise with ridge regression. Add a small \(\lambda \lVert \boldsymbol\beta \rVert^2\) penalty. Ridge produces a slightly biased but much more stable split between collinear regressors, at the cost of introducing a tuning parameter.
What you should not do is leave the unstable individual coefficients in the model and interpret each one as a structural parameter. Their standard errors are honest about the noise, but the temptation to overinterpret the point estimate is strong.
Adjusted \(R^2\)
Where you’ll see this / why it matters: Plain \(R^2\) goes up every time you add a regressor — even a column of pure noise. So if you compare two models by \(R^2\) alone, you will always pick the bigger one. Adjusted \(R^2\) fixes that by docking points for each extra parameter. This is the number you quote when comparing nested models.
Plain \(R^2\) is greedy: it never goes down when you add a regressor, so it secretly rewards model bloat. Adjusted \(R^2\) adds a small penalty for each new regressor. If the new regressor earns its keep (improves fit more than the penalty costs), adjusted \(R^2\) rises. If it does not, adjusted \(R^2\) falls. That makes it a fair referee between models of different size.
Why \(R^2\) is a biased model-comparison tool
The fundamental problem with \(R^2\) as a tool for comparing models is that it is monotone in the number of regressors. If you add any new column to \(\mathbf X\) — even a column of pure noise unrelated to \(Y\) — OLS will find some coefficient that, by definition, drives the new residual sum of squares no higher than the old one (the worst case is the new coefficient is set to zero, recovering the previous fit). So \(R^2\) only goes up as you add regressors. Comparing a 3-factor model to a 5-factor model purely by \(R^2\) is therefore biased in favour of the larger model, even when the additional regressors are noise.
The fix is to penalise the number of parameters. Adjusted \(R^2\) does this in the simplest possible way: scale up the residual variance estimate, and the goodness-of-fit measure, to account for the degrees of freedom used. The definition is
\[ \bar R^2 \;=\; 1 \;-\; (1 - R^2) \cdot \frac{n - 1}{n - p - 1}, \]
where \(n\) is the sample size and \(p\) is the number of regressors (excluding the intercept). Note the symmetry with the standard error of regression \(\hat\sigma = \sqrt{\mathrm{SSR}/(n - p - 1)}\): the denominator \(n - p - 1\) is the residual degrees of freedom, the number of observations minus the number of parameters estimated.
Several properties of \(\bar R^2\) are worth knowing:
- \(\bar R^2 \le R^2\) always.
- \(\bar R^2\) can decrease when a regressor is added that explains less than its degree-of-freedom cost.
- \(\bar R^2\) can be negative — this happens when the regression fits worse, after the degrees-of-freedom adjustment, than the simple mean.
- As \(n \to \infty\) with \(p\) fixed, \(\bar R^2 \to R^2\). The penalty is most severe in small samples.
When \(\bar R^2\) goes up vs down
A regressor that adds genuine explanatory power raises \(\bar R^2\). A regressor that adds no useful information lowers \(\bar R^2\) (because \(R^2\) rises by less than the penalty term shrinks the adjustment factor). The point at which \(\bar R^2\) neither rises nor falls is, roughly, where the new regressor’s \(t\)-statistic equals 1 in absolute value. So a rule of thumb: a regressor with \(|t| > 1\) tends to raise \(\bar R^2\), a regressor with \(|t| < 1\) tends to lower it. This is a much weaker bar than statistical significance (which sits at \(|t| \approx 2\)).
The implication: \(\bar R^2\) is necessary for honest model comparison, but it is not sufficient by itself for inferring that the extra regressors are doing real work. A regressor can raise \(\bar R^2\) slightly while still failing both individual \(t\)- and joint \(F\)-tests. Use the partial \(F\)-test for the formal significance decision; use \(\bar R^2\) for a quick gut-check.
Worked numerical example
Suppose on a sample of \(n = 252\) trading days we have:
| Model | \(p\) | \(R^2\) | \(\bar R^2\) |
|---|---|---|---|
| CAPM | 1 | 0.545 | \(1 - 0.455 \cdot \dfrac{251}{250} = 0.543\) |
| FF3 | 3 | 0.583 | \(1 - 0.417 \cdot \dfrac{251}{248} = 0.578\) |
| FF5 | 5 | 0.596 | \(1 - 0.404 \cdot \dfrac{251}{246} = 0.588\) |
The penalty per regressor is small in absolute terms — about 0.002 for each — because \(n = 252\) is much larger than \(p\). Both \(\bar R^2\) values rise as we move from CAPM to FF3 and from FF3 to FF5, so both transitions look genuine on this criterion. If the move from FF3 to FF5 had lowered \(\bar R^2\), we would have flagged RMW and CMA as a poor investment of degrees of freedom and reverted to FF3.
What we just learned: statsmodels exposes both rsquared and rsquared_adj as attributes. When comparing nested models, look at the adjusted one.
In the model-selection literature you will also see the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), both reported in the statsmodels summary. They penalise model complexity more aggressively than \(\bar R^2\) — BIC most aggressively of all. For nested-model factor selection over a moderate sample size, all three criteria typically point the same way; on borderline cases, prefer the partial \(F\)-test and economic reasoning over a single-number criterion. AIC and BIC matter most when you are choosing among non-nested alternatives, or selecting from a long candidate factor list.
The Fama–French 3- and 5-factor models
Where you’ll see this / why it matters: The FF3 and FF5 models are the standard benchmark in academic finance. Any paper claiming “alpha” must show that the alpha survives an FF5 regression. Any hedge fund pitch that does not address factor exposures will be dismissed by a sophisticated allocator within five minutes.
Construction of the factors
Each Fama–French factor is itself the return on a long-short portfolio. The construction follows a uniform recipe: sort the cross-section of stocks on the factor’s underlying characteristic, form value-weighted portfolios from the extremes, and compute the long-minus-short return. Concretely:
| Factor | Long leg | Short leg | What it captures |
|---|---|---|---|
| Mkt-RF | Value-weighted total US equity market | One-month T-bill (\(R_f\)) | Broad equity premium |
| SMB | Small market-cap portfolio | Large market-cap portfolio | Size premium |
| HML | High book-to-market (value) | Low book-to-market (growth) | Value premium |
| RMW | Robust operating profitability | Weak operating profitability | Quality / profitability premium |
| CMA | Conservative asset growth (low capex) | Aggressive asset growth (high capex) | Investment style premium |
So SMB is itself a return: the difference between what a small-cap basket earned today and what a large-cap basket earned today. When SMB is positive, small-caps outperformed; when it is negative, large-caps did. Each of the five factors is constructed using detailed methodological rules (which Ken French publishes openly on his Dartmouth website), and the daily factor series are freely available from his data library.
The FF3 and FF5 regressions
The Fama–French 3-factor model (Fama and French, 1993) extends CAPM with SMB and HML:
\[ r_i - r_f \;=\; \alpha_i \;+\; \beta_M (r_M - r_f) \;+\; s \cdot \mathrm{SMB} \;+\; h \cdot \mathrm{HML} \;+\; \varepsilon_i. \]
The 5-factor extension (Fama and French, 2015) adds RMW and CMA:
\[ r_i - r_f \;=\; \alpha_i \;+\; \beta_M (r_M - r_f) \;+\; s \cdot \mathrm{SMB} \;+\; h \cdot \mathrm{HML} \;+\; r \cdot \mathrm{RMW} \;+\; c \cdot \mathrm{CMA} \;+\; \varepsilon_i. \]
Each Greek-letter coefficient is the stock’s loading on the corresponding factor — its partial sensitivity, in the same units as the factor itself (return per unit factor return). The vector of loadings is the stock’s factor profile.
Interpreting the loadings
The signs of the loadings carry direct economic meaning:
- \(\beta_M > 1\) (high-beta): the stock amplifies market moves. Common for growth/tech, banks, leveraged firms.
- \(\beta_M < 1\) (low-beta): the stock dampens market moves. Common for utilities, consumer staples, mature dividend payers.
- \(s > 0\): the stock behaves like a small-cap — it tends to outperform when small-caps rally. A genuine small company will have \(s > 0\); some mid-caps or microcap-flavoured ETFs do too.
- \(s < 0\): the stock behaves like a large-cap or mega-cap. NVDA, AAPL, MSFT, and the index-dominating megacaps typically have \(s\) in the range \(-0.3\) to \(-0.6\).
- \(h > 0\): value tilt — the stock co-moves with high book-to-market portfolios. Banks, utilities, traditional industrials.
- \(h < 0\): growth tilt — the stock co-moves with low book-to-market growth portfolios. Almost all big tech.
- \(r > 0\): quality / profitability tilt — co-moves with high-margin firms.
- \(c < 0\): aggressive-investor tilt — co-moves with high-investment firms (heavy capex, M&A, R&D).
For NVDA in the FF5 specification you should expect something like \(\beta_M \approx 1.8\), \(s \approx -0.5\), \(h \approx -0.7\), \(r \approx 0.2\), \(c \approx -0.3\). The picture this paints — high beta, large-cap, strong growth, mildly high-quality, aggressively investing — matches the qualitative description an equity analyst would write. The factor model has read the same story off pure return data.
“\(s = -0.48\)” means: on a day when small caps (SMB) outperform large caps by 1%, NVDA tends to underperform the market by 0.48%, holding the other factors fixed. It is not a forecast of NVDA’s return — it is a sensitivity. The loading lets you decompose NVDA’s daily return into the part driven by each factor and the residual.
Why the 5-factor extension matters
The Fama and French (1993) three-factor model dominated empirical asset pricing for two decades. It “worked” in the sense that the cross-sectional dispersion of average returns on size-and-value sorted portfolios was largely captured by exposures to (Mkt-RF, SMB, HML), and most of CAPM’s notorious \(\alpha\) anomalies — small-cap premium, value premium — were absorbed into factor loadings.
By the late 2000s, however, two additional patterns had been thoroughly documented and resisted explanation by FF3. The profitability anomaly (Novy-Marx 2013): firms with high gross profitability earn higher returns than low-profitability firms with the same book-to-market and size. The investment anomaly (Cooper, Gulen, and Schill 2008; Titman, Wei, and Xie 2004): firms that invest heavily — high asset growth, high capex — earn lower subsequent returns than conservative-investment firms. Fama and French (2015) added RMW and CMA explicitly to capture these two. Their paper documented that adding both improved cross-sectional fit, especially for portfolios sorted on profitability and investment, which FF3 priced poorly.
A side effect of the new factors: in the presence of RMW and CMA, the HML factor sometimes becomes statistically redundant. Value stocks (high B/M) tend to be unprofitable and to invest conservatively, both of which are absorbed once RMW and CMA enter the regression. Some applied work therefore drops HML from FF5 and runs a four-factor model (Mkt-RF, SMB, RMW, CMA). In real work, the case for retaining HML rests less on its average alpha — which has been close to zero since 2010 — and more on its role as a hedgeable risk dimension. Growth-versus-value rotations are real and large, and you cannot hedge what you have not included.
Newey–West HAC standard errors
Where you’ll see this / why it matters: Every factor-model regression in a published finance paper after about 1990 uses HAC (or similar) standard errors. The classical OLS standard errors quietly underestimate uncertainty when residuals are serially correlated, which they essentially always are in monthly or daily return data. Use HAC by default.
HAC standard errors — robust SEs that survive when residuals are serially correlated (monthly stock returns often are). The coefficient estimates do not change at all; only the standard errors get corrected upward, which makes \(t\)-statistics smaller and confidence intervals wider. The point is honesty: classical SEs lie about how precise your estimates are when the data has time-series structure.
Why classical standard errors fail in finance
The classical OLS standard errors are derived under the assumption that the residuals \(\varepsilon_t\) are homoskedastic (constant variance over time) and independently distributed (no autocorrelation). In daily financial return data both assumptions fail more often than they hold.
Heteroskedasticity is endemic. Volatility clusters: a high-vol day is more likely to be followed by another high-vol day than the unconditional distribution would predict. This is the GARCH effect, which Engle won the 2003 Nobel Prize for documenting. Residual variance is plainly not constant.
Autocorrelation is also present, though often subtler. Even when daily returns themselves are nearly uncorrelated, residuals from a factor regression can pick up serial structure: momentum, post-earnings drift, microstructure effects (bid-ask bounce), or simply the fact that the factor model is misspecified and leaves a slow-moving component in the error. Squared residuals are essentially always autocorrelated (this is GARCH again).
When the residuals are heteroskedastic or autocorrelated, \(\hat{\boldsymbol\beta}\) is still unbiased, but the standard error formula \(\sigma^2 (\mathbf X^\top \mathbf X)^{-1}\) is wrong, often by enough to flip the conclusion of a \(t\)-test. Classical SEs typically understate the true sampling variability when residuals are positively autocorrelated, so \(t\)-statistics are too large and \(p\)-values too small. In honest reporting, you replace them with robust standard errors.
The Newey–West HAC estimator
The most widely used robust standard error in time-series regression is the Newey–West HAC estimator — “HAC” stands for heteroskedasticity- and autocorrelation-consistent. The idea, due to Whitney Newey and Kenneth West (1987), is to estimate the variance of \(\hat{\boldsymbol\beta}\) using
\[ \widehat{\mathrm{Var}}_{\mathrm{NW}}(\widehat{\boldsymbol\beta}) \;=\; (\mathbf X^\top \mathbf X)^{-1} \, \widehat{\mathbf S} \, (\mathbf X^\top \mathbf X)^{-1}, \]
where \(\widehat{\mathbf S}\) is a weighted sum of \(\mathbf X_t \hat\varepsilon_t \hat\varepsilon_{t-j} \mathbf X_{t-j}^\top\) across lags \(j = 0, 1, \ldots, L\) with Bartlett-kernel weights \(w_j = 1 - j/(L+1)\). The choice of lag length \(L\) is the one tuning parameter. A common rule of thumb is \(L = \lfloor 4 (n/100)^{2/9} \rfloor\), which for daily data over a few years gives \(L\) in the range 4–6. In practice, many empirical papers use \(L = 5\) for daily data.
In statsmodels the Newey–West HAC SEs are a single argument to .fit():
The point estimates \(\hat{\boldsymbol\beta}\) are identical in the two fits — switching the standard error type does not change OLS itself. What changes is the standard error column, and therefore every downstream \(t\)-statistic, \(p\)-value, and confidence interval.
What we just learned: Adding cov_type="HAC", cov_kwds={"maxlags": 5} to .fit() is the entire syntactic cost of robust inference. Default it on for any time-series factor regression.
When HAC matters in practice
For factor models on daily data the typical pattern is that HAC SEs are 10–30% larger than classical SEs. The result is often a downgrade of one or two coefficients from “significant at 1%” to “significant at 5%”, or from “significant at 5%” to “marginally significant.” Occasionally a coefficient flips from being above the threshold to below it. The cases where HAC matters most are:
- Long-horizon regressions (monthly or quarterly returns), where residual autocorrelation is larger.
- Stocks with strong momentum or earnings-drift exposure.
- Tests on small samples (\(n < 500\)) where every standard-error correction has noticeable bite.
The cost of using HAC is essentially zero — one extra argument to .fit() — so the default in factor-model regressions should always be HAC. Classical SEs are pedagogically useful in Chapter 5 (one regressor, smaller and cleaner data) but are not the right tool for a publication-grade factor regression you would submit to a journal or to a risk committee.
When you publish or share a regression result, always state which standard error type you used. Reporting “\(t = 2.4\)” without specifying whether the SE is classical, HC0/HC3 white-heteroskedastic, or HAC is ambiguous — the underlying SE estimate can differ by 30% across the choices, and the \(t\) along with it.
Attribution and market-neutral hedging
Where you’ll see this / why it matters: This is the part where the regression pays for itself. Once you have the loadings, you can break apart any day’s return into “what each factor did” plus a residual (attribution), and you can hedge out factors you do not want to bear (market-neutral). Every long-short equity hedge fund does exactly this, every day.
Market-neutral hedging — you own a stock with market beta \(\hat\beta_p\). To kill the market exposure, short \(\hat\beta_p\) dollars of SPY for every dollar of stock you own. Now if the market drops 2%, your long loses \(\hat\beta_p \cdot 2\%\) from the market move and your short gains exactly \(\hat\beta_p \cdot 2\%\) — they cancel. What is left over after the cancellation is alpha plus other factor exposures plus idiosyncratic noise. That residual is what you are paid to take, or what you wanted to keep.
Decomposing a portfolio’s return
Once we have the factor loadings \((\hat\alpha, \hat\beta_M, \hat s, \hat h, \hat r, \hat c)\) for a portfolio (or a stock), the regression equation
\[ r_p - r_f \;=\; \hat\alpha \;+\; \hat\beta_M (r_M - r_f) \;+\; \hat s \cdot \mathrm{SMB} \;+\; \hat h \cdot \mathrm{HML} \;+\; \hat r \cdot \mathrm{RMW} \;+\; \hat c \cdot \mathrm{CMA} \;+\; \hat\varepsilon_t \]
becomes a return attribution. On any given day, the realised excess return decomposes additively into a contribution from each factor and a residual:
| Component | Daily contribution to \(r_p - r_f\) |
|---|---|
| Alpha | \(\hat\alpha\) |
| Market | \(\hat\beta_M \cdot (r_M - r_f)\) |
| Size | \(\hat s \cdot \mathrm{SMB}\) |
| Value | \(\hat h \cdot \mathrm{HML}\) |
| Profitability | \(\hat r \cdot \mathrm{RMW}\) |
| Investment | \(\hat c \cdot \mathrm{CMA}\) |
| Idiosyncratic | \(\hat\varepsilon_t\) |
Summed over time, the alpha component aggregates into the “skill” portion of the strategy’s performance and the factor components aggregate into “factor return” — the part you would have earned by holding the right combination of passive factor portfolios. A long-short equity fund whose entire \(\sum_t (r_p - r_f)\) is explained by factor contributions has no statistically meaningful alpha; it is replicable with cheap factor ETFs and does not deserve a 2-and-20 fee. A fund whose alpha contribution is positive, statistically significant after HAC correction, and persistent out of sample is doing something the factor model cannot replicate. By the empirical record, that is rare.
The market-neutral hedge
Of all the hedges the factor model enables, the simplest and most widely used is the market-neutral hedge: short the market index in proportion to the portfolio’s market beta. If you hold \(V\) dollars of a portfolio with market beta \(\hat\beta_p\), shorting \(\hat\beta_p \cdot V\) dollars of SPY (or any proxy for the market portfolio) creates a hedged portfolio whose market exposure is, by construction, zero:
\[ \underbrace{r_p - r_f}_{\text{long}} \;-\; \hat\beta_p \cdot \underbrace{(r_M - r_f)}_{\text{short}} \;=\; \hat\alpha + \hat s \cdot \mathrm{SMB} + \hat h \cdot \mathrm{HML} + \hat r \cdot \mathrm{RMW} + \hat c \cdot \mathrm{CMA} + \hat\varepsilon_t. \]
The market term has cancelled. The hedged portfolio retains exposure to the other four factors (size, value, profitability, investment) plus idiosyncratic risk, but is insulated against broad market drawdowns. If \(\hat\beta_p = 1.84\) for a $10M NVDA long, the hedge is short $18.4M of SPY. Note that the hedge size exceeds the portfolio size when \(\hat\beta_p > 1\) — this is the leverage embedded in any aggressive stock.
To go beyond market-neutral and reach factor-neutral, you short the appropriate amounts of factor-mimicking portfolios for each non-trivial loading. With ETF proxies (IWM for SMB, IWD/IWF for HML, etc.) the construction is straightforward, though imperfect: ETFs do not exactly replicate Ken French’s factor portfolios, so a small residual factor exposure remains. Hedge funds running factor-neutral mandates accept this slippage as a cost of doing business.
The hedge does not eliminate all risk — it eliminates market risk. The hedged portfolio still has its idiosyncratic component \(\hat\varepsilon_t\), plus exposure to whatever factors are not yet hedged. The strategic question is which risks you want to keep (because you are paid to bear them) and which you want to shed (because you are not). A stock-picker who claims a positive \(\hat\alpha\) wants to shed market risk and keep idiosyncratic exposure. A factor investor who believes in the value premium wants to shed market risk and keep HML exposure. The factor model is the bookkeeping engine for both kinds of strategy.
A small numerical attribution
To make the attribution concrete, take a single trading day and decompose NVDA’s realised excess return. Suppose the day’s factor returns and NVDA’s FF5 loadings are:
| Quantity | Value |
|---|---|
| Mkt-RF (daily) | \(+1.20\%\) |
| SMB | \(+0.30\%\) |
| HML | \(-0.40\%\) |
| RMW | \(-0.10\%\) |
| CMA | \(-0.20\%\) |
| NVDA loading \(\hat\beta_M\) | \(+1.85\) |
| NVDA loading \(\hat s\) | \(-0.48\) |
| NVDA loading \(\hat h\) | \(-0.71\) |
| NVDA loading \(\hat r\) | \(+0.21\) |
| NVDA loading \(\hat c\) | \(-0.31\) |
| NVDA realised excess return | \(+2.50\%\) |
The factor contributions multiply each loading by the day’s factor return:
\[ \begin{aligned} \text{Market:}\quad & 1.85 \times 1.20\% = +2.22\% \\ \text{Size:}\quad & -0.48 \times 0.30\% = -0.14\% \\ \text{Value:}\quad & -0.71 \times (-0.40\%) = +0.28\% \\ \text{Profitability:}\quad & 0.21 \times (-0.10\%) = -0.02\% \\ \text{Investment:}\quad & -0.31 \times (-0.20\%) = +0.06\% \\ \hline \text{Factor total:}\quad & = +2.40\% \end{aligned} \]
The factors collectively explain \(+2.40\) percentage points of NVDA’s \(+2.50\%\) excess return on the day. The remaining \(+0.10\%\) is the residual \(\hat\varepsilon_t\) for that day — idiosyncratic, day-specific noise unrelated to the systematic factors. Across a long time series the residuals average to roughly zero (by OLS construction, exactly zero in-sample if an intercept is included) and the alpha term carries any persistent skill.
What we just learned: A one-day attribution is just loading × factor return for each factor, summed. The residual is whatever the factors do not explain. This same arithmetic, summed over a year, gives you the performance-attribution report your boss asks for.
The same attribution, aggregated over the sample window, produces the conventional performance attribution table that institutional risk reports use. It is the bookkeeping output of the factor model.
Worked example: FF5 regression of a sample portfolio
Where you’ll see this / why it matters: This is the full pipeline you would run in real work — every step a quant analyst or risk officer does after building a portfolio. Build → estimate → diagnose → attribute → hedge. After this example you should be able to repeat the workflow on any stock or portfolio.
We close the chapter with a complete worked example that exercises every concept in the chapter. The setup: construct a tilted long-only portfolio out of three large-cap names, fit the FF5 model with Newey–West HAC standard errors, read the loadings, run the partial \(F\)-tests against the CAPM and FF3 nested models, check VIFs for collinearity, attribute one day’s return, and design a market-neutral hedge.
Read this output in roughly the order it prints. The coefficient table shows how each loading evolves as the model is extended; the market beta should be relatively stable across CAPM, FF3, and FF5 (because the market factor is mostly orthogonal to the others), while small shifts in the other loadings reflect omitted-variable adjustments. The partial \(F\)-tests tell you whether each extension is statistically justified — under most return data both come out highly significant. The VIFs confirm that the FF5 factor block is reasonably well-conditioned, with no entry above the warning thresholds. The goodness-of-fit comparison closes the loop: \(R^2\) rises monotonically, but adjusted \(R^2\) rises only when each new block is worth its degrees of freedom, which is the honest model-comparison criterion.
What we just learned: A real factor-model analysis is exactly eight steps: load → portfolio → factors → fit three nested models with HAC → coefficient table → partial \(F\)-tests → VIFs → goodness-of-fit comparison. Each step is a few lines of statsmodels.
The final step — the one a portfolio manager actually acts on — is the market-neutral hedge. Given the portfolio’s fitted \(\hat\beta_M\), the hedge is short \(\hat\beta_M \cdot V\) dollars of SPY for every \(V\) dollars of long exposure. The full factor-neutral hedge layers four more shorts on top of that, one for each non-trivial loading on (SMB, HML, RMW, CMA), using factor-mimicking ETF proxies or constructing the long-short legs directly out of the cross-section.
A long-short equity hedge fund typically targets multiple factor neutralities at once: market-neutral, sector-neutral, size-neutral, sometimes style-neutral. The mechanics generalise straightforwardly from this chapter: regress the portfolio on the relevant factor basis, read off the loadings, and short each factor by the loading amount. What the chapter does not cover — and which any production-grade hedging system must — is the time variation of the loadings (rolling-window estimation, dynamic factor models), transaction costs, and the fact that hedge ETFs are imperfect factor proxies. These are the production challenges Module 4 of the course addresses.
Summary
We extended single-regressor inference to multiple regression by writing the model as \(\mathbf y = \mathbf X \boldsymbol\beta + \boldsymbol\varepsilon\) with a \((p+1)\)-column design matrix and the OLS estimator \(\widehat{\boldsymbol\beta} = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y\). The interpretive shift from simple to multiple regression is the shift from total to partial effect: each coefficient is the slope of \(Y\) on \(X_k\) holding the other regressors fixed.
We introduced three new pieces of statistical machinery. The \(F\)-test for joint significance lets us test whether a block of \(q\) regressors is collectively useful — partial \(F\)-tests applied to nested factor models (CAPM → FF3, FF3 → FF5) are the standard way to decide whether the extra factors are worth keeping. The variance inflation factor diagnoses multicollinearity, \(\mathrm{VIF}_k = 1/(1 - R_k^2)\), where \(R_k^2\) is the \(R^2\) of regressing \(X_k\) on the other regressors. Multicollinearity inflates individual SEs without biasing the coefficients themselves or distorting overall fit; the joint \(F\)-test, \(R^2\), and predictions remain trustworthy, but individual confidence intervals are not. Adjusted \(R^2 = 1 - (1 - R^2)(n-1)/(n-p-1)\) replaces \(R^2\) for model comparison; unlike \(R^2\) it is not monotone in \(p\) and can fall when a regressor’s degree-of-freedom cost exceeds its fit contribution.
We applied the machinery to the Fama–French three- and five-factor models. Each factor is itself a long-short portfolio return. The loadings \((\beta_M, s, h, r, c)\) are the stock’s partial sensitivities; their signs identify the stock’s style profile (large-cap growth vs small-cap value, etc.). Newey–West HAC standard errors are the correct inference under heteroskedasticity and serial correlation, which both fail in daily financial data; the cost of switching from classical to HAC is one keyword argument and should be the default in factor-model regressions. Once the loadings are fitted, the same regression equation becomes a return attribution and a hedging recipe: a market-neutral position shorts \(\hat\beta_p\) units of the market index per unit of long exposure, and a factor-neutral book extends this to each non-trivial factor loading.
Exercises
Load the NVDA + FF5 dataset shipped with this book and fit the FF3 model
\[ r_{\mathrm{NVDA},t} - r_{f,t} \;=\; \alpha + \beta_M (r_{M,t} - r_{f,t}) + s\, \mathrm{SMB}_t + h\, \mathrm{HML}_t + \varepsilon_t \]
with classical OLS standard errors.
- Report the four coefficients and their \(t\)-statistics. Which ones reject \(H_0: \beta_k = 0\) at the 5% level?
- Compute the 95% confidence interval for \(\hat\beta_M\) by hand using \(\hat\beta_M \pm 1.96 \cdot SE(\hat\beta_M)\) and verify it matches
statsmodels’s reported CI. - Interpret the sign and magnitude of each coefficient in plain English (one sentence each).
- What is the residual standard error \(\hat\sigma = \sqrt{\mathrm{SSR}/(n-p-1)}\), expressed in percent per day? Compare it to a typical NVDA daily standard deviation in the same sample.
Using the same NVDA data:
- Compute the partial \(F\)-statistic by hand for the CAPM → FF3 nested-model test. Report \(\mathrm{SSR}_R\), \(\mathrm{SSR}_U\), \(q\), \(n - p - 1\), \(F\), and the associated \(p\)-value (use
scipy.stats.f.sf(F, q, n-p-1)). - Verify your answer using
ff3.compare_f_test(capm). - Now repeat for the FF3 → FF5 test. If \(F\) rejects the null but neither individual \(t\)-statistic on RMW or CMA does, explain in two sentences why this can happen.
- Conceptual question: Suppose \(\mathrm{SSR}_R = \mathrm{SSR}_U\) exactly (the new regressors contribute zero in-sample fit). What value of \(F\) do you get? Should that be considered evidence against the new regressors? Explain.
Construct a synthetic dataset with two regressors that are intentionally collinear:
import numpy as np
np.random.seed(0)
n = 500
x1 = np.random.normal(0, 1, n)
x2 = 0.95 * x1 + 0.05 * np.random.normal(0, 1, n) # correlation ~0.95
beta_true = np.array([1.0, 2.0, -1.0]) # intercept, x1, x2
y = beta_true[0] + beta_true[1]*x1 + beta_true[2]*x2 + 0.3 * np.random.normal(0, 1, n)- Fit the regression of \(y\) on \((x_1, x_2)\). Report \(\hat\beta_1\), \(\hat\beta_2\), and their standard errors.
- Compute \(\mathrm{VIF}_1\) and \(\mathrm{VIF}_2\) by hand (run the auxiliary regression of \(x_1\) on \(x_2\)).
- Now refit on a slightly different random sample (re-seed and regenerate). Compare \(\hat\beta_1, \hat\beta_2\) across the two samples. Are they stable? Is \(\hat\beta_1 + \hat\beta_2\) stable?
- What is \(R^2\) in each sample? Did the multicollinearity hurt the model’s predictive fit?
Take the NVDA FF5 regression. Now add a column of pure noise to the design matrix:
data["junk"] = np.random.normal(0, 1, len(data))- Refit the model with the six regressors {Mkt-RF, SMB, HML, RMW, CMA, junk}. Does \(R^2\) rise or fall? Does adjusted \(R^2\) rise or fall?
- Report the \(t\)-statistic on the junk regressor. What is its \(p\)-value? (Repeat the exercise with a new random
junkcolumn 10 times; what fraction of the time does the \(p\)-value come in below 5%? Compare to the theoretical Type I error rate.) - Now build a useful extra regressor — for example, the lagged NVDA excess return. Add it to the regression. What happens to adjusted \(R^2\)? Does the lagged regressor have a statistically significant coefficient?
- Discussion: Adjusted \(R^2\) goes up when a regressor’s \(|t| > 1\) (very roughly). The conventional significance threshold is \(|t| > 2\). What does this mean for the relationship between “raises adjusted \(R^2\)” and “is statistically significant”?
Fit FF5 on NVDA twice: once with classical SEs and once with Newey–West HAC (5 lags).
- Tabulate the five coefficient standard errors side by side. Compute the ratio HAC / classical for each. Which factor’s SE changes the most?
- Does any coefficient flip from significant at 5% under classical SEs to insignificant under HAC, or vice versa?
- Diagnostic: Run the Breusch–Godfrey LM test for residual autocorrelation (
statsmodels.stats.diagnostic.acorr_breusch_godfrey) on the classical-SE fit. Does it reject the null of no autocorrelation? If so, that is the statistical reason HAC is necessary.
Construct a $1,000,000 long position in NVDA. Using your fitted FF5 loadings with HAC standard errors:
- Compute the dollar hedge sizes for each of the five factor exposures. For Mkt-RF, this is just \(\hat\beta_M \cdot \$1{,}000{,}000\) of SPY short. For SMB, you would short \(|\hat s| \cdot \$1{,}000{,}000\) of a small-cap proxy (e.g., IWM minus IWB notional). Continue for HML, RMW, CMA.
- Report the gross notional exposure (sum of absolute dollar legs) of the fully factor-neutral book. Why is this larger than $1M?
- Attribution: Pick the worst single-day return for NVDA in your sample. Decompose that day’s loss into factor contributions and residual. How much of the loss would the market-neutral hedge have offset? How much would the full factor-neutral hedge have offset (assuming perfect ETF replication of each factor)?
- Discussion: The factor-neutral hedge eliminates all factor exposure but leaves \(\hat\varepsilon_t\), the idiosyncratic component. For NVDA, this component over 2023–2024 was driven heavily by AI-narrative news. Is hedging into a pure idiosyncratic-exposure portfolio always desirable? In one paragraph, argue for or against running NVDA factor-neutral as a single-name exposure.