Chapter 3: Simple Linear Regression and CAPM
Linear regression is the most-used model in business analytics. In this chapter we move from the correlation you met in Chapter 4 to a directional model that lets us predict, quantify, and test relationships.
A quick map of the new words you’ll meet (each will be re-explained in plain English where it first appears):
- Regression — finding the line (or surface) that best fits a cloud of points; here we use it to ask “how much of NVDA’s daily wiggle is just the market wiggling?”
- CAPM (Capital Asset Pricing Model) — a textbook recipe saying a stock’s excess return is mostly its market exposure (\(\beta\)) times the market’s excess return, plus a small skill bit (\(\alpha\)) plus noise. We’re not endorsing it as truth; we’re using it to learn regression on a real example.
- Excess return — the part of a return that’s above the risk-free rate. If a stock returned 8% and you could have gotten 2% safely in T-bills, the excess return is 6%.
We will fit a line in statsmodels, read every number in the OLS().fit().summary() table the way a desk analyst reads a Bloomberg screen, decompose variance into explained and unexplained components, and apply the whole machine to the canonical risk model in finance — the Capital Asset Pricing Model (CAPM). By the end you will be able to take any stock’s daily history, regress its excess return on a market proxy, and turn the resulting two numbers — \(\alpha\) and \(\beta\) — into a position-sizing decision.
Why regression
Where you’ll see this. Open any finance YouTube channel and you’ll hear “Nvidia has a beta of 1.8” — that number comes straight out of a regression. The same machinery shows up in marketing dashboards (“how many extra sales per ad-spend dollar?”) and ops planning (“how much warehouse staff per unit of forecast demand?”). Once you can read a regression output, a surprisingly large slice of real-world business analysis stops looking like magic.
Before reading the algebra, watch Josh Starmer talk through the geometry of fitting a line by least squares. Every concept in this chapter — slope, intercept, residuals, \(R^2\) — appears in this 27-minute video, intuition first.
— StatQuest with Josh Starmer
From description to inference to decision
The previous chapter showed you how to describe a single return series — its mean, volatility, drawdown profile, and Sharpe ratio — and how to measure association between two series with a correlation coefficient. Descriptive statistics summarise; correlation co-moves; but neither one predicts. The moment a portfolio manager wants to ask a different kind of question — if SPY drops 2% tomorrow, what will NVDA do? — we have crossed from description into inference. The instrument that carries us across is regression — the process of finding the line (or, in later chapters, the surface) that best fits a cloud of \((X, Y)\) points, so that we can use \(X\) to make an informed guess about \(Y\).
Regression is the workhorse of applied business analytics for one simple reason: it produces an interpretable directional model. Given a predictor \(X\) and an outcome \(Y\), regression returns a slope coefficient that says how much \(Y\) changes per unit change in \(X\), an intercept that anchors the level, and a residual scatter that tells you how much of \(Y\) remains unexplained. Those three numbers — slope, intercept, residual standard deviation — are the entire input to a thousand downstream business decisions: how big a hedge to put on, what price to charge for a new product, how much inventory to order ahead of a holiday, what compensation to offer a marketing channel.
In finance, the directional nature of regression is what separates it from correlation. The Pearson \(\rho\) between NVDA and SPY answers how tightly they co-move. The regression \(\hat\beta\) of NVDA on SPY answers how much NVDA is expected to move when SPY moves by one unit. The first is a unitless scalar bounded between \(-1\) and \(+1\). The second has units — percent NVDA per percent SPY — and translates directly into dollar exposure on a real portfolio. A risk manager who knows the correlation is well-informed; a risk manager who knows the regression coefficient can size a hedge.
Two hundred years of regression
The technique is old. In 1809 the 23-year-old Carl Friedrich Gauss, working to predict the orbit of the asteroid Ceres, invented the method of least squares — the same computational backbone every regression output in this chapter rests on. Almost eight decades later, Francis Galton studied the heights of parents and children and noticed that tall parents tend to produce children who are tall but not quite as tall; he called this regression to mediocrity, and the name stuck. Karl Pearson formalised the mathematics of simple linear regression in 1896, and Ronald A. Fisher in the 1920s extended the framework to multiple predictors, analysis of variance, and the significance tests you will use throughout this chapter.
In finance, the story has its own Nobel-Prize provenance. William Sharpe shared the 1990 Nobel Memorial Prize in Economic Sciences for the Capital Asset Pricing Model — a two-parameter regression that defined how Wall Street prices risk for three decades. The CAPM’s slope coefficient, beta (\(\beta\)), became the lingua franca of equity analysis: aggressive stocks with \(\beta > 1\) amplify market swings; defensive stocks with \(\beta < 1\) dampen them. Every Bloomberg terminal displays beta. Every equity research report references it. Every risk system on every trading desk starts here.
Correlation, causation, and what regression can (and cannot) say
One warning before we open the toolbox. Regression measures statistical association in a directional, quantifiable form. It does not, by itself, establish a causal link. The classic cautionary tale: ice-cream sales and drowning rates are positively correlated; regressing one on the other will return a tidy positive slope and a small \(p\)-value. Neither causes the other — both respond to a third variable (summer weather). In finance, the same warning applies: a regression of two stock returns on each other does not mean one drives the other. They may both load on a common factor (the market). This is precisely why CAPM regresses each stock on the market — the common factor — rather than on another individual stock.
The most important warning in this chapter. A statistically significant regression coefficient is evidence of association, not causation. To go from association to causation requires either an experiment (which we rarely have in finance) or a careful identification strategy (instrumental variables, natural experiments, regression discontinuity). For pure prediction — given a value of \(X\), what is my best forecast of \(Y\)? — association is enough. For policy or capital-allocation decisions — if I change \(X\), what happens to \(Y\)? — you need causation, and regression alone will not give it to you.
Ordinary least squares
Where you’ll see this. Every time you see a chart with a straight line drawn through a scatter of points — in a McKinsey deck, a Bloomberg terminal, a Khan Academy video — somebody quietly ran Ordinary Least Squares (OLS) in the background. OLS is the default recipe for fitting that line. The next subsection writes down the model; then we explain why the line is chosen by squaring errors and adding them up.
The simple linear regression model
In plain words, we believe each observed \(Y\) (say a stock’s daily return) is roughly a line in \(X\) plus a random bump. The line has a height (intercept) and a tilt (slope); the bump is whatever the line can’t see. The formal version:
\[ Y_i = \alpha + \beta X_i + \varepsilon_i, \qquad i = 1, 2, \ldots, n. \]
What this says. Pick observation \(i\). Multiply its \(X_i\) by the slope \(\beta\), add the intercept \(\alpha\), and you’ve got the part of \(Y_i\) the model can predict; the leftover \(\varepsilon_i\) is the unpredictable bit.
Three quantities define the model. The intercept \(\alpha\) is the value of \(Y\) when \(X = 0\). The slope \(\beta\) is the change in the conditional mean of \(Y\) associated with a one-unit increase in \(X\). The error \(\varepsilon_i\) is the random deviation of the \(i\)-th observation from the line \(\alpha + \beta X_i\). The three classical OLS assumptions for valid inference, traditionally summarised by the acronym LINE, are:
- Linearity. The conditional mean of \(Y\) given \(X\) is exactly \(\alpha + \beta X\).
- Independence. The errors \(\varepsilon_i\) across observations are uncorrelated.
- Normality. Each \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\).
- Equal variance (homoscedasticity). \(\mathrm{Var}(\varepsilon_i) = \sigma^2\) does not depend on \(X_i\).
For point estimation of \(\alpha\) and \(\beta\) we need only L and E; for the \(t\)-tests and confidence intervals reported by statsmodels we need all four, or a large enough sample that the Central Limit Theorem rescues us.
Anatomy of the regression equation
Before grinding through the algebra, let’s slow down and name every piece of the equation \(Y_i = \alpha + \beta X_i + \varepsilon_i\). The diagram below renders the formula and points an arrow at each symbol with a one-line description.
The five pieces of the simple-regression equation, each with a one-line plain-English label. Memorise this picture — every coefficient table you read for the rest of your career is just an estimate of \((\alpha, \beta)\) from this template.
Least squares: minimise the sum of squared residuals
Imagine drawing a line through a scatterplot with a ruler. For each point, measure the vertical gap to your line (positive if above, negative if below). If you just added the gaps up, the positives and negatives would cancel and every line through the cloud’s centre would look perfect. So we square each gap first (making them all positive and punishing big misses harder than small ones) and add. The line that makes that total smallest is the OLS line. That’s the whole idea.
Given a sample of \(n\) paired observations \((X_1, Y_1), \ldots, (X_n, Y_n)\) we want to estimate \(\alpha\) and \(\beta\). The OLS principle says: choose the pair \((\hat\alpha, \hat\beta)\) that minimises the total squared vertical distance between the data points and the fitted line. Formally,
\[ (\hat\alpha, \hat\beta) = \arg\min_{a, b}\; \sum_{i=1}^{n}\bigl(Y_i - a - b X_i\bigr)^2. \]
Why squared distances? Three reasons. First, squaring is differentiable, so we can solve the minimisation analytically by setting derivatives to zero. Second, squaring penalises large errors more than small ones, which is the right risk preference for most business problems. Third, under the normality assumption, OLS coincides with maximum-likelihood estimation — it is the most efficient unbiased estimator. (This is the content of the Gauss–Markov theorem: among linear unbiased estimators, OLS has the smallest variance. The proof appears in any econometrics text.)
Setting the partial derivatives of the sum of squared residuals to zero yields two normal equations whose solution is the familiar pair of closed-form formulae:
\[ \hat\beta = \frac{\sum_{i=1}^{n} (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^{n} (X_i - \bar X)^2} \;=\; \frac{\widehat{\mathrm{Cov}}(X, Y)}{\widehat{\mathrm{Var}}(X)}, \qquad \hat\alpha = \bar Y - \hat\beta\, \bar X. \]
The slope is the sample covariance of \(X\) and \(Y\) scaled by the sample variance of \(X\). The intercept is whatever shift makes the fitted line pass through the centroid \((\bar X, \bar Y)\). Two consequences worth committing to memory: the OLS residuals always sum to zero, and the fitted line always passes through the mean of the data.
The matrix form
For one predictor the two formulae above are enough. But the same problem written in matrix form generalises immediately to multiple predictors, which we will use in the next chapter, and it is also how every numerical linear-algebra library actually computes the answer. Stack the \(n\) observations of the outcome into an \(n \times 1\) vector \(\mathbf{y}\), and stack a column of ones (for the intercept) next to the predictor column into an \(n \times 2\) design matrix \(\mathbf{X}\):
\[ \mathbf{y} \;=\; \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}, \qquad \mathbf{X} \;=\; \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix}, \qquad \boldsymbol{\beta} \;=\; \begin{bmatrix} \alpha \\ \beta \end{bmatrix}. \]
The model is then \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\). Minimising the sum of squared residuals \(\lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta}\rVert^2\) over \(\boldsymbol{\beta}\) gives the celebrated closed-form OLS estimator:
\[ \boxed{\;\hat{\boldsymbol{\beta}} \;=\; (\mathbf{X}^{\top}\mathbf{X})^{-1}\,\mathbf{X}^{\top}\mathbf{y}.\;} \]
This single equation is the engine of every linear regression in every business analytics course on earth. For simple regression (one predictor plus intercept), expanding the matrix algebra reproduces exactly the scalar formulae for \(\hat\alpha\) and \(\hat\beta\) above. For multiple regression, the same equation handles 2, 3, or 50 predictors with no additional theory — only more columns in \(\mathbf{X}\). Numerically, software does not literally invert \(\mathbf{X}^\top\mathbf{X}\) (that would be both slow and unstable); it uses a QR decomposition or singular-value decomposition. The mathematical content is identical.
Think of \(\mathbf{y}\) and the columns of \(\mathbf{X}\) as vectors in \(\mathbb{R}^n\). The fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The residual vector \(\mathbf{y} - \hat{\mathbf{y}}\) is perpendicular to every column of \(\mathbf{X}\). This is why the OLS residuals are uncorrelated with each predictor — and it is the geometric reason behind every algebraic identity in the rest of this chapter.
Fitting OLS by hand
Before we hand the computation to statsmodels, let us verify the formulae with a tiny synthetic example. The code below invents 200 fake (market, stock) return pairs using parameters we picked, then estimates those parameters back from the data — first with the scalar formula, then with the matrix formula. If both methods land near the true numbers, our recipe works.
What we got. Both the scalar and the matrix recipes spit out the same \((\hat\alpha, \hat\beta)\), and both land near (but not exactly on) the true values we picked — the leftover gap is just sampling noise.
Both methods recover the same numbers, and both are close to (but not exactly equal to) the true parameters — the gap is sampling noise. With \(n = 200\) observations the slope is recovered to about two significant figures; with \(n = 2{,}000\) it would tighten to three. Sample size and predictor spread are the two levers that buy you precision.
The dashed segment is the gap OLS is squaring and adding up. OLS picks the slope-and-intercept that makes the sum of all such squared vertical segments as small as possible — every other line through this cloud produces a larger total.
Fitting in Python with statsmodels
Where you’ll see this. In a finance internship, somebody will hand you a CSV of stock returns and say “what’s the beta of Nvidia?”. You don’t reinvent OLS from scratch — you call statsmodels. This section is the most practically useful three pages of the chapter; the table you’re about to learn to read appears in academic papers, equity-research notes, and risk reports every day.
The statsmodels workflow
Three lines fit a regression in statsmodels. Add a constant column to the predictors (so the model has an intercept), build the OLS object, then call .fit(). The fitted object exposes everything you need — coefficients, standard errors, \(t\)-statistics, \(p\)-values, residuals, fitted values, \(R^2\), and a printable .summary() table that is the standard format of applied regression.
What we got. A wall of numbers from .summary(). Don’t panic — most of them are diagnostic, and out of about a dozen numbers we’ll only act on four today (the slope, the intercept, \(R^2\), and the slope’s \(t\)-stat). The next subsection is a label-by-label tour.
Reading the OLS().fit().summary() table
We’re about to see a table with about a dozen numbers. Don’t be scared — most are diagnostic and we’ll only use four of them today: the slope \(\hat\beta\), the intercept \(\hat\alpha\), the slope’s \(t\)-statistic, and \(R^2\). Everything else is sanity-check material.
A typical summary() output is divided into three blocks. The top block contains overall model statistics (\(R^2\), adjusted \(R^2\), \(F\)-statistic, residual sum of squares, sample size, degrees of freedom). The middle block is the coefficient table — one row per predictor (plus a row for the intercept), with columns for the point estimate, standard error, \(t\)-statistic, \(p\)-value, and 95% confidence interval. The bottom block contains diagnostic tests on the residuals (Durbin–Watson, Jarque–Bera, condition number, skew, kurtosis).
A condensed view of the table for our simulated CAPM looks like this:
| coef | std err | t | P>|t| | [0.025, 0.975] | |
|---|---|---|---|---|---|
const (\(\alpha\)) |
0.0036 | 0.0009 | 4.13 | 0.000 | 0.0019, 0.0054 |
Mkt_excess (\(\beta\)) |
2.3320 | 0.0900 | 25.90 | 0.000 | 2.155, 2.509 |
| \(R^2\) | 0.574 | ||||
| Adj. \(R^2\) | 0.573 | ||||
| \(F\)-statistic | 671.0 | Prob (\(F\)): 0.000 | |||
| Durbin–Watson | 2.04 |
Three numbers carry the story.
The t-statistic is “how many standard errors a coefficient is from zero.” A rough rule: if \(|t| > 2\) the coefficient is unlikely to be a fluke. The p-value is the same idea on a different scale: the probability of seeing a \(|t|\) this big (or bigger) just by chance, if the true coefficient were actually zero. A \(p\)-value below 0.05 is the conventional “looks real” cutoff. Neither number tells you the effect is large or useful in real work — only that it’s hard to explain away as luck.
The slope \(\hat\beta = 2.33\) says that NVDA moves 2.33 percent for every one-percent move in the market. Its standard error is 0.090, its \(t\)-statistic is \(\hat\beta / SE(\hat\beta) = 25.9\) (very far from zero — clearly not a fluke), and the corresponding two-tailed \(p\)-value is essentially zero. The 95% confidence interval \([2.16, 2.51]\) tells us how precisely the slope is pinned down: in repeated sampling, an interval constructed this way would contain the true \(\beta\) ninety-five percent of the time.
The intercept \(\hat\alpha = 0.0036\) corresponds to an average daily excess return of about 36 basis points beyond what NVDA’s market exposure can explain. Annualised that is \(0.0036 \times 252 \approx 0.91\) — about ninety percent per year of unexplained “alpha.” (This is a simulated number; real stocks rarely show this.) Its \(p\)-value of 0.000 rejects the null \(H_0: \alpha = 0\) at any conventional level.
The \(R^2\) (read: R-squared — the fraction of the up-and-down in \(Y\) that the model can explain) of 0.574 tells us that the market explains about fifty-seven percent of NVDA’s daily return variance in this sample. The remaining forty-three percent is idiosyncratic — NVDA-specific news, GPU shipment data, analyst upgrades, datacentre contracts — the stuff the model doesn’t see. We will return to \(R^2\) in detail in the next section.
sm.add_constant(X)— adds a column of ones for the intercept.sm.OLS(y, X)— builds the model object (does not fit yet)..fit()— performs the OLS minimisation and returns a results object whose.summary(),.params,.bse,.tvalues,.pvalues,.fittedvalues,.resid,.rsquared, and.conf_int()you will use over and over.
A static reference for the full summary
For reference, the full statsmodels output for the same regression, captured from a static run, is shown below. Treat this like a road map — you don’t have to use every road, but you should know what’s there. (You do not need to execute this; it is provided as a label-by-label reading guide.)
OLS Regression Results
==============================================================================
Dep. Variable: NVDA_excess R-squared: 0.574
Model: OLS Adj. R-squared: 0.573
Method: Least Squares F-statistic: 671.0
Date: Fri, 14 May 2026 Prob (F-statistic): 1.97e-94
No. Observations: 500 AIC: -2473.
Df Residuals: 498 BIC: -2465.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0036 0.001 4.131 0.000 0.002 0.005
Mkt_excess 2.3320 0.090 25.903 0.000 2.155 2.509
==============================================================================
Omnibus: 0.214 Durbin-Watson: 2.042
Prob(Omnibus): 0.898 Jarque-Bera (JB): 0.156
Skew: 0.043 Prob(JB): 0.925
Kurtosis: 3.015 Cond. No. 102.
==============================================================================Read this top-down. Top-left is the dependent variable and the fitting method. Top-right is overall fit (\(R^2\), adjusted \(R^2\), \(F\)-statistic and its \(p\)-value). The middle table gives you the per-coefficient story. The bottom block is the residual diagnostic kit, which we will return to: Durbin–Watson tests for autocorrelation of residuals (we want a value near 2), Jarque–Bera tests for non-normality (we want a large \(p\)-value), Skew and Kurtosis quantify the asymmetry and tail-thickness of the residuals (we want roughly 0 and 3), and Condition Number flags numerical multicollinearity issues (a problem only in multiple regression).
Anatomy of the OLS().fit().summary() table
Here is a labeled mock summary so you know where every important number lives. The cells you actually use most days are highlighted; arrows point to the four numbers that drive 90% of decisions.
Top block = overall fit. Middle block = per-coefficient story. Four numbers tell you the headline: \(\hat\alpha\) (const row), \(\hat\beta\) (slope row), the p-values, and \(R^2\).
Interpreting slope, intercept, residuals
Where you’ll see this. Once a regression spits out numbers, somebody has to translate them into business English: “for every 1% move in the S&P, Nvidia moves 1.88%.” That translation is what gets you taken seriously in a meeting. Get the units wrong here and you’ll under-hedge a $10M position by millions of dollars — a costly mistake.
The slope as elasticity, sensitivity, exposure
For any simple regression \(Y = \alpha + \beta X + \varepsilon\), the slope \(\beta\) is the partial derivative of the predicted outcome with respect to the predictor:
\[ \frac{\partial \mathbb{E}[Y \mid X]}{\partial X} = \beta. \]
Two practical interpretations follow. First, \(\beta\) has units: it is the change in \(Y\) per one-unit change in \(X\). If \(Y\) is a return in decimal form and \(X\) is a return in decimal form, then \(\beta\) is unitless and is interpreted as “percent move in \(Y\) per percent move in \(X\).” Second, \(\beta\) is conditional on the linearity assumption: it is the slope of the population’s best linear approximation to the conditional mean function, even when the true relationship is curved.
In the CAPM context the slope has a specific name — beta, the stock’s market sensitivity — and a specific operational meaning. If you hold \(\$10{,}000{,}000\) of NVDA and NVDA has \(\beta = 1.88\) against SPY, then a 1% move in SPY translates into a 1.88% expected move in NVDA, or about \(\$188{,}000\) of P&L from market exposure alone. To hedge that exposure to zero you would short \(\$10{,}000{,}000 \times 1.88 = \$18{,}800{,}000\) notional of SPY (or QQQ). That is not a metaphor; that is exactly how every long/short equity desk sizes its market hedges when you do this for a job.
The intercept as the “no predictor” baseline
The intercept \(\alpha\) is the value of the fitted line at \(X = 0\). Whether it has a meaningful interpretation depends on whether \(X = 0\) is meaningful in your problem.
For CAPM, \(X = 0\) corresponds to a market that earned exactly the risk-free rate (because we work with excess returns — the part of a return above the risk-free rate; if a stock returned 8% and you could have gotten 2% safely in T-bills, the excess return is 6%). The intercept \(\alpha\) is then the stock’s expected excess return when the market premium is zero — which the theory says should be zero. A non-zero \(\hat\alpha\), statistically significant and persistently positive, would mean the stock earns a return that its market exposure cannot justify. Finance has a name for this: Jensen’s alpha, after Michael Jensen’s 1968 Journal of Finance paper that introduced it as the central performance metric for mutual funds. Mutual-fund managers live and die by alpha; hedge-fund managers are paid for it; and asset-pricing economists spend their careers explaining why most of the apparent alphas in the literature are actually loadings on omitted risk factors.
For a non-CAPM example: if you regress monthly retail sales (in millions of dollars) on monthly marketing spend (in millions of dollars), the intercept is the expected sales when marketing spend is zero. That number is at least interpretable. If you regress salary on years of work experience, the intercept is the expected salary at zero experience — also interpretable. If you regress crop yield on monthly temperature in Celsius, the intercept is the predicted yield at zero degrees, which is a perfectly valid mathematical object but a useless business number. In each case the intercept is the level the line is anchored at; whether it carries economic content is a domain question, not a statistics question.
Residuals as the unexplained part of \(Y\)
In plain words, a residual is “what the line missed for this point.” If the line predicts \(\hat Y_i\) and the truth is \(Y_i\), the residual is the gap \(Y_i - \hat Y_i\). Positive residual: that day’s \(Y\) was higher than the line. Negative residual: lower.
The residual for observation \(i\) is the vertical distance from the data point to the fitted line:
\[ \hat\varepsilon_i = Y_i - \hat Y_i = Y_i - \hat\alpha - \hat\beta X_i. \]
Residuals are not the same as errors. The error \(\varepsilon_i = Y_i - \alpha - \beta X_i\) is the deviation from the true line, with the true parameters; we never observe it. The residual \(\hat\varepsilon_i\) is the deviation from the fitted line, with the estimated parameters; we do observe it, and we use it as our best plug-in estimate of the unobserved \(\varepsilon_i\).
Three properties of the residual vector are worth committing to memory because every downstream calculation rests on them:
- They sum to zero: \(\sum_i \hat\varepsilon_i = 0\). (This is built into the OLS first-order condition for the intercept.)
- They are uncorrelated with the predictor: \(\sum_i X_i \hat\varepsilon_i = 0\). (Built into the OLS first-order condition for the slope.)
- Their sample variance is the unbiased estimator of \(\sigma^2\), up to a small degrees-of-freedom correction:
\[ \hat\sigma^2 \;=\; \frac{1}{n - 2}\sum_{i=1}^{n} \hat\varepsilon_i^2. \]
The division by \(n - 2\) rather than \(n\) corrects for the fact that we used two parameters (\(\hat\alpha, \hat\beta\)) to fit the line. The square root \(\hat\sigma\) is sometimes called the residual standard error or RMSE and is the standard deviation of a typical prediction error.
In the CAPM context, residuals carry direct financial content. The fitted value \(\hat Y_i = \hat\alpha + \hat\beta X_i\) is the systematic-risk component of NVDA’s return — the part driven by the market. The residual \(\hat\varepsilon_i\) is the idiosyncratic component — the firm-specific news flow. Splitting one stock’s return history into these two components is what every multi-factor risk model in production does (Barra, Axioma, MSCI), and the simplest version of that split is the residual of a CAPM regression.
\(R^2\) and the meaning of explained variance
Where you’ll see this. Watch ten finance YouTubers and at least three will misuse “R-squared” — they’ll say things like “an R-squared of 0.5 means the strategy makes money half the time,” which is gibberish. Knowing what \(R^2\) actually says (and doesn’t say) is one of the cheapest ways to look smart in any analytics conversation.
\(R^2\) is “what fraction of the up-and-down in \(Y\) does the model explain?” Suppose \(Y\) has a total wiggle (variance) of 100 units. If the line captures 30 units of that wiggle and leaves 70 units of unexplained scatter, then \(R^2 = 30/100 = 0.3\) — the model explains 30% of the variation; the other 70% is stuff the model doesn’t see. \(R^2 = 0\) means the line is useless (flat); \(R^2 = 1\) means every point sits exactly on the line.
If you only watch one video on \(R^2\) in your life, this is it. Josh derives \(R^2 = \text{SSR}/\text{SST}\) from the picture, without a single matrix.
— StatQuest
SST, SSR, SSE: the variance decomposition
Three quick names you’ll see in every textbook: SST (total wiggle in \(Y\)), SSR (the part the line captures), SSE (the part the line misses). Memorising the acronyms is less important than the picture: total = captured + missed.
The variance of \(Y\) in your sample can be split into a part the regression captures and a part it leaves behind. Define:
\[ \begin{aligned} \text{SST} \;&=\; \sum_{i=1}^{n} (Y_i - \bar Y)^2 \quad &\text{(total sum of squares)} \\ \text{SSR} \;&=\; \sum_{i=1}^{n} (\hat Y_i - \bar Y)^2 \quad &\text{(regression / explained sum of squares)} \\ \text{SSE} \;&=\; \sum_{i=1}^{n} (Y_i - \hat Y_i)^2 \quad &\text{(error / residual sum of squares)} \end{aligned} \]
A clean algebraic identity (provable in one line from the orthogonality of residuals and fitted values) says these three are linked:
\[ \boxed{\;\text{SST} \;=\; \text{SSR} \;+\; \text{SSE}.\;} \]
Each term has a one-line interpretation. SST is the total scatter of the outcome around its mean — the variance of \(Y\) before any predictor is used. SSR is the scatter the fitted line captures — the variance of the predictions \(\hat Y_i\) around the mean. SSE is the scatter the fitted line misses — the variance of the residuals around zero.
\(R^2\) as the fraction of variance explained
The coefficient of determination, denoted \(R^2\) and pronounced “r-squared,” is the fraction of total variance the regression explains:
\[ R^2 \;=\; \frac{\text{SSR}}{\text{SST}} \;=\; 1 - \frac{\text{SSE}}{\text{SST}}. \]
By construction \(R^2 \in [0, 1]\). A model with \(R^2 = 0\) explains none of the variance — the fitted line is flat at \(\bar Y\) and the slope estimate is zero. A model with \(R^2 = 1\) explains all of the variance — every data point lies exactly on the fitted line, with zero residual. Real models live somewhere between.
For simple linear regression (one predictor) there is a beautiful identity:
\[ R^2 \;=\; \rho_{XY}^2, \]
where \(\rho_{XY}\) is the Pearson correlation between \(X\) and \(Y\). A correlation of \(\pm 0.74\) corresponds to \(R^2 \approx 0.55\); a correlation of \(\pm 0.50\) corresponds to \(R^2 = 0.25\); a correlation of \(\pm 0.90\) corresponds to \(R^2 = 0.81\). This identity is the formal link between Chapter 4’s correlation and this chapter’s regression: for a single predictor, knowing one immediately tells you the other.
In CAPM, the SST/SSR/SSE decomposition has a direct financial reading. SSR is the systematic risk captured by the market factor — exposure that cannot be diversified away because every stock in the market loads on the same broad factor. SSE is the idiosyncratic risk unique to the stock — the portion that a portfolio of twenty to thirty stocks averages out to near zero. \(R^2\) is the share of the stock’s variance that comes from market exposure. For NVDA over 2023–2024 a typical value is \(R^2 \approx 0.55\): about half the daily return variance is systematic, half is firm-specific. The half that is idiosyncratic is exactly the space in which active equity research and alternative-data signals try to add value.
Computing SST, SSR, SSE, \(R^2\) by hand
The next code block computes SST, SSR, SSE and \(R^2\) from scratch and then asks statsmodels for \(R^2\) too — they should match exactly. If they don’t, we’ve made an arithmetic error somewhere.
What we got. SSR + SSE lands on SST, our by-hand \(R^2\) matches model.rsquared, and the squared correlation equals \(R^2\) — three independent ways of computing the same number, all agreeing.
The numerical check should confirm three things: \(\text{SSR} + \text{SSE}\) equals \(\text{SST}\) to machine precision, \(\text{SSR}/\text{SST}\) equals model.rsquared, and (for simple regression) the squared correlation between \(X\) and \(Y\) equals \(R^2\) exactly.
Each stacked bar represents one regression with SST normalised to 1. The blue bottom slab is the fraction of total variance the regression captures (= \(R^2\)); the grey top slab is the leftover residual variance. As noise grows from left to right, \(R^2\) shrinks — the line explains less, the residuals explain more.
A common misreading of \(R^2\)
A high \(R^2\) is not a guarantee of a good model and a low \(R^2\) is not a guarantee of a bad one. Three caveats:
- \(R^2\) depends on the range of \(X\). A regression of NVDA on SPY over a quiet flat-vol year may produce a lower \(R^2\) than the same regression over a high-vol year, even when the true relationship is identical, simply because there is less variance in \(X\) to explain.
- \(R^2\) never decreases when predictors are added. This is why in multiple regression we use adjusted \(R^2\), which penalises additional predictors. We will meet adjusted \(R^2\) in the next chapter.
- \(R^2\) measures in-sample fit, not out-of-sample predictive power. A model can be heavily overfit, with an in-sample \(R^2\) of 0.95, and still produce useless forecasts. We address this with train/test splits and out-of-sample \(R^2\) later in the chapter.
For prediction-focused work, \(R^2\) is best read alongside the residual standard error (RMSE), which has units of \(Y\) and gives a direct sense of typical prediction error. A model with \(R^2 = 0.55\) and RMSE = 1.2% per day tells you both how much variance is explained and how big a typical mistake is.
Inference: standard errors, \(t\)-tests, \(p\)-values, confidence intervals
Where you’ll see this. Anytime someone says “the effect is statistically significant” — in a research paper, a marketing A/B test write-up, an equity analyst’s note — they’re invoking the machinery in this section. If you can’t read a \(t\)-statistic and a confidence interval, you can’t audit those claims, and you’ll have to take their word for it. Don’t take their word for it.
The estimators are random variables
Here is the headline idea before the math. Our estimate \(\hat\beta\) depends on which 500 days we happened to grab. Grab a different 500 days and we’d get a different \(\hat\beta\). So \(\hat\beta\) is itself a random number — and that’s why we report a standard error next to it, to say how much it would jiggle if we re-rolled the sample. Inference is the toolkit for taking that jiggle seriously.
Run the same experiment twice with two different samples from the same population, and you will get two different pairs \((\hat\alpha, \hat\beta)\). The OLS estimator is itself a random variable. Inference is the discipline of quantifying that randomness — how variable are the estimates, how do we test whether they differ significantly from a hypothesised value, and how do we build an interval that quantifies their precision?
Under the LINE assumptions, the OLS estimator of the slope has a sampling distribution that is exactly normal in finite samples:
\[ \hat\beta \;\sim\; \mathcal{N}\!\left(\beta,\; \frac{\sigma^2}{S_{xx}}\right), \qquad S_{xx} \;=\; \sum_{i=1}^{n} (X_i - \bar X)^2. \]
Replace the unknown \(\sigma^2\) with its estimator \(\hat\sigma^2 = \text{SSE}/(n-2)\), and the standardised estimator follows a Student-\(t\) distribution with \(n - 2\) degrees of freedom:
\[ \frac{\hat\beta - \beta}{SE(\hat\beta)} \;\sim\; t_{n-2}, \qquad SE(\hat\beta) \;=\; \frac{\hat\sigma}{\sqrt{S_{xx}}}. \]
The standard error formula has an intuitive reading. The slope is estimated more precisely (smaller \(SE\)) when the residual scatter is small (\(\hat\sigma\) small), when the predictor has a wide spread (\(S_{xx}\) large), and when the sample is large (both effects). To estimate a slope sharply, run the experiment over a wide range of \(X\) and gather lots of data.
The intercept follows a parallel formula:
\[ SE(\hat\alpha) \;=\; \hat\sigma \cdot \sqrt{\frac{1}{n} + \frac{\bar X^2}{S_{xx}}}. \]
Each replication grabs a fresh sample and reports its own \(\hat\beta\). The histogram of those 500 estimates is the sampling distribution of \(\hat\beta\); the red curve is the textbook normal approximation. A standard error is just the width of this bell — it is not abstract; you could measure it with a ruler on the histogram.
Hypothesis testing on the slope
Every row in the coefficient table of a statsmodels summary implicitly conducts a two-tailed test of the form \(H_0: \theta = 0\) versus \(H_a: \theta \ne 0\), where \(\theta\) is the coefficient on that row. The mechanics:
- Compute the \(t\)-statistic: \(t = \hat\theta / SE(\hat\theta)\).
- Compute the two-tailed \(p\)-value: \(p = 2 \cdot \mathbb{P}\bigl(\lvert T_{n-2}\rvert \ge \lvert t\rvert\bigr)\).
- Reject \(H_0\) at the 5% level if \(p < 0.05\) (equivalently if \(\lvert t\rvert > t_{0.025, n-2} \approx 1.96\) for large \(n\)).
In CAPM, the slope test asks: does the market explain any portion of the stock’s return at all? A non-significant \(\beta\) would mean the stock’s exposure to broad market moves is statistically indistinguishable from zero — exotic for most equities but normal for, say, a gold-mining stock during certain regimes. The intercept test asks: does the stock earn a free lunch beyond what its market exposure justifies? A significantly positive \(\hat\alpha\) is the empirical signature of “alpha” — abnormal risk-adjusted return.
- It is not the probability that the null hypothesis is true.
- It is not the probability that your result occurred by chance.
- It is the probability of observing data this extreme (or more extreme) if \(H_0\) were true.
These distinctions are not pedantic — confusing them leads to systematic over-confidence in regression results. In finance, Harvey, Liu, and Zhu (2016, Review of Financial Studies) surveyed 316 claimed equity-premium “factors” and concluded that, given the multiple-testing problem, the \(t\)-statistic threshold for declaring a new factor significant should be raised from 2.0 to at least 3.0. A \(p\)-value is evidence, not proof.
Confidence intervals
A 95% confidence interval for \(\beta\) is constructed so that, in repeated sampling from the same population, ninety-five percent of such intervals would contain the true \(\beta\):
\[ \text{95% CI for } \beta:\quad \hat\beta \;\pm\; t_{0.025, n-2} \cdot SE(\hat\beta). \]
For large samples (say \(n > 100\)), the \(t\)-quantile \(t_{0.025, n-2}\) is very close to the normal quantile \(1.96\). In the simulated CAPM above, \(\hat\beta = 2.33\) with \(SE(\hat\beta) = 0.090\) and \(n = 500\) produces a 95% CI of approximately \([2.16, 2.51]\). The interval does not contain zero, which is the same conclusion as the \(t\)-test: the slope is significantly different from zero.
Two readings of the same interval:
- Narrowness measures precision. A 95% CI of \([1.8, 2.0]\) is much sharper than \([0.5, 3.5]\), even though both have the same point estimate of 1.9.
- Position measures economic content. If the 95% CI for \(\beta\) is \([0.8, 1.2]\), then the data are consistent with any value of \(\beta\) between 0.8 and 1.2 — including 1.0. Statistically, you cannot distinguish the stock from a “market-neutral” exposure.
Testing a non-zero hypothesis: is \(\beta > 1\)?
So far the question “is \(\hat\beta\) different from zero?” is answered for us in the summary. But in finance the more interesting question is often “is the stock more aggressive than the market?” — i.e. is \(\beta > 1\)? The trick: shift the test statistic so it asks about 1 instead of 0.
The default statsmodels \(t\)-statistic always tests against zero. To test whether NVDA is more aggressive than the market — \(H_0: \beta \le 1\) versus \(H_a: \beta > 1\) — you recentre the test statistic against the new null:
\[ t = \frac{\hat\beta - 1}{SE(\hat\beta)}. \]
If \(t\) exceeds the one-tailed critical value \(t_{0.05, n-2} \approx 1.645\), you reject \(H_0\) and conclude NVDA is statistically more aggressive than the market.
What we got. The shifted \(t\) is large and the corresponding \(p\)-value is tiny, so we reject “\(\beta \le 1\)” — NVDA is statistically more aggressive than the market.
The general principle: the default test is against zero, “the predictor matters.” To test any other threshold, recentre \(\hat\beta\) around the threshold before dividing by the standard error.
Robust standard errors
In one sentence: the default standard errors in statsmodels assume the residuals behave nicely (same variance, no day-to-day correlation). Real return data breaks both assumptions. Robust standard errors are alternative SE formulas that don’t trust those assumptions, so the \(p\)-values they produce are more honest. The point estimates \(\hat\alpha, \hat\beta\) don’t change — only the error bars around them.
The classical OLS standard error formula assumes homoscedastic, uncorrelated errors. In financial return data both assumptions routinely fail. Volatility clusters — calm weeks follow calm weeks, panic days cluster together — so the error variance is heteroscedastic (a fancy word meaning “the scatter is bigger on some days than others”). And consecutive daily returns are often weakly autocorrelated, so the errors are serially correlated (today’s leftover relates to yesterday’s).
When these assumptions fail, the OLS point estimates \(\hat\alpha, \hat\beta\) remain unbiased, but the reported standard errors are wrong, and so are all the \(p\)-values and confidence intervals built on top of them. The fix is robust standard errors — replace the classical formula with one that does not assume homoscedasticity or independence.
The two standard choices in statsmodels are:
cov_type='HC3': heteroscedasticity-consistent (White) standard errors. Use for cross-sectional regressions.cov_type='HAC', cov_kwds={'maxlags': L}: Newey–West heteroscedasticity- and autocorrelation-consistent standard errors with \(L\) lags. Use for time-series regressions like CAPM. A common rule of thumb is \(L = \lfloor 4 (n/100)^{2/9}\rfloor\).
What we got. Three different SE columns, one set of coefficients. The HAC and HC3 SEs are bigger — that’s the honest signal that volatility clustering and serial correlation were quietly inflating our confidence.
Notice that the coefficients are identical across the three fits. Only the standard errors change. In real work on return data, HC3 and HAC standard errors are routinely 10–40 percent larger than the classical default — meaning the classical default understates uncertainty, which is exactly the wrong direction for risk management.
The CAPM: a canonical simple regression
Where you’ll see this. Walk into a finance internship and somebody will ask “what’s the beta of Nvidia?” — they’re asking for the slope of a CAPM regression. The CAPM appears on Bloomberg screens, in corporate valuation models, and in every equity research note. We are not claiming CAPM is true — many academic papers since the 1990s have poked holes in it — but it is the cleanest finance-flavoured application of simple regression, which makes it the perfect teaching example.
CAPM (Capital Asset Pricing Model) is a textbook recipe: a stock’s excess return is mostly its market exposure (\(\beta\)) times the market’s excess return, plus a small skill bit (\(\alpha\)), plus noise. Think of it like a recipe with one main ingredient (market risk) and one optional spice (manager skill). We’re not endorsing CAPM as a true law of nature — we’re using it as a clean example to learn regression on real data.
The model
The Capital Asset Pricing Model regresses a stock’s excess return on the market’s excess return:
\[ \boxed{\;R_{i,t} - R_{f,t} \;=\; \alpha_i \;+\; \beta_i\bigl(R_{m,t} - R_{f,t}\bigr) \;+\; \varepsilon_{i,t}.\;} \]
Three terms, three meanings.
- \(R_{i,t} - R_{f,t}\) is the stock’s excess return at time \(t\) — its raw return less the risk-free rate. The risk-free rate is the return on a one-month T-bill or some equivalent short government bill. CAPM is a model of premiums, not levels.
- \(R_{m,t} - R_{f,t}\) is the market excess return — the return on a broad market portfolio less the risk-free rate. In real work we approximate the market by an index ETF (SPY for the S&P 500, IWM for small caps, ACWI for global) or by the Fama–French market factor
Mkt-RF, which is the value-weighted return of all NYSE, AMEX, and NASDAQ stocks minus the one-month T-bill rate. - \(\varepsilon_{i,t}\) is the idiosyncratic, firm-specific shock — the part of the return the market cannot explain. By assumption \(\mathbb{E}[\varepsilon_{i,t} \mid R_{m,t}] = 0\), so the errors are uncorrelated with the market factor.
The two coefficients \(\alpha_i\) and \(\beta_i\) are the entire content of the model.
Anatomy of the CAPM equation
The CAPM is a regression equation in finance clothing. Each symbol has a name and an economic interpretation — the diagram below maps the math to the finance.
Read left to right: stock excess return = a tiny skill bit + market sensitivity × market excess return + idiosyncratic noise. Estimating \(\beta_i\) from data is the most common “what’s the beta of this stock?” task in finance.
Beta: market exposure
The slope \(\beta_i\) — the stock’s market beta — measures the sensitivity of the stock’s excess return to the market’s excess return. Three benchmark cases:
- \(\beta = 1\). The stock moves one-for-one with the market. The market portfolio itself has beta 1 by construction.
- \(\beta > 1\). Aggressive stock. Moves amplify market swings. Typical of high-growth technology, semiconductors, biotech. NVDA’s full-sample beta against SPY has been in the 1.5–2.0 range for most of 2015–2025.
- \(\beta < 1\). Defensive stock. Moves dampen market swings. Typical of consumer staples, utilities, healthcare. Procter & Gamble’s beta is usually below 0.7.
- \(\beta < 0\). Counter-cyclical asset. Rare in equities; common in gold-related instruments and certain put-option strategies.
A practical operational reading: if you hold \(\$V\) of a stock with beta \(\beta\), your dollar market exposure is \(\beta \cdot V\). To hedge to a target beta of zero, short \(\beta \cdot V\) notional of the market proxy. To target a beta of, say, 0.3, short \((\beta - 0.3) \cdot V\).
A useful identity: \[ \beta_i \;=\; \frac{\mathrm{Cov}(R_i, R_m)}{\mathrm{Var}(R_m)}. \] This is just the OLS slope formula in disguise. It tells you that beta is not a correlation: it has units (percent per percent) and depends on the variances. Two stocks can have the same correlation with the market but very different betas; two stocks can have the same beta but very different correlations.
Alpha: skill or compensation for missing risk?
The intercept \(\alpha_i\) — Jensen’s alpha — is the stock’s expected excess return when the market premium is zero. The CAPM theory says \(\alpha_i = 0\) for all assets in equilibrium: no security earns a free lunch above its market-risk compensation. Empirically, the null hypothesis \(H_0: \alpha = 0\) is the central test of the CAPM, and the central performance metric of active management.
There are exactly three reasons a stock can show a statistically significant positive \(\hat\alpha\):
- Genuine mispricing / manager skill. The market has systematically underpriced the stock or the manager has identified a true informational edge. This is the interpretation active managers prefer.
- Missing risk factor. The stock loads on a priced risk that CAPM does not include (size, value, momentum, profitability, …). The Fama–French three-factor and five-factor models in the next chapter address exactly this.
- Data mining / multiple testing. If you test 1000 stocks at the 5% level, you expect 50 significant alphas under the null even when none of them are real. Harvey–Liu–Zhu (2016) argue this drives most of the published anomaly literature.
The empirical fact is that on long-horizon, large-cross-section data, most stocks have alphas statistically indistinguishable from zero once you correctly excess-return-adjust and use robust standard errors. This is the strongest empirical evidence in favour of weak-form efficient markets.
Why excess returns?
CAPM specifies excess returns on both sides of the regression — not raw returns. Why does this matter?
Decompose a raw return:
\[ R_{i,t} \;=\; R_{f,t} \;+\; \underbrace{(R_{i,t} - R_{f,t})}_{\text{excess return}}. \]
If you regressed raw \(R_i\) on raw \(R_m\) during a period when \(R_f\) was high and time-varying (say, 2022–2023, when the one-month T-bill yield moved from 0.05% to 5.4%), the regression would be partially fitting the common time trend in the risk-free rate, not the risk premium. The estimated intercept would not equal Jensen’s alpha; it would equal \(\alpha_i + (1 - \beta_i) R_f\), which depends on the level of \(R_f\). In a zero-rate environment this distinction does not matter empirically; in a normal-rate environment it does. Always subtract the risk-free rate from both sides before fitting CAPM.
The two CAPM numbers are exactly the geometry of this picture. The slope of the red line is \(\hat\beta\) — sensitivity to market moves; the height at which the line crosses the y-axis (\(x = 0\)) is \(\hat\alpha\) — the average excess return when the market premium is zero.
Worked example: full CAPM regression of NVDA on SPY
Where you’ll see this. This is the most important worked example in the chapter. Every step — load CSV, compute returns, subtract risk-free rate, fit OLS, read \(\beta\) and \(\alpha\), check residuals, validate out-of-sample — mirrors what an equity-research analyst, a quant risk manager, or a robo-advisor coder actually does on a typical Monday. Read it once for the story, then run each code cell yourself.
The remainder of the chapter walks through a complete CAPM analysis on real daily NVDA and SPY data from 2023–2024, using the same data file the ISOM2600 textbook uses. The file ships alongside this book on the project’s CDN, so no API key or external download is needed — the regression runs entirely in your browser through Pyodide.
Step 1: load prices, compute returns, build excess returns
We’re going to pull two CSV files: a daily-prices file (NVDA and SPY close prices) and a Fama–French factors file (which contains the daily risk-free rate RF). Then we turn prices into daily simple returns and subtract RF to get excess returns — the part of the return above what a one-month T-bill would have paid you that day.
What we got. A DataFrame of about 500 daily rows with two excess-return columns ready for regression: NVDA_excess on the left, Mkt_excess on the right.
A few details worth noting. The pct_change() step gives us simple returns in decimal form (0.012 means 1.2%). The Fama–French file stores RF in percent (0.018 means 1.8 basis points), which we convert by dividing by 100. The inner join on dates protects us against a calendar mismatch — Fama–French uses US trading days, our price data uses the same, but the inner join is cheap insurance.
Step 2: fit the CAPM with statsmodels
About to see another .summary() table — same shape as the one earlier in the chapter. The four numbers we care about are: \(R^2\) (top-right of the table), and on the Mkt_excess row the coefficient (that’s \(\hat\beta\)), its \(t\)-statistic, and the row for const (that’s \(\hat\alpha\)).
What we got. Roughly \(\hat\beta \approx 1.88\), \(R^2 \approx 0.55\), \(\hat\alpha\) tiny and statistically indistinguishable from zero.
Read the output top-down. The first block tells you \(R^2 \approx 0.55\) — about fifty-five percent of NVDA’s daily excess-return variance over 2023–2024 is explained by the market; the other 45% is stuff the model doesn’t see (firm-specific news). The coefficient table tells you \(\hat\beta \approx 1.88\) (so NVDA is roughly 1.88 times as sensitive to the market as a one-to-one market move), with a \(t\)-statistic well over 20 and a \(p\)-value essentially zero — not a fluke. The intercept \(\hat\alpha\) comes out small — a fraction of a basis point per day — with a \(p\)-value that does not reach significance. NVDA shows no statistically detectable Jensen’s alpha in this two-year sample once we control for its market exposure.
Step 3: read off SST/SSR/SSE and \(R^2\) by hand
What we got. The hand-built \(R^2\) matches capm.rsquared, and the RMSE (about half a percent) tells us the typical day-to-day “miss” of the model.
The RMSE of roughly half a percent per day says: if you used CAPM to predict NVDA’s daily excess return given the SPY excess return, your typical prediction error would be about fifty basis points per day in either direction. On a \(\$10{,}000{,}000\) NVDA position that is roughly \(\$50{,}000\) of daily P&L noise that the market factor cannot explain. That is the idiosyncratic component, and it is what an active manager would try to forecast separately.
Step 4: visualise the regression
What we got. A scatterplot of daily \((R_{m,\,t}^{\text{ex}},\, R_{i,\,t}^{\text{ex}})\) pairs with the fitted CAPM line on top — the picture form of \(\hat\beta = 1.88\).
The cloud is elongated diagonally from lower-left to upper-right — positive correlation made visual. The fitted red line passes through the cloud at a slope steeper than 45 degrees: that is the geometric fingerprint of \(\beta > 1\). The scatter around the line is the idiosyncratic risk — the part of NVDA’s daily move that the market cannot reach. A perfect index-fund stock would have all its points lying exactly on a 45-degree line through the origin; NVDA’s points lie around a steeper line with substantial vertical scatter.
Step 5: confidence intervals and a non-zero hypothesis test
What we got. The 95% CI for \(\beta\) sits well above 1, and the shifted one-tailed \(t\)-test rejects \(H_0: \beta \le 1\) with a tiny \(p\)-value — so on this data we conclude NVDA is statistically more aggressive than the market.
The 95% confidence interval for \(\beta\) typically lands in the neighbourhood of \([1.7, 2.0]\) — well above 1, well above the market-neutral threshold. The shifted-null one-tailed test rejects \(H_0: \beta \le 1\) with a \(p\)-value far below 0.05. NVDA is statistically more aggressive than the market. That single conclusion has direct portfolio implications: a \(\$10\text{M}\) NVDA position carries roughly \(\$18.8\text{M}\) of effective market exposure, and a long-only desk that wanted to control overall market beta would need to either hedge that exposure with \(\$8.8\text{M}\) short of SPY (to bring net exposure to \(\$10\text{M}\)) or to size the NVDA position smaller in the first place.
Step 6: robust standard errors
What we got. Same \(\hat\beta\), slightly larger SE under HAC — exactly what robust SEs are supposed to do when residuals misbehave.
The coefficients are identical (OLS point estimates do not change with the covariance type), but the standard errors and \(t\)-statistics typically shift. In daily return data the HAC standard errors are often somewhat larger than the default, reflecting positive serial correlation and volatility clustering. Reporting both is standard practice in academic finance.
Residual diagnostics
Where you’ll see this. When a peer reviewer (or a sceptical boss) asks “did you check the residuals?”, they want to see these plots. Skipping diagnostics is the single most common mistake in undergraduate regression analyses — and the easiest one for a reader to spot.
A regression’s point estimates may be perfectly fine even when the inference assumptions are violated, but the standard errors and \(p\)-values can be misleading. Residual diagnostics are the post-fit sanity check: do the residuals look like they came from a well-behaved, independent, normal-with-constant-variance distribution? Four diagnostics, four plots.
Diagnostic 1: residuals vs fitted values
A scatter of residuals against fitted values should look like a structureless cloud with constant vertical spread. Two failure modes to watch for:
- Funnel shape — vertical spread of residuals grows or shrinks with the fitted value. This is heteroscedasticity, and the cure is either a variance-stabilising transformation (log, square-root) or robust standard errors (HC3).
- Curved shape — residuals systematically positive in one region and negative in another. This means the linearity assumption is wrong; the relationship between \(X\) and \(Y\) is curved, not linear. The cure is a non-linear transformation or a polynomial term.
Diagnostic 2: histogram and QQ plot of residuals
Plot a histogram of residuals overlaid with a normal density. Bimodal residuals usually indicate an omitted categorical variable. Skewed residuals indicate a missing transformation. Fat-tailed residuals — too many extreme observations — are the norm in financial return data and indicate that classical \(t\)-tests slightly understate tail risk.
A more sensitive check is a QQ plot (quantile-quantile): plot the empirical quantiles of the residuals against the theoretical quantiles of a standard normal. Under perfect normality the points lie on a 45-degree line. Deviations in the tails are common in financial returns (S-shaped QQ plot = fatter tails than normal). Mild departures do not invalidate OLS inference for large samples, thanks to the CLT, but persistent heavy tails are a warning to use robust standard errors and to be cautious about extrapolating beyond the sample range.
Diagnostic 3: Durbin–Watson for serial correlation
In time-series regressions, the most common assumption violation is serial correlation of residuals: consecutive errors are not independent. The Durbin–Watson statistic, reported in the bottom of the statsmodels summary, ranges from 0 to 4 and approximately equals \(2(1 - \hat\rho_1)\), where \(\hat\rho_1\) is the first-order autocorrelation of the residuals:
- Durbin–Watson \(\approx 2\): no serial correlation. (Good.)
- Durbin–Watson \(< 1.5\): positive serial correlation. (Common in time-series regressions; use HAC standard errors.)
- Durbin–Watson \(> 2.5\): negative serial correlation. (Less common; sometimes signals over-differencing.)
Diagnostic 4: Jarque–Bera for normality
The Jarque–Bera test, also reported in the summary, tests the joint null that the residuals are normally distributed (zero skew, kurtosis equal to 3). A small \(p\)-value rejects normality. In daily return data the Jarque–Bera test almost always rejects, because returns are fat-tailed. This is one reason to default to robust standard errors in time-series regressions on returns.
Diagnostics in code
What we got. Three diagnostic plots plus four diagnostic numbers (JB, skew, kurtosis, Durbin–Watson) — enough to tell whether our SEs can be trusted.
For the NVDA/SPY 2023–2024 regression you should expect: residuals-vs-fitted looking roughly structureless with mild heteroscedasticity (a faint funnel during high-vol weeks), a histogram with visible fat tails, a QQ plot with S-shaped tail deviations, Jarque–Bera rejecting normality, and Durbin–Watson close to 2 (daily returns rarely show strong serial correlation, even though their variances cluster). The standard reaction in real work: keep the OLS point estimates, but report HAC-robust standard errors for inference.
Each panel exposes a different assumption violation. The funnel in the top-left, the curved tails in the QQ plot, the upward drift in the scale-location panel, and the lone red dot in residuals-vs-leverage are exactly the visual cues that tell you to switch from classical to robust standard errors (and to inspect — and possibly remove — that high-leverage point).
Train/test thinking
Where you’ll see this. Any time a YouTube finance creator brags about a strategy with a 95% hit rate, your first question should be: “is that in-sample or out-of-sample?” In-sample brilliance is almost always a mirage. The train/test split is the cheapest possible discipline against fooling yourself, and you’ll use it in every ML, marketing-modelling, or strategy back-test situation you ever touch.
Why split a linear model?
A linear regression is among the least flexible models in the analyst’s toolkit. With only two parameters in simple regression, it has very little capacity to overfit a single sample — and yet train/test thinking still matters, for two distinct reasons:
Coefficient stability. A model that fits one window well may not generalise to another window. In finance, the true relationship between a stock and the market is itself time-varying — NVDA’s beta in 2017 (gaming chips) is not its beta in 2024 (AI accelerators). The only way to detect this is to fit on one sample and evaluate on a different one.
Operational discipline. Every real model gets deployed forward in time. The empirical question is not “how well does this model fit the historical data I trained it on?” but “how well will it predict next month’s NVDA returns given current SPY moves?” That is an out-of-sample question, and you can only answer it by leaving some data out of the fitting step.
The simplest split: chronological
For time-series data the right split is chronological: fit on the first \(k\) years, evaluate on the last \(n - k\) years. Random K-fold cross-validation, which works for cross-sectional data, leaks future information into the past and overstates out-of-sample performance for time series.
What we got. The out-of-sample \(R^2\) is lower than the in-sample \(R^2\). That gap is honest evidence of how much our nice-looking fit will degrade in the real world.
Two outcomes worth flagging.
- The out-of-sample \(R^2\) is almost always lower than the in-sample \(R^2\), even for a model with two parameters. This is honest: in-sample \(R^2\) is computed at the same parameters that minimised in-sample SSE. Apply those parameters to a different sample and you no longer have the minimising solution.
- The coefficient difference between train and a full-sample fit tells you how stable the relationship is. If the train-window \(\hat\beta\) is 1.88 and the test-window \(\hat\beta\) (computed by re-running OLS on the test window) is 1.65, the relationship is drifting and a single-window estimate is misleading.
Rolling and walk-forward estimation
The natural extension of the train/test split is to slide the train window forward through time. This is rolling-window estimation: at each date \(t\), fit OLS on the most recent \(W\) days, use the resulting \(\hat\beta_t\) for the next day’s hedge or signal, then slide forward by one day and refit. We will treat rolling beta, shrinkage estimators, and Kalman-filter dynamic beta in detail in later chapters; for now, the train/test split is the first taste of the same idea.
Common pitfalls
Where you’ll see this. Most of the regression mistakes that show up in undergraduate projects (and, frankly, in many published papers) are not mistakes in the math — they are mistakes in deciding what data to feed the math. The six pitfalls below are the ones a senior analyst spots within five seconds of looking at your work.
The mechanics of OLS are simple; the misuse of OLS is sophisticated. Six pitfalls every analyst should know.
Pitfall 1: omitted variable bias
If a true predictor is missing from your regression, the OLS coefficients on the included predictors absorb the missing predictor’s effect — biased away from their true values. Formally, if the true model is \(Y = \alpha + \beta X + \gamma Z + \varepsilon\) and you fit \(Y = \alpha + \beta X + u\), then
\[ \mathbb{E}[\hat\beta_{\text{OLS}}] \;=\; \beta \;+\; \gamma \cdot \frac{\mathrm{Cov}(X, Z)}{\mathrm{Var}(X)}. \]
The bias is zero only if \(\gamma = 0\) (the missing variable does not matter) or \(\mathrm{Cov}(X, Z) = 0\) (the missing variable is uncorrelated with the included one). The cure is to include the omitted variable — and the multiple-regression machinery of the next chapter is built precisely to handle this.
Pitfall 2: endogeneity
If \(X\) and \(\varepsilon\) are correlated — perhaps because \(X\) is itself influenced by \(Y\), or because both are driven by an unobserved third variable — then OLS is biased no matter how large the sample. Examples: regressing wage on years of schooling without controlling for ability, regressing trade volume on price without distinguishing supply shocks from demand shocks. In finance, regressing a portfolio’s future returns on its past returns runs into return-autocorrelation issues. The cure is identification strategy — instrumental variables, natural experiments, randomised controlled trials — not OLS.
Pitfall 3: outliers and high-leverage points
A single extreme observation — a flash crash, a stock split that wasn’t adjusted, a clerical error — can drag the OLS line toward itself. Two cures: (i) inspect the data with scatter plots and residual plots before trusting the fit; (ii) refit with the suspect point removed and compare. For systematic robustness, replace OLS with a quantile regression or a Huber M-estimator. In real work on return data the most common outliers are unadjusted corporate actions (splits, dividends, mergers) — always work with split- and dividend-adjusted prices.
Pitfall 4: spurious regression on non-stationary series
Regressing one upward-trending time series on another upward-trending series will produce a high \(R^2\) and a small \(p\)-value even when the two series are unrelated. This is the spurious regression problem of Granger and Newbold (1974). The cure for financial work is to regress returns on returns, not levels on levels. Prices trend; returns are roughly stationary. CAPM is correctly specified in returns, not in prices.
Pitfall 5: confusing correlation with slope
Correlation is unitless and bounded between \(-1\) and \(+1\). The slope has units and is unbounded. Two stocks can have identical correlation with the market but very different betas; two stocks can have identical betas but very different correlations. Correlation measures co-movement strength; slope measures exposure. Both are useful; neither replaces the other.
Pitfall 6: extrapolation beyond the sample
A regression fitted on SPY excess returns in the range \([-3\%, +3\%]\) has no information about what NVDA does when SPY drops 10%. The linear approximation may be roughly right out-of-sample, but it may also break down badly. Tail risk in particular is a regime that linear models built on calm-period data systematically underestimate. Always note the range of \(X\) in the training sample and treat predictions far outside that range with caution.
Putting it together: a checklist for real work
Where you’ll see this. Treat the checklist below as the menu of questions a senior analyst or thesis advisor will ask you. If you can answer all five for any regression you ever run, you’ll have already done more careful work than most.
A complete simple-regression analysis answers five questions:
- Specification. What is \(Y\)? What is \(X\)? Are both correctly scaled and time-aligned? For finance: are you regressing excess returns on excess returns?
- Estimation. Use
sm.add_constantthensm.OLS(y, X).fit(). Read the coefficient table for \(\hat\alpha\), \(\hat\beta\), and their standard errors. Read the \(R^2\) for goodness of fit. - Inference. Compute \(t\)-statistics and \(p\)-values. For directional hypotheses, recentre the test statistic. Use HAC standard errors for time-series regressions.
- Diagnostics. Plot residuals against fitted values, histogram with normal overlay, QQ plot. Check Durbin–Watson, Jarque–Bera, skew, kurtosis. Decide whether the assumptions are good enough for the use case.
- Validation. Split the data chronologically. Fit on the train window, evaluate out-of-sample \(R^2\) on the test window. Compare the train and test coefficients to detect drift.
These five steps are the workflow of every applied regression in finance, marketing, operations, and economics when you do this for a job. The mathematics in this chapter is the same mathematics that produces the beta on every Bloomberg terminal, the cost-of-equity in every valuation model at every investment bank, and the hedge ratio on every long/short equity desk in the world.
Goldman Sachs, Morgan Stanley, and JPMorgan equity research desks all run exactly this regression — daily or monthly returns of a stock against a market proxy — when initiating coverage of a new name. The resulting \(\hat\beta\) feeds directly into the discounted-cash-flow valuation: a stock with \(\beta = 2\) receives a higher required return (cost of equity) than a utility with \(\beta = 0.4\), and the higher hurdle rate lowers the implied fair value. Every step in this chapter has a dollar-and-cents application at a real trading desk.
Exercises
Exercise 5.1 — OLS by hand
Using the simulated dataset below, compute \(\hat\alpha\) and \(\hat\beta\) by three methods: (a) the scalar formulae \(\hat\beta = \mathrm{Cov}(X,Y)/\mathrm{Var}(X)\) and \(\hat\alpha = \bar Y - \hat\beta \bar X\); (b) the matrix formula \((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}\); and (c) statsmodels.OLS().fit().params. Verify all three agree. Then compute SST, SSR, SSE by hand and verify \(R^2 = \mathrm{SSR}/\mathrm{SST}\) matches model.rsquared.
Exercise 5.2 — CAPM with statsmodels
Using the NVDA/SPY/FF5 dataset loaded in the worked example, repeat the CAPM regression with monthly rather than daily returns. Resample the daily price series to month-end, compute monthly excess returns, and fit CAPM. Compare the monthly \(\hat\beta\) and \(R^2\) with the daily values. Are they similar? Why might they differ?
Exercise 5.3 — Hypothesis tests and confidence intervals
For the daily CAPM regression of NVDA on SPY 2023–2024, conduct the following three tests at the 5% level and state your conclusion for each:
\(H_0: \beta = 0\) vs \(H_a: \beta \ne 0\) (two-tailed). Is NVDA exposed to the market?
\(H_0: \beta \le 1\) vs \(H_a: \beta > 1\) (one-tailed). Is NVDA more aggressive than the market?
\(H_0: \alpha = 0\) vs \(H_a: \alpha \ne 0\) (two-tailed). Does NVDA show statistically significant Jensen’s alpha?
For each test, report the relevant \(t\)-statistic, the \(p\)-value, and the 95% confidence interval. Use the HAC (Newey–West, 5 lags) standard errors for the inference.
Exercise 5.4 — Residual diagnostics
Fit CAPM on NVDA/SPY for 2023–2024. Produce the three-panel diagnostic plot (residuals vs fitted, histogram with normal overlay, QQ plot). Compute and report: Durbin–Watson, Jarque–Bera \(p\)-value, residual skew, residual kurtosis. State whether each of the four LINE assumptions appears plausible based on the diagnostics, and recommend an appropriate covariance type for the inference (nonrobust, HC3, or HAC).
Exercise 5.5 — Train/test split and beta stability
Split the NVDA/SPY/FF5 daily data into two halves — the 2023 calendar year as the train window, the 2024 calendar year as the test window. Fit CAPM on the train window only. Report:
The train-window \(\hat\beta\), \(\hat\alpha\), and \(R^2\).
The test-window \(\hat\beta\) computed by re-fitting OLS on the test window alone.
The out-of-sample \(R^2\) obtained by using the train coefficients to predict the test outcomes: \[ R^2_{\text{oos}} = 1 - \frac{\sum_{t \in \text{test}} (y_t - \hat y_t)^2}{\sum_{t \in \text{test}} (y_t - \bar y_{\text{test}})^2}. \]
Comment on whether NVDA’s beta is stable across years and what the gap between in-sample \(R^2\) and out-of-sample \(R^2\) implies for forward use of the fitted model.
Exercise 5.6 — A defensive stock
Repeat the worked CAPM example with a hypothetical defensive stock — say a utilities ETF such as XLU — whose beta is well below 1. (You may use the price file at https://busanalytics-book.pages.dev/data/xlu_daily_2023_2024.csv if available, or simulate one whose data-generating process is \(R_i - R_f = 0 + 0.45 (R_m - R_f) + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, 0.008^2)\).) Compare the resulting \(\hat\beta\), \(\hat\alpha\), \(R^2\), and residual standard error with NVDA’s. In a paragraph, explain what the difference in \(\beta\) implies for hedge sizing on a \(\$10\)M position, and what the difference in \(R^2\) implies for diversification benefit.
You now have the full simple-regression toolkit: OLS in scalar and matrix form, \(R^2\) as the fraction of variance explained, \(t\)-tests and confidence intervals with both classical and robust standard errors, residual diagnostics, and chronological train/test splits. The CAPM is the canonical application: regress a stock’s excess return on the market’s, and the two coefficients tell you the stock’s market exposure and its skill-adjusted alpha. In the next chapter we extend to multiple regression and meet the Fama–French three- and five-factor models — the same machinery, with more columns in the design matrix.