Appendix A: A Minimum of Linear Algebra
The regression chapters (Ch 3, Ch 4) and the alpha-model chapter (Ch 5) use matrix and vector notation in a few key places — most importantly the OLS formula \[ \hat{\boldsymbol{\beta}} \;=\; \bigl(\mathbf{X}^{\!\top}\!\mathbf{X}\bigr)^{-1}\mathbf{X}^{\!\top}\!\mathbf{y}. \] If you have never seen this kind of object before, the symbols can feel like alphabet soup. Don’t panic. A matrix is nothing more exotic than a small spreadsheet of numbers, and every operation we will do on it is something Excel already does — we are only giving the operations short symbolic names. This appendix introduces only what is needed to read those chapters and to run the Python code that implements them — nothing more. You can finish it in about an hour, with zero prior exposure to linear algebra. By the end you will be able to:
- Read a vector and a matrix and say what their shape is.
- Compute a matrix–vector and a matrix–matrix product by hand on small examples.
- Translate the OLS formula into NumPy code and check the answer numerically.
- Recognise when a matrix is invertible and when it is not.
- Read the Fama–French regression \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) as a sentence about stocks and factors, not just notation.
No proofs. No abstract vector spaces. No determinants. Just the four objects (vector, matrix, transpose, inverse) and the four operations (addition, scalar multiply, matrix multiply, solving a linear system) that actually appear in the book.
Why we need linear algebra in business analytics
Why this matters in the rest of the book. Almost every quantitative model you will meet — CAPM, Fama–French, Ridge, even neural networks — boils down to “multiply a spreadsheet of data by a list of weights.” Linear algebra is just the shorthand for that one move, written so compactly that a 5-factor model and a 500-factor model look identical on the page.
In Chapter 3 you fit a line through a scatter plot of stock returns versus market returns. There were two unknowns: the intercept \(\hat\alpha\) and the slope \(\hat\beta\). With only two unknowns you can write down the formulae in scalar form — no matrices needed:
\[ \hat\beta = \frac{\sum_t (x_t - \bar x)(y_t - \bar y)}{\sum_t (x_t - \bar x)^2}, \qquad \hat\alpha = \bar y - \hat\beta \bar x. \]
In Chapter 4 you fit a five-factor regression: an intercept plus loadings on Market, SMB, HML, RMW, CMA. Six unknowns. You could in principle write down six separate scalar formulae, but they would each be a half-page of summations and they would obscure the underlying pattern. Worse, the formula would need to be re-derived from scratch the moment you wanted a sixth factor. Linear algebra is the language that says it all once, for any number of factors:
\[ \hat{\boldsymbol{\beta}} \;=\; (\mathbf{X}^{\!\top}\!\mathbf{X})^{-1}\mathbf{X}^{\!\top}\!\mathbf{y}. \]
This one formula handles the 1-factor CAPM, the 5-factor FF5, and a hypothetical 50-factor model with no change. That is the whole reason linear algebra is the lingua franca of statistics, econometrics, machine learning, and risk modelling.
Strategy for this appendix. Read each section once for intuition, then run the live pyodide cell. The cell numbers in this appendix match the formulae used in Chapters 3–5, so you will be working with the same objects you will see in the main text. Whenever you feel lost in those chapters, come back here and re-read the matching section.
Section 1: Vectors
Why this matters in the rest of the book. Every column in your data — the 252 daily returns of SPY, the monthly Market factor, your portfolio weights — is a vector. When you see bold lowercase like \(\mathbf{y}\) or \(\boldsymbol{\beta}\), just picture a single column in Excel.
Before reading further, watch this Grant Sanderson clip. It builds the geometric intuition for what a vector is — and the rest of the appendix will land much faster once that picture is in your head.
If you want the whole Essence of Linear Algebra series (15 short videos), it lives at https://www.3blue1brown.com/topics/linear-algebra. Chapters 1–4 of the series are the best 45-minute investment in linear-algebra intuition you can make.
What a vector is
A vector is just an ordered list of numbers — a single Excel column, or a Python list. Nothing more mysterious than that. In data analysis the list is usually one observation per row, so a vector for monthly SPY excess returns over 5 months might look like
\[ \mathbf{y} \;=\; \begin{pmatrix} 0.012 \\ -0.003 \\ 0.025 \\ -0.018 \\ 0.041 \end{pmatrix}. \]
The boldface \(\mathbf{y}\) is a vector; the plain \(y_t\) refers to its \(t\)-th entry. We say \(\mathbf{y}\) has length (or dimension) 5, written \(\mathbf{y} \in \mathbb{R}^5\). By convention vectors are written as columns in mathematical notation. When you store one in Python with NumPy it is a 1-D array, and its .shape will tell you the length:
What we just did: built a 5-number vector and asked NumPy for its shape, length, and one specific entry.
On paper a vector is always drawn as a tall column with one number per row, so mathematically \(\mathbf{y}\) has shape \(5 \times 1\). NumPy, however, stores a plain 1-D array whose .shape is (5,) — note the lonely comma, meaning “5 elements, no second dimension.” Most of the time this difference doesn’t matter, but the moment you start mixing matrices and vectors you may see shape mismatches. If you ever need an explicit column vector, write y.reshape(-1, 1) to get shape (5, 1). Rule of thumb: if a matrix multiplication complains about shapes, print .shape of every operand first.
Mathematicians and statisticians usually index from 1: \(y_1, y_2, \ldots, y_n\). Python (and NumPy) indexes from 0: y[0], y[1], ..., y[n-1]. The two communities will collide on this every time you read a paper and then implement it. Read mathematics with the 1-based eye, type code with the 0-based eye, and triple-check loop bounds.
Anatomy of a vector
A column of numbers — drawn explicitly, with arrows pointing at every named piece. Look at this picture once and you’ll never confuse the column-vs-row convention again.
Left: a vector is just a column of \(n\) numbers, with a name in bold. Right: in 2D you can also picture it as an arrow from the origin — that’s where the geometric intuition behind “length” and “angle between vectors” comes from.
Vector addition and scalar multiplication
You can add two vectors of the same length by adding entry-by-entry:
\[ \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} + \begin{pmatrix} 10 \\ 20 \\ 30 \end{pmatrix} \;=\; \begin{pmatrix} 11 \\ 22 \\ 33 \end{pmatrix}. \]
You can multiply a vector by a single number (a scalar), which scales each entry:
\[ 3 \cdot \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} \;=\; \begin{pmatrix} 3 \\ 6 \\ 9 \end{pmatrix}. \]
These two operations together are the only ones you need to define a linear combination — and a linear combination is what a regression is. The fitted return \(\hat y_t\) in a multi-factor model is a linear combination of the factor returns at month \(t\):
\[ \hat y_t = \hat\alpha + \hat\beta_{\text{Mkt}} \text{Mkt}_t + \hat\beta_{\text{SMB}} \text{SMB}_t + \hat\beta_{\text{HML}} \text{HML}_t + \cdots . \]
That is just a scalar-times-vector sum.
What we just did: added two vectors entry-by-entry, scaled one by 3, and combined them into a single weighted sum — the building block of every regression prediction.
The dot product
Here comes the smallest, most useful operation in the whole appendix. The dot product takes two same-length lists of numbers and crunches them into one number — “pair them up, multiply each pair, add all the products.” That single number will end up being a prediction, a mean, or a sum of squares depending on what we plug in.
Given two vectors \(\mathbf{a}\) and \(\mathbf{b}\) of the same length \(n\), their dot product (or inner product) is the single number
\[ \mathbf{a} \cdot \mathbf{b} \;=\; \sum_{i=1}^{n} a_i\, b_i \;=\; a_1 b_1 + a_2 b_2 + \cdots + a_n b_n. \]
This is the single most-used operation in all of statistics. The fitted value of a multi-factor regression for one month, written in plain summation form, is
\[ \hat y_t = \hat\alpha + \hat\beta_1 x_{t,1} + \hat\beta_2 x_{t,2} + \cdots + \hat\beta_p x_{t,p} \;=\; \mathbf{x}_t \cdot \hat{\boldsymbol{\beta}}, \]
where we have packed the slopes (and the intercept) into the vector \(\hat{\boldsymbol{\beta}}\) and the factor values for month \(t\) into the vector \(\mathbf{x}_t\) (with a 1 in the first slot to pick up the intercept). The dot product is the regression prediction.
What we just did: turned a row of factor values and a vector of betas into the model’s single-month return prediction with one multiplication.
NumPy offers np.dot(a, b) and the @ operator. Both do the same thing for vectors. @ is shorter and recommended; it was added to Python specifically for matrix multiplication.
Why the dot product matters
Three facts about the dot product will appear over and over in the book.
- Means are dot products. \(\bar y = \tfrac{1}{n} \sum_i y_i = \tfrac{1}{n}\,\mathbf{1} \cdot \mathbf{y}\), where \(\mathbf{1}\) is the all-ones vector.
- Squared lengths are dot products. The sum of squares \(\sum_i y_i^2 = \mathbf{y} \cdot \mathbf{y}\).
- Correlations are dot products. After centring and scaling, the Pearson correlation between \(\mathbf{x}\) and \(\mathbf{y}\) is the dot product of the standardised vectors.
So when you read \(\mathbf{a}^\top \mathbf{b}\) in a paper, mentally translate it as “the sum of entry-by-entry products.” That covers 90% of the cases.
Section 2: Matrices
Why this matters in the rest of the book. The design matrix \(\mathbf{X}\) — one row per month, one column per factor — is the single object that all of OLS, Ridge, LASSO, and CAPM operate on. Once you can read \(\mathbf{X}\) as “a small spreadsheet of factor returns,” everything else is just rearranging that spreadsheet.
What a matrix is
A matrix is a rectangle of numbers — a small Excel sheet, or a list of equal-length lists. We write it with bold capitals: \(\mathbf{X}\). Its shape is given as rows × columns, e.g. an \(n \times p\) matrix has \(n\) rows and \(p\) columns (think: \(n\) rows of data, \(p\) columns of variables). In a regression, the design matrix \(\mathbf{X}\) stacks one row per observation and one column per regressor:
\[ \mathbf{X} \;=\; \begin{pmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix}. \]
The first column is all-ones to encode the intercept. The remaining \(p\) columns hold the regressors. So if you have \(n = 120\) months and \(p = 5\) factors (Mkt, SMB, HML, RMW, CMA), \(\mathbf{X}\) is a \(120 \times 6\) matrix (5 factors + 1 intercept column).
What we just did: built a small design matrix by gluing a column of 1’s (for the intercept) to three columns of fake factor returns.
Anatomy of a matrix
The same picture as for vectors, but in 2D. The diagram below is a 4×3 matrix with every named part labeled.
Memorise the rows-then-columns convention; it carries through every NumPy .shape call and every regression diagnostic.
Matrix addition and scalar multiplication
Same as vectors: entry-by-entry. The two matrices being added must have the same shape (same number of rows AND same number of columns — you can’t add a 3×2 sheet to a 4×2 sheet).
The transpose
The transpose is just flipping the spreadsheet on its side: rows become columns, columns become rows. Nothing is added, deleted, or recomputed; the same numbers are re-laid out. We write it \(\mathbf{X}^\top\) (or sometimes \(\mathbf{X}'\)).
\[ \mathbf{X} = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} \quad\Longrightarrow\quad \mathbf{X}^\top = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}. \]
If \(\mathbf{X}\) is \(n \times p\) then \(\mathbf{X}^\top\) is \(p \times n\). In NumPy: X.T.
What we just did: flipped a 2×3 sheet on its side and got a 3×2 sheet — same numbers, different layout.
The transpose is the single most-frequent piece of bookkeeping you will do in the matrix algebra of regression. The reason is that matrix multiplication requires the inner dimensions to match — and the transpose is how you fix a mismatch.
Section 3: Matrix multiplication
Why this matters in the rest of the book. Whenever you see \(\mathbf{X}\boldsymbol{\beta}\) in this book, that’s matrix multiplication. It’s how a multi-factor model produces 252 predictions in one line: \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\), no for-loops needed.
Important from the start: matrix multiplication is not element-by-element (it is not the obvious “multiply matching cells”). It is a careful “row of left times column of right” procedure. The video clip we embed below makes this picture obvious — watch it before reading the rule below.
Why matrix multiplication is not commutative becomes obvious once you see each matrix as a transformation of space. This is the right mental picture for everything that follows.
— 3Blue1Brown — Essence of Linear Algebra Ch 4
The rule
To multiply matrices \(\mathbf{A}\) (shape \(m \times k\)) and \(\mathbf{B}\) (shape \(k \times p\)), the inner dimensions must match (both \(k\)). The product \(\mathbf{A}\mathbf{B}\) has shape \(m \times p\), and its \((i,j)\) entry is the dot product of row \(i\) of \(\mathbf{A}\) with column \(j\) of \(\mathbf{B}\):
\[ (\mathbf{A}\mathbf{B})_{ij} \;=\; \sum_{\ell=1}^{k} a_{i,\ell}\, b_{\ell,j}. \]
In words: the \((i,j)\) entry of the product is “row \(i\) of \(\mathbf{A}\) dot column \(j\) of \(\mathbf{B}\).” This is the operation you will perform mentally the most.
The shape rule for \(\mathbf{A}\mathbf{B}\) is \((m \times k) \cdot (k \times p) \to (m \times p)\). Read it as: the two inner numbers (both \(k\) here) must agree, and they disappear; the two outer numbers (\(m\) and \(p\)) survive and give you the shape of the product. So a \(3 \times 2\) times a \(2 \times 5\) gives a \(3 \times 5\) (the two 2’s match and vanish). A \(3 \times 2\) times a \(5 \times 2\) is undefined — the inner numbers (2 and 5) don’t match. When in doubt, print every .shape before you multiply.
What we just did: multiplied a 3×2 by a 2×3 (inner 2’s match) to get a 3×3, then verified one cell with pencil-and-paper arithmetic.
Anatomy of matrix multiplication
The picture below shows the shape rule and the “row-of-A dot column-of-B” recipe in a single diagram. \(A\) is 4×3, \(B\) is 3×2, so \(C = AB\) is 4×2. The dashed boxes highlight row 1 of \(A\) and column 1 of \(B\) — their dot product fills cell \(C_{1,1}\).
The two “inner” dimensions must match and they vanish into the dot product. The two “outer” dimensions survive and give you the shape of the product. Cell \(C_{ij}\) is always “row \(i\) of \(A\) dotted with column \(j\) of \(B\).”
\(\mathbf{A}\mathbf{B}\) and \(\mathbf{B}\mathbf{A}\) are usually not the same matrix, and one of the two products may not even be defined. Matrix multiplication is not commutative. Always check shapes first.
The product that matters: \(\mathbf{X}\boldsymbol{\beta}\)
This is the product you will see more than any other in the book. Read it as: “take your spreadsheet of factor returns \(\mathbf{X}\), weight each column by the matching beta, sum across columns row-by-row, and out pops one predicted return per row.” One line of math, \(n\) predictions.
In a regression, you have the design matrix \(\mathbf{X}\) (\(n \times p\)) and the parameter vector \(\boldsymbol{\beta}\) (\(p \times 1\)). Their product is the vector of fitted values:
\[ \hat{\mathbf{y}} \;=\; \mathbf{X}\boldsymbol{\beta}. \]
Shape check: \(n \times p\) times \(p \times 1\) gives \(n \times 1\). Exactly \(n\) fitted values, one per observation. Each entry of \(\hat{\mathbf{y}}\) is row \(i\) of \(\mathbf{X}\) dotted with \(\boldsymbol{\beta}\) — i.e. the prediction for observation \(i\).
What we just did: turned a 3-month design matrix and a beta vector into three predicted returns — one matrix product, no loops.
\(\mathbf{X}\) is \(n \times p\) (say \(120 \times 6\)); \(\boldsymbol{\beta}\) is \(p \times 1\) (\(6 \times 1\)). The product \(\mathbf{X}\boldsymbol{\beta}\) has inner dimensions \(p\) and \(p\) — they match, so the product exists and has shape \(n \times 1\) (120 predicted returns). Flip the order to \(\boldsymbol{\beta}\mathbf{X}\) and you ask for \((6 \times 1) \cdot (120 \times 6)\) — inner dimensions 1 and 120, mismatch, undefined. This is why “matrix multiplication is not commutative” matters in practice: not only is the answer different, often one direction is illegal. Always write the data matrix on the left and the parameter vector on the right.
Two products you will see in OLS
The OLS formula uses two specific matrix products. Both should be unambiguous once you know the shape arithmetic.
| Expression | Shape arithmetic | Result shape | What it represents |
|---|---|---|---|
| \(\mathbf{X}^{\top}\mathbf{X}\) | \(p \times n\) times \(n \times p\) | \(p \times p\) | Sums of squares & cross-products of the regressors |
| \(\mathbf{X}^{\top}\mathbf{y}\) | \(p \times n\) times \(n \times 1\) | \(p \times 1\) | Sums of cross-products of regressors with \(\mathbf{y}\) |
Notice that \(\mathbf{X}^\top\mathbf{X}\) is always square (\(p \times p\)). That is critical: only square matrices can have an inverse.
Section 4: The identity matrix
Why this matters in the rest of the book. Whenever Chapter 5’s Ridge formula writes “\(+ \lambda \mathbf{I}\)”, that \(\mathbf{I}\) is just a place-holder that lets us add a scalar to a matrix. Without it, the formula would be a type error — like trying to add the number 5 to an Excel sheet.
The identity matrix \(\mathbf{I}_p\) is the “do-nothing” matrix — the matrix-world equivalent of multiplying by 1 in ordinary arithmetic. It is the \(p \times p\) matrix with 1s on the diagonal and 0s elsewhere:
\[ \mathbf{I}_3 \;=\; \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}. \]
It plays the role of the number 1 for matrices: \(\mathbf{I}\mathbf{A} = \mathbf{A}\mathbf{I} = \mathbf{A}\) for any compatible \(\mathbf{A}\). In Chapter 5 you will see \(\mathbf{I}\) appear in the Ridge regression formula \[ \hat{\boldsymbol{\beta}}_{\text{Ridge}} \;=\; (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}, \] where the \(\lambda \mathbf{I}\) term shrinks the coefficients toward zero. The \(\mathbf{I}\) is there because \(\lambda\) is a scalar and you cannot directly add a scalar to a matrix — you have to first lift it to a “scalar-times-identity” matrix of the right shape.
What we just did: printed the 4×4 identity — 1’s on the diagonal, 0’s everywhere else.
Section 5: The inverse of a matrix
Why this matters in the rest of the book. Every closed-form regression formula — OLS, Ridge, GLS — contains a matrix inverse. If the inverse exists and is well-behaved, your coefficients are stable; if not, your standard errors blow up. Understanding when the inverse can fail is the core diagnostic for multicollinearity in Chapter 4.
Intuition
The inverse is the “undo” matrix — the matrix equivalent of dividing by a number. For a non-zero number \(a\), the inverse is the number \(a^{-1} = 1/a\) such that \(a \cdot a^{-1} = 1\). For a square matrix \(\mathbf{A}\), the inverse \(\mathbf{A}^{-1}\) is the unique matrix such that \[ \mathbf{A}\mathbf{A}^{-1} \;=\; \mathbf{A}^{-1}\mathbf{A} \;=\; \mathbf{I}. \]
Just as you can’t divide by zero with ordinary numbers, you can’t always invert a matrix. Only square matrices can have an inverse, and even then, the matrix must be full rank (also called invertible or non-singular) — meaning no column is a redundant copy or weighted sum of the others. We will sharpen this definition in the next subsection.
When does the inverse exist?
A square matrix \(\mathbf{A}\) is invertible if and only if its rows (equivalently, its columns) are linearly independent — no row can be written as a combination of the others. In the regression context, the design matrix \(\mathbf{X}^\top\mathbf{X}\) is invertible if and only if the columns of \(\mathbf{X}\) are linearly independent.
That last condition fails in three common ways:
- Perfect collinearity — one regressor is a multiple of another (e.g. one column is “return in %” and another is “return in basis points × 100”). The \(\mathbf{X}^\top\mathbf{X}\) matrix becomes singular and \((\mathbf{X}^\top\mathbf{X})^{-1}\) does not exist. NumPy will raise
LinAlgError. - Near-collinearity (multicollinearity) — the columns are not exactly proportional but are highly correlated. \((\mathbf{X}^\top\mathbf{X})^{-1}\) exists but is numerically unstable; the coefficient standard errors balloon. This is the situation in Chapter 4 with HML and CMA, where the VIF exceeded 10. The fix is not “use a different inverse”; it is to interpret the regressors jointly, drop one, or regularise.
- More columns than rows (\(p > n\)) — there are more unknowns than equations. The OLS problem is under-determined. The remedy is regularisation (Ridge, LASSO; Chapter 5).
The whole reason multicollinearity hurts inference but not prediction is that the inverse \((\mathbf{X}^\top\mathbf{X})^{-1}\) multiplies the cross-product \(\mathbf{X}^\top\mathbf{y}\) in the OLS formula. When the inverse is unstable, the direction of \(\hat{\boldsymbol{\beta}}\) becomes uncertain — but the fitted values \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) are still well-defined because they only need the projection, not the parameter direction. This is exactly the asymmetry you saw in Ch 4: bad CIs on each \(\hat{\beta}_k\), good adj-\(R^2\) and joint \(F\).
Don’t compute the inverse — solve the system
Quick framing before the code: although the formula on paper reads “invert this matrix, then multiply,” that is not how anyone actually computes it. We hand the whole system to a single function (np.linalg.solve) that finds \(\hat{\boldsymbol{\beta}}\) in one shot, without ever building the inverse. Faster, and far less likely to blow up numerically.
In practice you almost never compute \((\mathbf{X}^\top\mathbf{X})^{-1}\) explicitly. The OLS formula is \[
\hat{\boldsymbol{\beta}} \;=\; (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y},
\] which can be rewritten as: “find the \(\hat{\boldsymbol{\beta}}\) that satisfies \((\mathbf{X}^\top\mathbf{X})\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}\).” This is a linear system of \(p\) equations in \(p\) unknowns. The right way to solve it is np.linalg.solve, not np.linalg.inv. The solve route is faster and far more numerically stable.
What we just did: computed \(\hat{\boldsymbol{\beta}}\) two ways — once via the textbook “invert-then-multiply” recipe, once via solve — and showed they agree to machine precision on a clean problem.
Both routes give the same answer here, but for ill-conditioned \(\mathbf{X}^\top\mathbf{X}\) (multicollinearity) the explicit-inverse route can lose 4–6 digits of precision while solve stays accurate.
inv fails vs when solve succeeds
np.linalg.inv(A) will outright raise LinAlgError when \(\mathbf{A}\) is exactly singular (perfectly collinear columns), and will silently return garbage when \(\mathbf{A}\) is merely near-singular (severe multicollinearity). np.linalg.solve(A, b) is more forgiving: it uses a smarter algorithm (LU decomposition) and will still raise an error on exact singularity, but it loses far fewer digits of precision on near-singular problems. Rule of thumb: if you wrote np.linalg.inv(M) @ v in your code, replace it with np.linalg.solve(M, v) — same math, fewer surprises.
Section 6: Putting it all together — the OLS formula
Why this matters in the rest of the book. This is the formula behind every regression you will run for the rest of the course — and most of the regressions you will ever run in your career.
You now have every piece you need to read the OLS formula. Here comes the most-used formula in the whole book. Don’t try to memorise it. Just remember it is the recipe that takes data \(\mathbf{X}\) and \(\mathbf{y}\) as inputs and returns the best-fit coefficients \(\hat{\boldsymbol{\beta}}\) as output. Everything to the right of the equals sign is built from objects you already know — transposes, matrix products, an inverse.
\[ \boxed{\;\hat{\boldsymbol{\beta}} \;=\; \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\mathbf{X}^\top\mathbf{y}\;} \]
Reading it left to right:
- \(\mathbf{y}\) is the \(n\)-vector of observed responses (e.g. monthly excess returns of one stock).
- \(\mathbf{X}\) is the \(n \times p\) design matrix (one row per month, one column per factor + 1 column for the intercept).
- \(\mathbf{X}^\top \mathbf{y}\) is a \(p\)-vector: each entry is the sum across months of one regressor × the response. It captures how each regressor co-moves with \(\mathbf{y}\).
- \(\mathbf{X}^\top \mathbf{X}\) is a \(p \times p\) matrix: its \((j,k)\) entry is the sum across months of regressor \(j\) × regressor \(k\). It captures how the regressors co-move with each other.
- \((\mathbf{X}^\top\mathbf{X})^{-1}\) disentangles those internal correlations.
- The product \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) is the unique vector of coefficients \(\hat{\boldsymbol{\beta}}\) that minimises the squared-error loss \[ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \;=\; \sum_{i=1}^{n} \bigl(y_i - \mathbf{x}_i \cdot \boldsymbol{\beta}\bigr)^2. \]
Once you have \(\hat{\boldsymbol{\beta}}\), the fitted values are \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) and the residuals are \(\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}\).
Anatomy of the OLS formula
The single most important formula in this book — with every piece named.
One formula, five named parts. Internalise this picture — the rest of the course is variations on it (Ridge, GLS, IV, panel-data fixed-effects).
A full worked example
We will fit a 3-factor model (intercept + Market + SMB) by hand using nothing but matrix operations, then compare to statsmodels.
What we just did: hand-built every step of OLS — design matrix, \(\mathbf{X}^\top\mathbf{X}\), \(\mathbf{X}^\top\mathbf{y}\), solve, fitted values, \(R^2\) — using only the four operations introduced in this appendix.
Sanity check against statsmodels:
What we just did: re-ran the same fit through statsmodels to confirm our hand-rolled coefficients match a battle-tested library.
The two routes give the same coefficients — by construction, since statsmodels is doing the same matrix algebra under the hood, just with extra numerical safeguards and the standard-error / inference machinery.
Section 7: A vocabulary cheat-sheet
Why this matters in the rest of the book. Print this table or keep this page open. Every time a chapter throws a new bold symbol at you, glance here first to translate it into a shape and a NumPy call.
| Object | Notation | Shape | Python | Used in |
|---|---|---|---|---|
| scalar | \(a\), \(\lambda\) | — | float, np.float64 |
regularisation strength, learning rate |
| vector | \(\mathbf{y}\), \(\boldsymbol{\beta}\) | \(n \times 1\) | np.ndarray(shape=(n,)) |
observations, parameters |
| matrix | \(\mathbf{X}\) | \(n \times p\) | np.ndarray(shape=(n,p)) |
design matrix, covariance matrix |
| transpose | \(\mathbf{X}^\top\) | swap rows ↔︎ cols | X.T |
rearrangement for shape-fit |
| identity | \(\mathbf{I}_p\) | \(p \times p\) | np.eye(p) |
Ridge regularisation |
| inverse | \(\mathbf{A}^{-1}\) | \(p \times p\) | np.linalg.inv(A) |
OLS formula; usually solve instead |
| linear solve | \(\mathbf{A}\mathbf{x} = \mathbf{b}\) | — | np.linalg.solve(A, b) |
OLS, GLS, Ridge |
| dot product | \(\mathbf{a} \cdot \mathbf{b}\) | scalar | a @ b or np.dot(a,b) |
one prediction |
| matrix product | \(\mathbf{A}\mathbf{B}\) | \(m \times p\) | A @ B |
vector of predictions, \(\mathbf{X}^\top\mathbf{X}\) |
| outer product | \(\mathbf{a}\mathbf{b}^\top\) | \(m \times p\) | np.outer(a,b) |
rank-1 update, covariance updates |
| matrix shape | “\(m\) by \(p\)” | rows × cols | A.shape |
first thing to check, every time |
Exercises
For each pair, state the shape of the product (or that it is undefined).
- \(\mathbf{A}\) is \(4 \times 3\), \(\mathbf{B}\) is \(3 \times 5\). Compute \(\mathbf{A}\mathbf{B}\).
- \(\mathbf{A}\) is \(4 \times 3\), \(\mathbf{B}\) is \(3 \times 5\). Compute \(\mathbf{B}\mathbf{A}\).
- \(\mathbf{X}\) is \(120 \times 6\). What is the shape of \(\mathbf{X}^\top\mathbf{X}\)? Of \(\mathbf{X}\mathbf{X}^\top\)?
- \(\mathbf{a}\) is a vector of length 5; \(\mathbf{b}\) is a vector of length 5. What is the shape of \(\mathbf{a}^\top\mathbf{b}\)? Of \(\mathbf{a}\mathbf{b}^\top\)?
Suppose you have only two observations and one regressor (plus intercept):
\[ \mathbf{y} = \begin{pmatrix} 2 \\ 4 \end{pmatrix}, \qquad \mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}. \]
Compute \(\mathbf{X}^\top\mathbf{X}\), \(\mathbf{X}^\top\mathbf{y}\), then \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) by hand. Verify with NumPy. (Hint: for a \(2\times 2\) matrix \(\begin{pmatrix} a & b \\ c & d \end{pmatrix}\) with determinant \(ad - bc\), the inverse is \(\tfrac{1}{ad-bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\).)
Construct a \(3 \times 2\) design matrix \(\mathbf{X}\) such that one column is exactly twice the other. Compute \(\mathbf{X}^\top\mathbf{X}\). Confirm that np.linalg.inv raises a LinAlgError. Then add a small Ridge penalty \(\lambda \mathbf{I}\) with \(\lambda = 0.01\) and confirm that (X.T @ X + 0.01 * np.eye(2)) IS invertible. This is exactly the trick Ridge uses to stabilise singular designs.
You have estimated \(\hat{\boldsymbol{\beta}} = (0.001, 1.10, -0.30)\) for a model with intercept + Market + SMB. You want predictions for three new months whose factor values are:
| month | Mkt | SMB |
|---|---|---|
| Jan | 0.020 | -0.005 |
| Feb | -0.015 | 0.010 |
| Mar | 0.008 | 0.000 |
- Build the \(3 \times 3\) design matrix \(\mathbf{X}_{\text{new}}\) for these three months. (b) Compute the predictions \(\hat{\mathbf{y}}_{\text{new}} = \mathbf{X}_{\text{new}} \hat{\boldsymbol{\beta}}\) first by hand, then with NumPy.
Each of the following expressions appears in this book. For each, name the chapter where you saw it, state the shape of every object, and explain in one sentence what the expression computes.
- \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\)
- \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\)
- \(\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\)
- \(\hat{\boldsymbol{\beta}}_{\text{LASSO}} = \arg\min_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_1\)
What we deliberately skipped
This appendix covers about 5% of an undergraduate linear-algebra course. We did not touch determinants, eigenvalues, eigenvectors, vector spaces, linear independence beyond intuition, orthogonality, SVD, or QR decomposition. Each of those would deepen your understanding of why the formulae behave the way they do — eigenvalues are the right lens to think about Ridge regularisation, for instance, and SVD is the right way to think about PCA — but none is required to read the main chapters of this book or to write production OLS / Ridge / LASSO code.
If you continue further into quantitative finance — portfolio optimisation, risk modelling, factor analysis — the eigenvalue picture becomes essential. The standard short reference is Strang, Introduction to Linear Algebra; for a quant-finance angle, Meucci, Risk and Asset Allocation chapters 1–3 covers exactly the set of tools you will need. Both are written for readers without prior exposure.