Appendix B: Solutions to Exercises
Three rules, in order of importance:
- Try the exercise first. Solutions are a verification device, not a substitute for the work. The whole point of an exercise is the moment of confusion — the half-hour spent staring at a NumPy shape mismatch or an unexpected NaN. Reading the solution before you have hit that wall short-circuits the very learning the exercise is designed to produce. Consult the appendix only to verify your own answer or after an honest, time-boxed attempt.
- Run the code. Every cell below is a live
{pyodide-python}block. Click Run and watch the actual output appear. A printed answer in the text and a number that materialises in your browser are two different teachers; the second teaches more. - Note the alternatives. Several exercises admit more than one correct approach (matrix vs scalar OLS, expanding vs rolling windows, simple vs log returns). Where this is the case, the solution flags the principal alternative and explains why a working analyst might prefer one over the other.
Chapter and exercise numbering follows the book. Where the book labels an exercise Exercise 5.3, this appendix labels the corresponding solution Exercise 3.3 (chapter 3, third exercise within the chapter) — the chapter number always matches the topic file, so Exercise 3.x lives under ## Chapter 3. Multi-part exercises are split into (a), (b), (c), etc.
Chapter 1 — Pandas Foundations
Chapter 1 has two exercise blocks. The first set (Exercises 1.1–1.5) accompanies the Series material; the second set (Exercises 1.6–1.11) accompanies the DataFrame material. We work through both in order.
Exercise 1.1: Aligning two price series
You are given two daily-close Series for the same year — us with five observations and hk with four — whose indices overlap on every date except 5 January 2024. (a) Compute us + hk and explain why one row is NaN. (b) Recompute treating missing values as zero on either side. (c) Compute the intersection — a Series indexed by only those dates present in both calendars.
Worked solution. The pandas alignment rule is the foundation of every Series operation: when two Series are combined with an arithmetic operator, pandas first takes the union of their indices, broadcasts both inputs to that union, and produces a result indexed by it. Wherever one side is missing on a given label, the result is NaN. The 5 January value of us has no counterpart in hk (which skips 5 January and resumes on 8 January), so us + hk produces NaN on that date.
Discussion / common pitfalls. The most common error is to assume + performs row-by-row arithmetic by position, like NumPy. It does not: pandas aligns by label, every time. Position-based addition is what you would get from us.values + hk.values, and that requires the two arrays to have the same length and matching semantics — a far more fragile contract than alignment by date. The fill_value= argument in .add, .sub, .mul, .div is the polite way to specify “use this default when one side is missing.” It is the right tool when one side genuinely should contribute zero (an FX gain on a day a market was closed, for instance); it is the wrong tool when missing means unknown, because zero is a real number and will downward-bias any subsequent average. The intersection idiom is the cleanest way to enforce “only dates present in both” and is what you want for return correlations and pairs-trading entries.
A slightly more idiomatic version of the intersection avoids the explicit intersection call: pd.concat([us, hk], axis=1).dropna().sum(axis=1). The concat aligns and the dropna enforces the intersection. Equivalent result, one fewer line of code.
Exercise 1.2: A simple trading signal
Given a daily-close Series close, write a single expression that returns a Series of trading signals: "Buy" on a bullish crossover (10-day MA crosses above 40-day MA), "Sell" on a bearish crossover (10-day MA crosses below 40-day MA), and "Hold" otherwise.
Worked solution. A crossover is a change of sign of fast - slow. The previous day’s sign was negative and today’s is positive — that is a bullish cross. Today’s is negative and yesterday’s was positive — that is a bearish cross. Anything else is "Hold".
Discussion / common pitfalls. Three traps appear in almost every student answer.
First, students forget to compare both the current and previous values of diff. If you label every day on which fast > slow as "Buy", you end up with "Buy" printing every day for as long as the regime persists — but a crossover is a single-day event. The role of .shift(1) is to encode “yesterday”.
Second, >= versus >. If diff.shift(1) == 0 exactly, is that a cross or not? The conventional choice is to treat exactly-zero as “not yet a cross”; using <= 0 for the previous value and > 0 for the current value handles the edge cleanly.
Third, the first 40 rows of slow are NaN, which propagate into diff. The np.where masks involving NaN evaluate to False, so those rows correctly receive "Hold". If you used pandas’s .where instead — which preserves NaN — you would need to explicitly fill them.
A “Buy” signal in this scheme is not a trade — it is a flag. The actual trade decision incorporates position management (only buy if currently flat), risk limits, transaction costs, and execution timing. In a research notebook the signal is the input; in production the signal is one of dozens of inputs to the order-management system.
Exercise 1.3: Sharpe ratio and drawdown
Generate 1,000 daily returns from \(\mathcal{N}(\mu = 0.0005, \sigma = 0.015)\) with seed 123, build a price series, and compute (a) the annualised Sharpe, (b) the maximum drawdown and its date, (c) the average return on the 10% worst and 10% best days, (d) the ratio of down-day volatility to up-day volatility.
Worked solution.
Discussion / common pitfalls. Three observations.
The Sharpe number for this simulated series will be roughly \(0.0005/0.015 \times \sqrt{252} \approx 0.53\) in expectation, with sampling noise of order \(\sqrt{2/T} \approx 0.045\) around it. A student who computes Sharpe and gets exactly \(0.0005/0.015 \times \sqrt{252} = 0.529\) has likely used the population formula on the population parameters — not the right thing. The correct number is the sample Sharpe, computed from the realised returns.
The drawdown date is the date of the trough, not the date of the peak. Beginners often report the peak date and call it the drawdown date — these can be months apart. The trough is what hurts; the peak is the reference. A useful diagnostic is also the time underwater — the number of days between peak and recovery — which can be far more painful than the magnitude itself.
Part (d) tests whether the return distribution is symmetric. For Gaussian draws the ratio should be approximately 1.0. In real equity data this ratio is reliably greater than 1.0 (down-day vol exceeds up-day vol) — a manifestation of negative skew. The fact that this asymmetry vanishes in the simulation is a feature, not a bug: it tells us that any backtest written against Gaussian innovations will systematically understate tail risk relative to what real markets deliver.
Exercise 1.4: Missing-value strategy
You have a Series sold of daily units sold for a single SKU with about 8% of days missing. (a) Justify each of the five imputation strategies — dropna, fillna(0), ffill, fillna(mean), fillna(median) — in one sentence. (b) Compute the mean of sold under each and identify the two extremes.
Worked solution.
The right strategy depends entirely on why a row is missing, not on which imputation is computationally most convenient.
| Strategy | When defensible | When indefensible |
|---|---|---|
dropna |
The cause of missingness is random and the lost rows are unlikely to bias remaining statistics. | The cause is non-random (e.g., a system outage on high-traffic days), so dropping biases the mean. |
fillna(0) |
“Missing” genuinely means “store was closed and sold nothing.” | Missing means “data was lost”; here zero pulls the mean down artificially. |
ffill |
The series is sticky day-to-day and the most recent observation is a good prior — e.g., a price snapshot. | A genuine value of zero is plausible on missing days; forward-filling then over-states. |
fillna(mean) |
You want to preserve the mean of the surviving observations and you do not care that the variance shrinks. | You care about variance, distribution, or tails — mean imputation flattens them. |
fillna(median) |
The series has a heavy-tailed distribution and the mean is unreliable; the median is more typical of a “normal day”. | The series has structural zeros and the median is far above zero. |
Discussion / common pitfalls. The two extremes are almost always fillna(0) (which pulls the mean down by exactly \(0.08 \times \bar y \approx\) 8% of the surviving mean) and dropna / fillna(mean) (which leave the mean of the surviving observations unchanged). Students who reflexively pick fillna(0) because “zero is a natural default” routinely produce silently biased dashboards. The lesson is that missing is information: documenting whether NaN means “zero sales” or “data unknown” is a data-engineering responsibility, and the imputation choice is a statistical one that depends on the answer.
In production, the cleanest pattern is to not impute at all. Track the missingness flag as a separate column, propagate it through the pipeline, and let the downstream consumer decide. This avoids the irreversibility of imputation — once a value is filled in, the information about which observations were synthetic is lost.
Exercise 1.5: Resampling and the weekend effect
Test the classical “Monday returns are lower than Friday returns” claim on 2,000 simulated business-day returns from \(\mathcal{N}(0.0005, 0.015)\).
Worked solution.
Discussion / common pitfalls. In a truly i.i.d. simulation there is no day-of-week effect by construction — but the sample mean by weekday is random, and the gap between any two days is a difference of two near-Gaussian estimates with standard error \(\approx \sigma / \sqrt{n/5}\). For \(n = 2000\) and \(\sigma = 1.5\%\) the standard error of each daily mean is \(\sigma / \sqrt{400} \approx 7.5\) bps, so the Monday–Friday gap has standard error \(\approx 11\) bps. A 5–10 bps gap will look “real” to the eye but is well within sampling noise.
This is the textbook intuition for why naive backtests overstate edge: any post-hoc slicing of a return series produces spurious sub-bucket effects, and with enough buckets one of them will always look profitable. The defence is pre-registration (commit to the test before seeing the data) and out-of-sample validation. Day-of-week effects, in particular, have been studied since the 1970s; the consensus is that any historical effect has long since been arbitraged away.
The remaining six Chapter 1 exercises (the DataFrame block) follow.
Exercise 1.6: Load and audit
Load transactions.csv (or synthesise an equivalent), then (a) print shape, dtypes, and column names; (b) compute per-column missing percentage; (c) report the date range.
Worked solution. A reproducible synthetic dataset stands in for the file load.
Discussion / common pitfalls. The single most common load-time bug is forgetting to parse the date column, leaving it as object dtype. Subsequent .loc["2024-01-02":"2024-12-31"] calls then either fail silently or return the wrong slice. The check df.dtypes should be the first line after every read_csv. The audit triplet — shape, dtypes, missingness — takes three lines and prevents most downstream surprises.
Exercise 1.7: Selection and filtering
- Select
BUYtrades with quantity above 300, first with bracket-and-mask then withquery, confirm equality. (b) FindAAPLorMSFTtrades where price is below the median price across all trades. (c) Pull the first 50 rows with.iloc[]and the same range with.loc[]; explain why counts can differ.
Worked solution.
Discussion / common pitfalls. .iloc[] is position-based and excludes the upper endpoint, mirroring Python list slicing. .loc[] is label-based and includes the upper endpoint. So df.iloc[0:50] returns 50 rows but df.loc["2024-01-02":"2024-03-12"] returns however many labelled rows fall in that closed interval — possibly 50, possibly more, possibly fewer, and the count can drift if the underlying index has duplicates or gaps. In the synthetic frame above there is exactly one row per business day, so the counts coincide. In a real transactions dataset with multiple trades per day they will not.
query is convenient for ad-hoc filtering but slower than bracket-and-mask on large frames and refuses to handle column names containing spaces, dots, or hyphens without backticks. For one-off exploration, query is fine. For production code paths, prefer bracket-and-mask: it is faster, more explicit, and immune to surprising parser behaviour.
Exercise 1.8: Missing values
Build a small DataFrame with 20 rows and three columns, inject ~20% NaNs, then (a) drop rows where price specifically is missing, (b) fill missing quantity with the median, (c) forward-fill rating, (d) compare the variance of quantity before and after mean imputation.
Worked solution.
Discussion / common pitfalls. Mean imputation always shrinks the variance. The imputed values are exactly at the mean — they contribute zero to the squared deviation from the mean — so the sum-of-squared-deviations is unchanged while the denominator \((n-1)\) grows. The new variance is approximately \(\hat\sigma^2_{\text{old}} \cdot \frac{n_{\text{obs}} - 1}{n - 1}\), where \(n_{\text{obs}}\) is the count of non-missing values and \(n\) the total. For 20% missingness on \(n = 20\) this is roughly a 20% reduction in variance — a substantial misstatement of dispersion. Any downstream Sharpe or risk number that depends on the variance estimate inherits the bias.
The right fix, if you must impute, is to add noise: draw imputations from \(\mathcal{N}(\bar y, \hat\sigma)\) rather than from the point \(\bar y\). Multiple imputation generalises this idea and produces uncertainty estimates that account for the imputation step. In practice, however, the cleanest thing is to keep NaNs and let downstream estimators handle them (statsmodels and scikit-learn both have dedicated paths).
Exercise 1.9: Groupby with named aggregation
Produce a per-ticker summary table with n_trades, total_qty, avg_price, price_iqr, and buy_share. Two of the five require custom lambdas. Which ticker has the widest price IQR?
Worked solution.
Discussion / common pitfalls. Named aggregation — output_name = (column, agg_fn) — is the modern, readable pattern. The legacy alternative is df.groupby("ticker").agg({...}) with a nested dict; it works but produces a multi-level column index that needs flattening. Two pitfalls.
First, the "size" aggregator counts rows (including NaNs); "count" counts non-null entries. For n_trades you want "size".
Second, when the same column needs two different aggregations — say avg_price and price_iqr both operate on price — name them differently in the output. Named aggregation enforces this automatically; the dict pattern does not.
For a fully vectorised IQR without a lambda, define a helper: def iqr(s): return s.quantile(0.75) - s.quantile(0.25). Then pass iqr (the function object) into the agg tuple. This is cleaner if the same custom aggregator is reused across many group-bys.
Exercise 1.10: Merge and validate
Build a small reference table with three of the four tickers in your trades plus one extra. (a) Left-merge, count missing sectors. (b) Outer-merge with indicator=True, count reference-only rows. (c) Pass validate="many_to_one" and explain what fails if the reference has duplicates.
Worked solution.
Discussion / common pitfalls. The validate= argument is the single most underused tool in pandas. Without it, a duplicated key on the “one” side of a many-to-one join silently multiplies rows on the “many” side — a pd.merge of 500 trades against a reference where AAPL appears twice yields 500 + (count of AAPL trades) rows, not 500. The trader who then sums quantity by sector reports double-counted notional. validate="many_to_one" raises immediately rather than allowing this corruption.
The three values to know:
"one_to_one": both keys must be unique on their respective sides."one_to_many": the left key must be unique; the right may repeat."many_to_one": the right key must be unique; the left may repeat — the most common join in finance (trades → reference data).
Exercise 1.11: End-to-end chain
Write a single chained expression producing a wide table with one row per month, one column per ticker, and values equal to the average daily notional traded.
Worked solution.
Discussion / common pitfalls. The “average daily notional” denominator is the number of distinct trading days in that ticker-month, not the number of trades. Conflating the two — dividing total notional by trade count — gives the average notional per trade, a different quantity that overweights months with many small trades. The cleanest pattern is to compute numerator and denominator inside a single agg, then divide in a follow-on assign.
The chain reads top-to-bottom as: enrich (add notional and month), aggregate (sum notional, count distinct days), transform (compute the ratio), reshape (unstack ticker), polish (round, sort). Six verbs, one expression, no intermediate variables. This is the production pattern.
Chapter 2 — Markets as Data Objects
Chapter 2 has two exercise blocks. The first set (Exercises 2.1–2.5) accompanies the early “returns” half of the chapter; the second set (Exercises 2.6–2.10) accompanies the “diagnostics” half — volatility, drawdown, Sharpe, correlation, and vol drag.
Exercise 2.1: Read and verify a price series
Download AAPL daily history (or simulate), then (a) confirm the index is a DatetimeIndex, (b) print the first/last dates and row count, (c) plot Close with a 252-day moving average overlay, (d) inspect the 31 August 2020 4-for-1 split.
Worked solution. Pyodide cannot reach yfinance for network reasons, so we simulate a price series with one split-like artefact. The audit pattern is identical for real data.
Discussion / common pitfalls. The audit triplet — index type, date range, row count — should run before any analysis. Forgetting to verify the index leaves you exposed to silent failures: a CSV that was saved with an integer index instead of a date will slice fine for df.loc[0:10] but blow up for df.loc["2024-01-01":"2024-12-31"]. The split inspection is conceptually important: yfinance’s auto_adjust=True returns prices already adjusted for splits and dividends, which is what you want for return calculations. The raw Close (pre-adjustment) is sometimes useful for matching newspaper headline prices but is wrong for cumulative-return work — adjacent days bridging a split will record a fictitious return.
Exercise 2.2: Simple vs log returns
Compute simpleR = Close.pct_change() and logR = np.log(Close).diff(), then (a) plot histograms on the same figure, (b) report means and stds of each, (c) verify \(\ell_t = \ln(1 + R_t)\) numerically.
Worked solution.
Discussion / common pitfalls. Three observations.
First, simple and log returns look visually identical at daily frequency. The histograms overlay almost perfectly. The difference is in the second moment of the mean: \(\mathbb{E}[\ln(1+R)] \approx \mathbb{E}[R] - \tfrac{1}{2}\sigma^2\) by a Taylor expansion. With \(\sigma \approx 1.8\%\) per day, the daily mean gap is \(\tfrac{1}{2} \cdot 0.018^2 \approx 1.6\) basis points — small per day, but it accumulates into a multi-percent gap over years.
Second, the maximum discrepancy in part (c) should be a few times np.finfo(float).eps ≈ \(2 \times 10^{-16}\) — pure floating-point noise. If you observe anything larger, you have a bug.
Third, the standard deviations of simple and log returns are essentially identical at daily frequency. They diverge meaningfully only at monthly or quarterly frequencies where large moves occur. Most practitioner work treats them as interchangeable for volatility estimation while being careful with means.
For multi-period sums (cumulative wealth), log returns are the right object: \(\ln(W_T/W_0) = \sum_t \ell_t\). For weighted averages (portfolio returns), simple returns are the right object: \(R_{\text{port},t} = \sum_i w_i R_{i,t}\). A clean style rule: log returns for time aggregation, simple returns for cross-sectional aggregation. Mixing them is the source of more bugs in this domain than any other single error.
Exercise 2.3: Build an equity curve two ways
Build eq1 = (1 + simpleR).cumprod() and eq2 = np.exp(logR.cumsum()). Plot, verify numerical equivalence, and report terminal wealth and CAGR.
Worked solution.
Discussion / common pitfalls. The discrepancy should be on the order of \(10^{-13}\) — a few times the floating-point relative precision multiplied by the path length. The two constructions are algebraically identical; the residual is purely floating-point rounding accumulated over the cumulative operation. If you see a discrepancy of order \(10^{-5}\) or larger, you have likely fed in different return columns or different starting indices.
The CAGR formula is geometric: \(\text{CAGR} = W_T^{1/T} - 1\) where \(T\) is years. Students sometimes use the arithmetic average annualised — that is the arithmetic mean return, which overstates CAGR by approximately \(\sigma^2/2\) (vol drag, see Exercise 2.10).
Exercise 2.4: Daily to monthly
- Compute monthly returns two ways — resample price then
pct_change, vs resample-and-compound the daily return — and confirm they agree. (b) Compare the empirical monthly mean and std to daily statistics scaled by 21 and \(\sqrt{21}\). (c) Bar-chart monthly returns with months of \(|R| > 10\%\) highlighted.
Worked solution.
Discussion / common pitfalls. The two constructions disagree at the level of \(10^{-15}\) when the input dates align exactly. In real data they sometimes disagree more visibly because of calendar mismatches: a resample("BME") may select 31 January as month-end whereas the daily return series ends on the last trading day. Always verify the index alignment before computing the diff.
The \(\sqrt{T}\) scaling test is the i.i.d. yardstick. Under independence and stationarity, both monthly statistics should match the daily-times-scaling versions to within sampling error. Discrepancies indicate either time-series predictability (autocorrelation in returns — generally negative at short horizons, mildly positive at longer ones, both small) or volatility clustering (autocorrelation in squared returns, which is well-documented and substantial). Either failure of i.i.d. means the scaling is approximate, and any annualisation based on it is approximate.
Exercise 2.5: Rolling vol and position sizing
You manage $1M and target daily portfolio std of $10,000 (1% of capital). Using AAPL: (a) compute a 22-day rolling std of daily returns; (b) compute the dollar position that hits the target; (c) plot it; (d) report mean position; (e) bonus: clip at $1M and report cap-bound fraction.
Worked solution.
Discussion / common pitfalls. Vol-targeting is the production form of “risk-based position sizing.” Three subtleties.
First, the 22-day window is a forecast of next-day volatility constructed from the past 22 days. In practice, traders use longer windows (60 or 120 days, often with an exponentially weighted scheme) or GARCH-style filters; the trade-off is signal-to-noise vs. responsiveness to regime change.
Second, the position size is inversely proportional to volatility. In quiet markets the position grows; in volatile markets it shrinks. The cap at capital (\(\$1\)M) is hit precisely when the rolling vol falls below the target, which happens often in low-vol regimes. Naïvely, a strategy that holds the same position size through both quiet and chaotic markets accepts wildly different dollar P&L ranges in the two periods — vol-targeting equalises them.
Third, vol-targeting does not guarantee the realised P&L hits the target; it only adjusts position for the expected P&L. If next-day realised vol diverges from the 22-day estimate, realised P&L can over- or under-shoot. The discipline is to revise the position size daily and accept the slippage.
Exercise 2.6: Annualising volatility correctly
Daily analyst reports \(\mu_d = 0.08\%\), \(\sigma_d = 1.4\%\). Monthly analyst reports \(\mu_m = 1.7\%\), \(\sigma_m = 6.2\%\). Compute annualised mean and vol from each. Do they agree? If not, why?
Worked solution.
The numbers. Annualised mean: 20.16% (daily) vs 20.40% (monthly) — within sampling agreement. Annualised vol: 22.22% (daily) vs 21.48% (monthly) — a \(\sim 3.5\%\) gap.
Why the small gap? The \(\sqrt{T}\) scaling rule assumes returns are i.i.d. Real returns are not — they exhibit (i) mild negative autocorrelation in levels (a slight mean-reverting tendency in daily returns) and (ii) strong positive autocorrelation in squared returns (volatility clustering). The first effect typically makes monthly variance less than \(T\) times daily variance: daily fluctuations partially cancel. The second effect goes the other way: clusters of large returns create heavier-tailed monthly moves. Empirically the two effects roughly cancel for equity indices, leaving the scaling rule approximately correct. The 3.5% gap is within what you would expect from this combination plus pure sampling error.
Discussion / common pitfalls. The deeper lesson is that the choice of base frequency for volatility estimation is not innocuous. Higher-frequency data (5-minute, 1-minute) brings more observations and tighter estimates, but also brings microstructure noise (bid-ask bounce, asynchronous trading) that inflates measured volatility. Lower-frequency data (monthly, quarterly) gets less noise but fewer degrees of freedom and longer error bars. The realised-volatility literature wrestles with exactly this trade-off; in practice, daily closes are a defensible default for equity work.
Exercise 2.7: Building a drawdown function
Write drawdown_stats(returns) returning a dict with (1) max drawdown, (2) trough date, (3) peak date, (4) recovery date (or None).
Worked solution.
Discussion / common pitfalls. Three common bugs.
First, students confuse the trough date (where dd is minimum) with the peak date (where wealth was highest before the trough). The drawdown is measured from peak to trough; both dates are needed.
Second, the recovery condition is wealth >= previous peak, not wealth >= wealth at trough. The latter just returns the day after the trough. The former requires fully retracing the loss.
Third, in a sample where the recovery has not yet occurred, the function must return None rather than the last sample date. A drawdown that has not recovered is open, and confusing it with a recovered drawdown understates risk.
Exercise 2.8: Sharpe under a time-varying risk-free rate
Simulate a daily \(r_f\) rising linearly from 0.0001 to 0.0002 over 5 years and recompute Sharpe for a fixed return series under (a) constant average \(r_f\) and (b) the time-varying path.
Worked solution.
Discussion / common pitfalls. When \(r_f\) moves slowly and smoothly, the two Sharpes agree closely — the average \(r_f\) is a good summary. The difference matters when:
- \(r_f\) has substantial variance relative to the return variance (rare for equity but real for fixed-income or low-vol strategies).
- The return series is correlated with the \(r_f\) path. If a strategy systematically performs better in high-\(r_f\) environments, the subtraction matters.
- The sample is long enough that the \(r_f\) regime shifts dominate. A 5-year window spanning 0–5% T-bill regimes (e.g., 2020–2024) qualifies.
The cleanest practitioner discipline is to always subtract the contemporaneous risk-free rate from contemporaneous returns. The cost is one extra line of code; the benefit is freedom from a class of inference errors that recur whenever rates move.
Exercise 2.9: Diversification under regime change
Two assets, \(\sigma_1 = \sigma_2 = 0.18\), equal weights, correlation \(\rho \in \{-0.5, 0, 0.3, 0.6, 0.9, 1.0\}\). Plot \(\sigma_p(\rho)\). By what percentage does \(\sigma_p\) rise when \(\rho\) moves from 0 to 0.6?
Worked solution. Portfolio variance for two equally weighted assets: \[ \sigma_p^2 = \tfrac{1}{4}\sigma_1^2 + \tfrac{1}{4}\sigma_2^2 + \tfrac{1}{2}\rho\sigma_1\sigma_2 = \tfrac{1}{2}\sigma^2(1+\rho). \] So \(\sigma_p = \sigma\sqrt{(1+\rho)/2}\).
The number. Portfolio volatility rises from \(0.18 \cdot \sqrt{0.5} = 12.73\%\) at \(\rho = 0\) to \(0.18 \cdot \sqrt{0.8} = 16.10\%\) at \(\rho = 0.6\) — a 26.5% relative increase.
Discussion / common pitfalls. This is the single most important regime-change number in portfolio theory. A risk manager who calibrates a Value-at-Risk model assuming \(\rho = 0\) and then encounters a flight-to-quality episode where \(\rho \to 0.6\) has underestimated her portfolio vol by 26.5% — every VaR number is now 26.5% too small. In a \(\$1\)B portfolio that is hundreds of millions of dollars of mis-stated risk.
The pattern repeats across crises: October 1987, August 1998, October 2008, March 2020, all featured cross-asset correlations rising sharply just as portfolios needed the diversification most. The takeaway is to stress-test correlation assumptions, not just volatility assumptions. A standard stress test sets all pairwise correlations to 0.8 and recomputes risk; whatever leverage looked safe at \(\rho = 0\) usually does not look safe at \(\rho = 0.8\).
Exercise 2.10: Volatility drag over a long horizon
Portfolio with arithmetic mean 12% and vol \(\sigma\), held 30 years. (a) Compute terminal wealth via CAGR \(\approx \bar r - \sigma^2/2\) for \(\sigma \in \{0.10, 0.20, 0.30, 0.40\}\). (b) Project error if advisor uses arithmetic mean. (c) At what \(\sigma\) does projection overstate by \(2\times\)?
Worked solution.
The numbers.
| \(\sigma\) | true CAGR | true terminal | naive terminal | overstate factor |
|---|---|---|---|---|
| 10% | 11.50% | $25.30 | $29.96 | 1.18× |
| 20% | 10.00% | $17.45 | $29.96 | 1.72× |
| 30% | 7.50% | $8.84 | $29.96 | 3.39× |
| 40% | 4.00% | $3.24 | $29.96 | 9.24× |
The \(2\times\) overstatement threshold is approximately \(\sigma \approx 22.7\%\).
Discussion / common pitfalls. This is the single most consequential approximation error in retirement planning. A financial advisor who quotes “12% average historical return” without subtracting \(\sigma^2/2\) promises terminal wealth that exceeds reality by a factor that grows exponentially with the horizon and the variance.
At equity-index volatility (\(\sigma \approx 18\%\)), the 30-year overstatement is a manageable 1.5×. At leveraged-fund volatility (\(\sigma \approx 35\%\)), the overstatement is 5×: the advisor projects a million dollars, the client receives two hundred thousand. At single-name volatility (\(\sigma \approx 50\%\)), the situation is worse still, and many “buy-and-hold a great stock” plans implicitly assume zero variance drag. The mathematics is unforgiving.
The clean rule for retirement projections: use geometric mean (CAGR), not arithmetic mean, and document the volatility assumption explicitly. If clients want a “best case”, give them the arithmetic projection — but label it as such, and show the geometric one alongside.
Vol drag is not a quirk of compounding mathematics — it is a fact about wealth dynamics. A \(-50\%\) loss requires a \(+100\%\) gain to recover, not a \(+50\%\) gain. The asymmetry of multiplicative compounding guarantees that geometric mean lies below arithmetic mean for any non-degenerate distribution. The only question is by how much, and the answer is approximately \(\sigma^2/2\) in continuous time.
Chapter 3 — Simple Linear Regression and CAPM
Exercise 3.1: OLS by hand
Using the simulated dataset of 250 observations, compute \(\hat\alpha\) and \(\hat\beta\) three ways — scalar formulae, matrix formula, statsmodels.OLS — and verify they agree. Then compute SST, SSR, SSE by hand and verify \(R^2 = \text{SSR}/\text{SST}\).
Worked solution.
Discussion / common pitfalls. All three approaches should agree to within \(\sim 10^{-14}\) — pure floating-point noise. If your scalar and matrix versions diverge by more than that, you have likely mis-typed a sum or used np.linalg.inv on a near-singular matrix. (np.linalg.solve is preferred to explicit inversion: it is both more numerically stable and faster.)
The \(R^2\) identity \(SST = SSR + SSE\) holds exactly when the regression has an intercept and OLS is used. Without an intercept, the identity fails and the “\(R^2\)” produced by software can be negative or exceed one — a sign of a serious model spec problem. Always include an intercept unless you have an iron-clad theoretical reason not to.
Exercise 3.2: CAPM with statsmodels (monthly vs daily)
Repeat the CAPM regression on the NVDA/SPY/FF5 data with monthly returns, then compare to the daily estimates from the worked example.
Worked solution. Pyodide cannot fetch the CSV from the book’s distribution server; we simulate stylised daily NVDA/SPY/\(r_f\) series that mirror the structure.
Discussion / common pitfalls. Three observations.
First, the slope \(\hat\beta\) should be approximately the same at both frequencies — daily and monthly betas typically agree within sampling noise. They are estimates of the same population parameter (the asset’s loading on the market factor) under the i.i.d. assumption.
Second, \(R^2\) usually rises moving from daily to monthly. The reason: idiosyncratic noise is approximately mean-zero per day, so summing daily idiosyncratic returns over a month produces a monthly idiosyncratic shock with std \(\sqrt{21} \times\) daily std. The systematic component scales similarly. But cross-day diversification of the idiosyncratic component is imperfect (volatility clustering, mild predictability), and in practice monthly noise grows less than \(\sqrt{21}\) while monthly systematic returns grow proportionally — net effect, \(R^2\) rises.
Third, \(\hat\alpha\) in monthly units is roughly 21× the daily \(\hat\alpha\). Always state the unit (daily or monthly) when reporting alpha — a 5 bps daily alpha is a roughly 12.6% annualised excess return, but a 5 bps monthly alpha is 60 bps annualised. The same number can be triumphant or trivial depending on the frequency it lives at.
Exercise 3.3: Hypothesis tests and confidence intervals
For NVDA/SPY 2023–2024 daily CAPM with HAC standard errors, test (a) \(\beta = 0\), (b) \(\beta \le 1\), (c) \(\alpha = 0\).
Worked solution.
Discussion / common pitfalls.
Test (a) almost always rejects for any liquid stock — the t-statistic on \(\hat\beta\) in a CAPM regression of NVDA on SPY runs into double digits over a two-year sample. The economic content is meager: of course NVDA is exposed to the market. The interesting question is the magnitude, not the existence, of the exposure.
Test (b) is the substantive question. NVDA’s beta is famously above 1, and an HAC-robust t-statistic for \(\hat\beta - 1\) will typically be 5–10. We reject \(H_0: \beta \le 1\) at every conventional level. The economic interpretation: a \(\$1\)M long NVDA exposure carries the same dollar systematic-risk as roughly \(\$\hat\beta\)M of SPY — useful for hedge sizing.
Test (c) is the alpha test, which is what asset managers care about. A statistically significant non-zero \(\alpha\) over a two-year window is rare and would normally be greeted with scepticism — the most common explanation is omitted-variable bias (NVDA’s “alpha” is really exposure to an AI factor or a quality factor that CAPM does not include) rather than skill. The wider question — what counts as alpha and what counts as exposure — is the entire content of Chapter 4.
Exercise 3.4: Residual diagnostics
Fit CAPM on NVDA/SPY for 2023–2024, produce the three-panel residual plot, report Durbin–Watson, Jarque–Bera, skew, kurtosis. Recommend a covariance type.
Worked solution.
LINE-assumption verdict.
- Linearity: residual-vs-fitted plot should look like an unstructured cloud. CAPM is reasonably linear in market returns over short horizons; we accept.
- Independence: Durbin–Watson near 2 supports no serial correlation; values below 1.5 or above 2.5 are warning signs. For real NVDA/SPY data the value is often ~1.9 — close enough.
- Normality: Jarque–Bera typically rejects for real equity returns (heavy tails, mild left skew). For simulated Gaussian residuals it does not.
- Equal variance: residual-vs-fitted plot would show fan-shape if heteroscedastic. Real equity data is typically heteroscedastic; clustering in volatility breaks both equal-variance and independence.
Recommendation. For real equity-return regressions, the safe default is HAC (Newey–West) standard errors with 5–10 lags. HAC corrects for both heteroscedasticity and serial correlation simultaneously. HC3 is heteroscedasticity-only; classical (nonrobust) assumes both away. Switching from classical to HAC is one keyword: cov_type="HAC", cov_kwds={"maxlags": 5}. In simulated i.i.d. residuals the three estimates roughly agree; in real data HAC is meaningfully wider.
Exercise 3.5: Train/test split and beta stability
Split NVDA/SPY into 2023 (train) and 2024 (test). Fit CAPM on train. Report (a) train \(\hat\beta\), \(\hat\alpha\), \(R^2\). (b) Re-fit on test alone for the test-only \(\hat\beta\). (c) Compute out-of-sample \(R^2\) using train coefficients.
Worked solution.
Discussion / common pitfalls. Beta stability is the central empirical question for any forward use of a fitted model. A train \(\hat\beta = 1.65\) followed by a test \(\hat\beta = 1.78\) tells the analyst that the loading is approximately but not exactly stable — the difference of \(\sim 0.13\) is within sampling noise for typical equity regressions, but a discrepancy of \(0.5\) or more would suggest a regime change (a new business line, a major acquisition, a shift in the firm’s leverage).
The in-sample vs out-of-sample \(R^2\) gap is normally small for CAPM (a few percentage points). It widens dramatically in factor models with many regressors (Chapter 4) and in alpha models (Chapter 5), where in-sample \(R^2\) can be 0.50 while OOS \(R^2\) is 0.02 — the entire content of the cross-validation discipline is to expose and shrink that gap.
For longer histories, rolling-window beta estimation is the standard production approach. Fit CAPM on each 252-day rolling window; the time series of \(\hat\beta_t\) reveals beta drift. Bloomberg’s “adjusted beta” is \(\hat\beta_{\text{adj}} = 0.67 \hat\beta + 0.33\), a shrinkage estimator that pulls extreme betas toward 1 — empirically known to forecast next-period beta better than the raw OLS estimate.
Exercise 3.6: A defensive stock
Repeat CAPM with a hypothetical defensive stock (e.g., XLU), or simulate with \(\beta = 0.45\). Compare \(\hat\beta\), \(\hat\alpha\), \(R^2\), residual std to NVDA. Discuss hedge sizing on \(\$10\)M and diversification implications.
Worked solution.
Discussion / common pitfalls. Three economic takeaways.
A \(\hat\beta = 1.7\) vs \(\hat\beta = 0.45\) means very different hedge sizes — \(\$17\)M of SPY short for NVDA vs \(\$4.5\)M for XLU — to neutralise the market exposure of a \(\$10\)M position. The defensive stock requires less hedge per dollar of position.
The lower \(R^2\) on the defensive stock (typically 0.50 vs 0.85 for NVDA in real data) implies the market factor explains less of the variance — XLU’s returns are dominated by idiosyncratic and sector-specific movements (interest rates, regulatory news) rather than broad-market direction. From a diversification perspective this is a feature: XLU diversifies a tech-heavy portfolio more effectively than another large-cap tech name, precisely because its \(R^2\) on the market is low.
The smaller residual standard error for XLU (0.008 vs 0.020) means that even with the market exposure neutralised, XLU’s idiosyncratic risk is much smaller than NVDA’s. A hedged long on XLU is a far quieter book than a hedged long on NVDA. The trade-off, of course, is expected return: defensive stocks compensate for low beta with lower expected returns (or, in CAPM-land, lower required returns).
Chapter 4 — Multi-factor and Beta Models
Exercise 4.1: Reading an FF3 summary
Fit FF3 on NVDA, report coefficients with classical OLS SEs. (a) Which reject \(H_0: \beta_k = 0\) at 5%? (b) Compute the 95% CI for \(\hat\beta_M\) manually. (c) Interpret signs in plain English. (d) Compute residual standard error.
Worked solution.
Discussion / common pitfalls.
Significance. With 500 daily observations and the factor structure above, the market loading and HML loading will be highly significant; SMB sometimes is, sometimes is not, depending on the realised correlation in the sample. Always check t-stats individually before claiming “all three factors load”.
Sign interpretation: - \(\hat\beta_M > 0\): NVDA rises in market rallies (universal for liquid equities). - \(\hat s < 0\): NVDA behaves like a large-cap stock, not a small-cap; it underperforms when SMB (small-minus-big) does well. - \(\hat h < 0\): NVDA behaves like a growth stock, not a value stock; it underperforms when HML (high-minus-low book-to-market) does well.
These signs are entirely typical of mega-cap tech: large, growth, high beta. The sign profile is a style fingerprint, and a single look at \((s, h)\) tells you most of what a stock-screener’s category column would.
Residual std. Roughly 1.8% per day, compared to NVDA’s raw daily std of ~2.5%. The factor model explains about half the variance, leaving the rest as idiosyncratic. For NVDA the idiosyncratic share is large because AI-narrative news shocks dominate cross-factor explanations.
Exercise 4.2: Joint vs individual significance
- Compute the partial \(F\)-statistic for CAPM → FF3 by hand. (b) Verify with
ff3.compare_f_test(capm). (c) Discuss why \(F\) can reject even if individual \(t\)s do not. (d) What if \(SSR_R = SSR_U\) exactly?
Worked solution.
(c) Joint F rejects, individual t’s do not — why? When two new regressors are correlated, each individual SE inflates (via VIF) — but the joint \(F\) test is constructed from the change in SSR, which is a fundamentally different quantity. The reduction in residual variance can be substantial in aggregate even if no single regressor independently accounts for much of it. A textbook example: regress \(y\) on \(x_1\) and \(x_2\) where the two are correlated 0.95 and both contribute equally. Each \(\hat\beta_k / SE(\hat\beta_k)\) is small; the joint \(F\)-statistic is large.
(d) \(SSR_R = SSR_U\) exactly. Then \(F = 0\). By design this is the smallest possible \(F\), and we cannot reject the null — i.e., we cannot claim the new regressors contribute fit. This is not the same as the new regressors being useless: they may, for example, be exactly collinear with the existing ones, in which case the design is singular and the regression should not be run. More commonly, \(SSR_R - SSR_U\) is small but positive, and \(F\) falls below the critical value — the orthodox conclusion is that the extra parameters do not pay their degree-of-freedom cost.
Discussion / common pitfalls. The partial \(F\)-test is the gold standard for nested model comparisons (CAPM ⊂ FF3 ⊂ FF5). It is not valid for non-nested comparisons (FF3 vs. a momentum-only model), where AIC, BIC, or out-of-sample \(R^2\) should be used. Students sometimes apply \(F\)-tests to non-nested comparisons and reach contradictory conclusions; the discipline is to confirm nesting before invoking the test.
Exercise 4.3: Multicollinearity hands-on
Build \(x_2 = 0.95 x_1 + \text{small noise}\), fit the regression, report \(\hat\beta_1\), \(\hat\beta_2\), their SEs. Compute VIF by hand. Re-fit on a different sample and compare. Does \(R^2\) change?
Worked solution.
Discussion / common pitfalls. Three observations.
The individual \(\hat\beta_1, \hat\beta_2\) shift dramatically between samples — sometimes by units of standard error, well beyond what their reported SEs would suggest “small variation” would be. This is the hallmark of multicollinearity: the individual coefficients are unstable, even though the joint behaviour (their sum) is rock-stable. The sum \(\hat\beta_1 + \hat\beta_2\) tracks the true sum (1.0) closely across both samples — because \(x_1 \approx x_2\), the model can identify the sum but cannot disentangle the contributions.
\(R^2\) is unaffected by multicollinearity. Both samples deliver \(R^2 \approx 0.95\). The fit and the predictions are excellent; only the interpretation of the individual loadings is compromised.
VIF in the order of 10 (\(\rho \approx 0.95\) gives \(1/(1-0.9025) \approx 10\)) is the textbook diagnostic. The conventional rule of thumb is VIF \(> 10\) means trouble for inference on individual coefficients. The rule does not say “drop the regressor” — that wastes information. The right responses are to (i) report the sum of correlated coefficients rather than individuals, (ii) use Ridge regression which stabilises individual coefficients at the cost of bias, or (iii) construct an orthogonalised version (regress one on the other and use the residual) where the interpretive question demands it.
Exercise 4.4: Adjusted \(R^2\) as a model-selection tool
- Add a junk regressor to FF5 NVDA. Does \(R^2\) rise? Does adjusted \(R^2\) rise? (b) Report the \(t\) on junk; repeat 10 times; report fraction below 5%. (c) Add a useful extra regressor (lagged NVDA). (d) Discuss the gap between “raises adjusted \(R^2\)” and “is statistically significant”.
Worked solution.
Discussion / common pitfalls.
On adjusted \(R^2\) vs significance. Adjusted \(R^2\) rises whenever the new regressor’s \(|t|\) exceeds approximately 1 (algebraically: \(\bar R^2\) increases iff \(F > 1\), and for one new regressor \(F = t^2\)). Conventional significance demands \(|t| > 2\). So the band \(1 < |t| < 2\) is precisely the zone where adjusted \(R^2\) increases even though the regressor would not be called statistically significant. Selecting models by adjusted \(R^2\) alone leads to over-parameterisation; selecting by \(p\)-value alone leads to under-parameterisation. The professional discipline is to use both (\(\bar R^2\) for fit, \(t\) for inference) plus a held-out validation sample.
On the junk regressor Type I rate. Under the null that junk is unrelated to \(y\), the \(p\)-value is uniformly distributed on [0, 1]. So \(\Pr(p < 0.05) = 0.05\) exactly. The simulation above will deliver a fraction close to 5%. This is not an error of the test — it is the test’s intended behaviour. What it means in practice: if you run 20 unrelated regressors through an FF5 model and pick the one with the lowest p, you will find a “significant” effect about \(1 - 0.95^{20} \approx 64\%\) of the time by pure chance. The defence is multiple-testing correction (Bonferroni, BH) and out-of-sample validation.
Exercise 4.5: Newey–West HAC vs classical SEs
Fit FF5 on NVDA twice: once classical, once HAC (5 lags). (a) Compute ratio HAC/classical for each SE. (b) Does any coefficient flip in significance? (c) Run Breusch–Godfrey LM test.
Worked solution.
Discussion / common pitfalls. Three observations.
The HAC/classical ratio is rarely 1.0 in real data. Ratios of 1.2–1.5 are typical when there is mild residual autocorrelation; ratios of 2 or higher are warning signs of substantial serial correlation that the classical SEs are missing. The coefficients themselves are unchanged — HAC adjusts only the SEs.
If a coefficient flips from significant to insignificant under HAC, the classical inference was over-confident. The correct conclusion is the HAC one. In practice this most often happens for small coefficients close to the significance boundary; large, robustly-positive coefficients (like NVDA’s market beta) survive any reasonable SE correction.
The Breusch–Godfrey LM test diagnoses whether serial correlation is present. If it rejects, HAC is required. If it does not reject (which often happens on simulated i.i.d. residuals), HAC is unnecessary but does no harm. A reasonable production rule: always use HAC for financial time-series regressions, full stop.
Exercise 4.6: Attribution and a market-neutral hedge
Construct a \(\$1\)M long NVDA. Using FF5 loadings, (a) compute dollar hedges for each factor; (b) report gross notional of the factor-neutral book; (c) on the worst-day, decompose loss into factor contributions and residual.
Worked solution.
Discussion / common pitfalls.
On gross vs net notional. The factor-neutral book has \(\$1\)M net exposure (one long NVDA) but several million dollars gross exposure, summing across the long NVDA and the short hedges. Gross is what matters for transaction costs, margin requirements, and prime-broker financing fees; net is what matters for directional exposure. Funds that report only net are masking real risk-and-cost realities.
On the residual. After hedging out all five factors, what remains is \(\hat\varepsilon\) — the idiosyncratic component. For NVDA in 2023–2024, this is overwhelmingly AI-narrative noise: earnings surprises, GPU-shipment announcements, AI-conference keynotes, competitive responses from AMD and Intel. The factor hedge does nothing to neutralise that exposure.
Should you run NVDA factor-neutral? Argument for: it isolates the alpha that comes from idiosyncratic, name-specific edge — exactly what stock-pickers claim to deliver. Argument against: the residual on NVDA is large and dominated by news that is hard to forecast; running this as an isolated bet trades the (modestly explainable) factor risk for (largely unpredictable) news risk. A balanced view: factor-neutralise positions in pursuit of a systematic alpha (signals across many names), but bear the idiosyncratic risk only where you have a real informational edge on the name.
Chapter 5 — Alpha Models and Machine Learning
Exercise 5.1: Ridge vs LASSO on collinear features
Generate \(N = 1000\), \(K = 20\) with features 0 and 1 nearly identical (\(\rho > 0.95\)) and equally signal-bearing. Fit Ridge and LASSO at a series of \(\lambda\). Plot coefficient paths. Verify LASSO assigns all weight to one feature; Ridge splits equally. Discuss which behaviour you prefer in alpha modelling.
Worked solution.
Discussion / common pitfalls. Ridge and LASSO impose qualitatively different penalties: Ridge uses \(\|\beta\|_2^2\) (smooth, differentiable, prefers all-small to one-big); LASSO uses \(\|\beta\|_1\) (kinked at zero, prefers sparse solutions). On a pair of nearly-collinear features, the Ridge optimum splits the weight roughly equally (the \(\ell_2\) ball touches the loss surface at a point where both \(\beta_0\) and \(\beta_1\) are nonzero); the LASSO optimum picks a corner of the \(\ell_1\) ball, sending one coefficient to zero.
Which to prefer in alpha modelling? It depends on the use case.
- If features 0 and 1 are literally the same signal (one is reversal-1-day, the other is reversal-2-day) and you want one cleaner predictor, LASSO’s selection is convenient: it picks the best of the pair and discards the redundant copy.
- If features 0 and 1 are related but distinct (one is short-term reversal, the other is short-term momentum, which are negatively correlated), the LASSO arbitrary-pick can be misleading — a small change in the data flips which feature is selected. Ridge gives a more stable interpretation.
- For pure forecast with no need to interpret the loadings, Ridge often slightly outperforms LASSO in cross-validation on financial data (more features, weaker individual signals, denser truth).
- For sparse model construction where parsimony matters (limited execution capacity, interpretability constraints), LASSO’s sparsity is exactly what you want.
The Elastic Net (a convex combination of Ridge and LASSO penalties) splits the difference and often dominates either alone.
Exercise 5.2: Walk-forward without look-ahead
Implement walk-forward HGBR with monthly retraining. Compare IC and Sharpe to annual retraining. Then introduce a look-ahead bug: standardise features using full-sample mean/var before splitting. Recompute IC. How much does the bug inflate performance?
Worked solution.
Discussion / common pitfalls. The leak-free version uses cross-sectional standardisation within each date independently — the operation depends only on the current date’s data, so no information from the future leaks into the training window. The buggy version standardises across all dates jointly, mixing future cross-sections into the training-window features. For this kind of pure feature-standardisation leak the inflation is usually modest (a few percent of IC) because the across-date mean/var is approximately constant. The lesson is structural: any operation that touches the test set must be a function only of past data. The discipline applies to mean/variance, to feature engineering, to label construction, and to model selection.
Monthly vs annual retraining. Monthly retraining is more expensive but adapts faster to regime change. The trade-off has no universal answer: it depends on (i) the speed of regime change in the data-generating process, (ii) the noise floor (more-frequent retrains overfit to local noise), and (iii) the operational cost of model updates. A defensible default for systematic equity is monthly or quarterly retraining with a rolling 36–60 month window.
Exercise 5.3: Information Coefficient by sub-period
Compute cumulative IC of HGBR separately over four contiguous five-year sub-periods. Are they roughly equal, or is one period responsible for most of the cumulative IC?
Worked solution.
Discussion / common pitfalls. The aggregated IC across the full 20-year sample may look respectable (say 0.05) but masks a four-bucket profile in which two sub-periods deliver IC = 0.10 and two sub-periods deliver IC = 0.00. The strategy looks robust on the aggregate; it is in fact concentrated risk.
The implication for deployment is immediate. If the live regime resembles the dormant ones, the model will produce noise-only predictions; the IC = 0.05 aggregate is irrelevant to next month’s P&L. A risk-aware portfolio manager will (i) shrink the position size in the live deployment based on the minimum sub-period IC rather than the mean, (ii) monitor a live IC tracker and de-allocate when the IC falls below a threshold, and (iii) maintain a portfolio of uncorrelated alpha signals so that no single regime change can kill the strategy outright.
Exercise 5.4: Top-K is not the only construction
Modify the lab’s portfolio construction to use decile sorts. Then implement a score-weighted version. Compare Sharpe, turnover, and concentration (Herfindahl) across three constructions.
Worked solution.
Discussion / common pitfalls. Three observations.
Top-K and decile constructions deliver similar Sharpe ratios on most datasets — they trade off slightly different things (Top-K is fixed-cardinality, decile is fixed-fraction; turnover differs depending on cross-section size). Score-weighted constructions usually deliver the highest Sharpe, because they allocate more weight to higher-conviction names, but they also have the highest concentration (HHI), so position-size risk concentrates in a few names.
Turnover, not Sharpe, is what dominates real-money P&L after costs. A 0.6 Sharpe strategy with 0.5 monthly turnover beats a 0.8 Sharpe strategy with 2.0 monthly turnover after typical equity transaction costs (~10 bps round-trip). The professional discipline is to monitor cost-adjusted Sharpe, not gross Sharpe.
Concentration matters for capacity: a strategy with HHI = 0.05 (≈20 effective positions) can scale to \(\$10\)B; a strategy with HHI = 0.20 (≈5 effective positions) saturates at maybe \(\$100\)M. Top-K = 30 produces HHI = 1/30 = 0.033 by construction; the score-weighted version produces higher HHI if the score distribution has heavy tails.
Exercise 5.5: Permutation vs gain importance
Train HGBR on the simulated panel. Compute (a) model.feature_importances_, (b) permutation importance on a held-out month, (c) univariate IC of each feature. Tabulate. Do they agree on top three? Bottom three?
Worked solution.
Discussion / common pitfalls. The three measures often disagree, especially for features that contribute through interactions (like f2 and f3 here). Univariate IC sees f2 alone as marginal because its predictive content surfaces only when interacted with f3. Gain importance sees the split where f2 is used — high once the model has learned the interaction. Permutation importance measures the drop in out-of-sample \(R^2\) when the feature is shuffled — typically the most defensible measure of predictive contribution.
Practical guidance: - Gain importance is fast, in-sample, and biased toward high-cardinality features. - Permutation importance is slower, out-of-sample, model-agnostic, and the gold standard. - Univariate IC is the simplest sanity check; if a feature has zero univariate IC and zero permutation importance, it is almost certainly noise.
Disagreement between gain and permutation usually flags either an interaction effect (the feature is used with other features) or an overfit signature (the in-sample model exploits the feature, but the exploitation does not survive permutation on held-out data).
Exercise 5.6: Drift simulation
Modify the simulated panel so that beginning at month 180 the relationship \(X_1 \cdot X_2 \to R\) flips sign. Rerun walk-forward with annual retraining. When does the model recover? Compare a 36-month rolling window.
Worked solution.
Discussion / common pitfalls. Expanding-window training contains pre-flip data that contradicts the post-flip relationship. Even after the flip, the expanding model gives substantial weight to the old (now-wrong) regime, so IC degrades and recovers slowly. The rolling-36 window drops the pre-flip data as soon as 36 months have passed; the model is then trained purely on post-flip data and IC recovers fully.
The trade-off. Rolling windows adapt fast but have noisier coefficient estimates (smaller sample). Expanding windows are statistically efficient under stationarity but slow to adapt under regime change. The right window length is a hyperparameter, not a fact of nature — it depends on the speed of expected regime change in your data. The right meta-decision is to monitor the rolling-window IC and compare it to the expanding-window IC: if rolling consistently dominates, the data is non-stationary and rolling should be used; if expanding consistently dominates, the data is stationary and expanding should be used.
A continuous version of this idea is exponentially weighted training: older observations contribute less weight in the loss function. This generalises both extremes (uniform expanding window ↔︎ decay rate 0; uniform rolling window ↔︎ decay rate \(\infty\)) and is the standard production form in trading.
Exercise 5.7: A capstone backtest
Pick a public dataset (S&P 500 constituents via yfinance), engineer features (momentum, short-term reversal, idio-vol, MA crossover), build the walk-forward HGBR pipeline, and report cumulative IC, top-30/bottom-30 Sharpe, and max drawdown. Write a one-page memo defending design choices.
Worked solution. Pyodide cannot reach yfinance, so we provide a synthetic panel of 200 names × 60 months that demonstrates the full pipeline. The structure is identical to the real-data case; only the data source changes.
One-page memo (template).
Memo: Capstone HGBR Cross-Sectional Equity Alpha
Strategy. Cross-sectional ranking of S&P 500 constituents on next-month return predictions from an HGBR model trained on four canonical features: 12-month skip-1-month momentum, 1-month reversal, 12-month idiosyncratic volatility, and a 3M-vs-12M moving-average crossover. Walk-forward with annual retraining on the last 12 months. Top-30 long / Bottom-30 short, equally weighted, monthly rebalance.
Why these features? Each is a well-documented anomaly with an out-of-sample track record (Jegadeesh and Titman 1993 for momentum; Jegadeesh 1990 for short-term reversal; Ang et al. 2006 for idiosyncratic volatility). Including the crossover adds a second-order timing component without doubling the feature count. Combining four weak-signal features through a non-linear model (HGBR) lets the learner detect interactions (e.g., reversal is more potent for high-vol names) that linear stacking would miss.
Why annual retraining? Monthly retraining is more expensive and slightly more responsive but yields negligible IC improvement on this universe and timescale. Annual retraining strikes a defensible balance between computational cost and regime adaptability.
Why K = 30? With ~500 names in the universe, top-30 is the top 6% — narrow enough to concentrate alpha but wide enough to avoid catastrophic single-name risk. Decile (top 50) would diversify the alpha across more names but dilute it; top-10 would concentrate too aggressively. The choice is also bounded by execution capacity at the target AUM.
Risk constraints. None imposed beyond cardinality. In production this would be expanded with sector neutrality (cross-sectionally demean within each GICS sector), size neutrality (split top-K within each market-cap tercile), and gross exposure limits.
Performance. Synthetic backtest delivers annualised Sharpe ≈ 0.4–0.6 (true real-data Sharpe would depend on data quality). The number is honest after walk-forward and after the standard ranking-based construction; it is not yet adjusted for trading costs.
Discussion / common pitfalls. A memo this short cannot be wrong if every claim is honest about its limits. The pitfalls real PMs face are sins of omission, not of commission: silently lookforward (forgot to lag a feature by one period), silently survivorship-bias (used the current S&P 500 constituent list across history rather than the as-of constituent list), silently arbitrage cost (didn’t include slippage). A defensible memo explicitly addresses each of these: how was lookforward avoided? Did the universe match the live-trade universe at each date? What is the gross-of-cost Sharpe vs net-of-cost Sharpe?
Chapter 6 (Appendix A) — Linear Algebra
Exercise 6.1: Shape arithmetic
For each pair, state the shape of the product (or that it is undefined).
(a) \(\mathbf{A}\) is \(4 \times 3\), \(\mathbf{B}\) is \(3 \times 5\). Compute \(\mathbf{AB}\).
Inner dimensions agree (\(3 = 3\)). Outer dimensions give the result shape: \(\mathbf{AB}\) is \(4 \times 5\).
(b) Same \(\mathbf{A}\) and \(\mathbf{B}\). Compute \(\mathbf{BA}\).
Inner dimensions are \(5\) (from \(\mathbf{B}\)’s columns) and \(4\) (from \(\mathbf{A}\)’s rows). \(5 \ne 4\), so \(\mathbf{BA}\) is undefined.
(c) \(\mathbf{X}\) is \(120 \times 6\). Shapes of \(\mathbf{X}^\top \mathbf{X}\) and \(\mathbf{XX}^\top\)?
- \(\mathbf{X}^\top\) is \(6 \times 120\). Then \(\mathbf{X}^\top \mathbf{X}\) has inner \(120 = 120\) ✓ and shape \(6 \times 6\).
- \(\mathbf{XX}^\top\) has inner \(6 = 6\) ✓ and shape \(120 \times 120\).
The first is the Gram matrix of regressors; the second is the Gram matrix of observations. In OLS we always work with the first (\(6 \times 6\), easy to invert); the second is too large to materialise for any reasonable sample.
(d) \(\mathbf{a}, \mathbf{b}\) both vectors of length 5. Shapes of \(\mathbf{a}^\top \mathbf{b}\) and \(\mathbf{a}\mathbf{b}^\top\)?
- \(\mathbf{a}^\top \mathbf{b}\) is the inner product: \((1 \times 5) \cdot (5 \times 1) = 1 \times 1\), a scalar.
- \(\mathbf{a}\mathbf{b}^\top\) is the outer product: \((5 \times 1) \cdot (1 \times 5) = 5 \times 5\), a matrix.
Discussion / common pitfalls. The single most common error is to write \(\mathbf{BA}\) when you meant \(\mathbf{AB}\) — matrix multiplication is not commutative. The shape-check (\(\text{cols of left} = \text{rows of right}\)) is the first thing to run on any new expression, and it catches roughly 80% of these mistakes in seconds.
The second most common error is to confuse inner and outer products of vectors. NumPy will silently interpret a * b as elementwise multiplication (not inner product), and a[:, None] * b[None, :] as broadcasting (the outer product). Read the dimensions before reaching for the operator.
Exercise 6.2: OLS by hand
\(\mathbf{y} = (2, 4)^\top\), \(\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}\). Compute \(\mathbf{X}^\top \mathbf{X}\), \(\mathbf{X}^\top \mathbf{y}\), then \(\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\) by hand. Verify with NumPy.
Worked solution.
Step 1: \(\mathbf{X}^\top \mathbf{X}\).
\[\mathbf{X}^\top = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}, \qquad \mathbf{X}^\top \mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}\begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} 1\cdot 1 + 1 \cdot 1 & 1\cdot 1 + 1 \cdot 2 \\ 1 \cdot 1 + 2 \cdot 1 & 1 \cdot 1 + 2 \cdot 2 \end{pmatrix} = \begin{pmatrix} 2 & 3 \\ 3 & 5 \end{pmatrix}.\]
Step 2: \(\mathbf{X}^\top \mathbf{y}\).
\[\mathbf{X}^\top \mathbf{y} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}\begin{pmatrix} 2 \\ 4 \end{pmatrix} = \begin{pmatrix} 2 + 4 \\ 2 + 8 \end{pmatrix} = \begin{pmatrix} 6 \\ 10 \end{pmatrix}.\]
Step 3: \((\mathbf{X}^\top \mathbf{X})^{-1}\) via the \(2 \times 2\) formula.
Determinant: \(\det = 2 \cdot 5 - 3 \cdot 3 = 10 - 9 = 1\).
\[(\mathbf{X}^\top \mathbf{X})^{-1} = \frac{1}{1} \begin{pmatrix} 5 & -3 \\ -3 & 2 \end{pmatrix} = \begin{pmatrix} 5 & -3 \\ -3 & 2 \end{pmatrix}.\]
Step 4: \(\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\).
\[\hat{\boldsymbol\beta} = \begin{pmatrix} 5 & -3 \\ -3 & 2 \end{pmatrix}\begin{pmatrix} 6 \\ 10 \end{pmatrix} = \begin{pmatrix} 30 - 30 \\ -18 + 20 \end{pmatrix} = \begin{pmatrix} 0 \\ 2 \end{pmatrix}.\]
So \(\hat\alpha = 0\) and \(\hat\beta = 2\).
NumPy verification:
Discussion / common pitfalls. Three checks.
The result \(\hat{\boldsymbol\beta} = (0, 2)\) corresponds to a fitted line \(y = 0 + 2x\). Plugging back: at \(x = 1\), \(\hat y = 2\); at \(x = 2\), \(\hat y = 4\). Both match the original \(y\). With only two data points and two free parameters, OLS exactly interpolates — the residuals are zero. This is a feature of the algebra (the system is square and invertible), not skill of the model.
Always confirm \(\det(\mathbf{X}^\top \mathbf{X}) \ne 0\) before inverting. A zero determinant means singular design — typically two collinear columns — and the system has no unique solution.
For larger problems, np.linalg.solve(XtX, Xty) is preferred to np.linalg.inv(XtX) @ Xty: solve is faster, more numerically stable, and never materialises the explicit inverse (which is rarely needed for itself).
Exercise 6.3: When does the inverse fail?
Build a \(3 \times 2\) design where column 2 is twice column 1. Show that np.linalg.inv(X.T @ X) raises LinAlgError. Then add \(\lambda \mathbf{I}\) and confirm invertibility.
Worked solution.
Why does this happen? When column 2 is twice column 1, the two columns of \(\mathbf{X}\) are linearly dependent — they span a one-dimensional subspace, not a two-dimensional one. The matrix \(\mathbf{X}^\top \mathbf{X}\) inherits this dependence: its rank is 1, not 2, and its determinant is exactly 0. A singular matrix has no inverse.
Why does Ridge fix it? Adding \(\lambda \mathbf{I}\) adds \(\lambda\) to every diagonal entry. This shifts the eigenvalues up by \(\lambda\). The eigenvalues of \(\mathbf{X}^\top \mathbf{X}\) were \(\{0, \text{trace}\}\); adding \(\lambda \mathbf{I}\) makes them \(\{\lambda, \text{trace} + \lambda\}\). Both are strictly positive, so the matrix is invertible. The cost is bias in \(\hat{\boldsymbol\beta}\): Ridge shrinks toward zero, more strongly for directions where the data has less information (small eigenvalue).
Discussion / common pitfalls. Exact collinearity is rare in real data. Approximate collinearity is everywhere — two features that are 99% correlated produce a determinant of \(\mathbf{X}^\top \mathbf{X}\) that is positive but tiny. The numerical inverse exists, but its entries are enormous, and the resulting coefficient estimates are wildly unstable. Ridge regularisation tames this. The conventional cross-validation choice of \(\lambda\) trades off bias (from shrinkage) against variance (from instability) and finds the sweet spot on held-out data.
Exercise 6.4: Predictions as matrix products
\(\hat{\boldsymbol\beta} = (0.001, 1.10, -0.30)\) for intercept + Mkt + SMB. New months:
| month | Mkt | SMB |
|---|---|---|
| Jan | 0.020 | -0.005 |
| Feb | -0.015 | 0.010 |
| Mar | 0.008 | 0.000 |
- Build \(\mathbf{X}_{\text{new}}\). (b) Compute \(\hat{\mathbf{y}}_{\text{new}}\) by hand and with NumPy.
Worked solution.
(a) Design matrix. The first column is a column of 1s (the intercept); the next two columns are the factor values.
\[\mathbf{X}_{\text{new}} = \begin{pmatrix} 1 & 0.020 & -0.005 \\ 1 & -0.015 & 0.010 \\ 1 & 0.008 & 0.000 \end{pmatrix}.\]
(b) Predictions by hand.
\[\hat y_{\text{Jan}} = 0.001 + 1.10 \cdot 0.020 + (-0.30) \cdot (-0.005) = 0.001 + 0.022 + 0.0015 = 0.0245.\]
\[\hat y_{\text{Feb}} = 0.001 + 1.10 \cdot (-0.015) + (-0.30) \cdot 0.010 = 0.001 - 0.0165 - 0.003 = -0.0185.\]
\[\hat y_{\text{Mar}} = 0.001 + 1.10 \cdot 0.008 + (-0.30) \cdot 0.000 = 0.001 + 0.0088 + 0 = 0.0098.\]
So \(\hat{\mathbf{y}}_{\text{new}} = (0.0245, -0.0185, 0.0098)^\top\).
NumPy verification:
Discussion / common pitfalls. Three things to internalise.
The intercept column of 1s is not optional. Forgetting it yields predictions that are zero whenever all features are zero — a strong implicit constraint that is rarely warranted.
The matrix-product form \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol\beta}\) scales to thousands of new observations and millions of features with no change to the code. The per-row hand computation does not. This is the entire reason linear algebra is the language of statistics: one expression covers all scales.
The dot product of each row with \(\hat{\boldsymbol\beta}\) is the prediction. Reading \(\mathbf{X}\hat{\boldsymbol\beta}\) as “each row of \(\mathbf{X}\) dotted into \(\hat{\boldsymbol\beta}\), in parallel” is the right mental model.
Exercise 6.5: Recognising the formula in the wild
Identify the chapter, state shapes, and explain in one sentence what each computes.
(a) \(\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\) — OLS coefficient vector.
- Appears in: Chapter 3 (simple regression / CAPM), Chapter 4 (multi-factor models).
- Shapes: \(\mathbf{X}\) is \(n \times p\); \(\mathbf{y}\) is \(n \times 1\); \(\mathbf{X}^\top \mathbf{X}\) is \(p \times p\); \(\mathbf{X}^\top \mathbf{y}\) is \(p \times 1\); \(\hat{\boldsymbol\beta}\) is \(p \times 1\).
- Meaning: the OLS estimate of the coefficient vector — the unique minimiser of \(\|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2\) when \(\mathbf{X}^\top\mathbf{X}\) is invertible.
(b) \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol\beta}\) — fitted values / predictions.
- Appears in: Chapter 3 (in-sample fit), Chapter 4 (predictions from FF3/FF5), Chapter 5 (out-of-sample alpha predictions).
- Shapes: \(\mathbf{X}\) is \(n \times p\); \(\hat{\boldsymbol\beta}\) is \(p \times 1\); \(\hat{\mathbf{y}}\) is \(n \times 1\).
- Meaning: the prediction vector — each entry is the dot product of the corresponding row of \(\mathbf{X}\) with the coefficient vector.
(c) \(\hat{\boldsymbol\beta}_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}\) — Ridge coefficient vector.
- Appears in: Chapter 5 (regularised alpha models).
- Shapes: \(\mathbf{X}\) is \(n \times p\); \(\mathbf{I}\) is \(p \times p\); \(\hat{\boldsymbol\beta}_{\text{Ridge}}\) is \(p \times 1\).
- Meaning: the OLS estimate biased toward zero by a penalty \(\lambda\) on the squared coefficient norm — stabilises ill-conditioned designs and reduces variance at the cost of bias.
(d) \(\hat{\boldsymbol\beta}_{\text{LASSO}} = \arg\min_{\boldsymbol\beta} \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1\) — LASSO coefficient vector.
- Appears in: Chapter 5 (sparse alpha models).
- Shapes: same as OLS — \(\hat{\boldsymbol\beta}_{\text{LASSO}}\) is \(p \times 1\).
- Meaning: the regression estimate with an \(\ell_1\) penalty that drives many coefficients exactly to zero, producing a sparse model. No closed form; solved iteratively (coordinate descent in
sklearn).
Discussion / common pitfalls. The four formulae are the language of every supervised-learning algorithm in this book. OLS is the closed-form solution under the assumption that \(\mathbf{X}^\top\mathbf{X}\) is invertible. Ridge adds a stabilising term to the inverse and retains closed form. LASSO replaces the \(\ell_2\) penalty with \(\ell_1\), gaining sparsity at the cost of closed-form analytical convenience. Tree ensembles (HGBR, Random Forest) abandon the linear-in-parameters form entirely but reuse the loss function \(\|\mathbf{y} - \hat{\mathbf{y}}\|_2^2\) as the splitting criterion.
The single most important thing to remember: \(\mathbf{X}\boldsymbol\beta\) is the universal expression for “linear combination of features.” Whether \(\boldsymbol\beta\) is fit by OLS, Ridge, LASSO, or any other procedure, the prediction step is always \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol\beta}\). Once you have read that expression three or four times in different chapters, the rest of the machinery becomes a vocabulary of variations on a single theme.
Closing notes
If you have worked all the way through this appendix — every exercise, every code cell, every discussion paragraph — you have rehearsed the entire mechanical vocabulary of the book. The substantive judgement that turns this vocabulary into research, into strategy, and into capital allocation is the work of the next decade. The exercises above are scaffolding for that work, not a destination.
Two final pieces of advice.
First, the gap between understanding a method and deploying a method is wider than any textbook lets on. Every exercise above has been engineered to converge. Real data does not converge. Real data has missing rows, mis-typed dates, survivorship bias, look-ahead bugs, asynchronous timestamps, regime changes, and structural breaks. The exercise of going from a textbook-clean version of a model to a production-grade version of that same model takes years and is the single most underrated source of professional edge in this discipline.
Second, the methods in this book are intermediate. They are the entry tariff to systematic finance, not the destination. The frontier — modern factor models, deep-learning return prediction, reinforcement-learning execution, alternative-data integration — sits on top of the foundations laid here. If you cannot fluently switch between scalar OLS, matrix OLS, regularised regression, and tree ensembles, you cannot meaningfully engage the frontier literature. The exercises above are the price of admission.
Good luck. Run the cells. Break the code. Fix it. Repeat.