• 📖 Cover
  • Contents

Chapter 1: Pandas Foundations

What you’ll learn

Pandas is the Python library that lets you work with tables of data the same way you would in Excel — but using code, so the steps are reproducible and the tables can be much larger. This chapter walks you through its two core building blocks:

  1. The Series — one column of data, where each row has a label (like a date or a ticker symbol). Think of it as a single Excel column that remembers which row each value belongs to. Most of the bugs beginners hit come from how Series line up by label, so we spend real time there.
  2. The DataFrame — a whole spreadsheet you can manipulate with code. Rows are records, columns are fields, and you can sort, filter, group, or summarise in one line.

By the end you’ll be able to: read a CSV file, line up two columns by date, pick out rows in three different ways, deal with missing values, summarise data by group, glue two tables together, and stitch all of those steps into one clean pipeline.

Why the Series is the workhorse

Where this shows up. Pretty much any data you’ll touch as a Year 2 student — your weekly study hours, daily step counts on your phone, the closing price of Apple stock each day, the temperature in Hong Kong over a month — is a list of numbers, where each number is tied to a label (a week, a day, a date). A pandas Series is just that idea, made into a Python object you can compute on.

In a spreadsheet, a column is the natural unit of analysis. A column of stock closing prices, a column of customer ages, a column of monthly revenues — each is a list of values, where each row is labelled by something (a date, a customer ID, a month). Pandas formalises this idea as the Series: a list of values paired with an explicit index of labels. (The “index” here is just pandas’ name for the column of labels on the left.)

Almost every operation you’ll do in this course — arithmetic on a whole column, descriptive statistics, handling missing values, summarising by group, working with time series, plotting — is first defined on a Series and then carried over to multi-column DataFrames. If you understand Series cleanly, the rest of pandas is just doing the same thing to many columns at once.

Three properties make the Series the right tool for the work you’ll do:

  1. Labels, not just positions. Every value carries a label. Stock prices are labelled by date, revenues by month, returns by ticker (a ticker is the short code for a stock — AAPL for Apple, MSFT for Microsoft). Because the label sticks to the value through every operation, you stop making the classic “off-by-one” mistakes that come from working with raw row numbers.
  2. Automatic alignment. When you combine two Series — say, adding this week’s revenue to last week’s, or subtracting a benchmark return from a stock return — pandas first lines them up by label, then does the arithmetic. If a label is in one Series but not the other, you get NaN (more on that below) instead of pandas silently mixing the wrong rows together.
  3. Everything works on the whole column at once. Arithmetic, comparisons, statistics, and string operations all act on the whole Series in one go. (The technical word for this is vectorised — you don’t write a loop, you just apply the operation to the column.) Behind the scenes pandas calls fast compiled code, so a million-row computation finishes in tens of milliseconds instead of minutes.
Where this shows up in a real job

At a hedge fund or any data-driven company, a single time series — a price history, a daily sentiment score, a user’s daily logins — is stored as a Series whose labels are dates. A cross-sectional snapshot — market values for every stock today, or sign-up counts for every region today — is also a Series, but the labels are tickers or region names instead of dates. A DataFrame is literally a bundle of such Series sharing the same row labels. So if you can do something to one Series, you can do it to every column of a DataFrame just by writing the same code.

Building a Series

Why this matters. Before you can analyse anything, you have to get your data into pandas. Most of the time it comes from a file, but in tutorials and quick experiments you’ll often build a Series by hand from a Python list or dictionary. Knowing the three common ways to do this saves you from copy-pasting code you don’t understand.

The function pd.Series(data, index=..., name=...) builds a Series. It accepts a list, a dictionary, a NumPy array, or a single number. Three patterns cover almost every case you’ll see.

From a list with a default index

Let’s start as simply as possible: a list of five numbers, no labels.

What just happened? You built a Series of five prices and gave the column the name "Close" (short for “closing price”). The output prints the row numbers 0..4 on the left (these are the labels, which pandas calls the index) and the values on the right. The trailing Name: Close, dtype: int64 tells you the column’s name and that pandas figured out the values are integers (int64 is just the standard 64-bit integer type). When you don’t pass an index=, pandas makes up a boring 0, 1, 2, ... index for you. The very first thing you should usually do is replace it with something meaningful.

From a list with a custom index

Now let’s give those numbers real meaning. We’ll attach a stock ticker to each one.

What just happened? You built a Series where each price is labelled with the company’s ticker (MSFT = Microsoft, AAPL = Apple, QQQ is an exchange-traded fund tracking the Nasdaq-100, TSM = Taiwan Semiconductor, NVDA = Nvidia). Now you can ask for price["AAPL"] and get 100 back directly — you never have to remember that Apple happens to be at position 1. This is the “snapshot across many things” pattern: one number per item, at one moment in time.

From a dict

A Python dict already pairs keys with values, so passing it to pd.Series is the most literal way to build one:

The dict keys become the index, the dict values become the data. If you also pass index=[...], pandas reindexes the result: keys present in index but missing from the dict yield NaN; keys in the dict but absent from index are dropped.

From a scalar (broadcast)

Passing a scalar plus an explicit index broadcasts the value to every label — handy for initialising a Series of zeros, ones, or any constant.

You can also build a Series from a NumPy array, from a column of a DataFrame (df["Close"] returns a Series), or from pd.read_csv(...)["col"]. All of these are common entry points.

Naming matters

Always set name= to something descriptive. It appears as the y-axis label when you call .plot(), as the column header when the Series is inserted into a DataFrame, and as the legend entry when overlaying multiple series. A nameless Series is the equivalent of an Excel column with no header.

Anatomy of a Series

Before going further, it pays to fix the picture of a Series in your head. A Series has exactly four things you should be able to point at: an index (the labels on the left), the values (the actual data), a dtype (what kind of numbers/strings live in there), and an optional name. The diagram below labels each part on a five-row example.

Whenever you read pandas code that does something to a Series, mentally point at one of these four parts and ask “is the code touching the index, the values, the dtype, or just the name?” Almost every beginner bug comes from acting on the wrong one — for instance, trying to do arithmetic on what is actually a label.

Attributes vs. methods

Why this matters. When you type price.something in your code, pandas will either give you a piece of information about the Series (an attribute) or run a calculation (a method). The two look almost identical — the only difference is whether you put parentheses after the name — and mixing them up is one of the most common beginner errors. Five minutes spent on this distinction now saves hours of confusing error messages later.

Every Series exposes two kinds of accessors. Attributes — written without parentheses — describe the object (like “what type are the values?”, “how many rows?”). Methods — written with parentheses — actually compute something (like “what’s the average?”). This rule applies to every pandas object you’ll meet.

What just happened? price.dtype reported int64 because all the values are whole numbers. price.shape returned (5,) — that’s Python’s way of saying “5 rows, and only one dimension” (a Series has just one column of values). price.index gave you back the labels themselves; the labels are also an object you can index into (so price.index[0] is 'MSFT') and call methods on (like .duplicated() to check for repeats, or .day_name() if the labels are dates).

Rule of thumb

No parentheses = a property the object already knows. Parentheses = “go compute something for me.” Confusing .shape (attribute, no parens) with .shape() (a call, with parens) is one of the most common pandas typos — and the error message is actually helpful: 'tuple' object is not callable, which means “you put parentheses after something that wasn’t a function.”

Index alignment — the key Pandas idiom

Why this matters. This is the single most important section in the chapter. Most beginner bugs in pandas come from misunderstanding how two Series combine. Once you get this, the rest of pandas feels much less mysterious. Imagine you have one column of US stock prices and another column of Hong Kong stock prices, with different trading dates (because the two markets close on different holidays). When you try to add them together, what should happen on a date that only one market traded? Pandas has a very specific answer, and that answer is what makes the library worth using.

The single most important behaviour that sets pandas apart from a plain Python list (or a NumPy array) is automatic alignment by label. When you combine two Series with arithmetic, pandas does not add them up position by position (1st of s1 with 1st of s2, 2nd with 2nd, etc.). Instead, it matches them label by label — finding labels that appear in both and computing only on those, filling NaN (which stands for “Not a Number”, pandas’ way of saying “missing”) where a label appears in only one of the two Series.

What just happened? The result has six rows — every label that appeared in either Series — and four of them are NaN:

  • AAPL (in both): 100 + 20 = 120
  • QQQ (in both): 600 + 200 = 800
  • TSM (in both): 200 + 150 = 350
  • GOOG (only in s2): NaN
  • MSFT, NVDA (only in s1): NaN

This is a feature, not a bug. Two stock-return columns with different trading calendars (US vs. HK), two surveys filled in by different students, two product catalogues with different items — in every case, alignment by label gives you the right answer without you having to manually check which row matches which. The cost is that the result may contain NaN rows that you then have to deal with on purpose (either drop them, or fill them with something).

The diagram tracks what pandas does when you write s1 + s2. The left panel shows the two Series laid out over the union of their indexes; the middle panel highlights the four labels (AAPL, QQQ, TSM) where both Series contribute a value; the right panel shows the result, with grey bars marking the labels that produced NaN because only one of the two Series had a value there.

Duplicate labels are dangerous

If the index of either Series contains duplicates, alignment becomes a Cartesian join on the duplicated label and the result can explode in size. The notebook example below makes the point:

Because "AAPL" appears twice in each Series, the aligned result contains four AAPL rows — every pair of duplicates is matched. This is almost never what you want. The fix is to deduplicate before combining: s.loc[~s.index.duplicated(keep="last")], or to aggregate up front with s.groupby(s.index).sum().

Common pitfall

Alignment is silent. Two Series with completely disjoint indexes will combine without raising any error — they will simply return an all-NaN result. If you suddenly see a Series of NaNs after an arithmetic operation, the first thing to check is whether the indexes actually match. len(s1.index.intersection(s2.index)) is a useful diagnostic.

Fill values during alignment

If you want a specific default rather than NaN for missing labels, use the .add, .sub, .mul, .div methods with the fill_value= argument:

s1.add(s2, fill_value=0) gives every label its value from whichever Series contains it, treating absence as zero. This is the right way to combine, say, this month’s revenue with last month’s adjustments when you do not want missing entries to wipe out a row.

Vectorised operations and broadcasting

Why this matters. If you’ve ever written a for loop in Python to multiply every element of a list by some number, you know it’s tedious — and on millions of rows it’s painfully slow. Pandas (and NumPy underneath it) let you write price * 0.95 to multiply every element in one stroke. That single idea makes pandas code shorter, faster, and easier to read. Coming up: a “stop-loss” example, which is a price level at which an investor automatically sells a stock to limit losses. Don’t worry if you’ve never traded — the math is just “today’s price minus 5 percent”.

A Series supports the same element-by-element arithmetic as a NumPy array (the underlying numerical library Python uses), but with the bonus of label-based alignment. A plain number (a scalar) gets applied to every element:

What just happened? price * 0.95 multiplied every element by 0.95 (a 5% discount, giving a stop-loss level — the price at which you’d cut your losses) in one shot, keeping the ticker labels attached. price > 200 returned a Series of the same shape but with True/False values — True everywhere the price exceeds 200, False elsewhere. A handy trick: in Python, True counts as 1 and False counts as 0, so summing a Boolean Series gives you the number of True values, and taking its mean gives you the fraction that are True:

The methods .gt, .ge, .lt, .le, .eq, .ne are the named equivalents of >, >=, <, <=, ==, !=. They accept a fill_value= argument when comparing two Series with different indexes, and they read more clearly inside long method chains.

Method chaining

Pandas encourages method chaining because every method returns a new Series (or DataFrame). You can write a multi-step computation as a single expression:

google["Open"].div(google["Volume"]).div(10).plot()

This reads left to right as a pipeline — divide Open by Volume, scale by 1/10, plot the result — with no intermediate variables. For complex chains, wrap the whole expression in parentheses and put each method on its own line.

apply vs. vectorised methods — why vectorisation wins

If you’ve come from Excel or VBA, it’s tempting to write a Python function that handles one row at a time and hand it to .apply(). It works — but it’s dramatically slower than the vectorised equivalent (the one that operates on the whole column at once). The timings below make the point starkly:

def compare(c):
    return c > 200

%timeit price.apply(compare)        # ~100 microseconds for 5 elements
%timeit price.gt(200)               #  ~10 microseconds for 5 elements

On a five-element Series the gap is small in absolute terms, but it’s already a 10x difference. Scale up to a million-row price history (a decade of minute-by-minute data) and apply takes seconds while the vectorised call takes milliseconds — the difference between “instant” and “go make tea”. The rule is simple and absolute: if pandas has a built-in method that does what you want, use it. Reach for .apply() only when the operation genuinely cannot be written as a vectorised expression.

Anti-pattern

The for v in series: ... loop is almost always wrong in pandas code. Beyond the speed penalty, it bypasses the index alignment and missing-value semantics that make pandas valuable in the first place. If you find yourself writing such a loop, stop and look for a vectorised equivalent: .gt, .where, .mask, .clip, .cumsum, .rolling, .shift, .diff, .pct_change, or one of the dozens of other methods listed in dir(pd.Series).

Missing values: NaN and how to handle them

Where this shows up. Real-world data is almost never complete. A weather sensor breaks for a day. A survey respondent skips a question. A stock didn’t trade on a US holiday. A student forgot to log into the LMS one week. In every case, the spreadsheet has a hole where a number should be. Pandas marks these holes with NaN, and you’ll need a deliberate strategy for handling them — because if you ignore them, your averages and totals will silently be wrong.

In pandas, missing numerical data is represented by NaN (which stands for “Not a Number”). It’s a special floating-point value built into the math standard your computer uses. Three rules govern how it behaves:

  1. Anything you do with NaN returns NaN. NaN + 5, NaN * 0, even NaN > 0 — all give NaN (or False for comparisons). The “missingness” spreads through every calculation it touches.
  2. Aggregations (like sum or mean) skip NaN by default. s.mean(), s.std(), s.sum() quietly ignore missing values and compute over the rest. If you want a missing value to “infect” the result instead (sometimes useful as a safety check), pass skipna=False.
  3. NaN != NaN. This is a quirk of the math standard: comparing NaN to anything (even itself) gives False. So to check for missing values, use .isna() (or its alias .isnull()) — never == np.nan.

The two most common handlers are .dropna() and .fillna().

Each choice encodes a different assumption about why the value is missing:

Method When to use
.dropna() The missing value cannot be reasonably estimated and you would rather have fewer, clean rows.
.fillna(0) The missing value represents absence (no trade, no event) and zero is the correct quantitative interpretation.
.ffill() The last observed value is still the best estimate — typical for daily stock prices on a non-trading day.
.fillna(s.mean()) or .fillna(s.median()) The missing value is genuinely unknown and you want a neutral imputation that does not bias the mean (median is safer with outliers).
.fillna(s.mode().iloc[0]) The Series is categorical and you want the most common category.

Each row is the same Series passed through a different handler. Grey cells are NaN; coloured cells are real numbers. The diagram makes the trade-offs visible: dropna removes information (cells stay blank); fillna(0) invents a value at the low end of the range; ffill/bfill reuse a neighbour and therefore preserve local level; fillna(mean) substitutes a value that cannot deviate from the mean and so will silently compress your variance.

Pitfall: mean imputation of return series

Filling missing returns with the historical mean understates volatility because it inserts a value with zero deviation from the mean. For risk-model inputs, either drop the missing days or use .ffill() on the price level and recompute returns — but never impute returns directly with their mean.

Counting missingness as a data-quality check

When you first load any new dataset, your second line of code (right after .head()) should compute the fraction of missing values per column. For a single Series:

s.isna().mean()

What just happened? That one line says “where is each value missing?”, then “what fraction of them are missing?”. If the result is more than a few percent, the missingness pattern itself becomes a question worth thinking about: is the data missing completely at random (truly random skipping), missing at random conditional on what we observe (e.g., young respondents skip income questions more often), or missing not at random (the fact of missingness is itself a signal — survey respondents who refuse to answer the income question might be the highest earners)? The three cases call for different fixes, and treating them all the same is a common source of biased results in models built on top of the data.

Descriptive statistics

Why this matters. Any time you load new data, the first thing you should do is ask “what does it look like?” — minimum, maximum, average, spread. Otherwise you’re flying blind. Pandas gives you all of these summary numbers without you having to write a single formula yourself. Each call below is one English-readable line of code.

Once your data is in a Series, dozens of summary methods are available with no further setup. The most useful ones are listed below — read the code and the comments next to each line to see what each one returns.

.idxmax() and .idxmin() return the index label of the extreme observation — far more useful than the integer position, since you usually want to know “which ticker had the highest price” or “which date had the worst return”, not the numeric offset.

.describe() — the one-line profile

.describe() returns count, mean, std, min, 25%, 50%, 75%, max in a single call. For categorical Series (dtype=object), .describe() returns count, unique, top, freq instead. Either way, it is the standard first probe of any new dataset.

Sample vs. population std

Pandas .std() and .var() default to sample statistics — division by \(n-1\), Bessel’s correction. NumPy’s np.std() defaults to population statistics — division by \(n\). On a 252-day return series the two differ by a factor of \(\sqrt{252/251} \approx 1.002\), which is negligible; on a five-row toy Series the gap can be 10% or more. Pandas’s choice is the right one for almost every business analytics task because your data is almost always a sample of a larger return-generating process.

.agg() — custom and multiple aggregations

.agg() lets you apply one or more aggregation functions in a single call. The functions can be strings naming built-in methods, or your own Python callables:

The output of .agg([...]) is a Series indexed by function name — convenient for stacking results across many series in a comparison table.

Counting categories and computing frequencies

For a Series of discrete labels (sectors, ratings, signal classes), .value_counts() gives the frequency table in descending order:

normalize=True gives proportions instead of counts. .value_counts() is also the right tool for checking the distribution of a categorical index: pd.Series(google.index.month).value_counts() tells you how many trading days fell in each calendar month.

Indexing methods: .loc, .iloc, and boolean masks

Why this matters. “Pick me the rows where…” is something you’ll do constantly: “give me the prices for AAPL only”, “give me the rows from January 2024”, “give me the customers older than 25”. Pandas gives you three different tools for this, and they look almost identical at first glance. They are not interchangeable, and picking the wrong one will produce code that works on today’s data but quietly breaks tomorrow.

There are three idioms for selecting a subset of a Series. They are not interchangeable, and the difference matters in any code you intend to re-run later.

.loc — label-based

.loc[label] looks up rows by their label (the thing you see on the left of the Series). It accepts a single label, a list of labels, or a slice between two labels. Watch out for one quirk: a slice between two labels includes both endpoints — unlike ordinary Python list slicing, which excludes the upper end. We’ll come back to this point because it’s a frequent source of confusion.

What just happened? You pulled out a single ticker’s price, a list of two tickers, and a contiguous range of three tickers — all by name, not by row number. For Series indexed by dates, label slicing also works with date strings: prices.loc["2024-01-01":"2024-12-31"] returns the entire calendar year 2024.

.iloc — position-based

.iloc[k] looks up rows by their integer position (the row number), ignoring the labels entirely. The i stands for “integer”. Position slices follow ordinary Python list slicing: the upper end is excluded.

When .iloc is dangerous

.iloc answers the question “give me the tenth row.” If your dataset never changes, that’s perfectly fine. But if rows get added, removed, or reordered between runs (which is normal for any data you re-download or refresh), “the tenth row” may correspond to a different date or a different stock every time you run your code. For any analysis you want to be reproducible across data refreshes, prefer .loc with explicit labels — that way you’re asking for “January 5th, 2024” rather than “whatever happens to be in the tenth row today.”

Slicing: .loc vs .iloc side by side

The trickiest part of .loc vs .iloc is that they treat the end of a slice differently. .loc["B":"D"] includes both endpoints; .iloc[1:4] excludes the upper end (just like ordinary Python list slicing). The diagram below highlights the exact same three rows produced by each call, so you can see the rule at a glance.

Both calls happen to return the same three values on this toy Series, but they got there by completely different rules. .loc["B":"D"] walks along the labels and includes every label from the first endpoint up to and including the last; .iloc[1:4] walks along the positions and uses ordinary Python half-open slicing, so position 4 (the E row) is excluded. Whenever you slice, ask “am I in label-land or position-land?” — the answer changes whether the upper end is in or out.

Boolean masks

A boolean mask is a Series of True/False values that’s the same length as the data you want to filter. When you pass the mask back into the brackets, pandas keeps only the rows where the mask is True and discards the rest. (If you’ve seen SQL, this is the equivalent of a WHERE clause.) Filtering with masks is probably the single most common operation you’ll do in pandas.

The mask itself is just a boolean Series, so you can compute on it directly: (price > 200).sum() counts how many tickers satisfy the condition; (price > 200).mean() gives the proportion.

.where and .mask — keep some, replace the rest

.where(cond) keeps values where cond is True and replaces the rest with NaN (or a value you specify via other=). .mask(cond) is the logical complement — keep where cond is False, replace where True. The two are mirror images.

The last line — price.where(price.ge(200)).mean() — is a clean idiom for “average over the subset that satisfies a condition,” without an explicit filter step. The NaNs introduced by .where are skipped by .mean() automatically.

For two-way conditional replacement, np.where is the pandas-friendly tool:

np.where(cond, x, y) returns x where cond is True and y where False. It is faster than .apply(lambda v: "Expensive" if v >= 200 else "Cheap") by a wide margin because the comparison and the selection both happen in compiled code.

Sorting, ranking, and clipping

These three families round out the basic Series toolkit. None of them is conceptually deep but each appears in essentially every real workflow.

Sorting

.sort_values() returns a new Series with the index labels carried along — the label-value pairing is preserved, only the order changes. To break ties stably, pass kind="mergesort". To reset the index to a clean 0..n-1 after sorting (and discard the original labels), append .reset_index(drop=True).

Ranking

.rank() assigns each value its position in sorted order — the smallest value gets rank 1, the largest gets rank \(n\). Ties are averaged by default. Ranking is the foundation of every cross-sectional quantitative signal: momentum deciles, value quintiles, factor scores.

pct=True converts ranks to percentiles in \((0, 1]\) — directly comparable across Series of different lengths.

Clipping

.clip(lower=..., upper=...) truncates values to a band. The most common use is winsorisation — replacing extreme observations with the 5th and 95th percentiles to reduce the influence of outliers on downstream estimators.

The two outliers (5000 and 10) are pulled in to the 95th and 5th percentiles respectively. The other five observations are untouched. Winsorising before computing a mean or a regression coefficient is standard practice in cross-sectional asset pricing.

Removing duplicates

.drop_duplicates() removes duplicate values; s[~s.index.duplicated(...)] removes duplicate index labels. The two are different and both come up.

keep="first", keep="last", and keep=False (drop all occurrences of any duplicated value) cover the cases you will need.

Time-indexed Series

Where this shows up. Most data you’ll work with as a Year 2 student lives in time: daily stock prices, monthly grades, hourly app usage, your weekly running mileage. As soon as the labels of your Series are actual dates (rather than strings or row numbers), pandas unlocks a whole toolkit of time-aware tricks — date-string slicing (prices.loc["2024-01":"2024-03"]), pulling out the day of the week from each date, shifting values forward or backward in time, computing rolling averages, and re-summarising at different frequencies (daily → weekly → monthly). Mastering these is what turns “I have a list of numbers” into “I can analyse a real time series.”

The Series becomes dramatically more useful once its labels are real dates (technically, a DatetimeIndex — pandas’ container for date labels). All of the following sub-sections rely on that.

Building a DatetimeIndex

If your index is currently a list of date strings, convert it with pd.to_datetime:

pd.to_datetime parses a wide range of formats automatically. For ambiguous cases (e.g., "10/12/2022" — Oct 12 or Dec 10?), pass an explicit format= argument. Once converted, the index supports a rich set of methods:

For a Series whose values (not index) are timestamps — e.g., a column extracted from a DataFrame — use the .dt accessor:

dates.dt.day_name()
dates.dt.is_month_end
dates.dt.strftime("%Y-%m-%d")

.shift, .diff, .pct_change — the time-series trio

Why this matters. Most interesting time-series questions are about changes: “how much did the price move yesterday?”, “what’s the growth rate from one month to the next?”, “what’s tomorrow’s value going to be relative to today’s?”. All three boil down to comparing a value to a value at a different time. Pandas gives you exactly three small tools for this, and they chain together to do almost everything.

A stock return is just the percentage change in price from one day to the next: if Apple closed at $200 yesterday and $202 today, its return is \((202-200)/200 = 0.01 = 1\%\). That’s literally what .pct_change() computes — one method call gives you the whole column of daily returns.

.shift(k) moves the values k rows forward (positive k) or backward (negative k), leaving the empty rows at the edge as NaN. From this one little tool, the two most common change calculations fall out immediately.

.diff() is shorthand for s - s.shift(1). .pct_change() is shorthand for (s - s.shift(1)) / s.shift(1). Both leave the first row as NaN because there is no prior observation to subtract.

Negative shifts compute forward-looking quantities — useful for constructing the dependent variable in a forecasting model:

forward_return_20d = (close.shift(-20) - close) / close

This is the percentage return from today to twenty trading days hence. Treating it as the y in a regression on today’s features is the standard setup for a 20-day forecasting model.

.rolling — windowed statistics

Where this shows up. A “10-day moving average” of a stock price is what you see in every trading chart on Bloomberg or Yahoo Finance — instead of plotting today’s noisy price, you plot the average of the last 10 days, which smooths out daily wobbles and lets you see the trend. The same idea works for anything: a 7-day moving average of your daily steps tells you whether you’re getting more or less active. .rolling() is how you compute these in one line.

.rolling(window) sets up a sliding window of a chosen size; chaining an aggregation (.mean(), .std(), .median(), .apply(func)) computes the statistic over each window. So .rolling(10).mean() produces the 10-day moving average; .rolling(20).std() produces the rolling 20-day standard deviation (a common measure of how “volatile” or jumpy a stock has been recently).

What just happened? You computed two smoothed price lines: a fast one (10-day average) that reacts quickly to recent moves, and a slow one (40-day average) that responds more sluggishly to the longer-term trend. Their crossings — fast crossing above slow, or below — are the basis of a classic trading rule called the “moving average crossover” (buy when fast crosses above slow, sell when it crosses below). The first window - 1 rows are NaN because the window doesn’t have enough data yet to fill it.

You can chain arbitrary functions through .rolling(...).apply(func). For example, a rolling t-statistic for “is the mean return different from zero?”:

def tstat(c):
    return c.mean() / (c.std() / len(c) ** 0.5)

returns.rolling(63).apply(tstat)   # quarterly t-stat

.cumsum, .cummax, .cummin — running totals

The cumulative methods produce a Series of running statistics. The most important one in finance is .cummax(), the running peak — needed for the drawdown calculation.

What just happened? peak is the running all-time-high of the equity curve up to each date. (peak - equity) / peak measures how far below that peak the strategy is sitting on any given day — a fraction between 0 (you’re at a new high) and 1 (you’ve lost everything). The maximum of that ratio across all days is the maximum drawdown: the worst peak-to-trough fall ever experienced. This is the headline risk number you’d quote for any investment track record.

.resample — change the time frequency

.resample is the time-aware analogue of .groupby. It splits the Series into bins of a fixed frequency ("D" daily, "W" weekly, "M" month-end, "Q" quarter-end, "A" annual) and lets you aggregate each bin.

The pattern is always the same: .resample(freq).aggregator(). Aggregators include .mean, .sum, .last, .first, .min, .max, .std, .quantile(p), and arbitrary .agg(func) calls.

.transform is the related operation that broadcasts the per-bin aggregate back to the original frequency — useful for de-meaning within month, computing distance from the monthly average, and similar normalisations:

demeaned = equity - equity.resample("M").transform("mean")
Frequency strings

Common frequency codes: "B" business day, "D" calendar day, "W" weekly (Sunday), "M" month-end, "MS" month-start, "Q" quarter-end, "A" year-end. Append a number to multiply: "2M" is bi-monthly, "5B" is every five business days. The full list is in the pandas user guide under “Offset aliases.”

Worked example: from daily prices to a risk profile

Why this matters. This is the section where everything you’ve learned comes together to do something a real analyst would do. Don’t be intimidated by the words “Sharpe ratio” and “drawdown” — they’re just two specific summary numbers people use to describe how good (or scary) an investment’s history looks. We’ll define each one before we compute it.

  • Sharpe ratio: roughly “average return divided by how bumpy the returns were.” A higher number = more reward for the same risk. We “annualise” it (scale it to a yearly basis) so investors can compare strategies across different time horizons.
  • Drawdown: how far the value has fallen from its previous peak. If your investment was once worth $100 and is now $70, you’re in a 30% drawdown. The biggest drawdown ever observed is the headline “worst-case” loss for the strategy.

The example below ties together everything in this chapter. We start from a simulated daily price history (made-up but realistic, so the example runs inside the browser without needing internet), and then we compute daily returns, summary statistics, the annualised Sharpe ratio, the maximum drawdown, and the five worst trading days.

Before we run the code, let’s read what it does in plain English: (1) make a fake stock price history; (2) turn prices into daily percentage returns; (3) summarise those returns; (4) compute the Sharpe ratio (daily, then annualised); (5) compare the volatility of up-days vs down-days; (6) compute the worst drawdown; (7) list the worst five days; (8) draw two charts — equity curve on top, drawdown below.

The output is the standard one-page risk report for any track record: a return distribution summary, a Sharpe ratio, a comparison of up-day versus down-day volatility, the worst observed drawdown, and a list of the worst single-day losses. Every line is a Series method or a one-line composition of Series methods. There is no explicit loop in the entire calculation.

The general-purpose template

The eight-step structure above — build the series, compute returns, summarise, compute risk-adjusted return, characterise asymmetry, measure drawdown, identify tail events, visualise — is almost universal in real work. Whether you’re evaluating a trading strategy, tracking a startup’s monthly revenue growth, or studying a YouTuber’s view counts over time, the same pipeline gives you the answers a decision-maker actually wants, in under fifty lines of code.

Plotting Series

Pandas wraps Matplotlib so that any Series can be plotted with a single method call. The index is used as the x-axis automatically.

The .plot family includes .plot.line (default), .plot.hist, .plot.box, .plot.kde, .plot.bar, .plot.barh, and .plot.pie. For exploratory work this is faster than direct Matplotlib calls; for publication-quality figures you will typically reach for the explicit plt.plot(...) form so you can control every detail.

A frequency bar chart of a categorical Series is two method calls:

sector.value_counts().plot.bar(title="Public companies by sector")

.value_counts() returns a Series sorted in descending order; .plot.bar() renders it directly.

Common pitfalls — a checklist

A short list of mistakes that account for most of the questions beginners ask about pandas Series. Bookmark this section — you’ll come back to it.

  1. Silent NaN from alignment. Two Series with completely different labels combine into an all-NaN result with no warning. Before doing arithmetic on Series of unknown origin, check s1.index.equals(s2.index) (do they have the same labels?) or print len(s1.index.intersection(s2.index)) (how many labels do they share?).
  2. Using .iloc when .loc is safer. Position-based indexing breaks the moment your dataset is refreshed and a row gets added or removed. Use .loc with explicit labels in any analysis that you want to be reproducible.
  3. Using .apply where a vectorised method exists. Every .apply(lambda ...) call deserves a “is there a vectorised version?” check. Nine times out of ten there is, and it’s 10–100x faster.
  4. Filling missing returns with the mean. This compresses (understates) volatility. Either drop the missing days, or fill on the price level and recompute returns from there.
  5. Sample vs. population standard deviation. Pandas defaults to ddof=1 (sample standard deviation, dividing by \(n-1\)); NumPy defaults to ddof=0 (population, dividing by \(n\)). The two libraries can disagree on the same data. Pandas’ choice is right for samples — which is almost always what your data is.
  6. Right-inclusive .loc slicing. s.loc["2024-01-01":"2024-12-31"] includes December 31, unlike Python’s normal list slicing which would exclude it. Annoying the first time, convenient once you get used to it.
  7. Forgetting to set name=. A nameless Series gives you ugly plot axes and unhelpful column headers when it lands in a DataFrame. Set the name when you build the Series, not later.
  8. Using == to check for NaN. It always returns False. Use .isna() instead.

Exercises

Exercise 1 — Aligning two price series

You’re given two Series of daily closing prices for early January 2024 — one for a US stock, one for a Hong Kong stock. Notice that the two markets don’t trade on exactly the same days.

import pandas as pd
us  = pd.Series([100, 101, 103, 102, 105],
                index=pd.to_datetime(["2024-01-02","2024-01-03","2024-01-04","2024-01-05","2024-01-08"]))
hk  = pd.Series([400, 398, 405, 410],
                index=pd.to_datetime(["2024-01-02","2024-01-03","2024-01-04","2024-01-08"]))

Have a go at the following:

  1. Compute us + hk and explain in one sentence why one of the rows comes out as NaN.
  2. Now redo the addition but treat any missing value as if it were zero (so no NaNs in the result).
  3. Build the intersection version: a Series indexed by only the dates that appear in both calendars.
Tip

Hint. For part 2, look up s1.add(s2, fill_value=0). For part 3, s1.index.intersection(s2.index) gives you the labels that are in both — then you can use .loc[...] to grab just those rows.

Exercise 2 — A simple trading signal

Imagine you have a Series close of daily closing prices for a stock. A classic (very simple) trading rule says: “buy when a fast moving average crosses above a slow moving average; sell when it crosses below; otherwise just hold.” Your task is to turn that English sentence into one line of pandas code that returns a Series of "Buy" / "Sell" / "Hold" labels.

Tip

Hint. Compute fast - slow (the gap between the two averages). A “cross-up” means yesterday the gap was negative and today it’s positive. To compare today’s gap with yesterday’s, use .shift(1) to push the gap one day forward. np.where(condition, x, y) returns x where the condition is true and y where false — you can nest two of these to get a three-way result.

Exercise 3 — Sharpe ratio and drawdown

Generate 1,000 made-up daily returns drawn from a normal distribution with mean \(\mu = 0.0005\) (about 0.05% per day, which annualises to roughly 13% per year) and standard deviation \(\sigma = 0.015\) (a typical daily volatility for a single stock). Use np.random.default_rng(123).normal(0.0005, 0.015, 1000) — fixing the seed 123 means everyone gets the same “random” numbers, which is handy for checking your work. Build a fake price series with 100 * (1 + r).cumprod() (start at $100 and compound the daily returns).

Then have a go at:

  1. Compute the annualised Sharpe ratio.
  2. Find the maximum drawdown and the date when it bottomed out.
  3. Compute the average return on the worst 10% of days, and on the best 10% of days.
  4. What’s the ratio of down-day volatility to up-day volatility? Why might this be useful to know?
Tip

Hint. For part 3, ret.sort_values().head(int(0.1 * len(ret))).mean() averages the bottom 10%. For part 4, .where(ret < 0).std() keeps only the negative returns and computes their standard deviation; do the same with > 0 for up-days, then take the ratio.

Exercise 4 — Missing-value strategy

You have a Series sold of daily units sold for one product (a SKU, “stock-keeping unit”, is just retail jargon for one specific product). About 8% of days have no record at all.

  1. For each of these five ways of handling missing values, write a one-sentence justification for whether it’s defensible for sales data: dropna, fillna(0), ffill, fillna(sold.mean()), fillna(sold.median()).
  2. Compute the mean of sold under each of these five strategies and compare. Which two strategies give the most extreme answers, and why?
Tip

Hint. There’s no single right answer to part 1 — it depends on what “no record” actually means in context. Was the store closed that day (then 0 is correct)? Was data lost from a system glitch (then ffill or the median is more reasonable)? Was the day truly unknown (then dropna is safest)?

Exercise 5 — Resampling and the weekend effect

There’s an old claim in finance called the “weekend effect” — that Monday stock returns are systematically lower than Friday returns (supposedly because bad news comes out over weekends and gets priced in on Monday morning). Let’s test the claim — but on data we’ve simulated from a normal distribution, where by construction no real weekend effect can exist. If you find a “weekend effect” in random data, you’ve learned something important.

Generate 2,000 made-up daily returns from \(\mathcal{N}(0.0005, 0.015)\) (use a fixed seed for reproducibility).

  1. Build the Series with pd.date_range("2010-01-04", periods=2000, freq="B") as the index. (freq="B" means business days only — no weekends.)
  2. Use .day_name() on the index to label each row with its weekday, then compute the mean return for each weekday.
  3. Compare Monday’s mean to Friday’s mean. In truly random data, should there be any real difference? What does this exercise tell you about how easy it is to find a “pattern” in pure noise?
Tip

Hint. close.groupby(close.index.day_name()).mean() is the one-liner you want. Try it again with a different random seed and watch the Monday-Friday gap wobble around — that’s just sampling noise, not a real effect. This is one of the most important lessons in data analysis.

Looking ahead

The DataFrame half of this chapter (coming up next) generalises every idiom you just learned to two-dimensional data. A DataFrame is, formally, a dictionary of Series that all share the same row labels — so almost every Series method has a DataFrame counterpart that acts column by column, and almost every Series operation extends to a row-by-row, column-by-column, or whole-table DataFrame operation. If you’ve understood alignment, vectorisation, label-based selection, and the time-series tools on Series, you already know 80% of what you need for DataFrames.

Why this chapter matters

So far you’ve worked with the Series — one column of labelled data, behaving like a smarter Python list. Real data, though, usually has many columns: customers in rows and their attributes (age, country, sign-up date) in columns; days in rows and stock prices in columns; transactions in rows and fields like price/quantity/side in columns. The pandas object built for that shape is the DataFrame, and once you’re working professionally with data, the DataFrame is what you’ll spend ninety percent of your time touching.

A DataFrame is best pictured as a programmable spreadsheet. It has row labels (called the index — often dates or IDs), a set of named columns, and a rectangle of values. Each column is itself a Series and carries its own data type, so a single DataFrame can hold dates, decimals, integers, True/False values, and strings all at once — something a plain NumPy array (which is one type only) cannot do. Beyond storage convenience, a DataFrame is the natural place to do the eight verbs that come up over and over in real analysis: load, inspect, select, filter, fill, transform, group, aggregate, merge, pivot. Master those and you’ve mastered most of applied data work.

This second half of the chapter assumes you understand the Series mechanics from the first half. The focus now shifts to the patterns — the small set of idioms that you’ll actually use in your projects, internships, and on the job. The goal is fluency, not encyclopaedic coverage: you should be able to read pandas code in someone else’s notebook the way you read English in a textbook.

DataFrame vs Series — when to use which

Why this matters. “Am I currently holding a Series (one column) or a DataFrame (a whole table)?” is a question you’ll forget to ask, and the error messages you’ll get when you mix them up are often confusing. A few minutes spent on the distinction here will save you debugging time later.

The Series and the DataFrame are close relatives. A Series is a single labelled column; a DataFrame is a bundle of Series that all share the same row labels. Switching between them is one of the most common operations in real analysis, and getting confused about which one you’re currently holding is one of the most common sources of bugs.

The mental model is simple. A Series has one set of labels (the row index) and one column of values. A DataFrame has two sets of labels (the row index and the column names) and a rectangle of values in between. The two object types share most of the same methods — .mean(), .sum(), .plot(), .describe() — but the result types differ predictably: a Series method returns a single number, while the same method on a DataFrame returns a Series (one value per column) unless you ask for something different.

import pandas as pd

s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="x")
df = pd.DataFrame({"x": [10, 20, 30], "y": [1, 2, 3]}, index=["a", "b", "c"])

s.mean()       # 20.0           -- scalar
df.mean()      # x 20, y 2.0    -- Series, one entry per column
df["x"].mean() # 20.0           -- back to a scalar

The single-bracket vs double-bracket idiom is how you decide which type you get. df["x"] returns a Series (the single column called x). df[["x"]] returns a DataFrame with one column. The extra brackets are not decoration — they’re the difference between a single column and a one-column table. The distinction matters because downstream code may expect one shape and silently misbehave on the other: a machine-learning library like scikit-learn wants a 2D table (a DataFrame), but .plot() on a Series gives you a quick line chart for free.

A simple rule of thumb

Use a Series when you are reasoning about one variable in isolation — a single price stream, a single customer attribute. Use a DataFrame when you need rows and columns to stay aligned — the whole table of transactions, the whole panel of stock prices. If you find yourself unpacking a DataFrame column by column to do arithmetic, you are working too hard: DataFrames broadcast operations across all columns at once.

Anatomy of a DataFrame

A DataFrame is a Series of Series — a rectangle of values with two sets of labels: row labels along the left (the index) and column labels along the top (the columns). Every cell sits at the intersection of one row label and one column name, and every column carries its own dtype. The diagram labels each part on a small 5-row × 4-column table.

Two things to internalise. First, the index and the columns are labels, not data — pandas treats them as accessors, which is why df["close"] (column lookup) and df.loc["2024-01-03"] (row lookup by label) look so similar but operate on different axes. Second, every column has its own dtype: a single DataFrame can hold floats, ints, dates, booleans, and strings side by side — something a plain NumPy array cannot do.

Loading data

Why this matters. In your projects, the very first cell of every notebook is usually “read the data in from a file.” Get this right and the rest of your analysis is smooth; get it wrong and you’ll be cleaning up weird date strings and string-vs-number mismatches for the next hour. A CSV (comma-separated values) file is just a plain text file where each row of a table is a line, and the cells are separated by commas — the simplest tabular format on the planet. Pandas’ read_csv function turns one of these files into a DataFrame in one call.

The default behaviour of pd.read_csv is usually OK, but most beginners stop there and pay for it later. A few well-chosen arguments at load time save you from having to clean up the data afterwards.

The standard pattern

The four arguments that matter most often are parse_dates, index_col, dtype, and na_values. They control: how dates are recognised, which column becomes the row labels (the index), what data type each column should be, and which text values should be treated as “missing”.

import pandas as pd

df = pd.read_csv(
    "transactions.csv",
    parse_dates=["trade_date"],   # turn this column into datetime64
    index_col="trade_date",       # use it as the row index
    dtype={"ticker": "category",  # save memory on repeated strings
           "quantity": "int32"},
    na_values=["N/A", "NA", "-", "null"],  # extra missing-value markers
)

The parse_dates=["trade_date"] argument is the most important of the four. Without it, the date column arrives as plain text strings; every later operation that treats those values as dates either fails or produces nonsense, and you’ll typically discover the problem three cells down when something like .resample("M") (monthly grouping) raises a confusing error. The fix is to convert the column at load time. The index_col="trade_date" argument then promotes the parsed date column to be the row index, which is what allows things like df.loc["2024-03"] to pull out March 2024 in one line.

The dtype dictionary lets you override pandas’ automatic guessing about column types. Two common wins: declaring repeated-string columns (like "ticker", which only has a few unique values) as "category" can shrink memory usage to a tenth on tables with millions of rows, and declaring integer columns explicitly stops pandas from converting them to floating-point numbers the moment a single missing value appears.

A first-look ritual

After loading, run the five-line ritual before you do anything else. It takes two seconds and answers nearly every question that arises in the first hour of analysis.

What just happened? You loaded a synthetic transactions table and immediately ran a five-line “sanity check”. head shows the first few rows so you can see what the columns look like. shape reports (rows, columns) — the simplest way to confirm the file loaded the size you expected. dtypes reveals what type pandas gave each column — if a price column comes back as object (pandas’ word for “text”) instead of float64, that’s a red flag that the file contained currency symbols or stray commas. columns.tolist() shows the exact column names, useful when there are hidden spaces or oddly capitalised names that the printed table hides. describe summarises every numeric column and is the cheapest possible “are there outliers?” check.

Common pitfall: silent dtype inference

A price column with one bad row containing "N/A" (as a string) will be read as object instead of float64, and every arithmetic operation will either fail or produce a string-concatenation surprise. Always check df.dtypes immediately after loading, and use pd.to_numeric(col, errors="coerce") to clean a column that should be numeric.

Selection: columns, .loc, .iloc, and boolean masks

Why this matters. “Show me just the rows where…” or “show me just these columns…” comes up dozens of times in any real analysis. Pandas gives you four different tools for selection, each answering a slightly different question. Memorise the differences early; you’ll use them every single day you work with data.

Single column versus multiple columns

A single string in square brackets returns a Series — a one-dimensional labeled column.

df["price"]            # Series
df.price               # same thing, attribute access (only works if name is a valid identifier)

A list of strings in square brackets returns a DataFrame — a sub-table preserving the two-dimensional structure.

df[["price", "quantity"]]   # DataFrame with two columns

The single-versus-double brackets distinction is the single most common source of confusion for beginners. The fastest way to internalise it: square brackets around a string produce a vector; square brackets around a list produce a table.

.loc — selection by label

.loc[] is the label-based accessor. It takes one or two arguments: row labels alone, or row labels and column labels separated by a comma. When the index is a DatetimeIndex, the row labels are dates and partial-string matching becomes available — df.loc["2024-03"] selects every March 2024 row.

df.loc["2024-02-01"]                    # one row by date label
df.loc["2024-02-01", "price"]           # one cell
df.loc["2024-02-01":"2024-02-15"]       # date range, BOTH endpoints INCLUSIVE
df.loc["2024-02-01":"2024-02-15", ["price", "quantity"]]  # range + column subset
df.loc[df["price"] > 400]               # boolean mask in the row position

Two properties of .loc[] deserve emphasis. First, slices are inclusive on both ends — a slice from 'a' to 'd' returns rows a, b, c, and d. This is the opposite of Python’s normal slice semantics and is deliberate: when you write a date range, you almost always mean to include the end date. Second, .loc[] accepts boolean arrays in the row position, which is why almost all conditional filtering is ultimately a .loc[] call under the hood.

.iloc — selection by position

.iloc[] is the integer-position accessor. It thinks like a list: df.iloc[0] is the first row, df.iloc[-1] is the last row, df.iloc[0:10] is the first ten rows. Slices follow Python’s standard convention — exclusive on the upper end.

df.iloc[0]            # first row
df.iloc[-3:]          # last three rows
df.iloc[0:10, 0:3]    # first 10 rows, first 3 columns
df.iloc[[1, 5, 7]]    # rows at positions 1, 5, 7

.iloc[] is the right tool when the row number itself is meaningful — the first hundred observations, the last batch of a chronological train/test split, or a sampled subset by position. When the index carries semantic meaning (dates, tickers, customer IDs), .loc[] is almost always more readable.

.loc versus .iloc — the slice asymmetry

The single biggest gotcha is the endpoint convention.

The asymmetry trips up everyone at least once. Burn it in: .loc includes the end label; .iloc excludes the end position.

The two panels show the same five-element Series sliced with the apparently parallel loc["b":"d"] and iloc[1:3]. The green cells are what survives the slice. The label-based slice keeps three rows because both endpoints are included; the position-based slice keeps two because the upper position is excluded, following ordinary Python list semantics.

Boolean masks

Boolean masking is the engine behind every “show me the rows where X” query in pandas. The expression df["price"] > 400 is evaluated row-by-row and produces a Series of True/False values — a mask. Feed the mask back into the DataFrame indexer, and only the rows where the mask is True survive.

Two rules for compound conditions. First, use & for AND and | for OR, not the Python keywords and/or — pandas needs bitwise operators to work element-wise on arrays. Second, wrap each individual condition in parentheses. The bitwise operators have higher precedence than the comparison operators, and omitting the parentheses silently produces a wrong-looking mask without raising an error.

SettingWithCopyWarning

When you filter a DataFrame and then assign to the result, pandas may emit SettingWithCopyWarning. The filtered frame is sometimes a view of the original and sometimes an independent copy, and pandas cannot always tell which. The defensive idiom is to call .copy() explicitly when you intend to modify the result:

big_trades = df[df["price"] > 400].copy()
big_trades["flag"] = "large"   # safe -- never touches df

Filtering with query and chained boolean expressions

For long compound conditions, the bracket-and-ampersand syntax becomes unwieldy. The .query() method accepts a string expression in a SQL-like dialect and is often both shorter and faster.

The query form has three quiet advantages. The column names appear bare, so the expression reads like a sentence. The Python keywords and, or, not work as expected — no bitwise gymnastics. And for very large DataFrames, query is often faster because it uses numexpr under the hood to evaluate the expression in compiled C without producing intermediate boolean arrays.

To reference a Python variable inside a query string, prefix it with @:

threshold = 400
df.query("price > @threshold and side == 'BUY'")

The query method does not support every Python expression — chained comparisons like 100 < price < 400 are allowed, but arbitrary function calls are not. When the expression gets too complex, fall back to the bracket form.

When to use query

Use bracket-and-mask for one or two conditions; use query for three or more. The cognitive saving on a long compound filter is substantial, and the performance advantage on million-row tables can be a factor of two or more.

Missing values in DataFrames

Where this shows up. We already met NaN on the Series side. With DataFrames the same idea applies — but now you have multiple columns and need to think about missingness column by column. Real-world data is full of holes: sensors fail, survey respondents skip questions, joins between tables fail to match some rows, vendor data feeds drop fields. Pandas uses one symbol (NaN) for every type of missingness and gives you a small toolkit for detecting, counting, filling, and dropping. How you handle missing values is one of the highest-leverage decisions in any analysis — a careless choice quietly corrupts everything downstream.

Detecting missing values

What just happened? isna() returned a DataFrame of True/False values with the same shape as the input — True wherever a value is missing. Summing once (.sum()) collapses each column into a count, so you see “how many missing in each column?”. Summing twice (.sum().sum()) gives one grand total across the whole table. Using .mean() (or dividing by the row count) gives the missing percentage, which is usually more informative than the raw count because it’s automatically scaled.

fillna — choosing a replacement

fillna accepts a scalar, a per-column dictionary, or a Series, and produces a new DataFrame with every NaN replaced.

df.fillna(0)                              # constant
df.fillna({"price": df["price"].mean(),   # column-specific replacements
           "rating": df["rating"].median()})
df["price"].fillna(df["price"].mean())    # one column

Each replacement encodes an assumption. Filling with zero treats absence as “no value,” which is correct when the variable counts events that did not happen. Filling with the mean is the conventional default but compresses variance and distorts correlations — use it sparingly. Filling with the median is more robust to outliers. Filling with a domain-specific constant (a regulatory default, a vendor convention) is sometimes the only defensible choice.

ffill and bfill — propagating values

For time-indexed data, the previous observation is often a sensible stand-in for a missing one: a stock price that did not print on a holiday is approximately the same as the price the day before. ffill (forward-fill) propagates the most recent prior value; bfill (back-fill) propagates the next future value.

ffill is the right tool for survivorship-style series — a price you observe today is the best guess for the price you missed yesterday. bfill is occasionally useful in the opposite direction (e.g., back-filling a target column at the start of a panel), but is dangerous in general because it can leak future information into the past.

dropna — when filling is wrong

When no defensible replacement exists, drop the affected rows.

The how argument switches between “any” and “all” semantics. The subset argument is the surgical option: drop a row only if a specific column is missing, leaving rows whose missing value is in an irrelevant column untouched. thresh gives you a count-based threshold, and the axis=1 form drops sparse columns rather than incomplete rows.

Fill or drop — a decision rubric

Fill when every row is needed downstream and a defensible imputation exists. Drop when the missing rate is small (a common threshold is below five percent) and the missingness mechanism appears random. The one option that is never defensible is to ignore the problem: pandas will quietly propagate NaN through every arithmetic operation, and the resulting numbers will be wrong in ways that no error message ever flags.

apply, applymap, and vectorised arithmetic

Why this matters. Sooner or later you’ll want to compute a new column based on existing ones — “give me a notional column equal to price * quantity”, or “tell me whether each trade is large or small”. Pandas offers three ways to do this. They look similar in syntax but differ by orders of magnitude in speed. On small toy datasets it doesn’t matter; on a 1-million-row table the difference is “finishes in a second” vs “needs a coffee break”. Pick the right one and your code stays fast.

Vectorised arithmetic — almost always the right answer

When you write df["price"] * df["quantity"], pandas dispatches the multiplication to NumPy, which performs the operation in compiled C across the entire column in one shot. There is no Python-level loop, no function call per row, no overhead beyond the actual arithmetic. This is the fastest possible path, and it should be the first thing you try.

df["notional"] = df["price"] * df["quantity"]
df["log_return"] = np.log(df["price"] / df["price"].shift(1))
df["is_large"] = df["quantity"] > 300

All standard arithmetic operators, comparison operators, and most NumPy functions broadcast over a Series this way. The result is always another Series of the same length, with the original index preserved.

apply — when vectorisation is impossible

When the per-row logic genuinely cannot be expressed as a vectorised operation — for instance, calling a custom Python function that hits an external service, or running an iterative algorithm per row — apply is the escape hatch.

apply on a DataFrame can run either column-by-column (axis=0, the default) or row-by-row (axis=1). The row-by-row form is the slow one — every row triggers a Python-level function call — and should be used only when no vectorised alternative exists. The column-by-column form is faster because it makes one call per column, not one per row, and is the natural shape for multi-column aggregations.

applymap — element-wise on every cell

applymap runs a function on every single cell of a DataFrame. In modern pandas it has been renamed to .map() on DataFrames (the Series .map was always there), but the old name still works.

df.applymap(lambda x: f"${x:,.2f}")   # format every cell as a currency string

This is almost always something you should avoid for numerical work. The element-wise Python call is the slowest path possible. Reserve it for formatting and string transformations where no vectorised alternative exists.

Vectorise first, apply second

A row-wise apply on a million-row DataFrame is typically a hundred to a thousand times slower than the equivalent vectorised expression. If your code uses df.apply(..., axis=1), stop and ask whether the same calculation can be written as a column-arithmetic expression. Nine times out of ten it can.

groupby and aggregation

Where this shows up. “What’s the average grade for each course?” “What’s total revenue per customer?” “What’s the maximum temperature in each month?” “How many trades happened in each ticker?” All of these are the same shape of question: summarise each group of rows by some key. Pandas calls this operation groupby, and it’s probably the second-most-important verb you’ll learn (after selection).

groupby follows a universal three-step pattern: split the data into groups by some key, apply an aggregation within each group, combine the results into a new table. (If you’ve seen SQL, this is the GROUP BY clause.) In pandas the typical syntax is df.groupby(key).agg(...). The mechanics differ slightly between SQL and pandas, but the mental model is the same.

The simplest case — one group, one aggregation

Let’s build a fake trading log and ask: “what’s the total dollar value traded for each ticker?” Read the code line by line first: build a DataFrame of trades, compute notional (= price × quantity, the dollar value of each trade), then group by ticker and sum.

What just happened? The expression reads naturally in English as a sentence: “group by ticker, then sum the notional column.” The result is a Series, one row per ticker, whose values are the total dollar value traded for that ticker.

The three panels above trace what df.groupby("ticker")["notional"].sum() actually does behind the scenes. Split: the rows are split into groups by the key (here, ticker), colour-coded for clarity. Apply: each sub-table is reduced to a single number with sum(), independently of the others. Combine: those per-group numbers are stacked into a Series whose index is the group keys. Almost every analytical aggregation in pandas follows the same three-step shape — once you recognise it, groupby stops feeling like magic.

Multiple aggregations on multiple columns

Two idioms cover almost every realistic case. The first passes a list of aggregation functions and applies all of them to each numeric column.

df.groupby("ticker")[["quantity", "notional"]].agg(["sum", "mean", "count"])

The result is a DataFrame with a MultiIndex on the columns — two levels, one for the column name and one for the aggregation. The shape is correct but the column names are awkward; flattening them with df.columns = ["_".join(c) for c in df.columns] is a common follow-up.

The second idiom passes a dictionary that maps column names to one or more aggregations, giving you per-column control.

Named aggregation — the modern idiom

For cleaner output column names, use the named-aggregation form introduced in pandas 0.25. Each keyword argument names a new output column and specifies the source column and aggregation function.

The output has a flat column index with descriptive names — much friendlier for downstream consumers than the MultiIndex form. Named aggregation is the recommended modern style for any aggregation you intend to ship.

transform — broadcasting an aggregate back to the original shape

agg collapses each group to a single row. transform does the same calculation but broadcasts the result back to the original row count, attaching the group-level number to every row of the group. This is the right tool for within-group standardisation and demeaning.

transform is the workhorse behind cross-sectional standardisation in equity research, within-store percentile rankings in retail, and within-cohort z-scores in customer analytics. Any time you need a group-level statistic attached to every row, transform is the answer.

agg versus transform versus apply
  • agg returns one row per group.
  • transform returns the same shape as the input; each value is replaced by a group-level computation.
  • apply is the flexible escape hatch — slowest, but works for arbitrary per-group logic that returns a Series or DataFrame.

Try agg and transform first; fall back to apply only when neither shape fits.

Joins, merges, and concatenation

Where this shows up. Almost no real analysis lives in just one table. You’ll have customer info in one CSV, the transactions they made in another, and the product catalogue in a third. Combining tables is one of the most common operations in data work, and pandas borrows the vocabulary directly from SQL: inner, outer, left, right joins. Don’t worry if you’ve never seen SQL — we’ll explain each one in plain English below.

concat — stacking tables along an axis

When two DataFrames have the same columns and you simply want to stack them, pd.concat is the right tool. It also concatenates along the column axis for tables that share an index.

ignore_index=True is worth knowing: it discards both source indices and produces a clean RangeIndex on the output. Without it, the original row indices are preserved and you may end up with duplicate index values, which causes subtle bugs downstream.

merge — relational joins

merge is the pandas equivalent of SQL’s JOIN. The first two arguments are the left and right DataFrames; on names the join key (a column present in both); how selects the join type.

The four join types do exactly what they do in SQL. In plain English:

  • inner: keep only rows where the key appears in both tables. The safest default when you expect both sides to match.
  • left: keep every row from the left-hand table; for rows where the right-hand table has no match, fill the new columns with NaN. This is the most common form in real work — you want to preserve every transaction, even if some have no matching reference data.
  • right: mirror of left. Less commonly used; usually you can just swap the two tables and use left.
  • outer: keep every row from either table, with NaN wherever one side is missing. Useful as a diagnostic — it shows you which keys are in only one source.

When the join key has different column names in the two tables, use left_on and right_on:

trades.merge(reference, left_on="symbol", right_on="ticker", how="left")

When the key is the index of one or both tables, use left_index=True or right_index=True. This is the natural form for joining a price panel on a date index to a calendar table.

Validate your joins

A surprising number of analytical bugs are silent join errors — a one-to-many key relationship treated as one-to-one inflates the output row count. Pass validate="one_to_one", "one_to_many", "many_to_one", or "many_to_many" to merge to make pandas check the cardinality and raise an error if it does not match. Also pass indicator=True to add a _merge column showing whether each row matched on left, right, or both — invaluable for diagnosing what went wrong.

join — the index-based shortcut

df.join(other) is a convenience method that performs a left join on the index by default. It is shorter than the equivalent merge for the common case of joining two tables that share an index.

prices.join(volumes)               # left join on index
prices.join(volumes, how="outer")  # outer join on index

join is fine for quick exploration; merge is more general and is what you should reach for in any code that other people will read or run again.

Pivot, melt, and stack/unstack

Why this matters. The same data can be arranged in two very different shapes, depending on what you want to do with it. Imagine recording your grades for three courses across four semesters. You could store the data as one row per (semester, course) combination — twelve rows total, in long form. Or you could store it as one row per semester, with one column per course — four rows total, in wide form. Both are the same data; they’re just rearranged. Different operations prefer different shapes — groupby likes long form, plotting and reports like wide form — and pandas gives you four verbs (pivot, melt, stack, unstack) to flip between them.

pivot — long to wide

pivot rearranges a long-form table so that values from one column become new column headers.

What just happened? You turned a long-form table (one row per ticker-per-day) into a wide-form table (one row per day, one column per ticker). The result has one row per unique date and one column per unique ticker. This is the natural shape for time-series analysis: each stock becomes its own column that you can plot, correlate, or do arithmetic with.

pivot_table is the aggregation-aware cousin of pivot. It accepts an aggfunc argument and handles duplicate (index, column) pairs by aggregating rather than raising an error:

wide = long_form.pivot_table(index="date", columns="ticker",
                              values="close", aggfunc="mean")

Use pivot_table whenever there is any chance of duplicate keys, which is almost always.

melt — wide to long

melt is the inverse operation: it takes a wide table and stacks the columns into a long form.

The long form is the natural shape for groupby operations and for many plotting libraries (Plotly Express, Altair, Seaborn) that expect one row per observation. The wide form is the natural shape for human reading and for time-series math. Real workflows often move between the two repeatedly: load wide, melt for groupby, pivot back to wide for plotting.

stack and unstack — reshaping with the index

stack and unstack perform the same reshaping but use the index instead of named arguments. stack moves a level of the column index down into the row index, turning columns into rows. unstack moves a level of the row index up into the column index, turning rows into columns.

These methods are essential when working with the MultiIndex frames that yfinance and similar data sources produce. When you download a panel of multiple tickers, the result has a two-level column index — typically (price_type, ticker) — and stack(level="ticker") or xs("Close", axis=1, level=0) is how you extract the slice you actually want.

Long form is for computers; wide form is for humans

A useful heuristic: every groupby, every merge, and every machine-learning input wants long form (or at least a flat table). Every chart, every report, and every human-readable summary wants wide form. The reshaping verbs are the bridges.

Method chaining — the modern Pandas idiom

Why this matters. When you read code that experienced analysts write, you’ll often see expressions that go on for many lines, with one method per line, no temporary variables, and dots running down the page. This is method chaining, and once you can read it, your own code becomes shorter, cleaner, and easier to debug. The reason it works: almost every pandas method returns a new DataFrame, so you can call the next method directly on the result. A multi-step pipeline collapses into a single expression that reads top-to-bottom like a recipe.

Without chaining — the procedural style

tmp1 = df[df["side"] == "BUY"]
tmp2 = tmp1.copy()
tmp2["notional"] = tmp2["price"] * tmp2["quantity"]
tmp3 = tmp2.groupby("ticker")["notional"].sum()
tmp4 = tmp3.sort_values(ascending=False)
result = tmp4.head(5)

Six lines, four intermediate variables, four opportunities to misspell a name or reuse a stale frame. The reader has to mentally track each tmp through the whole block to understand what the final result represents.

With chaining — the modern style

Before we run the next code block, here’s how to read it line by line: take the trades DataFrame, keep only the BUYs, add a notional column (price × quantity), group by ticker, sum the notional within each ticker, sort the totals high-to-low, keep the top five. Each line is one step; the whole thing is one expression wrapped in parentheses so it can span many lines.

What just happened? You ran the same six-step transformation as the verbose version above, but as one expression. Read top to bottom, it tells a story: filter to buys, compute notional, group by ticker, sum, sort, take the top five. There are no temporary names to track, no risk of reusing the wrong intermediate frame. The surrounding parentheses are what let the expression span many lines without Python complaining.

The chaining toolkit

Three methods make long chains possible.

.assign(col=...) adds or overwrites a column without mutating the original. The right-hand side can be a constant, a Series, or a lambda that receives the current frame and returns a Series. The lambda form is essential inside a chain because the intermediate frame has no name:

.assign(notional   = lambda d: d["price"] * d["quantity"],
        log_price  = lambda d: np.log(d["price"]))

.pipe(func, *args, **kwargs) applies an arbitrary function to the whole DataFrame inside the chain. Use it when you want to factor a complex transformation into a named function while keeping the chain readable:

def winsorise(df, col, lower=0.01, upper=0.99):
    lo, hi = df[col].quantile([lower, upper])
    return df.assign(**{col: df[col].clip(lo, hi)})

result = (df
          .pipe(winsorise, "price", 0.01, 0.99)
          .groupby("ticker")["price"].mean())

.query("...") is the chain-friendly filter. The bracket-and-mask form requires the DataFrame to have a name (df[df["x"] > 0]), which breaks the chain. query takes a string and works on whatever frame the chain is currently holding.

When chaining is the wrong tool

Chains are wonderful when each step has a clear meaning, but they can hide bugs when the chain is too long or when an intermediate result deserves inspection. A practical guideline: chains of three to seven steps are usually clearer than the equivalent procedural code; chains of fifteen steps are usually worse. When debugging, break the chain at the suspicious step, assign to a temporary, inspect it, and re-fuse the chain once you understand the bug.

A worked example: monthly summary from daily transactions

Why this matters. This final worked example is the moment everything in the chapter clicks together. We start with a raw daily transaction log (lots of small individual trades) and end with a clean monthly summary table — the kind of thing you’d actually email to a boss. Every idiom from the chapter shows up: loading, type casting, derived columns, groupby with named aggregation, pivot. If you can follow this end to end, you’ve got real working pandas skills.

What just happened? You generated about 2,500 fake transactions across 130 trading days and four tickers — too many for you to read line by line, but tiny for pandas. The goal now is a single monthly summary table with one row per ticker-month, reporting trade count, signed quantity (positive for BUY, negative for SELL), total dollar value traded, average price, and price volatility (standard deviation of price within the month).

Step one: build the pipeline

What just happened? The pipeline reads as five English sentences: (1) add three derived columns — a month period, a signed quantity, and a notional; (2) group by month and ticker; (3) aggregate into five named output columns; (4) round to two decimals; (5) print. Each step is doing exactly what it says — no hidden magic.

Step two: pivot to a human-readable report

The groupby result has a two-level row index (month, ticker), which is fine for storage but awkward for human consumption. A single unstack turns it into a wide table with one row per month and a column per ticker for the chosen metric.

What just happened? The wide form is what you’d paste into a slide deck or email to a manager: one row per month, one column per ticker, totals in dollars. The long form (before the unstack) is what you’d feed into a plotting library or save back to a database. The two are the same data, reshaped — which is exactly the point of this section.

What you’ve just done

The two cells above are a complete, real-job-style analytics pipeline: load, type, transform, group, aggregate, reshape, report. Real pipelines at a hedge fund or an e-commerce company are longer and have more error-handling, but the structure is the same. If you can read these two chains and predict their output, you’ve got the working vocabulary of an intermediate pandas user — which is a real, employable skill.

Exercises

Exercise 1 — Load and audit

Either read a file called transactions.csv, or just reuse the synthetic generator from the worked example above. Once you have the DataFrame:

  1. Print its shape, dtypes, and column names.
  2. Compute the per-column missing percentage. Which columns (if any) have missing data?
  3. Report the date range the data covers, using df.index.min() and df.index.max().
Tip

Hint. Chain df.isna().mean().mul(100).round(2) for the missingness percentages — it reads left to right as “is missing?”, “what fraction?”, “multiply by 100”, “round to 2 decimals”.

Exercise 2 — Selection and filtering

Using the trades DataFrame, try the following:

  1. Select only the BUY trades with quantity above 300 — first using the bracket-and-mask form, then using .query(). Confirm the two results are identical with .equals().
  2. Find every trade for AAPL or MSFT where the price was below the median price of all trades.
  3. Pull the first 50 rows and the last 50 rows using .iloc[]. Then try to pull “the same” rows by date range using .loc[]. Why do the two row counts differ slightly?
Tip

Hint. Remember the slice asymmetry: .iloc[] slicing excludes the upper endpoint (like normal Python lists), but .loc[] slicing includes it. This is the source of the row-count difference in part 3.

Exercise 3 — Missing values

Build a small DataFrame with 20 rows and three columns (price, quantity, rating), and randomly poke NaNs into about 20% of the cells. Then:

  1. Drop only the rows where price specifically is missing (use the subset= argument).
  2. Fill missing quantity values with the median of quantity.
  3. Forward-fill missing rating values, treating the rows as time-ordered.
  4. Compare the variance of quantity before and after mean imputation. Which direction does the variance move, and why?
Tip

Hint. For part 4, recall that mean imputation inserts values with zero deviation from the mean, so it always compresses variance. Compute quantity.var() before, then again after quantity.fillna(quantity.mean()).

Exercise 4 — Groupby with named aggregation

Using the transactions DataFrame, produce a per-ticker summary table with these five columns:

  • n_trades: total number of trades.
  • total_qty: sum of quantity.
  • avg_price: mean price.
  • price_iqr: the interquartile range of price (the 75th percentile minus the 25th — a robust measure of spread that’s less sensitive to outliers than the standard deviation).
  • buy_share: fraction of trades that are BUYs.

Two of the five aggregations don’t have a built-in name — you’ll need to write them as lambda functions inside the named-aggregation call. Once you have the table, look at it and answer: which ticker has the widest price IQR?

Tip

Hint. ("price", lambda s: s.quantile(0.75) - s.quantile(0.25)) is a valid argument to agg. For buy_share, you can write ("side", lambda s: (s == "BUY").mean()) — the mean of a True/False Series gives the fraction of Trues.

Exercise 5 — Merge and validate

Build a small reference table with columns ticker (the key to join on) and sector (which industry the company belongs to — Tech, Healthcare, etc.). Include three of the four tickers from your transactions DataFrame, plus one extra ticker that doesn’t appear in the trades. Then:

  1. Run a left merge of trades onto the reference table. How many rows end up with a missing sector?
  2. Run an outer merge with indicator=True. How many rows appear only in the reference table, and what are their tickers?
  3. Pass validate="many_to_one" to confirm the join cardinality (many trades map to one reference row). What error would you get if the reference had duplicate tickers?
Tip

Hint. The indicator=True argument adds a _merge column that tells you whether each row came from the left, right, or both — invaluable for figuring out exactly what your join did.

Exercise 6 — End-to-end chain

This one ties together everything in the chapter. Write a single chained expression that turns the raw transactions DataFrame into a wide table where:

  • each row is a month,
  • each column is a ticker,
  • each value is the average daily notional traded in that ticker-month.

Round to the nearest dollar and sort the rows in chronological order. The whole pipeline should fit in fewer than ten lines, and should use .assign(), .groupby(), named aggregation, and .unstack().

Tip

Hint. Average daily notional = total notional in the month ÷ number of distinct trading days in that month. Compute both the numerator and the denominator inside a single .agg() call (with named aggregations), then divide the two new columns to make a third. After that, .unstack("ticker") reshapes from long to wide.

Where this leaves us

The DataFrame is the workhorse object of applied analytics, and the operations covered in this chapter — load, select, filter, fill, transform, group, aggregate, merge, pivot, chain — are the daily vocabulary you’ll use in every data project, in every internship, in every analytics job. The next chapter builds on this foundation with time-series-specific methods (resampling, rolling windows, lag operators, calendar arithmetic) that turn the generic DataFrame into a purpose-built tool for the kinds of panel data (multiple things observed over multiple dates) that dominate finance, economics, and operations.

If a single habit from this chapter sticks, let it be this: express transformations as a single chained pipeline, and the code you write today will still be readable to your future self — and to whoever inherits the notebook after you.

 

Prof. Xuhu Wan · HKUST ISOM · Modern Business Analytics