Database Indexing
The single most impactful thing to understand for database performance. An index is a precomputed shortcut that lets the database find matching rows without scanning every row in the table.
The library analogy
Imagine a library with a million books stacked in no order.
- No index → to find “all books by Dostoevsky”, you read the title page of every book. One million lookups. This is a table scan.
- With an index → there’s a separate card catalog sorted by author. You flip to “D”, find Dostoevsky’s cards, each tells you exactly which shelf to go to. Maybe a dozen lookups total.
That’s what a database index is: a separate, sorted data structure that points back to the actual rows.
What an index actually is
Most databases use B-tree (balanced tree) indexes by default. A B-tree keeps keys sorted in a tree structure where each node has many children (hundreds is typical).
[M]
/ \
[F] [T]
/ \ / \
[B,D] [H,J,L] [O,R] [U,W,Y]
Properties:
- O(log n) search, insert, delete — so a billion-row table still takes just ~30 steps to find a key
- Leaves are linked, so range queries (
WHERE age BETWEEN 20 AND 30) walk the leaves sequentially - Naturally sorted, so ORDER BY can sometimes skip sorting entirely
Other index types exist for special workloads:
- Hash — exact match only (no range queries); faster than B-tree for
=lookups - GIN (PostgreSQL) — inverted index for arrays, JSON, full-text search
- GiST — geospatial, range types
- BRIN — block-range, great for huge time-series tables with natural ordering
What gets indexed (and what doesn’t)
By default:
- Primary key → always indexed (automatically)
- Unique constraints → automatically indexed
- Foreign keys → often not automatically indexed (depends on the DB) — a common source of slow queries
You add indexes manually for:
- Columns you filter by in WHERE clauses
- Columns you join on
- Columns you sort by in ORDER BY
The four levels of query efficiency
Learning to read query plans (EXPLAIN in most databases) is the key skill. From worst to best:
| Plan | What it means | When OK |
|---|---|---|
| Sequential scan | Read every row | Small tables, or when most rows match |
| Index scan | Use index to find rows, then read each row | Selective queries |
| Index-only scan | All needed columns are in the index — no table access | Carefully designed covering indexes |
| Index skip/range | Walk a small portion of the index | Range queries on indexed columns |
A sequential scan over a billion-row table is the classic “my query is slow” root cause.
The cost of indexes
Indexes aren’t free:
- Disk space — a B-tree index on a 1 TB table can easily be 100 GB. Multiply by N indexes.
- Write slowdown — every INSERT / UPDATE / DELETE has to update every index on the table.
- Cache pressure — indexes compete with data for RAM.
Rule of thumb: every index helps some queries and hurts all writes. Only add an index if you can point to the query it accelerates.
Composite indexes
An index on (country, city, age) — the order matters.
This index accelerates:
WHERE country = 'Georgia'✓WHERE country = 'Georgia' AND city = 'Tbilisi'✓WHERE country = 'Georgia' AND city = 'Tbilisi' AND age = 30✓
It does not accelerate:
WHERE city = 'Tbilisi'✗ (can’t use the index; wrong leading column)WHERE age = 30✗
Leftmost-prefix rule: a composite index on (a, b, c) helps queries that filter on a, or a+b, or a+b+c — not on b alone or c alone.
Covering indexes / “include” columns
An index that contains all the columns a query needs is a covering index → the database answers the query entirely from the index, never touching the table. Dramatic speedup.
CREATE INDEX ON orders (customer_id) INCLUDE (total, order_date);
-- Query: SELECT total, order_date FROM orders WHERE customer_id = 42;
-- Reads just the index. The table is untouched.When indexes don’t help
- Low-selectivity columns — an index on “is_active” (boolean) is usually worse than a table scan, because every row roughly matches anyway
- Functions on indexed columns —
WHERE LOWER(email) = '...'can’t use an index onemailunless you create a functional index onLOWER(email) - Leading wildcards —
WHERE name LIKE '%foo%'can’t use a standard index; needs full-text search - Implicit type conversion —
WHERE id = '42'whereidis numeric can skip the index
How to find missing indexes
Three practical techniques:
- Slow query log — enable it, watch for queries that take >100 ms (or whatever your threshold is). Look at the top 10 offenders.
- EXPLAIN / EXPLAIN ANALYZE — run on suspect queries. Find the sequential scans.
- Index advisors —
pg_stat_statements+ extensions in Postgres, Query Store in SQL Server, Performance Schema in MySQL — all can surface the queries that would benefit most.
How to find useless indexes
pg_stat_user_indexesin Postgres shows “how many times this index was actually used.” Zero uses over weeks → probably a candidate for removal.- Duplicate indexes (same columns, same order) exist surprisingly often in legacy systems. Drop them.
- Indexes on columns that are never queried. Drop them.