Databases — Senior Engineer Study Guide

📝 Quiz · 🃏 Flashcards

Companion to INTERVIEW_PREP.md §8. This guide is the teaching layer for databases: concepts explained from first principles, SQL/CLI examples, comparison tables, gotchas, and anchor examples (a legacy MQ microservice, a MongoDB productivity app, regulated-environment compliance, schema-enforced writes via Avro).
Scope: Relational (Postgres-first, MySQL/SQL Server as callouts) + NoSQL (MongoDB, DynamoDB deep) + distributed-DB theory + caching + OLAP + search + JPA/Spring Data. Stack weighting: Postgres = canonical SQL engine, MongoDB = canonical document store, DynamoDB = canonical managed KV/wide-column. Other engines get "what differs" callouts.
How to use: Skim the Q-Map to jump to the section answering a specific INTERVIEW_PREP §8 question. For open study, walk §1 → §30 in order. Morning-of-interview: §31 Rapid-Fire. Pre-onsite drill: §32 Practice Exercises.

Database Landscape & Taxonomy
Relational Model & SQL Fundamentals
Joins Deep Dive
Advanced SQL
Schema Design & Normalization
ACID & Transactions
Isolation Levels & Anomalies
Concurrency Control
Indexing Deep Dive
Storage Internals
Query Optimization
Replication
Sharding & Partitioning
Distributed Systems for DBs
NoSQL Categories
MongoDB Deep Dive
DynamoDB Deep Dive
Postgres Specifics
MySQL / SQL Server Callouts
Caching
OLAP & Data Warehousing
Search (Elasticsearch / OpenSearch)
Schema Migrations
Backups, HA, DR
Security
Connection Pooling & Tuning
JPA / Hibernate Specifics
Spring Data Specifics
Observability & Troubleshooting
Connect to Your Experience
Rapid-Fire Review
Practice Exercises

Interview-Question Coverage Matrix

Maps each INTERVIEW_PREP.md §8 question (1–21) to the section(s) in this guide that answer it.

Q#	Topic	Section
1	ACID — each letter with example	§6
2	Isolation levels + anomalies (dirty/non-repeatable/phantom)	§7
3	MVCC and Postgres	§8, §10
4	Indexes — B-tree vs hash vs GIN; when indexes hurt	§9
5	Covering index	§9
6	When to denormalize	§5
7	`EXPLAIN`/`EXPLAIN ANALYZE` — reading plans	§11
8	N+1 query problem (Hibernate/JPA)	§11, §27
9	Connection pooling (HikariCP)	§26
10	Pessimistic vs optimistic locking	§8
11	Client disconnects mid-transaction	§6
12	MongoDB vs Postgres — when	§15, §16
13	MongoDB indexes — compound prefix	§16
14	MongoDB transactions — limitations	§16
15	DynamoDB — partition key vs sort key, hot partition	§17
16	DynamoDB — GSI vs LSI, consistency	§17
17	DynamoDB — single-table design	§17
18	DynamoDB — one-to-many modeling	§17
19	DynamoDB — RCU/WCU, on-demand vs provisioned	§17
20	Zero-downtime column rename	§23
21	Cache invalidation strategies	§20

1. Database Landscape & Taxonomy

Before you pick a database, know the axes you're picking on. Senior interviews test whether you can match a workload to an engine — not whether you can recite features.

Workload axis — OLTP vs OLAP

	OLTP (transactional)	OLAP (analytical)
Workload	Many small reads/writes, latency-sensitive	Few large scans/aggregates, throughput-sensitive
Rows touched per query	1s–100s	Millions–billions
Storage layout	Row-oriented	Column-oriented
Examples	Postgres, MySQL, MongoDB, DynamoDB	Redshift, BigQuery, Snowflake, ClickHouse, DuckDB
Indexes	B-tree heavy	Zone maps, bloom filters, no per-row indexes
Concurrency	High — ACID transactions	Low — batch or micro-batch

OLTP = "what happened to order #42?" OLAP = "how did revenue trend across the last 90 days by region?"

Model axis — SQL vs NoSQL (NoSQL is four things, not one)

Document — self-describing JSON-like records; flexible schema. MongoDB, Couchbase, Firestore.
Key-value (KV) — opaque value keyed by a simple identifier; fastest possible lookup. Redis, Memcached, DynamoDB (when used trivially), etcd.
Wide-column — rows keyed by partition + clustering keys, columns vary per row. Cassandra, HBase, DynamoDB (in single-table-design usage), ScyllaDB.
Graph — nodes + typed edges optimized for traversal. Neo4j, JanusGraph, Amazon Neptune.

Two other non-relational families worth knowing:

Time-series — writes dominate, natural clock-based partitioning. InfluxDB, TimescaleDB, Prometheus TSDB.
Vector — nearest-neighbor search over embeddings (HNSW, IVF). pgvector, Pinecone, Weaviate, Milvus.

The decision tree (spoken out loud in an interview)

Need strong relational integrity + joins + ACID across multiple rows? → SQL (Postgres default).
Access pattern is simple lookup, and you already know every query shape? → KV (DynamoDB, Redis).
Data is natural documents with variable shape, and you don't cross-join? → Document (MongoDB).
Write volume is massive, you only query by key, and consistency is tunable? → Wide-column (Cassandra / DynamoDB SLT).
Data is relationships-first (social graph, fraud rings)? → Graph.
Data is time + metric first? → Time-series.
Data is embeddings (semantic search, RAG)? → Vector.

Gotcha — "polyglot persistence" is the usual answer. Real systems mix engines: Postgres for the transactional core, Redis for cache + hot counters, Elasticsearch for full-text, S3 + a warehouse for analytics. Don't pretend one engine fits everything.

Interview Qs covered

§1 sets the taxonomy used by Qs 12 (Mongo vs Postgres) and 15–19 (DynamoDB).

2. Relational Model & SQL Fundamentals

A relation is a set of tuples with a fixed schema. A table is a relation with a name. The relational model, from Codd (1970), says: "data is rows, schema is fixed, operations are set-based." SQL is the language that implements it — but SQL predates (and violates) a few pure-relational rules, most famously by allowing duplicates (bags, not sets) and NULLs.

Keys

Key	What it is	Example
Primary key (PK)	Unique, non-null, one per table — the "identity" of a row	`user_id BIGINT`
Candidate key	Any column(s) that could serve as PK — minimal + unique	`email` (if unique)
Composite key	PK made of multiple columns	`(order_id, line_no)`
Surrogate key	Synthetic, typically an auto-increment or UUID, carries no domain meaning	`BIGSERIAL`, `UUID v7`
Natural key	Has real-world meaning	SSN, email, ISBN
Foreign key (FK)	Column(s) referencing another table's PK; database-enforced integrity	`order.user_id → user.id`
Unique constraint	Non-PK uniqueness; allows NULL (depending on engine)	`UNIQUE(email)`

Surrogate vs natural PK — the pragmatic default is surrogate. Natural keys change (email rebinds, names change spelling, SSNs don't apply internationally). A surrogate PK is stable forever; enforce the natural constraint as a UNIQUE. The rare case for natural PK: immutable reference data (country code ISO-3166).

UUID PK tradeoff: UUIDs are globally unique and shardable but are 16 bytes, random-ordered (trashes B-tree locality), and invisible to humans. Prefer UUIDv7 (time-ordered) or BIGSERIAL + a separate UUID public_id for the wire API.

SQL DDL / DML / DCL / TCL

DDL (Data Definition) — CREATE, ALTER, DROP, TRUNCATE.
DML (Data Manipulation) — SELECT, INSERT, UPDATE, DELETE, MERGE / UPSERT.
DCL (Data Control) — GRANT, REVOKE.
TCL (Transaction Control) — BEGIN, COMMIT, ROLLBACK, SAVEPOINT.

NULL and three-valued logic

SQL predicates evaluate to TRUE, FALSE, or UNKNOWN. NULL compared with anything (including another NULL) is UNKNOWN. Consequences:

sql

SELECT * FROM users WHERE deleted_at = NULL;   -- returns 0 rows, always
SELECT * FROM users WHERE deleted_at IS NULL;  -- correct

Aggregates (COUNT, SUM, AVG) skip NULLs — except COUNT(*) which counts rows.
NOT IN (subquery) returns zero rows if the subquery contains any NULL. Use NOT EXISTS instead.
Uniqueness: Postgres / MySQL / Oracle treat each NULL as distinct (you can have many NULLs in a UNIQUE column). SQL Server treats them as equal (only one NULL allowed) — different behavior.

Gotcha. WHERE status <> 'active' silently excludes rows with status IS NULL. Always think about the NULL case.

Set operations

sql

SELECT id FROM a
UNION          -- set union, DISTINCT
SELECT id FROM b;

UNION ALL      -- bag union, keeps duplicates (cheaper, no sort/dedup)
INTERSECT      -- common rows
EXCEPT         -- rows in A not in B (MINUS in Oracle)

UNION ALL is almost always what you want in app code. UNION adds a sort/dedup pass — expensive.

Constraints

NOT NULL — rejects NULL.
CHECK — arbitrary predicate: CHECK (age >= 0 AND age < 150).
FOREIGN KEY — referential integrity. Actions: ON DELETE CASCADE, SET NULL, RESTRICT, NO ACTION.
UNIQUE — one-column or composite.
EXCLUDE (Postgres) — generalized uniqueness using any operator (e.g., "no two ranges overlap" via GiST + &&).

Interview Qs covered

Foundational — touches §8 Qs 1 (ACID), 6 (denormalize), 11 (tx lifecycle).

3. Joins Deep Dive

A join combines rows from two tables using a predicate. Know the six shapes and know which algorithm the planner uses.

Join shapes

Given users(id, name) and orders(id, user_id, total):

Shape	Result	SQL
INNER	Only rows matching predicate in both sides	`users u JOIN orders o ON o.user_id = u.id`
LEFT OUTER	All left rows; unmatched right columns NULL	`users u LEFT JOIN orders o ON ...`
RIGHT OUTER	All right rows; unmatched left columns NULL	(mirror of LEFT — rarely used in practice)
FULL OUTER	All rows from both sides; unmatched side is NULL	`FULL JOIN`
CROSS	Cartesian product — every left × every right	`users CROSS JOIN products`
SELF	Table joined with itself; use aliases	`employees e JOIN employees m ON e.mgr_id = m.id`

Two more flavors you should name:

SEMI JOIN — "does a matching row exist?" Written as EXISTS (subquery) or IN (subquery). No columns from the right side.
ANTI JOIN — "no matching row exists." Written as NOT EXISTS. Prefer over NOT IN (NULL gotcha).
LATERAL JOIN (Postgres, Oracle) — right side can reference left side's current row, row-by-row. Useful for "top-N per group":

sql

SELECT u.id, recent.id AS order_id
FROM users u
JOIN LATERAL (
    SELECT id FROM orders o WHERE o.user_id = u.id
    ORDER BY o.created_at DESC LIMIT 3
) recent ON TRUE;

Join algorithms (what the planner actually does)

Algorithm	How it works	Best when	Complexity
Nested loop	For each row in outer, scan/index-lookup inner	Small outer + indexed inner; or very small inner	O(N·M) worst; O(N·log M) with index
Hash join	Build hash table on smaller side; probe with larger side	Equi-joins, no useful index, enough memory for build side	O(N+M), memory-bound
Merge join	Both sides sorted on join key, then merge	Both sides already ordered (index), large result	O(N+M) if pre-sorted; O(N log N + M log M) if not

-- Postgres planner hint (how to read it)
EXPLAIN SELECT * FROM orders o JOIN users u ON u.id = o.user_id;
--> Hash Join  (cost=... rows=... width=...)
      Hash Cond: (o.user_id = u.id)
      -> Seq Scan on orders
      -> Hash
            -> Seq Scan on users

Gotcha. Missing FK index → nested loop over orders with a full scan of users per probe → quadratic blow-up at scale. Always index FK columns, especially if the child table is the "many" side.

ON vs WHERE with outer joins

These give different answers:

sql

-- (A) filter BEFORE the outer join — users without orders still appear
SELECT u.id FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.total > 100;

-- (B) filter AFTER the outer join — users without orders DROPPED (NULL > 100 is UNKNOWN)
SELECT u.id FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.total > 100;

Form (B) is effectively an inner join. If you want LEFT semantics + a post-filter that preserves NULL rows, write WHERE o.total > 100 OR o.id IS NULL.

Interview Qs covered

Foundational for Q7 (EXPLAIN) and Q8 (N+1 — ORMs generate nested-loop fetches).

4. Advanced SQL

Advanced SQL is the line between "I can write a query" and "I can solve problems in the database tier." Senior interviews expect all of this.

Common Table Expressions (CTE)

A named sub-query used in a later query. Great for readability.

sql

WITH active_users AS (
    SELECT id FROM users WHERE deleted_at IS NULL
),
recent_orders AS (
    SELECT user_id, SUM(total) AS revenue
    FROM orders
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT a.id, COALESCE(r.revenue, 0)
FROM active_users a
LEFT JOIN recent_orders r ON r.user_id = a.id;

Postgres gotcha. Until Postgres 12, CTEs were an optimization fence — always materialized. Postgres 12+ inlines non-recursive, non-write, single-reference CTEs. Force the old behavior with WITH ... AS MATERIALIZED. Force inlining with AS NOT MATERIALIZED.

Recursive CTEs

sql

WITH RECURSIVE descendants AS (
    SELECT id, manager_id, name, 1 AS depth FROM employees WHERE id = 42
    UNION ALL
    SELECT e.id, e.manager_id, e.name, d.depth + 1
    FROM employees e JOIN descendants d ON e.manager_id = d.id
)
SELECT * FROM descendants;

Useful for: org charts, category trees, graph traversal, number-series generation. Always include a depth limit or base case that terminates.

Window functions

A window function computes over a "window" of rows without collapsing them (unlike GROUP BY).

sql

SELECT
    user_id,
    order_id,
    total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn,
    SUM(total)  OVER (PARTITION BY user_id) AS user_lifetime_total,
    LAG(total)  OVER (PARTITION BY user_id ORDER BY created_at) AS prev_total,
    total - LAG(total) OVER (...) AS delta
FROM orders;

Common windows:

ROW_NUMBER — unique sequential (ties broken arbitrarily by ORDER BY).
RANK — ties share rank, gaps follow: 1, 2, 2, 4.
DENSE_RANK — ties share rank, no gaps: 1, 2, 2, 3.
LAG(col, n) / LEAD(col, n) — value from n rows before / after.
FIRST_VALUE / LAST_VALUE / NTH_VALUE — with framing.
Aggregate OVER — SUM, AVG, COUNT as running or partition totals.

Framing — ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running total vs RANGE ... for value-based windows.

Classic use: top-N-per-group:

sql

SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY total DESC) rn FROM orders
) t WHERE rn <= 3;

GROUPING SETS, ROLLUP, CUBE

Compute multiple GROUP BY results in one pass.

sql

-- rollup: per (region, product), per region, grand total
SELECT region, product, SUM(sales)
FROM orders
GROUP BY ROLLUP (region, product);

GROUPING SETS — arbitrary combos you enumerate.
ROLLUP(a,b,c) — (a,b,c), (a,b), (a), ().
CUBE(a,b) — every subset: (a,b), (a), (b), ().

GROUPING(col) returns 1 when a column was rolled up (so you can label "subtotal" rows).

Subqueries — EXISTS vs IN vs JOIN

Pattern	Best for	Gotcha
`IN (subquery)`	Small, non-NULL value set	`NOT IN` + any NULL in subquery → 0 rows
`EXISTS (correlated)`	Existence check, early-exit on first match	Modern planners treat `IN` and `EXISTS` equivalently for most cases
`JOIN + DISTINCT`	Need columns from both sides	DISTINCT hides duplicates but adds a sort pass
`NOT EXISTS`	Anti-join, NULL-safe	Always prefer over `NOT IN`

MERGE / UPSERT

Standard SQL MERGE (Postgres 15+, all major engines):

sql

MERGE INTO customers c
USING staging s ON s.id = c.id
WHEN MATCHED THEN UPDATE SET name = s.name, updated_at = NOW()
WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name);

Postgres shortcut: INSERT ... ON CONFLICT (id) DO UPDATE SET .... MySQL: INSERT ... ON DUPLICATE KEY UPDATE ....

Interview Qs covered

Window functions + CTEs are standard senior interview ground. Not explicitly in §8 but expected knowledge.

5. Schema Design & Normalization

Normalization is a set of rules to remove redundancy and prevent update/insert/delete anomalies. You don't need to cite Codd — you do need to recognize bad schemas.

Anomalies (the "why" for normalization)

Update anomaly — same fact stored in many places, miss one on update, data drifts.
Insert anomaly — can't represent a fact without unrelated other data (e.g., can't list a new product without an order).
Delete anomaly — deleting the last order loses product metadata.

Normal forms in one page

1NF — atomic values, no repeating groups.

-- Bad (repeating group)
user(id, phone1, phone2, phone3)

-- Good
user(id), phone(user_id, phone)

No arrays/lists masquerading as comma-separated strings. (Note: Postgres native arrays and JSON columns arguably break pure 1NF — that's fine pragmatically, as long as you're not querying inside them routinely.)

2NF — 1NF + no partial dependency on a composite key.

Applies when PK is composite. Every non-key attribute must depend on the whole PK, not part of it.

-- Bad: PK = (order_id, product_id); product_name depends only on product_id
order_line(order_id, product_id, product_name, qty)

-- Good
order_line(order_id, product_id, qty)
product(product_id, product_name)

3NF — 2NF + no transitive dependency.

Non-key attributes must not depend on other non-key attributes.

-- Bad: department_name depends on department_id, not employee_id
employee(id, department_id, department_name)

-- Good
employee(id, department_id)
department(department_id, department_name)

BCNF — every determinant is a superkey.

A strict 3NF. If X → Y, then X must be a superkey. Rare cases where 3NF passes but BCNF fails involve overlapping candidate keys. Interviewers usually mean "3NF" when they say "normalize."

4NF / 5NF — handle multi-valued and join dependencies. Worth naming; real-world bugs are rare. "If you've needed 4NF outside a textbook, you probably already know it."

Denormalization — when and how

You denormalize to trade write cost + storage for read simplicity/speed. Warranted when:

Read is 100x write — e.g., a dashboard counter.
Joins dominate latency — copy the 3 fields you need into the child row.
Analytical warehouse — star schema is denormalized on purpose (§21).
NoSQL document store — embedding is denormalization by design (§15, §16).
Materialized cache — user_stats rebuilt nightly from events.

How to denormalize safely:

Make the canonical source clear. The denormalized copy is a cache, not a source of truth.
Use triggers, change-data-capture (CDC), or application-layer fan-out to keep copies fresh.
Tolerate staleness if the business allows it — or accept the write amplification if it doesn't.

ER modeling cheatsheet

Relationship	Implementation
1:1	FK in either table with UNIQUE; or merge into one table
1:N	FK on the "many" side, pointing at the "one" side
N:M	Junction table with both FKs, optional extra columns (e.g., `role` on `user_project`)
Weak entity (depends on parent for identity)	Composite PK that includes parent FK
Hierarchy (is-a)	Single-table inheritance (one table, nullable cols) / class-table inheritance (one table per subtype) / concrete-table inheritance (one table per leaf)

Interview Qs covered

§5 addresses Q6 (denormalize).

6. ACID & Transactions

A transaction is a sequence of operations the database treats as a single logical unit. ACID describes the guarantees.

ACID, letter by letter

Atomicity — all operations succeed or none do. Mid-transaction crashes leave the DB as if nothing happened. Implementation: write-ahead log (WAL) + redo/undo.
Consistency — a transaction moves the DB from one valid state (constraints, FKs, triggers) to another. Application-level invariants are NOT part of SQL consistency — that's on you.
Isolation — concurrent transactions don't interfere; each behaves as if it ran alone (at the chosen isolation level — §7).
Durability — once committed, the change survives crashes. Implementation: fsync'd WAL; replication for durability across node failures.

The canonical example

sql

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;

A — both updates or neither.
C — CHECK (balance >= 0) rejects the tx if Alice goes negative.
I — another reader doesn't see Alice-debited-but-Bob-not-yet.
D — after COMMIT returns, the transfer survives a power cut.

Savepoints and nested transactions

Real "nested transactions" don't exist in most engines. What you have is savepoints.

sql

BEGIN;
INSERT INTO orders ...;
SAVEPOINT after_order;
INSERT INTO line_items ...;        -- this might fail
ROLLBACK TO SAVEPOINT after_order; -- undo line_items but keep order
COMMIT;

Spring's Propagation.NESTED uses JDBC savepoints under the hood.

Distributed transactions — 2PC / XA

Two-phase commit (2PC) — a transaction coordinator asks each resource manager:

Prepare — "can you commit?" Each RM writes a prepared-but-uncommitted record and votes yes/no.
Commit / abort — coordinator tallies; if all yes, sends commit; otherwise abort.

Problems with 2PC:

Blocking — if the coordinator dies after prepare, the RM is stuck holding locks.
Latency — one round-trip per phase, plus fsyncs on each side.
Coupling — every RM must implement the XA protocol.

XA is the industry spec for 2PC (javax.transaction.xa.XAResource). IBM MQ, Oracle, and many JDBC drivers support it. A legacy MQ-to-DB bridge that must not lose messages typically uses XA to span the broker and the database in one transaction.

Saga pattern — when 2PC is too heavy

A saga is a sequence of local transactions where each has a compensating action for rollback. Two flavors:

Choreography — each service publishes events; others react.
Orchestration — a central orchestrator drives each step explicitly.

Saga gives you eventual atomicity via compensations — not atomicity in the ACID sense. Choose saga when:

The steps cross service / DB boundaries.
You can define a sensible compensation (refund = reverse of charge).
You can tolerate a brief inconsistent window.

Client disconnects mid-transaction

The server doesn't know the client is gone until:

The next statement fails with a broken pipe, or
TCP keepalive / idle_in_transaction_session_timeout fires.

Postgres: set idle_in_transaction_session_timeout = '30s'. Otherwise an abandoned BEGIN holds locks indefinitely — classic outage scenario. The broken client holds open transactions that block VACUUM, causing bloat to pile up until someone restarts the connection pool.

Interview Qs covered

§6 addresses Qs 1 (ACID), 11 (client disconnects mid-tx).

7. Isolation Levels & Anomalies

Isolation defines what concurrent transactions can see of each other. Four SQL standard levels, increasing strictness:

Level	Dirty read	Non-repeatable read	Phantom	Lost update	Write skew
READ UNCOMMITTED	Possible	Possible	Possible	Possible	Possible
READ COMMITTED	Prevented	Possible	Possible	Possible	Possible
REPEATABLE READ	Prevented	Prevented	Possible*	Prevented (MVCC)	Possible
SERIALIZABLE	Prevented	Prevented	Prevented	Prevented	Prevented

* Postgres's REPEATABLE READ uses snapshot isolation and prevents phantoms for the committing tx (but allows write skew). Standard SQL lets phantoms happen at RR.

The anomalies, with examples

Dirty read — T1 reads T2's uncommitted write.

T2: UPDATE account SET balance = 1000 WHERE id = 'alice';  -- not committed
T1: SELECT balance FROM account WHERE id = 'alice';  -- reads 1000
T2: ROLLBACK;   -- T1 saw a value that never "officially" existed

Non-repeatable read — T1 reads the same row twice and gets different values because T2 committed between.

T1: SELECT balance FROM account WHERE id='alice';  -- 500
T2: UPDATE account SET balance=600 WHERE id='alice'; COMMIT;
T1: SELECT balance FROM account WHERE id='alice';  -- 600 (same tx!)

Phantom read — a range query returns different rows because a concurrent insert changed set membership.

T1: SELECT COUNT(*) FROM orders WHERE total > 100;  -- 5
T2: INSERT INTO orders ... total=200; COMMIT;
T1: SELECT COUNT(*) FROM orders WHERE total > 100;  -- 6 (phantom)

Lost update — two transactions read the same value, both update, last writer wins; first update vanishes.

T1: x = SELECT value FROM counter WHERE id=1;   -- 10
T2: x = SELECT value FROM counter WHERE id=1;   -- 10
T1: UPDATE counter SET value = 11 WHERE id=1;   -- commit
T2: UPDATE counter SET value = 11 WHERE id=1;   -- commit; should be 12!

Write skew — two transactions read disjoint rows and write disjoint rows, but together violate an invariant.

-- invariant: at least one doctor on call
T1: SELECT COUNT(*) FROM doctors WHERE on_call = true;  -- 2
T2: SELECT COUNT(*) FROM doctors WHERE on_call = true;  -- 2
T1: UPDATE doctors SET on_call = false WHERE id = 'alice';
T2: UPDATE doctors SET on_call = false WHERE id = 'bob';
-- both commit; now zero doctors on call

Write skew is the anomaly SERIALIZABLE prevents that snapshot isolation doesn't.

Snapshot isolation vs Serializable

Snapshot isolation (SI) — each tx sees a consistent snapshot as of its start. Prevents dirty / non-repeatable / phantom reads via MVCC. Doesn't prevent write skew. Postgres calls this REPEATABLE READ; Oracle calls it SERIALIZABLE (not true serializable!).

Serializable Snapshot Isolation (SSI) — Postgres's true SERIALIZABLE. SI plus runtime detection of dangerous read-write dependency cycles. On detection, one tx aborts with could not serialize access. You retry.

Gotcha. When you choose SERIALIZABLE, your application MUST be prepared to catch serialization failure SQLSTATE 40001 and retry. Spring: @Retryable(SerializationFailureException.class) or explicit loop.

Picking a level

READ COMMITTED — default in Postgres, SQL Server. Fine for most OLTP.
REPEATABLE READ — default in MySQL InnoDB. Use when a transaction does multiple reads and must see a consistent view (reports, consistency-sensitive logic).
SERIALIZABLE — when invariants cross rows and write skew is possible (scheduling, inventory, double-spend). Pay the cost + handle retries.

Interview Qs covered

§7 addresses Q2 (isolation levels + anomalies).

8. Concurrency Control

Two families: pessimistic (lock first, work second) and optimistic (work, then verify at commit). Postgres primarily uses MVCC + optimistic snapshot; locks still exist for writes.

MVCC — how Postgres actually implements isolation

Every row has hidden columns xmin (tx that created this version) and xmax (tx that deleted/updated it). An UPDATE doesn't overwrite — it inserts a new row version and sets xmax on the old one. Each transaction gets a snapshot: a list of "which xids were in flight when I started." A row is visible to me if:

xmin is committed AND (xmin < my_xid OR xmin ∈ my_snapshot.finished).
xmax is NULL or uncommitted or xmax > my_xid.

Consequences:

Readers never block writers, writers never block readers. Big win over lock-based DBs.
Bloat — dead row versions accumulate. VACUUM reclaims them (§10).
Long-running transactions prevent VACUUM from reclaiming versions newer than their xmin → runaway bloat. Kill long-idle transactions.

MySQL InnoDB, Oracle, SQL Server (with RCSI / SI) have their own MVCC implementations with the same ideas — old versions stored in undo segments rather than in-heap.

Pessimistic locking

Acquire the lock before reading/writing to exclude others.

sql

BEGIN;
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE;  -- row lock, write
-- nothing else can UPDATE/DELETE/SELECT FOR UPDATE this row until COMMIT
UPDATE inventory SET qty = qty - 1 WHERE product_id = 42;
COMMIT;

Lock modes:

FOR UPDATE — exclusive row lock; readers in SI still see old snapshot.
FOR NO KEY UPDATE — weaker; allows FK checks from other txs.
FOR SHARE — shared lock, blocks FOR UPDATE but allows other FOR SHARE.
FOR UPDATE SKIP LOCKED — queue-like consumer pattern: grab the next unlocked row, don't wait.
FOR UPDATE NOWAIT — fail immediately if locked.

Table-level: LOCK TABLE t IN ACCESS EXCLUSIVE MODE; — last-resort heavy hammer.

Optimistic locking

No lock acquired. At commit (or update), check that nothing changed since you read. Typically implemented with a version column:

sql

UPDATE orders
SET status = 'shipped', version = version + 1
WHERE id = 42 AND version = 7;  -- expects current version 7
-- affected rows = 0 means someone else updated; reject + retry

JPA @Version is this pattern: Hibernate appends AND version = ? to every UPDATE, checks affected rows, throws OptimisticLockException if 0.

Pessimistic vs optimistic — picking

Use pessimistic when	Use optimistic when
High contention on specific rows	Low contention; conflicts are rare
Short transactions	Longer "read/think/write" user flows
Critical invariants (inventory, accounts)	Best-effort saves (comments, profile updates)
You can tolerate waiting	You can tolerate retries

Deadlock

Four conditions for deadlock (Coffman's): mutual exclusion, hold & wait, no preemption, circular wait. Break any one and you're safe.

T1 locks row A, then asks for row B.
T2 locks row B, then asks for row A.
-- deadlock. Postgres detects, aborts one with deadlock_detected.

Prevention tactics:

Always lock rows in a consistent order (e.g., sorted by ID). This alone kills most deadlocks.
Keep transactions short — release locks fast.
Use SELECT FOR UPDATE SKIP LOCKED for work-queue patterns (no two workers wait on each other).
Set lock_timeout so a stuck locker aborts rather than waits forever.

Gotcha — FK deadlocks. Inserting into the child + updating the parent, in different orders across tx, commonly deadlocks. Always do parent operations first, or order by PK.

Advisory locks (Postgres)

Named application-level locks not tied to any row. Good for single-writer patterns:

sql

SELECT pg_advisory_lock(hashtext('nightly-job'));
-- do the job
SELECT pg_advisory_unlock(hashtext('nightly-job'));

Useful when you want cross-process mutual exclusion without a separate coordinator (Zookeeper, Redis Redlock).

Interview Qs covered

§8 addresses Qs 3 (MVCC), 10 (pessimistic vs optimistic).

9. Indexing Deep Dive

An index is an auxiliary data structure that answers "where is the row with this value?" faster than scanning the table. Indexes are the single highest-leverage performance tool in databases; they are also the most commonly misused.

B-tree — the default

A B-tree (actually B+ tree in most engines) is a balanced tree where:

Internal nodes hold keys and child pointers.
Leaves hold keys + row pointers (heap) or the row itself (clustered).
Leaves are linked — range scans walk the leaf list.
Height stays O(log N); even 100M rows is ~4 levels deep.

B-trees support equality (=) and ordered operations (<, >, BETWEEN, ORDER BY, LIKE 'prefix%').

Clustered vs non-clustered

Term	Clustered index	Non-clustered (secondary)
Leaves hold	Full row	Row locator (heap pointer or PK)
Count per table	At most one	Many
Postgres	Not really — "cluster" is a one-time rewrite; all indexes are secondary, rows live in a heap	N/A
MySQL InnoDB	PK IS clustered — rows live in PK order in the B-tree	Secondary indexes store (key, PK), require second lookup
SQL Server	Optional clustered index; recommended	Can have any number

Implication for InnoDB: a secondary-index lookup is two traversals (secondary → PK → row). Covering indexes or INCLUDE columns eliminate the second.

Composite (multi-column) indexes and the prefix rule

CREATE INDEX idx_user ON orders(user_id, status, created_at);

Can serve queries whose predicates form a prefix:

WHERE user_id = ? ✓
WHERE user_id = ? AND status = ? ✓
WHERE user_id = ? AND status = ? AND created_at > ? ✓
WHERE status = ? ✗ — no leading user_id
WHERE user_id = ? AND created_at > ? — can use index for the user_id part, but not for created_at (status skipped)

Column order rule: put equality predicates first, then the highest-selectivity range column last. (user_id=, status=, created_at range) is correct.

Covering (INCLUDE) index

An index that contains every column the query needs → index-only scan, no heap visit.

sql

CREATE INDEX idx_orders_user_covering
  ON orders(user_id) INCLUDE (total, status);

Query SELECT total, status FROM orders WHERE user_id = 42 runs entirely off the index. Postgres 11+ supports INCLUDE; SQL Server has it too. MySQL doesn't — widen the key instead.

Other index types

Type	Structure	Best for	Engines
Hash	Hash table	Equality-only lookups	Postgres (unlogged until 10+), MySQL MEMORY
GIN (Generalized Inverted)	Inverted list per value	Array membership, JSONB containment, full-text	Postgres
GiST (Generalized Search Tree)	Balanced tree framework	Geometry, ranges, trigram fuzzy	Postgres
BRIN (Block Range INdex)	Min/max per block range	Naturally clustered data (time-series), tiny index	Postgres
Bitmap	Bit per row per value	Low-cardinality columns on OLAP	Oracle, Postgres (materialized during scan)
LSM tree	Sorted runs, periodic merge	Write-heavy workloads	RocksDB, Cassandra, DynamoDB backend
Full-text / inverted	Token → doc list	Keyword search	Postgres `tsvector`, Elasticsearch, Mongo text

LSM tree primer

A Log-Structured Merge tree buffers writes in an in-memory sorted structure (memtable). When it fills, it flushes as an immutable sorted file (SSTable). Background compaction merges SSTables.

Writes are blazing fast — append-only; no in-place update.
Reads can touch many SSTables — bloom filters prune candidates.
Write amplification — each byte may be rewritten many times through compactions. Tuning knob: leveled vs tiered compaction.

Used by Cassandra, RocksDB, HBase, LevelDB, and inside DynamoDB and TiDB. Contrast with B+tree (in-place update, better read latency, worse write throughput).

Partial and functional indexes (Postgres)

sql

-- partial: index only the rows we care about
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';

-- functional: index an expression
CREATE INDEX idx_email_lower ON users(LOWER(email));
-- query must use LOWER(email) = ? to hit it

Massive storage + speedup wins when the "interesting" subset is small.

When indexes HURT

Write amplification — every INSERT/UPDATE/DELETE maintains every index on affected columns.
Bloat on MVCC engines — each update creates a new row version, and every index points to all versions until VACUUM.
Planner misjudgment — an index on a very low-selectivity column (e.g., boolean is_deleted) may mislead the planner into an index scan when a seq scan would be faster.
Space — a table with 15 indexes can easily be 5× the heap size.
Backup/restore time — indexes rebuild after restore.

Rule: start with PK + FK + query-driven indexes. Measure before adding more. Drop unused indexes — Postgres has pg_stat_user_indexes showing scan counts.

Index-only scan vs index scan vs seq scan

Seq scan — full table read. Fast for large fractions of rows.
Index scan — find rows via index, fetch from heap.
Index-only scan — all needed columns in the index; skip heap. Requires visibility map up to date (run VACUUM recently).

Interview Qs covered

§9 addresses Qs 4 (index types), 5 (covering index).

10. Storage Internals

Interviews that go deep on databases pick at physical layout. You don't need Ph.D. detail, but you should know pages, WAL, and VACUUM.

Pages and tuples

Most engines organize storage in fixed-size pages (blocks): Postgres 8KB, MySQL 16KB (configurable), SQL Server 8KB. A page holds multiple rows (tuples). The engine does I/O at page granularity — reading one tuple pulls the whole page into the buffer cache.

Postgres page layout:

+------------------+-----+-----+----+----+
| page header      | line pointers →     |
+------------------+-----+-----+----+----+
|                  ...                   |
+----+----+----+----+----+----+----+----+
|  ← tuples (grow backward toward middle)|
+----------------------------------------+

Fillfactor — the % of each page used on INSERT. Default 100% for tables. Lowering to 70–90% leaves room for HOT updates (§ below) without needing new pages. Common tuning for write-heavy tables.

TOAST (The Oversized-Attribute Storage Technique) — Postgres stores large (> ~2KB) field values in a separate out-of-line table, compressed. Transparent to queries, but large bytea/text columns don't bloat the main heap.

Write-Ahead Log (WAL)

Every change is written to the WAL before the change is applied to the heap page. Guarantees durability + crash recovery:

Client sends COMMIT.
Server flushes WAL record to disk (fsync).
Server acknowledges commit.
Dirty heap pages get written later, asynchronously, at checkpoints.

On crash:

Redo every WAL record newer than the last checkpoint.
The heap is now consistent with the committed WAL.

Checkpoint — periodically flushes all dirty buffer pages to disk and truncates old WAL. Too frequent = I/O storm. Too infrequent = slow recovery. Postgres default: 5 minutes or when max_wal_size is reached.

Synchronous commit — fsync on WAL at commit. Postgres synchronous_commit = on default. synchronous_commit = off drops fsync = faster commits + chance of losing last few ms on crash (not torn writes).

VACUUM and bloat (Postgres MVCC specifics)

Recall: UPDATE in Postgres = new row version + mark old one dead. Dead rows must be reclaimed or the heap grows forever. VACUUM does three things:

Marks dead tuple slots free for future inserts.
Updates the visibility map — pages where all tuples are visible to all txs; enables index-only scans + skipped VACUUM passes.
Prevents transaction-ID wraparound — Postgres xids are 32-bit; if unfrozen tuples reach ~2B age, the cluster shuts down to prevent data loss. VACUUM FREEZE rewrites old tuples with a "frozen" marker.

autovacuum runs in background. Signs it's not keeping up: bloat metric climbing, query plans shifting, "too many dead tuples" in logs. Knobs: autovacuum_vacuum_scale_factor, per-table autovacuum_vacuum_threshold.

HOT (Heap-Only Tuple) updates — if a row update doesn't touch any indexed columns, Postgres stores the new version in the same page (if there's room) and avoids touching index entries. Huge perf win — this is why fillfactor < 100 matters on hot tables.

Heap-organized vs index-organized tables

Heap-organized (Postgres, SQL Server default) — rows stored in insertion order; every index is secondary.
Index-organized (InnoDB, Oracle IOT, SQLite rowid) — rows stored inside the PK B-tree. Lookups by PK are free (single traversal); secondary indexes add a step.

InnoDB PK ordering effect: random UUIDv4 PK = random inserts = page splits = fragmentation. Auto-increment or UUIDv7 PKs keep the tree sequential and fast.

Interview Qs covered

§10 reinforces Q3 (MVCC) and Q4 (index types — why they hurt writes).

11. Query Optimization

The #1 differentiator between juniors and seniors in database interviews is the ability to read an execution plan and reason about it.

EXPLAIN vs EXPLAIN ANALYZE

EXPLAIN — shows the plan the optimizer will use, with estimates.
EXPLAIN ANALYZE — actually runs the query and shows actual rows + timings per node.
EXPLAIN (ANALYZE, BUFFERS) — adds buffer cache hits/misses; reveals I/O pressure.

Read plans bottom-up, right-to-left. Each node feeds the one above.

Sort  (cost=123.45..125.67 rows=890 width=42) (actual time=1.2..1.5 rows=812 loops=1)
  Sort Key: created_at DESC
  Sort Method: quicksort  Memory: 52kB
  ->  Hash Join  (cost=10..100 rows=890) (actual time=0.5..1.1 rows=812 loops=1)
        Hash Cond: (o.user_id = u.id)
        ->  Seq Scan on orders o  (cost=0..50 rows=1000)
        ->  Hash  (cost=5..5 rows=100)
              ->  Index Scan using idx_users_active on users u

Watchouts:

Rows estimate vs actual wildly off — stale statistics. Run ANALYZE.
Seq Scan on a huge table + small WHERE — missing index, or index exists but isn't useful (LIKE with leading %, type mismatch, non-SARGable predicate).
Nested Loop with outer rows ≫ 1 and inner Seq Scan — quadratic; needs index on join column.
"Rows Removed by Filter: X" — X filtered after index or scan. Wider index / partial index can push the filter down.

Statistics and the cost model

The planner picks a plan based on estimated cost:

cost = seq_page_cost × pages_read
     + random_page_cost × random_reads
     + cpu_tuple_cost × tuples_processed
     + cpu_operator_cost × operator_invocations

Estimates depend on statistics: per-column histogram, most-common values (MCV), null fraction, n_distinct, correlation. ANALYZE refreshes them; autovacuum runs it. If stats are stale, plans are wrong.

Common anti-patterns

Anti-pattern	Fix
`WHERE LOWER(email) = 'x@y'` with index on `email`	Functional index, or store canonicalized
`WHERE created_at::date = '2026-04-17'`	`WHERE created_at >= '2026-04-17' AND created_at < '2026-04-18'` (SARGable)
`WHERE id IN (SELECT id FROM huge_tbl WHERE ...)` with no index	Rewrite as `EXISTS` or JOIN; ensure indexed
`OR` across two columns	Often better as `UNION` of two indexed queries
`SELECT *` through a JPA entity	DTO projection; select only needed cols
Implicit type cast (`WHERE id = '42'` with BIGINT col)	Fix type; planner can't use index on cast input
`LIMIT N` on unordered	Implicit sort cost; add `ORDER BY` + index to walk forward
N+1 (ORM)	§27 — JOIN FETCH / batch size / entity graph
`SELECT COUNT(*)` on huge table	Estimate via `pg_stat_user_tables.n_live_tup`; or accept cost

SARGability

A predicate is Search ARGument-able if the engine can use an index to narrow rows without evaluating every row. Non-SARGable patterns:

sql

-- Non-SARGable (function on column)
WHERE YEAR(created_at) = 2026

-- SARGable
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'

The N+1 problem

An ORM fetches 1 parent row, then issues N queries — one per child — to fetch relations. Classic hidden performance bug.

java

List<Order> orders = repo.findAll();   // 1 query
for (Order o : orders) {
    o.getLineItems().size();           // N queries, one per order
}

Fixes:

JOIN FETCH in JPQL / @EntityGraph (§27).
@BatchSize(size = 50) — Hibernate issues one batched IN query per 50 parents.
DTO projection that does the JOIN in SQL and returns flat rows.

Detecting N+1:

Enable SQL logging in integration tests.
Use datasource-proxy or p6spy to count queries per request.
Spring Boot: spring.jpa.properties.hibernate.generate_statistics=true + check logs.

Rewriting for plan stability

Sometimes a query works in dev and blows up in prod. Causes:

Different row counts → different plans.
Stats drift.
Parameter sniffing (SQL Server) — first compile uses caller's value, poor for other values.

Fixes: manual ANALYZE after bulk load, pin plan via hints (engine-specific), rewrite with CTE AS MATERIALIZED to force a boundary, or pre-compute via materialized view.

Interview Qs covered

§11 addresses Qs 7 (EXPLAIN), 8 (N+1).

12. Replication

Replication copies data from one node to others for availability, read scale-out, and geographic distribution.

Physical vs logical replication

Physical (binary)	Logical (row/statement)
Ships WAL records byte-for-byte	Ships decoded operations (INSERT/UPDATE/DELETE per row)
All-or-nothing (whole cluster)	Table-level selection possible
Exact byte copy; perfect for HA failover	Cross-version, cross-schema, cross-engine possible
Postgres streaming replication, MySQL binlog row format	Postgres logical replication (10+), MySQL row-based binlog, Debezium CDC

Primary-replica (a.k.a. master-slave — language is shifting)

One node accepts writes (primary). Replicas apply the primary's WAL/binlog. Reads can go to replicas.

       [writes]
          ↓
      +---------+
      | Primary |──WAL──┬─── Replica 1 (read)
      +---------+       ├─── Replica 2 (read)
                         └── Replica 3 (read)

Sync replication — primary waits for at least one replica's ack before COMMIT returns. No data loss on primary failure; higher commit latency.
Async replication — primary COMMITs locally; replicas catch up later. Fast commits; possible data loss window on failover.
Semi-sync (MySQL) — primary waits for replica to receive (not apply). Middle ground.

Replication lag

Asynchronous replicas are eventually consistent with the primary. Lag appears as:

INSERT INTO orders (id) VALUES (42);  -- primary
SELECT * FROM orders WHERE id = 42;   -- replica; "not found!"

Read-your-writes consistency approaches:

Read from primary after a write — simple, loses the scale benefit briefly.
Session stickiness — route user's reads to same node for a window after their write.
Wait-for-LSN — the app tracks the WAL position at commit and reads with "WAIT FOR LSN ≥ X" on replica.
Synchronous replicas — eliminate lag at cost of write latency.

Failover

When the primary dies:

Manual failover — operator promotes a replica. Safe but slow.
Automatic failover (Patroni, RDS Multi-AZ, Galera) — consensus (Raft/etcd) picks a new primary.

Split-brain — network partition leaves two primaries, each accepting writes. Resolved only by:

Fencing (STONITH — shoot the other node in the head).
Quorum — a majority vote decides the legitimate primary.
External coordinator (etcd, Zookeeper) holding the lease.

Multi-master (multi-primary)

All nodes accept writes; changes propagate and conflicts are resolved (last-writer-wins, CRDT, or custom). Used by:

Galera / Percona XtraDB Cluster — sync multi-master MySQL with certification.
CockroachDB, Spanner, YugabyteDB — distributed SQL with Raft per range; logically multi-master.
DynamoDB global tables — active-active across regions with last-writer-wins.
Cassandra — eventually consistent multi-master with tunable quorum.

Pitfall: without conflict resolution, simultaneous updates to the same row produce garbage. LWW silently loses one write.

Read replicas are not HA by themselves

Read replicas survive a primary crash only if you also have a promotion mechanism. Using replicas to handle reads doesn't automatically mean your system tolerates primary failure.

13. Sharding & Partitioning

Partitioning = splitting a single logical table into multiple physical pieces, usually within one node. Sharding = partitioning across nodes. Goal: push a dataset beyond what one node can hold or serve.

Partitioning strategies

Strategy	Rule	Best for	Gotchas
Range	`created_at` buckets per month	Time-series, historical data (easy drop partition)	Hot partition for "now"; key choice matters
Hash	`hash(id) mod N`	Uniform spread, no obvious ordering	Re-sharding is expensive (most keys move)
List	`region IN ('us-east', 'us-west')` → specific partition	Tenant isolation	Rigid; requires known values
Composite	Hash then range	Multi-tenant + time-series	Complex
Directory (lookup)	Explicit mapping `shard_id` → `node` in metadata	Flexible, supports rebalancing	Extra hop; metadata becomes a SPOF unless HA

Consistent hashing

Naive hash sharding (id % N) forces nearly every key to move when N changes. Consistent hashing places keys and nodes on a ring; a key belongs to the next node clockwise. Adding a node only re-homes ~1/N of keys. Enhancements: virtual nodes (each physical node occupies many ring positions, smoothing distribution).

Used by: Cassandra, DynamoDB internals, memcached (with ketama), Redis Cluster (different scheme — hash slots, 16384 fixed), CDNs.

Declarative partitioning in Postgres

sql

CREATE TABLE measurements (
    id BIGSERIAL,
    taken_at TIMESTAMPTZ NOT NULL,
    value NUMERIC
) PARTITION BY RANGE (taken_at);

CREATE TABLE measurements_2026_04
    PARTITION OF measurements
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Planner performs partition pruning — queries with taken_at predicates skip non-matching partitions. Dropping a month of data is DROP TABLE measurements_2026_01 — instant, no VACUUM needed.

Postgres doesn't do cross-node sharding natively (Citus adds it).

Cross-node sharding — the hard parts

Cross-shard joins / transactions — expensive or forbidden. Design to avoid.
Global uniqueness — use UUIDs or a distributed ID generator (Snowflake).
Rebalancing — moving shards without downtime requires dual-writes + backfills.
Hot shards — celebrity user, viral tweet. Either pre-shard celebrity accounts or add secondary hashing.
Secondary indexes — a global secondary index spans shards (scatter/gather) or must be maintained per shard (local).

When do you actually need to shard?

Working set > single-node RAM.
Write throughput > single-node disk/CPU.
Dataset > single-node disk cost-effectively.

Before sharding: vertical scale, add read replicas, add caches, archive cold data. Sharding changes your app's model — not a decision to take lightly.

14. Distributed Systems for DBs

CAP

In a distributed system with a network partition, you can have Consistency (every read sees the latest write) OR Availability (every request gets a non-error response) — not both. Partition tolerance is not optional; the network will partition.

So CAP is really: CP or AP during a partition.

CP — refuse requests that can't be served consistently. MongoDB (with w=majority), HBase, Spanner, CockroachDB, Zookeeper/etcd.
AP — keep serving, accept potential inconsistency. Cassandra (default), DynamoDB (default), Riak, eventually-consistent systems.

PACELC — CAP isn't enough

If there's a Partition, choose A or C. Else (normal operation), choose Latency or Consistency.

Recognizes that even without partitions you pay latency for strong consistency. DynamoDB: PA/EL (available + low-latency, eventually consistent by default). Spanner: PC/EC (strict serializable always).

Consistency models (strong → weak)

Model	Guarantee
Linearizable (strong)	Every op appears to happen instantaneously at some point between its invocation and response. Single-copy illusion.
Sequential	All nodes see all ops in the same order; this order matches each client's program order.
Causal	If op A happened-before op B, everyone sees A before B. Concurrent ops may appear in different orders to different observers.
Read-your-writes	A client always sees its own writes (even if others might not yet).
Monotonic reads	Successive reads from a client never go backward in time.
Eventual	Given no new writes, eventually all replicas converge. Weakest.

You typically want session guarantees (read-your-writes + monotonic reads) in an AP system, because eventual consistency alone confuses users.

Quorum (Dynamo-style)

N replicas; read from R; write to W. R + W > N guarantees overlap → reads see the latest write. Common settings:

Setup	Properties
`N=3, W=3, R=1`	Fast reads, slow writes, no fault tolerance on write
`N=3, W=1, R=1`	Fastest, eventually consistent only
`N=3, W=2, R=2`	Strongly consistent, tolerates one node down
`N=5, W=3, R=3`	Strongly consistent, tolerates two

Raft / Paxos (at 1,000 feet)

Consensus protocols let a group of nodes agree on an ordered log despite failures. Used by: etcd, Consul, CockroachDB, Spanner (Paxos), TiKV, Kafka controller (KRaft), Redis Sentinel (custom).

Leader election — one node wins majority vote.
Log replication — leader appends entries, replicates to followers; entry committed when majority acks.
Safety — committed entries never overwritten; only a node with an up-to-date log can become leader.
Liveness — requires majority reachable; two-of-three tolerates one failure, three-of-five tolerates two.

You won't implement Raft. You will recognize "we use Raft to elect a primary and replicate the WAL." If asked to whiteboard, focus on: leader, term numbers, log matching, commit index.

Vector clocks

An event tagged with per-node counters [A:3, B:5, C:2]. Lets you tell if two events are ordered (one dominates) or concurrent (neither does). Used in eventually-consistent KV stores (Riak, historically Dynamo) to detect and surface conflicts to the application.

CRDTs (conflict-free replicated data types) go further — merge values deterministically without application involvement (counters, sets, last-writer-wins registers).

15. NoSQL Categories

Document stores

Self-describing JSON/BSON records, often with some query support.

MongoDB — §16. Rich query language, aggregation pipeline, transactions (with limits).
Couchbase — SQL-like N1QL, integrated cache layer.
Firestore / Firebase — serverless, realtime, mobile-focused.

When: data is naturally a document (user profiles, product catalog, CMS), shapes vary per record, you rarely cross-join, you want horizontal scale.

Key-value (KV)

Get/put/delete by key. Opaque value.

Redis — in-memory, rich data types (lists, sets, sorted sets, streams), pub/sub, Lua. See §20.
Memcached — in-memory, simple KV only, no persistence.
DynamoDB (when used trivially) — §17.
etcd / Consul — small-value config store with consensus.
RocksDB / LevelDB — embedded KV engine, used as backend for many DBs.

When: access pattern is pure lookup-by-key, or caching.

Wide-column

Rows keyed by partition key + clustering columns; sparse columns can differ per row. Physically column-oriented within a row.

Cassandra — AP, tunable consistency, linear scale. Great for write-heavy time-series.
HBase — Hadoop ecosystem, strong consistency per row.
DynamoDB (as single-table design) — §17.
ScyllaDB — Cassandra-compatible, C++ implementation.

When: massive writes, queries always by partition key, you can design your schema around access patterns (no ad-hoc queries).

Graph

Nodes + edges optimized for traversal and pattern matching.

Neo4j — Cypher query language, property graph.
Amazon Neptune — Cypher + SPARQL (RDF).
JanusGraph — on top of Cassandra/HBase.

When: relationships are first-class (social graph, recommendation, fraud rings, knowledge graphs). Multi-hop queries that would be N self-joins in SQL run cheaply.

Time-series

Writes-dominated, data keyed by time, frequently append-only, auto-downsampling.

InfluxDB — SQL-like InfluxQL / Flux.
TimescaleDB — Postgres extension with hypertables.
Prometheus TSDB — metrics, pull-based, not a durable system of record.

When: metrics, IoT, monitoring.

Vector

Approximate nearest-neighbor search over embeddings — foundational for RAG / semantic search / recommendation.

pgvector (Postgres extension) — brings vectors into SQL.
Pinecone, Weaviate, Qdrant, Milvus — dedicated.
Index: HNSW (Hierarchical Navigable Small World) — graph traversal in embedding space. IVF (Inverted File) — cluster then search cluster.

When NOT to pick NoSQL

You need multi-table transactions / cross-entity invariants.
You need ad-hoc querying without rewriting the schema.
Your data is relational and modest in size.
Your team doesn't yet have an access pattern to design around.

Default to Postgres unless there's a specific reason not to. "We used Mongo because it was cool" is a regret story on every senior engineer's resume.

Interview Qs covered

§15 frames Q12 (Mongo vs Postgres).

16. MongoDB Deep Dive

MongoDB stores BSON documents (binary JSON with extra types like ObjectId, Date, Decimal128) inside collections (≈ tables) inside databases. No enforced schema by default — but you should use validators ($jsonSchema) in production.

Documents and the embedding decision

The core data-modeling decision: embed vs reference.

jsonc

// Embedded (denormalized)
{
    _id: ObjectId("..."),
    user: "alice",
    orders: [
        { id: 1, total: 100, items: [...] },
        { id: 2, total: 50 }
    ]
}

// Referenced (normalized)
// users: { _id: "alice", ... }
// orders: { _id: 1, user_id: "alice", total: 100 }

Embed when:

Data is accessed together (user's orders appear on the profile).
Lifetime is tied (order lines die with their order).
Embedded array is bounded — not unbounded growth (16 MB document limit).
You don't need to query embedded items independently.

Reference when:

Array can grow unboundedly (a popular post's comments).
Entity is referenced from multiple places.
You need to query it separately with an index.

Replica set + oplog

A replica set is one primary + N secondaries. The primary writes to its oplog (a capped collection of operations). Secondaries tail the oplog and replay.

Default: async replication. Secondaries might lag seconds.
Failover via Raft-like election. Secondary with freshest oplog wins.
Reads default to primary. readPreference: secondary[Preferred] offloads reads at cost of staleness.

Write and read concerns

Write concern — how many nodes must ack a write before client hears success.

w: 1 — primary only (default). Fast; lose it if primary dies before replication.
w: "majority" — majority of voting nodes. Durable across primary failure.
j: true — flushed to disk journal.

Read concern — consistency of reads.

local (default) — latest data on queried node; might not be majority-committed.
majority — only returns data acked by a majority.
linearizable — even stronger; blocks until you can guarantee single-copy reads.

Rule: w: majority, j: true, readConcern: majority for "this write must never be lost."

Sharding

mongos (router) → config servers (cluster metadata)
              ↓
  Shard 1 (replica set) — partition range A
  Shard 2 (replica set) — partition range B
  Shard 3 (replica set) — partition range C

Shard key is critical: all documents with the same shard key live on the same shard.

Picking a shard key (in order of importance):

High cardinality — enough distinct values to spread.
Even distribution — avoid hot shards.
Matches common queries — queries with shard key filter are single-shard; without are scatter/gather (slow).
Non-monotonic — monotonic keys (timestamps) concentrate writes on one shard. Use hashed sharding for timestamps.

Changing a shard key used to require a dump/restore; MongoDB 5.0+ can reshard online.

Indexes

Single-field — most common.
Compound — prefix rule same as SQL: { user_id: 1, status: 1, created_at: 1 } serves queries prefixed on user_id, (user_id, status), etc.
Multikey — automatic on array fields: { tags: 1 } indexes each array element.
Text — full-text search, one per collection, language-aware stemming.
2dsphere / 2d — geospatial.
Wildcard — { "$**": 1 } indexes all paths — handy for flexible schemas.
TTL — expireAfterSeconds auto-deletes documents; great for sessions.
Partial — indexes only docs matching a filter.
Hashed — for sharding on a monotonic natural key.

Gotcha — compound prefix rule. Index { a: 1, b: 1, c: 1 } supports queries on a, a+b, a+b+c — but NOT b, c, or b+c. Order matters. Put equality fields first, then range, then sort-only.

Aggregation pipeline

Multi-stage transformation pipeline. Stages: $match, $project, $lookup (join), $group, $sort, $limit, $unwind, $facet, $bucket, $addFields.

db.orders.aggregate([
    { $match: { status: "shipped", created_at: { $gte: new Date("2026-01-01") } } },
    { $group: { _id: "$user_id", total: { $sum: "$amount" }, count: { $sum: 1 } } },
    { $sort: { total: -1 } },
    { $limit: 10 }
])

Optimization rules:

Put $match as early as possible — pushdown.
Use indexes (only stages before blocking ops like $group can use them).
$lookup is an ad-hoc left outer join — slower than embedding. Avoid in hot paths.

Transactions — and their limits

MongoDB 4.0+ supports multi-document ACID transactions. 4.2+ across shards.

Limits:

Default 60-second transaction timeout. Long transactions kill secondaries (oplog pressure).
Performance hit: transactions serialize on the oplog; cross-shard transactions use 2PC internally.
Limits on data moved per tx.
Read-your-own-writes inside tx is always fine, but reads outside the tx with readConcern: snapshot require a replica set.

Rule of thumb: design to not need transactions. Embed where you can. Use them for rare cross-document invariants, not for everything.

Change streams

Tail the oplog via a clean API. Useful for:

Keeping a search index (Elasticsearch) in sync.
Emitting events to Kafka (Mongo → Kafka connector is literally this).
Trigger-like workflows.

db.orders.watch([{ $match: { "fullDocument.status": "shipped" } }])
    .on('change', evt => { ... });

Resume tokens let you pick up where you left off on reconnect.

Interview Qs covered

§16 addresses Qs 12 (Mongo vs Postgres), 13 (compound prefix), 14 (transaction limits).

17. DynamoDB Deep Dive

DynamoDB is a fully managed key-value / wide-column store on AWS. It trades ad-hoc querying for predictable low-latency at any scale. The design mantra: model access patterns first, schema second.

Tables, items, attributes

A table holds items (≈ rows). Each item is a set of attributes (≈ columns, but each item can have any attributes). Every item has a primary key, either:

Partition key (PK) only — hash key, simple KV.
PK + sort key (SK) — composite. All items with the same PK form a partition and are stored together, sorted by SK.

Partitions and hot partitions

Behind the scenes: DynamoDB hashes PK → assigns to a physical partition (hash space slice). Single partition max ~1000 WCU / 3000 RCU per second (also size-bounded, 10 GB).

Hot partition — if one PK value gets a huge share of traffic (e.g., country = "US"), that single partition saturates even though the table has excess capacity.

Fixes:

Better PK — higher cardinality (e.g., user_id not country).
Write sharding — suffix the key: country#01..country#99, random on write, scatter-read.
Caching — DAX or app-layer.
Adaptive capacity — DynamoDB auto-boosts a hot partition up to some limit; not a substitute for good modeling.

GSI vs LSI

	Local Secondary Index (LSI)	Global Secondary Index (GSI)
PK	Same as base table	Different — any attribute
SK	Different	Different
Max count	5 per table	20 per table (soft limit)
Created	Only at table creation	Anytime
Consistency	Supports strongly consistent reads	Eventually consistent only
Capacity	Shares base table's	Separately provisioned
Size	Limited to 10 GB per PK	Unlimited

Rule: GSIs unless you truly need strong consistency on a secondary access path (rare). GSIs are the flexible, production-preferred choice.

Single-table design

Counter-intuitive: store multiple entity types (users, orders, order lines) in one table, distinguished by key patterns.

PK              SK                  type    attrs...
USER#alice      PROFILE             user    { email, name }
USER#alice      ORDER#2026-04-17    order   { total, status }
USER#alice      ORDER#2026-04-18    order   { total, status }
ORDER#42        LINE#1              line    { qty, price }
ORDER#42        LINE#2              line    { qty, price }

Why:

One round-trip fetches a user AND all their orders: PK = "USER#alice".
GSIs reshape the same data for different access patterns.
No cross-entity joins needed — relationships are captured by key design.

The cost: schema complexity. Queries depend on memorized key shapes. Use a naming convention (ENTITY#id) and document it ruthlessly.

Modeling relationships

One-to-many — composite PK/SK where PK = parent, SK = CHILD#id:

PK = USER#alice, SK = PROFILE         → the user
PK = USER#alice, SK = ORDER#202604170001 → each order
Query PK = "USER#alice" AND SK begins_with "ORDER#" → all orders.

Many-to-many — adjacency list pattern:

PK = USER#alice, SK = GROUP#eng      → alice ∈ eng group
PK = GROUP#eng,  SK = USER#alice     → GSI1: query by group

With a GSI flipping PK ↔ SK you get both directions.

Capacity modes

Provisioned — set RCU/WCU per table. Cheaper at steady load. Autoscaling adjusts within bounds.

1 RCU = 1 strongly consistent read of ≤ 4 KB per second, OR 2 eventually consistent reads, OR 0.5 transactional reads.
1 WCU = 1 write of ≤ 1 KB per second, OR 0.5 transactional writes.

On-demand — pay per request. ~6-7× more expensive per request but no capacity planning. Great for unpredictable or spiky traffic.

Eventual vs strong vs transactional reads

Eventual (default) — may lag a write by ~1s. Cheapest.
Strong — always latest; costs 2× eventual; not supported on GSIs.
Transactional read (TransactGetItems) — consistent snapshot across items, 4× cost.

Writes

PutItem / UpdateItem / DeleteItem — single-item.
BatchWriteItem — up to 25 items per call; no atomicity across items.
TransactWriteItem — up to 100 items, ACID across them, 2× cost. Rollback on any failure.
Conditional writes — ConditionExpression: "attribute_not_exists(pk)" enforces uniqueness / optimistic lock.

Streams + DAX + TTL

DynamoDB Streams — change feed with 24-hour retention; feed Lambda, replicate to other stores, audit.
DAX (DynamoDB Accelerator) — in-cluster cache; sub-millisecond reads; eventually consistent vs source.
TTL — attribute-based automatic deletion. Cheap way to expire sessions, temp data.

PartiQL

SQL-style query language for DynamoDB — convenient but doesn't change capabilities. Still limited to key-based access patterns. Convenience, not power.

Interview Qs covered

§17 addresses Qs 15–19 (PK/SK, GSI/LSI, single-table design, 1:M modeling, capacity modes).

18. Postgres Specifics

Postgres is the "reach for by default" relational engine for a reason: it has every feature, and its extension ecosystem covers the rest.

JSONB — structured data as a column

JSONB stores a parsed, binary JSON representation. Not just a blob.

sql

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    payload JSONB
);

INSERT INTO events (payload) VALUES ('{"type":"login","user":"alice"}');

SELECT payload->>'user' FROM events WHERE payload->>'type' = 'login';

Operators: -> (returns JSONB), ->> (returns text), #> / #>> (path), @> (containment), ? (key exists), ?& / ?|.

Index JSONB with GIN:

sql

CREATE INDEX idx_events_payload_gin ON events USING GIN (payload);
-- Or the smaller jsonb_path_ops for containment-only:
CREATE INDEX idx_events_payload_gin_path ON events USING GIN (payload jsonb_path_ops);

Supports queries like payload @> '{"type":"login"}' via the GIN index.

JSON (no B) stores the text as-is — slower for queries, preserves whitespace/ordering. Prefer JSONB.

Arrays

Native array types: INT[], TEXT[]. Indexable with GIN.

sql

CREATE TABLE posts (id BIGSERIAL PK, tags TEXT[]);
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgres'];

Use cautiously: if you routinely query inside the array, it's usually better as a real junction table.

Extensions worth knowing

Extension	Purpose
pg_trgm	Trigram fuzzy search; index with `gin_trgm_ops` for `LIKE '%...%'`
pgvector	Vector similarity search; `vector` column + HNSW/IVFFlat index
PostGIS	Geospatial — points, polygons, routing
pg_cron	Cron-style scheduled SQL in the DB
pg_stat_statements	Query stats — top-N slowest, most-called
pg_partman	Partition maintenance automation
hstore	Key-value column (older; JSONB mostly replaces)
uuid-ossp / pgcrypto	UUID gen functions
citext	Case-insensitive text type

LISTEN / NOTIFY

Tiny, durable-ish pub/sub inside Postgres.

sql

-- Listener
LISTEN order_updates;

-- Publisher (in another session)
NOTIFY order_updates, 'order 42 shipped';

Great for local cache-invalidation signals. Don't rely on it as a replacement for Kafka — not durable if no listener is connected.

Logical replication

Publish/subscribe per table. Lets you:

Replicate a subset across major versions.
Do zero-downtime upgrades (logical-replicate from old to new, then switch).
Build CDC pipelines (Debezium is a logical replication consumer).

Roles and Row-Level Security (RLS)

sql

CREATE ROLE app_user LOGIN PASSWORD '...';
GRANT SELECT ON orders TO app_user;

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY user_orders ON orders
    FOR SELECT USING (user_id = current_setting('app.current_user_id')::BIGINT);

Any query from app_user on orders is filtered by the policy — enforced by the engine, not application code. Great for multi-tenant apps when you can trust the DB connection's identity.

CTE materialization gotcha

Pre-Postgres-12 CTEs were always materialized. Post-12 they're often inlined. If you depended on the materialization fence (side effects, avoiding repeated expensive subqueries), add AS MATERIALIZED.

19. MySQL / SQL Server Callouts

Not a full tour — just what differs that interviewers probe.

MySQL (InnoDB)

Clustered PK — rows live inside the PK B-tree. Lookups by PK = one traversal. Secondary indexes store (key, PK), require a second lookup → matters for covering-index choices.
Gap locks — InnoDB's REPEATABLE READ prevents phantoms via gap locks (locks the "space between" rows on a range query). Can cause surprising deadlocks; tune with innodb_locks_unsafe_for_binlog = 0.
ORDER BY + LIMIT with filesort — check EXPLAIN; ensure index covers order.
No transactional DDL — ALTER TABLE can't roll back. Use gh-ost / pt-online-schema-change for online migrations on big tables.
Storage engines — InnoDB is the only one to use for new work. MyISAM exists; don't.
Query cache — removed in MySQL 8.0. If an interviewer asks about it, the correct answer includes "it's been removed because invalidation cost > benefit."

SQL Server / T-SQL

Clustered index — optional; strongly recommended (usually on PK).
Temp tables — #temp (session) vs ##temp (global) vs table variables. Table variables have fewer stats → worse plans for anything non-trivial.
Common table expressions — supported; used for recursion and readability just like Postgres.
Parameter sniffing — the plan cache uses the first call's parameter value. Bad for skewed distributions. Workarounds: OPTION (RECOMPILE), OPTIMIZE FOR UNKNOWN, local variables, WITH OPTIMIZE FOR (@p = ...).
Always On Availability Groups — HA + read-scale with log-shipping. Primary + secondaries with failover.
Columnstore indexes — clustered columnstore for big analytical tables; non-clustered columnstore alongside row store for HTAP.
SELECT INTO — creates table from query results — no indexes.

20. Caching

Caching adds a fast, cheap tier in front of the database to reduce read latency and load. It's one of the most commonly asked interview topics because invalidation is the hard part.

Redis vs Memcached

	Redis	Memcached
Data model	Strings, lists, sets, sorted sets, hashes, streams, bitmaps, HyperLogLog, geospatial	Strings only
Persistence	RDB snapshots + AOF	None
Replication	Primary-replica + Cluster	None (client-side sharding)
Transactions	`MULTI`/`EXEC`, Lua scripting	No
Memory model	Single-threaded main loop (with IO threads in 6+)	Multi-threaded
Use case	Cache + data structures server + pub/sub + queue	Pure cache

Redis is the default. Memcached is still viable when you want a purely ephemeral cache with zero features.

Caching patterns

Cache-aside (lazy loading) — app controls the cache:

read:
    value = cache.get(k)
    if value is null:
        value = db.get(k)
        cache.set(k, value, ttl)
    return value

write:
    db.put(k, v)
    cache.delete(k)        -- invalidate

Pros: cache holds only requested data; app unaffected if cache is down. Cons: cache stampede on miss (N requests all hit DB); stale window after delete before next read repopulates.

Read-through — cache library hits DB on miss (cache-aside abstracted into a library).

Write-through — app writes to cache; cache propagates to DB synchronously. Strong consistency between cache and DB; slower writes.

Write-behind (write-back) — app writes to cache; cache drains to DB async. Fastest writes. Risk: cache failure = lost data. Avoid for anything you can't lose.

Refresh-ahead — cache proactively refreshes entries before TTL expires (predict access). Used for very hot keys.

Invalidation strategies

TTL only — set expiry, accept up to that much staleness. Simple, best when "slightly stale is fine."
Explicit delete on write — cache-aside. Risk: if write happens between read and set, stale data re-cached (race). Solutions: SETNX on write, versioned keys.
Write-through / write-behind — cache is always fresh (by construction).
Event-driven (CDC) — DB change stream (Debezium, Postgres logical, Mongo change stream) → invalidate cache key. Best for systems where multiple writers touch the DB.
Version tag — cache key includes a version: user:42:v17. Bumping version invalidates all.

Phil Karlton: "There are only two hard things in computer science: cache invalidation and naming things." Believe him.

Eviction policies (when the cache fills)

Policy	Evicts	Use case
LRU (Least Recently Used)	Oldest-accessed	General caching; default
LFU (Least Frequently Used)	Least-accessed	Skewed access (celebrity hotness)
FIFO	Oldest-written	Rare; naive
Random	Random key	When you don't want tracking overhead
allkeys-lru / volatile-lru	LRU across all keys / only keys with TTL	Redis configs
volatile-ttl	Shortest TTL first	When TTL roughly matches value
noeviction	None — errors on OOM	Session stores, queues

Stampede / thundering herd

A popular key expires; 1000 concurrent requests all miss the cache and hammer the DB.

Mitigations:

Distributed lock — first miss locks, fetches, sets; others wait or serve stale.
Probabilistic early expiration — each request independently rolls a probability to refresh before actual expiry; smooths the cliff.
Stale-while-revalidate — serve stale value + kick off background refresh.
Request coalescing — in-process dedup: concurrent misses for same key reuse one DB call.

Cache hit ratio — what's good

>95% — great for a hot-data cache.
80–95% — normal.
<50% — cache is fighting working set size; bigger cache or longer TTL or different key design.

What to cache

Derived, expensive, rarely-written data — perfect.
Session state, rate-limit counters, leaderboards — Redis sorted sets are ideal.
Write-heavy, rarely-read data — don't bother.
Personalized unique-per-user — cache by user; hit rate depends on user revisit.

Interview Qs covered

§20 addresses Q21 (cache invalidation).

21. OLAP & Data Warehousing

When queries scan millions of rows with aggregates, OLTP engines suffer. A separate warehouse uses column-oriented storage and a denormalized schema optimized for analytics.

Star vs snowflake schema

Star schema:

                 dim_date
                    |
dim_customer ─── fact_sales ─── dim_product
                    |
                 dim_store

Fact table — numeric, additive measures (sales, quantity, revenue) + FKs to dimensions.
Dimension tables — descriptive attributes (customer name, product category, date breakdown). Wide, few rows relative to facts.

Snowflake schema — dimensions are further normalized into sub-dimensions:

dim_product → dim_category → dim_department

Saves storage; adds joins. Star is usually preferred for warehouse speed.

Slowly Changing Dimensions (SCD)

When dimension attributes change over time, how do you preserve history?

Type	Behavior
SCD 1	Overwrite. History lost. Use for corrections.
SCD 2	New row for each change, with `effective_from`/`effective_to` + `is_current`. History preserved. Most common.
SCD 3	New column for "previous" value. Limited to one step back.
SCD 4	Split current vs history tables.
SCD 6	Hybrid (1+2+3).

Senior interviews: "how would you handle a customer changing address?" → SCD 2 with effective dating.

Columnar storage

Row-oriented (Postgres, MySQL): rows stored contiguously. Good for "give me row 42."

Column-oriented (Redshift, BigQuery, Snowflake, ClickHouse, Parquet): values of one column stored contiguously.

Benefits:

Compression — one column = one data type = high compression (delta, dictionary, run-length).
Projection pushdown — SELECT avg(salary) only reads the salary column, not every byte of every row.
Vectorized execution — process chunks of column values in SIMD-friendly loops.

Trade-off: inserting or updating a row touches every column file. Columnar engines generally favor append-only / bulk-load patterns.

ETL vs ELT

ETL (Extract-Transform-Load) — classic: transform in an intermediate tool (Talend, Informatica, Spark), load to warehouse. Transformation logic lives outside the warehouse.

ELT (Extract-Load-Transform) — modern: load raw to warehouse, transform there with SQL (dbt is the dominant tool). Warehouse does the heavy lifting; transformations are version-controlled SQL.

ELT wins when the warehouse is cheap, fast, and elastic (BigQuery, Snowflake, Redshift). ETL still wins for heavy rule-based cleansing or when you can't store raw for policy reasons.

MPP engines

Massively Parallel Processing — one logical warehouse, many nodes each processing a slice of the data.

Redshift — AWS, columnar, provisioned + serverless, DC2/RA3 node families.
BigQuery — Google, serverless, columnar, separation of storage and compute, slots model.
Snowflake — multi-cloud, virtual warehouses (elastic compute clusters), automatic scaling.
ClickHouse — open source columnar, super fast for specific queries, self-hosted.

Data lake vs warehouse vs lakehouse

Lake — raw files in object storage (S3, GCS) in open formats (Parquet, Avro). Schema-on-read.
Warehouse — structured, optimized tables. Schema-on-write.
Lakehouse — ACID tables over a lake via formats like Delta Lake, Apache Iceberg, Apache Hudi. Hybrid.

22. Search (Elasticsearch / OpenSearch)

Elasticsearch isn't a database in the transactional sense — it's a distributed, indexed document store optimized for relevance-scored search.

Inverted index — the core data structure

For every term in every document, a list of documents containing it.

"database"  → [doc1, doc5, doc7]
"postgres"  → [doc1, doc3]
"mongodb"   → [doc5, doc7]

Search "database postgres" = intersect [doc1, doc5, doc7] ∩ [doc1, doc3] = [doc1]. Blazingly fast for text match.

Analyzers and tokenizers

Documents are processed into tokens at index time:

Tokenizer — split text into tokens (standard, whitespace, keyword, n-gram).
Token filters — lowercase, remove stopwords ("the", "a"), stem ("running" → "run"), synonyms.

Same analyzer applied to queries. Analyzer mismatch is a classic bug — your query analyzer lowercases but the indexed analyzer doesn't.

Relevance scoring — BM25

Default scoring in modern Elasticsearch. Extension of TF-IDF:

Term frequency — more mentions = more relevant (with diminishing returns).
Inverse document frequency — rare terms weigh more ("postgres" more than "the").
Document length normalization — shorter doc with same hit = more relevant.

Tunable per field (boost), per query.

Shards and replicas

Primary shards — data sliced into N shards at index creation. Can't change without reindex (since 7.0: can split/shrink with constraints).
Replica shards — copies for redundancy + read scale.
Each shard is a self-contained Lucene index.

Rule of thumb: shard size ~20–50 GB. Too-small shards (< few GB) waste overhead. Too-big (> 50 GB) makes recovery slow.

Refresh / flush / merge

Refresh (default 1s) — makes recent writes visible to search. Creates a small in-memory segment. Tunable; bulk loads should set -1 and refresh at end.
Flush — fsyncs transaction log, commits Lucene segments to disk.
Merge — background process combines small segments into larger ones. More reads = bigger segments = faster search.

When search ≠ OLTP DB

ES is near-real-time — 1s refresh isn't transactional.
Writes lag behind a system-of-record DB (via CDC or dual-write).
No multi-document transactions.
Don't use ES as a primary data store — use it as a secondary index on top of one.

Mapping explosion

A "mapping" is ES's schema. Dynamic mapping auto-creates fields from first document seen. If you index unbounded user-supplied field names (custom_attr_<n>), you explode the mapping and kill cluster performance.

Mitigations:

Disable dynamic mapping ("dynamic": "strict").
Use "dynamic": "runtime" to keep fields searchable without indexing.
Flatten user-supplied attributes into a single nested object.

23. Schema Migrations

A migration is a versioned change to the schema — never hand-typed into a prod console.

Flyway vs Liquibase

	Flyway	Liquibase
Format	SQL files (`V1__init.sql`, `V2__add_index.sql`)	XML / YAML / SQL / JSON changelogs
Philosophy	Simple, imperative	Declarative, DB-agnostic
Rollback	Not built-in (write a down migration manually)	Built-in (but requires care)
Java integration	Spring Boot starter	Spring Boot starter
Best for	Teams fluent in SQL, single-DB target	Cross-DB portability, complex orchestration

Either works. Pick one, commit to it, check migrations into source control alongside code.

The expand-contract (parallel-change) pattern

A zero-downtime schema change happens in three phases, deployed independently.

Expand — add new, don't touch old. Migrate — dual-write; backfill; dual-read. Contract — remove old.

Zero-downtime column rename example

Goal: rename `users.email_address` → `users.email`

Phase 1: EXPAND
    ADD COLUMN email TEXT (nullable).
    Deploy writers: write to BOTH email_address AND email.
    Backfill: UPDATE users SET email = email_address WHERE email IS NULL;  -- in batches

Phase 2: MIGRATE
    Deploy readers: read from email, fall back to email_address if null.
    Verify.
    Make email NOT NULL (only safe once backfill done).

Phase 3: CONTRACT
    Deploy writers + readers: use only email.
    DROP COLUMN email_address.

Every step is backward-compatible. You can roll back any deployment.

Other zero-downtime patterns

Change	Technique
Add column	Safe (with default; in PG 11+ default is metadata-only, no rewrite)
Add index	`CREATE INDEX CONCURRENTLY` (PG) / `ALGORITHM=INPLACE, LOCK=NONE` (MySQL). Non-blocking.
Drop column	Deploy code that doesn't use it → drop.
Change type	Add new column → backfill → switch reads → drop old.
Add NOT NULL to existing column	Add CHECK (col IS NOT NULL) NOT VALID → VALIDATE CONSTRAINT → ALTER COLUMN SET NOT NULL (PG 12+).
Add FK	ADD CONSTRAINT ... NOT VALID; VALIDATE CONSTRAINT (doesn't take strong lock).

Online DDL tools (big MySQL tables)

pt-online-schema-change (Percona) — creates a shadow table, triggers copy rows, swap at end.
gh-ost (GitHub) — similar, reads binlog instead of triggers → less load.
pg_repack / pg_squeeze (Postgres) — reclaim bloat / rewrite without long lock.

Backfill strategy

Backfills on large tables must:

Batch (e.g., 1000 rows per transaction) — bounded lock duration.
Throttle — pause on replica lag, CPU saturation.
Resume from checkpoint — if killed, don't start over.
Log progress.

A bad backfill (UPDATE huge_table SET new_col = old_col in one statement) = hours-long lock → production outage.

Interview Qs covered

§23 addresses Q20 (zero-downtime column rename).

24. Backups, HA, DR

Backups are the last line of defense against "oops." HA (high availability) handles node failure. DR (disaster recovery) handles region / site failure.

Backup types

Type	What it captures	Speed	Storage
Full	Everything	Slow	Large
Incremental	Changes since last backup	Fast	Small (but chain dependency)
Differential	Changes since last full	Medium	Medium
Logical (pg_dump, mongodump)	SQL/BSON statements to recreate	Slow restore	Portable across versions
Physical (pg_basebackup, xtrabackup, file-level snapshot)	Binary files + WAL	Fast restore	Tied to major version
Snapshot (EBS, LVM, ZFS)	Point-in-time filesystem view	Very fast	Storage layer dependent

Point-in-Time Recovery (PITR)

The most valuable backup capability — restore to any timestamp, not just last backup.

Nightly full backup + continuous WAL archiving (shipped to S3).
Restore path: restore full → replay WAL up to target timestamp.

PITR recovers from both hardware failure AND user error ("we accidentally DELETE'd all orders at 14:32").

RPO vs RTO

RPO (Recovery Point Objective) — how much data can you afford to lose? "We can lose up to 5 minutes" = RPO 5m.
RTO (Recovery Time Objective) — how long can you be down? "We must be back in 30 minutes" = RTO 30m.

These drive architecture:

RPO / RTO	Architecture
Hours / hours	Nightly full + backup shipping to cheap storage
Minutes / minutes	Continuous WAL archive + warm standby
Seconds / seconds	Sync replica + automated failover
Zero / zero	Multi-region sync quorum (Spanner, CockroachDB geo-distributed)

Lower RPO/RTO = higher cost. Pick based on business need.

HA patterns

Cold standby — backup ready; operator restores on failure. Cheap, slow.
Warm standby — replica kept up-to-date, not serving traffic. Promote on failure. RDS Multi-AZ is this.
Hot standby — replica serving read traffic already; promoted on failure. Postgres streaming replication.
Active-active — multi-primary; always serving from multiple nodes (see §12).

Multi-AZ vs multi-region

Multi-AZ (single region): tolerates data-center / availability-zone failure. Low latency within region. AWS RDS Multi-AZ uses sync replication between AZs.

Multi-region: tolerates entire region outage. Options:

Read replicas in other region — RPO seconds (async lag).
Aurora Global DB — low-latency cross-region replication with storage-layer magic.
DynamoDB global tables — active-active multi-region, LWW conflicts.

Chaos engineering + drills

Backups untested are Schrödinger's backups. Real practices:

Restore drills — quarterly restore to a scratch env.
Failover drills — periodically kill the primary to exercise automation.
Game days — planned chaos injection.
Runbook validation — can a non-DBA engineer perform the runbook?

Regulated-environment retention angle

Regulated compliance regimes typically require audit retention (often 1+ year online, longer in archive). Common controls:

Append-only audit log table (can't UPDATE/DELETE from app role).
Daily backups retained per retention policy; deletion requires dual-control.
Encryption at rest using FIPS 140-2 modules where required.
Backup medium must meet classification rules — e.g., can't ship regulated-data backups to unvetted storage.

25. Security

Encryption at rest vs in transit

At rest — data on disk. Options:

Filesystem / volume encryption (LUKS, EBS encryption) — transparent to DB.
Transparent Data Encryption (TDE) — engine encrypts pages (SQL Server, Oracle, MySQL Enterprise, Postgres via extensions).
Column-level encryption — application encrypts specific columns (SSN, credit card) before insert.

For regulated environments (FIPS, FedRAMP, etc.): use FIPS 140-2 validated cryptographic modules. Key management via AWS KMS, HashiCorp Vault, or dedicated HSM.

In transit — TLS everywhere. Postgres: sslmode=verify-full; MongoDB: tls=true requireCert. Self-signed in dev; real certs in prod. Mutual TLS (mTLS) for service-to-DB auth.

SQL injection — still the #1 killer

Insecure:

java

String sql = "SELECT * FROM users WHERE name = '" + name + "'";
stmt.executeQuery(sql);

Attacker supplies name = "'; DROP TABLE users; --" → game over.

Secure — parameterized query:

java

PreparedStatement ps = conn.prepareStatement("SELECT * FROM users WHERE name = ?");
ps.setString(1, name);
ps.executeQuery();

The driver sends the SQL template and values separately; the server never parses the user data as SQL. This isn't just escaping — it's structural separation.

JPA / Hibernate with :param or positional parameters is safe. Hibernate with string-built JPQL or native SQL is vulnerable — same rules apply.

NoSQL has analogous vulnerabilities:

Mongo injection — { $where: "this.name == '" + userInput + "'" } → arbitrary JS.
Use the structured operators: { name: userInput }.

Row-Level Security (RLS)

Filter rows at the engine level based on session context. Postgres example:

sql

ALTER TABLE patients ENABLE ROW LEVEL SECURITY;
CREATE POLICY patient_privacy ON patients
    USING (assigned_doctor_id = current_setting('app.doctor_id')::BIGINT);

Every query from the application role is scoped to the current user's rows — even SELECT *. Great for multi-tenant SaaS.

Limits: requires trusted session context setting. If attacker can SET app.doctor_id arbitrarily, RLS is bypassed. Set it in a trusted middleware.

Least-privilege roles

Don't use the superuser from the app. Create a role per service or per permission tier:

sql

CREATE ROLE app_read LOGIN;
GRANT SELECT ON orders, users TO app_read;

CREATE ROLE app_write LOGIN;
GRANT SELECT, INSERT, UPDATE ON orders TO app_write;
-- No DROP TABLE, no ALTER.

Service connects as app_write. Attacker who gets connection can't drop your tables.

Audit logging

Every sensitive operation logged to tamper-evident storage.

Postgres: pgaudit extension, or trigger-based audit tables.
DynamoDB: CloudTrail data events.
MongoDB: audit log (Enterprise edition).

Common audit log requirements under regulated frameworks:

Who, what, when, from where.
Append-only (writable by audit role; readable by reviewer).
Retention per policy.
Integrity verification (HMAC or digital signature).

Secret management

Never commit connection strings with passwords. Options:

AWS Secrets Manager, HashiCorp Vault — programmatic fetch.
Kubernetes Secrets + Sealed Secrets / External Secrets — GitOps-safe.
IAM database authentication (RDS IAM auth) — no password at all, short-lived tokens.

Data exposure risk at the application layer

Disable application.properties endpoints that dump DB config in Actuator.
Sanitize exceptions — don't leak schema in error responses.
Log slow queries at INFO/DEBUG, not with bound values unless redacted.

26. Connection Pooling & Tuning

Opening a TCP+TLS+auth handshake for every query = death. A connection pool keeps a set of warm connections and hands them out on demand.

HikariCP (Spring Boot default)

Pool-sizing formula (from HikariCP wiki, attributed to Oracle research):

connections = ((core_count × 2) + effective_spindle_count)

For SSD + 8 cores: (8 × 2) + 1 ≈ 17. In practice most apps tune via load testing; start at 10–20 and adjust.

Key settings:

maximumPoolSize — upper bound.
minimumIdle — baseline kept warm (default = max, i.e., no shrinking).
connectionTimeout — ms to wait for a connection before throwing.
idleTimeout — close idle connections after this (doesn't go below minimumIdle).
maxLifetime — hard cap; closes a connection at this age (should be a bit less than DB-side timeout).
leakDetectionThreshold — warn when a connection is held > this ms.

What happens when the pool is exhausted

All connections in use → new request waits up to connectionTimeout → throws SQLException / PoolTimeoutException.

Causes:

Long-running queries holding connections.
Leaks — forgot to close.
Pool sized too small.
Downstream (DB) slow.

Mitigation: statement timeouts (SET statement_timeout = '5s'), leak detection, circuit breakers, auto-scaling pool.

Pool sizing reality

Too small — requests queue, latency climbs under load. Too big — more connections than DB can usefully serve; connections spend time context-switching. Postgres default max_connections = 100 is not a lot; every connection consumes ~10MB of backend memory.

In microservice architectures: total microservices × pool_size can exceed DB's max_connections. Use PgBouncer / ProxySQL as a shared pool layer.

PgBouncer pooling modes

Mode	How	Use case
Session pooling	Client gets a backend for the whole session	Behaves like a direct connection; no functional change
Transaction pooling	Backend handed over at transaction boundaries	Highest density; app mustn't use session-state features (prepared statements with names, `SET`, advisory locks)
Statement pooling	Backend handed over per statement	Rarely used; no multi-statement transactions

Transaction pooling is the magic: 10,000 clients can share 20 backends if they each hold a transaction for < 50ms on average.

Prepared statement caching

Most drivers cache prepared statements per connection. Pool interactions:

Session pooling: cache works fine.
Transaction pooling: named prepared statements break because a new backend might not have seen the prepare. Disable named prepared statements (prepareThreshold=0 on Postgres JDBC).

Statement and lock timeouts

sql

SET statement_timeout = '5s';       -- aborts any query running > 5s
SET lock_timeout = '2s';            -- fails fast if lock can't be acquired
SET idle_in_transaction_session_timeout = '30s';

Set these per app connection — prevents runaway queries from tying up resources.

Interview Qs covered

§26 addresses Q9 (connection pooling).

27. JPA / Hibernate Specifics

You'll get DB-round questions phrased in ORM terms. Know the entity lifecycle, the caches, and the N+1 remedies.

Entity lifecycle

State	Meaning
Transient	`new User()` — not tracked; no ID; no persistence context.
Managed	In the persistence context; changes tracked; flushed at commit. Result of `persist()` or `find()`.
Detached	Was managed, persistence context closed. Changes not tracked. Use `merge()` to reattach.
Removed	`entityManager.remove(entity)` called; will be DELETE'd at flush.

Flush is what actually emits SQL. Flush modes:

AUTO (default) — before queries and at commit.
COMMIT — only at commit.
MANUAL — only when you call flush().

First-level vs second-level cache

First-level (L1) — per EntityManager / persistence context. Automatic. Guarantees: within one transaction, find(User, 42) twice returns the same object. Cleared on close.

Second-level (L2) — optional, per SessionFactory (process-wide). Configurable per entity. Providers: Ehcache, Caffeine, Hazelcast, Infinispan, Redis (via Redisson).

Cache strategies:

READ_ONLY — immutable data (country codes).
READ_WRITE — with soft locks; slower writes.
NONSTRICT_READ_WRITE — eventual consistency.
TRANSACTIONAL — XA-ish.

Query cache — caches result sets from JPQL queries. Rarely worth it — invalidation kills you.

L2 cache caveats:

Another process modifying the DB directly won't invalidate it.
Cache across cluster needs distributed invalidation (Infinispan).
Can leak stale data if not tuned.

Fetch types — LAZY vs EAGER

java

@OneToMany(fetch = FetchType.LAZY)   // default for collections
private List<Order> orders;

@ManyToOne(fetch = FetchType.EAGER)  // default for singular relations!
private User user;

EAGER means every time you fetch a child, JPA fetches the parent. Easy to accidentally turn N+1 into N×M. Default everything to LAZY; fetch eagerly at query time when you need it.

N+1 problem — the JPA flavor

java

List<Order> orders = em.createQuery("FROM Order", Order.class).getResultList();  // 1 query
for (Order o : orders) {
    o.getLineItems().size();  // LAZY trigger → 1 query PER order
}

Fixes (pick per situation):

java

// 1. JOIN FETCH in JPQL — eager just for this query
em.createQuery("SELECT o FROM Order o JOIN FETCH o.lineItems", Order.class)

// 2. Entity graph — declare the graph at query time
EntityGraph<?> g = em.createEntityGraph(Order.class);
g.addAttributeNodes("lineItems");
em.createQuery("FROM Order").setHint("jakarta.persistence.fetchgraph", g)

// 3. @BatchSize on the association — Hibernate IN-query batches
@OneToMany @BatchSize(size = 50) private List<LineItem> lineItems;

// 4. DTO projection — bypass entities entirely, flat result rows
em.createQuery("SELECT new OrderSummary(o.id, COUNT(l)) FROM Order o LEFT JOIN o.lineItems l GROUP BY o.id", OrderSummary.class)

Detecting N+1:

hibernate.generate_statistics=true → statistics.getQueryExecutionCount() in tests.
datasource-proxy / p6spy to count.
JdbcSqlStatementInterceptor in Hibernate 6.

@Version — optimistic locking

java

@Entity class Order {
    @Id Long id;
    @Version Long version;
    String status;
}

Every UPDATE issued by Hibernate appends AND version = ?. If affected rows = 0, throws OptimisticLockException. Classic use: long-running user edits (open form at 9:00, save at 9:05; someone else saved at 9:03 → you get a conflict instead of silently overwriting).

DTO projections vs entities

Rule of thumb: entities for writes, DTO projections for reads. Reasons:

Entities are fully-populated graphs — over-fetching by default.
Entities carry lazy proxies — LazyInitializationException outside a session.
Entities with an open persistence context auto-flush dirty state — danger in read-only code.
DTOs select only the columns you need → smaller query, better plan.

java

// Interface projection
interface OrderSummary { Long getId(); Integer getLineCount(); }
List<OrderSummary> find(Long userId);

Native queries

java

@Query(value = "SELECT * FROM orders WHERE jsonb_col @> :filter", nativeQuery = true)
List<Order> search(@Param("filter") String filter);

Use for DB-specific features (JSONB operators, window functions, hints). Downside: less portable, no compile-time checking.

Spring `@Transactional` and JPA

Spring manages the EntityManager + transaction via proxy (same caveats as any Spring AOP — §86 in your Spring study).
Default propagation REQUIRED — joins existing or starts new.
Default isolation = DB default (read committed for Postgres).
Default rollback = unchecked exceptions only. Set rollbackFor = Exception.class for checked.
Self-invocation bypasses the proxy — this.someMethod() won't start a new transaction.

Interview Qs covered

§27 addresses Q8 (N+1 — Hibernate-specific fixes).

28. Spring Data Specifics

JpaRepository method-name parsing

java

interface OrderRepo extends JpaRepository<Order, Long> {
    List<Order> findByUserIdAndStatus(Long userId, Status status);
    Optional<Order> findTopByUserIdOrderByCreatedAtDesc(Long userId);
    Stream<Order> streamByStatus(Status status);    // returns a resource-backed stream
    List<Order> findDistinctByUserIn(List<Long> ids);
    int countByStatus(Status status);
}

Spring Data parses method names into queries (findBy, readBy, queryBy, existsBy, countBy, deleteBy). Keywords: And, Or, Between, LessThan, Like, IgnoreCase, OrderBy, In, NotIn, True, False, IsNull, IsNotNull.

Prefer @Query for anything non-trivial — the method-name grammar is hard to read at 4 conjuncts.

@Query annotations

java

// JPQL (entity-oriented, portable)
@Query("SELECT o FROM Order o WHERE o.user.id = :uid AND o.total > :min")
List<Order> findBigOrders(@Param("uid") Long uid, @Param("min") BigDecimal min);

// Native SQL
@Query(value = "SELECT * FROM orders WHERE jsonb_col @> ?1", nativeQuery = true)
List<Order> searchJson(String filter);

// Modifying (non-SELECT)
@Modifying
@Query("UPDATE Order o SET o.status = :s WHERE o.id = :id")
int updateStatus(@Param("id") Long id, @Param("s") Status s);

@Modifying queries don't touch the persistence context — managed entities already in memory won't see the change. Call em.clear() after if the caller then reads the same entities.

Transaction propagation levels in Spring Data

java

@Transactional(propagation = Propagation.REQUIRED)       // default
@Transactional(propagation = Propagation.REQUIRES_NEW)   // suspend caller, new tx
@Transactional(propagation = Propagation.NESTED)         // savepoint within caller
@Transactional(propagation = Propagation.SUPPORTS)       // join if exists, no tx otherwise
@Transactional(propagation = Propagation.MANDATORY)      // must be in a tx; else error
@Transactional(propagation = Propagation.NEVER)          // must NOT be in a tx
@Transactional(propagation = Propagation.NOT_SUPPORTED)  // suspend any caller tx

Interview staple — "give a use case for each":

REQUIRES_NEW — audit log insert that must survive even if caller rolls back.
NESTED — optional step where failure rolls back just the sub-work.
MANDATORY — utility method asserting a caller already opened a tx.

Repository projection types

Interface-based — lightweight; Spring generates a proxy returning only the selected fields.
Class-based (DTO) — new DtoClass(...) in JPQL.
Dynamic — <T> List<T> findByStatus(Status s, Class<T> type); — caller picks projection.

Spring Data MongoDB

MongoTemplate — imperative, hand-built queries.
MongoRepository<T, ID> — derived queries like JPA.
@Query on Mongo repos takes a JSON filter string: @Query("{ 'status': ?0 }").
@Aggregation — declare an aggregation pipeline.
Transactions — require replica set; @Transactional works but mind the limits (§16).

Spring Data DynamoDB (via AWS SDK Enhanced Client)

Spring Data DynamoDB (community) exists but is unofficial; the AWS-blessed path is the Enhanced Client with DynamoDbTable<Entity>.

java

DynamoDbTable<Order> table = enhancedClient.table("orders", TableSchema.fromBean(Order.class));
table.putItem(order);
Order o = table.getItem(Key.builder().partitionValue("alice").sortValue("O#42").build());

QueryConditional q = QueryConditional.keyEqualTo(k -> k.partitionValue("alice"));
PageIterable<Order> pages = table.query(r -> r.queryConditional(q).limit(20));

Annotate POJOs with @DynamoDbBean, @DynamoDbPartitionKey, @DynamoDbSortKey, @DynamoDbSecondaryPartitionKey("GSI1").

29. Observability & Troubleshooting

When a database is slow or failing, you debug with metrics and logs — not by squinting at code.

Postgres observability

Key catalog / stats views:

View	What it tells you
`pg_stat_activity`	Active connections, current query, state, wait events — "what's running right now?"
`pg_stat_user_tables`	Per-table seq/index scans, live/dead tuples
`pg_stat_user_indexes`	Per-index scan counts — find unused indexes
`pg_stat_statements` (extension)	Aggregated query stats: total time, mean, calls, rows
`pg_locks`	Current locks held and waited on
`pg_stat_bgwriter`	Checkpoint stats; buffer writes
`pg_stat_replication`	Per-replica state and lag

Example: find slowest queries:

sql

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

Slow query log — log_min_duration_statement = 1000 logs every query > 1s. Essential in prod.

MongoDB observability

Database profiler — db.setProfilingLevel(1, { slowms: 100 }) logs operations > 100ms into system.profile.
db.currentOp() — active operations; similar to pg_stat_activity.
db.serverStatus() — connection counts, network, opcounters, WiredTiger cache.
mongotop — per-collection read/write times.

DynamoDB observability

CloudWatch metrics — ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests, SuccessfulRequestLatency.
Contributor Insights — top partition keys by traffic (hot-key detector).
CloudTrail — per-operation audit.

Three-pillars recap for DBs

Logs — slow query log, connection errors, deadlock reports.
Metrics — connections, QPS, mean/p99 latency, cache hit rate, replication lag, disk I/O.
Traces — OpenTelemetry DB spans (db.statement, db.system, db.name) linking app spans to SQL.

Alerting SLIs

Good DB SLOs to track:

Connection saturation — pool_in_use / pool_max.
Replication lag — alert at > 30s.
Deadlock rate — > 0 is a smell.
Error rate — statements/sec failing.
p99 latency — correlates to user experience.
Disk free — alert at 20%.
Bloat ratio (Postgres) — > 30% dead tuples means VACUUM isn't keeping up.

Avoid alerting on resource utilization alone (CPU 80% isn't a problem if latency is fine). Alert on symptoms that affect users.

30. Connect to Your Experience

This section is for tying study to stories. Every answer is stronger when anchored in something you've actually built.

Anchor example: legacy MQ microservice — messaging meets DB

Processed 10k+ transactions/day from IBM MQ, routed to downstream services.
DB decision: each transaction needs idempotent storage (exactly-once into downstream systems). Natural fit: insert into a claim/outbox table keyed on MQ message ID, with a UNIQUE constraint for dedup.
Transactions: XA or the outbox pattern (§6) — MQ ack + DB insert must be atomic. Outbox is usually the right call over XA because XA is brittle.
Isolation: REPEATABLE READ or SERIALIZABLE for the consumer's claim-then-process loop; otherwise double-processing on a pod crash.

Anchor example: cross-team schema standardization — databases of events

Avro schemas are the write-time schema check for Kafka. Analogous to CHECK constraints + column types in a DB.
Schema evolution rules (backward / forward / full) map to §7–8 from data-format: add optional = backward-compatible, remove required = breaking.
The schema registry is a metadata DB — often backed by Postgres.

Anchor example: JAXB migration — XML / XSD as schema

XSD is to XML what DDL is to tables. Strict validation on parse = fail-fast.
Thymeleaf was string-templating; JAXB is schema-driven serialization. Type-safe, validated, versioned.

Anchor example: regulated-environment compliance — where DB design meets the auditor

Encryption at rest with FIPS 140-2 modules — drives TDE / KMS choices.
Audit retention — §24 append-only audit table pattern.
Row-level security for multi-tenant sharing — §25.
Backup medium classification — backup of regulated data can't flow to unvetted storage without authorization.

Anchor example: productivity app (Spring Boot + MongoDB)

MongoDB fit flexible habit/rule/notification documents.
@DataMongoTest with Testcontainers — integration tests against real Mongo, not a mock.
Good example for "when MongoDB makes sense": schema varies per user, documents are the natural unit, no cross-entity invariants demanding SQL transactions.

Anchor example: intern mentorship — 30% PR defect reduction

If databases come up: TDD focus + code review catches N+1, missing indexes, unsafe string concatenation (SQL injection), unscoped transactions.
Concrete checklist item: "every new query has an EXPLAIN plan in the PR description for large tables."

31. Rapid-Fire Review

One-liner cheat answers for the morning-of. In TOC order.

OLTP vs OLAP — small tx-heavy vs large scan-heavy; row vs column storage.
Surrogate key default — prefer a synthetic PK (BIGSERIAL or UUIDv7); enforce natural uniqueness as UNIQUE.
NULL gotcha — = NULL never matches; use IS NULL. NOT IN broken on any NULL in subquery.
Inner vs outer joins — inner drops unmatched; outer keeps them with NULLs on missing side.
Join algorithms — nested loop (small outer + indexed inner), hash (equi-join, memory-bound), merge (pre-sorted, large).
CTE — named sub-query; Postgres 12+ inlines non-recursive single-reference unless AS MATERIALIZED.
Window functions — compute per row over a partition without collapsing; ROW_NUMBER, RANK, LAG, aggregates OVER.
3NF — no partial deps (2NF), no transitive deps (3NF). BCNF stricter (every determinant is superkey).
Denormalize — when reads dominate, joins hurt latency, or staleness is acceptable.
ACID — Atomicity (all or nothing), Consistency (constraints hold), Isolation (concurrent tx appear serial), Durability (committed survives crash).
Isolation levels — RU → RC → RR → Serializable; each prevents more anomalies (dirty → non-repeatable → phantom → write skew).
MVCC — writers don't block readers; old versions live until VACUUM (Postgres) or undo purge.
Pessimistic lock — SELECT FOR UPDATE; best for contended hot rows. Optimistic — @Version; retry on conflict; best for rare conflicts.
Deadlock — always lock rows in a consistent order.
B-tree vs hash vs GIN — B-tree for range + equality, hash for equality only, GIN for arrays / JSONB / full-text.
Composite index prefix rule — index (a,b,c) supports a, a+b, a+b+c — not b alone.
Covering index — query answered from the index; no heap visit. INCLUDE (col) in Postgres 11+.
LSM vs B-tree — LSM write-optimized with background compaction; B-tree read-optimized with in-place update.
WAL — write-ahead log; every change logged before heap update → atomicity + durability.
VACUUM — reclaims dead tuples, prevents xid wraparound, enables index-only scans.
EXPLAIN ANALYZE — actual rows/timings per node; read bottom-up. Watch for Seq Scan on big tables + rows-estimate-off-by-10x (stale stats).
N+1 fix — JOIN FETCH / @EntityGraph / @BatchSize / DTO projection.
Sync vs async replication — sync = no loss + slower commit; async = fast + lag window.
CAP — during partition, pick C or A. PACELC adds: even without partition, pick Latency or Consistency.
Quorum — R + W > N for strong consistency.
Sharding — consistent hashing to avoid mass re-homing; avoid hot shards via write-sharding suffix.
MongoDB shard key — high-cardinality, even distribution, matches queries, non-monotonic.
Mongo transactions — supported 4.0+; limits (timeouts, oplog pressure); prefer to embed instead.
DynamoDB PK/SK — items with same PK live together sorted by SK. Single-table design = multiple entity types in one table.
GSI vs LSI — GSI any PK, created anytime, eventually consistent; LSI same PK, created at table creation, can be strongly consistent.
Cache-aside — app reads/writes cache; miss → DB → populate. Standard pattern.
Invalidation strategies — TTL, delete-on-write, write-through, event-driven (CDC), version tags.
Thundering herd — distributed lock on miss, probabilistic early expiry, or stale-while-revalidate.
Star schema — fact + dimensions; denormalized for OLAP speed.
SCD 2 — track history with effective_from/to + is_current on dimension.
ELT > ETL in modern warehouses — Snowflake/BigQuery/Redshift make in-DB transform cheap; dbt is the idiom.
Inverted index — term → doc list; basis of Elasticsearch.
Expand-contract migration — add new + dual-write + backfill + switch + remove old. Every step backward-compatible.
Online index — Postgres CREATE INDEX CONCURRENTLY; MySQL ALGORITHM=INPLACE, LOCK=NONE.
PITR — full backup + continuous WAL archive = restore to any timestamp.
RPO / RTO — how much data can you lose / how long can you be down. Drives architecture.
Parameterized queries — separate SQL from values; the only safe defense against injection.
RLS — filter rows at engine level by session context; multi-tenant superpower.
HikariCP pool sizing — start at ~(cores × 2) + 1; tune by load test.
Statement timeout — cap queries' runtime at the connection level.
PgBouncer transaction pooling — high-density connection sharing; forbids session-state features.
Hibernate L1 cache — per transaction, automatic, same-object guarantee. L2 — optional, process-wide, careful with clusters.
LAZY default — always use LAZY on associations; fetch eagerly with JOIN FETCH when needed.
@Transactional caveats — self-invocation bypasses proxy; only unchecked exceptions roll back by default; rollbackFor = Exception.class to change.
@Version — optimistic locking; AND version = ? added to UPDATE; OptimisticLockException on 0 rows affected.

32. Practice Exercises

Work through these a few days before the interview. Answer out loud; draw diagrams.

Exercise 1 — write a window-function query

Given orders(id, user_id, total, created_at), return each user's top 3 orders by total, with rank. Now do it without window functions (correlated subquery).

Exercise 2 — design a schema

A read-write book tracking app. Users can:

Mark books read / in-progress / want-to-read.
Rate books 1–5 and write reviews.
Follow other users; see their recent activity.

Requirements: 3NF for the core, then identify one denormalization you'd make for the activity feed and explain why.

Exercise 3 — read an EXPLAIN plan

Nested Loop  (cost=0..50000 rows=1 width=40) (actual time=0.5..9800 rows=1 loops=1)
  ->  Seq Scan on orders o  (cost=0..15000 rows=10000 width=16) (actual time=0.05..150 rows=100000 loops=1)
        Filter: (status = 'active')
        Rows Removed by Filter: 900000
  ->  Index Scan using idx_users_id on users u  (cost=0.5..3.5 rows=1 width=24) (actual time=0.09..0.09 rows=1 loops=100000)
        Index Cond: (id = o.user_id)

Questions:

What's wrong?
What would you change?
What stats or indexes are missing?

(Hint: stats off by 10x, 900k rows filtered on status, 100k inner loops. Partial index on WHERE status = 'active', or a compound index.)

Exercise 4 — DynamoDB access pattern

Design a single-table schema for a project-management app:

Workspace → many Projects → many Tasks.
Query: list all workspaces for a user.
Query: list all projects for a workspace.
Query: list all tasks in a project, by due date.
Query: list all tasks assigned to a user across workspaces.
Query: full-text search on task title.

What are PK/SK patterns? Which queries need a GSI? Which need a different tool (e.g., Elasticsearch)?

Exercise 5 — spot the N+1

java

@GetMapping("/orders")
List<OrderDTO> recent() {
    return orderRepo.findTop100ByOrderByCreatedAtDesc().stream()
        .map(o -> new OrderDTO(
            o.getId(),
            o.getUser().getEmail(),
            o.getLineItems().size(),
            o.getLineItems().stream().mapToDouble(LineItem::getPrice).sum()))
        .toList();
}

How many queries does this emit in the worst case? How would you fix it?

Exercise 6 — pick the isolation level

Use case 1: online shop; place order, decrement stock. Use case 2: analytics dashboard; multi-table aggregate that must be self-consistent. Use case 3: scheduler; assign a job to a worker from a pool, no double-assignment.

Match each to the minimum isolation level (or alternative: advisory lock, SELECT FOR UPDATE SKIP LOCKED, etc.) and defend.

Exercise 7 — zero-downtime rename

Rename events.user (TEXT) to events.user_id (BIGINT, FK to users). There are 500M rows. The app has 20 microservices touching it. Write the migration plan with deploy stages.

Exercise 8 — cache strategy

Product page renders 5 database queries. Avg latency 250ms, p99 800ms. Users refresh often. Write-to-read ratio is 1:1000.

Design a caching strategy. What's the cache key? TTL? Invalidation mechanism? What if two requests miss at the same time?

Exercise 9 — replica lag diagnosis

Users report: "I place an order but it doesn't appear in my history until I refresh 5 seconds later."

The app uses a read-replica for SELECTs. Describe the failure mode, options to fix, and tradeoffs.

Exercise 10 — CAP in the wild

You're designing a global leaderboard for a game. Writes: score updates from 50+ regions. Reads: top-100 display every second to millions.

What's your consistency model? What engine? What's the RPO? How do you handle a region partition?

This guide is a living document. Update it as you learn. The best preparation is working real examples out loud — don't just re-read.

Databases — Senior Engineer Study Guide ​

Table of Contents ​

Interview-Question Coverage Matrix ​

1. Database Landscape & Taxonomy ​

Workload axis — OLTP vs OLAP ​

Model axis — SQL vs NoSQL (NoSQL is four things, not one) ​

The decision tree (spoken out loud in an interview) ​

Interview Qs covered ​

2. Relational Model & SQL Fundamentals ​

Keys ​

SQL DDL / DML / DCL / TCL ​

NULL and three-valued logic ​

Set operations ​

Constraints ​

Interview Qs covered ​

3. Joins Deep Dive ​

Join shapes ​

Join algorithms (what the planner actually does) ​

ON vs WHERE with outer joins ​

Interview Qs covered ​

4. Advanced SQL ​

Common Table Expressions (CTE) ​

Recursive CTEs ​

Window functions ​

GROUPING SETS, ROLLUP, CUBE ​

Subqueries — EXISTS vs IN vs JOIN ​

MERGE / UPSERT ​

Interview Qs covered ​

5. Schema Design & Normalization ​

Anomalies (the "why" for normalization) ​

Normal forms in one page ​

Denormalization — when and how ​

ER modeling cheatsheet ​

Interview Qs covered ​

6. ACID & Transactions ​

ACID, letter by letter ​

The canonical example ​

Savepoints and nested transactions ​

Distributed transactions — 2PC / XA ​

Saga pattern — when 2PC is too heavy ​

Client disconnects mid-transaction ​

Interview Qs covered ​

7. Isolation Levels & Anomalies ​

The anomalies, with examples ​

Snapshot isolation vs Serializable ​

Picking a level ​

Interview Qs covered ​

8. Concurrency Control ​

MVCC — how Postgres actually implements isolation ​

Pessimistic locking ​

Optimistic locking ​

Pessimistic vs optimistic — picking ​

Deadlock ​

Advisory locks (Postgres) ​

Interview Qs covered ​

9. Indexing Deep Dive ​

B-tree — the default ​

Clustered vs non-clustered ​

Composite (multi-column) indexes and the prefix rule ​

Covering (INCLUDE) index ​

Other index types ​

LSM tree primer ​

Partial and functional indexes (Postgres) ​

When indexes HURT ​

Index-only scan vs index scan vs seq scan ​

Interview Qs covered ​

10. Storage Internals ​

Pages and tuples ​

Write-Ahead Log (WAL) ​

VACUUM and bloat (Postgres MVCC specifics) ​

Heap-organized vs index-organized tables ​

Interview Qs covered ​

11. Query Optimization ​

EXPLAIN vs EXPLAIN ANALYZE ​

Statistics and the cost model ​

Common anti-patterns ​

SARGability ​

The N+1 problem ​

Rewriting for plan stability ​

Interview Qs covered ​

Databases — Senior Engineer Study Guide

Table of Contents

Interview-Question Coverage Matrix

1. Database Landscape & Taxonomy

Workload axis — OLTP vs OLAP

Model axis — SQL vs NoSQL (NoSQL is four things, not one)

The decision tree (spoken out loud in an interview)

Interview Qs covered

2. Relational Model & SQL Fundamentals

Keys

SQL DDL / DML / DCL / TCL

NULL and three-valued logic

Set operations

Constraints

Interview Qs covered

3. Joins Deep Dive

Join shapes

Join algorithms (what the planner actually does)

ON vs WHERE with outer joins

Interview Qs covered

4. Advanced SQL

Common Table Expressions (CTE)

Recursive CTEs

Window functions

GROUPING SETS, ROLLUP, CUBE

Subqueries — EXISTS vs IN vs JOIN

MERGE / UPSERT

Interview Qs covered

5. Schema Design & Normalization

Anomalies (the "why" for normalization)

Normal forms in one page

Denormalization — when and how

ER modeling cheatsheet

Interview Qs covered

6. ACID & Transactions

ACID, letter by letter

The canonical example

Savepoints and nested transactions

Distributed transactions — 2PC / XA

Saga pattern — when 2PC is too heavy

Client disconnects mid-transaction

Interview Qs covered

7. Isolation Levels & Anomalies

The anomalies, with examples

Snapshot isolation vs Serializable

Picking a level

Interview Qs covered

8. Concurrency Control

MVCC — how Postgres actually implements isolation

Pessimistic locking

Optimistic locking

Pessimistic vs optimistic — picking

Deadlock

Advisory locks (Postgres)

Interview Qs covered

9. Indexing Deep Dive

B-tree — the default

Clustered vs non-clustered

Composite (multi-column) indexes and the prefix rule

Covering (INCLUDE) index

Other index types

LSM tree primer

Partial and functional indexes (Postgres)

When indexes HURT

Index-only scan vs index scan vs seq scan

Interview Qs covered

10. Storage Internals

Pages and tuples

Write-Ahead Log (WAL)

VACUUM and bloat (Postgres MVCC specifics)

Heap-organized vs index-organized tables

Interview Qs covered

11. Query Optimization

EXPLAIN vs EXPLAIN ANALYZE

Statistics and the cost model

Common anti-patterns

SARGability

The N+1 problem

Rewriting for plan stability

Interview Qs covered