Appearance
Databases β Senior Engineer Study Guide β
π Quiz Β· π Flashcards
Companion to INTERVIEW_PREP.md Β§8. This guide is the teaching layer for databases: concepts explained from first principles, SQL/CLI examples, comparison tables, gotchas, and anchor examples (a legacy MQ microservice, a MongoDB productivity app, regulated-environment compliance, schema-enforced writes via Avro).
Scope: Relational (Postgres-first, MySQL/SQL Server as callouts) + NoSQL (MongoDB, DynamoDB deep) + distributed-DB theory + caching + OLAP + search + JPA/Spring Data. Stack weighting: Postgres = canonical SQL engine, MongoDB = canonical document store, DynamoDB = canonical managed KV/wide-column. Other engines get "what differs" callouts.
How to use: Skim the Q-Map to jump to the section answering a specific INTERVIEW_PREP Β§8 question. For open study, walk Β§1 β Β§30 in order. Morning-of-interview: Β§31 Rapid-Fire. Pre-onsite drill: Β§32 Practice Exercises.
Table of Contents β
- Database Landscape & Taxonomy
- Relational Model & SQL Fundamentals
- Joins Deep Dive
- Advanced SQL
- Schema Design & Normalization
- ACID & Transactions
- Isolation Levels & Anomalies
- Concurrency Control
- Indexing Deep Dive
- Storage Internals
- Query Optimization
- Replication
- Sharding & Partitioning
- Distributed Systems for DBs
- NoSQL Categories
- MongoDB Deep Dive
- DynamoDB Deep Dive
- Postgres Specifics
- MySQL / SQL Server Callouts
- Caching
- OLAP & Data Warehousing
- Search (Elasticsearch / OpenSearch)
- Schema Migrations
- Backups, HA, DR
- Security
- Connection Pooling & Tuning
- JPA / Hibernate Specifics
- Spring Data Specifics
- Observability & Troubleshooting
- Connect to Your Experience
- Rapid-Fire Review
- Practice Exercises
Interview-Question Coverage Matrix β
Maps each INTERVIEW_PREP.md Β§8 question (1β21) to the section(s) in this guide that answer it.
| Q# | Topic | Section |
|---|---|---|
| 1 | ACID β each letter with example | Β§6 |
| 2 | Isolation levels + anomalies (dirty/non-repeatable/phantom) | Β§7 |
| 3 | MVCC and Postgres | Β§8, Β§10 |
| 4 | Indexes β B-tree vs hash vs GIN; when indexes hurt | Β§9 |
| 5 | Covering index | Β§9 |
| 6 | When to denormalize | Β§5 |
| 7 | EXPLAIN/EXPLAIN ANALYZE β reading plans | Β§11 |
| 8 | N+1 query problem (Hibernate/JPA) | Β§11, Β§27 |
| 9 | Connection pooling (HikariCP) | Β§26 |
| 10 | Pessimistic vs optimistic locking | Β§8 |
| 11 | Client disconnects mid-transaction | Β§6 |
| 12 | MongoDB vs Postgres β when | Β§15, Β§16 |
| 13 | MongoDB indexes β compound prefix | Β§16 |
| 14 | MongoDB transactions β limitations | Β§16 |
| 15 | DynamoDB β partition key vs sort key, hot partition | Β§17 |
| 16 | DynamoDB β GSI vs LSI, consistency | Β§17 |
| 17 | DynamoDB β single-table design | Β§17 |
| 18 | DynamoDB β one-to-many modeling | Β§17 |
| 19 | DynamoDB β RCU/WCU, on-demand vs provisioned | Β§17 |
| 20 | Zero-downtime column rename | Β§23 |
| 21 | Cache invalidation strategies | Β§20 |
1. Database Landscape & Taxonomy β
Before you pick a database, know the axes you're picking on. Senior interviews test whether you can match a workload to an engine β not whether you can recite features.
Workload axis β OLTP vs OLAP β
| OLTP (transactional) | OLAP (analytical) | |
|---|---|---|
| Workload | Many small reads/writes, latency-sensitive | Few large scans/aggregates, throughput-sensitive |
| Rows touched per query | 1sβ100s | Millionsβbillions |
| Storage layout | Row-oriented | Column-oriented |
| Examples | Postgres, MySQL, MongoDB, DynamoDB | Redshift, BigQuery, Snowflake, ClickHouse, DuckDB |
| Indexes | B-tree heavy | Zone maps, bloom filters, no per-row indexes |
| Concurrency | High β ACID transactions | Low β batch or micro-batch |
OLTP = "what happened to order #42?" OLAP = "how did revenue trend across the last 90 days by region?"
Model axis β SQL vs NoSQL (NoSQL is four things, not one) β
- Document β self-describing JSON-like records; flexible schema. MongoDB, Couchbase, Firestore.
- Key-value (KV) β opaque value keyed by a simple identifier; fastest possible lookup. Redis, Memcached, DynamoDB (when used trivially), etcd.
- Wide-column β rows keyed by partition + clustering keys, columns vary per row. Cassandra, HBase, DynamoDB (in single-table-design usage), ScyllaDB.
- Graph β nodes + typed edges optimized for traversal. Neo4j, JanusGraph, Amazon Neptune.
Two other non-relational families worth knowing:
- Time-series β writes dominate, natural clock-based partitioning. InfluxDB, TimescaleDB, Prometheus TSDB.
- Vector β nearest-neighbor search over embeddings (HNSW, IVF). pgvector, Pinecone, Weaviate, Milvus.
The decision tree (spoken out loud in an interview) β
- Need strong relational integrity + joins + ACID across multiple rows? β SQL (Postgres default).
- Access pattern is simple lookup, and you already know every query shape? β KV (DynamoDB, Redis).
- Data is natural documents with variable shape, and you don't cross-join? β Document (MongoDB).
- Write volume is massive, you only query by key, and consistency is tunable? β Wide-column (Cassandra / DynamoDB SLT).
- Data is relationships-first (social graph, fraud rings)? β Graph.
- Data is time + metric first? β Time-series.
- Data is embeddings (semantic search, RAG)? β Vector.
Gotcha β "polyglot persistence" is the usual answer. Real systems mix engines: Postgres for the transactional core, Redis for cache + hot counters, Elasticsearch for full-text, S3 + a warehouse for analytics. Don't pretend one engine fits everything.
Interview Qs covered β
Β§1 sets the taxonomy used by Qs 12 (Mongo vs Postgres) and 15β19 (DynamoDB).
2. Relational Model & SQL Fundamentals β
A relation is a set of tuples with a fixed schema. A table is a relation with a name. The relational model, from Codd (1970), says: "data is rows, schema is fixed, operations are set-based." SQL is the language that implements it β but SQL predates (and violates) a few pure-relational rules, most famously by allowing duplicates (bags, not sets) and NULLs.
Keys β
| Key | What it is | Example |
|---|---|---|
| Primary key (PK) | Unique, non-null, one per table β the "identity" of a row | user_id BIGINT |
| Candidate key | Any column(s) that could serve as PK β minimal + unique | email (if unique) |
| Composite key | PK made of multiple columns | (order_id, line_no) |
| Surrogate key | Synthetic, typically an auto-increment or UUID, carries no domain meaning | BIGSERIAL, UUID v7 |
| Natural key | Has real-world meaning | SSN, email, ISBN |
| Foreign key (FK) | Column(s) referencing another table's PK; database-enforced integrity | order.user_id β user.id |
| Unique constraint | Non-PK uniqueness; allows NULL (depending on engine) | UNIQUE(email) |
Surrogate vs natural PK β the pragmatic default is surrogate. Natural keys change (email rebinds, names change spelling, SSNs don't apply internationally). A surrogate PK is stable forever; enforce the natural constraint as a UNIQUE. The rare case for natural PK: immutable reference data (country code ISO-3166).
UUID PK tradeoff: UUIDs are globally unique and shardable but are 16 bytes, random-ordered (trashes B-tree locality), and invisible to humans. Prefer UUIDv7 (time-ordered) or BIGSERIAL + a separate UUID public_id for the wire API.
SQL DDL / DML / DCL / TCL β
- DDL (Data Definition) β
CREATE,ALTER,DROP,TRUNCATE. - DML (Data Manipulation) β
SELECT,INSERT,UPDATE,DELETE,MERGE/UPSERT. - DCL (Data Control) β
GRANT,REVOKE. - TCL (Transaction Control) β
BEGIN,COMMIT,ROLLBACK,SAVEPOINT.
NULL and three-valued logic β
SQL predicates evaluate to TRUE, FALSE, or UNKNOWN. NULL compared with anything (including another NULL) is UNKNOWN. Consequences:
sql
SELECT * FROM users WHERE deleted_at = NULL; -- returns 0 rows, always
SELECT * FROM users WHERE deleted_at IS NULL; -- correct- Aggregates (
COUNT,SUM,AVG) skip NULLs β exceptCOUNT(*)which counts rows. NOT IN (subquery)returns zero rows if the subquery contains any NULL. UseNOT EXISTSinstead.- Uniqueness: Postgres / MySQL / Oracle treat each NULL as distinct (you can have many NULLs in a UNIQUE column). SQL Server treats them as equal (only one NULL allowed) β different behavior.
Gotcha.
WHERE status <> 'active'silently excludes rows withstatus IS NULL. Always think about the NULL case.
Set operations β
sql
SELECT id FROM a
UNION -- set union, DISTINCT
SELECT id FROM b;
UNION ALL -- bag union, keeps duplicates (cheaper, no sort/dedup)
INTERSECT -- common rows
EXCEPT -- rows in A not in B (MINUS in Oracle)UNION ALL is almost always what you want in app code. UNION adds a sort/dedup pass β expensive.
Constraints β
- NOT NULL β rejects NULL.
- CHECK β arbitrary predicate:
CHECK (age >= 0 AND age < 150). - FOREIGN KEY β referential integrity. Actions:
ON DELETE CASCADE,SET NULL,RESTRICT,NO ACTION. - UNIQUE β one-column or composite.
- EXCLUDE (Postgres) β generalized uniqueness using any operator (e.g., "no two ranges overlap" via GiST +
&&).
Interview Qs covered β
Foundational β touches Β§8 Qs 1 (ACID), 6 (denormalize), 11 (tx lifecycle).
3. Joins Deep Dive β
A join combines rows from two tables using a predicate. Know the six shapes and know which algorithm the planner uses.
Join shapes β
Given users(id, name) and orders(id, user_id, total):
| Shape | Result | SQL |
|---|---|---|
| INNER | Only rows matching predicate in both sides | users u JOIN orders o ON o.user_id = u.id |
| LEFT OUTER | All left rows; unmatched right columns NULL | users u LEFT JOIN orders o ON ... |
| RIGHT OUTER | All right rows; unmatched left columns NULL | (mirror of LEFT β rarely used in practice) |
| FULL OUTER | All rows from both sides; unmatched side is NULL | FULL JOIN |
| CROSS | Cartesian product β every left Γ every right | users CROSS JOIN products |
| SELF | Table joined with itself; use aliases | employees e JOIN employees m ON e.mgr_id = m.id |
Two more flavors you should name:
- SEMI JOIN β "does a matching row exist?" Written as
EXISTS (subquery)orIN (subquery). No columns from the right side. - ANTI JOIN β "no matching row exists." Written as
NOT EXISTS. Prefer overNOT IN(NULL gotcha). - LATERAL JOIN (Postgres, Oracle) β right side can reference left side's current row, row-by-row. Useful for "top-N per group":
sql
SELECT u.id, recent.id AS order_id
FROM users u
JOIN LATERAL (
SELECT id FROM orders o WHERE o.user_id = u.id
ORDER BY o.created_at DESC LIMIT 3
) recent ON TRUE;Join algorithms (what the planner actually does) β
| Algorithm | How it works | Best when | Complexity |
|---|---|---|---|
| Nested loop | For each row in outer, scan/index-lookup inner | Small outer + indexed inner; or very small inner | O(NΒ·M) worst; O(NΒ·log M) with index |
| Hash join | Build hash table on smaller side; probe with larger side | Equi-joins, no useful index, enough memory for build side | O(N+M), memory-bound |
| Merge join | Both sides sorted on join key, then merge | Both sides already ordered (index), large result | O(N+M) if pre-sorted; O(N log N + M log M) if not |
-- Postgres planner hint (how to read it)
EXPLAIN SELECT * FROM orders o JOIN users u ON u.id = o.user_id;
--> Hash Join (cost=... rows=... width=...)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders
-> Hash
-> Seq Scan on usersGotcha. Missing FK index β nested loop over orders with a full scan of users per probe β quadratic blow-up at scale. Always index FK columns, especially if the child table is the "many" side.
ON vs WHERE with outer joins β
These give different answers:
sql
-- (A) filter BEFORE the outer join β users without orders still appear
SELECT u.id FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.total > 100;
-- (B) filter AFTER the outer join β users without orders DROPPED (NULL > 100 is UNKNOWN)
SELECT u.id FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.total > 100;Form (B) is effectively an inner join. If you want LEFT semantics + a post-filter that preserves NULL rows, write WHERE o.total > 100 OR o.id IS NULL.
Interview Qs covered β
Foundational for Q7 (EXPLAIN) and Q8 (N+1 β ORMs generate nested-loop fetches).
4. Advanced SQL β
Advanced SQL is the line between "I can write a query" and "I can solve problems in the database tier." Senior interviews expect all of this.
Common Table Expressions (CTE) β
A named sub-query used in a later query. Great for readability.
sql
WITH active_users AS (
SELECT id FROM users WHERE deleted_at IS NULL
),
recent_orders AS (
SELECT user_id, SUM(total) AS revenue
FROM orders
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY user_id
)
SELECT a.id, COALESCE(r.revenue, 0)
FROM active_users a
LEFT JOIN recent_orders r ON r.user_id = a.id;Postgres gotcha. Until Postgres 12, CTEs were an optimization fence β always materialized. Postgres 12+ inlines non-recursive, non-write, single-reference CTEs. Force the old behavior with
WITH ... AS MATERIALIZED. Force inlining withAS NOT MATERIALIZED.
Recursive CTEs β
sql
WITH RECURSIVE descendants AS (
SELECT id, manager_id, name, 1 AS depth FROM employees WHERE id = 42
UNION ALL
SELECT e.id, e.manager_id, e.name, d.depth + 1
FROM employees e JOIN descendants d ON e.manager_id = d.id
)
SELECT * FROM descendants;Useful for: org charts, category trees, graph traversal, number-series generation. Always include a depth limit or base case that terminates.
Window functions β
A window function computes over a "window" of rows without collapsing them (unlike GROUP BY).
sql
SELECT
user_id,
order_id,
total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn,
SUM(total) OVER (PARTITION BY user_id) AS user_lifetime_total,
LAG(total) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_total,
total - LAG(total) OVER (...) AS delta
FROM orders;Common windows:
ROW_NUMBERβ unique sequential (ties broken arbitrarily by ORDER BY).RANKβ ties share rank, gaps follow:1, 2, 2, 4.DENSE_RANKβ ties share rank, no gaps:1, 2, 2, 3.LAG(col, n)/LEAD(col, n)β value from n rows before / after.FIRST_VALUE/LAST_VALUE/NTH_VALUEβ with framing.- Aggregate OVER β
SUM,AVG,COUNTas running or partition totals.
Framing β ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running total vs RANGE ... for value-based windows.
Classic use: top-N-per-group:
sql
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY total DESC) rn FROM orders
) t WHERE rn <= 3;GROUPING SETS, ROLLUP, CUBE β
Compute multiple GROUP BY results in one pass.
sql
-- rollup: per (region, product), per region, grand total
SELECT region, product, SUM(sales)
FROM orders
GROUP BY ROLLUP (region, product);GROUPING SETSβ arbitrary combos you enumerate.ROLLUP(a,b,c)β(a,b,c),(a,b),(a),().CUBE(a,b)β every subset:(a,b),(a),(b),().
GROUPING(col) returns 1 when a column was rolled up (so you can label "subtotal" rows).
Subqueries β EXISTS vs IN vs JOIN β
| Pattern | Best for | Gotcha |
|---|---|---|
IN (subquery) | Small, non-NULL value set | NOT IN + any NULL in subquery β 0 rows |
EXISTS (correlated) | Existence check, early-exit on first match | Modern planners treat IN and EXISTS equivalently for most cases |
JOIN + DISTINCT | Need columns from both sides | DISTINCT hides duplicates but adds a sort pass |
NOT EXISTS | Anti-join, NULL-safe | Always prefer over NOT IN |
MERGE / UPSERT β
Standard SQL MERGE (Postgres 15+, all major engines):
sql
MERGE INTO customers c
USING staging s ON s.id = c.id
WHEN MATCHED THEN UPDATE SET name = s.name, updated_at = NOW()
WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id, s.name);Postgres shortcut: INSERT ... ON CONFLICT (id) DO UPDATE SET .... MySQL: INSERT ... ON DUPLICATE KEY UPDATE ....
Interview Qs covered β
Window functions + CTEs are standard senior interview ground. Not explicitly in Β§8 but expected knowledge.
5. Schema Design & Normalization β
Normalization is a set of rules to remove redundancy and prevent update/insert/delete anomalies. You don't need to cite Codd β you do need to recognize bad schemas.
Anomalies (the "why" for normalization) β
- Update anomaly β same fact stored in many places, miss one on update, data drifts.
- Insert anomaly β can't represent a fact without unrelated other data (e.g., can't list a new product without an order).
- Delete anomaly β deleting the last order loses product metadata.
Normal forms in one page β
1NF β atomic values, no repeating groups.
-- Bad (repeating group)
user(id, phone1, phone2, phone3)
-- Good
user(id), phone(user_id, phone)No arrays/lists masquerading as comma-separated strings. (Note: Postgres native arrays and JSON columns arguably break pure 1NF β that's fine pragmatically, as long as you're not querying inside them routinely.)
2NF β 1NF + no partial dependency on a composite key.
Applies when PK is composite. Every non-key attribute must depend on the whole PK, not part of it.
-- Bad: PK = (order_id, product_id); product_name depends only on product_id
order_line(order_id, product_id, product_name, qty)
-- Good
order_line(order_id, product_id, qty)
product(product_id, product_name)3NF β 2NF + no transitive dependency.
Non-key attributes must not depend on other non-key attributes.
-- Bad: department_name depends on department_id, not employee_id
employee(id, department_id, department_name)
-- Good
employee(id, department_id)
department(department_id, department_name)BCNF β every determinant is a superkey.
A strict 3NF. If X β Y, then X must be a superkey. Rare cases where 3NF passes but BCNF fails involve overlapping candidate keys. Interviewers usually mean "3NF" when they say "normalize."
4NF / 5NF β handle multi-valued and join dependencies. Worth naming; real-world bugs are rare. "If you've needed 4NF outside a textbook, you probably already know it."
Denormalization β when and how β
You denormalize to trade write cost + storage for read simplicity/speed. Warranted when:
- Read is 100x write β e.g., a dashboard counter.
- Joins dominate latency β copy the 3 fields you need into the child row.
- Analytical warehouse β star schema is denormalized on purpose (Β§21).
- NoSQL document store β embedding is denormalization by design (Β§15, Β§16).
- Materialized cache β
user_statsrebuilt nightly fromevents.
How to denormalize safely:
- Make the canonical source clear. The denormalized copy is a cache, not a source of truth.
- Use triggers, change-data-capture (CDC), or application-layer fan-out to keep copies fresh.
- Tolerate staleness if the business allows it β or accept the write amplification if it doesn't.
ER modeling cheatsheet β
| Relationship | Implementation |
|---|---|
| 1:1 | FK in either table with UNIQUE; or merge into one table |
| 1:N | FK on the "many" side, pointing at the "one" side |
| N:M | Junction table with both FKs, optional extra columns (e.g., role on user_project) |
| Weak entity (depends on parent for identity) | Composite PK that includes parent FK |
| Hierarchy (is-a) | Single-table inheritance (one table, nullable cols) / class-table inheritance (one table per subtype) / concrete-table inheritance (one table per leaf) |
Interview Qs covered β
Β§5 addresses Q6 (denormalize).
6. ACID & Transactions β
A transaction is a sequence of operations the database treats as a single logical unit. ACID describes the guarantees.
ACID, letter by letter β
- Atomicity β all operations succeed or none do. Mid-transaction crashes leave the DB as if nothing happened. Implementation: write-ahead log (WAL) + redo/undo.
- Consistency β a transaction moves the DB from one valid state (constraints, FKs, triggers) to another. Application-level invariants are NOT part of SQL consistency β that's on you.
- Isolation β concurrent transactions don't interfere; each behaves as if it ran alone (at the chosen isolation level β Β§7).
- Durability β once committed, the change survives crashes. Implementation: fsync'd WAL; replication for durability across node failures.
The canonical example β
sql
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;- A β both updates or neither.
- C β
CHECK (balance >= 0)rejects the tx if Alice goes negative. - I β another reader doesn't see Alice-debited-but-Bob-not-yet.
- D β after COMMIT returns, the transfer survives a power cut.
Savepoints and nested transactions β
Real "nested transactions" don't exist in most engines. What you have is savepoints.
sql
BEGIN;
INSERT INTO orders ...;
SAVEPOINT after_order;
INSERT INTO line_items ...; -- this might fail
ROLLBACK TO SAVEPOINT after_order; -- undo line_items but keep order
COMMIT;Spring's Propagation.NESTED uses JDBC savepoints under the hood.
Distributed transactions β 2PC / XA β
Two-phase commit (2PC) β a transaction coordinator asks each resource manager:
- Prepare β "can you commit?" Each RM writes a prepared-but-uncommitted record and votes yes/no.
- Commit / abort β coordinator tallies; if all yes, sends commit; otherwise abort.
Problems with 2PC:
- Blocking β if the coordinator dies after prepare, the RM is stuck holding locks.
- Latency β one round-trip per phase, plus fsyncs on each side.
- Coupling β every RM must implement the XA protocol.
XA is the industry spec for 2PC (javax.transaction.xa.XAResource). IBM MQ, Oracle, and many JDBC drivers support it. A legacy MQ-to-DB bridge that must not lose messages typically uses XA to span the broker and the database in one transaction.
Saga pattern β when 2PC is too heavy β
A saga is a sequence of local transactions where each has a compensating action for rollback. Two flavors:
- Choreography β each service publishes events; others react.
- Orchestration β a central orchestrator drives each step explicitly.
Saga gives you eventual atomicity via compensations β not atomicity in the ACID sense. Choose saga when:
- The steps cross service / DB boundaries.
- You can define a sensible compensation (refund = reverse of charge).
- You can tolerate a brief inconsistent window.
Client disconnects mid-transaction β
The server doesn't know the client is gone until:
- The next statement fails with a broken pipe, or
- TCP keepalive / idle_in_transaction_session_timeout fires.
Postgres: set idle_in_transaction_session_timeout = '30s'. Otherwise an abandoned BEGIN holds locks indefinitely β classic outage scenario. The broken client holds open transactions that block VACUUM, causing bloat to pile up until someone restarts the connection pool.
Interview Qs covered β
Β§6 addresses Qs 1 (ACID), 11 (client disconnects mid-tx).
7. Isolation Levels & Anomalies β
Isolation defines what concurrent transactions can see of each other. Four SQL standard levels, increasing strictness:
| Level | Dirty read | Non-repeatable read | Phantom | Lost update | Write skew |
|---|---|---|---|---|---|
| READ UNCOMMITTED | Possible | Possible | Possible | Possible | Possible |
| READ COMMITTED | Prevented | Possible | Possible | Possible | Possible |
| REPEATABLE READ | Prevented | Prevented | Possible* | Prevented (MVCC) | Possible |
| SERIALIZABLE | Prevented | Prevented | Prevented | Prevented | Prevented |
* Postgres's REPEATABLE READ uses snapshot isolation and prevents phantoms for the committing tx (but allows write skew). Standard SQL lets phantoms happen at RR.
The anomalies, with examples β
Dirty read β T1 reads T2's uncommitted write.
T2: UPDATE account SET balance = 1000 WHERE id = 'alice'; -- not committed
T1: SELECT balance FROM account WHERE id = 'alice'; -- reads 1000
T2: ROLLBACK; -- T1 saw a value that never "officially" existedNon-repeatable read β T1 reads the same row twice and gets different values because T2 committed between.
T1: SELECT balance FROM account WHERE id='alice'; -- 500
T2: UPDATE account SET balance=600 WHERE id='alice'; COMMIT;
T1: SELECT balance FROM account WHERE id='alice'; -- 600 (same tx!)Phantom read β a range query returns different rows because a concurrent insert changed set membership.
T1: SELECT COUNT(*) FROM orders WHERE total > 100; -- 5
T2: INSERT INTO orders ... total=200; COMMIT;
T1: SELECT COUNT(*) FROM orders WHERE total > 100; -- 6 (phantom)Lost update β two transactions read the same value, both update, last writer wins; first update vanishes.
T1: x = SELECT value FROM counter WHERE id=1; -- 10
T2: x = SELECT value FROM counter WHERE id=1; -- 10
T1: UPDATE counter SET value = 11 WHERE id=1; -- commit
T2: UPDATE counter SET value = 11 WHERE id=1; -- commit; should be 12!Write skew β two transactions read disjoint rows and write disjoint rows, but together violate an invariant.
-- invariant: at least one doctor on call
T1: SELECT COUNT(*) FROM doctors WHERE on_call = true; -- 2
T2: SELECT COUNT(*) FROM doctors WHERE on_call = true; -- 2
T1: UPDATE doctors SET on_call = false WHERE id = 'alice';
T2: UPDATE doctors SET on_call = false WHERE id = 'bob';
-- both commit; now zero doctors on callWrite skew is the anomaly SERIALIZABLE prevents that snapshot isolation doesn't.
Snapshot isolation vs Serializable β
Snapshot isolation (SI) β each tx sees a consistent snapshot as of its start. Prevents dirty / non-repeatable / phantom reads via MVCC. Doesn't prevent write skew. Postgres calls this REPEATABLE READ; Oracle calls it SERIALIZABLE (not true serializable!).
Serializable Snapshot Isolation (SSI) β Postgres's true SERIALIZABLE. SI plus runtime detection of dangerous read-write dependency cycles. On detection, one tx aborts with could not serialize access. You retry.
Gotcha. When you choose
SERIALIZABLE, your application MUST be prepared to catch serialization failure SQLSTATE40001and retry. Spring:@Retryable(SerializationFailureException.class)or explicit loop.
Picking a level β
- READ COMMITTED β default in Postgres, SQL Server. Fine for most OLTP.
- REPEATABLE READ β default in MySQL InnoDB. Use when a transaction does multiple reads and must see a consistent view (reports, consistency-sensitive logic).
- SERIALIZABLE β when invariants cross rows and write skew is possible (scheduling, inventory, double-spend). Pay the cost + handle retries.
Interview Qs covered β
Β§7 addresses Q2 (isolation levels + anomalies).
8. Concurrency Control β
Two families: pessimistic (lock first, work second) and optimistic (work, then verify at commit). Postgres primarily uses MVCC + optimistic snapshot; locks still exist for writes.
MVCC β how Postgres actually implements isolation β
Every row has hidden columns xmin (tx that created this version) and xmax (tx that deleted/updated it). An UPDATE doesn't overwrite β it inserts a new row version and sets xmax on the old one. Each transaction gets a snapshot: a list of "which xids were in flight when I started." A row is visible to me if:
xminis committed AND (xmin < my_xidORxmin β my_snapshot.finished).xmaxis NULL or uncommitted orxmax > my_xid.
Consequences:
- Readers never block writers, writers never block readers. Big win over lock-based DBs.
- Bloat β dead row versions accumulate.
VACUUMreclaims them (Β§10). - Long-running transactions prevent VACUUM from reclaiming versions newer than their xmin β runaway bloat. Kill long-idle transactions.
MySQL InnoDB, Oracle, SQL Server (with RCSI / SI) have their own MVCC implementations with the same ideas β old versions stored in undo segments rather than in-heap.
Pessimistic locking β
Acquire the lock before reading/writing to exclude others.
sql
BEGIN;
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE; -- row lock, write
-- nothing else can UPDATE/DELETE/SELECT FOR UPDATE this row until COMMIT
UPDATE inventory SET qty = qty - 1 WHERE product_id = 42;
COMMIT;Lock modes:
FOR UPDATEβ exclusive row lock; readers in SI still see old snapshot.FOR NO KEY UPDATEβ weaker; allows FK checks from other txs.FOR SHAREβ shared lock, blocks FOR UPDATE but allows other FOR SHARE.FOR UPDATE SKIP LOCKEDβ queue-like consumer pattern: grab the next unlocked row, don't wait.FOR UPDATE NOWAITβ fail immediately if locked.
Table-level: LOCK TABLE t IN ACCESS EXCLUSIVE MODE; β last-resort heavy hammer.
Optimistic locking β
No lock acquired. At commit (or update), check that nothing changed since you read. Typically implemented with a version column:
sql
UPDATE orders
SET status = 'shipped', version = version + 1
WHERE id = 42 AND version = 7; -- expects current version 7
-- affected rows = 0 means someone else updated; reject + retryJPA @Version is this pattern: Hibernate appends AND version = ? to every UPDATE, checks affected rows, throws OptimisticLockException if 0.
Pessimistic vs optimistic β picking β
| Use pessimistic when | Use optimistic when |
|---|---|
| High contention on specific rows | Low contention; conflicts are rare |
| Short transactions | Longer "read/think/write" user flows |
| Critical invariants (inventory, accounts) | Best-effort saves (comments, profile updates) |
| You can tolerate waiting | You can tolerate retries |
Deadlock β
Four conditions for deadlock (Coffman's): mutual exclusion, hold & wait, no preemption, circular wait. Break any one and you're safe.
T1 locks row A, then asks for row B.
T2 locks row B, then asks for row A.
-- deadlock. Postgres detects, aborts one with deadlock_detected.Prevention tactics:
- Always lock rows in a consistent order (e.g., sorted by ID). This alone kills most deadlocks.
- Keep transactions short β release locks fast.
- Use
SELECT FOR UPDATE SKIP LOCKEDfor work-queue patterns (no two workers wait on each other). - Set
lock_timeoutso a stuck locker aborts rather than waits forever.
Gotcha β FK deadlocks. Inserting into the child + updating the parent, in different orders across tx, commonly deadlocks. Always do parent operations first, or order by PK.
Advisory locks (Postgres) β
Named application-level locks not tied to any row. Good for single-writer patterns:
sql
SELECT pg_advisory_lock(hashtext('nightly-job'));
-- do the job
SELECT pg_advisory_unlock(hashtext('nightly-job'));Useful when you want cross-process mutual exclusion without a separate coordinator (Zookeeper, Redis Redlock).
Interview Qs covered β
Β§8 addresses Qs 3 (MVCC), 10 (pessimistic vs optimistic).
9. Indexing Deep Dive β
An index is an auxiliary data structure that answers "where is the row with this value?" faster than scanning the table. Indexes are the single highest-leverage performance tool in databases; they are also the most commonly misused.
B-tree β the default β
A B-tree (actually B+ tree in most engines) is a balanced tree where:
- Internal nodes hold keys and child pointers.
- Leaves hold keys + row pointers (heap) or the row itself (clustered).
- Leaves are linked β range scans walk the leaf list.
- Height stays O(log N); even 100M rows is ~4 levels deep.
B-trees support equality (=) and ordered operations (<, >, BETWEEN, ORDER BY, LIKE 'prefix%').
Clustered vs non-clustered β
| Term | Clustered index | Non-clustered (secondary) |
|---|---|---|
| Leaves hold | Full row | Row locator (heap pointer or PK) |
| Count per table | At most one | Many |
| Postgres | Not really β "cluster" is a one-time rewrite; all indexes are secondary, rows live in a heap | N/A |
| MySQL InnoDB | PK IS clustered β rows live in PK order in the B-tree | Secondary indexes store (key, PK), require second lookup |
| SQL Server | Optional clustered index; recommended | Can have any number |
Implication for InnoDB: a secondary-index lookup is two traversals (secondary β PK β row). Covering indexes or INCLUDE columns eliminate the second.
Composite (multi-column) indexes and the prefix rule β
CREATE INDEX idx_user ON orders(user_id, status, created_at);
Can serve queries whose predicates form a prefix:
WHERE user_id = ?βWHERE user_id = ? AND status = ?βWHERE user_id = ? AND status = ? AND created_at > ?βWHERE status = ?β β no leadinguser_idWHERE user_id = ? AND created_at > ?β can use index for the user_id part, but not for created_at (status skipped)
Column order rule: put equality predicates first, then the highest-selectivity range column last. (user_id=, status=, created_at range) is correct.
Covering (INCLUDE) index β
An index that contains every column the query needs β index-only scan, no heap visit.
sql
CREATE INDEX idx_orders_user_covering
ON orders(user_id) INCLUDE (total, status);Query SELECT total, status FROM orders WHERE user_id = 42 runs entirely off the index. Postgres 11+ supports INCLUDE; SQL Server has it too. MySQL doesn't β widen the key instead.
Other index types β
| Type | Structure | Best for | Engines |
|---|---|---|---|
| Hash | Hash table | Equality-only lookups | Postgres (unlogged until 10+), MySQL MEMORY |
| GIN (Generalized Inverted) | Inverted list per value | Array membership, JSONB containment, full-text | Postgres |
| GiST (Generalized Search Tree) | Balanced tree framework | Geometry, ranges, trigram fuzzy | Postgres |
| BRIN (Block Range INdex) | Min/max per block range | Naturally clustered data (time-series), tiny index | Postgres |
| Bitmap | Bit per row per value | Low-cardinality columns on OLAP | Oracle, Postgres (materialized during scan) |
| LSM tree | Sorted runs, periodic merge | Write-heavy workloads | RocksDB, Cassandra, DynamoDB backend |
| Full-text / inverted | Token β doc list | Keyword search | Postgres tsvector, Elasticsearch, Mongo text |
LSM tree primer β
A Log-Structured Merge tree buffers writes in an in-memory sorted structure (memtable). When it fills, it flushes as an immutable sorted file (SSTable). Background compaction merges SSTables.
- Writes are blazing fast β append-only; no in-place update.
- Reads can touch many SSTables β bloom filters prune candidates.
- Write amplification β each byte may be rewritten many times through compactions. Tuning knob: leveled vs tiered compaction.
Used by Cassandra, RocksDB, HBase, LevelDB, and inside DynamoDB and TiDB. Contrast with B+tree (in-place update, better read latency, worse write throughput).
Partial and functional indexes (Postgres) β
sql
-- partial: index only the rows we care about
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';
-- functional: index an expression
CREATE INDEX idx_email_lower ON users(LOWER(email));
-- query must use LOWER(email) = ? to hit itMassive storage + speedup wins when the "interesting" subset is small.
When indexes HURT β
- Write amplification β every INSERT/UPDATE/DELETE maintains every index on affected columns.
- Bloat on MVCC engines β each update creates a new row version, and every index points to all versions until VACUUM.
- Planner misjudgment β an index on a very low-selectivity column (e.g., boolean
is_deleted) may mislead the planner into an index scan when a seq scan would be faster. - Space β a table with 15 indexes can easily be 5Γ the heap size.
- Backup/restore time β indexes rebuild after restore.
Rule: start with PK + FK + query-driven indexes. Measure before adding more. Drop unused indexes β Postgres has pg_stat_user_indexes showing scan counts.
Index-only scan vs index scan vs seq scan β
- Seq scan β full table read. Fast for large fractions of rows.
- Index scan β find rows via index, fetch from heap.
- Index-only scan β all needed columns in the index; skip heap. Requires visibility map up to date (run
VACUUMrecently).
Interview Qs covered β
Β§9 addresses Qs 4 (index types), 5 (covering index).
10. Storage Internals β
Interviews that go deep on databases pick at physical layout. You don't need Ph.D. detail, but you should know pages, WAL, and VACUUM.
Pages and tuples β
Most engines organize storage in fixed-size pages (blocks): Postgres 8KB, MySQL 16KB (configurable), SQL Server 8KB. A page holds multiple rows (tuples). The engine does I/O at page granularity β reading one tuple pulls the whole page into the buffer cache.
Postgres page layout:
+------------------+-----+-----+----+----+
| page header | line pointers β |
+------------------+-----+-----+----+----+
| ... |
+----+----+----+----+----+----+----+----+
| β tuples (grow backward toward middle)|
+----------------------------------------+Fillfactor β the % of each page used on INSERT. Default 100% for tables. Lowering to 70β90% leaves room for HOT updates (Β§ below) without needing new pages. Common tuning for write-heavy tables.
TOAST (The Oversized-Attribute Storage Technique) β Postgres stores large (> ~2KB) field values in a separate out-of-line table, compressed. Transparent to queries, but large bytea/text columns don't bloat the main heap.
Write-Ahead Log (WAL) β
Every change is written to the WAL before the change is applied to the heap page. Guarantees durability + crash recovery:
- Client sends
COMMIT. - Server flushes WAL record to disk (fsync).
- Server acknowledges commit.
- Dirty heap pages get written later, asynchronously, at checkpoints.
On crash:
- Redo every WAL record newer than the last checkpoint.
- The heap is now consistent with the committed WAL.
Checkpoint β periodically flushes all dirty buffer pages to disk and truncates old WAL. Too frequent = I/O storm. Too infrequent = slow recovery. Postgres default: 5 minutes or when max_wal_size is reached.
Synchronous commit β fsync on WAL at commit. Postgres synchronous_commit = on default. synchronous_commit = off drops fsync = faster commits + chance of losing last few ms on crash (not torn writes).
VACUUM and bloat (Postgres MVCC specifics) β
Recall: UPDATE in Postgres = new row version + mark old one dead. Dead rows must be reclaimed or the heap grows forever. VACUUM does three things:
- Marks dead tuple slots free for future inserts.
- Updates the visibility map β pages where all tuples are visible to all txs; enables index-only scans + skipped VACUUM passes.
- Prevents transaction-ID wraparound β Postgres xids are 32-bit; if unfrozen tuples reach ~2B age, the cluster shuts down to prevent data loss.
VACUUM FREEZErewrites old tuples with a "frozen" marker.
autovacuum runs in background. Signs it's not keeping up: bloat metric climbing, query plans shifting, "too many dead tuples" in logs. Knobs: autovacuum_vacuum_scale_factor, per-table autovacuum_vacuum_threshold.
HOT (Heap-Only Tuple) updates β if a row update doesn't touch any indexed columns, Postgres stores the new version in the same page (if there's room) and avoids touching index entries. Huge perf win β this is why fillfactor < 100 matters on hot tables.
Heap-organized vs index-organized tables β
- Heap-organized (Postgres, SQL Server default) β rows stored in insertion order; every index is secondary.
- Index-organized (InnoDB, Oracle IOT, SQLite rowid) β rows stored inside the PK B-tree. Lookups by PK are free (single traversal); secondary indexes add a step.
InnoDB PK ordering effect: random UUIDv4 PK = random inserts = page splits = fragmentation. Auto-increment or UUIDv7 PKs keep the tree sequential and fast.
Interview Qs covered β
Β§10 reinforces Q3 (MVCC) and Q4 (index types β why they hurt writes).
11. Query Optimization β
The #1 differentiator between juniors and seniors in database interviews is the ability to read an execution plan and reason about it.
EXPLAIN vs EXPLAIN ANALYZE β
EXPLAINβ shows the plan the optimizer will use, with estimates.EXPLAIN ANALYZEβ actually runs the query and shows actual rows + timings per node.EXPLAIN (ANALYZE, BUFFERS)β adds buffer cache hits/misses; reveals I/O pressure.
Read plans bottom-up, right-to-left. Each node feeds the one above.
Sort (cost=123.45..125.67 rows=890 width=42) (actual time=1.2..1.5 rows=812 loops=1)
Sort Key: created_at DESC
Sort Method: quicksort Memory: 52kB
-> Hash Join (cost=10..100 rows=890) (actual time=0.5..1.1 rows=812 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0..50 rows=1000)
-> Hash (cost=5..5 rows=100)
-> Index Scan using idx_users_active on users uWatchouts:
- Rows estimate vs actual wildly off β stale statistics. Run
ANALYZE. - Seq Scan on a huge table + small WHERE β missing index, or index exists but isn't useful (LIKE with leading
%, type mismatch, non-SARGable predicate). - Nested Loop with outer rows β« 1 and inner Seq Scan β quadratic; needs index on join column.
- "Rows Removed by Filter: X" β X filtered after index or scan. Wider index / partial index can push the filter down.
Statistics and the cost model β
The planner picks a plan based on estimated cost:
cost = seq_page_cost Γ pages_read
+ random_page_cost Γ random_reads
+ cpu_tuple_cost Γ tuples_processed
+ cpu_operator_cost Γ operator_invocationsEstimates depend on statistics: per-column histogram, most-common values (MCV), null fraction, n_distinct, correlation. ANALYZE refreshes them; autovacuum runs it. If stats are stale, plans are wrong.
Common anti-patterns β
| Anti-pattern | Fix |
|---|---|
WHERE LOWER(email) = 'x@y' with index on email | Functional index, or store canonicalized |
WHERE created_at::date = '2026-04-17' | WHERE created_at >= '2026-04-17' AND created_at < '2026-04-18' (SARGable) |
WHERE id IN (SELECT id FROM huge_tbl WHERE ...) with no index | Rewrite as EXISTS or JOIN; ensure indexed |
OR across two columns | Often better as UNION of two indexed queries |
SELECT * through a JPA entity | DTO projection; select only needed cols |
Implicit type cast (WHERE id = '42' with BIGINT col) | Fix type; planner can't use index on cast input |
LIMIT N on unordered | Implicit sort cost; add ORDER BY + index to walk forward |
| N+1 (ORM) | Β§27 β JOIN FETCH / batch size / entity graph |
SELECT COUNT(*) on huge table | Estimate via pg_stat_user_tables.n_live_tup; or accept cost |
SARGability β
A predicate is Search ARGument-able if the engine can use an index to narrow rows without evaluating every row. Non-SARGable patterns:
sql
-- Non-SARGable (function on column)
WHERE YEAR(created_at) = 2026
-- SARGable
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'The N+1 problem β
An ORM fetches 1 parent row, then issues N queries β one per child β to fetch relations. Classic hidden performance bug.
java
List<Order> orders = repo.findAll(); // 1 query
for (Order o : orders) {
o.getLineItems().size(); // N queries, one per order
}Fixes:
JOIN FETCHin JPQL /@EntityGraph(Β§27).@BatchSize(size = 50)β Hibernate issues one batched IN query per 50 parents.- DTO projection that does the JOIN in SQL and returns flat rows.
Detecting N+1:
- Enable SQL logging in integration tests.
- Use
datasource-proxyorp6spyto count queries per request. - Spring Boot:
spring.jpa.properties.hibernate.generate_statistics=true+ check logs.
Rewriting for plan stability β
Sometimes a query works in dev and blows up in prod. Causes:
- Different row counts β different plans.
- Stats drift.
- Parameter sniffing (SQL Server) β first compile uses caller's value, poor for other values.
Fixes: manual ANALYZE after bulk load, pin plan via hints (engine-specific), rewrite with CTE AS MATERIALIZED to force a boundary, or pre-compute via materialized view.
Interview Qs covered β
Β§11 addresses Qs 7 (EXPLAIN), 8 (N+1).
12. Replication β
Replication copies data from one node to others for availability, read scale-out, and geographic distribution.
Physical vs logical replication β
| Physical (binary) | Logical (row/statement) |
|---|---|
| Ships WAL records byte-for-byte | Ships decoded operations (INSERT/UPDATE/DELETE per row) |
| All-or-nothing (whole cluster) | Table-level selection possible |
| Exact byte copy; perfect for HA failover | Cross-version, cross-schema, cross-engine possible |
| Postgres streaming replication, MySQL binlog row format | Postgres logical replication (10+), MySQL row-based binlog, Debezium CDC |
Primary-replica (a.k.a. master-slave β language is shifting) β
One node accepts writes (primary). Replicas apply the primary's WAL/binlog. Reads can go to replicas.
[writes]
β
+---------+
| Primary |ββWALβββ¬βββ Replica 1 (read)
+---------+ ββββ Replica 2 (read)
βββ Replica 3 (read)- Sync replication β primary waits for at least one replica's ack before COMMIT returns. No data loss on primary failure; higher commit latency.
- Async replication β primary COMMITs locally; replicas catch up later. Fast commits; possible data loss window on failover.
- Semi-sync (MySQL) β primary waits for replica to receive (not apply). Middle ground.
Replication lag β
Asynchronous replicas are eventually consistent with the primary. Lag appears as:
INSERT INTO orders (id) VALUES (42); -- primary
SELECT * FROM orders WHERE id = 42; -- replica; "not found!"Read-your-writes consistency approaches:
- Read from primary after a write β simple, loses the scale benefit briefly.
- Session stickiness β route user's reads to same node for a window after their write.
- Wait-for-LSN β the app tracks the WAL position at commit and reads with "WAIT FOR LSN β₯ X" on replica.
- Synchronous replicas β eliminate lag at cost of write latency.
Failover β
When the primary dies:
- Manual failover β operator promotes a replica. Safe but slow.
- Automatic failover (Patroni, RDS Multi-AZ, Galera) β consensus (Raft/etcd) picks a new primary.
Split-brain β network partition leaves two primaries, each accepting writes. Resolved only by:
- Fencing (STONITH β shoot the other node in the head).
- Quorum β a majority vote decides the legitimate primary.
- External coordinator (etcd, Zookeeper) holding the lease.
Multi-master (multi-primary) β
All nodes accept writes; changes propagate and conflicts are resolved (last-writer-wins, CRDT, or custom). Used by:
- Galera / Percona XtraDB Cluster β sync multi-master MySQL with certification.
- CockroachDB, Spanner, YugabyteDB β distributed SQL with Raft per range; logically multi-master.
- DynamoDB global tables β active-active across regions with last-writer-wins.
- Cassandra β eventually consistent multi-master with tunable quorum.
Pitfall: without conflict resolution, simultaneous updates to the same row produce garbage. LWW silently loses one write.
Read replicas are not HA by themselves β
Read replicas survive a primary crash only if you also have a promotion mechanism. Using replicas to handle reads doesn't automatically mean your system tolerates primary failure.
13. Sharding & Partitioning β
Partitioning = splitting a single logical table into multiple physical pieces, usually within one node. Sharding = partitioning across nodes. Goal: push a dataset beyond what one node can hold or serve.
Partitioning strategies β
| Strategy | Rule | Best for | Gotchas |
|---|---|---|---|
| Range | created_at buckets per month | Time-series, historical data (easy drop partition) | Hot partition for "now"; key choice matters |
| Hash | hash(id) mod N | Uniform spread, no obvious ordering | Re-sharding is expensive (most keys move) |
| List | region IN ('us-east', 'us-west') β specific partition | Tenant isolation | Rigid; requires known values |
| Composite | Hash then range | Multi-tenant + time-series | Complex |
| Directory (lookup) | Explicit mapping shard_id β node in metadata | Flexible, supports rebalancing | Extra hop; metadata becomes a SPOF unless HA |
Consistent hashing β
Naive hash sharding (id % N) forces nearly every key to move when N changes. Consistent hashing places keys and nodes on a ring; a key belongs to the next node clockwise. Adding a node only re-homes ~1/N of keys. Enhancements: virtual nodes (each physical node occupies many ring positions, smoothing distribution).
Used by: Cassandra, DynamoDB internals, memcached (with ketama), Redis Cluster (different scheme β hash slots, 16384 fixed), CDNs.
Declarative partitioning in Postgres β
sql
CREATE TABLE measurements (
id BIGSERIAL,
taken_at TIMESTAMPTZ NOT NULL,
value NUMERIC
) PARTITION BY RANGE (taken_at);
CREATE TABLE measurements_2026_04
PARTITION OF measurements
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');Planner performs partition pruning β queries with taken_at predicates skip non-matching partitions. Dropping a month of data is DROP TABLE measurements_2026_01 β instant, no VACUUM needed.
Postgres doesn't do cross-node sharding natively (Citus adds it).
Cross-node sharding β the hard parts β
- Cross-shard joins / transactions β expensive or forbidden. Design to avoid.
- Global uniqueness β use UUIDs or a distributed ID generator (Snowflake).
- Rebalancing β moving shards without downtime requires dual-writes + backfills.
- Hot shards β celebrity user, viral tweet. Either pre-shard celebrity accounts or add secondary hashing.
- Secondary indexes β a global secondary index spans shards (scatter/gather) or must be maintained per shard (local).
When do you actually need to shard? β
- Working set > single-node RAM.
- Write throughput > single-node disk/CPU.
- Dataset > single-node disk cost-effectively.
Before sharding: vertical scale, add read replicas, add caches, archive cold data. Sharding changes your app's model β not a decision to take lightly.
14. Distributed Systems for DBs β
CAP β
In a distributed system with a network partition, you can have Consistency (every read sees the latest write) OR Availability (every request gets a non-error response) β not both. Partition tolerance is not optional; the network will partition.
So CAP is really: CP or AP during a partition.
- CP β refuse requests that can't be served consistently. MongoDB (with
w=majority), HBase, Spanner, CockroachDB, Zookeeper/etcd. - AP β keep serving, accept potential inconsistency. Cassandra (default), DynamoDB (default), Riak, eventually-consistent systems.
PACELC β CAP isn't enough β
If there's a Partition, choose A or C. Else (normal operation), choose Latency or Consistency.
Recognizes that even without partitions you pay latency for strong consistency. DynamoDB: PA/EL (available + low-latency, eventually consistent by default). Spanner: PC/EC (strict serializable always).
Consistency models (strong β weak) β
| Model | Guarantee |
|---|---|
| Linearizable (strong) | Every op appears to happen instantaneously at some point between its invocation and response. Single-copy illusion. |
| Sequential | All nodes see all ops in the same order; this order matches each client's program order. |
| Causal | If op A happened-before op B, everyone sees A before B. Concurrent ops may appear in different orders to different observers. |
| Read-your-writes | A client always sees its own writes (even if others might not yet). |
| Monotonic reads | Successive reads from a client never go backward in time. |
| Eventual | Given no new writes, eventually all replicas converge. Weakest. |
You typically want session guarantees (read-your-writes + monotonic reads) in an AP system, because eventual consistency alone confuses users.
Quorum (Dynamo-style) β
N replicas; read from R; write to W. R + W > N guarantees overlap β reads see the latest write. Common settings:
| Setup | Properties |
|---|---|
N=3, W=3, R=1 | Fast reads, slow writes, no fault tolerance on write |
N=3, W=1, R=1 | Fastest, eventually consistent only |
N=3, W=2, R=2 | Strongly consistent, tolerates one node down |
N=5, W=3, R=3 | Strongly consistent, tolerates two |
Raft / Paxos (at 1,000 feet) β
Consensus protocols let a group of nodes agree on an ordered log despite failures. Used by: etcd, Consul, CockroachDB, Spanner (Paxos), TiKV, Kafka controller (KRaft), Redis Sentinel (custom).
- Leader election β one node wins majority vote.
- Log replication β leader appends entries, replicates to followers; entry committed when majority acks.
- Safety β committed entries never overwritten; only a node with an up-to-date log can become leader.
- Liveness β requires majority reachable; two-of-three tolerates one failure, three-of-five tolerates two.
You won't implement Raft. You will recognize "we use Raft to elect a primary and replicate the WAL." If asked to whiteboard, focus on: leader, term numbers, log matching, commit index.
Vector clocks β
An event tagged with per-node counters [A:3, B:5, C:2]. Lets you tell if two events are ordered (one dominates) or concurrent (neither does). Used in eventually-consistent KV stores (Riak, historically Dynamo) to detect and surface conflicts to the application.
CRDTs (conflict-free replicated data types) go further β merge values deterministically without application involvement (counters, sets, last-writer-wins registers).
15. NoSQL Categories β
Document stores β
Self-describing JSON/BSON records, often with some query support.
- MongoDB β Β§16. Rich query language, aggregation pipeline, transactions (with limits).
- Couchbase β SQL-like N1QL, integrated cache layer.
- Firestore / Firebase β serverless, realtime, mobile-focused.
When: data is naturally a document (user profiles, product catalog, CMS), shapes vary per record, you rarely cross-join, you want horizontal scale.
Key-value (KV) β
Get/put/delete by key. Opaque value.
- Redis β in-memory, rich data types (lists, sets, sorted sets, streams), pub/sub, Lua. See Β§20.
- Memcached β in-memory, simple KV only, no persistence.
- DynamoDB (when used trivially) β Β§17.
- etcd / Consul β small-value config store with consensus.
- RocksDB / LevelDB β embedded KV engine, used as backend for many DBs.
When: access pattern is pure lookup-by-key, or caching.
Wide-column β
Rows keyed by partition key + clustering columns; sparse columns can differ per row. Physically column-oriented within a row.
- Cassandra β AP, tunable consistency, linear scale. Great for write-heavy time-series.
- HBase β Hadoop ecosystem, strong consistency per row.
- DynamoDB (as single-table design) β Β§17.
- ScyllaDB β Cassandra-compatible, C++ implementation.
When: massive writes, queries always by partition key, you can design your schema around access patterns (no ad-hoc queries).
Graph β
Nodes + edges optimized for traversal and pattern matching.
- Neo4j β Cypher query language, property graph.
- Amazon Neptune β Cypher + SPARQL (RDF).
- JanusGraph β on top of Cassandra/HBase.
When: relationships are first-class (social graph, recommendation, fraud rings, knowledge graphs). Multi-hop queries that would be N self-joins in SQL run cheaply.
Time-series β
Writes-dominated, data keyed by time, frequently append-only, auto-downsampling.
- InfluxDB β SQL-like InfluxQL / Flux.
- TimescaleDB β Postgres extension with hypertables.
- Prometheus TSDB β metrics, pull-based, not a durable system of record.
When: metrics, IoT, monitoring.
Vector β
Approximate nearest-neighbor search over embeddings β foundational for RAG / semantic search / recommendation.
- pgvector (Postgres extension) β brings vectors into SQL.
- Pinecone, Weaviate, Qdrant, Milvus β dedicated.
- Index: HNSW (Hierarchical Navigable Small World) β graph traversal in embedding space. IVF (Inverted File) β cluster then search cluster.
When NOT to pick NoSQL β
- You need multi-table transactions / cross-entity invariants.
- You need ad-hoc querying without rewriting the schema.
- Your data is relational and modest in size.
- Your team doesn't yet have an access pattern to design around.
Default to Postgres unless there's a specific reason not to. "We used Mongo because it was cool" is a regret story on every senior engineer's resume.
Interview Qs covered β
Β§15 frames Q12 (Mongo vs Postgres).
16. MongoDB Deep Dive β
MongoDB stores BSON documents (binary JSON with extra types like ObjectId, Date, Decimal128) inside collections (β tables) inside databases. No enforced schema by default β but you should use validators ($jsonSchema) in production.
Documents and the embedding decision β
The core data-modeling decision: embed vs reference.
jsonc
// Embedded (denormalized)
{
_id: ObjectId("..."),
user: "alice",
orders: [
{ id: 1, total: 100, items: [...] },
{ id: 2, total: 50 }
]
}
// Referenced (normalized)
// users: { _id: "alice", ... }
// orders: { _id: 1, user_id: "alice", total: 100 }Embed when:
- Data is accessed together (user's orders appear on the profile).
- Lifetime is tied (order lines die with their order).
- Embedded array is bounded β not unbounded growth (16 MB document limit).
- You don't need to query embedded items independently.
Reference when:
- Array can grow unboundedly (a popular post's comments).
- Entity is referenced from multiple places.
- You need to query it separately with an index.
Replica set + oplog β
A replica set is one primary + N secondaries. The primary writes to its oplog (a capped collection of operations). Secondaries tail the oplog and replay.
- Default: async replication. Secondaries might lag seconds.
- Failover via Raft-like election. Secondary with freshest oplog wins.
- Reads default to primary.
readPreference: secondary[Preferred]offloads reads at cost of staleness.
Write and read concerns β
Write concern β how many nodes must ack a write before client hears success.
w: 1β primary only (default). Fast; lose it if primary dies before replication.w: "majority"β majority of voting nodes. Durable across primary failure.j: trueβ flushed to disk journal.
Read concern β consistency of reads.
local(default) β latest data on queried node; might not be majority-committed.majorityβ only returns data acked by a majority.linearizableβ even stronger; blocks until you can guarantee single-copy reads.
Rule: w: majority, j: true, readConcern: majority for "this write must never be lost."
Sharding β
mongos (router) β config servers (cluster metadata)
β
Shard 1 (replica set) β partition range A
Shard 2 (replica set) β partition range B
Shard 3 (replica set) β partition range CShard key is critical: all documents with the same shard key live on the same shard.
Picking a shard key (in order of importance):
- High cardinality β enough distinct values to spread.
- Even distribution β avoid hot shards.
- Matches common queries β queries with shard key filter are single-shard; without are scatter/gather (slow).
- Non-monotonic β monotonic keys (timestamps) concentrate writes on one shard. Use hashed sharding for timestamps.
Changing a shard key used to require a dump/restore; MongoDB 5.0+ can reshard online.
Indexes β
- Single-field β most common.
- Compound β prefix rule same as SQL:
{ user_id: 1, status: 1, created_at: 1 }serves queries prefixed onuser_id,(user_id, status), etc. - Multikey β automatic on array fields:
{ tags: 1 }indexes each array element. - Text β full-text search, one per collection, language-aware stemming.
- 2dsphere / 2d β geospatial.
- Wildcard β
{ "$**": 1 }indexes all paths β handy for flexible schemas. - TTL β
expireAfterSecondsauto-deletes documents; great for sessions. - Partial β indexes only docs matching a filter.
- Hashed β for sharding on a monotonic natural key.
Gotcha β compound prefix rule. Index
{ a: 1, b: 1, c: 1 }supports queries ona,a+b,a+b+cβ but NOTb,c, orb+c. Order matters. Put equality fields first, then range, then sort-only.
Aggregation pipeline β
Multi-stage transformation pipeline. Stages: $match, $project, $lookup (join), $group, $sort, $limit, $unwind, $facet, $bucket, $addFields.
js
db.orders.aggregate([
{ $match: { status: "shipped", created_at: { $gte: new Date("2026-01-01") } } },
{ $group: { _id: "$user_id", total: { $sum: "$amount" }, count: { $sum: 1 } } },
{ $sort: { total: -1 } },
{ $limit: 10 }
])Optimization rules:
- Put
$matchas early as possible β pushdown. - Use indexes (only stages before blocking ops like
$groupcan use them). $lookupis an ad-hoc left outer join β slower than embedding. Avoid in hot paths.
Transactions β and their limits β
MongoDB 4.0+ supports multi-document ACID transactions. 4.2+ across shards.
Limits:
- Default 60-second transaction timeout. Long transactions kill secondaries (oplog pressure).
- Performance hit: transactions serialize on the oplog; cross-shard transactions use 2PC internally.
- Limits on data moved per tx.
- Read-your-own-writes inside tx is always fine, but reads outside the tx with
readConcern: snapshotrequire a replica set.
Rule of thumb: design to not need transactions. Embed where you can. Use them for rare cross-document invariants, not for everything.
Change streams β
Tail the oplog via a clean API. Useful for:
- Keeping a search index (Elasticsearch) in sync.
- Emitting events to Kafka (Mongo β Kafka connector is literally this).
- Trigger-like workflows.
js
db.orders.watch([{ $match: { "fullDocument.status": "shipped" } }])
.on('change', evt => { ... });Resume tokens let you pick up where you left off on reconnect.
Interview Qs covered β
Β§16 addresses Qs 12 (Mongo vs Postgres), 13 (compound prefix), 14 (transaction limits).
17. DynamoDB Deep Dive β
DynamoDB is a fully managed key-value / wide-column store on AWS. It trades ad-hoc querying for predictable low-latency at any scale. The design mantra: model access patterns first, schema second.
Tables, items, attributes β
A table holds items (β rows). Each item is a set of attributes (β columns, but each item can have any attributes). Every item has a primary key, either:
- Partition key (PK) only β hash key, simple KV.
- PK + sort key (SK) β composite. All items with the same PK form a partition and are stored together, sorted by SK.
Partitions and hot partitions β
Behind the scenes: DynamoDB hashes PK β assigns to a physical partition (hash space slice). Single partition max ~1000 WCU / 3000 RCU per second (also size-bounded, 10 GB).
Hot partition β if one PK value gets a huge share of traffic (e.g., country = "US"), that single partition saturates even though the table has excess capacity.
Fixes:
- Better PK β higher cardinality (e.g.,
user_idnotcountry). - Write sharding β suffix the key:
country#01..country#99, random on write, scatter-read. - Caching β DAX or app-layer.
- Adaptive capacity β DynamoDB auto-boosts a hot partition up to some limit; not a substitute for good modeling.
GSI vs LSI β
| Local Secondary Index (LSI) | Global Secondary Index (GSI) | |
|---|---|---|
| PK | Same as base table | Different β any attribute |
| SK | Different | Different |
| Max count | 5 per table | 20 per table (soft limit) |
| Created | Only at table creation | Anytime |
| Consistency | Supports strongly consistent reads | Eventually consistent only |
| Capacity | Shares base table's | Separately provisioned |
| Size | Limited to 10 GB per PK | Unlimited |
Rule: GSIs unless you truly need strong consistency on a secondary access path (rare). GSIs are the flexible, production-preferred choice.
Single-table design β
Counter-intuitive: store multiple entity types (users, orders, order lines) in one table, distinguished by key patterns.
PK SK type attrs...
USER#alice PROFILE user { email, name }
USER#alice ORDER#2026-04-17 order { total, status }
USER#alice ORDER#2026-04-18 order { total, status }
ORDER#42 LINE#1 line { qty, price }
ORDER#42 LINE#2 line { qty, price }Why:
- One round-trip fetches a user AND all their orders:
PK = "USER#alice". - GSIs reshape the same data for different access patterns.
- No cross-entity joins needed β relationships are captured by key design.
The cost: schema complexity. Queries depend on memorized key shapes. Use a naming convention (ENTITY#id) and document it ruthlessly.
Modeling relationships β
One-to-many β composite PK/SK where PK = parent, SK = CHILD#id:
PK = USER#alice, SK = PROFILE β the user
PK = USER#alice, SK = ORDER#202604170001 β each order
Query PK = "USER#alice" AND SK begins_with "ORDER#" β all orders.Many-to-many β adjacency list pattern:
PK = USER#alice, SK = GROUP#eng β alice β eng group
PK = GROUP#eng, SK = USER#alice β GSI1: query by groupWith a GSI flipping PK β SK you get both directions.
Capacity modes β
Provisioned β set RCU/WCU per table. Cheaper at steady load. Autoscaling adjusts within bounds.
- 1 RCU = 1 strongly consistent read of β€ 4 KB per second, OR 2 eventually consistent reads, OR 0.5 transactional reads.
- 1 WCU = 1 write of β€ 1 KB per second, OR 0.5 transactional writes.
On-demand β pay per request. ~6-7Γ more expensive per request but no capacity planning. Great for unpredictable or spiky traffic.
Eventual vs strong vs transactional reads β
- Eventual (default) β may lag a write by ~1s. Cheapest.
- Strong β always latest; costs 2Γ eventual; not supported on GSIs.
- Transactional read (
TransactGetItems) β consistent snapshot across items, 4Γ cost.
Writes β
PutItem/UpdateItem/DeleteItemβ single-item.BatchWriteItemβ up to 25 items per call; no atomicity across items.TransactWriteItemβ up to 100 items, ACID across them, 2Γ cost. Rollback on any failure.- Conditional writes β
ConditionExpression: "attribute_not_exists(pk)"enforces uniqueness / optimistic lock.
Streams + DAX + TTL β
- DynamoDB Streams β change feed with 24-hour retention; feed Lambda, replicate to other stores, audit.
- DAX (DynamoDB Accelerator) β in-cluster cache; sub-millisecond reads; eventually consistent vs source.
- TTL β attribute-based automatic deletion. Cheap way to expire sessions, temp data.
PartiQL β
SQL-style query language for DynamoDB β convenient but doesn't change capabilities. Still limited to key-based access patterns. Convenience, not power.
Interview Qs covered β
Β§17 addresses Qs 15β19 (PK/SK, GSI/LSI, single-table design, 1:M modeling, capacity modes).
18. Postgres Specifics β
Postgres is the "reach for by default" relational engine for a reason: it has every feature, and its extension ecosystem covers the rest.
JSONB β structured data as a column β
JSONB stores a parsed, binary JSON representation. Not just a blob.
sql
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
payload JSONB
);
INSERT INTO events (payload) VALUES ('{"type":"login","user":"alice"}');
SELECT payload->>'user' FROM events WHERE payload->>'type' = 'login';Operators: -> (returns JSONB), ->> (returns text), #> / #>> (path), @> (containment), ? (key exists), ?& / ?|.
Index JSONB with GIN:
sql
CREATE INDEX idx_events_payload_gin ON events USING GIN (payload);
-- Or the smaller jsonb_path_ops for containment-only:
CREATE INDEX idx_events_payload_gin_path ON events USING GIN (payload jsonb_path_ops);Supports queries like payload @> '{"type":"login"}' via the GIN index.
JSON (no B) stores the text as-is β slower for queries, preserves whitespace/ordering. Prefer JSONB.
Arrays β
Native array types: INT[], TEXT[]. Indexable with GIN.
sql
CREATE TABLE posts (id BIGSERIAL PK, tags TEXT[]);
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgres'];Use cautiously: if you routinely query inside the array, it's usually better as a real junction table.
Extensions worth knowing β
| Extension | Purpose |
|---|---|
| pg_trgm | Trigram fuzzy search; index with gin_trgm_ops for LIKE '%...%' |
| pgvector | Vector similarity search; vector column + HNSW/IVFFlat index |
| PostGIS | Geospatial β points, polygons, routing |
| pg_cron | Cron-style scheduled SQL in the DB |
| pg_stat_statements | Query stats β top-N slowest, most-called |
| pg_partman | Partition maintenance automation |
| hstore | Key-value column (older; JSONB mostly replaces) |
| uuid-ossp / pgcrypto | UUID gen functions |
| citext | Case-insensitive text type |
LISTEN / NOTIFY β
Tiny, durable-ish pub/sub inside Postgres.
sql
-- Listener
LISTEN order_updates;
-- Publisher (in another session)
NOTIFY order_updates, 'order 42 shipped';Great for local cache-invalidation signals. Don't rely on it as a replacement for Kafka β not durable if no listener is connected.
Logical replication β
Publish/subscribe per table. Lets you:
- Replicate a subset across major versions.
- Do zero-downtime upgrades (logical-replicate from old to new, then switch).
- Build CDC pipelines (Debezium is a logical replication consumer).
Roles and Row-Level Security (RLS) β
sql
CREATE ROLE app_user LOGIN PASSWORD '...';
GRANT SELECT ON orders TO app_user;
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY user_orders ON orders
FOR SELECT USING (user_id = current_setting('app.current_user_id')::BIGINT);Any query from app_user on orders is filtered by the policy β enforced by the engine, not application code. Great for multi-tenant apps when you can trust the DB connection's identity.
CTE materialization gotcha β
Pre-Postgres-12 CTEs were always materialized. Post-12 they're often inlined. If you depended on the materialization fence (side effects, avoiding repeated expensive subqueries), add AS MATERIALIZED.
19. MySQL / SQL Server Callouts β
Not a full tour β just what differs that interviewers probe.
MySQL (InnoDB) β
- Clustered PK β rows live inside the PK B-tree. Lookups by PK = one traversal. Secondary indexes store (key, PK), require a second lookup β matters for covering-index choices.
- Gap locks β InnoDB's REPEATABLE READ prevents phantoms via gap locks (locks the "space between" rows on a range query). Can cause surprising deadlocks; tune with
innodb_locks_unsafe_for_binlog = 0. ORDER BY+LIMITwith filesort β check EXPLAIN; ensure index covers order.- No transactional DDL β
ALTER TABLEcan't roll back. Usegh-ost/pt-online-schema-changefor online migrations on big tables. - Storage engines β InnoDB is the only one to use for new work. MyISAM exists; don't.
- Query cache β removed in MySQL 8.0. If an interviewer asks about it, the correct answer includes "it's been removed because invalidation cost > benefit."
SQL Server / T-SQL β
- Clustered index β optional; strongly recommended (usually on PK).
- Temp tables β
#temp(session) vs##temp(global) vs table variables. Table variables have fewer stats β worse plans for anything non-trivial. - Common table expressions β supported; used for recursion and readability just like Postgres.
- Parameter sniffing β the plan cache uses the first call's parameter value. Bad for skewed distributions. Workarounds:
OPTION (RECOMPILE),OPTIMIZE FOR UNKNOWN, local variables,WITH OPTIMIZE FOR (@p = ...). - Always On Availability Groups β HA + read-scale with log-shipping. Primary + secondaries with failover.
- Columnstore indexes β clustered columnstore for big analytical tables; non-clustered columnstore alongside row store for HTAP.
SELECT INTOβ creates table from query results β no indexes.
20. Caching β
Caching adds a fast, cheap tier in front of the database to reduce read latency and load. It's one of the most commonly asked interview topics because invalidation is the hard part.
Redis vs Memcached β
| Redis | Memcached | |
|---|---|---|
| Data model | Strings, lists, sets, sorted sets, hashes, streams, bitmaps, HyperLogLog, geospatial | Strings only |
| Persistence | RDB snapshots + AOF | None |
| Replication | Primary-replica + Cluster | None (client-side sharding) |
| Transactions | MULTI/EXEC, Lua scripting | No |
| Memory model | Single-threaded main loop (with IO threads in 6+) | Multi-threaded |
| Use case | Cache + data structures server + pub/sub + queue | Pure cache |
Redis is the default. Memcached is still viable when you want a purely ephemeral cache with zero features.
Caching patterns β
Cache-aside (lazy loading) β app controls the cache:
read:
value = cache.get(k)
if value is null:
value = db.get(k)
cache.set(k, value, ttl)
return value
write:
db.put(k, v)
cache.delete(k) -- invalidatePros: cache holds only requested data; app unaffected if cache is down. Cons: cache stampede on miss (N requests all hit DB); stale window after delete before next read repopulates.
Read-through β cache library hits DB on miss (cache-aside abstracted into a library).
Write-through β app writes to cache; cache propagates to DB synchronously. Strong consistency between cache and DB; slower writes.
Write-behind (write-back) β app writes to cache; cache drains to DB async. Fastest writes. Risk: cache failure = lost data. Avoid for anything you can't lose.
Refresh-ahead β cache proactively refreshes entries before TTL expires (predict access). Used for very hot keys.
Invalidation strategies β
- TTL only β set expiry, accept up to that much staleness. Simple, best when "slightly stale is fine."
- Explicit delete on write β cache-aside. Risk: if write happens between read and set, stale data re-cached (race). Solutions: SETNX on write, versioned keys.
- Write-through / write-behind β cache is always fresh (by construction).
- Event-driven (CDC) β DB change stream (Debezium, Postgres logical, Mongo change stream) β invalidate cache key. Best for systems where multiple writers touch the DB.
- Version tag β cache key includes a version:
user:42:v17. Bumping version invalidates all.
Phil Karlton: "There are only two hard things in computer science: cache invalidation and naming things." Believe him.
Eviction policies (when the cache fills) β
| Policy | Evicts | Use case |
|---|---|---|
| LRU (Least Recently Used) | Oldest-accessed | General caching; default |
| LFU (Least Frequently Used) | Least-accessed | Skewed access (celebrity hotness) |
| FIFO | Oldest-written | Rare; naive |
| Random | Random key | When you don't want tracking overhead |
| allkeys-lru / volatile-lru | LRU across all keys / only keys with TTL | Redis configs |
| volatile-ttl | Shortest TTL first | When TTL roughly matches value |
| noeviction | None β errors on OOM | Session stores, queues |
Stampede / thundering herd β
A popular key expires; 1000 concurrent requests all miss the cache and hammer the DB.
Mitigations:
- Distributed lock β first miss locks, fetches, sets; others wait or serve stale.
- Probabilistic early expiration β each request independently rolls a probability to refresh before actual expiry; smooths the cliff.
- Stale-while-revalidate β serve stale value + kick off background refresh.
- Request coalescing β in-process dedup: concurrent misses for same key reuse one DB call.
Cache hit ratio β what's good β
- >95% β great for a hot-data cache.
- 80β95% β normal.
- <50% β cache is fighting working set size; bigger cache or longer TTL or different key design.
What to cache β
- Derived, expensive, rarely-written data β perfect.
- Session state, rate-limit counters, leaderboards β Redis sorted sets are ideal.
- Write-heavy, rarely-read data β don't bother.
- Personalized unique-per-user β cache by user; hit rate depends on user revisit.
Interview Qs covered β
Β§20 addresses Q21 (cache invalidation).
21. OLAP & Data Warehousing β
When queries scan millions of rows with aggregates, OLTP engines suffer. A separate warehouse uses column-oriented storage and a denormalized schema optimized for analytics.
Star vs snowflake schema β
Star schema:
dim_date
|
dim_customer βββ fact_sales βββ dim_product
|
dim_store- Fact table β numeric, additive measures (sales, quantity, revenue) + FKs to dimensions.
- Dimension tables β descriptive attributes (customer name, product category, date breakdown). Wide, few rows relative to facts.
Snowflake schema β dimensions are further normalized into sub-dimensions:
dim_product β dim_category β dim_departmentSaves storage; adds joins. Star is usually preferred for warehouse speed.
Slowly Changing Dimensions (SCD) β
When dimension attributes change over time, how do you preserve history?
| Type | Behavior |
|---|---|
| SCD 1 | Overwrite. History lost. Use for corrections. |
| SCD 2 | New row for each change, with effective_from/effective_to + is_current. History preserved. Most common. |
| SCD 3 | New column for "previous" value. Limited to one step back. |
| SCD 4 | Split current vs history tables. |
| SCD 6 | Hybrid (1+2+3). |
Senior interviews: "how would you handle a customer changing address?" β SCD 2 with effective dating.
Columnar storage β
Row-oriented (Postgres, MySQL): rows stored contiguously. Good for "give me row 42."
Column-oriented (Redshift, BigQuery, Snowflake, ClickHouse, Parquet): values of one column stored contiguously.
Benefits:
- Compression β one column = one data type = high compression (delta, dictionary, run-length).
- Projection pushdown β
SELECT avg(salary)only reads the salary column, not every byte of every row. - Vectorized execution β process chunks of column values in SIMD-friendly loops.
Trade-off: inserting or updating a row touches every column file. Columnar engines generally favor append-only / bulk-load patterns.
ETL vs ELT β
ETL (Extract-Transform-Load) β classic: transform in an intermediate tool (Talend, Informatica, Spark), load to warehouse. Transformation logic lives outside the warehouse.
ELT (Extract-Load-Transform) β modern: load raw to warehouse, transform there with SQL (dbt is the dominant tool). Warehouse does the heavy lifting; transformations are version-controlled SQL.
ELT wins when the warehouse is cheap, fast, and elastic (BigQuery, Snowflake, Redshift). ETL still wins for heavy rule-based cleansing or when you can't store raw for policy reasons.
MPP engines β
Massively Parallel Processing β one logical warehouse, many nodes each processing a slice of the data.
- Redshift β AWS, columnar, provisioned + serverless, DC2/RA3 node families.
- BigQuery β Google, serverless, columnar, separation of storage and compute, slots model.
- Snowflake β multi-cloud, virtual warehouses (elastic compute clusters), automatic scaling.
- ClickHouse β open source columnar, super fast for specific queries, self-hosted.
Data lake vs warehouse vs lakehouse β
- Lake β raw files in object storage (S3, GCS) in open formats (Parquet, Avro). Schema-on-read.
- Warehouse β structured, optimized tables. Schema-on-write.
- Lakehouse β ACID tables over a lake via formats like Delta Lake, Apache Iceberg, Apache Hudi. Hybrid.
22. Search (Elasticsearch / OpenSearch) β
Elasticsearch isn't a database in the transactional sense β it's a distributed, indexed document store optimized for relevance-scored search.
Inverted index β the core data structure β
For every term in every document, a list of documents containing it.
"database" β [doc1, doc5, doc7]
"postgres" β [doc1, doc3]
"mongodb" β [doc5, doc7]Search "database postgres" = intersect [doc1, doc5, doc7] β© [doc1, doc3] = [doc1]. Blazingly fast for text match.
Analyzers and tokenizers β
Documents are processed into tokens at index time:
- Tokenizer β split text into tokens (standard, whitespace, keyword, n-gram).
- Token filters β lowercase, remove stopwords ("the", "a"), stem ("running" β "run"), synonyms.
Same analyzer applied to queries. Analyzer mismatch is a classic bug β your query analyzer lowercases but the indexed analyzer doesn't.
Relevance scoring β BM25 β
Default scoring in modern Elasticsearch. Extension of TF-IDF:
- Term frequency β more mentions = more relevant (with diminishing returns).
- Inverse document frequency β rare terms weigh more ("postgres" more than "the").
- Document length normalization β shorter doc with same hit = more relevant.
Tunable per field (boost), per query.
Shards and replicas β
- Primary shards β data sliced into N shards at index creation. Can't change without reindex (since 7.0: can split/shrink with constraints).
- Replica shards β copies for redundancy + read scale.
- Each shard is a self-contained Lucene index.
Rule of thumb: shard size ~20β50 GB. Too-small shards (< few GB) waste overhead. Too-big (> 50 GB) makes recovery slow.
Refresh / flush / merge β
- Refresh (default 1s) β makes recent writes visible to search. Creates a small in-memory segment. Tunable; bulk loads should set
-1and refresh at end. - Flush β fsyncs transaction log, commits Lucene segments to disk.
- Merge β background process combines small segments into larger ones. More reads = bigger segments = faster search.
When search β OLTP DB β
- ES is near-real-time β 1s refresh isn't transactional.
- Writes lag behind a system-of-record DB (via CDC or dual-write).
- No multi-document transactions.
- Don't use ES as a primary data store β use it as a secondary index on top of one.
Mapping explosion β
A "mapping" is ES's schema. Dynamic mapping auto-creates fields from first document seen. If you index unbounded user-supplied field names (custom_attr_<n>), you explode the mapping and kill cluster performance.
Mitigations:
- Disable dynamic mapping (
"dynamic": "strict"). - Use
"dynamic": "runtime"to keep fields searchable without indexing. - Flatten user-supplied attributes into a single nested object.
23. Schema Migrations β
A migration is a versioned change to the schema β never hand-typed into a prod console.
Flyway vs Liquibase β
| Flyway | Liquibase | |
|---|---|---|
| Format | SQL files (V1__init.sql, V2__add_index.sql) | XML / YAML / SQL / JSON changelogs |
| Philosophy | Simple, imperative | Declarative, DB-agnostic |
| Rollback | Not built-in (write a down migration manually) | Built-in (but requires care) |
| Java integration | Spring Boot starter | Spring Boot starter |
| Best for | Teams fluent in SQL, single-DB target | Cross-DB portability, complex orchestration |
Either works. Pick one, commit to it, check migrations into source control alongside code.
The expand-contract (parallel-change) pattern β
A zero-downtime schema change happens in three phases, deployed independently.
Expand β add new, don't touch old. Migrate β dual-write; backfill; dual-read. Contract β remove old.
Zero-downtime column rename example β
Goal: rename `users.email_address` β `users.email`
Phase 1: EXPAND
ADD COLUMN email TEXT (nullable).
Deploy writers: write to BOTH email_address AND email.
Backfill: UPDATE users SET email = email_address WHERE email IS NULL; -- in batches
Phase 2: MIGRATE
Deploy readers: read from email, fall back to email_address if null.
Verify.
Make email NOT NULL (only safe once backfill done).
Phase 3: CONTRACT
Deploy writers + readers: use only email.
DROP COLUMN email_address.Every step is backward-compatible. You can roll back any deployment.
Other zero-downtime patterns β
| Change | Technique |
|---|---|
| Add column | Safe (with default; in PG 11+ default is metadata-only, no rewrite) |
| Add index | CREATE INDEX CONCURRENTLY (PG) / ALGORITHM=INPLACE, LOCK=NONE (MySQL). Non-blocking. |
| Drop column | Deploy code that doesn't use it β drop. |
| Change type | Add new column β backfill β switch reads β drop old. |
| Add NOT NULL to existing column | Add CHECK (col IS NOT NULL) NOT VALID β VALIDATE CONSTRAINT β ALTER COLUMN SET NOT NULL (PG 12+). |
| Add FK | ADD CONSTRAINT ... NOT VALID; VALIDATE CONSTRAINT (doesn't take strong lock). |
Online DDL tools (big MySQL tables) β
- pt-online-schema-change (Percona) β creates a shadow table, triggers copy rows, swap at end.
- gh-ost (GitHub) β similar, reads binlog instead of triggers β less load.
- pg_repack / pg_squeeze (Postgres) β reclaim bloat / rewrite without long lock.
Backfill strategy β
Backfills on large tables must:
- Batch (e.g., 1000 rows per transaction) β bounded lock duration.
- Throttle β pause on replica lag, CPU saturation.
- Resume from checkpoint β if killed, don't start over.
- Log progress.
A bad backfill (UPDATE huge_table SET new_col = old_col in one statement) = hours-long lock β production outage.
Interview Qs covered β
Β§23 addresses Q20 (zero-downtime column rename).
24. Backups, HA, DR β
Backups are the last line of defense against "oops." HA (high availability) handles node failure. DR (disaster recovery) handles region / site failure.
Backup types β
| Type | What it captures | Speed | Storage |
|---|---|---|---|
| Full | Everything | Slow | Large |
| Incremental | Changes since last backup | Fast | Small (but chain dependency) |
| Differential | Changes since last full | Medium | Medium |
| Logical (pg_dump, mongodump) | SQL/BSON statements to recreate | Slow restore | Portable across versions |
| Physical (pg_basebackup, xtrabackup, file-level snapshot) | Binary files + WAL | Fast restore | Tied to major version |
| Snapshot (EBS, LVM, ZFS) | Point-in-time filesystem view | Very fast | Storage layer dependent |
Point-in-Time Recovery (PITR) β
The most valuable backup capability β restore to any timestamp, not just last backup.
Nightly full backup + continuous WAL archiving (shipped to S3).
Restore path: restore full β replay WAL up to target timestamp.PITR recovers from both hardware failure AND user error ("we accidentally DELETE'd all orders at 14:32").
RPO vs RTO β
- RPO (Recovery Point Objective) β how much data can you afford to lose? "We can lose up to 5 minutes" = RPO 5m.
- RTO (Recovery Time Objective) β how long can you be down? "We must be back in 30 minutes" = RTO 30m.
These drive architecture:
| RPO / RTO | Architecture |
|---|---|
| Hours / hours | Nightly full + backup shipping to cheap storage |
| Minutes / minutes | Continuous WAL archive + warm standby |
| Seconds / seconds | Sync replica + automated failover |
| Zero / zero | Multi-region sync quorum (Spanner, CockroachDB geo-distributed) |
Lower RPO/RTO = higher cost. Pick based on business need.
HA patterns β
- Cold standby β backup ready; operator restores on failure. Cheap, slow.
- Warm standby β replica kept up-to-date, not serving traffic. Promote on failure. RDS Multi-AZ is this.
- Hot standby β replica serving read traffic already; promoted on failure. Postgres streaming replication.
- Active-active β multi-primary; always serving from multiple nodes (see Β§12).
Multi-AZ vs multi-region β
Multi-AZ (single region): tolerates data-center / availability-zone failure. Low latency within region. AWS RDS Multi-AZ uses sync replication between AZs.
Multi-region: tolerates entire region outage. Options:
- Read replicas in other region β RPO seconds (async lag).
- Aurora Global DB β low-latency cross-region replication with storage-layer magic.
- DynamoDB global tables β active-active multi-region, LWW conflicts.
Chaos engineering + drills β
Backups untested are SchrΓΆdinger's backups. Real practices:
- Restore drills β quarterly restore to a scratch env.
- Failover drills β periodically kill the primary to exercise automation.
- Game days β planned chaos injection.
- Runbook validation β can a non-DBA engineer perform the runbook?
Regulated-environment retention angle β
Regulated compliance regimes typically require audit retention (often 1+ year online, longer in archive). Common controls:
- Append-only audit log table (can't UPDATE/DELETE from app role).
- Daily backups retained per retention policy; deletion requires dual-control.
- Encryption at rest using FIPS 140-2 modules where required.
- Backup medium must meet classification rules β e.g., can't ship regulated-data backups to unvetted storage.
25. Security β
Encryption at rest vs in transit β
At rest β data on disk. Options:
- Filesystem / volume encryption (LUKS, EBS encryption) β transparent to DB.
- Transparent Data Encryption (TDE) β engine encrypts pages (SQL Server, Oracle, MySQL Enterprise, Postgres via extensions).
- Column-level encryption β application encrypts specific columns (SSN, credit card) before insert.
For regulated environments (FIPS, FedRAMP, etc.): use FIPS 140-2 validated cryptographic modules. Key management via AWS KMS, HashiCorp Vault, or dedicated HSM.
In transit β TLS everywhere. Postgres: sslmode=verify-full; MongoDB: tls=true requireCert. Self-signed in dev; real certs in prod. Mutual TLS (mTLS) for service-to-DB auth.
SQL injection β still the #1 killer β
Insecure:
java
String sql = "SELECT * FROM users WHERE name = '" + name + "'";
stmt.executeQuery(sql);Attacker supplies name = "'; DROP TABLE users; --" β game over.
Secure β parameterized query:
java
PreparedStatement ps = conn.prepareStatement("SELECT * FROM users WHERE name = ?");
ps.setString(1, name);
ps.executeQuery();The driver sends the SQL template and values separately; the server never parses the user data as SQL. This isn't just escaping β it's structural separation.
JPA / Hibernate with :param or positional parameters is safe. Hibernate with string-built JPQL or native SQL is vulnerable β same rules apply.
NoSQL has analogous vulnerabilities:
- Mongo injection β
{ $where: "this.name == '" + userInput + "'" }β arbitrary JS. - Use the structured operators:
{ name: userInput }.
Row-Level Security (RLS) β
Filter rows at the engine level based on session context. Postgres example:
sql
ALTER TABLE patients ENABLE ROW LEVEL SECURITY;
CREATE POLICY patient_privacy ON patients
USING (assigned_doctor_id = current_setting('app.doctor_id')::BIGINT);Every query from the application role is scoped to the current user's rows β even SELECT *. Great for multi-tenant SaaS.
Limits: requires trusted session context setting. If attacker can SET app.doctor_id arbitrarily, RLS is bypassed. Set it in a trusted middleware.
Least-privilege roles β
Don't use the superuser from the app. Create a role per service or per permission tier:
sql
CREATE ROLE app_read LOGIN;
GRANT SELECT ON orders, users TO app_read;
CREATE ROLE app_write LOGIN;
GRANT SELECT, INSERT, UPDATE ON orders TO app_write;
-- No DROP TABLE, no ALTER.Service connects as app_write. Attacker who gets connection can't drop your tables.
Audit logging β
Every sensitive operation logged to tamper-evident storage.
- Postgres:
pgauditextension, or trigger-based audit tables. - DynamoDB: CloudTrail data events.
- MongoDB: audit log (Enterprise edition).
Common audit log requirements under regulated frameworks:
- Who, what, when, from where.
- Append-only (writable by audit role; readable by reviewer).
- Retention per policy.
- Integrity verification (HMAC or digital signature).
Secret management β
Never commit connection strings with passwords. Options:
- AWS Secrets Manager, HashiCorp Vault β programmatic fetch.
- Kubernetes Secrets + Sealed Secrets / External Secrets β GitOps-safe.
- IAM database authentication (RDS IAM auth) β no password at all, short-lived tokens.
Data exposure risk at the application layer β
- Disable
application.propertiesendpoints that dump DB config in Actuator. - Sanitize exceptions β don't leak schema in error responses.
- Log slow queries at INFO/DEBUG, not with bound values unless redacted.
26. Connection Pooling & Tuning β
Opening a TCP+TLS+auth handshake for every query = death. A connection pool keeps a set of warm connections and hands them out on demand.
HikariCP (Spring Boot default) β
Pool-sizing formula (from HikariCP wiki, attributed to Oracle research):
connections = ((core_count Γ 2) + effective_spindle_count)For SSD + 8 cores: (8 Γ 2) + 1 β 17. In practice most apps tune via load testing; start at 10β20 and adjust.
Key settings:
maximumPoolSizeβ upper bound.minimumIdleβ baseline kept warm (default = max, i.e., no shrinking).connectionTimeoutβ ms to wait for a connection before throwing.idleTimeoutβ close idle connections after this (doesn't go belowminimumIdle).maxLifetimeβ hard cap; closes a connection at this age (should be a bit less than DB-side timeout).leakDetectionThresholdβ warn when a connection is held > this ms.
What happens when the pool is exhausted β
All connections in use β new request waits up to connectionTimeout β throws SQLException / PoolTimeoutException.
Causes:
- Long-running queries holding connections.
- Leaks β forgot to close.
- Pool sized too small.
- Downstream (DB) slow.
Mitigation: statement timeouts (SET statement_timeout = '5s'), leak detection, circuit breakers, auto-scaling pool.
Pool sizing reality β
Too small β requests queue, latency climbs under load. Too big β more connections than DB can usefully serve; connections spend time context-switching. Postgres default max_connections = 100 is not a lot; every connection consumes ~10MB of backend memory.
In microservice architectures: total microservices Γ pool_size can exceed DB's max_connections. Use PgBouncer / ProxySQL as a shared pool layer.
PgBouncer pooling modes β
| Mode | How | Use case |
|---|---|---|
| Session pooling | Client gets a backend for the whole session | Behaves like a direct connection; no functional change |
| Transaction pooling | Backend handed over at transaction boundaries | Highest density; app mustn't use session-state features (prepared statements with names, SET, advisory locks) |
| Statement pooling | Backend handed over per statement | Rarely used; no multi-statement transactions |
Transaction pooling is the magic: 10,000 clients can share 20 backends if they each hold a transaction for < 50ms on average.
Prepared statement caching β
Most drivers cache prepared statements per connection. Pool interactions:
- Session pooling: cache works fine.
- Transaction pooling: named prepared statements break because a new backend might not have seen the prepare. Disable named prepared statements (
prepareThreshold=0on Postgres JDBC).
Statement and lock timeouts β
sql
SET statement_timeout = '5s'; -- aborts any query running > 5s
SET lock_timeout = '2s'; -- fails fast if lock can't be acquired
SET idle_in_transaction_session_timeout = '30s';Set these per app connection β prevents runaway queries from tying up resources.
Interview Qs covered β
Β§26 addresses Q9 (connection pooling).
27. JPA / Hibernate Specifics β
You'll get DB-round questions phrased in ORM terms. Know the entity lifecycle, the caches, and the N+1 remedies.
Entity lifecycle β
| State | Meaning |
|---|---|
| Transient | new User() β not tracked; no ID; no persistence context. |
| Managed | In the persistence context; changes tracked; flushed at commit. Result of persist() or find(). |
| Detached | Was managed, persistence context closed. Changes not tracked. Use merge() to reattach. |
| Removed | entityManager.remove(entity) called; will be DELETE'd at flush. |
Flush is what actually emits SQL. Flush modes:
AUTO(default) β before queries and at commit.COMMITβ only at commit.MANUALβ only when you callflush().
First-level vs second-level cache β
First-level (L1) β per EntityManager / persistence context. Automatic. Guarantees: within one transaction, find(User, 42) twice returns the same object. Cleared on close.
Second-level (L2) β optional, per SessionFactory (process-wide). Configurable per entity. Providers: Ehcache, Caffeine, Hazelcast, Infinispan, Redis (via Redisson).
Cache strategies:
READ_ONLYβ immutable data (country codes).READ_WRITEβ with soft locks; slower writes.NONSTRICT_READ_WRITEβ eventual consistency.TRANSACTIONALβ XA-ish.
Query cache β caches result sets from JPQL queries. Rarely worth it β invalidation kills you.
L2 cache caveats:
- Another process modifying the DB directly won't invalidate it.
- Cache across cluster needs distributed invalidation (Infinispan).
- Can leak stale data if not tuned.
Fetch types β LAZY vs EAGER β
java
@OneToMany(fetch = FetchType.LAZY) // default for collections
private List<Order> orders;
@ManyToOne(fetch = FetchType.EAGER) // default for singular relations!
private User user;EAGER means every time you fetch a child, JPA fetches the parent. Easy to accidentally turn N+1 into NΓM. Default everything to LAZY; fetch eagerly at query time when you need it.
N+1 problem β the JPA flavor β
java
List<Order> orders = em.createQuery("FROM Order", Order.class).getResultList(); // 1 query
for (Order o : orders) {
o.getLineItems().size(); // LAZY trigger β 1 query PER order
}Fixes (pick per situation):
java
// 1. JOIN FETCH in JPQL β eager just for this query
em.createQuery("SELECT o FROM Order o JOIN FETCH o.lineItems", Order.class)
// 2. Entity graph β declare the graph at query time
EntityGraph<?> g = em.createEntityGraph(Order.class);
g.addAttributeNodes("lineItems");
em.createQuery("FROM Order").setHint("jakarta.persistence.fetchgraph", g)
// 3. @BatchSize on the association β Hibernate IN-query batches
@OneToMany @BatchSize(size = 50) private List<LineItem> lineItems;
// 4. DTO projection β bypass entities entirely, flat result rows
em.createQuery("SELECT new OrderSummary(o.id, COUNT(l)) FROM Order o LEFT JOIN o.lineItems l GROUP BY o.id", OrderSummary.class)Detecting N+1:
hibernate.generate_statistics=trueβstatistics.getQueryExecutionCount()in tests.- datasource-proxy / p6spy to count.
- JdbcSqlStatementInterceptor in Hibernate 6.
@Version β optimistic locking β
java
@Entity class Order {
@Id Long id;
@Version Long version;
String status;
}Every UPDATE issued by Hibernate appends AND version = ?. If affected rows = 0, throws OptimisticLockException. Classic use: long-running user edits (open form at 9:00, save at 9:05; someone else saved at 9:03 β you get a conflict instead of silently overwriting).
DTO projections vs entities β
Rule of thumb: entities for writes, DTO projections for reads. Reasons:
- Entities are fully-populated graphs β over-fetching by default.
- Entities carry lazy proxies β LazyInitializationException outside a session.
- Entities with an open persistence context auto-flush dirty state β danger in read-only code.
- DTOs select only the columns you need β smaller query, better plan.
java
// Interface projection
interface OrderSummary { Long getId(); Integer getLineCount(); }
List<OrderSummary> find(Long userId);Native queries β
java
@Query(value = "SELECT * FROM orders WHERE jsonb_col @> :filter", nativeQuery = true)
List<Order> search(@Param("filter") String filter);Use for DB-specific features (JSONB operators, window functions, hints). Downside: less portable, no compile-time checking.
Spring @Transactional and JPA β
- Spring manages the
EntityManager+ transaction via proxy (same caveats as any Spring AOP β Β§86 in your Spring study). - Default propagation
REQUIREDβ joins existing or starts new. - Default isolation = DB default (read committed for Postgres).
- Default rollback = unchecked exceptions only. Set
rollbackFor = Exception.classfor checked. - Self-invocation bypasses the proxy β
this.someMethod()won't start a new transaction.
Interview Qs covered β
Β§27 addresses Q8 (N+1 β Hibernate-specific fixes).
28. Spring Data Specifics β
JpaRepository method-name parsing β
java
interface OrderRepo extends JpaRepository<Order, Long> {
List<Order> findByUserIdAndStatus(Long userId, Status status);
Optional<Order> findTopByUserIdOrderByCreatedAtDesc(Long userId);
Stream<Order> streamByStatus(Status status); // returns a resource-backed stream
List<Order> findDistinctByUserIn(List<Long> ids);
int countByStatus(Status status);
}Spring Data parses method names into queries (findBy, readBy, queryBy, existsBy, countBy, deleteBy). Keywords: And, Or, Between, LessThan, Like, IgnoreCase, OrderBy, In, NotIn, True, False, IsNull, IsNotNull.
Prefer @Query for anything non-trivial β the method-name grammar is hard to read at 4 conjuncts.
@Query annotations β
java
// JPQL (entity-oriented, portable)
@Query("SELECT o FROM Order o WHERE o.user.id = :uid AND o.total > :min")
List<Order> findBigOrders(@Param("uid") Long uid, @Param("min") BigDecimal min);
// Native SQL
@Query(value = "SELECT * FROM orders WHERE jsonb_col @> ?1", nativeQuery = true)
List<Order> searchJson(String filter);
// Modifying (non-SELECT)
@Modifying
@Query("UPDATE Order o SET o.status = :s WHERE o.id = :id")
int updateStatus(@Param("id") Long id, @Param("s") Status s);@Modifying queries don't touch the persistence context β managed entities already in memory won't see the change. Call em.clear() after if the caller then reads the same entities.
Transaction propagation levels in Spring Data β
java
@Transactional(propagation = Propagation.REQUIRED) // default
@Transactional(propagation = Propagation.REQUIRES_NEW) // suspend caller, new tx
@Transactional(propagation = Propagation.NESTED) // savepoint within caller
@Transactional(propagation = Propagation.SUPPORTS) // join if exists, no tx otherwise
@Transactional(propagation = Propagation.MANDATORY) // must be in a tx; else error
@Transactional(propagation = Propagation.NEVER) // must NOT be in a tx
@Transactional(propagation = Propagation.NOT_SUPPORTED) // suspend any caller txInterview staple β "give a use case for each":
REQUIRES_NEWβ audit log insert that must survive even if caller rolls back.NESTEDβ optional step where failure rolls back just the sub-work.MANDATORYβ utility method asserting a caller already opened a tx.
Repository projection types β
- Interface-based β lightweight; Spring generates a proxy returning only the selected fields.
- Class-based (DTO) β
new DtoClass(...)in JPQL. - Dynamic β
<T> List<T> findByStatus(Status s, Class<T> type);β caller picks projection.
Spring Data MongoDB β
MongoTemplateβ imperative, hand-built queries.MongoRepository<T, ID>β derived queries like JPA.@Queryon Mongo repos takes a JSON filter string:@Query("{ 'status': ?0 }").@Aggregationβ declare an aggregation pipeline.- Transactions β require replica set;
@Transactionalworks but mind the limits (Β§16).
Spring Data DynamoDB (via AWS SDK Enhanced Client) β
Spring Data DynamoDB (community) exists but is unofficial; the AWS-blessed path is the Enhanced Client with DynamoDbTable<Entity>.
java
DynamoDbTable<Order> table = enhancedClient.table("orders", TableSchema.fromBean(Order.class));
table.putItem(order);
Order o = table.getItem(Key.builder().partitionValue("alice").sortValue("O#42").build());
QueryConditional q = QueryConditional.keyEqualTo(k -> k.partitionValue("alice"));
PageIterable<Order> pages = table.query(r -> r.queryConditional(q).limit(20));Annotate POJOs with @DynamoDbBean, @DynamoDbPartitionKey, @DynamoDbSortKey, @DynamoDbSecondaryPartitionKey("GSI1").
29. Observability & Troubleshooting β
When a database is slow or failing, you debug with metrics and logs β not by squinting at code.
Postgres observability β
Key catalog / stats views:
| View | What it tells you |
|---|---|
pg_stat_activity | Active connections, current query, state, wait events β "what's running right now?" |
pg_stat_user_tables | Per-table seq/index scans, live/dead tuples |
pg_stat_user_indexes | Per-index scan counts β find unused indexes |
pg_stat_statements (extension) | Aggregated query stats: total time, mean, calls, rows |
pg_locks | Current locks held and waited on |
pg_stat_bgwriter | Checkpoint stats; buffer writes |
pg_stat_replication | Per-replica state and lag |
Example: find slowest queries:
sql
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;Slow query log β log_min_duration_statement = 1000 logs every query > 1s. Essential in prod.
MongoDB observability β
- Database profiler β
db.setProfilingLevel(1, { slowms: 100 })logs operations > 100ms intosystem.profile. db.currentOp()β active operations; similar topg_stat_activity.db.serverStatus()β connection counts, network, opcounters, WiredTiger cache.mongotopβ per-collection read/write times.
DynamoDB observability β
- CloudWatch metrics β
ConsumedReadCapacityUnits,ConsumedWriteCapacityUnits,ThrottledRequests,SuccessfulRequestLatency. - Contributor Insights β top partition keys by traffic (hot-key detector).
- CloudTrail β per-operation audit.
Three-pillars recap for DBs β
- Logs β slow query log, connection errors, deadlock reports.
- Metrics β connections, QPS, mean/p99 latency, cache hit rate, replication lag, disk I/O.
- Traces β OpenTelemetry DB spans (
db.statement,db.system,db.name) linking app spans to SQL.
Alerting SLIs β
Good DB SLOs to track:
- Connection saturation β
pool_in_use / pool_max. - Replication lag β alert at > 30s.
- Deadlock rate β > 0 is a smell.
- Error rate β statements/sec failing.
- p99 latency β correlates to user experience.
- Disk free β alert at 20%.
- Bloat ratio (Postgres) β > 30% dead tuples means VACUUM isn't keeping up.
Avoid alerting on resource utilization alone (CPU 80% isn't a problem if latency is fine). Alert on symptoms that affect users.
30. Connect to Your Experience β
This section is for tying study to stories. Every answer is stronger when anchored in something you've actually built.
Anchor example: legacy MQ microservice β messaging meets DB β
- Processed 10k+ transactions/day from IBM MQ, routed to downstream services.
- DB decision: each transaction needs idempotent storage (exactly-once into downstream systems). Natural fit: insert into a claim/outbox table keyed on MQ message ID, with a UNIQUE constraint for dedup.
- Transactions: XA or the outbox pattern (Β§6) β MQ ack + DB insert must be atomic. Outbox is usually the right call over XA because XA is brittle.
- Isolation: REPEATABLE READ or SERIALIZABLE for the consumer's claim-then-process loop; otherwise double-processing on a pod crash.
Anchor example: cross-team schema standardization β databases of events β
- Avro schemas are the write-time schema check for Kafka. Analogous to
CHECKconstraints + column types in a DB. - Schema evolution rules (backward / forward / full) map to Β§7β8 from data-format: add optional = backward-compatible, remove required = breaking.
- The schema registry is a metadata DB β often backed by Postgres.
Anchor example: JAXB migration β XML / XSD as schema β
- XSD is to XML what DDL is to tables. Strict validation on parse = fail-fast.
- Thymeleaf was string-templating; JAXB is schema-driven serialization. Type-safe, validated, versioned.
Anchor example: regulated-environment compliance β where DB design meets the auditor β
- Encryption at rest with FIPS 140-2 modules β drives TDE / KMS choices.
- Audit retention β Β§24 append-only audit table pattern.
- Row-level security for multi-tenant sharing β Β§25.
- Backup medium classification β backup of regulated data can't flow to unvetted storage without authorization.
Anchor example: productivity app (Spring Boot + MongoDB) β
- MongoDB fit flexible habit/rule/notification documents.
@DataMongoTestwith Testcontainers β integration tests against real Mongo, not a mock.- Good example for "when MongoDB makes sense": schema varies per user, documents are the natural unit, no cross-entity invariants demanding SQL transactions.
Anchor example: intern mentorship β 30% PR defect reduction β
- If databases come up: TDD focus + code review catches N+1, missing indexes, unsafe string concatenation (SQL injection), unscoped transactions.
- Concrete checklist item: "every new query has an EXPLAIN plan in the PR description for large tables."
31. Rapid-Fire Review β
One-liner cheat answers for the morning-of. In TOC order.
- OLTP vs OLAP β small tx-heavy vs large scan-heavy; row vs column storage.
- Surrogate key default β prefer a synthetic PK (BIGSERIAL or UUIDv7); enforce natural uniqueness as UNIQUE.
- NULL gotcha β
= NULLnever matches; useIS NULL.NOT INbroken on any NULL in subquery. - Inner vs outer joins β inner drops unmatched; outer keeps them with NULLs on missing side.
- Join algorithms β nested loop (small outer + indexed inner), hash (equi-join, memory-bound), merge (pre-sorted, large).
- CTE β named sub-query; Postgres 12+ inlines non-recursive single-reference unless
AS MATERIALIZED. - Window functions β compute per row over a partition without collapsing;
ROW_NUMBER,RANK,LAG, aggregates OVER. - 3NF β no partial deps (2NF), no transitive deps (3NF). BCNF stricter (every determinant is superkey).
- Denormalize β when reads dominate, joins hurt latency, or staleness is acceptable.
- ACID β Atomicity (all or nothing), Consistency (constraints hold), Isolation (concurrent tx appear serial), Durability (committed survives crash).
- Isolation levels β RU β RC β RR β Serializable; each prevents more anomalies (dirty β non-repeatable β phantom β write skew).
- MVCC β writers don't block readers; old versions live until VACUUM (Postgres) or undo purge.
- Pessimistic lock β
SELECT FOR UPDATE; best for contended hot rows. Optimistic β@Version; retry on conflict; best for rare conflicts. - Deadlock β always lock rows in a consistent order.
- B-tree vs hash vs GIN β B-tree for range + equality, hash for equality only, GIN for arrays / JSONB / full-text.
- Composite index prefix rule β index
(a,b,c)supportsa,a+b,a+b+cβ notbalone. - Covering index β query answered from the index; no heap visit.
INCLUDE (col)in Postgres 11+. - LSM vs B-tree β LSM write-optimized with background compaction; B-tree read-optimized with in-place update.
- WAL β write-ahead log; every change logged before heap update β atomicity + durability.
- VACUUM β reclaims dead tuples, prevents xid wraparound, enables index-only scans.
- EXPLAIN ANALYZE β actual rows/timings per node; read bottom-up. Watch for Seq Scan on big tables + rows-estimate-off-by-10x (stale stats).
- N+1 fix β JOIN FETCH / @EntityGraph / @BatchSize / DTO projection.
- Sync vs async replication β sync = no loss + slower commit; async = fast + lag window.
- CAP β during partition, pick C or A. PACELC adds: even without partition, pick Latency or Consistency.
- Quorum β
R + W > Nfor strong consistency. - Sharding β consistent hashing to avoid mass re-homing; avoid hot shards via write-sharding suffix.
- MongoDB shard key β high-cardinality, even distribution, matches queries, non-monotonic.
- Mongo transactions β supported 4.0+; limits (timeouts, oplog pressure); prefer to embed instead.
- DynamoDB PK/SK β items with same PK live together sorted by SK. Single-table design = multiple entity types in one table.
- GSI vs LSI β GSI any PK, created anytime, eventually consistent; LSI same PK, created at table creation, can be strongly consistent.
- Cache-aside β app reads/writes cache; miss β DB β populate. Standard pattern.
- Invalidation strategies β TTL, delete-on-write, write-through, event-driven (CDC), version tags.
- Thundering herd β distributed lock on miss, probabilistic early expiry, or stale-while-revalidate.
- Star schema β fact + dimensions; denormalized for OLAP speed.
- SCD 2 β track history with effective_from/to + is_current on dimension.
- ELT > ETL in modern warehouses β Snowflake/BigQuery/Redshift make in-DB transform cheap; dbt is the idiom.
- Inverted index β term β doc list; basis of Elasticsearch.
- Expand-contract migration β add new + dual-write + backfill + switch + remove old. Every step backward-compatible.
- Online index β Postgres
CREATE INDEX CONCURRENTLY; MySQLALGORITHM=INPLACE, LOCK=NONE. - PITR β full backup + continuous WAL archive = restore to any timestamp.
- RPO / RTO β how much data can you lose / how long can you be down. Drives architecture.
- Parameterized queries β separate SQL from values; the only safe defense against injection.
- RLS β filter rows at engine level by session context; multi-tenant superpower.
- HikariCP pool sizing β start at ~(cores Γ 2) + 1; tune by load test.
- Statement timeout β cap queries' runtime at the connection level.
- PgBouncer transaction pooling β high-density connection sharing; forbids session-state features.
- Hibernate L1 cache β per transaction, automatic, same-object guarantee. L2 β optional, process-wide, careful with clusters.
- LAZY default β always use LAZY on associations; fetch eagerly with JOIN FETCH when needed.
@Transactionalcaveats β self-invocation bypasses proxy; only unchecked exceptions roll back by default;rollbackFor = Exception.classto change.- @Version β optimistic locking;
AND version = ?added to UPDATE; OptimisticLockException on 0 rows affected.
32. Practice Exercises β
Work through these a few days before the interview. Answer out loud; draw diagrams.
Exercise 1 β write a window-function query β
Given orders(id, user_id, total, created_at), return each user's top 3 orders by total, with rank. Now do it without window functions (correlated subquery).
Exercise 2 β design a schema β
A read-write book tracking app. Users can:
- Mark books read / in-progress / want-to-read.
- Rate books 1β5 and write reviews.
- Follow other users; see their recent activity.
Requirements: 3NF for the core, then identify one denormalization you'd make for the activity feed and explain why.
Exercise 3 β read an EXPLAIN plan β
Nested Loop (cost=0..50000 rows=1 width=40) (actual time=0.5..9800 rows=1 loops=1)
-> Seq Scan on orders o (cost=0..15000 rows=10000 width=16) (actual time=0.05..150 rows=100000 loops=1)
Filter: (status = 'active')
Rows Removed by Filter: 900000
-> Index Scan using idx_users_id on users u (cost=0.5..3.5 rows=1 width=24) (actual time=0.09..0.09 rows=1 loops=100000)
Index Cond: (id = o.user_id)Questions:
- What's wrong?
- What would you change?
- What stats or indexes are missing?
(Hint: stats off by 10x, 900k rows filtered on status, 100k inner loops. Partial index on WHERE status = 'active', or a compound index.)
Exercise 4 β DynamoDB access pattern β
Design a single-table schema for a project-management app:
- Workspace β many Projects β many Tasks.
- Query: list all workspaces for a user.
- Query: list all projects for a workspace.
- Query: list all tasks in a project, by due date.
- Query: list all tasks assigned to a user across workspaces.
- Query: full-text search on task title.
What are PK/SK patterns? Which queries need a GSI? Which need a different tool (e.g., Elasticsearch)?
Exercise 5 β spot the N+1 β
java
@GetMapping("/orders")
List<OrderDTO> recent() {
return orderRepo.findTop100ByOrderByCreatedAtDesc().stream()
.map(o -> new OrderDTO(
o.getId(),
o.getUser().getEmail(),
o.getLineItems().size(),
o.getLineItems().stream().mapToDouble(LineItem::getPrice).sum()))
.toList();
}How many queries does this emit in the worst case? How would you fix it?
Exercise 6 β pick the isolation level β
Use case 1: online shop; place order, decrement stock. Use case 2: analytics dashboard; multi-table aggregate that must be self-consistent. Use case 3: scheduler; assign a job to a worker from a pool, no double-assignment.
Match each to the minimum isolation level (or alternative: advisory lock, SELECT FOR UPDATE SKIP LOCKED, etc.) and defend.
Exercise 7 β zero-downtime rename β
Rename events.user (TEXT) to events.user_id (BIGINT, FK to users). There are 500M rows. The app has 20 microservices touching it. Write the migration plan with deploy stages.
Exercise 8 β cache strategy β
Product page renders 5 database queries. Avg latency 250ms, p99 800ms. Users refresh often. Write-to-read ratio is 1:1000.
Design a caching strategy. What's the cache key? TTL? Invalidation mechanism? What if two requests miss at the same time?
Exercise 9 β replica lag diagnosis β
Users report: "I place an order but it doesn't appear in my history until I refresh 5 seconds later."
The app uses a read-replica for SELECTs. Describe the failure mode, options to fix, and tradeoffs.
Exercise 10 β CAP in the wild β
You're designing a global leaderboard for a game. Writes: score updates from 50+ regions. Reads: top-100 display every second to millions.
What's your consistency model? What engine? What's the RPO? How do you handle a region partition?
This guide is a living document. Update it as you learn. The best preparation is working real examples out loud β don't just re-read.