Distributed Systems, Messaging & Microservices — Study Guide

📝 Quiz · 🃏 Flashcards

Companion to INTERVIEW_PREP.md (sections 4 and 5). Audience: mid-senior Java/Spring backend engineers with experience in Kafka / IBM MQ / ActiveMQ, ArgoCD, and Kubernetes. Purpose: Go from "I know the bullet points" to "I can whiteboard this in 45 minutes."

Every section follows the same pattern:

Concepts — the why and how, in plain English, with diagrams where they help.
Code — Java/Spring snippets where they illustrate the point.
Q&A — interview-style questions with 1–3 paragraph model answers. Every answer ties back to something you've actually built where possible.

Study rule of thumb: for every topic in this guide, try to anchor your answer to a concrete example from your own work — a legacy MQ microservice at scale (e.g., 10k+ tx/day), cross-team schema standardization, a Thymeleaf → JAXB migration, a GitOps-driven deploy rollout, a mentorship program. Specifics beat generalities in every interview that matters.

How to use this guide
Distributed Systems Fundamentals
Replication, Partitioning & Consensus
Messaging Fundamentals (broker-agnostic)
Kafka Deep Dive
IBM MQ / ActiveMQ / RabbitMQ / JMS
Microservices Architecture & Patterns
Resilience & Failure Modes
Observability in Distributed Systems
System Design Drills
Exercise Appendix
Anchor Example Cheat Sheet

Part 0 — How to use this guide

Scope

This guide merges and expands sections 4 (Messaging) and 5 (Microservices & Distributed Systems) of INTERVIEW_PREP.md. It is not a replacement — INTERVIEW_PREP.md is still your top-level question bank. Treat this document as the deep-dive companion for your two strongest-and-most-probed topic areas.

Out of scope here (future guides):

Pure system-design walkthroughs at more than "mid-depth" (→ SYSTEM_DESIGN.md).
Core Java & Spring internals (→ JAVA_SPRING.md).
Observability tooling mechanics beyond propagation (→ OBSERVABILITY.md).
Security / regulated-environment specifics (→ SECURITY.md).
Kubernetes/ArgoCD/Helm deep dives (→ DEVOPS_K8S.md).

How to study

First pass (read-through): Read each Part's Concepts section. Don't try to memorize — build the mental model.
Second pass (drill): Go through the Q&A. Answer out loud before reading the model answer. Mark weak spots.
Third pass (apply): Do the Part 9 system-design drills. Draw them on a whiteboard or in Excalidraw without looking.
Day-of: Skim the flash-card appendix and the hero-story cheat sheet (Part 11).

How Q&A answers are structured

Each model answer has three parts, either explicitly or implicitly:

The short answer (one sentence they need to hear).
The "why" or mechanism (what they'll probe on).
Your personal hook (tie to a concrete project anchor — makes you memorable).

Anchor examples — quick map

Anchor	Use for prompts about…
Legacy MQ microservice (10k+ tx/day)	Kafka/MQ ingest, ordering, idempotency, DLQ, throughput, reliability
Cross-team schema standardization	Schema evolution, cross-team technical leadership, contract design
Thymeleaf → JAXB migration (schema-bound XML)	Type-safe data formats, schema-first dev, inter-system data exchange
IBM MQ → AWS ActiveMQ bridge	Cross-system messaging, dedup, failover, cutover strategy
ArgoCD 99% deploy success	GitOps, reliability engineering, delivery metrics
Intern mentorship program	Mentorship, code review, sprint velocity

Full detail map in Part 11.

Part 1 — Distributed Systems Fundamentals

1.1 What makes a system "distributed"

A system is distributed when correctness depends on multiple machines cooperating over an unreliable network. That definition has three sharp edges:

No shared clock. Two machines cannot agree on "now" to better than a few milliseconds even with NTP. Any algorithm that depends on exact ordering must use logical or hybrid clocks, not wall-clock time.
Partial failure. A single machine either works or doesn't — a distributed system has machines that are up, machines that are down, and machines you can't tell about because the network dropped your packets. "I didn't get a reply" ≠ "the remote side didn't process the request."
The network is adversarial. Packets are lost, reordered, duplicated, delayed arbitrarily, or silently dropped by middleboxes. Any property you care about (ordering, delivery, consistency) has to be enforced end-to-end — the network won't do it for you.

Everything else in this guide — replication, consensus, idempotency, retries, DLQs, saga compensation — exists because of these three facts.

1.2 The 8 Fallacies of Distributed Computing

Peter Deutsch's list (1994, extended later). Every single one has bitten real systems. If you can name all 8 and give an example for each, you're already ahead of 80% of candidates.

#	Fallacy	Example
1	The network is reliable	TCP still loses data on RST mid-stream; retries are your problem
2	Latency is zero	A cross-region round-trip is ~80 ms — too slow for synchronous calls in a hot path
3	Bandwidth is infinite	Large payloads across a service mesh at peak load exhaust NICs before CPUs
4	The network is secure	Always assume MITM; use mTLS inside the mesh, not just at the edge
5	Topology doesn't change	Pods move, IPs change, Kubernetes rolling updates shuffle everything
6	There is one administrator	Multi-team, multi-cluster, multi-cloud — no single throat to choke
7	Transport cost is zero	Serializing 100 KB JSON 10× per request adds real CPU time and GC pressure
8	The network is homogeneous	On-prem MQ + AWS ActiveMQ + SaaS vendor APIs — each has different timeout/retry semantics

1.3 CAP theorem — stated precisely

Formal statement: In the presence of a network partition, a distributed data store must choose between consistency (every read sees the most recent write, system-wide) and availability (every request gets a non-error response) — it cannot guarantee both.

What CAP is not:

It's not "pick two." Partitions are not optional — the network will partition. You really choose CP or AP during partitions.
It's not about latency. A CP system that delays writes during a partition is still "consistent."
It's not a static property. A system can be CP for writes and AP for reads, or CP in one region and AP across regions.

Real examples:

CP during partition: ZooKeeper, etcd, HBase. Refuses writes (or the partitioned minority refuses them) rather than let the two sides diverge.
AP during partition: Cassandra, DynamoDB (by default), Riak. Accepts writes on both sides of the split; reconciles later.
Kafka: Usually framed as CP — acks=all + min.insync.replicas=2 on a 3-replica partition will refuse writes if 2 of 3 brokers are unreachable.

1.4 PACELC — the honest extension

CAP only describes behavior during a partition. PACELC (Abadi, 2012) says: If Partition → Availability vs Consistency; Else (normal ops) → Latency vs Consistency.

Even with no partitions, strong consistency costs latency. Synchronous replication to 3 replicas across 3 AZs adds ~5 ms to every write. Eventual consistency lets you reply from the local replica at ~1 ms. PACELC makes the always-on tradeoff explicit.

System	PACELC class	Meaning
MongoDB (default)	PA/EC	AP during partition; prefers consistency normally
Cassandra	PA/EL	AP during partition; prefers low latency normally (tunable)
DynamoDB	PA/EL	Same (eventually-consistent reads are default)
HBase / BigTable	PC/EC	Favors consistency in both regimes

1.5 Consistency models — from strongest to weakest

This is a ladder. Stronger = easier to reason about, harder to scale. Weaker = easier to scale, harder to reason about.

Linearizable
    │
    ├─ Sequential
    │     │
    │     ├─ Causal
    │     │     │
    │     │     ├─ Read-your-writes
    │     │     ├─ Monotonic reads
    │     │     ├─ Monotonic writes
    │     │     └─ Writes-follow-reads
    │     │
Eventual

Linearizable (a.k.a. strong / atomic): every operation appears to take effect instantaneously at some point between invocation and response, in a single global total order consistent with real time. Implemented via consensus (Raft, Paxos). Examples: etcd, ZooKeeper, Spanner (via TrueTime).
Sequential: global total order, but not necessarily tied to real time. Operation X may appear to happen before Y even if Y started first in wall-clock time, as long as all nodes agree.
Causal: if A causally precedes B (A was read before B was written), every node sees A before B. Concurrent operations may be seen in any order.
Read-your-writes: after you write X, your next read returns X (or later). Common for user-facing systems — nothing's worse than updating your profile and not seeing it.
Monotonic reads: you never see the system go backwards in time.
Monotonic writes: your writes apply in the order you submitted them.
Eventual: if writes stop, all replicas eventually converge. Zero guarantees about when or what you read in the meantime.

When each is enough:

Financial ledger with real money → linearizable.
User-profile edits, shopping cart → read-your-writes + monotonic reads.
Social-media feed, like counts → eventual.

1.6 Time & ordering

Since we can't use wall-clock time, we use logical clocks.

Lamport clocks — each process has a counter. Increment on local event; on send, attach counter; on receive, local = max(local, received) + 1. Gives a total order that respects causality, but can't distinguish concurrent from causal events.

Vector clocks — each process maintains a vector V[i] of counters, one per process. On local event, bump your own slot. On send, attach vector. On receive, take pairwise max, then bump your slot. Two events A and B are concurrent iff V(A) < V(B) is false AND V(B) < V(A) is false. Used by Dynamo/Riak to detect conflicting writes.

Hybrid Logical Clocks (HLC) — combine physical time (for human readability and rough ordering) with logical counter (for tie-breaking within the same physical tick). Used by CockroachDB, MongoDB's cluster-wide clock.

Google TrueTime (Spanner) — uses GPS + atomic clocks in every datacenter to bound clock uncertainty to a few milliseconds. Spanner waits out the uncertainty window before committing, giving externally consistent (linearizable) transactions worldwide. Expensive, but makes a lot of distributed-systems problems evaporate.

1.7 Failure detection

Heartbeats: process A pings process B every N seconds; if B doesn't respond for K misses, A declares B dead. Simple, but binary — tuning K is a fight between false positives (marking healthy nodes dead) and detection latency.

Phi-accrual detector (Cassandra, Akka): instead of binary dead/alive, outputs a suspicion level phi that rises over time since last heartbeat. Higher phi = higher confidence node is down. Each caller picks its own threshold (phi=8 for quick-but-noisy, phi=12 for patient).

SWIM (Scalable Weakly-consistent Infection-style Membership): each node periodically pings a random peer. If no response, it asks K other nodes to ping on its behalf before marking suspicious. Membership changes gossip through the cluster. Used by Consul, HashiCorp Serf, Hazelcast.

1.8 FLP impossibility

Fischer-Lynch-Paterson (1985): in a fully asynchronous system with even one faulty process, there is no deterministic consensus algorithm that always terminates. Intuitively: you can't tell "slow" from "crashed," and a sufficiently adversarial scheduler can delay any protocol forever.

How real systems dodge it:

Timeouts — accept that "slow" gets treated like "crashed" and reconfigure (elect a new leader, etc.). Raft does this.
Randomization — randomized leader election (Ben-Or, Raft's randomized election timeouts) makes adversarial scheduling unlikely.
Partial synchrony assumptions — assume the network is eventually well-behaved. Paxos works correctly even under async; just might not terminate during a network storm.

The practical takeaway: consensus algorithms in production aren't "wrong," they've made explicit assumptions (bounded message delay, majority of nodes alive) that trade strict theoretical guarantees for forward progress.

1.9 Part 1 Q&A

Q1. Explain CAP with a concrete example.

CAP says a distributed store must sacrifice either consistency or availability when the network partitions. Concrete: imagine a Kafka cluster with three brokers spread across AZs, a topic with 3 replicas, acks=all, min.insync.replicas=2. If one AZ's network partitions, two brokers are still in the ISR — producers keep writing, consumers keep reading, consistency is preserved (CP). If two brokers become unreachable, only one ISR member remains, so Kafka refuses the write — it chose consistency over availability. Cassandra with CL=ONE would keep accepting writes on the minority side and reconcile later via read repair — AP.
The key nuance interviewers look for: CAP isn't "pick two" — partitions happen, so you really choose CP or AP during a partition. A legacy MQ microservice handling non-idempotent downstream effects is typically tuned to be CP: better to reject a message than double-process one when retries aren't safe.

Q2. What's the practical difference between linearizability and serializability?

Linearizability is about single-object operations appearing to happen instantaneously in real-time order. Serializability is about multi-object transactions appearing to happen in some serial order — not necessarily the real-time order. A system can be serializable but not linearizable (e.g., snapshot isolation with no real-time constraints), or linearizable but not serializable (a single-register atomic store).
Spanner is famously both — "external consistency." Most SQL databases with SERIALIZABLE isolation aren't linearizable across the cluster; they just serialize the transactions they see.

Q3. When is eventual consistency acceptable and how do you explain it to a PM?

Acceptable whenever the business tolerates seeing stale data briefly. Likes, view counts, search index updates, recommendation scores — nobody notices if the number is off for a few seconds. Not acceptable when staleness drives wrong behavior: balances, inventory, permission checks, dedup, fraud rules.
To a PM I'd frame it as: "The change will show up on every screen within a few seconds — but briefly, two users looking at the same screen might see slightly different numbers. For a social feed that's invisible. For a bank balance, it's unacceptable." Then pick the model that matches the business risk.

Q4. Two processes both think they're the leader. What went wrong and how do you recover?

Classic split-brain. A network partition let process A keep thinking it was leader while B (on the other side of the partition) won a new election. When the partition heals, they both try to commit. Recovery requires a fencing token — a monotonically increasing number assigned at election time. Any resource (DB, lock server) rejects operations from a lower fencing token than it has seen, so the stale leader's writes get rejected even though it doesn't know it's stale. This is how ZooKeeper's epochs, Raft's terms, and Kafka's leader epochs work.

Q5. A colleague wants to use System.currentTimeMillis() to resolve write conflicts between two nodes. Push back.

Wall-clock time can jump backwards (NTP corrections, leap seconds, VM pauses) and drift by seconds between nodes even with NTP. "Last write wins by timestamp" can silently drop writes when clocks disagree. Options: (1) use a logical clock like a Lamport counter or vector clock to detect concurrent writes, then resolve via domain logic; (2) use an HLC that pins the physical time but tie-breaks via counter; (3) fall through to consensus (Raft) — slower but correct. The "just use timestamps" approach works until the first clock skew incident, and then corrupts data silently.

Q6. What are the 8 fallacies — name at least 5.

Network is reliable; latency is zero; bandwidth is infinite; network is secure; topology doesn't change; one administrator; transport cost is zero; network is homogeneous. Every multi-service bug I've debugged has its roots in one of these — and "one administrator" especially bites when you're bridging on-prem IBM MQ to AWS ActiveMQ across two ops teams with different change-management rules.

Q7. Explain FLP impossibility in two sentences.

In a fully asynchronous network where even one process might crash, no deterministic consensus protocol can guarantee termination — because you can't distinguish "slow" from "dead." Real systems (Raft, Paxos) dodge it by adding timeouts and assuming eventual network sanity, trading strict theory for forward progress.

Q8. What is PACELC and why does it matter more than CAP day-to-day?

PACELC: "If Partition then A vs C, Else Latency vs C." CAP only applies during partitions, which are rare. PACELC acknowledges that even with a healthy network, strong consistency costs latency — synchronous cross-AZ replication is always slower than local-only reads. Every real-world design negotiates this tradeoff on every request. MongoDB's readConcern: majority vs local is PACELC in one config flag.

Q9. What's the difference between crash failure and Byzantine failure, and which does Kafka handle?

Crash failure: process stops responding. Byzantine: process behaves arbitrarily (sends wrong data, lies, corrupts responses). Crash-tolerant systems assume survivors are honest. Byzantine-tolerant systems assume survivors may lie. Kafka (and most enterprise systems) handle crash failures only — far cheaper. Byzantine tolerance (PBFT, HotStuff) is used where you can't trust the nodes themselves: blockchain, multi-organization consensus, adversarial environments. In hardened-perimeter environments, own-nodes are trusted and crash fault-tolerance is enough; the perimeter is the defensive layer instead.

Q10. What is a quorum and why is it always a majority?

A quorum is the minimum number of nodes that must participate in an operation for it to be valid. Setting quorum to a majority (⌊n/2⌋ + 1) guarantees any two quorums share at least one node — which is what prevents split-brain. If you allowed quorum of n/2, two non-overlapping quorums could independently elect leaders or commit conflicting writes. The majority rule is what makes Raft, Paxos, and every consensus protocol safe.

Part 2 — Replication, Partitioning & Consensus

2.1 Replication strategies

Three fundamental models:

Single-leader (primary-replica). One node accepts writes; replicates to followers synchronously or asynchronously. Reads can go to the leader (strong) or any follower (stale). Simple, strong-ish consistency, but leader is a bottleneck and SPOF.

Examples: Postgres streaming replication, MySQL binlog, MongoDB replica sets, Kafka (per partition), Redis.

Multi-leader. Multiple nodes accept writes; replicate to each other. Lets geographically distributed writers commit locally. Creates write conflicts that must be resolved — timestamp, CRDT, or manual resolution.

Examples: Cassandra (sort of — see "leaderless" below), multi-master MySQL, CouchDB, some active-active Postgres setups.

Leaderless. No designated leader. Writes go to N replicas; reads query R replicas; correctness needs R + W > N (see quorum math). Conflict resolution via vector clocks or last-write-wins.

Examples: DynamoDB (internally), Cassandra, Riak.

Sync vs async:

Sync replication: write acks after ≥1 follower confirms. Slower, no data loss on leader crash.
Async replication: write acks as soon as leader commits locally. Faster, can lose the tail of writes on leader crash.
Semi-sync: wait for at least 1 replica but not all. MySQL's default semi-sync. Kafka's acks=all + min.insync.replicas=2 is a form of semi-sync (commit after the ISR acks).

2.2 Quorum math

For N replicas, W writes acked, R reads queried:

R + W > N → guaranteed to read the latest write (overlapping quorum).
W > N/2 → avoids split-brain on writes.
R = 1, W = N → fast reads, slow writes.
R = N, W = 1 → fast writes, slow reads.
R = W = ⌈(N+1)/2⌉ → balanced, common default.

Example: N=3. W=2, R=2 ⇒ R+W=4 > 3 ⇒ consistent. Tolerate 1 node failure for writes and reads.

Sloppy quorum & hinted handoff (Dynamo): if the "home" replicas for a key are down, temporarily write to healthy replacement nodes and flag the data with a hint. When the home replicas recover, the replacement forwards the data home ("hinted handoff"). Preserves availability at the cost of brief stale reads.

2.3 Anti-entropy: read repair & Merkle trees

Even with quorums, replicas can drift due to dropped messages or failed nodes. Anti-entropy fixes this.

Read repair: on a read, compare responses from all queried replicas. If they differ, return the most-recent, then asynchronously write the fresh value to the stale replicas. Catches drift on hot keys.

Active anti-entropy (Merkle trees): each node builds a Merkle tree (tree of hashes) over its data. Neighboring replicas periodically compare root hashes — if equal, all data matches; if different, recursively compare subtrees, transferring only the divergent ranges. Makes full repair cheap even for TB-scale datasets. Cassandra, DynamoDB, Riak all use this.

2.4 Leader election (Raft vs Paxos vs ZAB)

All three solve the same problem — get a group of nodes to agree on who's in charge and what the log of operations is. You don't need to implement any of them, but you should be able to speak to them.

Raft (Ongaro & Ousterhout, 2014) — designed specifically for understandability. Strong leader (all writes go through leader), heartbeat-based election, log replication with commit index. Used by etcd (Kubernetes), Consul, CockroachDB, Kafka's KRaft.

Election: follower with highest term + most-up-to-date log wins.
Safety: a candidate must have seen all committed entries to be elected.
Fault tolerance: 2f + 1 nodes tolerate f crash failures.

Paxos (Lamport, 1998) — the original. More flexible than Raft (any node can propose) but notoriously hard to understand and implement. Variants: Multi-Paxos, Fast Paxos, EPaxos. Used by Google Chubby, Spanner.

ZAB (ZooKeeper Atomic Broadcast) — Zookeeper's homegrown protocol. Similar to Raft (leader-based, log replication) but predates it. Being phased out as Kafka, Hadoop, etc. move to Raft variants.

Split-brain prevention (fencing): every election increments a term/epoch. Any operation tagged with an old term is rejected by the log or the external resource (DB, lock). See §1.9 Q4.

2.5 Consensus in the wild

System	Consensus	Where
etcd	Raft	Kubernetes control plane, service discovery
ZooKeeper	ZAB	Kafka (pre-3.3), Hadoop, Solr
KRaft	Raft	Kafka 3.3+ (replaces ZK dependency)
Consul	Raft	Service mesh (Consul Connect), KV store
CockroachDB	Raft (per range)	Distributed SQL
Spanner	Paxos (per tablet) + TrueTime	Google's global DB

Why Kafka moved off ZK (KRaft): running two clusters (Kafka brokers + ZK ensemble) doubles operational surface area. KRaft co-locates the controller quorum inside Kafka brokers — fewer moving parts, faster metadata propagation (seconds → milliseconds), no more "what's ZK doing?" during incidents. GA in 3.3, default in 4.0, required in 4.x (ZK removed).

2.6 Partitioning / sharding

Splitting data across nodes horizontally. Three strategies:

Range partitioning. Keys sorted; each node owns a contiguous range. Great for range scans (WHERE time BETWEEN ...). Bad when the range is skewed (newest data always hotter).

Example: HBase, BigTable, CockroachDB.

Hash partitioning. Hash the key; modulo N = partition. Uniform distribution. Terrible for range queries (adjacent keys are on random shards).

Example: DynamoDB partition key, Cassandra default, Kafka's default partitioner.

Consistent hashing. Hash keys and nodes onto a ring. Each key goes to the next node clockwise. Adding/removing a node remaps only 1/N of the keys, not all of them. Tune with virtual nodes (each physical node holds many ring positions) to smooth out load when a node dies.

Example: Cassandra's token ring, DynamoDB internals, memcached client-side sharding.

Rendezvous (HRW) hashing. For each key, compute a score with every node; pick the highest. Equivalent effect to consistent hashing but no ring to maintain. Used by CRUSH (Ceph), some CDN edge routing.

Hot partitions — causes & mitigations:

Cause: skewed key distribution (celebrity user, monotonic timestamp key, all events from one tenant).
Mitigations: add a random or tenant-id prefix to the key; salt the key; split hot keys into multiple virtual keys; use bucketing (e.g., tenant:hour instead of tenant).

2.7 Rebalancing

Moving partitions between nodes when membership changes. Always expensive — data must be copied or re-indexed.

Kafka consumer rebalancing (classic "eager" protocol):

All consumers in the group stop fetching.
Coordinator revokes all partition assignments.
Coordinator computes a new assignment.
All consumers resume.

This "stop-the-world" pause hurts tail latency. Cooperative (incremental) rebalancing (KIP-429, Kafka 2.4+): only the partitions that actually need to move are revoked; everyone else keeps consuming. Much better for large groups.

Static membership (KIP-345): consumers declare a group.instance.id. A restart within session.timeout.ms doesn't trigger a rebalance at all — the same instance resumes its old partitions. Essential in Kubernetes where pod restarts are routine.

2.8 Part 2 Q&A

Q1. When would you pick leaderless replication over single-leader?

Leaderless wins when you need (a) very high write availability across geographically distributed writers, (b) no single point of failure for writes, or (c) workloads where last-write-wins or CRDT merge is acceptable. Dynamo/Cassandra were explicitly designed for Amazon's shopping cart: a write must succeed even if most of the cluster is unreachable, and occasional merge conflicts are cheaper than outages.
Single-leader wins everywhere else — simpler mental model, easier transactions, cleaner ordering. Default to single-leader (Postgres, MongoDB, Kafka partitions are all single-leader per partition) and only go leaderless when you have a specific availability requirement.

Q2. Explain consistent hashing in under 60 seconds.

Map both keys and nodes onto a ring (hash modulo a large number). Each key belongs to the next node clockwise. When a node joins or leaves, only the keys between the new/old boundary move — about 1/N of the total — instead of every key being reshuffled like a plain hash % N. Pair it with virtual nodes (100–256 per physical node) to spread load evenly. Cassandra's token ring, DynamoDB's internals, and every memcached sharding library work this way.

Q3. What is a fencing token and what problem does it solve?

A monotonically increasing number issued at leader election. Every operation the leader emits is tagged with its current token. External resources (a DB, a lock server, a storage system) remember the highest token they've seen and reject any operation from a lower token. This prevents a stale leader — one that still thinks it's in charge because it hasn't noticed its partition — from committing conflicting writes after a new leader has taken over. ZooKeeper calls it the epoch, Raft calls it the term, Kafka calls it the leader epoch. Without fencing, split-brain corrupts data silently.

Q4. Why does Raft need 2f+1 nodes to tolerate f failures?

Raft's safety relies on any two quorums overlapping in at least one node. A quorum is ⌊n/2⌋ + 1. With n = 2f+1, each quorum has f+1 nodes; any two such quorums must share at least one member (pigeonhole). That shared member preserves the log across leadership changes. With n = 2f, you could have two disjoint quorums of size f — recipe for split-brain.
Practically: 3-node cluster tolerates 1 failure; 5-node tolerates 2; 7-node tolerates 3. Beyond 5–7 the latency overhead of replication outweighs the fault-tolerance benefit — most production etcd clusters are 3 or 5 nodes.

Q5. How do Kafka rebalances work and why are they painful?

When a consumer joins/leaves a group, or partitions are added, the group coordinator triggers a rebalance. In the eager protocol (pre-2.4), all consumers stop, relinquish their partitions, the coordinator reassigns, and everyone resumes. "Stop-the-world" on every scale-up or pod restart — for a 50-consumer group that's seconds of pause per rebalance, visible as lag spikes and tail-latency jumps.
Cooperative (incremental) rebalancing (KIP-429) only revokes the partitions that truly need to move, so most consumers keep chugging. Static membership (KIP-345) skips the rebalance entirely if a consumer restarts within session.timeout.ms. Both are essential in Kubernetes where pod restarts are routine. In one legacy MQ ingest, switching to cooperative + static membership cut rebalance-related lag spikes from 30s to under 2s.

Q6. A partition key is creating a hot spot. What are your options?

Diagnose first — is it a celebrity key (one tenant 100× others) or a structural issue (monotonic timestamp key, all events from one region)? Options:
Salt the key: prepend a random prefix 0..k, spread load, but now reads must query all salts.
Composite key: tenant:hour instead of tenant — spreads one tenant across 24 shards.
Split the hot key: if one entity generates most traffic, split it in the model (e.g., product:123:eu, product:123:us).
Add a cache in front so reads don't touch the shard.
Scale horizontally — sometimes the fix is more shards, if the partitioner handles it.
The first instinct "just add more partitions" is usually wrong — if the key is fundamentally skewed, more partitions just spread the problem thinner.

Q7. Explain read-repair and Merkle-tree anti-entropy.

Read repair: on a multi-replica read, compare responses and write back the freshest value to any stale replicas. Fixes drift as a byproduct of reads. Only fixes hot keys.
Merkle-tree anti-entropy: each node builds a tree of hashes over its data (say, 2^20 leaf hashes). Neighboring replicas exchange root hashes — if equal, all data matches. If not, recursively compare subtree hashes to zoom in on the divergent ranges, then transfer only those. Makes full-cluster repair cheap even at terabytes. Cassandra runs this as a periodic nodetool repair.

Q8. Why did Kafka move off ZooKeeper?

Two reasons: operational complexity and scale. Running Kafka meant also running a ZK ensemble with different config, different monitoring, different failure modes, and during incidents engineers had to reason across two clusters. Scale: all broker metadata lived in ZK znodes, so controller failover took 30+ seconds on a big cluster because the new controller had to re-read everything. KRaft (KIP-500) embeds a Raft quorum inside the brokers — metadata is its own Kafka-like log, controller failover is sub-second, and you operate one cluster, not two. GA in 3.3, default in 4.0, ZK removed in 4.x.

Q9. What's the difference between sync, async, and semi-sync replication?

Sync: leader doesn't ack the client until every replica has the write. Zero data loss on leader crash, highest latency. Rare in practice — one slow replica makes everything slow.
Async: leader acks immediately, replicates in the background. Low latency, can lose the tail on leader crash (whatever hadn't replicated yet). Postgres default, MongoDB default, Kafka with acks=1.
Semi-sync: wait for at least one (or a quorum) of replicas before acking. Bounded data loss, tolerable latency. Kafka's acks=all + min.insync.replicas=2 on a 3-replica partition is semi-sync — one replica can lag without blocking.

Q10. Why isn't "just replicate everything synchronously" the default?

Tail latency. If you wait for all N replicas, every write is as slow as the slowest replica — and in a large cluster, some replica is always GC-ing, doing a Snapshot, or on a slow disk. A quorum-based approach (acks=all with min.insync.replicas < total replicas) gives durability without tying latency to the worst performer. Also: full sync means a single slow/crashed replica blocks writes entirely, turning a partial failure into a total one — not a tradeoff you want in production.

Part 3 — Messaging Fundamentals (broker-agnostic)

Before diving into Kafka specifics (Part 4) or the MQ family (Part 5), grok the cross-cutting concepts. Every messaging interview question ultimately reduces to these.

3.1 Three conceptual models: queues, topics, logs

Queue (point-to-point): message is consumed by exactly one consumer. Once consumed, it's gone. Work distribution model. JMS queue, SQS, IBM MQ queue, RabbitMQ queue.

Topic (publish-subscribe, traditional): each consumer receives its own copy. Multiple independent subscribers. JMS topic, SNS, RabbitMQ fanout.

Log (distributed commit log): messages persist on an append-only log; consumers track their own offset; any number of consumers can read the same message independently; messages are retained for a configurable time or size even after consumption. Kafka, Pulsar, Kinesis.

The log model includes queue and topic semantics as special cases (with consumer groups for queue-like load balancing, and independent consumers for pub/sub). That's why Kafka ate the world — one data model covers three use cases.

3.2 Delivery semantics

Three possibilities; pick the weakest that works.

At-most-once. Messages may be lost but never delivered twice. Fire-and-forget. Metrics, best-effort events.

At-least-once. Messages are never lost but may be delivered multiple times (duplicates on retry). Default for almost every reliable broker. Your consumer must be idempotent.

Exactly-once. No loss, no duplicates. End-to-end, this is legitimately hard; most "exactly-once" systems are at-least-once + consumer-side dedup (= "effectively once"). Kafka's EOS gets you exactly-once within Kafka (read from topic A, process, write to topic B atomically). It does not cover external side effects like "call Stripe" — those still need an idempotency key.

The producer side:

At-most-once: acks=0 (fire and forget).
At-least-once: acks=all, retries, but non-idempotent producer may duplicate on retry.
Exactly-once (to Kafka): enable.idempotence=true (producer sequence numbers) + transactions.

The consumer side:

At-most-once: commit offset before processing. If process crashes, message is lost.
At-least-once: commit offset after processing. If process crashes after processing but before commit, message is re-delivered.
Exactly-once: read-process-write pattern inside a Kafka transaction (so the offset commit and the output write are atomic).

3.3 Ordering guarantees

Global ordering (every message seen in the same order by every consumer) is very expensive in a distributed system — it requires a total order, which requires either a single writer or consensus. Almost no high-throughput broker offers it.

Per-partition (or per-queue) ordering. Messages on the same partition are delivered in the order they were produced. Kafka's and Kinesis's bread and butter. Ordered within partition, unordered across.

Per-key ordering. A producer hashes a key (e.g., userId) to a partition — all messages for the same user are ordered, across users they're parallel. This is the pragmatic sweet spot. Used everywhere.

Cost of ordering: strong ordering prevents parallelism. If you need global order, you have throughput of one consumer. If per-key is enough, throughput scales with the number of distinct keys.

3.4 Durability: acks, replication, fsync

Three layers.

Client → broker ack. The producer waits for confirmation the broker received it. acks=0 skips this; acks=1 waits for leader only; acks=all waits for the full ISR.
Broker replication. The broker synchronously or asynchronously replicates to other brokers. Kafka uses log replication; IBM MQ uses queue manager replication or shared queues.
fsync to disk. Even after replication, the data may be in page cache. An fsync flushes to the physical disk. Kafka deliberately does not fsync on every write — it relies on replication across independent OSs. If you want fsync-per-write, configure flush.messages=1, but throughput drops 10–100×.

The durability contract you sell to the business is the intersection of these three. "We persist messages" is meaningless without specifying which layers.

3.5 Idempotency & dedup

At-least-once delivery means your consumer must handle duplicates. Two approaches.

Naturally idempotent operations: set a value, upsert a record. Applying twice = applying once. Design for this first.

Explicit dedup store: keep a table/cache of message IDs you've processed; reject duplicates. TTL-based cleanup. Redis set, DynamoDB with TTL, Postgres unique index.

Idempotency keys (the producer mints a UUID per logical request): the consumer writes (idempotency_key, result) to the dedup store in the same transaction as the business write. Second attempt sees the key, returns the stored result without re-processing.

On the producer side: an idempotent producer (Kafka's enable.idempotence=true) uses a producer ID + sequence number per partition. The broker rejects duplicate sequence numbers, so producer retries don't duplicate.

3.6 Poison pills & dead-letter queues

A poison pill is a message that cannot be processed — malformed, violates a constraint, corrupts state on retry. Left alone, it blocks the partition forever (the consumer keeps crashing, committing nothing, and re-reading the same message — "poison loop").

Dead-letter queue (DLQ) / dead-letter topic (DLT) pattern:

On failure, retry N times (with backoff).
If still failing, publish to a DLQ with headers: original topic, offset, partition, exception, timestamp.
Commit the offset past the poison message so the consumer moves on.
Alert on non-empty DLQ.
Manually inspect → fix the code or the data → replay from DLQ.

Topology: Spring Kafka's RetryTopicConfigurer automates the whole chain: main topic → retry-0 (1s delay) → retry-1 (5s delay) → retry-2 (30s) → DLT. See §4.15.

Why naïve infinite-retry is evil: no progress on the partition, CPU burn, eventually you scale out trying to "catch up" on lag that isn't real lag. Every production consumer needs a bounded retry + DLQ.

3.7 Backpressure

Backpressure is the ability of a slow consumer to slow down fast producers — or the system's graceful handling of the gap between them.

Push (broker pushes to consumer): broker may overwhelm consumer. Requires explicit flow control (credit-based in AMQP, prefetch count in RabbitMQ, reactive Streams request(n) signals).

Pull (consumer pulls from broker): consumer controls its own rate. Kafka's model. Slow consumer → higher lag → alert. Producer is unaware; producers write to the log at their own pace. The log itself is the buffer.

Consumer lag as backpressure signal: in a pull system, lag (messages produced minus messages consumed) is the backpressure metric. If lag grows unboundedly, either scale out consumers, scale down producers, or the work is structurally too slow.

Bounded queues (in-memory, between pipeline stages): block the upstream when full. Fast-fail when full. Drop when full. Each has use cases. ArrayBlockingQueue(capacity) + put() blocks; .offer() drops; .offer(timeout) fast-fails.

3.8 Transactional messaging & the dual-write problem

The canonical distributed-systems bug: you want to update a database and publish a message, atomically. Naive code:

java

db.save(order);           // ✅ committed
kafka.send(orderEvent);   // ❌ process dies here

Now you have an order with no event. Or:

java

kafka.send(orderEvent);   // ✅ sent
db.save(order);           // ❌ dies here

Now an event for a non-existent order.

Options:

XA / 2PC. Wrap both in a global transaction via JTA + XA-compliant resource managers. Works, but XA is slow, complex, and not supported by Kafka (Kafka transactions are scoped to Kafka, not distributed). Great for the IBM MQ + DB world — that's where XA was born. Overkill almost everywhere else.
Transactional outbox (see §6.11). Write the event to an outbox table in the same DB transaction as the business write. A separate poller or CDC process (Debezium) reads the outbox and publishes to Kafka. The publish is at-least-once → consumer dedups. Standard answer for 90% of interviews.
Listen-to-yourself / CDC-driven. Instead of a separate outbox, treat the DB's change log (WAL, oplog) as the event source via Debezium. The event is literally derived from the committed DB write — no dual write at all. Requires a CDC-capable DB and willingness to expose the DB schema as an interface.
Eventual consistency with compensation. Publish first, then persist. If persist fails, publish a compensation event. Painful; rarely the right answer.

3.9 Part 3 Q&A

Q1. Explain at-least-once vs exactly-once delivery.

At-least-once: messages are never lost but may be delivered more than once. Default for any broker that retries. The consumer must be idempotent.
Exactly-once: every message is delivered and processed exactly one time. End-to-end, this is very hard — almost every "exactly-once" system is actually at-least-once + dedup ("effectively once"). Kafka's EOS gives exactly-once within Kafka (read→process→write+commit-offset atomically via a transaction). It doesn't cover external side effects like "call Stripe" — those still need an idempotency key on the external call.
On the legacy MQ ingest, at-least-once + idempotent handlers was the deliberate choice. Every message had a dedup key stored in Mongo; retries were free. Much simpler than XA and way cheaper than EOS.

Q2. What's the cost of global ordering and when do you actually need it?

Global ordering requires a total order across all messages, which requires either a single writer (serial bottleneck, caps throughput at one partition's worth) or distributed consensus (Raft/Paxos) on every message (latency explodes). Almost no high-throughput broker offers it.
You actually need it rarely — financial ledgers, certain audit logs, event sourcing with cross-aggregate invariants. Most "we need ordering" reduces to per-entity ordering, which is solved cheaply by partitioning on a key (userId, orderId). In the legacy MQ integration, ordering was needed per-case, not globally, so the partition key was the case ID — per-key ordering, parallel across cases.

Q3. What's a poison pill and how do you handle one?

A message that can't be processed — malformed payload, schema mismatch, violates a business invariant. If a consumer retries it forever, it blocks the partition: no progress, CPU burn, eventually escalates to an incident because lag alarms fire.
Standard handling: retry a bounded number of times with backoff, then send to a DLQ with full context (original topic, offset, exception, payload if policy allows). Commit past the poison message so the partition moves on. Alarm on non-empty DLQ. Manually inspect → fix the code or the data → replay from DLQ. Spring Kafka's RetryTopicConfigurer or DeadLetterPublishingRecoverer builds this chain out of the box.

Q4. Why can naïve retry cause retry storms?

Naïve: on error, retry immediately. If the failure is a downstream service that's overloaded, every retry adds to the overload, making recovery impossible. Solution: exponential backoff with jitter. Backoff so you're not thundering; jitter so all your clients don't retry at the same tick. Retry budgets cap the total retry rate across a service, so you can't DDoS yourself. Circuit breakers open the path entirely when failure rates are sustained. In Resilience4j: IntervalFunction.ofExponentialRandomBackoff.

Q5. Explain the dual-write problem and the outbox pattern.

Dual write: you need to persist to a DB and publish an event atomically. Doing them as two separate operations means the process can crash between them, leaving DB and broker inconsistent. XA/2PC solves it but is expensive and not universally supported (Kafka isn't an XA resource).
Outbox pattern: add an outbox table. In the same DB transaction as the business write, insert the event into the outbox. A separate poller reads unpublished outbox rows, publishes to Kafka, and marks them sent. The publish is at-least-once — the consumer dedups. Alternative: Debezium CDC reads the DB's WAL and publishes outbox inserts directly. Either way, one atomic local commit, no 2PC, no XA.

Q6. Why does Kafka do pull-based consumption instead of push?

Pull lets the consumer control its own rate. Slow consumer → lag grows → clear backpressure signal. Fast consumer → pulls more batches. Broker doesn't have to track per-consumer state or buffer. Also: consumer can do batch sizing based on its own capacity, not the broker's guess. And replay/rewind is trivial — just reset the offset.
Push-based (RabbitMQ, legacy MQ) needs explicit flow control (prefetch count, AMQP credit) to avoid overwhelming consumers, and is harder to reason about when scaling.

Q7. What's the difference between at-least-once on the producer side and on the consumer side?

Producer at-least-once: the broker gets the message at least once even if producer retries after a network glitch. Duplicates on the broker side are possible unless the producer is idempotent (enable.idempotence=true in Kafka adds producer-ID + sequence numbers so the broker dedups retries).
Consumer at-least-once: the message is processed at least once even if the consumer crashes mid-process. Achieved by committing the offset after processing — if the consumer crashes after processing but before committing, the next consumer re-reads and re-processes. Duplicates on the consumer side are handled by making the processing idempotent or using a dedup store.

Q8. When would you prefer XA/2PC over outbox?

Almost never, in greenfield work. XA's big niche is heterogeneous transactional guarantees across legacy resource managers — IBM MQ + DB2 + CICS, where all three speak XA and you need strict atomic commit. Mainframe-adjacent enterprise work still uses it. But it's a ticket to operational pain: 2PC blocks on coordinator failure, XA recovery is slow, and Kafka / most modern brokers aren't XA resources. For any system where outbox is viable, pick outbox.

Part 4 — Kafka Deep Dive

Kafka is the topic they will probe hardest on — especially given your resume. The goal of this section is that you can whiteboard a Kafka cluster, debug a lag spike, and choose the right config for a delivery contract, not just define terms.

4.1 Architecture primer

┌─────────────────── Kafka Cluster ───────────────────┐
│                                                     │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐        │
│   │ Broker 1 │   │ Broker 2 │   │ Broker 3 │        │
│   │          │   │          │   │          │        │
│   │ Topic A  │   │ Topic A  │   │ Topic A  │        │
│   │ P0 (L)   │   │ P0 (F)   │   │ P0 (F)   │        │
│   │ P1 (F)   │   │ P1 (L)   │   │ P1 (F)   │        │
│   │ P2 (F)   │   │ P2 (F)   │   │ P2 (L)   │        │
│   └──────────┘   └──────────┘   └──────────┘        │
│                                                     │
│   ┌──────────────────────────────────────────┐      │
│   │  Controller Quorum (KRaft) — embedded     │      │
│   └──────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────┘
        ▲                                        ▲
        │ produce                                │ fetch
        │                                        │
┌───────────────┐                        ┌───────────────┐
│   Producer    │                        │   Consumer    │
└───────────────┘                        │    Group      │
                                         └───────────────┘

Core nouns:

Broker — a Kafka server process. A cluster is a set of brokers.
Topic — a named stream of records. A topic is partitioned for parallelism.
Partition — an ordered, append-only log. The unit of parallelism and ordering. Each partition has one leader and N-1 followers (for a replication factor of N).
Segment — a partition's log on disk is split into segment files (default 1 GB). Old segments are eligible for retention-based deletion.
Offset — a monotonic 64-bit integer identifying a record's position in a partition. Consumers track offsets; offsets are partition-scoped (partition 0's offset 42 is unrelated to partition 1's offset 42).
Replica — a copy of a partition on a different broker. One is designated leader; the rest are followers.
ISR (In-Sync Replicas) — the set of replicas that are fully caught up to the leader. Only ISR members are eligible for leader election.
Consumer group — a set of consumers sharing a group.id. Partitions are divided among the group's consumers — each partition is assigned to exactly one consumer in the group. For N partitions and M consumers, the group handles at most N consumers in parallel (extras are idle).
Coordinator — the broker responsible for managing a consumer group's membership and offsets. Different from the Controller, which manages cluster-wide metadata (leader elections, topic creation, etc.).

4.2 Producer internals

Produced records flow: user code → serializer → partitioner → accumulator (batching) → sender thread → network → broker. A few configs you must know cold:

java

Properties p = new Properties();
p.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092");
p.put("key.serializer",   StringSerializer.class);
p.put("value.serializer", KafkaAvroSerializer.class);

// Durability contract
p.put("acks", "all");                      // wait for ISR acks
p.put("enable.idempotence", true);         // producer ID + seq nums, no dup on retry
p.put("max.in.flight.requests.per.connection", 5);  // 5 is safe with idempotence
p.put("retries", Integer.MAX_VALUE);       // rely on delivery.timeout.ms instead
p.put("delivery.timeout.ms", 120_000);     // bound total time per record

// Throughput tuning
p.put("linger.ms",  20);                   // wait up to 20ms to accumulate a batch
p.put("batch.size", 64 * 1024);            // up to 64KB per batch per partition
p.put("compression.type", "zstd");         // best ratio/cpu tradeoff at 2026

// Transactions (for EOS)
p.put("transactional.id", "order-svc-tx-1");

Key gotchas:

acks=1 is not the safe default most people think it is. If the leader crashes after acking but before any follower replicated, the message is lost. Use acks=all + min.insync.replicas=2.
retries + non-idempotent producer = duplicates. Always pair retries > 0 with enable.idempotence=true.
max.in.flight.requests.per.connection used to need to be 1 for ordering with retries — with idempotent producer, up to 5 is safe (producer tracks seq nums per partition).
linger.ms is a latency↔throughput knob. At low volume, a tiny linger.ms keeps tail latency down; at high volume, 10–50 ms dramatically reduces broker load.

4.3 Replication, ISR, leader epoch, high watermark

Replication: each partition has replication.factor replicas across brokers. One is leader; others are followers. All writes go to the leader; followers pull from the leader. "Pull-based replication" — same reason as pull-based consumption.

ISR (In-Sync Replica set): a follower is "in-sync" if it is within replica.lag.time.max.ms (default 30s) of the leader. A lagging follower drops out of the ISR; when it catches up, it rejoins. Only ISR members can be elected leader.

High Watermark (HW): the offset up to which all ISR members have replicated. Consumers only see messages up to the HW — anything beyond that is unreplicated and could be lost on a leader crash. This is why acks=all is safe: the leader only acks after HW advances past your message.

Leader Epoch (KIP-101): every leader election bumps an epoch number. On leader failover, followers reconcile their logs against the new leader using the epoch, not offsets. Before KIP-101, a zombie leader could return old data; epochs prevent this. Always enabled in modern Kafka.

Durability contract you can tell the business:

replication.factor = 3
min.insync.replicas = 2
acks = all
enable.idempotence = true

→ Tolerates 1 broker loss with no data loss, no duplicates. 2 broker losses → producers get NotEnoughReplicas (system chooses consistency over availability — correct default).

4.4 Controller & KRaft

Pre-3.3 (ZooKeeper era): one broker is elected controller by ZK. Controller manages cluster metadata (leader assignments, topic config) via ZK znodes. Controller failover = re-read all ZK metadata → 30+ seconds on a large cluster.

KRaft (Kafka 3.3 GA, 4.0 default, 4.x-only): a subset of brokers run a Raft quorum ("controller nodes"). Metadata is itself a Kafka-like log stored on those nodes. Controller failover is sub-second (Raft leader election, no ZK). Operationally: one cluster, not two.

Transition is opt-in through 3.x (you pick ZK or KRaft at cluster creation). 4.x removes ZK support entirely. If a cluster in your interview is on 4.x, it's KRaft by definition.

4.5 Consumer groups & rebalancing

A consumer group coordinates partition ownership. Rules:

Each partition is assigned to exactly one consumer in the group.
A consumer can own multiple partitions.
If there are more consumers than partitions, the extras sit idle.
Scaling out = add consumers (up to #partitions) or add partitions.

Lifecycle:

Consumer joins group → sends JoinGroup to the group coordinator.
Coordinator picks a leader (one of the consumers).
Leader computes assignments based on the assignment strategy (range, roundrobin, sticky, cooperative-sticky).
Coordinator broadcasts assignments via SyncGroup.
Consumers fetch from their assigned partitions.
Heartbeats every heartbeat.interval.ms (default 3s).
If session.timeout.ms (default 45s) passes with no heartbeat, coordinator removes the consumer → rebalance.

Rebalancing protocols:

Eager (classic): stop-the-world. All assignments revoked, recomputed, re-assigned. Pauses the whole group for seconds.
Cooperative (incremental) (KIP-429, Kafka 2.4+): only the partitions that actually need to move are revoked. Everyone else keeps consuming. Use partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor. Use this in Kubernetes.
Static membership (KIP-345): set group.instance.id. A restart within session.timeout.ms is not a rebalance — the old instance re-claims its partitions. Essential for pod restarts.

4.6 Offset management

Kafka stores committed offsets in a special internal topic __consumer_offsets, log-compacted, keyed by (group, topic, partition).

Commit modes:

Auto-commit (enable.auto.commit=true, every auto.commit.interval.ms): easy but dangerous. Offsets can commit before processing completes → message loss on crash. Don't use in production.
Sync commit (commitSync()): blocks until the broker acks. Safe, slow. Good for batch boundaries.
Async commit (commitAsync()): fire-and-forget. Fast. Combine with a final commitSync() on shutdown.
Manual ack in Spring (AckMode.MANUAL or MANUAL_IMMEDIATE): the listener explicitly acks after processing.

The fundamental rule: commit the offset after processing, not before. Commit-before-process = at-most-once (data loss). Commit-after-process = at-least-once (dedup required).

4.7 Exactly-Once Semantics end-to-end

Kafka EOS gives you exactly-once for Kafka topics. The canonical pattern is read-process-write:

consume from topic A ──┐
                       ├── all inside one Kafka transaction ──┐
produce to topic B ────┤                                      │
                       │                                      │
commit consumer offset ┘                                      │
                                                              ▼
                                                       atomic commit

If the process crashes mid-transaction, Kafka aborts it — neither the output write nor the offset commit is visible to downstream consumers running isolation.level=read_committed. Retry from the last committed offset; no duplicates.

java

producer.initTransactions();
while (running) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    producer.beginTransaction();
    try {
        for (var r : records) {
            ProducerRecord<String, String> out = processRecord(r);
            producer.send(out);
        }
        Map<TopicPartition, OffsetAndMetadata> offsets = currentOffsets(records);
        producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
        producer.commitTransaction();
    } catch (Exception e) {
        producer.abortTransaction();
    }
}

Critical limit: Kafka EOS does not cover external side effects. If processRecord calls Stripe, sending to Kafka is atomic but the Stripe call isn't. You still need an idempotency key on the external call.

4.8 Schema Registry & Avro

Avro over JSON for Kafka because: (1) binary format → 5–10× smaller, (2) schema enforcement → producers can't send garbage, (3) schema evolution is first-class.

Subject naming strategies (which Avro schema belongs to which topic):

TopicNameStrategy (default): subject = <topic>-value. One schema per topic. Simple; forces one record type per topic.
RecordNameStrategy: subject = fully-qualified record name. Lets you multiplex different record types on one topic (union).
TopicRecordNameStrategy: subject = <topic>-<record>. Same as RecordName but scoped to a topic — you can evolve the same record type differently per topic.

For your 6-team standardization work, the answer to "which did you pick?" depends on whether teams needed one record type per topic (TopicName — simpler) or heterogeneous records per topic (TopicRecord).

Compatibility modes (enforced by the registry when a new schema is registered):

BACKWARD (default): new schema can read old data. Consumers upgrade first, producers can keep the old schema for a while. Typical choice.
FORWARD: old schema can read new data. Producers upgrade first. Rare.
FULL: both BACKWARD and FORWARD. Safest; most restrictive.
NONE: no checks. Only for intentional breaking changes with coordination.

Safe changes (BACKWARD-compatible) in Avro:

Add a field with a default.
Remove an optional field (one with a default).
Add an alias.

Breaking changes:

Add a field without a default.
Remove a required field.
Rename a field (use an alias instead).
Change the type (no widening rules in Avro).
Change an enum's symbols.

Interview anchor: cross-team schema standardization work typically comes from exactly these pitfalls — teams adding fields without defaults, changing enum symbols, or renaming fields without aliases. Setting up a Schema Registry with BACKWARD compatibility and a PR-based review process surfaces breaking changes before they ship. Six previously-uncoordinated teams finally sharing a contract is a common outcome.

4.9 Retention, compaction, tiered storage

Time/size retention (delete cleanup policy, default): delete segments older than retention.ms (default 7 days) or once total size exceeds retention.bytes. Consumers beyond the retention get OffsetOutOfRange.

Log compaction (cleanup.policy=compact): instead of deleting by age, keep the most recent value per key. Older values for the same key are compacted away. Used for "current state" topics: customer profiles, config, GDPR tombstones. Tombstone = null value, signals deletion.

Tiered storage (KIP-405, GA in 4.0): hot segments on local disk; cold segments offloaded to object storage (S3, GCS). Effectively unlimited retention. Reads from tiered storage are slower but transparent to consumers. Lets you retain audit / replay data for months without blowing up broker disk. Major win for compliance-heavy workloads.

4.10 Kafka Connect overview

Connect is a framework for integrating Kafka with other systems without writing consumer code.

Source connector: reads from an external system (DB, file, SaaS API), writes to a Kafka topic. E.g., Debezium reads Postgres WAL → Kafka.
Sink connector: reads from a Kafka topic, writes to an external system. E.g., JDBC sink writes to Postgres, S3 sink writes Avro to S3.
SMT (Single Message Transform): per-record transformation in the connect pipeline (route, mask, flatten).
Converter: format at the Kafka boundary (Avro, JSON, Protobuf).

When to use Connect vs a custom consumer: Connect for data-integration patterns (change data capture, lake loads); custom consumer for business logic.

4.11 Kafka Streams overview

Kafka Streams is a library (not a service) for stream processing.

KStream: record-by-record stream. Every event is independent.
KTable: changelog-backed "current-state" table. Last value per key wins (mirrors log compaction).
GlobalKTable: a KTable replicated in full to every instance. Small lookup tables only.
Stateful processing: joins, windows, aggregations — state stored in local RocksDB, backed by a Kafka changelog topic for recovery.
Exactly-once in Streams: processing.guarantee=exactly_once_v2 — the whole read-process-write-commit is one Kafka transaction.

When to use Streams vs Flink: Streams for in-service stateful transformation; Flink when you need a stream-processing cluster with heavy windowing/CEP.

4.12 Operational concerns

Partition sizing:

Too few → can't parallelize consumers. Limit is #partitions = max consumer parallelism.
Too many → metadata bloat, longer controller failovers, more open file handles. Rule of thumb: no more than a few thousand partitions per broker.
Adjustable: you can add partitions but not remove them, and adding breaks key-based ordering for new keys in that bucket. Plan your partition count carefully.

Consumer lag monitoring: the single most important Kafka metric.

Burrow (LinkedIn): classifies consumers as OK/WARN/ERR based on lag trajectory, not absolute lag. Don't page on "lag > 1000" — page on "lag is growing faster than consumption rate."
Kafka Exporter + Prometheus: exposes per-group, per-partition lag metrics.
Consumer-side: KafkaConsumer.metrics() exposes records-lag-max.

Under-replicated partitions (URP): alarm on UnderReplicatedPartitions > 0. Means some follower is out of the ISR — could be a failed broker, slow disk, network issue.

Rack awareness (broker.rack): lets the cluster distribute replicas across racks/AZs. With replica.selector.class=RackAwareReplicaSelector, consumers can prefer a same-rack replica to avoid cross-AZ traffic (cost + latency).

Quotas (client.quota): cap producer/consumer bandwidth or request rate per client-ID or user. Keep noisy neighbors from starving quiet ones.

4.13 Security

Encryption in transit: TLS between client and broker. security.protocol=SSL or SASL_SSL.

Authentication (SASL):

SASL/PLAIN — username/password. Simple. Don't use over unencrypted transport.
SASL/SCRAM-SHA-256/512 — salted challenge-response. Preferred over PLAIN.
SASL/GSSAPI (Kerberos) — enterprise default.
SASL/OAUTHBEARER — OAuth2 bearer tokens, usually via OIDC.
mTLS — client certs. No SASL needed. Preferred in zero-trust meshes.

Authorization (ACLs): Kafka has topic/group/cluster-level ACLs — who can produce to topic X, consume from group Y. Managed via kafka-acls.sh or Cruise Control. Confluent RBAC adds role-based layer on top.

Encryption at rest: Kafka doesn't encrypt disk; rely on filesystem encryption (LUKS, EBS encryption). Some deployments add field-level envelope encryption at the producer before serialization (for PII and other regulated data).

4.14 Anti-patterns

Kafka as a request/reply bus. Kafka is one-way and async. Don't use it for "I need a response in 50 ms." Use gRPC or REST.
Too many partitions. More ≠ better. Each partition is a few KB of memory, a file handle, a replication overhead. Beyond low thousands per broker, you pay without benefit.
Unbounded topic growth. Default retention is 7 days, but if you bump retention to "forever" without tiered storage, you'll run out of disk and outage your cluster.
Un-keyed writes when ordering matters. Round-robin partitioning across no-key records means "userId 42's events" can land on any partition in any order. Always key by the ordering dimension.
One giant topic for everything. Hurts compaction semantics, per-topic tuning, access control. Split by domain / contract.
Treating offsets as durable IDs. Offsets are partition-scoped and can be re-mined on topic deletes. Use your own unique IDs in the payload for cross-system references.

4.15 Spring for Kafka snippets

java

// Producer config (application.yml)
spring:
  kafka:
    producer:
      bootstrap-servers: ${KAFKA_BROKERS}
      acks: all
      retries: 2147483647
      properties:
        enable.idempotence: true
        max.in.flight.requests.per.connection: 5
        linger.ms: 20
        compression.type: zstd
      key-serializer:   org.apache.kafka.common.serialization.StringSerializer
      value-serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
    consumer:
      group-id: order-svc
      enable-auto-commit: false
      isolation-level: read_committed
      properties:
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
        group.instance.id: ${HOSTNAME}   # static membership
    listener:
      ack-mode: MANUAL_IMMEDIATE

java

// Consumer with manual ack + DLT
@Component
public class OrderEventsListener {

    private final OrderService orders;
    private final DedupStore dedup;

    @KafkaListener(topics = "orders.v1", containerFactory = "kafkaListenerContainerFactory")
    @RetryableTopic(
        attempts = "4",
        backoff = @Backoff(delay = 1000, multiplier = 2.0, random = true),
        dltStrategy = DltStrategy.FAIL_ON_ERROR,
        autoCreateTopics = "true")
    public void onOrder(ConsumerRecord<String, OrderEvent> record, Acknowledgment ack) {
        if (dedup.seen(record.key())) {
            ack.acknowledge();
            return;
        }
        orders.apply(record.value());
        dedup.mark(record.key());
        ack.acknowledge();
    }

    @DltHandler
    public void onDlt(ConsumerRecord<String, OrderEvent> r, Exception e) {
        // alert, persist for manual inspection, etc.
    }
}

java

// Kafka-to-Kafka EOS (read-process-write)
@Bean
public KafkaTemplate<String, String> tx(ProducerFactory<String, String> pf) {
    pf.setTransactionIdPrefix("order-svc-tx-");
    return new KafkaTemplate<>(pf);
}

@KafkaListener(topics = "orders.in")
@Transactional("kafkaTransactionManager")
public void onOrder(ConsumerRecord<String, OrderIn> r) {
    OrderOut out = process(r.value());
    kafka.send("orders.out", r.key(), out);
    // offset commit + send are one Kafka transaction
}

4.16 Part 4 Q&A

Q1. Explain Kafka's architecture in 90 seconds.

A Kafka cluster is a set of brokers. Each topic is split into partitions; each partition is an ordered, append-only log replicated across N brokers (one leader, N-1 followers). Producers write to the leader; followers replicate asynchronously but only records replicated to the ISR are acked. Consumers in a group split the partitions among themselves — each partition is owned by exactly one consumer. Ordering is guaranteed within a partition, not across. Offsets track consumer progress and are committed to __consumer_offsets. Metadata (which broker leads which partition, which topics exist) is managed by a controller — in KRaft mode (Kafka 4.x) the controller is a Raft quorum embedded in brokers, replacing the old ZooKeeper dependency.

Q2. How does Kafka guarantee ordering, and what breaks it?

Ordering is guaranteed only within a partition. Messages with the same key hash to the same partition, so per-key ordering is preserved. Across partitions, no guarantee.
Things that break it: (1) un-keyed writes get round-robined across partitions — sibling events land anywhere. (2) Partition count changes — adding partitions remaps the hash function, so a key may move, interleaving old-partition and new-partition messages at consumer time. (3) With retries > 0 on a non-idempotent producer and max.in.flight > 1, a retried batch could commit after a later batch — out of order. Idempotent producer (producer ID + seq num) fixes the retry case.

Q3. Walk me through acks=0 vs acks=1 vs acks=all.

acks=0: fire and forget. Producer doesn't wait for a response. Lowest latency, no durability — leader crash loses the record. Use for best-effort metrics.
acks=1: wait for leader to write to its local log. Survives consumer but not leader failure — if the leader crashes after acking and before replicating, the record is lost. Default in older clients, but not actually safe.
acks=all: wait for the full ISR to replicate. Combined with min.insync.replicas=2 on a 3-replica topic, survives 1 broker loss with no data loss. If ISR drops below min.insync.replicas, producer gets NotEnoughReplicas — fails closed. This is the safe default for any message you care about.

Q4. What is the ISR and how does a replica fall out of it?

In-Sync Replica set: replicas that are caught up to the leader within replica.lag.time.max.ms (default 30s). Only ISR members are eligible for leader election — ensures a new leader has all committed data.
A follower falls out when it can't pull the leader's log within the timeout: slow disk, GC pause, network issue, rebalance traffic. It can rejoin once it catches up. You alarm on UnderReplicatedPartitions > 0 because a shrunken ISR reduces fault tolerance.

Q5. Explain the leader epoch.

Every leader election increments a 32-bit epoch number. All records are tagged with the epoch of the leader at the time they were produced. On failover, followers compare their log's latest epoch with the new leader's — any divergent suffix (written by a previous leader that the new leader didn't replicate) is truncated. Before leader epochs (KIP-101), logs could diverge after failover, causing silent data loss or duplication. It's the fencing-token mechanism specific to Kafka partition replication.

Q6. How does exactly-once semantics work end-to-end in Kafka, and what doesn't it cover?

EOS = idempotent producer + Kafka transactions + isolation.level=read_committed on consumers. Within Kafka, the read-process-write pattern commits the offset, the output write, and all produces atomically — a crash before commit aborts everything, and downstream consumers never see the output. So consuming topic A, processing, and producing to topic B is genuinely exactly-once.
What it doesn't cover: external side effects. If your processing calls Stripe or writes to Postgres, that call is not part of the Kafka transaction. You still need idempotency keys or outbox patterns for external effects. EOS is "exactly-once within Kafka," not "exactly-once everywhere."

Q7. Consumer lag is growing. Walk me through diagnosis.

First, is lag growing or is it a spike? Spike = transient (GC, rebalance, broker failover), growing = structural.
Check the consumer side: (1) Is CPU pegged? (2) Are downstream calls slow (DB, external API)? (3) How many partitions is each consumer assigned — are some consumers idle because #partitions < #consumers? (4) Are there frequent rebalances (check rebalance-rate)?
Check the producer side: (5) Did throughput spike? (6) Did batch size change? (7) Is the partition count a bottleneck (one hot key → one hot partition)?
Check the broker side: (8) URPs > 0? (9) Disk I/O saturated? (10) Network saturated?
In a typical legacy MQ integration, lag growth is almost always (a) a downstream service latency spike, or (b) a rebalance storm from pod churn — the latter fixed with static membership + cooperative rebalancing.

Q8. What's the difference between log compaction and retention?

Retention (cleanup.policy=delete): delete segments older than retention.ms or exceeding retention.bytes. Time-based aging.
Compaction (cleanup.policy=compact): keep only the most recent record per key. Tombstone (null value) marks a key for deletion. Used for "current state" topics — customer profiles, GDPR tombstones, configuration. A compacted topic acts like a change log; any consumer can bootstrap a full key-value view by reading from offset 0.
You can use both simultaneously (compact,delete) — compact for most keys, delete tombstones after a grace period.

Q9. Why is enable.auto.commit=true dangerous?

Auto-commit commits offsets on a timer, regardless of whether processing completed. If the consumer pulls 100 records, auto-commit fires after 50 have been processed, and then the consumer crashes — the other 50 are marked consumed but never actually handled. Silent data loss. Also: a long processing batch can exceed the commit interval, causing the same records to be re-processed. Rule: disable auto-commit, commit explicitly after processing, use Spring's MANUAL_IMMEDIATE ack mode.

Q10. Explain cooperative rebalancing and when it matters.

Eager rebalancing (classic): on any group change, every consumer revokes all partitions, waits for the coordinator to compute a new assignment, then resumes. "Stop-the-world" — 50-consumer group, one scale-up, everyone pauses for seconds.
Cooperative (KIP-429, CooperativeStickyAssignor): only the partitions that actually need to move are revoked. Consumers whose assignments don't change keep consuming. Dramatically lower tail-latency impact, especially in Kubernetes where pods restart often. Essential for any group > 10 consumers or any environment with frequent membership churn. Pair with static membership (group.instance.id) to skip rebalance entirely on short-lived restarts.

Q11. When would you NOT use Kafka?

Small-volume pub/sub with simple fan-out and no need for retention — RabbitMQ or SNS/SQS is simpler.
Request/reply with sub-100ms latency — use REST or gRPC; Kafka adds hops.
Strong cross-system transactions — Kafka isn't an XA resource. Use outbox + Kafka, or MQ + XA for legacy.
Very bursty, low-latency workflows where the log model's polling is overkill — RabbitMQ push + prefetch.
When the team doesn't have the ops capacity — Kafka is real infrastructure.

Q12. How would you design a consumer that processes 10k+ messages/day reliably?

(This is literally the legacy-MQ-microservice prompt.) Design:
Idempotent handler, keyed by a business ID in the message envelope. Dedup store (Mongo, Redis) with TTL ~72h.
Manual commit, after processing, MANUAL_IMMEDIATE.
Bounded retry with DLQ: Spring Kafka @RetryableTopic with 3 attempts, exponential backoff + jitter, then DLT.
Observability: correlation ID propagated in Kafka headers, structured logs with trace_id/span_id, Micrometer metrics on consumer lag + handler latency + DLT count.
Graceful shutdown: drain in-flight records, commit, close producer. On SIGTERM, Spring's ContainerStoppingErrorHandler handles this.
Backpressure: slow handler → lag grows → alarm → scale out consumers (up to #partitions) or scale out downstream dependency.
Capacity sizing: 10k/day is ~0.12/s avg — trivial. Size for peak (say 10× avg) and bursts. One partition per logical ordering dimension; more for parallelism.

Q13. What are the subject naming strategies in Schema Registry?

TopicNameStrategy (default): subject = <topic>-value. One schema per topic. Simplest. Good when a topic carries a single record type.
RecordNameStrategy: subject = fully-qualified record name. Lets one topic carry multiple record types (Avro unions). The record itself determines compatibility, not the topic.
TopicRecordNameStrategy: subject = <topic>-<record>. Scoped per topic — lets you evolve the same record type differently on different topics. Best of both.
For the 6-team Avro work, we went TopicName for most topics (one record per topic, cleanest contract) and TopicRecord on the small number of topics that legitimately multiplexed record types.

Q14. What's the difference between BACKWARD and FULL compatibility?

BACKWARD: new schema can read old data. Consumers deploy first with the new schema; producers can keep using the old schema. Typical rollout sequence. Safe for adding optional fields, removing optional fields.
FORWARD: old schema can read new data. Producers deploy first. Rare.
FULL = BACKWARD and FORWARD. Safest and most restrictive. You can roll consumers and producers in any order — messages in flight survive. Pay the constraint cost (stricter about what evolutions are allowed) to get deployment flexibility.
FULL_TRANSITIVE / BACKWARD_TRANSITIVE check against all previous versions, not just the immediately preceding one. Use when consumers might be on an older version than just N-1.

Q15. How do you monitor and alert on consumer lag?

Metric sources: JMX on the consumer (records-lag-max), Kafka Exporter → Prometheus, or Burrow for trajectory-based evaluation.
Don't alert on absolute lag (e.g., "lag > 1000"): a healthy consumer processing 10k/s has temporary lag in the thousands. Alert on lag growing faster than consumption rate, or on lag persisting above a threshold for N minutes.
Pair with DLT count — a consumer that's "caught up" because it's DLTing every message isn't healthy.
Per-partition when possible — hot-partition skew hides behind group-level averages.
Burrow's three signals: OK (lag moving), WARN (lag growing), ERR (stopped consuming). Page on ERR, ticket on sustained WARN.

Q16. What's the difference between Kafka Connect and Kafka Streams?

Connect: framework for moving data between Kafka and other systems without writing code. Source connectors ingest (Debezium CDC, JDBC source, file source). Sink connectors deliver (JDBC sink, S3 sink, Elasticsearch sink). SMTs transform in-flight. Runs as its own cluster of workers. Good answer to "how do I get Postgres → Kafka" without writing a producer.
Streams: library for transforming data within your service — stateful joins, windowed aggregations, KTable/KStream joins. Embedded in your app (no separate cluster). Good answer to "I want to compute a rolling 5-minute average per key" without introducing Flink.
They're complementary: Connect brings data in and ships it out; Streams processes it in flight.

Q17. What is tiered storage and why is it a big deal?

KIP-405 (GA in Kafka 4.0): hot segments stay on local disk; cold segments offload to object storage (S3, GCS, Azure Blob). Retention can be months or years without blowing up broker disk. Reads from tiered storage are slower but transparent to consumers (driven by the offset range requested).
Why it matters: audit/compliance retention, replay for debugging, rebuilding downstream state stores weeks later. Previously you had to either size brokers for worst-case retention (expensive) or ship logs to a separate data lake (extra pipeline). Tiered storage collapses those into one choice.

Q18. How does Kafka compare to Pulsar?

Both are distributed commit logs with pub/sub and durable storage. Differences:
Storage/compute separation: Pulsar uses BookKeeper for storage and brokers for serving — you can scale them independently. Kafka fuses them (though tiered storage partially closes the gap).
Partition model: Pulsar decouples partitions from brokers via ledgers. Moving a partition is cheap. Kafka partitions are tied to brokers; moves are expensive.
Multi-tenancy: Pulsar has first-class tenant/namespace hierarchy. Kafka has topics + ACLs.
Ecosystem: Kafka has vastly more clients, connectors, and tooling. Pulsar has caught up in tooling but not community.
For most teams, Kafka's ecosystem wins by default. Pulsar shines when you need strong multi-tenancy or want storage-compute separation without tiered storage.

Q19. What happens if min.insync.replicas is set higher than replication.factor?

The topic is un-writable. With replication.factor=3 and min.insync.replicas=4, no producer write will ever be acked because the ISR can never hit 4. Producers get NotEnoughReplicasException immediately. It's a misconfiguration (usually a typo), but a production-grade check in your Helm chart should validate replication.factor >= min.insync.replicas.
Similarly, min.insync.replicas == replication.factor means any single replica going out of ISR stops writes. The usual safe config on 3 replicas is min.insync.replicas=2 — tolerates 1 loss, fails closed at 2.

Q20. Design a DLT topology for a payment processor (where losing a message is unacceptable but endless retries of a poison are also unacceptable).

Three-tier topology:
Main topic payments.v1 — normal processing.
Retry topics payments.v1.retry.0 (1s delay), payments.v1.retry.1 (10s), payments.v1.retry.2 (60s), payments.v1.retry.3 (600s). Each attempt that fails moves to the next retry topic. Spring's @RetryableTopic configures this chain declaratively.
DLT payments.v1.dlt — terminal. Alarm on any non-zero message count. Message headers carry original topic, offset, exception class, stack trace.
Replay: a separate "DLT replay" consumer reads the DLT on manual trigger, writes back to the main topic. Include a replay-attempt header so infinite loops are prevented. For payments specifically: never auto-replay; operator signs off on each replay batch. DLT retention high (30d+) so you can replay even after a long outage.

Part 5 — IBM MQ / ActiveMQ / RabbitMQ / JMS

Kafka ate the streaming world, but the MQ family is very much alive — especially in regulated, financial, and mainframe-adjacent work. Expect interviewers to probe your comfort bridging these worlds.

5.1 JMS (Java Message Service)

JMS is the Java specification for messaging, not a product. It defines the API; vendors (IBM MQ, ActiveMQ, Artemis, OpenMQ) implement it.

Two domains:

Point-to-point (Queue): one producer, one consumer per message, load-balanced across consumers on the queue.
Publish/Subscribe (Topic): every subscriber gets every message. Subscriptions can be durable (survives subscriber disconnect) or non-durable.

Acknowledgment modes:

AUTO_ACKNOWLEDGE — session acks as each receive() returns. If the consumer crashes between receiving and processing, the broker thinks it was delivered. Similar risk profile to Kafka's auto-commit.
CLIENT_ACKNOWLEDGE — consumer calls message.acknowledge() manually. Acks all prior unacked messages on the session, not just this one.
DUPS_OK_ACKNOWLEDGE — lazy ack, may duplicate on crash.
SESSION_TRANSACTED — use session.commit() / session.rollback(). Atomic for the session; composes with XA if used in a JTA tx.

JMS 1.1 vs 2.0: 2.0 added a simplified API (JMSContext), injection via CDI, async send(), and delivery delay.

Spring integration:

java

@Configuration
@EnableJms
class JmsConfig {
    @Bean
    ConnectionFactory mqConnectionFactory() {
        MQConnectionFactory f = new MQConnectionFactory();
        f.setHostName("mq.example.com");
        f.setPort(1414);
        f.setQueueManager("QM1");
        f.setChannel("APP.SVRCONN");
        f.setTransportType(WMQConstants.WMQ_CM_CLIENT);
        return new UserCredentialsConnectionFactoryAdapter() {{
            setTargetConnectionFactory(f);
            setUsername(username);
            setPassword(password);
        }};
    }

    @Bean
    JmsListenerContainerFactory<?> jmsListenerContainerFactory(ConnectionFactory cf) {
        DefaultJmsListenerContainerFactory f = new DefaultJmsListenerContainerFactory();
        f.setConnectionFactory(cf);
        f.setSessionTransacted(true);            // tx per message
        f.setConcurrency("3-10");                // dynamic consumers
        f.setErrorHandler(t -> log.error("JMS listener error", t));
        return f;
    }
}

@Component
class CaseListener {
    @JmsListener(destination = "CASE.INBOUND.Q")
    public void onCase(TextMessage m) throws JMSException {
        var body = m.getText();
        // handler; exception → rollback → redelivery (eventually → BOQ)
    }
}

5.2 IBM MQ essentials

The IBM MQ model has more moving parts than Kafka — the terms matter.

Queue Manager (QM): an instance of IBM MQ running on a host. A QM owns a namespace of queues. Apps connect to a QM.
Queue: a FIFO(-ish) store of messages.
- Local queue: lives on the QM.
- Remote queue: a pointer to a queue on another QM; transmission happens via a transmission queue + channel.
- Transmission queue (XMITQ): where outbound messages sit awaiting channel transmission.
- Dead Letter Queue (DLQ): messages the QM can't deliver (e.g., destination queue doesn't exist).
- Backout Queue (BOQ): messages that a consumer has failed to process BOTHRESH times. Named in the queue's BOQNAME attribute.
- Reply-to queue: request/reply pattern; the producer sets ReplyToQueue, the consumer sends the response there.
Channel: the network connection between QMs or between a client and a QM.
- SVRCONN: server-connection channel, for client apps.
- SDR/RCVR: sender/receiver pair between two QMs.
- CLUSSDR/CLUSRCVR: cluster channels.
MCA (Message Channel Agent): the process that moves messages over a channel.
Listener: accepts incoming channel connections on a TCP port (default 1414).
CCDT (Client Channel Definition Table): a binary file given to client apps describing how to connect — hostnames, channels, auth. Lets ops change connection routing without app redeploys.

Persistent vs non-persistent messages:

Persistent (MQPER_PERSISTENT): written to the QM's log → survives QM restart. Slower.
Non-persistent (MQPER_NOT_PERSISTENT): in memory; lost on QM restart. Faster.
The message property is set by the producer; queues have a default.

Backout queue and BOTHRESH:

APP.CASE.Q:
  BOTHRESH(3)                # redeliver up to 3 times
  BOQNAME('APP.CASE.BOQ')    # after that, move to this queue

When a consumer rolls back (session.rollback() or crash), the QM increments the message's BackoutCount. When it hits BOTHRESH, the message is moved to BOQNAME. Pattern: consumer checks BackoutCount on receive, and if >= threshold-1, routes manually (with richer error info) to a custom BOQ. This is the "DLQ" pattern for MQ.

Clustering (queue-sharing clusters):

A cluster is a group of QMs with shared repositories. Apps see cluster queues as if they were local; messages are load-balanced across instances.
Full repository QMs hold the full cluster metadata; partial repository QMs hold only what they need.
Benefits: load distribution, high availability (any QM can take a message). Trade-off: complexity, ordering caveats.

Uniform clusters (newer, 9.2+): simplifies cluster management, auto-balances client connections across QMs, handles app-level failover transparently. Goal: make a cluster feel like a single logical QM.

5.3 IBM MQ advanced

XA & the transaction manager: IBM MQ is a first-class XA resource manager. In a JTA (Java Transaction API) scope, you can commit a DB update and an MQ put atomically. Classic pattern for mainframe interop (MQ + DB2 + CICS).

java

@Transactional  // JTA, not local Spring tx
public void processCase(Case c) {
    db.save(c);              // DB2 via JTA
    jmsTemplate.convertAndSend("DOWNSTREAM.Q", c);  // MQ via JTA
    // Both commit or both roll back
}

The coordinator is usually an app server (WebSphere, WildFly) or a standalone TM (Narayana, Atomikos). 2PC overhead + coordinator availability = real cost. Don't reach for XA unless you genuinely need heterogeneous transactional guarantees.

MQ Appliance vs software: MQ Appliance is hardware-form-factor MQ, pre-tuned. Used in regulated environments where procurement wants "an appliance." Same protocol as software MQ.

PCF (Programmable Command Format): MQ's admin API. You can script QM creation, queue definitions, channel setup in Java via PCFMessageAgent. Useful for Infrastructure-as-Code with MQ.

MFT (Managed File Transfer): MQ ships an MFT add-on for file transfer with audit and retry semantics. Relevant in regulated environments where "securely move this 10 GB file between organizations" is a daily problem.

When is XA the right answer?

Mainframe / legacy-heavy environments where DB2, CICS, MQ all speak XA and atomic cross-commit is a hard requirement (compliance, financial posting).
Systems where the outbox pattern is infeasible — e.g., when you can't modify the source database schema.
Short-lived transactions (XA gets punishing for long transactions — locks are held until commit).

For everything else: outbox → Kafka is cheaper, simpler, and doesn't couple your team's velocity to XA recovery pain.

5.4 ActiveMQ Classic vs Artemis

Apache ActiveMQ has two active codebases:

ActiveMQ Classic (5.x): the original. Uses KahaDB (a B-tree-based journal) for persistence. Supports JMS 1.1, OpenWire, STOMP, AMQP 1.0, MQTT. Well-understood, lots of deployed systems, not being added to.

ActiveMQ Artemis (originally HornetQ, donated by Red Hat): newer journaling engine, higher throughput, better clustering. JMS 2.0, AMQP 1.0, MQTT, OpenWire (for Classic compatibility). Recommended for new work. This is also what AWS MQ offers as its "ActiveMQ" broker.

Failover and clustering (both):

Master/slave: one master handles traffic; slaves are standby, take over on master failure. Shared storage (SAN, NFS, replicated journal).
Network of brokers (Classic): brokers form a mesh; messages route between them based on consumer demand. Good for geographical spread, harder to reason about ordering.
Artemis clustering: peer brokers with load-balanced connections; more seamless than Classic's network of brokers.

Failover transport (client side):

failover:(tcp://mq1:61616,tcp://mq2:61616)?randomize=false&maxReconnectAttempts=10

The client transparently reconnects to the next broker on connection loss. Pair with a clustered broker setup for high availability.

Why new work goes Artemis: better throughput, better clustering, AMQP-native (OpenWire is still supported for Classic clients), Red Hat backs it. Classic is in maintenance mode.

5.5 RabbitMQ (for comparison)

Not on your resume but likely to come up as "compare to what you've used."

Model: AMQP 0-9-1. Producers publish to exchanges; exchanges route to queues via bindings; consumers read from queues.

Exchange types:

direct: routing key equality match.
topic: routing key pattern match (wildcards * and #).
fanout: broadcast to every bound queue.
headers: match on message headers instead of routing key.

Key features:

Prefetch count (QoS): how many unacked messages a consumer can hold — push-model backpressure.
Dead-letter exchange (DLX): queue property — on nack or TTL expiry, message routes to a DLX. RabbitMQ's DLQ pattern.
Quorum queues (RabbitMQ 3.8+): Raft-based replicated queues. Replaces the older mirrored queues (deprecated, removed in 4.0). Better failover semantics, first-class HA.
Streams (3.9+): append-log data structure, similar to Kafka's log. For replay scenarios; not a Kafka replacement at scale.

Compared to Kafka: RabbitMQ shines for flexible routing, small messages at moderate scale, and traditional queue semantics. Kafka shines for high-throughput log-structured streaming and retention-based replay. Same company (Pivotal → VMware → Broadcom) owned both for a while; they solve different problems.

5.6 Head-to-head decision matrix

Feature	Kafka	IBM MQ	ActiveMQ Artemis	RabbitMQ
Model	Distributed log	Queue/topic broker	Queue/topic broker	AMQP exchange/queue broker
Ordering	Per-partition	Per-queue (FIFO)	Per-queue	Per-queue
Throughput	Very high (100k+/s/partition)	Moderate (10k/s range)	High	Moderate
Retention	Time/size-based, days to months	Until consumed (or TTL)	Until consumed	Until consumed
Replay	Native (offset rewind)	No — once consumed, gone	No	No (Streams is the exception)
Transactions (XA)	No (Kafka transactions only)	Yes, first-class	Yes	Yes (via plugin)
Delivery semantics	At-least-once; EOS within Kafka	At-least-once; exactly-once with XA	At-least-once	At-least-once
Cross-broker ordering	No (per-partition only)	Available (sequential delivery)	Available	Per-queue only
Protocols	Kafka wire protocol	MQI, JMS, MQTT, AMQP 1.0	JMS, AMQP 1.0, OpenWire, STOMP, MQTT	AMQP 0-9-1, AMQP 1.0, MQTT, STOMP
Cloud-native fit	Strong (Confluent Cloud, MSK, self-hosted on K8s)	OK (MQ on Cloud Pak, limited K8s)	Strong (AWS MQ, self-hosted)	Strong
Ops complexity	High	Very high (enterprise setup)	Moderate	Low to moderate
Best for	Event streaming, log-retention, high throughput	Enterprise transactional messaging, mainframe interop, XA	Classic Java messaging, AWS-hosted	Flexible routing, moderate scale, traditional queueing

"When would you choose X over Y?" — model answers:

Kafka over IBM MQ: high-volume event streaming, log-retention for replay, horizontal scale. When the use case is "fire and forget events" at scale with many consumers, Kafka's log model is strictly better.
IBM MQ over Kafka: transactional guarantees across heterogeneous systems (XA), mainframe integration, regulated environments where the org has decades of MQ operational experience, or where the broker owning the queue per-message-ack semantics is critical (no replay = no risk of accidental reprocessing).
ActiveMQ over IBM MQ: cloud-native deployments, smaller scale, teams that don't need XA or mainframe interop. AWS MQ makes it turnkey.
RabbitMQ over everything: moderate throughput with complex routing (topic exchanges), RPC-style pub/sub, quick time-to-value. Easier ops than Kafka at smaller scale.

5.7 Bridge / migration patterns (the on-prem-MQ → cloud-broker pattern)

Moving between brokers — particularly bridging on-prem IBM MQ to AWS ActiveMQ — has canonical patterns.

Pattern: Bridging IBM MQ → Kafka

Options, in increasing order of sophistication:

Custom JMS-to-Kafka adapter: consume from MQ via JMS, produce to Kafka. Hand-rolled. Flexibility at the cost of maintaining your own code.
Kafka Connect with the IBM MQ source connector: off-the-shelf from Confluent. Config-driven. Handles offsets and ordering semantics per queue.
IBM MQ Kafka Connect sink connector: bridge in the other direction, Kafka → MQ.
IBM MQ Streaming Queues: MQ 9.3+ feature that lets a queue publish to a stream consumer without consuming the message from the queue. Useful for dual-path: legacy consumer from the MQ queue, modern consumer from the Kafka mirror.

Pattern: IBM MQ (on-prem) → AWS ActiveMQ (cloud)

Network isolation: AWS Direct Connect or site-to-site VPN. Broker traffic on a private subnet. No broker exposed to the public internet.
Bridge process: a small Spring Boot app running in AWS (ECS/EKS) with two JMS connections — one to on-prem MQ, one to AWS MQ. Transactional move: consume from one, produce to the other, ack on success.
Dedup: include a business-ID in each message. On the target side, a dedup store catches replays (e.g., after a bridge crash). Redis with TTL = 2× max bridge downtime.
Ordering: if per-queue ordering is required and you have multiple bridge instances, shard the queues across instances by queue-name hash. Alternative: leader-election via a distributed lock (Redis RLock, ZooKeeper lock) — only one bridge instance active at a time, others standby.
Failover: bridge should handle broker disconnects with exponential backoff reconnect. Set recoveryInterval on the Spring listener container.
Cutover strategy: run both brokers in parallel during migration — dual-write from producers, dual-read on consumers, confirm traffic matches, cut producer writes to the new broker, drain the old.

Anchor example: a legacy on-prem MQ → AWS MQ bridge built as a Spring Boot service with two DefaultMessageListenerContainer beans, session-transacted. Dedup key was the upstream message ID; store was DynamoDB with TTL. Cutover was dual-read for 30 days before the on-prem path was retired. Zero message loss observed; two duplicates handled by dedup during a bridge restart at the midpoint.

5.8 Part 5 Q&A

Q1. What's the difference between JMS and IBM MQ?

JMS is a Java specification (an API, javax.jms / jakarta.jms). IBM MQ is a product that implements the JMS spec (among other protocols). Any JMS-compliant code can run against IBM MQ, ActiveMQ, Artemis, OpenMQ, or any other JMS provider — that's the portability benefit. The provider-specific knobs (IBM MQ's CCDT, backout queue, clusters) are not part of the JMS spec — they're product features exposed via properties or provider-specific APIs.

Q2. Queue vs topic in JMS.

Queue: point-to-point, each message consumed by exactly one consumer out of potentially many on the queue. Load distribution. Native for work-queue patterns (case processing, job dispatch).
Topic: publish-subscribe. Every subscriber gets every message. Subscriptions can be durable (broker retains messages for the subscriber even if disconnected) or non-durable. Native for event-broadcast patterns.
Kafka's log model collapses both — a topic with one consumer group acts like a JMS queue; a topic with many independent consumer groups acts like a JMS topic with durable subs.

Q3. What happens when a JMS consumer throws an exception in a session-transacted listener?

The session rolls back. The message is redelivered. The broker increments the redelivery count (JMSXDeliveryCount). On IBM MQ, after BOTHRESH redeliveries, the message is moved to the queue's Backout Queue (BOQNAME). On ActiveMQ, after the redelivery policy exhausts, it goes to ActiveMQ.DLQ (or a per-destination DLQ if configured). In application code, you typically check JMSXDeliveryCount on receive and, if above a threshold, route the message to a custom BOQ with rich error context before acking — so the operator has enough info to triage.

Q4. How does IBM MQ differ from Kafka architecturally?

IBM MQ is a broker-centric, per-message ack, transactional system. Messages live in queues owned by a queue manager; when a consumer acks, the message is deleted. No replay, no retention beyond TTL. Ordering is per-queue.
Kafka is a distributed log. Messages persist on an append-only partitioned log for a configurable retention. Consumers track their own offsets; any consumer can replay from any offset within retention. Ordering is per-partition, enabling massive parallelism.
Consequence: MQ is excellent when you want "one-and-done transactional messages" with cross-system XA. Kafka is excellent when you want streaming with replay and horizontal scale.

Q5. How would you guarantee no message loss in a bridge between IBM MQ and AWS ActiveMQ?

Session-transacted on both sides. The bridge consumes from source in a transaction, produces to target, commits both when the produce is acked. If either side fails, both roll back — the source message is redelivered. At-least-once delivery by construction. To avoid duplicates after bridge crashes, include a business-ID in every message and dedup on the target side (Redis/DynamoDB with TTL). A typical bridge uses DynamoDB with a 72-hour TTL keyed on the upstream message ID — cheap, reliable, and survives cross-region failovers cleanly.

Q6. When is XA / 2PC the right answer?

When you need atomic cross-resource commit across heterogeneous systems that all support XA — classic case is MQ + DB2 + CICS in a mainframe-adjacent environment, or MQ + Oracle in a transactional messaging flow. The cost is real: the coordinator is a SPOF, in-doubt transactions need manual resolution after a coordinator crash, and locks are held until commit.
For most modern microservice work, outbox + async event propagation (→ Part 6.11) replaces XA with no loss of correctness and less operational risk. Reach for XA only when the business requires it and the legacy stack supports it.

Q7. What's the difference between Network of Brokers (ActiveMQ Classic) and Artemis clustering?

Network of Brokers (Classic): brokers form a mesh; each broker can forward messages to another broker based on consumer demand. Good for geographic distribution (a consumer in region B can read a message produced in region A by the mesh pulling it across). Downside: ordering is not preserved across brokers, message flow is hard to reason about, and routing logic is complex.
Artemis clustering: peer brokers with a simpler "shared load" model — producers can connect to any broker, consumers can connect to any broker, and the cluster arranges for each message to be consumed exactly once across the cluster. Cleaner model, better tooling, the direction new work should go.

Q8. How do you handle a poison message in IBM MQ?

Set BOTHRESH (backout threshold) on the queue to N — typically 3–5. Set BOQNAME to a dedicated backout queue. The consumer uses session-transacted mode; on exception, the session rolls back, the QM increments BackoutCount, and after BOTHRESH rollbacks the message is moved to the backout queue. The consumer code can also inspect JMSXDeliveryCount on receive and, if near threshold, explicitly route the message to a custom BOQ with enriched headers (exception class, stack, producer correlation ID) before acking — gives the operator far better triage info than the default move.

Q9. Kafka vs RabbitMQ — one sentence each.

Kafka: distributed log for high-throughput event streaming with replay. RabbitMQ: flexible-routing AMQP broker for traditional queue/pub-sub at moderate scale. (If they want more: pick Kafka for log-retention and scale, Rabbit for complex routing with topic exchanges, small-to-moderate throughput, and low ops overhead.)

Q10. You need to migrate 10 producers and 6 consumers from IBM MQ to Kafka. Walk me through.

Parallel run. Set up the Kafka cluster. Dual-publish from producers (send to MQ and Kafka). Consumers still read from MQ.
Consumer migration (one at a time). Pick a low-risk consumer; build a Kafka consumer alongside the MQ one. Run both against the same logical input for a validation window — compare outputs, fix discrepancies.
Switch consumer source. When validated, cut the consumer to Kafka-only.
Repeat for all 6 consumers.
Stop dual-publishing once all consumers are off MQ.
Retire MQ queues after a retention window.
Key decisions: (a) message format — if MQ was text/XML, is it worth moving to Avro during the migration? Often yes. (b) Ordering — Kafka's per-partition ordering needs a deliberate key choice; MQ's per-queue ordering maps naturally to one Kafka partition, but then you lose parallelism. Most of the time per-entity keying is the right call. (c) Dedup — during dual-publish, consumers must dedup across MQ and Kafka sources.

Part 6 — Microservices Architecture & Patterns

6.1 Monolith vs microservices (and modular monoliths)

The "microservices vs monolith" debate is misleading — the real spectrum is monolith → modular monolith → microservices → distributed monolith. The last one is the worst of all worlds.

When a monolith is right:

Small team (<10 engineers).
Single transactional domain where cross-module transactions are normal.
No independent scaling needs (compute demand is roughly uniform across the app).
No independent deploy needs (everyone ships at the same cadence).
Early-stage product — you don't yet know the domain boundaries well enough to draw service lines.

When microservices earn their keep:

Multiple teams with distinct roadmaps and cadences.
Clear domain boundaries that map to services (DDD bounded contexts).
Parts of the system have very different scaling profiles or tech requirements.
Regulatory isolation (e.g., regulated data must be in a different deployment from non-regulated data).

Modular monolith is often the right answer in the middle: one deployable, but strict module boundaries with well-defined interfaces. Lets you split into services later when you learn the right boundaries. Popularized by Shopify and DHH.

Distributed monolith warning signs:

Services have to be deployed together.
A release requires coordination across teams.
Services share a database.
One service's outage reliably takes down others.
You have microservices without microservice benefits (independent deploy, independent scale, team autonomy).

6.2 DDD essentials for interviews

Full DDD is a book. For interviews, focus on these:

Bounded context: an explicit boundary within which a domain model is consistent and its terms have precise meaning. "Order" in the fulfillment context is different from "Order" in the billing context. Service boundaries = bounded contexts, not tables.
Aggregate: a cluster of domain objects treated as a unit for data changes. An aggregate has a root (the only externally referenced entity) and an invariant (a rule that must hold across the cluster). Transactional writes happen at the aggregate level.
Ubiquitous language: the team and the code agree on the same terms for the same concepts. Avoids translation errors between product and engineering.
Context map: a diagram of bounded contexts and the relationships between them (partnership, customer/supplier, conformist, anti-corruption layer).
Anti-corruption layer (ACL): a translation layer between your domain and an external domain you don't control, so external changes don't leak into your model.

Example: a JAXB migration against an external schema-bound XML exchange format is essentially an ACL — the external schema is a foreign domain, and the JAXB layer translates between that XML schema and the internal domain model.

6.3 Service-to-service communication

Synchronous (REST, gRPC):

Low-latency request/response, strong typing (gRPC + protobuf), good when the caller needs an immediate answer.
Tight coupling: caller blocks on callee; callee outage cascades.
Default for "commands" where the caller needs a result.

Asynchronous (events via broker):

Loose coupling: producer doesn't know (or care) who consumes.
Resilient to callee outage (messages buffer in the broker).
Higher end-to-end latency; requires eventual-consistency thinking.
Default for "facts" — "this thing happened" — fan-out to many interested consumers.

Hybrid (the modern default): commands go via REST/gRPC when the caller needs the result; domain events (facts about state changes) are published to a broker for everyone else to react to. You use both in the same system.

Client ──REST──> OrderService (command: create order)
                        │
                        ├── DB commit
                        └── emit OrderCreated event ──> Kafka ──┬──> ShippingService
                                                               ├──> BillingService
                                                               └──> AnalyticsService

6.4 Idempotency & request dedup

Every mutating endpoint in a microservices world should be idempotent. Two mechanisms:

Idempotency-Key header (Stripe-style): client sends Idempotency-Key: <uuid> with any mutating call. Server stores (key, result) in a dedup table with TTL (24–48 h typical). On retry with the same key, server returns the stored result without re-executing.

java

@PostMapping("/orders")
public ResponseEntity<Order> createOrder(
        @RequestHeader("Idempotency-Key") UUID key,
        @RequestBody CreateOrderRequest req) {
    Optional<Order> existing = idempotencyStore.find(key);
    if (existing.isPresent()) {
        return ResponseEntity.ok(existing.get());
    }
    Order created = orderService.create(req);
    idempotencyStore.save(key, created);
    return ResponseEntity.status(CREATED).body(created);
}

Business-key dedup (message processing): use a natural unique ID from the payload (order ID, case ID) as the dedup key on the consumer side. Don't introduce a new key when the payload already has one.

Gotcha: concurrent requests with the same idempotency key. Either use a DB unique index on the key (second INSERT fails, retry the SELECT) or a distributed lock (Redis RLock) to serialize.

6.5 API gateway

A gateway is the single entry point to the microservice world from external clients.

Responsibilities:

Routing (map /orders/* to order-service).
Authentication/authorization at the edge.
Rate limiting, quotas.
Request/response transformation.
TLS termination.
Logging, tracing, metrics at the edge.

Pitfalls:

SPOF: if the gateway is down, nothing is reachable. Run it HA with health checks.
Feature creep: every team wants to add "one little bit of logic" to the gateway. Say no. The gateway should be domain-ignorant; business logic goes in services.
Latency hop: every request now has an extra network hop. Budget for it.
Coupling via the gateway: if all services share one gateway config and deploys are global, you've recreated a monolith at the edge.

BFF (Backend-for-Frontend): instead of one gateway for all clients, have one gateway per client type (web BFF, mobile BFF, partner API BFF). Each can aggregate services in the shape that client needs, without polluting a shared gateway with client-specific logic.

Popular gateways: Spring Cloud Gateway, Kong, Envoy-based (Ambassador, Contour), AWS API Gateway, Kubernetes Gateway API.

6.6 Service discovery

How does Service A find Service B's address?

Client-side discovery: client queries a registry (Eureka, Consul) and picks an instance. Client handles load balancing. Pros: no extra hop. Cons: every client language needs the registry client.

Server-side discovery: client calls a well-known DNS name or load balancer; the infra routes to an instance. Kubernetes Services + CoreDNS is this. Pros: language-agnostic. Cons: extra hop, LB can be a bottleneck.

Service mesh: sidecar (Envoy) on every pod handles discovery and load-balancing per-request. Next layer up — see 6.7.

Kubernetes specifics: every K8s Service gets a DNS name (my-svc.my-ns.svc.cluster.local). Clients just resolve DNS; kube-proxy iptables/IPVS routes to a pod. Server-side discovery with no coordination overhead.

6.7 Service mesh

A service mesh adds an infrastructure layer for service-to-service communication, decoupled from application code.

Components:

Data plane: sidecar proxies (usually Envoy) injected into every pod. All inter-service traffic goes through them.
Control plane: manages the sidecars — config, certs, traffic rules (Istio's Istiod, Linkerd's control plane).

What a mesh gives you without app code changes:

mTLS everywhere: every pod gets a cert; sidecars terminate TLS. Zero-trust networking inside the cluster.
Traffic management: canary, traffic shifting, mirroring, retries, timeouts — configured declaratively.
Observability: metrics, traces, logs from the sidecars — no app instrumentation needed (basic level).
Policy: authorization (service A may call service B on /foo but not /bar), rate limits.

Sidecar vs ambient: Istio's new "ambient mode" (GA 2024) moves mTLS into a per-node ztunnel, and L7 features into optional "waypoint" proxies. No per-pod sidecar. Lower overhead, more operational moving parts.

Trade-offs: service meshes add complexity, consume CPU/memory per pod, and introduce a new team/skillset. Worth it at 20+ services; overkill at 5.

6.8 API versioning

How do you evolve an API without breaking clients?

Strategies:

URL versioning (/v1/orders, /v2/orders): most visible; easy routing; clients must pick. Default for public APIs.
Header versioning (Accept: application/vnd.orders.v2+json): clean URLs; harder to test in a browser; less discoverable.
Query-param versioning (?version=2): easy to test, least favored — clients sometimes strip query params in caches.
Content negotiation via media type: formal, well-defined by HTTP, not widely loved.

Deprecation playbook:

Ship v2 alongside v1.
Add a Deprecation header (IETF draft) and Sunset header on v1 responses.
Measure v1 traffic per client; reach out to each client.
Monitor v1 usage until it drops to zero (or below threshold).
Remove v1 after announced sunset date.

Consumer-driven contracts (Pact): consumers publish their expectations (what requests they send, what they expect back); producers run those expectations as tests. Prevents breaking changes from shipping — the producer's build fails if any consumer's contract breaks. Powerful in a multi-team microservice environment.

6.9 Distributed data patterns

Database per service: each service owns its database; no other service reads it directly. Interactions are via the service's API. Enforces encapsulation.

Pro: independent schema evolution, independent scaling, clear ownership.
Con: no joins across services → duplicated data, async consistency, saga-style distributed updates.

Shared database (the anti-pattern): multiple services write to the same DB. Coupling via schema; a migration breaks everyone; no one owns the data model. Recognize this on Day 1 and fight to migrate out.

CQRS (Command Query Responsibility Segregation): separate write and read models. Writes go to one model (normalized, optimized for consistency); reads go to another (denormalized, optimized for query patterns). Sync via events.

Pro: fit-for-purpose read models (a "product listing view," a "billing view," a "shipping view" — each denormalized for its query).
Con: eventual consistency between write and read; more moving parts.
Use when query patterns diverge enough that a single model hurts both.

Event sourcing: state is derived from a sequence of events, not a mutable row. The event log is the source of truth; any view is a projection of the log.

Pro: perfect audit (replay history), temporal queries ("what was the state last Tuesday?"), fits naturally with CQRS.
Con: dramatically higher conceptual complexity; migrations are event-schema migrations; snapshots needed for long aggregates; rare genuine use cases.
Honest take: don't reach for event sourcing unless you have a specific need (regulated audit trail with full reconstructibility, complex temporal queries). It's often cited as "the microservices pattern" but is the wrong default 95% of the time.

6.10 The Saga pattern

Distributed transactions without 2PC. A saga is a sequence of local transactions, each in its own service; if a step fails, compensating transactions undo the prior steps.

Orchestration (central coordinator):

OrderSaga:
  1. CreateOrder → OrderService
  2. ReserveInventory → InventoryService
  3. ChargeCard → PaymentService
  4. ScheduleShipment → ShippingService

  On step 3 failure:
    - ReleaseInventory → InventoryService
    - CancelOrder → OrderService

The orchestrator (e.g., Temporal, Camunda, AWS Step Functions, or a hand-rolled state machine) knows the flow. Pro: visible, testable, debuggable. Con: coupling to the orchestrator.

Choreography (event-driven, no central coordinator):

OrderCreated ──> InventoryService reserves ──emits──> InventoryReserved
                                                             │
                                                             ▼
                                                PaymentService charges ──emits──> PaymentFailed
                                                             │
                                                             ▼
                                                InventoryService releases

Each service reacts to events. Pro: loose coupling, no single point of failure. Con: flow is implicit — you can't see it on one screen.

Which to choose:

Orchestration when the flow is complex, has many branches, or needs visibility (customer service needs to see where a saga stuck).
Choreography when the flow is simple and each step is a natural domain event.
In practice, large systems often use both — high-level orchestration for long-running business flows, choreography for local cross-service side effects.

6.11 Outbox & Inbox patterns

The dual-write problem (repeat from §3.8): you need to update a DB and publish a message atomically. Two operations can't be atomic without a coordinator.

Outbox pattern: in the same DB transaction as the business write, insert the event into an outbox table. A separate publisher reads the outbox and publishes to the broker.

sql

BEGIN;
INSERT INTO orders (...);
INSERT INTO outbox (id, aggregate_id, type, payload, created_at) VALUES (...);
COMMIT;

java

// Publisher (polling variant, runs in a background thread)
@Scheduled(fixedDelay = 500)
@Transactional
public void drainOutbox() {
    List<OutboxEntry> batch = outboxRepo.fetchPending(100);
    for (var e : batch) {
        kafka.send(e.topic(), e.aggregateId(), e.payload());
        outboxRepo.markPublished(e.id());
    }
}

Or: CDC-driven — Debezium reads the DB's WAL/oplog and publishes outbox-table inserts directly to Kafka. No polling, no custom publisher code, but requires CDC infrastructure.

Inbox pattern (the consumer side): on receiving an event, write (event_id, processed_at) into an inbox table in the same tx as the business work. A retry of the same event finds the row and skips processing.

java

@Transactional
public void onEvent(OrderEvent e) {
    if (inboxRepo.exists(e.id())) return;       // dedup
    inboxRepo.save(new InboxEntry(e.id()));     // mark processed
    orderService.handle(e);                     // business work
    // all three commit atomically
}

Outbox + Inbox = reliable, effectively-once cross-service messaging without XA. This is the standard answer to "how do you handle transactional messaging in microservices?"

6.12 Schema & contract evolution

Avro/Protobuf/JSON Schema are all valid choices; the question is whether your toolchain supports evolution natively.

Expand-contract (the canonical zero-downtime schema change):

Expand: add the new field / new schema version, optional. Producers may or may not populate it.
Deploy consumers that read both old and new shapes.
Deploy producers that write only the new shape.
Contract: consumers stop reading the old shape; drop the old field.

Works at the DB column level, the message schema level, and the REST API level. The discipline is: never break in one step.

Consumer-driven contracts (Pact): on every producer build, run the test suite generated from every consumer's expectations. Breaks prevent merge. Essential in a multi-team setup — one team's refactor can't break another's consumers.

6.13 Distributed tracing & correlation

The concepts (full coverage in §8 and the future OBSERVABILITY.md):

W3C Trace Context defines two headers:

traceparent: 00-<trace-id>-<span-id>-<flags> — the current trace and parent span.
tracestate — vendor-specific baggage.

Baggage: W3C baggage header — propagate arbitrary key-value pairs across services for a trace (userId, tenant, feature flag). Separate from tracing but correlated.

MDC injection: at request entry, extract trace_id/span_id and put them in SLF4J's MDC (Mapped Diagnostic Context). Every log line emits them automatically. You can then pivot from a trace to logs and back.

java

// Spring filter auto-added by Micrometer/OTel starters
@GetMapping("/orders/{id}")
public Order get(@PathVariable String id) {
    log.info("Fetching order {}", id);
    // logs will carry trace_id + span_id in MDC
    return orderService.get(id);
}

Propagation across brokers:

Kafka: headers carry traceparent and baggage. Spring Kafka + Micrometer Tracing injects/extracts automatically.
JMS: message properties carry them. Spring JMS + Micrometer Tracing does the same.
HTTP: standard headers.

Sampling: at full sample rate, tracing is expensive. Options:

Head-based: decide at the trace root whether to sample (e.g., 1%). Simple but may miss rare errors.
Tail-based: collector buffers whole traces, samples after seeing the whole picture (always keep errors, always keep slow traces). More accurate, more infrastructure.

6.14 Part 6 Q&A

Q1. When is a monolith the right choice?

Early-stage product (domain unclear), small team (<10), no independent-scale/deploy needs. A modular monolith — one deployable, strict module boundaries — is the right default until clear service boundaries and team autonomy requirements emerge. Premature microservices turn every PR into a cross-team coordination problem and every deploy into a distributed systems failure drill. "Monolith first" is Fowler's advice and it's usually right.

Q2. What's a distributed monolith and how do you spot one?

A distributed monolith is a microservices architecture that has the costs of distribution (network hops, eventual consistency, operational overhead) without the benefits (independent deploy, scale, team autonomy). Signs: services must deploy together; one service's DB is read directly by another; a release requires coordinating across teams; one outage cascades through the system; shared libraries are lock-stepped across services. Fix by rehabilitating one boundary at a time — usually by introducing async events between tightly-coupled services and deprecating the sync chain.

Q3. Synchronous vs asynchronous — how do you choose?

Use sync (REST/gRPC) when the caller needs the result to proceed — commands, queries. Use async (events through a broker) for "this happened" facts that other services react to. In practice, a well-shaped service uses both: sync for the command that creates state, then emits async events for everyone else. The mistake is using sync for fan-out (caller waits for 5 downstream services) — that's when you get cascading latency and outages. Fan-out belongs in the broker.

Q4. Explain the circuit breaker pattern.

A circuit breaker wraps a potentially-failing call and fails fast when the downstream is known bad — saving CPU, threads, and cascading latency. Three states:
Closed: calls pass through; failures tracked in a sliding window.
Open: calls fail fast (return error or fallback) without hitting the downstream. After a cool-down, transition to half-open.
Half-open: let a limited number of probes through; if they succeed, close; if they fail, open again.
Resilience4j implements this with CircuitBreakerConfig(failureRateThreshold, slidingWindow, waitDurationInOpenState). Tune the threshold to your domain — 50% failure rate is a common starting point, tuned down for critical paths.

Q5. Why is naive retry dangerous?

Naive retry (loop immediately on failure) amplifies load on an already-struggling downstream. Every caller retrying at once = retry storm = downstream stays down. Mitigations: exponential backoff (each retry waits longer), jitter (randomize within a window so clients don't sync up), retry budgets (cap total retry rate across the service — if >10% of traffic is retries, stop), and idempotency (retry must be safe). In Resilience4j: RetryConfig with intervalFunction(IntervalFunction.ofExponentialRandomBackoff(500, 2.0)).

Q6. Orchestration vs choreography in a saga — when each?

Orchestration when the flow is complex, branchy, has visibility/auditability needs, or when the business flow is a recognizable "process" that should be modeled explicitly (Temporal, Camunda, Step Functions). The orchestrator owns the state machine; services are commands.
Choreography when the flow is simple and each step is a natural domain event. Each service emits "I did my bit" and listens for upstream steps. Looser coupling, no single coordinator. But flow is implicit — debugging requires tracing across services.
Rule of thumb: orchestrate flows customers care about ("where's my order?"), choreograph internal ripples. In one case-processing system, the main flow was orchestrated (because it had branching, manual steps, and SLA visibility requirements) while secondary effects (analytics, notifications) were choreographed.

Q7. Explain the outbox pattern and when you'd use it.

In a microservice that writes to a DB and also needs to publish an event, the two operations can't be atomic without a distributed coordinator (XA). The outbox pattern: in the same DB transaction as the business write, insert the event into an outbox table. A separate publisher — a poller, or Debezium CDC on the outbox table — reads unpublished rows and publishes to the broker, marking them sent on success.
The publish is at-least-once; consumers must dedup (usually inbox-pattern on the receiver side: check a processed-events table). You get effectively-exactly-once delivery across services, no XA, no 2PC. It's the standard answer to "transactional messaging in microservices." The only cost: a few extra rows per transaction and a small publisher process.

Q8. How do you make a POST endpoint idempotent?

Require an Idempotency-Key: <uuid> header from the client. Server stores (key, response) in a dedup table with TTL 24–48h. On first request, execute, store the result, return it. On retry with the same key, return the stored result without re-executing.
Gotcha: concurrent requests with the same key. Either (a) INSERT ... ON CONFLICT DO NOTHING into the dedup table before processing, second caller sees the row and waits/polls for the result, or (b) distributed lock on the key. Stripe's public API is the canonical reference.

Q9. What does a service mesh give you, and when is it worth the cost?

Data plane (Envoy sidecar per pod) + control plane (Istio/Linkerd). You get mTLS everywhere, traffic shifting for canaries, retries/timeouts/circuit-breakers at the network layer, and observability (metrics/traces/logs) without app code changes.
Worth it at 20+ services, or when mTLS / zero-trust is a compliance requirement, or when the org needs consistent traffic policies across polyglot services. Not worth it at 5 services — the CPU/memory overhead per pod and the new operational skillset outweigh the wins. Ambient mode (Istio 1.24+) reduces the cost but adds its own complexity.

Q10. How would you handle a breaking schema change across 6 services reading from one Kafka topic?

Expand-contract over multiple deploys, with Schema Registry enforcing compatibility:
Add the new field/shape backward-compatibly (optional in Avro, defaulted). Register new schema.
Deploy all 6 consumers with code that handles both old and new shapes.
Deploy the producer with the new schema as the canonical shape.
After a safe window (say, one retention cycle), remove old-shape code from consumers.
If you really need to remove the old field, make one more deploy dropping it. Optionally set registry compat to NONE for one release and back to BACKWARD.
This is exactly the cross-team schema standardization story — teams got bitten by skipping step 1 (producers shipped without default, consumers died on parse). Adding a Schema Registry with BACKWARD compat mode + a PR-review gate fixes it at the point-of-change rather than downstream.

Q11. How do you prevent cascading failure when a downstream service is slow?

Stack of defenses:
Timeouts (always, never missing). Set both connect and read timeouts shorter than any caller's timeout.
Circuit breaker — fail fast after the downstream's failure rate spikes.
Bulkhead — isolate resources for the slow dep so it can't starve everything else (e.g., a dedicated thread pool or semaphore per downstream).
Fallback — return cached data, default response, or degraded feature instead of propagating the failure.
Retry budget — cap total retries so retries don't amplify load.
Backpressure — if using reactive streams or queues, signal upstream to slow down.
The Netflix Hystrix lesson: without bulkheads, one slow dep can exhaust your entire thread pool and take the whole service down. Don't rely on a single defense.

Q12. How do you correlate logs across services for a single user request?

Generate a trace ID at the edge (API gateway or entry service). Propagate it via W3C traceparent header on HTTP calls and via Kafka/JMS message headers. Inject trace_id and span_id into SLF4J's MDC at request boundaries; structured logs emit them on every line.
Tooling: Micrometer Tracing or OpenTelemetry Java agent handles propagation automatically. Elasticsearch/Grafana Loki indexes trace_id. Jaeger/Tempo stores traces. Cross-link logs ↔ traces via the trace ID.

Q13. What's the difference between client-side and server-side service discovery?

Client-side: the client queries a registry (Eureka, Consul) for available instances and picks one. Client does the load balancing. Benefit: no extra hop, client can do smart load balancing. Drawback: every client language needs a registry client.
Server-side: the client calls a well-known name or load balancer; the infra routes to an instance. Kubernetes Services + CoreDNS is the modern default. Language-agnostic, the platform handles it. Extra hop, but kube-proxy makes that nearly free via iptables/IPVS.
In practice Kubernetes + service mesh is what you see in 2026. Explicit registries (Eureka) are legacy or specific to Spring Cloud.

Q14. What is CQRS and when is it the right pattern?

CQRS: separate the write model and the read model. Writes go through a command handler that updates a normalized, invariant-enforcing model. Reads query denormalized projections optimized for specific query patterns. The two models are sync'd via events (usually).
Right pattern when: (a) query patterns are very different from write patterns, (b) read scalability needs to outpace write, (c) different read views need wildly different data shapes. Wrong pattern when: you're adding complexity for no win — a single model serves both perfectly well. CQRS often gets mistakenly bundled with event sourcing; they're orthogonal. You can have CQRS without ES and ES without CQRS.

Q15. How do you version a REST API?

URL versioning (/v2/orders) is the default — most discoverable, easiest to route, most common. Header versioning is cleaner but harder to test ad-hoc. Deprecation strategy: ship v2 alongside v1, add Deprecation and Sunset headers to v1 responses, measure per-client v1 traffic, announce a sunset date, track adoption, remove v1 on the date. If clients are in your org, use consumer-driven contracts (Pact) so v1's producers can't ship a breaking change by accident.

Part 7 — Resilience & Failure Modes

The single biggest distinction between a "junior built this" and "senior built this" service is how it behaves when something downstream breaks. Resilience patterns are compositional — use all of them together, not just one.

7.1 Failure taxonomy

Flynn's hierarchy of failures, as applied to distributed systems:

Crash failure: process stops. Cleanest to detect and recover from.
Omission failure: process runs but drops some messages (sends none, or receives none). Subset of crash.
Timing failure: process responds, but too late to be useful. Real-world default — "timeouts" are this.
Response failure: process responds, but with wrong data.
Byzantine failure: process behaves arbitrarily — could lie, corrupt, collude. Worst case.

Enterprise systems defend against 1–4 (crash-tolerant). Byzantine tolerance is the domain of blockchain and adversarial multi-org systems. In hardened-perimeter environments, insiders are assumed benign and Byzantine is out of scope.

7.2 Timeouts — always, never missing

Rule: every I/O call must have a timeout. No exceptions.

Hierarchy of timeouts:

Connect timeout: how long to wait for a TCP connection to establish.
Socket/read timeout: how long to wait for bytes once connected.
Request timeout: total wall-clock time for an end-to-end request, including retries.

Deadline propagation: the caller's total budget should be passed along so callees know how much time they have. gRPC has deadline natively; REST requires a X-Request-Deadline or similar convention. Without propagation, the callee may still be working on a request the caller already gave up on.

Timeout budget math (a calling chain A → B → C):

A's total timeout: 500 ms.
B's timeout on its call to C: should be less than what A has left. If A's 500 ms minus B's own processing time leaves 400 ms, B's timeout to C should be ~350 ms to leave a retry window.
Always: child timeout < parent remaining time.

Common mistakes:

Default client timeouts (some are infinite — HttpURLConnection in older Java). Verify in your code.
Connect timeout set, read timeout not — connection establishes, then hangs forever.
Single long timeout masking chronic slowness.

7.3 Retries — with backoff, jitter, budget

Retries turn transient failures into success. Bad retries turn transient failures into outages.

Always:

Exponential backoff (delay = base * 2^attempt).
Jitter (randomize within a window). Prevents all clients retrying at the exact same tick — "thundering herd."
Bounded attempts (3–5 typical).
Retry only idempotent or idempotency-keyed operations.
Retry budget: cap total retry rate (e.g., no more than 10% of traffic is retries). If retries exceed budget, stop retrying — the downstream is broken and retries won't help.

Resilience4j:

java

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(500, 2.0, 0.5))
    .retryOnException(e -> e instanceof TransientException)
    .build();
Retry retry = Retry.of("billingService", config);
Supplier<BillingResult> decorated = Retry.decorateSupplier(retry, () -> billing.charge(req));

Retry storms — how they happen and how to prevent:

Downstream slow → clients retry → more load → slower → more clients retry → ∞

Prevention: jitter (desynchronize), backoff (don't retry immediately), circuit breaker (stop retrying when clearly broken), retry budget (cap system-wide).

7.4 Circuit breaker

Wraps a call. If failure rate exceeds a threshold, opens the circuit — subsequent calls fail fast without hitting the downstream. After a cool-down, transitions to half-open; a limited number of test calls determine whether to close or re-open.

State machine:

      fail rate > threshold
CLOSED ────────────────────────> OPEN
  ▲                                │ after wait-duration
  │                                ▼
  │                            HALF_OPEN
  │     probe success               │
  └──── below threshold ────────────┤
                                    │
                                    │ probe failure
                                    └──> OPEN

Resilience4j:

java

CircuitBreakerConfig cfg = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                    // 50% fail → open
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)                       // last 20 calls
    .minimumNumberOfCalls(10)                    // need ≥10 before computing rate
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(5)
    .build();
CircuitBreaker cb = CircuitBreaker.of("paymentGateway", cfg);

Where to apply:

Per external dependency (payment, email, third-party API).
Usually not per database (the DB is rarely the failing part at this layer — pool exhaustion is the real symptom).
Per-endpoint if different endpoints on the same dep have different failure profiles.

7.5 Bulkhead

Isolate resources so one failure domain can't starve another. Named after ship compartments — one flooded compartment doesn't sink the whole ship.

Two variants:

Thread-pool bulkhead: separate thread pool per downstream. One slow dep fills its own pool; other deps still have threads.
Semaphore bulkhead: a semaphore per downstream, bounding concurrent calls. Lighter than thread pools; doesn't isolate blocking time (the blocked call still holds the caller thread).

When each:

Thread-pool: external calls with unbounded blocking (legacy HTTP clients, JDBC).
Semaphore: fast in-process calls, or async clients (Netty, Reactor).

Tradeoff: bulkheads reduce utilization (you have more pools, each with spare capacity). Worth it — the whole point is to have spare capacity reserved so critical paths survive when non-critical ones saturate.

java

ThreadPoolBulkheadConfig cfg = ThreadPoolBulkheadConfig.custom()
    .maxThreadPoolSize(10)
    .coreThreadPoolSize(5)
    .queueCapacity(20)
    .build();

7.6 Rate limiting & throttling

Algorithms (know them all by name):

Fixed window: count requests per N-second window. Simple; bursty at window boundaries.
Sliding window log: keep timestamps of recent requests; count those within the window. Smooth; memory-heavy.
Sliding window counter: approximates the sliding window using the current and previous fixed window. Good compromise.
Token bucket: bucket fills at rate R, capacity C; each request consumes a token. Allows bursts up to C, sustained at R.
Leaky bucket: requests enter a queue drained at rate R; overflow drops. Smooth output, no bursts.

Where to enforce:

API gateway: coarse-grained per-client, per-IP. Protects the whole service surface.
Sidecar (service mesh): per-service, per-route, per-identity. Consistent policy without app code.
In-process: fine-grained per-endpoint, per-user. When decisions need app context (tier, feature flag).

Distributed rate limiting: single-node counters work in a single pod; across N pods you need a shared store. Redis with INCR + EXPIRE is the classic; Redis scripts (Lua) for atomicity. Trade-off: single-node is fast but loose; distributed is exact but adds latency.

7.7 Load shedding & graceful degradation

When the system is overloaded, intentionally reject some requests so the rest succeed. Refusing 10% of requests is infinitely better than all requests timing out.

Strategies:

Priority-based shedding: tag requests by priority (paid user vs free, internal vs external). Shed the low-priority traffic first.
Admission control: at the gateway, reject requests when downstream queue depth exceeds threshold. Fail fast with 503 + Retry-After.
Static fallbacks: if the recommendation service is down, return a static top-10 list. Degraded, not broken.
Cached last-known-good: keep a local cache of recent responses; when downstream is down, serve from cache with a "stale" header.

In code:

java

@GetMapping("/recommendations")
@CircuitBreaker(name = "recs", fallbackMethod = "defaultRecs")
public List<Product> recs(@RequestHeader("X-User") String user) { ... }

public List<Product> defaultRecs(String user, Throwable t) {
    return topProductsCache.get();   // graceful degradation
}

7.8 Cascading failure — the Netflix Hystrix lesson

Textbook cascading failure:

Service B's latency rises from 50 ms to 2 s.
Service A calls B synchronously with 2 s timeout; A's threads get stuck waiting.
A's thread pool fills up; new requests to A queue.
A can't serve its own requests → A's callers see timeouts.
A's callers start retrying → more load on A → A's situation worsens.
Whole graph goes down.

Prevention stack:

Short timeouts: A should timeout on B at well below A's own SLA.
Circuit breaker on B: after 20 failures, A stops calling B and serves a fallback.
Bulkhead on B: A's calls to B use a dedicated pool of 10 threads; other calls unaffected.
Retry budget: A caps total retries; retry storms limited.
Fallback: A serves degraded response when B is circuit-broken.

Runbook for an active cascade:

Identify the originating dep (usually via error-rate correlations).
Circuit-break upstream of the failing dep (even manually — flip a feature flag).
Shed load at the gateway to the affected service.
Let the queues drain; observe recovery.
Post-mortem: why didn't the breakers trip automatically? Usually, the breaker threshold was wrong or the fallback wasn't configured.

7.9 Backpressure propagation

Backpressure = slow downstream pushes back on fast upstream. Shows up in three forms:

Pull model (Kafka): consumer lag grows → operators scale up consumers or scale down producers. The log buffers indefinitely.
Reactive streams (Reactor, RxJava): request(n) signals from consumer to producer — producer only emits what consumer can handle. Built into Project Reactor's Flux.
Bounded queues: fixed-capacity queue between stages. When full, upstream blocks (put()), drops (offer()), or times out (offer(timeout)). Explicit backpressure at stage boundaries.

The anti-pattern: unbounded queues. Memory grows until OOM. Every LinkedBlockingQueue created without a capacity is a time bomb. Executors.newFixedThreadPool() uses an unbounded queue by default — always use new ThreadPoolExecutor(...) with a bounded queue and a RejectedExecutionHandler.

7.10 Chaos engineering (the light version)

You don't need a chaos team to benefit from the discipline. Start small:

Game days: once a quarter, intentionally kill a service in staging. See what breaks. Fix it.
Failure injection in tests: wrap external-dep test doubles with random failures (latency, timeouts, errors). Catches brittle error handling at test time.
Pod-kill schedules: PodDisruptionBudget + a scheduled job that kills random pods. Makes sure the system truly survives pod loss (not just "probably survives").
Load testing with latency injection: not just "100 req/s"; add "100 req/s + 500 ms artificial latency on payment service." Reveals breaker tuning issues.

Tools: Chaos Mesh, LitmusChaos, Gremlin, Chaos Monkey (original Netflix). At small scale, a shell script + kubectl delete pod is enough.

7.11 Spring + Resilience4j snippets

Annotation-driven (easy to adopt, but less flexible than programmatic):

java

@Service
public class BillingClient {

    @CircuitBreaker(name = "billing", fallbackMethod = "chargeFallback")
    @Retry(name = "billing")
    @Bulkhead(name = "billing", type = Type.THREADPOOL)
    @TimeLimiter(name = "billing")
    public CompletableFuture<ChargeResult> charge(ChargeRequest req) {
        return CompletableFuture.supplyAsync(() -> httpClient.charge(req));
    }

    public CompletableFuture<ChargeResult> chargeFallback(ChargeRequest req, Throwable t) {
        log.warn("Billing unavailable, using fallback", t);
        return CompletableFuture.completedFuture(ChargeResult.queued(req.id()));
    }
}

yaml

# application.yml
resilience4j:
  circuitbreaker:
    instances:
      billing:
        failure-rate-threshold: 50
        sliding-window-size: 20
        minimum-number-of-calls: 10
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 5
  retry:
    instances:
      billing:
        max-attempts: 3
        wait-duration: 500ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
  bulkhead:
    instances:
      billing:
        max-concurrent-calls: 20
  timelimiter:
    instances:
      billing:
        timeout-duration: 2s

Order of annotations matters. Resilience4j applies them from outside in: TimeLimiter → Retry → CircuitBreaker → RateLimiter → Bulkhead. The default is usually right, but you can override with @Order or programmatic composition.

7.12 Part 7 Q&A

Q1. Walk me through the stack of resilience patterns you'd use for a call to a third-party payment gateway.

Layers, outside-in:
Timeout (TimeLimiter): 2 s cap on total call time.
Retry with exponential backoff + jitter: 3 attempts, 500ms/1s/2s + ±50% jitter.
Circuit breaker: 50% failure rate over last 20 calls → open for 30 s.
Bulkhead: thread pool of 20, queue of 50. Keeps payment calls from starving other work.
Fallback: on breaker open, queue the charge to a retry topic for async processing; return QUEUED to client.
Idempotency key on every call to the gateway — retries are safe.
Rate limit at the gateway layer to avoid hitting the provider's rate limit.
All composed via Resilience4j. The fallback + async retry pattern is especially important for payments — users would rather see "processing" than "failed" if the provider is having a blip.

Q2. Why is the default Executors.newFixedThreadPool dangerous in production?

It uses an unbounded LinkedBlockingQueue under the hood. If tasks arrive faster than they're processed, the queue grows without bound until OOM. Always construct explicitly: new ThreadPoolExecutor(coreSize, maxSize, keepAlive, unit, new ArrayBlockingQueue<>(capacity), new ThreadPoolExecutor.CallerRunsPolicy()). A bounded queue + explicit rejection policy makes the failure visible (backpressure or fast-fail) instead of silent (memory leak → OOM).

Q3. Circuit breaker is open but the downstream is actually healthy now. What's wrong?

Likely one of:
Cool-down too long: waitDurationInOpenState too high — it stays open long after recovery.
Half-open probe misconfigured: permittedNumberOfCallsInHalfOpenState too low, or slowCallRateThreshold trips even on successful-but-slow calls.
Fallback always succeeding: the breaker counts successes, but if your fallback is what's "succeeding" while calls aren't actually reaching the downstream, the breaker can't close.
Wrong granularity: breaker is on a dependency that has multiple endpoints — one slow endpoint tripped the whole breaker. Split into per-endpoint breakers.
Debug: check Resilience4j metrics (resilience4j_circuitbreaker_state, failure_rate), look at recent calls in the sliding window, and verify that probes actually hit the real downstream.

Q4. Explain the bulkhead pattern and when to use thread-pool vs semaphore.

Bulkhead isolates resources so one failing dep can't exhaust the whole service's capacity. Two flavors:
Thread-pool bulkhead: each dep has its own thread pool. Slow/stuck calls consume only that pool's threads. Use for blocking I/O — JDBC, sync HTTP clients, JMS calls.
Semaphore bulkhead: each dep has a semaphore capping concurrent calls. Lighter; doesn't free the calling thread during a block. Use for async I/O (Netty, Reactor) where the calling thread doesn't block on the call itself.
The tradeoff is cost vs isolation. Thread pools give true isolation but consume more memory and add context-switching. Semaphores are lightweight but don't protect against long blocks. Most real services have a mix.

Q5. A downstream service has degraded from 50 ms to 500 ms. Your service's latency p99 has quadrupled. Why, and what's the fix?

Without bulkhead/timeout tuning, your service's request threads are sitting waiting on the slow downstream. Those threads aren't available to serve other requests, so new requests queue, the queue depth grows, and p99 latency reflects the queue wait.
Fixes:
Shorter timeout on the downstream call so stuck threads free up faster.
Bulkhead around the slow dep so its slowness doesn't consume all your threads.
Fallback so you can short-circuit a slow downstream and serve a degraded response.
Circuit breaker trips if slowness persists.
Longer term: audit what's making the downstream slow (tracing + their metrics) and work with that team.

Q6. How do you prevent a retry storm?

Four things in combination:
Exponential backoff between retries — don't hammer immediately.
Jitter — randomize retry timing so synchronized clients don't pulse together.
Retry budget — cap the fraction of total traffic that's retries (Envoy / service mesh can enforce this at the fleet level).
Circuit breaker — stop retrying entirely when the downstream is clearly down; retries will fail anyway and add load.
Miss any one and a retry storm is possible. A common cautionary tale: on a legacy downstream, a thundering-herd retry after a 5-minute outage prolonged the outage to 45 min. Adding jitter + budget lets the next incident recover in minutes.

Q7. Token bucket vs leaky bucket — practical difference.

Token bucket: bucket fills at rate R, capacity C. Each request consumes a token; no token = reject/queue. Allows bursts up to C (the accumulated tokens), sustained at R. Good fit for bursty human traffic and API quotas.
Leaky bucket: requests enter a queue drained at rate R. Fixed smooth output; overflow drops. No bursts — even temporary overproduction gets smoothed or rejected. Good fit when downstream has strict throughput limits.
For most "rate limit this API" problems, token bucket is the right answer. Leaky bucket shines when the downstream really, truly cannot handle bursts.

Q8. What's load shedding and when do you use it?

Intentional rejection of a fraction of traffic when the system is overloaded, so the remainder succeeds. Better to 503 10% than to time out 100%. Implementations:
At the gateway: 503 + Retry-After when a queue depth metric crosses a threshold.
Priority-based: reject low-priority requests first (free users before paying ones, internal synthetic traffic before real).
Feature-level degradation: turn off non-critical features (recommendations, personalization) under load; keep core flows alive.
Netflix famously uses "prioritized load shedding" — they can drop 30% of traffic and hide it from users because the dropped traffic was optional paths.

Q9. Walk through a real cascade you've seen or would design against.

(Anchor: a legacy downstream-dependency cascade, or hypothetical.) Classic cascade: payments API slowed from 100 ms to 2 s. The order-service called it synchronously. Thread pool filled up. Incoming orders started timing out. Clients retried, adding load. Upstream API gateway's circuit breaker wasn't configured, so the bad traffic kept flowing. Eventually the whole order path was down.
Design against it: (1) synchronous call replaced with "queue the charge" pattern — order acknowledges, async worker does the charge, user sees the outcome via polling or push. (2) Circuit breaker at the call site. (3) Bulkhead for payment threads. (4) Load-shed at the gateway on queue depth. (5) Retry budget of 10%. (6) Alarm on breaker-open events to drive investigation before cascade.

Q10. How do you test resilience patterns?

Unit tests with fault-injection test doubles (simulate timeout, 500, slow response). Assert fallback invoked, breaker transitions recorded in Resilience4j metrics.
Integration tests with Testcontainers — kill the real downstream (MongoDB, Kafka broker) mid-test, assert the service survives.
Chaos experiments in staging — kill pods, drop network, inject latency (Chaos Mesh / Toxiproxy). Game-day style, scheduled.
Load tests with failure injection — simulate 50% error rate on one dep; verify the service's other deps still work.
Metrics-based monitoring in prod — track breaker state changes, retry rates, fallback rates over time.
The goal is not "100% test coverage of failure paths" but "we know what happens when X breaks, because we've seen it break before."

Part 8 — Observability in Distributed Systems

This is the thin slice — deep coverage belongs in the future OBSERVABILITY.md guide. Here, we cover only what's inseparable from distributed-systems reasoning: the three pillars, trace-context propagation across brokers, and consumer-lag SLIs.

8.1 Three pillars: logs, metrics, traces

Each pillar answers a different question:

Logs: what happened at this moment? Discrete events. Searchable, filterable. Cost scales with volume.
Metrics: how is the system doing over time? Aggregated numeric time series. Cheap at scale; cardinality is the limit.
Traces: how did this one request flow through the system? Per-request, per-span. End-to-end latency attribution.

You need all three. Logs tell you what broke, metrics tell you at what rate, traces tell you where in the call chain.

8.2 Correlation ID & trace context propagation

W3C Trace Context (mandatory in 2026 — OpenTelemetry's default):

traceparent: 00-<trace-id>-<span-id>-<flags> — current trace + parent span + flags.
tracestate — vendor-specific.
baggage — propagated key-value context (user ID, tenant, feature flag).

Across brokers:

Kafka: trace context lives in record headers. Spring Kafka + Micrometer Tracing injects/extracts automatically.
JMS / IBM MQ: trace context in JMS properties or message headers. Spring's MessagePostProcessor injects them at send; the listener extracts on receive.
HTTP: standard headers; handled by every mainstream client (RestTemplate, WebClient, OkHttp, gRPC).

Log–trace cross-reference:

java

// Micrometer Tracing puts trace_id and span_id into MDC
log.info("Processing order {}", orderId);
// → {"msg":"Processing order 123","trace_id":"...","span_id":"...","order_id":"123"}

Now a dashboard can: click trace → filter logs by trace_id → see every log line for that request across every service.

8.3 Consumer-lag SLIs and error budgets

For consumer-based services, the most critical SLI is often consumer lag, not request latency.

Defining the SLI:

"99% of messages are consumed within 5 seconds of production."
Measured as consume_time - produce_time, not raw lag.
Raw lag is a signal, not an SLI — a burst of incoming traffic legitimately increases lag.

Related SLIs for consumer services:

Processing latency p99.
DLT rate (messages sent to dead-letter per hour).
Handler error rate.
Saturation of consumer threads or downstream dep.

RED for request services: Rate, Errors, Duration. USE for resources: Utilization, Saturation, Errors.

For a consumer service, mix: Rate consumed, Error (DLT + handler), Duration (end-to-end age at consume) plus saturation of the downstream dep.

8.4 Part 8 Q&A

Q1. What's the difference between a log, a metric, and a trace?

Log: a discrete event — something happened at this time. Captures context. Searchable. High per-line cost.
Metric: an aggregated number over time — how much is happening, how fast, how many errors. Cheap at scale. Limited by cardinality (tags/labels).
Trace: a single request's path through multiple services, with per-span timings. Shows causality and latency attribution. Essential for debugging "slow somewhere in the call chain" bugs.
You need all three — logs for what, metrics for how much, traces for where.

Q2. How do you propagate trace context across a Kafka boundary?

W3C traceparent in Kafka record headers. Spring Kafka + Micrometer Tracing (or the OpenTelemetry Kafka instrumentation) injects the header on send() and extracts it on consume, automatically attaching the consume span as a child of the produce span. In application code, you do nothing beyond enabling the starter. For diagnostics: dump headers of a suspicious record with kafka-console-consumer --property print.headers=true.

Q3. You have 20 microservices and an incident. Where do you start?

Check the high-level SLIs (error rate, latency) on a dashboard — find which services are burning error budget.
Pick a failing trace — click through, find the span that's red or slow.
Pivot to the service's logs using the trace_id. Read context around the failing span.
Check that service's metrics — is it the code, downstream, resource saturation?
If it's downstream: repeat for the downstream service.
Without trace-context propagation, step 2 is impossible — you're grepping 20 services' logs for a user ID hoping to guess the chain. With it, you follow the trace.

Q4. What SLIs would you set for a Kafka consumer service?

Consume latency: now - message.timestamp at consume. Target p99 < 5 s (or whatever the business SLA is). This is the user-facing latency for async consumers.
DLT rate: messages/hour sent to dead-letter. Alarm on any non-zero sustained rate.
Handler p99: time spent in the business logic per message.
Consumer lag: informational; not an SLI because a burst is normal.
Rebalance rate: alarm on frequent rebalances (indicates pod churn or config issue).
A typical legacy-MQ consumer setup alarms on "consume latency p99 > 10s for 5 min" plus "DLT count > 0." Those two signals together catch every meaningful incident without noise.

Q5. How do you avoid alert fatigue in a 20-service system?

Alert on symptoms, not causes. Symptoms = user-visible problems (error rate, latency, queue age). Causes (high CPU, high memory, pod restarts) are diagnostics, not alerts. If CPU is 95% but error rate and latency are fine, don't page — you don't have a user-visible problem.
Bind alerts to SLOs with error-budget policies. If the service is within its error budget, don't page for noise. If the service is burning budget fast, page. This framing cuts pages by 70%+ in typical implementations.

Part 9 — System Design Drills

Eight prompts, each a full walkthrough in the format interviewers expect:

Clarify → Requirements (functional + non-functional) → API → Data model →
High-level architecture → Deep-dive on 1–2 components → Failure modes → Scale & rollback

Practice tip: write the drill in 30 minutes on a whiteboard without looking. Then come back and compare.

9.1 Drill 1 — Ingest 10k+ transactions/day from IBM MQ to downstream microservices

(The legacy MQ microservice pattern, in full. This should be the best-rehearsed drill in your kit.)

Clarify:

What are "transactions"? (Case events with attachments, avg 2–50 KB.)
How many downstream services consume? (4–6 depending on case type.)
SLA — max time from MQ put to all downstream deliveries? (5 min p99.)
Ordering requirement? (Per-case-ID ordering.)
Duplication tolerance? (Zero — downstream effects are not safely idempotent — dedup at ingest.)
Security? (Regulated — encrypted in transit, audit logged, PII redaction in logs.)

Non-functional requirements:

Availability: 99.9% (43 min/month downtime budget).
Zero message loss from MQ to downstream.
Replay capability — rebuild downstream state from message history if a service corrupts.

Data model (message envelope):

{
  "messageId": "uuid-v4",           // dedup key
  "caseId": "CASE-12345",           // partition key
  "eventType": "CASE_CREATED" | "EVIDENCE_UPDATED" | ...,
  "producedAt": "2026-04-17T10:15:00Z",
  "payload": { ... },                // type-specific body
  "correlationId": "..."             // trace context
}

High-level architecture:

┌──────────┐   on-prem    ┌──────────────┐   AWS    ┌──────────────┐
│ IBM MQ   │◀─────────────│  Bridge      │─────────▶│    Kafka     │
│  (QM1)   │  SVRCONN     │  (Spring)    │          │  (MSK, 3 AZ) │
└──────────┘              │  1..N repl   │          └──────────────┘
                          └──────────────┘                 │
                                ▲                          │ consume
                                │                          ▼
                           ┌────┴───┐         ┌──────────────────────┐
                           │ Dedup  │         │ Consumer Services    │
                           │ (Dynamo│         │ (4-6, consumer groups│
                           │  TTL)  │         │                      │
                           └────────┘         └──────────────────────┘
                                                        │
                                                        ▼
                                                ┌──────────────┐
                                                │   DLT        │
                                                │ + Replay UI  │
                                                └──────────────┘

Deep dives:

Bridge. Spring Boot service on EKS. Two DefaultMessageListenerContainer beans per QM — one for primary, one for secondary clusters. Session-transacted JMS; consumes from on-prem MQ, produces to Kafka. On a successful Kafka ack, JMS session commits. On exception, JMS rollback + MQ redelivers. Dedup: before producing to Kafka, check DynamoDB for messageId; if absent, PUT with TTL 72h and produce. The dedup insert is idempotent (PutItem with conditional expression attribute_not_exists(messageId)).

Consumer services. Each subscribes to the relevant Kafka topic as a distinct consumer group. Keyed by caseId → per-case ordering preserved. @RetryableTopic with 4 attempts (1s, 10s, 60s, 600s) then DLT. Inbox pattern in each consumer: event_id in a processed-events table, same tx as the business write.

Failure modes & mitigations:

On-prem MQ down: bridge consumers enter reconnect loop with backoff; Kafka is unaffected for existing messages; producer lag alarms after 5 min.
AWS Kafka down: bridge can't produce, JMS rolls back, MQ holds messages (queues grow — alarm on queue depth). When Kafka recovers, drain resumes naturally.
Bridge pod crash: another pod takes over; at-least-once semantics mean the in-flight message is redelivered on the next bridge; dedup prevents duplicate publishing.
Consumer service crashes: Kafka consumer group rebalance; partition reassigned; new owner picks up from last committed offset; inbox dedup handles redelivery.
Poison message: 4 retries, then DLT; alarm; ops triages via a small "DLT replay" UI that lets them fix data or code and re-publish to the main topic.

Scaling to 10× (100k+/day):

Horizontal: add Kafka partitions (requires caseId re-bucketing planning), scale bridge pods, scale consumer pods.
Vertical: batch MQ gets, batch Kafka produces, tune linger.ms.
Dedup store: DynamoDB handles this natively — scale RCU/WCU.

Rollback plan:

Bridge deploy: canary 1 pod, watch lag metrics, progressive rollout.
Schema change: Schema Registry BACKWARD compat, expand-contract across deploys.
Full rollback: revert via ArgoCD to prior git SHA; bridge is stateless, no data surgery.

9.2 Drill 2 — IBM MQ (on-prem) ↔ AWS ActiveMQ (cloud) bridge

Closely related to Drill 1 but broker-to-broker; worth practicing separately.

Clarify:

Direction — uni or bi? (Assume bi.)
Volume? (10k/day.)
Transactional guarantee required? (At-least-once both directions, dedup.)
Network — dedicated or internet? (Direct Connect preferred.)
Failover — RTO/RPO? (RTO < 5 min, RPO = 0.)

Architecture:

   ┌─────────────┐       Direct       ┌──────────────┐
   │   IBM MQ    │◀──── Connect ─────▶│ AWS MQ       │
   │   (on-prem) │                    │ (Artemis)    │
   └─────────────┘                    └──────────────┘
         ▲                                    ▲
         │ JMS SVRCONN                        │ JMS failover transport
         │                                    │
         └──────────┬─────────────────────────┘
                    │
              ┌─────▼──────┐
              │  Bridge    │  Active/passive pair (leader election via Redis RLock)
              │  (Spring   │  Session-transacted JMS both sides
              │   Boot)    │  Dedup via DynamoDB, TTL 72h
              └────────────┘

Key design decisions:

Active/passive bridge with leader election — only one instance consuming from each queue at a time, to preserve per-queue ordering. Redis SETNX + TTL renewal; loser tries to acquire every 30 s.
Failover transport on AWS MQ client: failover:(ssl://mq1:61617,ssl://mq2:61617)?randomize=false.
Heartbeat queue — bridge produces a heartbeat every 10 s to both brokers as the end-to-end liveness check.

Cutover strategy (the real-world painful part):

Phase 1: bridge running, not consuming. Validate connectivity + heartbeat queue.
Phase 2: bridge active, one direction only (say, on-prem → cloud). Cloud consumers dual-read from on-prem queues (via VPN proxy) and from cloud queues; compare.
Phase 3: bridge bi-directional. Cloud producers start writing to cloud queues.
Phase 4: on-prem consumers read via bridge (cloud → on-prem direction).
Phase 5: deprecate on-prem write path; on-prem becomes read-only.
Phase 6: decommission on-prem after the required retention period.

Failure handling:

Bridge partition (on-prem side): on-prem writes queue up locally; when healed, bridge catches up. Zero loss.
Bridge crash during a transaction: JMS rolls back on both sides (session-transacted). Message redelivered next time.
DNS/network failover: Direct Connect multi-link + BGP convergence. Bridge sees short disconnect, reconnects, resumes.

9.3 Drill 3 — Centralized logging pipeline for 20+ microservices

Clarify:

Log volume? (50 GB/day production, 10 GB/day staging.)
Retention? (30d hot, 1y cold.)
Who consumes? (SREs for incidents, security for audit, product for analytics.)
Search latency target? (p99 < 5 s on 7 days of data.)
Cost constraint? (Yes — ES is expensive at scale.)

Architecture (OpenTelemetry + Elasticsearch):

┌──────────┐   stdout/JSON   ┌────────────┐  OTLP  ┌──────────┐
│ Service  │────────────────▶│ OTel       │───────▶│ OTel     │
│ (Spring) │                 │ Agent      │        │ Collector│
└──────────┘                 │ (sidecar/  │        │ (daemon- │
                             │  node)     │        │  set)    │
                             └────────────┘        └──────────┘
                                                        │
                                                        │ batching,
                                                        │ sampling,
                                                        │ transform
                                                        ▼
                                               ┌───────────────┐
                                               │ Elasticsearch │
                                               │ (3-node,      │
                                               │  hot/warm tier│
                                               └───────────────┘
                                                        │
                                                        ▼
                                                 ┌─────────────┐
                                                 │ Kibana /    │
                                                 │ Grafana     │
                                                 └─────────────┘
                                                        │
                                                        ▼
                                               cold (S3, >30d)
                                                via ILM

Deep dive — sampling/routing at the Collector:

100% of error logs.
10% of info logs (head sampling).
Drop debug logs in prod.
Route security-relevant logs (auth, authz, sensitive-data access) to a separate, append-only index with stricter retention.

Failure modes:

Collector down: OTel Agent buffers locally with disk backing (bounded queue; drop-oldest policy).
ES down: Collector retries with exponential backoff; buffers to disk; after threshold, drops with a counter metric so SRE sees backlog.
Mapping explosion: ES mappings capped at 1000 fields per index; offending fields get quarantined. Usually caused by dynamic JSON fields (customer-provided data) — strictly validate and strip at the Collector.

Cost control:

Hot/warm/cold ES tiers via ILM (Index Lifecycle Management): hot (7d, NVMe), warm (30d, HDD), cold (1y, S3).
Downsampling: after 7 days, keep aggregates (error counts/hour), drop raw logs.
Separate indexes per tenant/team → easier to budget and drop.

9.4 Drill 4 — GitOps deploys across dev/staging/prod with ArgoCD

(Anchor: your 99% deploy success rate story.)

Clarify:

How many services? (10+.)
How many clusters? (3 — dev, staging, prod; prod multi-region.)
Who deploys? (Dev teams for dev/staging; platform gates prod with approval.)
Config drift tolerance? (None — Git is source of truth.)

Architecture:

┌────────────┐     watches     ┌─────────────┐
│ Git repo   │◀────────────────│  ArgoCD     │
│ (k8s       │                 │  (in mgmt   │
│  manifests │                 │   cluster)  │
│  + Helm)   │                 └─────────────┘
└────────────┘                        │
     ▲                                │ sync + reconcile
     │ PRs                            ▼
     │                   ┌──────────────────────────────┐
┌────┴───────┐           │   Workload clusters          │
│ Developers │           │  ┌───────┐ ┌───────┐ ┌─────┐ │
└────────────┘           │  │  dev  │ │staging│ │prod │ │
                         │  └───────┘ └───────┘ └─────┘ │
                         └──────────────────────────────┘

Repo structure:

infra-gitops/
  apps/
    order-svc/
      base/          # common kustomize or Helm values
      overlays/
        dev/
        staging/
        prod-us/
        prod-eu/
    ...
  applicationsets/
    services.yaml    # ApplicationSet: one per service × environment

ApplicationSet generator:

cluster generator: automatically creates an Application for each service × cluster combination.
Progressive rollouts via ArgoCD Image Updater or a Git-PR-based promotion pipeline (dev → staging → prod).

Secret management:

Sealed Secrets or External Secrets Operator (pulls from AWS Secrets Manager / Vault).
Never commit plaintext secrets — even in a private repo.

Deploy flow:

Developer opens PR with a manifest change.
CI runs lint, test, Helm template, policy checks (OPA/Kyverno).
PR merged → ArgoCD detects git diff → syncs dev cluster.
Health checks pass → PR auto-promoted to staging overlay (via bot).
Human approval → PR to prod overlay → sync prod.
Canary rollout (Argo Rollouts): 10% → 50% → 100% with auto-rollback on SLO burn.

99% success rate — what it actually means:

Count "deploy events" and "deploy events with rollback or failure." Burn budget when the ratio dips below 99%.
Key enablers: comprehensive pre-merge CI, SLO-based auto-rollback (Argo Rollouts + Prometheus), canary by default, blameless postmortems on each incident, per-service health-check tuning (liveness, readiness, startup probes correctly configured — huge driver).

Failure modes:

ArgoCD down: existing workloads unaffected; new syncs queued. Running ArgoCD HA with 3 replicas + Redis HA mitigates.
Git outage: same — existing apps unaffected. Cache PRs locally if truly blocking.
Bad manifest: canary catches it; auto-rollback; Slack alert.

9.5 Drill 5 — Schema registry and evolution workflow for a multi-team Kafka environment

(Anchor: cross-team schema standardization across 6 teams.)

Clarify:

How many teams producing/consuming on Kafka? (6.)
Current state? (Each team defining their own schemas ad hoc; breaking changes in prod every few sprints.)
Target state? (A schema lives in a central registry; compatibility is enforced at the registry; teams publish schemas via PR; breaking changes require coordinated migration.)
Migration constraint? (Can't rewrite all producer code at once.)

Architecture:

┌──────────────┐   register    ┌─────────────────┐
│ Schemas repo │◀─────PR──────▶│ Schema Registry │
│ (Avro IDL    │               │ (Confluent)     │
│  or .avsc,   │               │ Compatibility:  │
│  one dir per │               │ BACKWARD        │
│  topic)      │               │ Subject naming: │
└──────────────┘               │ TopicName       │
     ▲                         └─────────────────┘
     │                                 ▲
     │ CI/CD                           │
     │                                 │
 ┌───┴─────┐   publish/consume         │
 │ Producers│───────────────────────────┤
 │ Consumers│                          │
 └──────────┘                          │
                                       │
                     ┌─────────────────┘
                     │ query schema by ID
                     │
               ┌───────────┐
               │  Kafka    │  (records carry schema ID)
               └───────────┘

Governance process:

Schema changes proposed as a PR in the schemas repo.
CI runs kafka-avro-console-producer-style compatibility check against the current registered schema.
PR review requires sign-off from the producing team and any subscribing team (via a CODEOWNERS pattern).
On merge, CI registers the new schema via REST API; rejected if incompatible.

Rollout rules:

BACKWARD compat by default — consumers deploy first, producers second.
Breaking change? Requires explicit schema evolution plan: new topic, dual-write period, consumer migration, deprecation timeline.

Training / enablement:

One-page cheat sheet: what's safe (add field with default), what's not (rename, change type).
Brown-bag session per team walking through a real PR.
Office hours for schema questions.

Metrics:

Breaking-change incidents per quarter (goal: 0 post-standardization).
Time from schema PR open to merge (keep it low so it doesn't become a drag).

Result (tie to your actual numbers):

Pre-standardization: ~1 breaking incident/month, avg 2-day rollback.
Post-standardization: zero breaking incidents in the following 6 months. Dev velocity slightly improved because teams stopped being on-call for downstream breaks.

9.6 Drill 6 — Design a rate limiter

Clarify:

What are we limiting? (API calls per user, or per API key, or per IP.)
Scale? (100k req/s across 50 pods.)
Fairness? (Per-user, not per-pod — can't have pod A get all of user X's quota.)
Granularity? (Per-second, per-minute, per-day quotas.)
Where? (At the gateway — doesn't require app changes.)

Algorithm choice:

Token bucket — allows bursts up to bucket capacity, sustained at refill rate. Best fit for API quotas with human-driven bursts.

Architecture (distributed token bucket in Redis):

           ┌─────────────────────────────┐
           │      API Gateway            │
           │  (Envoy / Spring Cloud      │
           │   Gateway)                  │
           └──────────────┬──────────────┘
                          │ per-request check
                          ▼
                   ┌─────────────┐
                   │   Redis     │  atomic Lua script:
                   │  (cluster)  │  - compute tokens now = stored + refill
                   │             │  - if > 0: decrement, return allow
                   │             │  - else: return deny
                   └─────────────┘

Lua script (atomic per-user rate limit):

lua

-- KEYS[1] = user-bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill_rate_per_sec
-- ARGV[3] = now_ms
-- ARGV[4] = requested tokens (usually 1)
local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local rate      = tonumber(ARGV[2])
local now_ms    = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill_ms')
local tokens        = tonumber(bucket[1]) or capacity
local last_refill   = tonumber(bucket[2]) or now_ms

local delta = (now_ms - last_refill) / 1000.0 * rate
tokens = math.min(capacity, tokens + delta)

if tokens < requested then
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill_ms', now_ms)
    return 0 -- deny
end

tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill_ms', now_ms)
redis.call('EXPIRE', key, 3600)
return 1 -- allow

Response to the client on deny: HTTP 429 Too Many Requests with headers X-RateLimit-Remaining: 0, Retry-After: 2, X-RateLimit-Reset: <epoch>.

Failure modes:

Redis down: gateway must decide — fail-open (allow all, degrade to "soft" quota) or fail-closed (deny all, protect downstream). For most APIs, fail-open with an alarm is right — don't 429 everyone because Redis is blinking.
Redis slow: circuit-breaker around the rate-limit check; fall back to a local-pod approximation (stricter per-pod but not per-user).

Scaling to 100k req/s:

Redis cluster shards by user-id hash. Each pod keeps a local connection pool.
For extreme scale, per-region rate limiting: user's requests go to their nearest region; each region enforces its own quota (approximate but acceptable).

9.7 Drill 7 — Notification fan-out (email + SMS + push)

Clarify:

How many notifications/day? (1M.)
Latency target? (Transactional: <30s. Marketing: next-day OK.)
User preferences (channel opt-in, quiet hours)? (Yes, enforce.)
Delivery receipts? (Yes — track delivered/opened/clicked.)

Architecture:

┌──────────────┐
│ Upstream     │ (order service, auth service, etc.)
│ services     │
└──────┬───────┘
       │ publish "NotificationRequested"
       ▼
┌──────────────┐
│   Kafka      │  topic: notifications.requested
│   (partitions│
│   by userId) │
└──────┬───────┘
       │
       ▼
┌──────────────────────────────────┐
│ Notification Orchestrator        │
│ - lookup user prefs              │
│ - resolve template               │
│ - check quiet hours              │
│ - fan out per channel            │
└──────┬──────────┬──────────┬─────┘
       │          │          │
       ▼          ▼          ▼
  ┌────────┐ ┌────────┐ ┌────────┐
  │ Email  │ │  SMS   │ │  Push  │
  │ Worker │ │ Worker │ │ Worker │
  └────┬───┘ └────┬───┘ └────┬───┘
       │          │          │
       ▼          ▼          ▼
     SES        Twilio     FCM/APNs
       │          │          │
       ▼          ▼          ▼
     (delivery receipts → Kafka topic → analytics)

Design details:

Orchestrator writes per-channel request to three separate topics (notifications.email, notifications.sms, notifications.push). Workers are independent — SMS outage doesn't stop email.
Idempotency key on every outbound call (notificationId:channel).
Retries at each worker with exponential backoff + jitter; DLT after 5 attempts.
Quiet-hours enforcement: if outside user's window and non-urgent, schedule delivery via delay topic.
Delivery receipts: provider callbacks (SES bounce/complaint, Twilio status, FCM delivered) land on a notifications.receipts topic for audit and retry decisions.

Failure modes:

Provider outage (Twilio down): SMS worker DLT fills; page ops; switch to backup provider via config flag. Long-held requests get delayed delivery.
Invalid email: provider bounces; update user prefs to suppress; remove from future sends.
Rate limiting by provider: worker-level token bucket; slow down without losing messages (just higher lag).

9.8 Drill 8 — Audit log for a regulated environment

Clarify:

What's logged? (Every access to regulated data, every change to it, every login, every permissions change.)
Retention? (Typical regulated minimum: 1 year, often longer.)
Tamper-evident? (Yes — legal evidence standard.)
Query pattern? (By user, by subject, by time range, by event type.)
Volume? (10M events/day.)

Architecture:

┌────────────┐    emit audit event     ┌──────────────┐
│ Services   │────────────────────────▶│    Kafka     │
│ (regulated)│   signed, structured    │ audit.v1     │
└────────────┘                         │ compacted=no │
                                       │ retention=1y │
                                       │ tiered storage
                                       └──────┬───────┘
                                              │
                         ┌────────────────────┼────────────────────┐
                         ▼                    ▼                    ▼
                ┌──────────────┐    ┌────────────────┐    ┌────────────────┐
                │ Write to     │    │ Index into     │    │ Stream to      │
                │ immutable    │    │ Elasticsearch  │    │ SIEM           │
                │ S3 (WORM)    │    │ (query layer)  │    │ (real-time     │
                │ Object Lock  │    │                │    │  alerting)     │
                └──────────────┘    └────────────────┘    └────────────────┘

Tamper-evidence:

Each event signed with an HMAC using a key from HSM/KMS. Event body + signature + previous event hash.
Forms a Merkle chain — altering one event invalidates all subsequent hashes.
Daily Merkle root anchored to an offline system (printed, stored in a safe, or notarized on a blockchain).

Object Lock on S3: compliance mode with a retention period. Even a root AWS account can't delete or modify the log within retention.

Query layer:

Elasticsearch for day-to-day queries (by user, time range).
For deeper history or evidentiary queries, replay from Kafka tiered storage or S3.

Failure modes:

Service can't reach Kafka: buffer locally in an append-only file (bounded size, rotate to S3 after N MB). Fail-closed for new operations if the buffer can't drain within N seconds — unlogged regulated-data access is worse than a user-facing error.
Kafka down: short — use local buffer; long — promote the buffer path to primary, page hard.

Compliance checklist:

Encryption in transit (TLS everywhere) and at rest (KMS keys, S3 encryption, ES index encryption).
Audit log records who ran every query (meta-audit).
Access to the audit log requires separate privileges from the underlying regulated-data access (separation of duties).
Retention enforced technically (S3 Object Lock), not procedurally.

9.9 Design drill meta-tips

Always ask clarifying questions first. A 5-minute clarification at the start prevents a 30-minute derail.
Explicitly separate functional and non-functional requirements. Non-functional drives architecture (scale, latency, consistency, availability).
Draw the architecture before going deep. Boxes and arrows. Identify the 2–3 most interesting components and go deep on those.
Name every box. "A queue" is weaker than "Kafka topic orders.v1, 12 partitions, 3 replicas, min.insync=2."
Always cover failure modes. "What if X dies?" for every X. They'll ask anyway; beat them to it.
Always cover scale. "If traffic 10×, where's the first bottleneck?" Answer it proactively.
Always cover rollback. "How do you back out?" The best-designed systems are reversible at every step.
Don't over-engineer. If the prompt is 10k/day, don't design for a billion.

Part 10 — Exercise Appendix

10.1 Flash-card Q&A bank (rapid-fire)

At-most-once, at-least-once, exactly-once — pick for metrics collection. At-most-once. Metrics can survive drops; dedup costs > loss cost.
Which Kafka config makes the producer idempotent?enable.idempotence=true.
How many brokers can you lose with replication.factor=3, min.insync.replicas=2, acks=all? One — with zero data loss. Two → producers get NotEnoughReplicas.
Name the four quorum math relationships.R+W>N = strong consistency, W>N/2 = no split-brain writes, R=1 = fast reads, R=N, W=1 = fast writes.
Which consistency model does "read your own writes" belong to? A session-level guarantee, weaker than linearizable, stronger than eventual.
Token bucket: allows bursts yes/no? Yes, up to bucket capacity.
Sliding-window counter: exact or approximate? Approximate — uses current + previous window, weighted.
What does acks=1 lose protection from? Leader crash before any follower replicates the record.
ISR members are eligible for what? Leader election (only in-sync replicas).
What's the role of the Kafka controller? Cluster metadata — leader elections, topic creation, partition reassignments.
What's a Kafka consumer group coordinator? Different from controller — manages group membership and offset commits for a specific group.
In JMS, what's the difference between CLIENT_ACKNOWLEDGE and SESSION_TRANSACTED? CLIENT_ACKNOWLEDGE acks all prior messages on the session when called. SESSION_TRANSACTED groups sends and receives into a commit/rollback unit.
What is BOTHRESH in IBM MQ? Backout threshold — after N rollbacks, message moves to BOQNAME.
Name 3 SASL mechanisms for Kafka. SCRAM, GSSAPI (Kerberos), OAUTHBEARER (also PLAIN, but don't use it unencrypted).
Outbox pattern — what's in the outbox? Events to publish, written in the same transaction as the business row. Separate poller/CDC publishes and marks sent.
Inbox pattern — solves what problem? Consumer-side dedup of at-least-once redelivery, atomically with the business write.
Orchestration vs choreography — which has a central coordinator? Orchestration.
Raft tolerates how many failures with N nodes?(N-1)/2 — majority needed for progress.
Why is KRaft replacing ZooKeeper in Kafka? Operational simplicity (one cluster), faster controller failover, no external dependency.
What's a leader epoch? A monotonic number incremented on each leader election; tags records so followers can reconcile logs correctly.
Consistent hashing — what does adding a node remap? Only ~1/N of keys (the arc the new node claims on the ring).
Why virtual nodes in consistent hashing? Smooth load distribution when nodes join/leave; reduce skew from random ring placement.
What is a Merkle tree used for in distributed systems? Anti-entropy — efficient reconciliation of differences between replicas by comparing tree hashes.
Failure mode of an unbounded LinkedBlockingQueue in a thread pool? OOM under sustained load.
Circuit breaker states? Closed, Open, Half-Open.
Name two rate-limit algorithms and their burst behavior. Token bucket (allows bursts), leaky bucket (smooths to steady rate).
Difference between CAP and PACELC in one sentence. CAP = partition-time tradeoff (A vs C); PACELC adds the normal-time tradeoff (Latency vs C).
XA vs outbox — which blocks on coordinator failure? XA (2PC blocks mid-prepare if the coordinator dies). Outbox is local commit only.
Schema Registry BACKWARD compatibility means what? New schema can read data written with old schema. Consumers deploy first.
What does isolation.level=read_committed on a Kafka consumer do? Only reads records from committed transactions; skips aborted-transaction records.

10.2 "Find the bug" scenarios

Scenario 1. A service uses Kafka with acks=1, enable.idempotence=false, retries=5. Under load, duplicate events occasionally appear downstream. Why?

retries > 0 without enable.idempotence=true means a retried batch can commit in a different order than the original, and if the original actually succeeded but the ack was lost, the retry duplicates. Fix: enable.idempotence=true. Spec: producer ID + sequence numbers dedup retries at the broker.

Scenario 2. A Spring @KafkaListener handler does:

java

@KafkaListener(topics = "orders")
@Transactional
public void handle(Order o) {
    orderRepo.save(o);
    kafka.send("events", o.id().toString(), new OrderCreated(o));
}

What's broken?

Dual-write problem. The @Transactional boundary is for the DB, not for Kafka. If the process dies between orderRepo.save() commit and the kafka.send, the order exists with no event. Fix: use the outbox pattern — write to outbox table in the same tx as orderRepo.save, separate publisher reads outbox and sends to Kafka.

Scenario 3. A consumer uses enable.auto.commit=true with auto.commit.interval.ms=5000. Processing occasionally takes 10s. Crashes are reported during slow processing. After restart, some messages appear to have been "silently dropped." Why?

Auto-commit fires every 5s regardless of handler progress. If a batch of records is still being processed at the 5s mark, the offset commits for records that haven't been handled yet. Crash after → those records are considered "processed" and skipped. Silent data loss. Fix: enable.auto.commit=false, commit manually after processing via Spring's MANUAL_IMMEDIATE ack mode.

Scenario 4. A payment service calls Stripe synchronously via @Retry(maxAttempts = 3). Users occasionally see double charges. What's missing?

Idempotency key on the Stripe call. Without one, each retry is a new charge. The retry logic is correct; the problem is the downstream call isn't idempotent on its own. Fix: pass a stable Idempotency-Key header (say, a hash of orderId + a nonce) on every call in the retry chain. Stripe will dedup.

Scenario 5. A Kafka consumer group with 10 partitions runs 15 consumer pods. Lag is fine, but CPU utilization is uneven — some pods sit idle. Why?

You have 10 partitions and 15 consumers; 5 consumers are idle (partitions > consumers means max parallelism is #partitions). Fix: reduce consumer count to 10 (one partition per consumer), or increase partitions. Note: increasing partitions mid-production breaks key ordering for any key that rehashes to a new partition — plan carefully.

Scenario 6. A service's p99 latency spikes every 45 seconds, then recovers. Checked GC — not it. Checked downstream — fine. What else might cause periodic spikes?

Check Kafka consumer group rebalancing — session.timeout.ms default is 45s, so if heartbeats are missed (e.g., slow GC, long handler, network blip), the coordinator kicks the consumer and rebalances. Eager rebalancing pauses the whole group. Fix: switch to CooperativeStickyAssignor, set a static group.instance.id, and investigate why heartbeats are missing.

Scenario 7. A service has a circuit breaker on a downstream dep. Metrics show the breaker trips correctly, but once open, it never closes again. Why?

The minimumNumberOfCalls is higher than permittedNumberOfCallsInHalfOpenState. The breaker transitions to half-open, lets a few calls through, they succeed, but because the count is below minimumNumberOfCalls, the breaker doesn't have enough data to decide and stays stuck. Fix: permittedNumberOfCallsInHalfOpenState ≥ minimumNumberOfCalls, or reduce minimumNumberOfCalls.

Scenario 8. Your team adds a required field to an Avro schema registered with BACKWARD compatibility. Registry rejects. Why, and what's the right change?

Adding a required field (no default) is NOT backward compatible — old consumers reading new messages would fail to parse because they'd look for the missing field. Fix: add the field with a default value. Or, if the field must be "required," make it nullable with default null in the schema, enforce required in app code post-parse during a transition period, then later register a breaking schema on a new topic or after coordinated migration.

10.3 Self-test checklist

Rate yourself on each (0 = can't, 1 = can stumble through, 2 = can teach it).

Fundamentals:

[ ] State CAP precisely and give a 30-second example ______
[ ] Explain PACELC and why it matters more day-to-day ______
[ ] Name and give an example for 5 of the 8 fallacies ______
[ ] Draw the consistency ladder from linearizable to eventual ______
[ ] Explain vector clocks vs Lamport clocks in 1 minute ______
[ ] Explain FLP impossibility in 2 sentences ______

Replication & consensus:

[ ] Explain Raft leader election in 90 seconds ______
[ ] State the quorum math (R+W>N etc.) and what each relationship guarantees ______
[ ] Explain fencing tokens and what attack they prevent ______
[ ] Compare consistent hashing to rendezvous hashing ______

Messaging:

[ ] Draw the Kafka architecture (brokers, partitions, replication, controller) ______
[ ] Explain ISR, leader epoch, high watermark ______
[ ] Configure producer for at-least-once with no duplicates ______
[ ] Configure consumer for at-least-once with dedup ______
[ ] Explain Kafka EOS and what it doesn't cover ______
[ ] Draw a DLT topology with retry chains ______
[ ] Explain outbox pattern and sketch the poller code ______
[ ] Compare Kafka to IBM MQ in 60 seconds ______
[ ] Compare Kafka to RabbitMQ in 60 seconds ______

Microservices & resilience:

[ ] Draw a saga with orchestration, then with choreography ______
[ ] Draw a service mesh architecture (data plane + control plane) ______
[ ] Explain circuit breaker states and when each transition happens ______
[ ] Explain bulkhead vs semaphore vs thread-pool bulkhead ______
[ ] Walk through a cascading failure and how to prevent it ______
[ ] Configure rate limiting at 3 different layers ______
[ ] Propagate a trace ID across HTTP, Kafka, and JMS ______

Design drills:

[ ] Drill 1 (MQ → downstream) in 30 min on a whiteboard ______
[ ] Drill 4 (GitOps deploys) in 30 min ______
[ ] Drill 6 (rate limiter) in 30 min ______
[ ] Drill 8 (audit log) in 30 min ______

10.4 Recommended reading & resources

Books:

Designing Data-Intensive Applications — Martin Kleppmann. The canonical book. If you read one thing from this list, read this. Covers data replication, partitioning, consensus, stream processing, batch processing — everything in Part 1-3 of this guide in far more depth.
Kafka: The Definitive Guide (2nd ed.) — Gwen Shapira et al. Canonical Kafka reference.
Release It! (2nd ed.) — Michael Nygard. Resilience patterns, circuit breakers, bulkheads, war stories. Still the best book on designing for failure.
Building Microservices (2nd ed.) — Sam Newman. Microservices architecture, decomposition, testing, deployment.
Microservices Patterns — Chris Richardson. Saga, outbox, CQRS — deep on the patterns in Part 6.

Papers:

Lamport, Time, Clocks, and the Ordering of Events in a Distributed System (1978).
Lamport, The Part-Time Parliament (Paxos original, 1998).
Ongaro & Ousterhout, In Search of an Understandable Consensus Algorithm (Raft, 2014).
DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (2007).
Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (blog post, but essential).

Blogs and docs:

Confluent blog — Kafka deep dives, EOS walkthroughs, tuning guides.
Martin Fowler — "Microservices," "Enterprise Integration Patterns," "Monolith First," "Circuit Breaker."
AWS Architecture Blog — real-world scale case studies.
Red Hat Developer blog — Kafka, service mesh, microservices tutorials with running code.
High Scalability (highscalability.com) — real system designs at scale.

Courses / videos:

MIT 6.824 Distributed Systems (lectures on YouTube).
"Jepsen" reports by Kyle Kingsbury — reading one or two of these teaches more about distributed DBs than most courses.

Part 11 — Anchor Example Cheat Sheet

One-page reference — which kinds of projects map to which interview prompts. Bring a mental version of this into every interview.

Anchor 1: Legacy MQ Microservice (10k+ tx/day)

Use for prompts about:

"Design a message ingest system"
"How would you handle 10k+ messages reliably?"
"Kafka consumer design"
"MQ-to-Kafka bridge"
"Idempotency in messaging"
"DLQ and retry strategy"
"Consumer lag monitoring"
"Schema evolution across teams"
"Tell me about a system you own"

Key metrics to drop in:

10k+ transactions/day
99.9% availability
Zero message loss during cutover
Dedup via DynamoDB with 72h TTL
Cooperative rebalancing + static membership → rebalance time 30s → 2s

Anchor 2: Cross-Team Schema Standardization across 6 Teams

Use for prompts about:

"Schema evolution"
"Backward/forward compatibility"
"Cross-team technical initiatives"
"Influencing without authority"
"Preventing production breakages"
"Schema Registry"
"Tell me about a time you identified a problem no one else saw"

Key metrics/details:

6 teams, previously uncoordinated
BACKWARD compat by default via Schema Registry
PR-review gate with codeowners on subscribing teams
Went from ~1 breaking incident/month → 0 in next 6 months
Dev velocity improved because fewer firefights

Anchor 3: Thymeleaf → JAXB Migration (schema-bound XML)

Use for prompts about:

"Migrating from X to Y"
"XML/schema-first design"
"Inter-system data exchange"
"Type safety in messaging"
"XXE prevention"
"Anti-corruption layer"
"Difficult technical migration"

Key details:

Thymeleaf = templating engine (error-prone for XML)
JAXB = XSD-first, compile-time type safety
Inter-system XSDs as the source of truth for the data exchange contract
Required XXE hardening in the JAXB parser
Made schema violations impossible to ship; moved validation left

Anchor 4: ArgoCD 99% Deploy Success Rate

Use for prompts about:

"GitOps"
"CI/CD pipeline"
"Deployment reliability"
"Multi-environment deploys"
"Reducing MTTR"
"SLI/SLO implementation"
"Platform engineering"

Key details:

10+ services across dev/staging/prod
ApplicationSet-driven (one config generates all envs)
Canary rollouts with auto-rollback on SLO burn
Sealed Secrets for secret management
Liveness/readiness/startup probes were the biggest reliability lever
Tracked deploy success as an SLI with a burn-rate alert

Anchor 5: Intern Mentorship Program (sprint completion 60% → 85%)

Use for prompts about:

"Mentorship"
"Team leadership without authority"
"Sprint velocity / predictability"
"PR review culture"
"Code review philosophy"
"Onboarding"
"Balancing mentoring with your own delivery"
"Developing junior engineers"

Key details:

~20-person intern cohort
Sprint completion: 60% → 85%
PR defect rate: -30%
Structured standups, PR review cadence (24h SLA), TDD-focused curriculum
200+ PRs reviewed with consistent checklist

Anchor 6: IBM MQ → AWS ActiveMQ Bridge

Use for prompts about:

"Cross-environment messaging"
"Cloud migration"
"At-least-once delivery with dedup"
"Cutover without downtime"
"Network isolation / Direct Connect"
"Active/passive failover"

Key details:

Spring Boot bridge, session-transacted JMS on both sides
DynamoDB dedup with TTL 72h
Active/passive with Redis leader election
Cutover via dual-read for 30 days
Zero message loss; 2 dedup-caught duplicates during bridge restart

Bridging these to typical interview moments

Interviewer says...	Lead with...
"Walk me through a project you're proud of"	Legacy MQ microservice (then schema standardization if follow-up on cross-team)
"Tell me about something technically challenging"	Thymeleaf → JAXB (schema-first, XXE, type safety)
"Time you influenced without authority"	Cross-team schema standardization (6 teams, no formal power)
"Time you mentored/led a team"	Intern Mentorship Program (60% → 85%)
"Time you shipped under tight deadlines"	Kafka migration for release deadline (if applicable) or ArgoCD rollout
"Time you broke production / learned from failure"	Pre-standardization schema incident (be specific, own the RCA, show the change you drove)
"Design a system that ingests MQ messages"	Legacy MQ microservice, directly — this is the literal prompt
"Design a cross-env message bridge"	IBM MQ → AWS ActiveMQ, directly
"Design a GitOps deploy system"	ArgoCD setup, directly
"Why should we hire you?"	Years of Java/Spring + messaging experience, shipped systems at scale, led cross-team initiatives, mentor

Final reminder

Every question in this guide has a model answer, but the real signal is:

You can draw it. If you can't whiteboard the architecture, you don't understand it yet.
You can debug it. Given a symptom, can you walk through the diagnosis tree?
You can anchor it. Every answer should tie back to something you've built, broken, or fixed.

Good luck. You're prepared.

Distributed Systems, Messaging & Microservices — Study Guide ​

TABLE OF CONTENTS ​

Part 0 — How to use this guide ​

Scope ​

How to study ​

How Q&A answers are structured ​

Anchor examples — quick map ​

Part 1 — Distributed Systems Fundamentals ​

1.1 What makes a system "distributed" ​

1.2 The 8 Fallacies of Distributed Computing ​

1.3 CAP theorem — stated precisely ​

1.4 PACELC — the honest extension ​

1.5 Consistency models — from strongest to weakest ​

1.6 Time & ordering ​

1.7 Failure detection ​

1.8 FLP impossibility ​

1.9 Part 1 Q&A ​

Part 2 — Replication, Partitioning & Consensus ​

2.1 Replication strategies ​

2.2 Quorum math ​

2.3 Anti-entropy: read repair & Merkle trees ​

2.4 Leader election (Raft vs Paxos vs ZAB) ​

2.5 Consensus in the wild ​

2.6 Partitioning / sharding ​

2.7 Rebalancing ​

2.8 Part 2 Q&A ​

Part 3 — Messaging Fundamentals (broker-agnostic) ​

3.1 Three conceptual models: queues, topics, logs ​

3.2 Delivery semantics ​

3.3 Ordering guarantees ​

3.4 Durability: acks, replication, fsync ​

3.5 Idempotency & dedup ​

3.6 Poison pills & dead-letter queues ​

3.7 Backpressure ​

3.8 Transactional messaging & the dual-write problem ​

3.9 Part 3 Q&A ​

Part 4 — Kafka Deep Dive ​

4.1 Architecture primer ​

4.2 Producer internals ​

4.3 Replication, ISR, leader epoch, high watermark ​

4.4 Controller & KRaft ​

4.5 Consumer groups & rebalancing ​

4.6 Offset management ​

4.7 Exactly-Once Semantics end-to-end ​

4.8 Schema Registry & Avro ​

4.9 Retention, compaction, tiered storage ​

4.10 Kafka Connect overview ​

4.11 Kafka Streams overview ​

4.12 Operational concerns ​

4.13 Security ​

4.14 Anti-patterns ​

4.15 Spring for Kafka snippets ​

4.16 Part 4 Q&A ​

Part 5 — IBM MQ / ActiveMQ / RabbitMQ / JMS ​

5.1 JMS (Java Message Service) ​

5.2 IBM MQ essentials ​

5.3 IBM MQ advanced ​

5.4 ActiveMQ Classic vs Artemis ​

5.5 RabbitMQ (for comparison) ​

5.6 Head-to-head decision matrix ​

5.7 Bridge / migration patterns (the on-prem-MQ → cloud-broker pattern) ​

5.8 Part 5 Q&A ​

Part 6 — Microservices Architecture & Patterns ​

6.1 Monolith vs microservices (and modular monoliths) ​

6.2 DDD essentials for interviews ​

6.3 Service-to-service communication ​

6.4 Idempotency & request dedup ​

6.5 API gateway ​

6.6 Service discovery ​

6.7 Service mesh ​

6.8 API versioning ​

6.9 Distributed data patterns ​

6.10 The Saga pattern ​

6.11 Outbox & Inbox patterns ​

6.12 Schema & contract evolution ​

6.13 Distributed tracing & correlation ​

6.14 Part 6 Q&A ​

Part 7 — Resilience & Failure Modes ​

7.1 Failure taxonomy ​

7.2 Timeouts — always, never missing ​

Distributed Systems, Messaging & Microservices — Study Guide

TABLE OF CONTENTS

Part 0 — How to use this guide

Scope

How to study

How Q&A answers are structured

Anchor examples — quick map

Part 1 — Distributed Systems Fundamentals

1.1 What makes a system "distributed"

1.2 The 8 Fallacies of Distributed Computing

1.3 CAP theorem — stated precisely

1.4 PACELC — the honest extension

1.5 Consistency models — from strongest to weakest

1.6 Time & ordering

1.7 Failure detection

1.8 FLP impossibility

1.9 Part 1 Q&A

Part 2 — Replication, Partitioning & Consensus

2.1 Replication strategies

2.2 Quorum math

2.3 Anti-entropy: read repair & Merkle trees

2.4 Leader election (Raft vs Paxos vs ZAB)

2.5 Consensus in the wild

2.6 Partitioning / sharding

2.7 Rebalancing

2.8 Part 2 Q&A

Part 3 — Messaging Fundamentals (broker-agnostic)

3.1 Three conceptual models: queues, topics, logs

3.2 Delivery semantics

3.3 Ordering guarantees

3.4 Durability: acks, replication, fsync

3.5 Idempotency & dedup

3.6 Poison pills & dead-letter queues

3.7 Backpressure

3.8 Transactional messaging & the dual-write problem

3.9 Part 3 Q&A

Part 4 — Kafka Deep Dive

4.1 Architecture primer

4.2 Producer internals

4.3 Replication, ISR, leader epoch, high watermark

4.4 Controller & KRaft

4.5 Consumer groups & rebalancing

4.6 Offset management

4.7 Exactly-Once Semantics end-to-end

4.8 Schema Registry & Avro

4.9 Retention, compaction, tiered storage

4.10 Kafka Connect overview

4.11 Kafka Streams overview

4.12 Operational concerns

4.13 Security

4.14 Anti-patterns

4.15 Spring for Kafka snippets

4.16 Part 4 Q&A

Part 5 — IBM MQ / ActiveMQ / RabbitMQ / JMS

5.1 JMS (Java Message Service)

5.2 IBM MQ essentials

5.3 IBM MQ advanced

5.4 ActiveMQ Classic vs Artemis

5.5 RabbitMQ (for comparison)

5.6 Head-to-head decision matrix

5.7 Bridge / migration patterns (the on-prem-MQ → cloud-broker pattern)

5.8 Part 5 Q&A

Part 6 — Microservices Architecture & Patterns

6.1 Monolith vs microservices (and modular monoliths)

6.2 DDD essentials for interviews

6.3 Service-to-service communication

6.4 Idempotency & request dedup

6.5 API gateway

6.6 Service discovery

6.7 Service mesh

6.8 API versioning

6.9 Distributed data patterns

6.10 The Saga pattern

6.11 Outbox & Inbox patterns

6.12 Schema & contract evolution

6.13 Distributed tracing & correlation

6.14 Part 6 Q&A

Part 7 — Resilience & Failure Modes

7.1 Failure taxonomy

7.2 Timeouts — always, never missing