DevOps — All-Inclusive Study & Interview Guide

📝 Quiz · 🃏 Flashcards

Companion to: INTERVIEW_PREP.md (Section 9) Audience: Senior software engineers targeting roles that include DevOps ownership (Java/Spring backends, GitOps, K8s, ArgoCD, Helm, EKS/OpenShift). How to use this doc:

Learning pass — read top-to-bottom; each section teaches the concept before testing it.
Interview pass — jump straight to the Interview Q&A subsections; the numbered answers are phrased the way you'd actually say them on the spot.
Day-of — scan the Appendices (commands cheat sheet, glossary, checklist).

Every section follows the same six-part shape: Why this matters → Core concepts → Commands/config you should know cold → Gotchas & war stories → Anchor example (where applicable) → Interview Q&A → Further reading. If a heading is missing, it was intentionally omitted because it didn't add value for that topic.

Part I — Foundations

1. DevOps Culture, Lifecycle & Metrics

Why this matters

Before any tooling question, interviewers want to hear you articulate what DevOps actually is — because half the candidates will say "it's CI/CD" and stop there. DevOps is a cultural + engineering practice; the tools are an implementation detail. Senior interviewers probe this to separate people who operate tools from people who reason about delivery. Expect at least one of: "What is DevOps in your own words?", "What DORA metrics would you track?", or "How does SRE differ from DevOps?"

Core concepts

Definition. DevOps is the combination of cultural philosophies, practices, and tools that shortens the feedback loop between writing code and running it in production — with the goal of delivering value to users faster, more reliably, and more safely. It deliberately blurs the historical split between Dev (ship features) and Ops (keep things up), because those competing incentives are the root cause of most delivery dysfunction.

CALMS framework — the cultural pillars:

Culture — shared ownership, blameless postmortems, trust
Automation — humans don't do what a script can do
Lean — small batches, short cycles, reduce waste
Measurement — you can't improve what you don't measure (DORA)
Sharing — knowledge, tools, and on-call across the team

DevOps lifecycle (the infinity loop — 8 phases):

Plan → Code → Build → Test → Release → Deploy → Operate → Monitor → (back to Plan)

Each phase is continuous; the whole loop runs over and over for the life of a service. Monitoring feeds back into planning.

DORA metrics (from the DORA / Accelerate research, now part of Google DORA's annual State of DevOps reports):

Deployment Frequency — how often you ship to prod. Elite teams: on-demand (multiple per day). Low performers: <1/month.
Lead Time for Changes — time from commit to running in prod. Elite: <1 hour. Low: >6 months.
Change Failure Rate — % of deploys that cause a failure needing remediation. Elite: 0–15%. Low: 46–60%.
Mean Time to Restore (MTTR) — time to recover from a prod failure. Elite: <1 hour. Low: >6 months.

Two metrics on speed (frequency + lead time), two on stability (failure rate + MTTR). DORA's finding: high-performing teams don't trade speed for stability — they get both.

SPACE framework — newer (2021, Microsoft Research) — complements DORA with developer experience:

Satisfaction & well-being
Performance (outcomes)
Activity (actions count, but cautiously)
Communication & collaboration
Efficiency & flow Use SPACE when DORA alone feels too narrow (e.g., you're measuring platform health, not delivery).

DevOps vs SRE vs Platform Engineering vs DevSecOps — don't confuse these in an interview:

Discipline	Origin	Focus	One-liner
DevOps	Patrick Debois, 2009	Culture + practice uniting dev & ops	"Break down the wall."
SRE	Google, 2003 (public ~2016)	Apply SW engineering to ops; reliability as a product feature	"Class SRE implements DevOps."
Platform Engineering	~2020s	Build internal platforms/IDPs that DevOps teams consume	"Product-ize your tooling."
DevSecOps	~2012	Shift-left security into the DevOps loop	"Security is everyone's job, continuously."

SRE is DevOps with a specific error-budget-driven implementation; Platform Engineering is what you do when DevOps self-service doesn't scale past a handful of teams; DevSecOps is an emphasis, not a separate team (usually).

GreenOps (sustainability) and FinOps (cloud cost) are sibling disciplines sitting on the same loop — both use observability to optimize a different axis (carbon / dollars).

Gotchas & war stories

"We do DevOps" ≠ "We have a Jenkins server." Tools without culture change creates the same silos with nicer scripts.
DORA metrics game themselves — if you reward deploy frequency, teams will cut bigger releases into trivial deploys. Pair with Change Failure Rate or the gain is hollow.
MTTR can hide a mean problem — one 48-hour outage swamps fifty 5-minute ones. Look at p50/p90/p99 of recovery time, not just the mean.
"DevOps engineer" as a title is controversial — purists argue it's a practice, not a role. In interviews, don't die on that hill: use whatever title the company uses and show you understand the underlying model.

Anchor example

A legacy MQ microservice hitting 99% deployment success rate translates to a DORA Change Failure Rate of ~1%, which is elite-tier by the DORA scale. Quantify this in behaviorals: "In DORA terms, our change failure rate was ~1%, well inside elite range, largely because we had automated integration tests hitting a real IBM MQ instance and every deploy went through ArgoCD's self-heal."

Interview Q&A

What is DevOps in your own words? — A set of cultural and engineering practices that shrinks the loop between writing code and operating it, so we can deliver value faster and more reliably. The tools are less important than the shared ownership between the people who write features and the people who keep them running.
Name the four DORA metrics and which axis each measures. — Deployment Frequency and Lead Time for Changes measure throughput; Change Failure Rate and MTTR measure stability. The DORA research showed elite teams win on both axes — there's no inherent trade-off.
How would you improve a team's deployment frequency from weekly to daily? — Start by making deploys boring: trunk-based dev, small PRs, feature flags, automated canaries, and tight monitoring so rollback is a one-click operation. The blocker is almost never technical — it's trust in the pipeline.
What's the difference between DevOps and SRE? — SRE is a concrete implementation of DevOps invented at Google, centered on error budgets. DevOps says "share ownership"; SRE adds "here's how: SLOs, error budgets, capped toil, and engineers who write software to operate software."
What is CALMS? — Culture, Automation, Lean, Measurement, Sharing — the cultural pillars of DevOps. Useful checklist when diagnosing why a "DevOps transformation" stalled: usually one of the five is missing.
A team deploys once a month. Their change failure rate is 5%. Is that elite? — No — deploy frequency is "medium/low" in DORA terms. Low change failure rate in isolation just means they've built a brittle approval process around infrequent releases. True elite teams have a low CFR while deploying daily.
What are Platform Engineering and DevSecOps, and when does a company need them? — Platform Engineering builds an Internal Developer Platform so product teams don't each reinvent CI/CD, observability, and secrets handling; usually kicks in at 10+ dev teams. DevSecOps integrates security gates into every CI/CD step (SAST, SCA, signing, SBOM) instead of bolting security on at the end.
How do you measure DevOps success beyond DORA? — SPACE framework for dev experience (satisfaction, flow, collaboration), plus business outcomes (time-to-market, NPS, revenue per deploy). DORA answers "are we delivering well?"; SPACE answers "are the humans doing the delivering healthy?"

2. Linux & Systems Administration

Why this matters

Kubernetes runs on Linux. Your CI runners are Linux. Your Docker images are Linux. When a container misbehaves at 2 a.m., you're SSHing into a node and running journalctl, not opening a GUI. An interviewer may skip straight to "a process is eating memory, walk me through your debug steps" — and the candidate who says top and stops is miles behind the one who narrates top → ps → pmap → /proc/<pid>/status.

Core concepts

Filesystem hierarchy (read: man hier):

/etc — config
/var — variable data (logs in /var/log, spool, cache)
/usr — user binaries (/usr/bin, /usr/local/bin)
/opt — optional (vendor) software
/tmp — ephemeral (wiped on reboot on most distros)
/proc — virtual FS: per-process + kernel state
/sys — virtual FS: device + driver state
/dev — device nodes
/home — user homes

Permissions. Every file has owner, group, and others, each with read/write/execute (rwx). chmod 755 file = owner rwx, group r-x, others r-x. Special bits:

setuid (4) — run as file owner (e.g., /usr/bin/passwd is setuid root).
setgid (2) — run as file group (or on a dir: new files inherit dir group).
sticky (1) — on a dir, only the file owner can delete their own files (e.g., /tmp).

Users & groups. /etc/passwd (users), /etc/group (groups), /etc/shadow (hashed passwords). UID 0 = root. System accounts typically UID <1000.

Processes & signals.

Every process has PID, PPID, UID, state (R/S/D/Z/T). D = uninterruptible sleep (usually waiting on I/O — scary if sustained).
Signals: SIGTERM (15, polite kill), SIGKILL (9, can't be trapped), SIGHUP (1, "reload"), SIGINT (2, Ctrl-C), SIGSTOP (19) / SIGCONT (18).
Zombies (Z): child exited but parent hasn't wait()ed. Clean up the parent, not the child.
Orphans: parent died, adopted by init (PID 1).

systemd — the init system on nearly every modern distro. A unit is a managed resource: services, sockets, timers, mounts, targets.

systemctl start|stop|restart|status|enable|disable <name>.service
journalctl -u <service> — read logs for a unit.
systemctl list-units --failed — what's broken.
Unit files live in /etc/systemd/system/ (admin) or /lib/systemd/system/ (distro).
Timers are systemd's replacement for cron (more observable, can depend on other units).

Package managers.

Debian/Ubuntu: apt/apt-get/dpkg. /var/cache/apt/archives, /var/lib/apt/lists.
RHEL/Fedora/Amazon Linux: dnf (modern) / yum (legacy). Packages are .rpm.
Alpine: apk. Tiny (musl libc instead of glibc — watch for compatibility gotchas in containers).
Arch: pacman.

Performance & troubleshooting toolbelt (Brendan Gregg's "USE" map if you want the canonical reference):

top / htop — interactive process view.
ps auxf — snapshot, with tree.
vmstat 1 — CPU, memory, swap, I/O, every second.
iostat -xz 1 — disk I/O per-device.
sar — historical metrics (requires sysstat installed).
pidstat 1 — per-process CPU over time.
free -h — memory.
df -h / du -sh * — disk usage.
lsof -p <pid> — open files/sockets for a PID.
lsof -i :8080 — who's listening on 8080.
strace -p <pid> — syscall trace (expensive — watch for perf impact).
ltrace — library call trace.
tcpdump -i eth0 port 80 — packet capture.
ss -tlnp — listening TCP sockets with owning process (replaces netstat).
dmesg -T | tail — kernel ring buffer (OOM kills show up here).
journalctl -f — follow all systemd logs.

Cron. crontab -e edits user crontab. Format: m h dom mon dow cmd. Special: @reboot, @daily, @hourly. For anything observable/production, prefer systemd timers — they integrate with journalctl, can depend on other units, and won't double-fire if the previous run is still going.

SELinux / AppArmor. Mandatory access control layers on top of traditional Unix perms. RHEL/Fedora default to SELinux; Ubuntu to AppArmor. When a container can't read a bind-mounted dir even though chmod 777 is set → 90% of the time it's SELinux denying it. Check with getenforce / setenforce 0 (temporarily permissive), ausearch -m AVC for denials.

Commands you should know cold

bash

# Find the top 5 memory-hungry processes
ps auxf --sort=-rss | head -n 6

# What process is using port 8080?
ss -tlnp | grep :8080
# or
lsof -i :8080

# Which files is PID 1234 holding open?
lsof -p 1234

# Live trace of every syscall PID 1234 makes
sudo strace -f -p 1234

# See disk I/O per device, once per second
iostat -xz 1

# Did the kernel OOM-kill something recently?
sudo dmesg -T | grep -i 'killed process'

# Tail logs for a failing service
journalctl -u myservice.service -f

# Find files larger than 100M under /var
sudo find /var -type f -size +100M -exec ls -lh {} \;

# Who's logged in and what are they doing?
w

# Replace a running process' binary (blue/green-ish) — most daemons handle SIGHUP as "reload config"
sudo kill -HUP <pid>

Gotchas & war stories

rm -rf / with a space — rm -rf / tmp/foo wipes your root FS. Many ops engineers have a set -e; set -u habit because of this. GNU rm on modern distros has --preserve-root by default, but don't rely on it.
Inodes can fill up before disk does — df -i when df -h says "plenty of space" but writes fail.
Swap thrashing looks like CPU starvation — always check vmstat columns si/so (swap in/out) before blaming CPU.
kill -9 doesn't flush buffers — DB processes especially should get SIGTERM first.
chmod -R 777 / — famous undo-able mistake. Don't use 777 as a "fix."
TIME_WAIT sockets piling up — common in high-QPS proxies; not a leak, just TCP being TCP. Tune net.ipv4.tcp_tw_reuse.
Mount points hide directories — if /data is a mount, anything you put in /data before the mount is hidden while the mount is active.

Interview Q&A

Walk me through how you'd debug a process using 100% CPU. — top to identify the PID. ps -o pid,tid,%cpu -p <pid> to find the hot thread. strace -p <tid> for 10–30 seconds to see what syscalls it's stuck in; if it's user-space CPU, perf top -p <pid> or a language-specific profiler. Check /proc/<pid>/status for threads, signals, and masks.
What's the difference between SIGTERM and SIGKILL? — SIGTERM (15) asks the process to shut down gracefully — it can trap it and clean up. SIGKILL (9) is delivered by the kernel and cannot be caught or ignored; the process dies immediately with no chance to flush buffers or release resources. Default to SIGTERM; escalate to SIGKILL only after a grace period.
What does strace -p <pid> do, and when would you use it? — Traces every system call the process makes. Useful when you're sure something is happening but the app logs are silent — you can see it's stuck on a futex() (lock contention), a read() against a socket, or repeatedly failing open() on a missing file. It adds overhead; don't run it against a hot path in prod for long.
What's the difference between cron and systemd timers? — Cron is older, fires via crond, logs minimally. Systemd timers are systemd units, so they integrate with journalctl, can declare dependencies (Requires=), have OnCalendar= + Persistent= for "catch up on missed runs after reboot," and automatically prevent overlap with RestartPreventExitStatus. Use timers for anything you want to observe.
You chmod 777 a directory and a service still can't read it — why? — Most likely SELinux or AppArmor. Check getenforce and ausearch -m AVC -ts recent. Other candidates: mount options (ro, nosuid, noexec), directory traversal perms up the tree (execute on every parent dir), or extended attributes (lsattr). Container-specific: volume's UID/GID inside the container might not match the host.
A pod on a Kubernetes node is OOMKilled. How do you confirm from the node? — dmesg -T | grep -i 'killed process' will show the kernel OOM killer's selection, with PID and RSS at time of kill. Cross-reference the PID to the container via crictl inspect or docker ps. At the K8s layer, kubectl describe pod will show Reason: OOMKilled on the container status.
Difference between /proc and /sys? — Both are virtual kernel filesystems. /proc is primarily per-process info (/proc/<pid>/...) plus kernel stats (/proc/meminfo, /proc/cpuinfo, /proc/net/*). /sys is a structured view of kernel devices and subsystems (cgroups live there on cgroupv1/v2). If you're writing a "how much memory does this process use" tool, you're reading /proc; if you're adjusting a cgroup limit, you're writing to /sys.

3. Networking Fundamentals

Why this matters

Every DevOps outage eventually becomes a networking problem. Ingress misconfiguration, DNS TTL, a security group change, certificate expiry, mTLS mismatch, CIDR collision — the higher you go, the more of your debugging time is spent in L4–L7. Expect "a user can't reach service X, walk me through your debugging" as a near-certainty.

Core concepts

OSI model (what matters in practice):

L3 (Network) — IP. Routing. ICMP.
L4 (Transport) — TCP, UDP. Ports.
L7 (Application) — HTTP, gRPC, DNS, etc.

L5/L6 (Session/Presentation) are mostly academic for DevOps; L1/L2 matter for on-prem but rarely in cloud-native work.

TCP vs UDP.

TCP — reliable, ordered, connection-oriented, 3-way handshake (SYN → SYN/ACK → ACK), flow + congestion control. Used by HTTP/1.1, HTTP/2, SSH, SMTP.
UDP — connectionless, unreliable, no ordering guarantees, low overhead. Used by DNS (most queries), NTP, QUIC (the transport under HTTP/3), real-time video.

TCP connection lifecycle:

SYN → SYN-ACK → ACK                 (open)
... data ...
FIN → ACK → FIN → ACK               (close, 4 segments because each side closes independently)

States to recognize: LISTEN, ESTABLISHED, TIME_WAIT (the socket hangs around ~2×MSL to absorb stray packets — usually harmless unless accumulating to 1000s under high QPS), CLOSE_WAIT (remote closed, local didn't — usually a bug: you forgot to close() the socket).

TLS handshake (TLS 1.3, simplified).

Client hello — supported ciphers, random, SNI (hostname).
Server hello — chosen cipher, server cert, server key-share.
Client verifies cert against trust store; both derive session keys from shared ephemeral key material.
Encrypted traffic begins.

TLS 1.3 requires forward secrecy and drops a lot of legacy (RSA key exchange, SHA-1, MD5, static DH). If something "doesn't work with modern TLS," it's usually because it's still on TLS 1.0/1.1.

HTTP versions:

1.1 — text, one request per connection at a time (or pipelined, flaky in practice). Head-of-line blocking.
2 — binary, multiplexed streams over one TCP connection, header compression (HPACK), server push (rarely used in practice).
3 — HTTP/2 semantics over QUIC (UDP). No TCP HoL blocking; works around lossy networks. 0-RTT reconnects.

DNS.

Records: A (IPv4), AAAA (IPv6), CNAME (alias), MX (mail exchange), TXT (arbitrary; SPF, DKIM, domain-verification tokens), SRV (service discovery — host+port), NS (name server), SOA (start of authority), PTR (reverse lookup).
Resolution flow: stub resolver (your app / OS) → recursive resolver (often local, e.g., 1.1.1.1) → root → TLD (.com) → authoritative → answer.
TTL governs caching. Low TTL = fast migrations but higher query load + cost.
Common failure modes: stale cache, split-horizon DNS (internal view ≠ external), search domain surprise (foo.svc becomes foo.svc.ns.svc.cluster.local), IPv4/IPv6 dual-stack weirdness, nscd/systemd-resolved caching inconsistencies.

Load balancing.

L4 (TCP/UDP) — e.g., AWS NLB, HAProxy in mode tcp. Fast, no HTTP knowledge. Good for non-HTTP protocols.
L7 (HTTP) — e.g., AWS ALB, nginx, HAProxy in mode http, Envoy. Can inspect headers, path, host; do retries, rewrites, cookie stickiness, WAF. Adds latency vs L4.
Algorithms: round-robin, least connections, IP hash, consistent hash (critical when you have a stateful backend cache).
Health checks: active (LB polls backend) vs passive (LB evicts on failure observation).

Reverse vs forward proxy.

Reverse proxy — sits in front of servers (nginx, Envoy). Clients talk to it, it routes to backends.
Forward proxy — sits in front of clients (outbound traffic). Corporate egress, caching.

NAT / CIDR / subnetting.

NAT rewrites source/destination IPs so a private network reaches the internet via a public IP. Cloud analog: NAT gateway.
CIDR (/24, /16): /24 = 256 addresses (254 usable). /16 = 65,536. /20 = 4,096. Cloud subnets reserve ~5 addresses per subnet for infra.
Plan non-overlapping CIDRs across VPCs you'll ever peer — you cannot re-IP without pain.

Firewalls.

iptables (classic) — tables (filter, nat, mangle), chains (INPUT, OUTPUT, FORWARD, plus custom), rules.
nftables — modern replacement.
AWS security groups — stateful allow-lists at the ENI level.
NACLs — stateless at the subnet level. (Security groups alone are enough 95% of the time.)

VPN. IPSec (site-to-site, IKEv2), WireGuard (modern, simple, fast), OpenVPN (older but ubiquitous). Cloud-specific: AWS Site-to-Site VPN, Client VPN, Direct Connect (not a VPN — dedicated circuit).

Commands you should know cold

bash

# Who's listening on port 8080 and which process
ss -tlnp | grep :8080

# Active TCP connections by state
ss -s
ss -tan state established

# DNS lookup (modern)
dig +short example.com A
dig @1.1.1.1 example.com        # ask a specific resolver
dig example.com ANY             # (often restricted now)
nslookup example.com

# Does the port respond at all?
nc -vz example.com 443
# or
curl -v telnet://example.com:443

# What path does my traffic take?
traceroute example.com
mtr example.com                  # interactive traceroute + loss per hop

# Inspect TLS cert
openssl s_client -connect example.com:443 -servername example.com </dev/null \
  | openssl x509 -noout -text

# Check cert expiry quickly
echo | openssl s_client -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# Packet capture
sudo tcpdump -i eth0 -n -s0 -w /tmp/cap.pcap port 443
# Replay/view
tcpdump -r /tmp/cap.pcap -n

# Force curl to resolve a host to a specific IP (test new backend before DNS switch)
curl --resolve example.com:443:10.0.0.5 https://example.com/

Gotchas & war stories

DNS caching burns hours — TTL 300s means up to 5 min of stale answers. TTL 86400s means a day. Plan migrations with TTL pre-lowered 24h in advance.
localhost vs 127.0.0.1 — sounds dumb but tripped many: IPv6 localhost resolves to ::1. Apps bound to 0.0.0.0:8080 (IPv4) won't answer [::1]:8080. Check with ss -tlnp.
MTU mismatches — VPNs, some cloud inter-region links, and Kubernetes overlay networks cap MTU at 1400–1450 instead of 1500. Large packets get silently dropped. Symptom: "small requests work, big ones hang." Fix: lower MTU on the path, or set TCP MSS clamping.
TIME_WAIT exhaustion — ~28k ports × bound to a single source — a reverse proxy calling the same backend IP:port can exhaust. Fix with net.ipv4.tcp_tw_reuse=1 + longer ephemeral range.
Security group changes aren't retroactive for existing connections — you allow-listed a new IP; the ones you removed are still connected. Bounce the connection pool or restart.
Split-horizon DNS — devs see api.company.com resolve internally; prod sees external. An app embedding the resolved IP in config will drift.
IPv6 default — modern OSes prefer AAAA. One misconfigured AAAA record can make an entire service appear broken to half your users.

Interview Q&A

What happens when I type https://example.com in my browser? (the classic) — OS resolves DNS (stub → recursive → authoritative → A record). Browser opens TCP to port 443. TLS handshake (client hello with SNI, server cert, key exchange, derived session keys). HTTP/2 (or 3) request. Response travels back. Close or keep-alive. Each of those is a failure domain: slow DNS, TCP drop, TLS cert expiry, HTTP error.
What's the difference between L4 and L7 load balancing? — L4 operates on TCP/UDP — picks a backend based on connection-level info (IP, port), forwards packets blindly. L7 understands the application protocol — for HTTP it can route by host/path, add headers, terminate TLS, retry idempotent requests. L4 is faster and protocol-agnostic; L7 is smarter and costs more CPU.
What's in a TLS handshake? — ClientHello (ciphers, SNI, key share), ServerHello (chosen cipher, cert, server key share), client validates cert chain, both derive session keys from ephemeral key material (ECDHE), encrypted application data flows. TLS 1.3 does it in one round trip (0-RTT for resumed sessions).
Explain DNS resolution end-to-end. — App calls getaddrinfo. OS checks /etc/hosts, then nsswitch. If DNS, stub resolver asks a recursive resolver. Recursive asks root (.) for .com NS, asks .com for example.com NS, asks that authoritative server for the A record. Answer propagates back with TTL. Subsequent queries hit caches at every layer until TTL expires.
You've deployed a new version. Users in one region can't reach it. Where do you look? — Start at the edge: does the public ALB/ingress health check pass? Is DNS resolving to the right IP from that region (split-horizon or stale cache)? Check the security group / network policy chain. Curl from a jump host in the target VPC. Look at L7 LB access logs for 5xx patterns. Likely culprits in order: DNS TTL, SG change, health-check failure, certificate mismatch on SNI.
What's the TIME_WAIT state and when is it a problem? — After closing a TCP connection, the side that sent the final ACK holds the socket in TIME_WAIT for ~2×MSL (120s traditionally) to absorb stray packets and prevent old-connection data from corrupting a new connection on the same 4-tuple. It's a problem only when you've got thousands of outbound short-lived connections from the same source IP+port range — ephemeral port exhaustion. Fix: connection pooling, tcp_tw_reuse, widen ephemeral port range.
Difference between a security group and a NACL in AWS. — SG is stateful (return traffic is auto-allowed), acts at the ENI, supports allow-only rules. NACL is stateless (must allow both directions explicitly), acts at the subnet boundary, supports both allow and deny rules. In practice: always use SGs; reach for NACLs only when you need a broad subnet-level block (like quarantining a subnet).
Why does a 1500-byte ping work but a 1600-byte ping fail? — MTU. The path has a link with MTU <1500 (VPN, tunnel, or cloud overlay). Without PMTUD working correctly (ICMP "fragmentation needed" blocked by a firewall), packets larger than the smallest MTU get silently dropped. Diagnose with ping -M do -s 1472 <host> and bisect the size.

4. Shell Scripting & Automation Languages

Why this matters

Every DevOps environment has hundreds of small scripts gluing tools together. The interview won't be "write a sort in bash," but it will be "write a script that checks if a service is healthy and restarts it" or "walk me through a bash one-liner I pasted on the screen." Being fluent in bash + one mainstream scripting language (Python) is table stakes. Being fluent in YAML/HCL/JSON is non-optional — every tool in the stack is configured in them.

Core concepts

Bash essentials.

Safety prelude — put at the top of every non-trivial script:

bash

#!/usr/bin/env bash
set -euo pipefail     # -e: exit on error, -u: error on unset var, -o pipefail: fail on any pipe stage
IFS=$'\n\t'           # safer word-splitting (not space)
trap 'echo "failed at line $LINENO"; exit 1' ERR

Variables: FOO=bar (no spaces around =). Use: $FOO or ${FOO}. Quote to avoid word splitting: "$FOO". Arrays: arr=(a b c); echo "${arr[@]}".

Conditionals:

bash

if [[ -f /etc/passwd ]]; then echo "exists"; fi
if [[ "$FOO" == "bar" ]]; then ...; fi
[[ "$FOO" =~ ^[0-9]+$ ]] && echo "numeric"

[[ ]] is bash-only and safer than [ ] (POSIX); use it unless you need strict POSIX.

Loops:

bash

for f in *.log; do gzip "$f"; done
while read -r line; do echo "$line"; done < file.txt

Functions:

bash

retry() {
  local tries=$1; shift
  for ((i=1; i<=tries; i++)); do
    "$@" && return 0
    sleep $((2 ** i))   # exponential backoff
  done
  return 1
}
retry 3 curl -sf https://example.com

Subshells vs command substitution:

bash

RESULT=$(date +%s)            # modern
RESULT=`date +%s`             # old style, avoid
(cd /tmp && ls)               # subshell — cwd change doesn't leak

Pipes + pipefail: without -o pipefail, a failing earlier command in a pipe is hidden by the last command's success.

Trap for cleanup:

bash

tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT

Redirection:

> truncate, >> append.
2> stderr, &> both.
2>&1 redirect stderr to wherever stdout is pointing now.
<<< here-string, <<EOF ... EOF heredoc.

Python for ops. Use Python when:

Logic branches more than 2 levels deep.
You need structured data (JSON/YAML parsing, API calls).
You need testable code.

python

#!/usr/bin/env python3
import json, subprocess, sys
r = subprocess.run(["kubectl", "get", "pods", "-o", "json"], capture_output=True, check=True, text=True)
pods = json.loads(r.stdout)["items"]
unhealthy = [p for p in pods if any(c.get("restartCount", 0) > 5 for c in p["status"].get("containerStatuses", []))]
for p in unhealthy:
    print(p["metadata"]["name"])
sys.exit(0 if not unhealthy else 1)

YAML/JSON/HCL/TOML.

YAML — indentation-sensitive (spaces only, never tabs). Anchors (&) + aliases (*) for reuse. Hidden gotcha: Norway bug (NO parses as false). Use "..." quotes on any value that's not clearly a string.
JSON — no comments. Strict. All keys are strings.
HCL (HashiCorp) — Terraform, Packer, Nomad. Typed; has expressions and for-loops; blocks look like resource "aws_s3_bucket" "this" { ... }.
TOML — Rust-world (Cargo), also some Python (pyproject.toml). Strict sections and types.

Essential sharp tools:

jq — JSON on the command line: kubectl get pods -o json | jq '.items[] | .metadata.name'.
yq — YAML processor (two flavors: Python-based by kislyuk or Go-based by mikefarah; Go version is more common today). yq '.spec.replicas' deploy.yaml.
envsubst — substitute $VAR in templates.
xargs — turn stdin into arguments: find . -name '*.log' -print0 | xargs -0 -P 4 gzip.
parallel — GNU parallel, more powerful xargs.
Makefile — still the clearest way to document a project's build/test/run commands. Tabs are mandatory inside recipes.

Commands you should know cold

bash

# Exit codes tell you everything
cmd
echo $?        # 0 = success, non-zero = failure

# Chain commands conditionally
build && deploy || notify-failure

# Parallel execution with xargs
cat urls.txt | xargs -P 8 -I{} curl -sf {}

# Portable temp file with auto-cleanup
tmp=$(mktemp); trap "rm -f '$tmp'" EXIT

# Count pod restarts by name
kubectl get pods -o json \
  | jq -r '.items[] | [.metadata.name, (.status.containerStatuses[0].restartCount // 0)] | @tsv'

# Replace a value in a YAML file in place
yq -i '.spec.replicas = 5' deploy.yaml

# Read a .env file safely
set -a; source .env; set +a

Gotchas & war stories

Unquoted variables — rm $FILE with FILE="foo bar" tries to remove two files. Always rm "$FILE".
cd failing silently in a script without -e — cd /nonexistent && rm -rf * deletes your current directory contents. Always set -e + chain with &&.
sudo doesn't inherit shell functions, only the environment. If you sudo my_bash_function, it won't exist.
find -print0 | xargs -0 is safer than find ... | xargs — filenames can contain spaces, quotes, newlines.
eval is dangerous — if user input flows in, you have shell injection. Alternatives: arrays, printf -v.
YAML "Norway bug" — country: NO parses to false (because no was a bool in YAML 1.1). Always quote strings that look like bools or numbers.
Windows line endings (CRLF) in a bash script — #!/bin/bash^M: bad interpreter. Fix with dos2unix or sed -i 's/\r$//' script.sh. Biggest WSL/Windows cross-over footgun.

Interview Q&A

What does set -euo pipefail do? — -e exits on any command's non-zero exit. -u errors on unset variable references. -o pipefail makes a pipeline fail if any stage fails (not just the last). Together they turn bash from "keeps going on error" into "fail fast." Should be on almost every script.
How do you safely make a temp file in a bash script? — tmp=$(mktemp) creates a random path in $TMPDIR. Register cleanup immediately: trap "rm -f '$tmp'" EXIT. For a directory: tmpdir=$(mktemp -d). The trap ensures cleanup even on set -e failure or signal.
[ ] vs [[ ]]? — [ ] is POSIX and external-command-ish; word-splits unquoted variables and breaks on empty strings. [[ ]] is a bash builtin — safer quoting, regex with =~, lexical </>. Use [[ ]] unless the script explicitly targets /bin/sh/dash.
What does jq '.items[] | select(.spec.replicas > 1) | .metadata.name' do? — Iterates the items array, filters to entries where .spec.replicas > 1, emits each matching entry's metadata.name. Classic shape for "pull a list of things meeting a condition from kubectl -o json."
Bash or Python — how do you decide? — Bash is fine for sequential shell commands with minimal logic: wrap a few kubectl/curl/jq calls, exit codes, trap cleanup. Switch to Python when you need structured data manipulation, unit tests, retries with state, or concurrency beyond xargs -P. The cutover is around 100 lines, or the moment you write a function with 3+ arguments.
Show me a retry-with-backoff function in bash. —

bash

retry() {
  local tries=$1; shift
  for ((i=1; i<=tries; i++)); do
    "$@" && return 0
    sleep $((2**i))
  done
  return 1
}

Usage: retry 5 curl -sf https://api/health.

Why is YAML fragile? — Indentation-sensitive (tabs vs spaces), type coercion surprises (yes/no/NO), multi-document files, anchor/alias complexity. Mitigations: lint with yamllint, validate against a schema (every K8s YAML has a schema), prefer tools that generate YAML (Helm, Kustomize) over hand-editing.

5. Version Control & Branching Strategies

Why this matters

Git is the substrate DevOps runs on: source code, IaC, Helm charts, ArgoCD manifests, pipeline definitions. A DevOps engineer is the go-to when "the merge went sideways" or "we need to pick between trunk-based and GitFlow for the platform team." Expect branching-strategy questions ("explain trunk-based dev and why you'd pick it over GitFlow") and a Git internals question or two ("what's a rebase actually doing?").

Core concepts

Git's data model.

Blob — a file's contents (deduplicated by SHA-1/SHA-256).
Tree — a directory listing (pointers to blobs and sub-trees).
Commit — a tree pointer + parent commit(s) + author/message.
Tag — a named pointer to a commit (often signed for releases).
Ref — a named pointer (branches and tags are both refs).

A branch is just a pointer to a commit. Moving HEAD is cheap. This is why rebases and amends feel lightweight.

Merge vs rebase.

Merge — creates a merge commit with 2+ parents. Preserves history verbatim. Default for integrating completed feature branches into main.
Rebase — re-applies your commits onto another base. Produces a linear history but rewrites SHAs. Default for keeping a feature branch up-to-date with main before the final merge.
Fast-forward merge — no merge commit; branch pointer just advances. Happens automatically when the target hasn't diverged.
Squash merge — all commits collapsed into one on integration. Clean mainline history at the cost of losing individual commit granularity.

The golden rule of rebasing: never rebase a public branch. If anyone else has pulled it, rebasing rewrites history they've based work on. Rebase your own topic branches before they're merged; never rebase main.

Branching strategies:

GitFlow (Vincent Driessen, 2010) — main (prod), develop (integration), feature/*, release/*, hotfix/*. Long-lived branches, heavyweight. Useful if you ship quarterly with explicit release windows. Mostly obsolete for cloud-native shops.
GitHub Flow — main + short-lived feature branches. Merge via PR. Deploy after every merge. Simple, modern, works well with CI/CD.
Trunk-Based Development — all devs commit to main (or merge small PRs rapidly). Branches live <1 day. Feature flags hide in-progress work. Preferred by DORA's elite-performer data; matches high deploy frequency.
Release Flow (Microsoft) — trunk-based with release branches only at ship time; hotfixes cherry-pick from main.

PR/MR workflow:

Cut a branch from main.
Commit often locally; force-push cleanups allowed on your own branch.
Open PR; CI runs build/test/scan.
Code review (≥1 approver).
Rebase or squash-merge into main.
CI kicks off deploy.

Monorepo vs polyrepo.

Monorepo (Google, Meta, Uber, many microservice shops) — all code in one repo. Pros: atomic cross-service refactors, shared tooling, easy dep updates. Cons: tooling scale (Bazel, Nx, Turborepo), large clones, CI complexity (change-only builds).
Polyrepo — one repo per service/lib. Pros: independent deploy/release, clear boundaries, smaller blast radius. Cons: dependency version drift, cross-cutting changes are multi-PR coordination nightmares.
Hybrid — one repo per "domain"; infra + shared libs in a separate monorepo.

Conventional Commits. type(scope): subject — e.g., feat(auth): add OIDC flow, fix(api): null check on user id, chore(deps): bump spring to 3.3.2. Feeds automated changelogs (semantic-release) and semver bumps: feat → minor, fix → patch, BREAKING CHANGE: footer → major.

Semver — MAJOR.MINOR.PATCH. Breaking → major, backward-compatible feature → minor, bug fix → patch. Pre-release suffixes: 1.2.0-alpha.1, 1.2.0-rc.2. Build metadata: 1.2.0+build.42.

Git hooks. Client-side (.git/hooks/) or server-side. Common uses:

pre-commit — run linters/formatters (Husky, lefthook, pre-commit framework).
commit-msg — enforce Conventional Commits (commitlint).
pre-push — run fast tests.

Signing. git commit -S signs with GPG or SSH. GitHub/GitLab verify and mark commits "Verified." Enforce in protected-branch rules for supply-chain hygiene.

Commands you should know cold

bash

# See what's about to be committed
git diff --staged

# Undo last commit, keep changes staged
git reset --soft HEAD~1

# Undo last commit, keep changes unstaged
git reset HEAD~1

# Undo last commit, lose changes (DANGEROUS)
git reset --hard HEAD~1

# Oops, pushed the wrong thing — revert (safer than force-push)
git revert <sha>

# Interactively rebase last 5 commits (squash, reorder, reword)
git rebase -i HEAD~5

# Pull with rebase (avoid ugly merge commits on each pull)
git pull --rebase

# Find which commit introduced a bug between two known commits
git bisect start
git bisect bad HEAD
git bisect good v1.2.0
# ... test, then `git bisect good` or `git bisect bad` until found
git bisect reset

# Who last touched this line?
git blame -L 42,50 path/to/file

# Find a string across all of history
git log -S 'mysterious_function' --all --source

# Apply someone's change from another branch
git cherry-pick <sha>

# Stash uncommitted work while you switch tasks
git stash push -m "wip: auth refactor"
git stash pop

# Recover a branch you 'deleted'
git reflog
git checkout -b my-branch <sha-from-reflog>

# Force-push safely (refuses if remote moved)
git push --force-with-lease

Gotchas & war stories

git push --force overwriting teammates' work — always --force-with-lease (refuses if remote moved since your last fetch).
Committing secrets — they live in history forever even if you "delete" the file. Use git-filter-repo or BFG to purge, then rotate the secret (treat it as leaked regardless). Pre-commit secret scanners (gitleaks, trufflehog) catch most attempts.
Line-ending wars (CRLF vs LF) — .gitattributes with * text=auto eol=lf prevents Windows devs from checkerboarding the repo.
Rebasing shared branches — wrecks teammates' clones. Rule: if others have pulled it, merge. If only you, rebase.
git pull without --rebase — creates noisy merge commits on every sync. Set pull.rebase = true globally.
LFS blobs in regular history — cloning a 10GB repo over a coffee-shop connection. Use LFS intentionally.
git clean -fdx on the wrong repo — nukes gitignored files, including .envs with unsaved local state.

Interview Q&A

What's the difference between merge and rebase? — Merge integrates two branches with a new merge commit (2+ parents); preserves exact history. Rebase re-applies your commits onto a new base, producing a linear history but rewriting SHAs. Rule of thumb: rebase local topic branches to stay current; merge to integrate completed work.
When should you NEVER rebase? — When the branch is shared/public and someone else has based work on it. Rewriting history changes every downstream commit's parent; teammates pulling will have hellish conflicts and might push the old history back. Rebase is for your own in-progress topic branches only.
What's trunk-based development, and why do DORA elite performers use it? — All devs integrate small changes into main multiple times a day; branches live <24 hours; in-progress features hide behind feature flags. It minimizes merge pain and integration debt, and keeps the mainline continuously deployable. DORA research shows trunk-based plus short-lived branches strongly correlates with high deploy frequency and low change-failure rate.
GitFlow vs GitHub Flow — when each? — GitFlow fits slow-release, regulated, versioned-product shops (quarterly releases, multiple supported versions). GitHub Flow (main + short-lived feature branches) fits continuous-deploy SaaS. 2025-era cloud-native shops almost universally use GitHub Flow or trunk-based; GitFlow has become a red flag of overbuilt process.
How do you find a bug that was introduced "sometime in the last month"? — git bisect. Mark a known-bad commit (e.g., HEAD) and a known-good commit (e.g., a tag from a month ago). Git binary-searches; you run your repro on each intermediate commit, mark good/bad, and in ~log2(range) steps you've narrowed it to one commit.
You accidentally pushed a secret. What now? — Rotate the secret immediately (treat it as leaked even if the push was quick). Then purge it from history with git filter-repo or BFG; force-push to overwrite the remote history. Warn teammates — everyone who cloned needs to re-clone. Add a pre-commit secret scanner so it can't happen again.
Why use Conventional Commits? — Machine-parseable commit messages let you auto-generate changelogs and choose the right semver bump (feat → minor, fix → patch, BREAKING CHANGE → major). Pairs with semantic-release or release-please to fully automate versioning and release notes.
Signed commits — why? — Proves the commit came from who it claims, not someone who ran git config user.name "ceo@example.com". Supply-chain hardening: protected-branch rules can require verified signatures, which combined with branch protection and CODEOWNERS makes it much harder for a compromised dev laptop to slip malicious commits into main.

Part II — Build & Delivery

6. CI/CD Pipelines (Deep Dive)

Why this matters

If DevOps has one signature artifact, it's the pipeline. It's the automated contract that says "code on main is always deployable" and "no change reaches prod without passing the gates." Interviewers will ask you to design a pipeline from scratch, diagnose a slow one, or defend a CI/CD tool choice. They're also probing for pipeline security — artifact signing, OIDC to cloud, SBOM.

Core concepts

CI vs CD vs CD — the three-letter distinction:

Continuous Integration — every change is merged to main frequently (ideally daily), and main is always building + passing tests. This is a developer practice, not just a tool.
Continuous Delivery — every change that passes CI is deployable to prod at any time. A human chooses when to ship.
Continuous Deployment — every change that passes CI is automatically deployed to prod. No human in the loop beyond merge.

You can do CI without CD; you can't do CD without CI. Many shops say "CD" but mean "Continuous Delivery" — ask which they mean.

Canonical pipeline stages:

1. Checkout            - git clone the commit
2. Cache restore       - restore deps, build artifacts
3. Build               - compile, bundle
4. Lint                - static analysis
5. Unit test           - fast, isolated
6. Integration test    - slower, real deps (Testcontainers etc.)
7. Security scan       - SAST (SonarQube/Semgrep), SCA (Dependabot/Snyk), secret scan
8. Package             - jar, docker image, helm chart
9. Sign                - Cosign/Notation on image + SBOM
10. Publish            - push to Artifactory/ECR/GHCR
11. Deploy (nonprod)   - Argo sync or direct apply
12. Integration tests  - against deployed env (smoke, contract, e2e)
13. Promote            - manual or automatic gate → prod
14. Deploy (prod)      - progressive delivery (canary/blue-green)
15. Post-deploy        - verify SLOs, roll back on regression

Not every project needs all of these, but every senior pipeline can articulate which they include and why.

Build caching & parallelism. CI time compounds with team size — a 20-minute pipeline × 50 devs × 10 PRs/day = 8.3 days of wall-clock waiting per day. Tactics:

Layer caches per stage (deps, compiled artifacts).
Docker BuildKit cache mounts (RUN --mount=type=cache).
Matrix / parallel test shards (split 10k tests across 10 runners).
Change-only builds (Bazel, Nx, Turborepo) — skip what didn't change.
Self-hosted runners for long warm caches.

Artifact repositories — don't ship from your CI workspace. Publish to a durable artifact store:

JFrog Artifactory / Sonatype Nexus — polyglot binary repositories (Maven, npm, Docker, Helm, generic).
GitHub Container Registry (GHCR), AWS ECR, Azure ACR, Google GAR — cloud-native container registries.
Harbor — self-hosted registry with scanning + signing built in.

Every artifact must be immutable and versioned. latest tags are anti-patterns in prod.

Pipeline-as-code. Define the pipeline in a file that lives with the code (.github/workflows/*.yml, .gitlab-ci.yml, Jenkinsfile, .circleci/config.yml). Benefits: version-controlled, PR-reviewable, branches can test pipeline changes. Click-ops pipelines (old Jenkins UI) are a major anti-pattern.

Runners — ephemeral vs self-hosted, cloud vs self-hosted.

Cloud-hosted, ephemeral (GitHub-hosted, GitLab.com shared) — zero maintenance, public IP, no state between runs. Default for most.
Self-hosted, ephemeral — you run the runner VM, tear it down after each job. Better for private network access + regulated/air-gapped.
Self-hosted, persistent — long-lived runner holding caches. Fastest; but opens state/security concerns (poisoned cache attacks).
ARC (Actions Runner Controller) for GitHub Actions on K8s — auto-scales ephemeral runners as K8s pods.

OIDC to cloud (replace static creds). Modern CI can mint a short-lived OIDC token that the cloud trusts (via a trust policy on an IAM role) and exchange it for temporary AWS/GCP/Azure credentials — no long-lived access keys stored as secrets. GitHub Actions → AWS via aws-actions/configure-aws-credentials with role-to-assume + id-token: write permission. This is the current best practice and expect to explain it.

Deployment strategies (dig deeper in Section 23 Progressive Delivery):

Rolling — K8s default; incrementally replace pods.
Blue/Green — two full environments; cutover via LB.
Canary — small % on new version; analyze metrics; ramp.
Shadow — mirror traffic to new version; don't use its responses.

Testing pyramid in CI — fast cheap tests at the bottom, slow expensive tests at the top:

       /\         E2E (few, slow)
      /  \        Integration (some, medium)
     /    \       Component (many, fast)
    /______\      Unit (huge, millisecond)

Tool-by-tool (what to know):

Jenkins — declarative Jenkinsfile (pipeline DSL), scripted pipelines (Groovy), agents (nodes where work runs), shared libraries, Blue Ocean UI. Heavyweight; runs on-prem well; plugin ecosystem is enormous but brittle.
GitHub Actions — workflows in .github/workflows/*.yml. Concepts: triggers (on:), jobs, steps, uses: for actions, with: for inputs. Composite actions = reusable step blocks. Reusable workflows = callable full workflows. Matrix strategy. Secrets + variables at repo/env/org scope. Environments with approvals + protection rules. Concurrency groups to cancel superseded runs. OIDC id-token: write for cloud auth.
GitLab CI — .gitlab-ci.yml with stages: and jobs:. DAG via needs:. Child pipelines. Includes for reuse. Built-in container registry, packages, security scanning (Ultimate tier).
CircleCI — .circleci/config.yml, orbs (reusable config), workflows, contexts.
Azure DevOps Pipelines — azure-pipelines.yml, stages/jobs/steps, templates.
Tekton — Kubernetes-native pipelines (CRDs: Pipeline, Task, PipelineRun). Each step is a container. Powerful but low-level.
Drone — container-native, simple YAML.
Buildkite — hybrid (SaaS control plane, your runners).

Commands / config you should know cold

Minimal GitHub Actions workflow with OIDC to AWS, K8s deploy, and caching:

yaml

# .github/workflows/ci.yml
name: CI
on:
  push: { branches: [main] }
  pull_request: { branches: [main] }

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with: { distribution: temurin, java-version: '21', cache: maven }
      - run: mvn -B verify

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      id-token: write           # for OIDC
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123:role/gh-ci
          aws-region: us-east-1
      - uses: docker/setup-buildx-action@v3
      - run: |
          docker buildx build \
            --push \
            --cache-from type=gha \
            --cache-to type=gha,mode=max \
            -t 123.dkr.ecr.us-east-1.amazonaws.com/app:${{ github.sha }} \
            .
      - name: Sign image
        run: cosign sign --yes 123.dkr.ecr.us-east-1.amazonaws.com/app:${{ github.sha }}

Minimal Jenkinsfile (declarative):

groovy

pipeline {
  agent { kubernetes { yaml '''<pod spec>''' } }
  options { timestamps(); buildDiscarder(logRotator(numToKeepStr: '20')) }
  triggers { pollSCM('H/5 * * * *') }
  stages {
    stage('Test') { steps { sh 'mvn -B verify' } }
    stage('Build image') {
      steps { sh "docker build -t app:${env.GIT_COMMIT} ." }
    }
    stage('Deploy')  { when { branch 'main' } ; steps { sh './deploy.sh' } }
  }
  post { failure { slackSend channel: '#ops', message: "Build failed: ${env.BUILD_URL}" } }
}

Gotchas & war stories

Flaky tests poison the pipeline — once people start re-running on red, they stop trusting all failures. Quarantine flaky tests in a separate tier and fix or delete them on a schedule.
Caching the wrong key — cache package-lock.json hash, not branch name. Otherwise you get stale deps silently.
Long-running Jenkins masters — over time their state rots (plugin conflicts, old workspaces, lost credentials). Treat as cattle: rebuild from declarative config (JCasC) quarterly.
Runner poisoning on self-hosted persistent runners — PR from a fork can run arbitrary code with access to your caches. Use required_approval for forked PRs; prefer ephemeral runners.
Static cloud creds in CI secrets — the #1 supply-chain hazard. Move to OIDC trust. If creds leak, rotate + investigate.
latest tag in deploy manifests — every cluster restart could pull a different image. Tag immutably (commit SHA or semver) always.
No timeout on pipeline steps — a hung test suite will burn runner minutes forever.
Secrets in build logs — echo $TOKEN or verbose http loggers. GitHub Actions masks registered secrets in logs; don't disable it.

Anchor example

A typical small-team .github/workflows/ci.yml is a clean reference: backend build+test, frontend lint+test+build, Docker image build on PR/push to main; a separate owasp-nightly.yml handles the slow daily scan. Pattern to cite: split fast PR gate (must-pass) from slow nightly scans (run async, file issues on findings).

A 99% deploy success rate is a pipeline metric — CFR ~1%. Tie this back to specific practices: automated integration tests (Testcontainers, real DB), mandatory ArgoCD self-heal, rollback playbooks.

Interview Q&A

Walk me through a CI/CD pipeline you've designed. — Start with the trigger (push or PR), then checkout → build → test (unit + integration with Testcontainers) → static analysis → security scan (SAST + SCA) → Docker build with BuildKit + cache → sign with Cosign → push to ECR → bump a Helm values file in a GitOps repo → ArgoCD picks it up and syncs to a dev namespace → smoke tests → manual promote gate for prod. Key principles: pipeline-as-code, OIDC to cloud (no static creds), immutable artifacts, one pipeline per service.
CI vs CD vs Continuous Deployment? — CI is the integration practice: devs merge small changes to main frequently and main is always green. Continuous Delivery means the artifact that passes CI is deployable at any time — a human chooses when. Continuous Deployment drops the human: every green main is automatically promoted to prod.
How do you speed up a 45-minute CI pipeline? — First, profile — look at stage durations. Usual wins: parallelize test shards (GitHub Actions matrix, TestNG/JUnit parallel classes), aggressive layer caching (Gradle/Maven/npm cache, Docker BuildKit cache), change-only builds (Turborepo/Nx/Bazel), self-hosted runners with warm caches for big workspaces, kill flaky tests (10 retries × 3 flaky tests eats 5 minutes), move slow scans (OWASP DC full scan) to nightly not per-PR.
How do you authenticate GitHub Actions to AWS without storing keys? — OIDC federation. Trust the GitHub OIDC provider in an IAM role's trust policy; constrain by repo/branch/environment claims. Workflow sets permissions: id-token: write, then aws-actions/configure-aws-credentials exchanges the OIDC token for short-lived STS creds. No long-lived access keys anywhere.
What's the difference between a reusable workflow and a composite action in GitHub Actions? — Composite action = reusable sequence of steps, lives as an action, called with uses:. Reusable workflow = a full workflow callable by another workflow via uses: org/repo/.github/workflows/x.yml@ref. Composite actions are for small step-blocks; reusable workflows are for full multi-job pipelines shared across repos.
Jenkins vs GitHub Actions vs GitLab CI — trade-offs? — Jenkins: most flexible, self-hosted, plugin-rich, high ops cost; good when you need complex custom logic or regulated air-gapped. GitHub Actions: lowest setup cost, best ecosystem of actions, tightly tied to GitHub, OIDC cloud auth, free for public repos. GitLab CI: strong if you're already on GitLab, integrated DevOps (registry, scanning, environments) out of the box, powerful DAG with needs:. For a greenfield GitHub shop, Actions wins on ergonomics.
How do you handle secrets in a pipeline? — (1) Never commit them. (2) Use the CI's secret store (GitHub Actions secrets / GitLab CI variables with masking). (3) Prefer OIDC federation to cloud providers so no cloud creds exist as secrets. (4) For non-cloud secrets (API keys), store in a secrets manager (Vault, AWS Secrets Manager) and fetch at runtime. (5) Mask in logs; scan in PR for accidental inclusion (gitleaks).
How do you ensure pipeline reproducibility? — Pin action/image versions to SHA, not tags. Pin dependencies via lockfiles (package-lock.json, Pipfile.lock, go.sum, poetry.lock). Use immutable build artifacts tagged with the commit SHA. Store build metadata (provenance). For serious reproducibility, SLSA Level 3+ with a hosted build service and isolated build environments.
What is a pipeline's "blast radius" and how do you constrain it? — How much damage a bad pipeline run can do. Constrain via: scoped cloud IAM roles (least privilege per pipeline/env), no cross-env write access, approval gates before prod, pipeline-level network policies (CI can't talk to prod DB), rate limits on deploys. Audit via CloudTrail/GitHub audit log.
Your main branch is red. What do you do? — Treat it as an all-hands situation. Don't merge over it. Revert the offending commit if the fix isn't immediate (git revert <sha>). Ping the author. Open a TPM/ops channel with the RCA. If it's repeatedly flaky, escalate to "stop the line" — no merges until green. Long-term: add a canary/quarantine suite so flaky tests stop gating main.

7. Infrastructure as Code (Terraform, Ansible, Pulumi)

Why this matters

Clicking in a cloud console doesn't scale, can't be reviewed, and drifts silently. IaC turns infrastructure into code: reviewable, versioned, replayable. Terraform is dominant for provisioning; Ansible dominant for config management; Pulumi/CDK gaining for "real-language" infra. Expect at least one deep Terraform question (state, modules, or a drift scenario) and a Terraform-vs-Ansible-vs-CloudFormation comparison.

Core concepts

Declarative vs imperative.

Declarative (Terraform, CloudFormation, K8s YAML) — "this is the desired end state." Engine diffs current vs desired and computes the plan.
Imperative (Ansible playbooks, bash) — "run these steps in order." You describe the path, not the destination.

Declarative wins for provisioning (reproducibility, drift detection). Imperative wins for sequences that are genuinely step-based (runtime installers, ordered migrations). Modern Ansible is "mostly declarative" — modules are idempotent — but the playbook orchestration is ordered.

Mutable vs immutable infrastructure.

Mutable — SSH in, apt upgrade, patch in place. History of drift.
Immutable — bake a new AMI/image, replace old instances, never patch in place. Preferred for cloud-native. Containers are the ultimate immutable delivery vehicle; images aren't upgraded in place, they're replaced.

Terraform deep dive

Core objects:

Provider — plugin that knows how to talk to a specific API (AWS, Azure, Kubernetes, GitHub, Datadog, etc.).
Resource — managed cloud object (resource "aws_s3_bucket" "logs" { ... }).
Data source — read-only lookup (data "aws_vpc" "main" { ... }).
Module — reusable collection of resources with inputs (variables) and outputs.
Variable — input: variable "env" { type = string }.
Output — exported value: output "bucket_arn" { value = aws_s3_bucket.logs.arn }.
Local — computed internal value: locals { tags = { env = var.env } }.
Backend — where state is stored (see below).
Workspace — a named instance of a config (effectively a separate state file).

State. Terraform tracks what it created in a JSON state file (terraform.tfstate). State is the source of truth for "what exists." Critical properties:

State contains secrets. Never commit it. Store in a remote backend (S3 + KMS, Terraform Cloud, GCS, Azure Storage).
Concurrent apply corrupts state. Use state locking — S3+DynamoDB table; Terraform Cloud; GCS has native locking.
State can be manipulated: terraform state list, terraform state rm, terraform state mv, terraform state pull, terraform state push. These are surgical tools — don't use them casually.

Typical S3 backend:

hcl

terraform {
  backend "s3" {
    bucket         = "my-tf-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tf-lock"
    encrypt        = true
  }
}

Modules. A module is just a directory of .tf files consumed by another config:

hcl

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"
  name    = "prod"
  cidr    = "10.0.0.0/16"
  azs     = ["us-east-1a", "us-east-1b", "us-east-1c"]
  # ...
}

Always pin versions for registry modules. For internal modules, use Git refs (?ref=v1.2.0).

Iteration:

count = 3 — creates 3 of a resource; indexed res[0], res[1], res[2].
for_each = { dev = "...", prod = "..." } — creates one per map entry, keyed by map key.

Rule: prefer for_each over count. Inserting into the middle of a count list causes all later resources to re-index (and often be destroyed and recreated). for_each is stable.

Drift. When someone changes resources outside Terraform (click in the console), state and reality diverge. terraform plan compares desired (config) → state (what we created); terraform refresh (or -refresh-only) syncs state from reality. Best practice: plan rejects drift and forces remediation (either revert the manual change, or terraform import it into config).

Import:

bash

terraform import aws_s3_bucket.logs my-existing-bucket

Imports the resource into state. You still have to write the matching resource block. Terraform 1.5+ supports import blocks in config for reproducible imports.

Key commands:

bash

terraform init              # download providers/modules, configure backend
terraform fmt -recursive    # canonicalize formatting
terraform validate          # syntax + internal references
terraform plan -out=tfplan  # show proposed changes; save to a file
terraform apply tfplan      # execute the saved plan
terraform destroy           # tear down everything this config manages

terraform state list
terraform state show <addr>
terraform state rm <addr>   # "un-manage" a resource without destroying it
terraform taint <addr>      # (legacy) force recreation on next apply
terraform apply -replace=<addr>   # modern equivalent of taint
terraform import <addr> <id>

Provisioners — ways to run scripts inside Terraform (local-exec, remote-exec, file). Generally avoid. They violate declarativeness and can't be re-applied cleanly. Prefer: user_data (cloud-init), configuration management (Ansible after provisioning), or a separate K8s/nomad workload.

Terraform Cloud / Enterprise — hosted backend + collaboration layer. Runs apply remotely, enforces policy, drift detection, VCS integration. Free tier for small teams.

Atlantis — open-source PR automation: post atlantis plan in a PR, it runs plan and comments the diff; atlantis apply applies after merge. Used by teams that want Terraform Cloud features without the SaaS.

Policy as Code.

Sentinel (Terraform Cloud/Enterprise) — policy language; reject plans that violate rules.
OPA / Rego — general-purpose; conftest runs Rego against plan JSON.
Checkov, tfsec, Terrascan — rule-based scanners for common misconfigs (public S3, unencrypted EBS).

Terraform vs Ansible vs CloudFormation vs Pulumi vs CDK vs Crossplane

Tool	Paradigm	Strength	Weakness
Terraform	Declarative HCL	Cloud-agnostic, huge provider ecosystem, state-based drift detection	State management overhead, HCL is bespoke
Ansible	Imperative YAML	Agentless config management via SSH, great for OS-level tasks	Not great at cloud resource provisioning (despite modules)
CloudFormation	Declarative JSON/YAML	AWS-native, no external state to manage	AWS only, verbose, slow, drift visibility weaker
Pulumi	Imperative code (TS/Py/Go/Java/.NET)	Use real programming language, testable	Newer ecosystem, commercial backend free-tier limits
AWS CDK	Imperative code (TS/Py/Java/etc.) → CloudFormation	Real language; AWS-native	AWS-only, compiles to CFN so inherits its limits
Crossplane	Declarative K8s CRDs	Manage cloud from inside K8s, GitOps-native	K8s-first mindset required, operator model

Ansible essentials

Playbook — YAML file orchestrating tasks over hosts.
Inventory — host list (INI or YAML; static or dynamic via plugins).
Role — reusable task/handler/template bundle (directory layout: tasks/, handlers/, templates/, vars/, defaults/, files/, meta/).
Module — built-in unit of work (apt, copy, template, systemd, k8s). Most modules are idempotent.
Handlers — tasks that run only when notified (restart a service after a config change).
Jinja2 templates — {{ var }}, {% if %}, filters.
Ansible Vault — encrypt secrets in the repo: ansible-vault encrypt secrets.yml.

Example playbook:

yaml

- hosts: web
  become: yes
  vars:
    nginx_port: 8080
  tasks:
    - name: Install nginx
      apt: { name: nginx, state: present, update_cache: yes }
    - name: Deploy config
      template: { src: nginx.conf.j2, dest: /etc/nginx/nginx.conf }
      notify: restart nginx
  handlers:
    - name: restart nginx
      systemd: { name: nginx, state: restarted, enabled: yes }

Commands you should know cold

bash

# Terraform — day-to-day loop
terraform init -upgrade
terraform fmt -recursive && terraform validate
terraform plan -out=plan.out
terraform apply plan.out

# Refactor state (rename a resource in config without destroying)
terraform state mv aws_s3_bucket.old aws_s3_bucket.new

# Force-recreate a single resource
terraform apply -replace='aws_instance.web[0]'

# Import an existing resource
terraform import 'module.vpc.aws_vpc.this' vpc-0abcd1234

# Ansible
ansible-inventory --graph
ansible-playbook -i inventory site.yml --check --diff
ansible all -m ping
ansible web -a "uptime" -b           # ad-hoc command, become root

Gotchas & war stories

Committing terraform.tfstate — state contains plaintext secrets (RDS passwords, API tokens). Remote backend + encryption always.
Concurrent apply — without locking, two engineers apply at once and corrupt state. Always use a locking backend.
count = length(var.list) when list is computed — plan can't figure out count before apply; errors with "value must be known at plan time." Fix: for_each = toset(var.list) or refactor.
Destroying by accident — terraform apply with a surprise delete. Habit: read the plan diff carefully; require -auto-approve absent for prod; enable prevent_destroy lifecycle on critical resources.
Provider version drift — old provider caches can re-create resources after a version bump. Pin required_providers with version constraints.
Secrets in variables — Terraform logs variable values on errors. Mark sensitive: variable "db_password" { sensitive = true }.
for_each over sensitive maps — the key becomes part of resource address; can leak metadata.
Ansible's "it worked on the 2nd run" — a non-idempotent task. Every shell/command module use is suspect; prefer built-in modules with state:.

Anchor example

If Helm and ArgoCD are central but Terraform is adjacent, lean on: "We used Helm for K8s, ArgoCD for GitOps, and touched Terraform for shared infra (VPC, IAM, EKS cluster). The pattern to advocate is module-per-concern (network, cluster, DNS), remote state in S3+DynamoDB, plan-on-PR via Atlantis or TF Cloud, and no manual console changes to managed resources."

Interview Q&A

Explain Terraform state. — JSON file that maps resources in your config to real-world IDs and tracks metadata. Required because Terraform can't otherwise know whether a resource exists, and matters because apply is desired (config) − state = plan. Store remotely with encryption (S3+KMS+DynamoDB lock, or TF Cloud), never in git, and treat it as secret.
Module vs resource vs data source. — Resource declares a managed cloud object Terraform will create/update/destroy. Data source reads external state read-only. Module is a reusable bundle of resources/data/variables/outputs, called with module "x" { source = "...", <inputs> }. Modules are the unit of reuse; resources are the unit of management.
count vs for_each — which do you reach for? — for_each with a map or set, almost always. Inserting at the start of a list passed to count re-indexes everything and causes mass destruction. for_each keys are stable, so changes are localized. Only use count for "create N copies of an identical thing," with a known number.
Someone changed a resource manually in the AWS console. What happens on next plan? — Plan shows a drift: "resource changed outside Terraform." You choose: revert the manual change (let Terraform re-apply desired state), or pull the change into config and run -refresh-only / re-apply. Teams with drift policies configure CI to fail on drift (terraform plan -detailed-exitcode).
How do you refactor without destroying resources? — terraform state mv renames resources in state; import adopts existing resources; removed blocks (TF 1.7+) drop from management without destroying. For modules, move the source and run state mv module.old module.new. Always test on a fresh plan before running apply.
Terraform vs Ansible — how do you pick? — Terraform for provisioning (cloud resources, networks, managed DBs). Ansible for configuration management (package install, file templates, service restarts on hosts you control). They overlap only a little. Modern K8s-heavy shops lean more on Terraform + K8s operators/Helm, with Ansible reserved for VM-based legacy.
Why not commit state to Git? — State contains secrets (plaintext passwords, API tokens), is write-sensitive (concurrent writes corrupt it), and can be large. Remote backends solve all of that: encryption at rest, locking, versioning, team access control.
What is drift and how do you detect/prevent it? — Drift = real infrastructure diverged from Terraform state (someone clicked). Detect: run terraform plan periodically (CI cron, TF Cloud drift detection); fail if diff is non-empty. Prevent: IAM policies denying humans write access to Terraform-managed resources, only letting the CI role change them.
How do you test Terraform code? — Static: terraform validate, tflint, checkov/tfsec for security. Unit: terratest (Go) or Terraform test (TF 1.6+) for in-language tests. Integration: spin up in a sandbox account, assert with provider APIs. Policy: OPA/Sentinel/Checkov in CI gates PRs.
What's a Terraform provisioner and why avoid it? — Runs a script inline during apply (local-exec on the Terraform host, remote-exec on the new resource). Problems: non-idempotent, can't re-run, failure handling is weak, mixes layers. Better: user_data/cloud-init for instance bootstrap, Ansible for post-provision config, K8s for workload-level concerns.

8. Configuration Management

Why this matters

Even in a cloud-native world, you've got EC2/VM images to bake, golden AMIs to keep patched, Windows fleets to configure, network appliances to push config to, and bare-metal nodes in regulated environments. Knowing the push-vs-pull and agent-vs-agentless trade-offs is a standard senior interview question.

Core concepts

Push vs pull.

Push (Ansible) — controller initiates, SSHes into targets, pushes changes.
Pull (Puppet, Chef, Salt) — agent on each node polls a central server, pulls its assigned catalog.

Push is simpler to operate (no agents); pull scales to huge fleets (nodes don't need to be reachable from the controller). Modern ops leans Ansible-heavy because most Ansible pain scales worse than the agent overhead for fleets >1000 hosts — but by that point you probably want Terraform + immutable images.

Tools:

Ansible — YAML, agentless, SSH-based (WinRM for Windows), idempotent modules. Owned by Red Hat. Most popular today.
Chef — Ruby DSL, agent-based (chef-client polls a chef-server), cookbook/recipe model. Slower-moving; still strong in some financial/enterprise shops.
Puppet — custom DSL, agent-based, declarative resource model, manifest/module hierarchy. Strong in traditional enterprise / Windows.
SaltStack — Python, supports push and pull, ZeroMQ-based "salt minions" for fast fan-out. Niche but powerful.
CFEngine — original (1993), still used in high-security/autonomous environments.

Golden images (Packer). Bake a machine image (AMI, Azure image, VMware template, qcow2) once, with all your packages and configs, then deploy immutable copies. Packer templates orchestrate "spin up VM → run provisioners (shell/Ansible/Chef) → snapshot → destroy." Output: a signed, versioned image.

cloud-init. Early-boot config on modern Linux cloud images. Reads user-data (provided by the cloud), runs scripts/packages/users setup on first boot. Works well for small tweaks; for anything complex, bake into the golden image with Packer.

Drift handling. When a config management system re-applies on a schedule (agent pull every 30 min), drift is self-correcting. In push-based Ansible, you need a cron/Tower/AWX run. In immutable-image world, drift isn't a concern — you replace the instance.

Commands you should know cold

bash

# Ansible dry run (check mode)
ansible-playbook -i inv site.yml --check --diff

# Limit to a host or group
ansible-playbook -i inv site.yml --limit web[0:2]

# Ad-hoc shell across a group
ansible web -i inv -m shell -a "uptime" -b

# Build an AMI with Packer
packer build -var "region=us-east-1" aws-app.pkr.hcl

# cloud-init: get instance's user-data
cat /var/lib/cloud/instance/user-data.txt
# Re-run cloud-init (rare; usually for debugging)
cloud-init clean && cloud-init init

Gotchas & war stories

Ansible's shell/command module use — nearly always a sign of missing idempotency. Prefer built-in modules with explicit state: present|absent.
Ordering in Ansible — tasks run top-to-bottom per play; if task 3 depends on task 5's side effect, either re-order or use handlers.
Chef/Puppet agent drift — a node that's offline for a week during a policy change comes back wrong; watch reports for "stale node."
Mixing Terraform and cloud-init for the same concern — the race between "Terraform says instance exists" and "cloud-init still running" bites. Wait for instance-ok + health check.
Packer + Ansible duplicating work — if Ansible is part of the Packer build and runs after provisioning, you apply the same change twice. Pick one.

Interview Q&A

Push vs pull config management — trade-offs? — Push (Ansible) is simpler to operate: no agents, just SSH + sudo. Breaks down above a few thousand hosts (controller bottleneck, concurrent SSH). Pull (Puppet/Chef) scales to huge fleets because nodes pull their own catalog on a schedule, but you pay the agent + server ops cost. Most modern shops pick Ansible unless they have specific scale or reach constraints.
Why build a golden image instead of configuring at boot? — Speed (boot-to-ready in seconds, not minutes while packages install), determinism (same bits every time), reproducibility (rebuild the image from source, not from the state of the internet on a given day), and easier rollback (tag the old image). cloud-init still applies instance-specific tweaks on top (hostname, SSH keys).
How does Ansible achieve idempotency? — Each module checks current state before acting: apt checks if the package is installed at the desired version; copy checks file hash before overwriting; user checks if user exists with matching attributes. Modules report changed only if they actually did something. Custom shell/command tasks break this; use creates:/removes: or a built-in module.
Ansible Vault vs a real secret manager? — Vault encrypts secrets in the playbook repo with a symmetric key — fine for moderate-sensitivity secrets + bootstrap credentials. A dedicated secret manager (Vault, AWS Secrets Manager, Azure Key Vault) is better for high-sensitivity + rotation + audit. Pattern: Ansible Vault for the bootstrap credential that lets a fresh node fetch from the real secrets manager.
When do you still reach for config management in a containerized world? — Baking AMIs/images (with Packer + Ansible), bootstrapping K8s nodes, managing bare-metal/VM fleets (OpenStack, VMware), Windows and network devices that aren't container-friendly, and one-off runbook automation (patch this fleet, rotate these keys). Container workloads themselves belong in K8s + Helm, not Ansible.

Part III — Containers & Orchestration

9. Containers & Docker (Deep)

Why this matters

Containers are the delivery format for modern software. Every DevOps engineer must know what's happening under the hood — because nine out of ten container failures are about the Linux primitives (namespaces, cgroups, volumes, networking) leaking through. Interviewers probe "container vs VM" to check for pop-science answers ("containers are faster") vs real understanding ("containers share the host kernel via namespace isolation").

Core concepts

What is a container, actually? A process (or group of processes) isolated from the rest of the host using Linux kernel primitives:

Namespaces — isolate views: pid (own PID 1), net (own network stack), mnt (own mounts), uts (own hostname), ipc (own SysV IPC), user (own UID map), cgroup, time (newer).
cgroups (control groups) — limit and account resources (CPU, memory, IO, PIDs).
chroot/overlayfs — isolated root filesystem, usually built from layered images.
Capabilities — fine-grained subdivisions of root (e.g., CAP_NET_BIND_SERVICE to bind privileged ports without full root).
seccomp — syscall filters.
AppArmor / SELinux — MAC layer.

A container is not a VM: no guest kernel, no hypervisor. Sharing the host kernel is the source of both the performance win and the security trade-off (kernel vulns cross the boundary).

Container vs VM:

	Container	VM
Kernel	Shared with host	Own guest kernel
Boot time	milliseconds	tens of seconds
Overhead	~MB	hundreds of MB
Isolation	Namespace-level (weaker)	Hypervisor-level (stronger)
Use case	Stateless apps, dev parity	Multi-tenant, heterogeneous OSes, higher security

When interviewers ask "container or VM" the right answer is "depends on isolation requirements." Regulated / untrusted-tenant workloads may still use VMs (or Kata/Firecracker — VM-isolated containers).

OCI (Open Container Initiative). The standard that split "Docker" into interoperable specs:

OCI Image Spec — what an image looks like (manifest, config, layers).
OCI Runtime Spec — how to run a container (runc is the reference).
OCI Distribution Spec — how registries serve images.

That's why podman, containerd, CRI-O can all run the same images you build with Docker.

Dockerfile instruction cheat sheet:

dockerfile

FROM eclipse-temurin:21-jre-alpine AS runtime   # Base image (named stage)
ARG BUILD_VERSION=dev                           # Build-time arg (no runtime access)
ENV JAVA_OPTS="-XX:+UseZGC"                     # Runtime env var (baked in)
WORKDIR /app                                     # cd; creates if missing
COPY --from=build /out/app.jar ./app.jar         # Copy from another stage
ADD https://... /tmp/...                         # (Prefer COPY; ADD has auto-extract + URL fetch)
RUN apk add --no-cache curl                      # Run at build time; creates a new layer
USER 1000:1000                                   # Drop privileges
EXPOSE 8080                                      # Documentation only; doesn't actually open the port
HEALTHCHECK CMD curl -fsS localhost:8080/health || exit 1
ENTRYPOINT ["java", "-jar", "/app/app.jar"]      # Exec form (no shell)
CMD []                                           # Default args to ENTRYPOINT

ENTRYPOINT vs CMD:

ENTRYPOINT = the command, CMD = default args to that command.
docker run image someargs replaces CMD but not ENTRYPOINT.
Exec form (["a","b"]) runs directly; shell form (a b) wraps in /bin/sh -c (spawns a shell, eats signals).
Use exec form. Shell-form ENTRYPOINT java -jar app.jar becomes PID 1 = sh, not Java; SIGTERM goes to sh, Java doesn't shut down cleanly.

Multi-stage builds. The single most impactful Dockerfile technique:

dockerfile

FROM maven:3.9-eclipse-temurin-21 AS build
WORKDIR /src
COPY pom.xml .
RUN mvn -B dependency:go-offline              # cache deps layer
COPY src ./src
RUN mvn -B -o package -DskipTests
# Stage 2: only the runtime
FROM eclipse-temurin:21-jre-alpine
COPY --from=build /src/target/app.jar /app.jar
USER 1000:1000
ENTRYPOINT ["java","-jar","/app.jar"]

Final image contains only the JRE + jar — no Maven, no source, no .m2. 200 MB vs 1+ GB.

Layer caching. Each Dockerfile instruction creates a layer; Docker reuses layers when inputs are identical. The cache is invalidated at the first changed instruction. Order from least-changing to most-changing:

FROM ...
RUN apt install ... (base packages — rarely changes)
COPY package.json ./
RUN npm ci                          (deps — changes when package.json does)
COPY . .                            (source — changes every commit)

The magic of COPY package.json ./ && RUN npm ci before COPY . . is that most builds don't re-run npm ci.

BuildKit. The modern builder (default since Docker 23, always in docker buildx):

Parallel stages.
Cache import/export (--cache-from, --cache-to with registry/GitHub Actions/local).
RUN --mount=type=cache,target=/root/.m2 — persistent build caches without writing them into the image.
RUN --mount=type=secret,id=npmrc — pass secrets into the build without baking them into layers.

Base images.

Distroless (Google, gcr.io/distroless/*) — no shell, no package manager, just your app + minimal libc. Tiny attack surface. Can't docker exec sh into it, which is the point.
Chainguard / Wolfi — distroless + signed + SBOM-included.
Alpine — 5 MB base, uses musl libc. Trips up anything that links glibc (compiled Go is fine; Python/Java need the glibc-backed image or suffer).
Ubuntu / Debian slim — familiar, larger but glibc-based. Safe default when you don't know the constraints.
scratch — literally empty. Useful for statically compiled Go binaries.

Image scanning. docker scan, trivy, grype, snyk container. Run on every build; fail on high/critical. Scan base images weekly (new CVEs drop against stable bases).

Image signing & SBOM.

Cosign (Sigstore) — sign images + provenance + SBOM with keyless OIDC-based signing or key-based.
Notation (Notary v2) — alternative signing standard.
SBOM — Software Bill of Materials (SPDX or CycloneDX format). syft generates; grype can scan one.

Security hardening in Dockerfiles:

USER to a non-root UID early. Never USER root in the final stage.
runAsNonRoot: true in K8s enforces this at runtime.
Drop all capabilities by default, add back only what you need.
Read-only root filesystem; mount writable paths explicitly (/tmp via emptyDir).
Pin base image by digest (@sha256:...), not just tag — tags are mutable.
Don't pass secrets via ARG or ENV (they end up in the image).
Don't install ssh/sudo/shell debug tools.

Alternative runtimes:

containerd — the runtime K8s (kubelet) actually uses; Docker is just a UI on top.
CRI-O — minimal runtime built explicitly for K8s CRI. Used by OpenShift.
runc — the OCI reference runtime (starts the container process).
gVisor — userspace kernel (Google) — stronger isolation at some CPU cost.
Kata Containers — each container in a lightweight VM — VM-level isolation, container UX.
Podman — Docker-compatible CLI; rootless by default; no daemon.
Firecracker — AWS's microVM, used under Lambda and ECS Fargate.

Commands you should know cold

bash

# Build (classic and modern)
docker build -t app:1.0 .
docker buildx build --push --platform linux/amd64,linux/arm64 -t ghcr.io/org/app:1.0 .

# Run; map port 8080, drop root, read-only FS, limit memory
docker run --rm -it \
  -p 8080:8080 \
  --user 1000:1000 \
  --read-only --tmpfs /tmp \
  --memory 512m --cpus 1 \
  app:1.0

# Inspect layers
docker history app:1.0
dive app:1.0                        # 3rd-party; best tool for this

# Find what's eating the image
docker image ls --digests
docker images --filter 'dangling=true'

# Live debug into a running container
docker exec -it <name> sh
docker logs -f <name>
docker stats <name>                  # CPU/mem/io per container
docker inspect <name>

# Scan
trivy image app:1.0
grype app:1.0
syft app:1.0 -o cyclonedx-json > sbom.json

# Sign (keyless OIDC)
cosign sign --yes app:1.0
cosign verify app:1.0 --certificate-identity=... --certificate-oidc-issuer=...

# Compose (local only)
docker compose up -d
docker compose logs -f api

Gotchas & war stories

PID 1 doesn't reap zombies or forward signals — if your app spawns children, use tini or dumb-init as the entrypoint, or compile/run so your app is a proper init. Symptom: kill the container and it takes 10s to exit (Docker SIGKILLs after grace).
EXPOSE doesn't publish — it's a documentation hint. You still need -p host:container on docker run or ports: in compose.
COPY . . copies your .git/ + node_modules/ — use .dockerignore or your image doubles in size.
Alpine + Python/Java — musl libc breaks manylinux wheels and some JARs with native libs. Switch to -slim or -bookworm-slim.
Mutable :latest tag — every cluster restart might pull a different image. Always tag immutably (commit SHA).
Secrets in ARGs/ENVs — they're in docker history forever. Use BuildKit --mount=type=secret or inject at runtime.
Running as root — USER 1000:1000 (or a named user via RUN adduser). Many base images still default to root.
Host mount eats host files — docker run -v /:/host with a careless rm -rf /host/... wipes the host. Run untrusted containers rootless.
Docker Desktop licensing — commercial in many orgs. Use Podman/Rancher Desktop/Colima on work laptops if unsure.

Anchor example

A typical full-stack Dockerfile is a multi-stage build (Node 22 → JDK 25 + Maven → JRE 25 runtime) — a concrete, defensible example worth being able to whiteboard. A related real-world story: Node/npm lockfile mismatches between host (npm 11) and container (npm 10) can burn an afternoon on npm ci failures inside the Dockerfile when the host-generated lockfile format is incompatible with the container's older npm — fix by regenerating the lockfile inside a disposable container to match prod.

Interview Q&A

Container vs VM — what's the real difference? — A container is a process group sharing the host kernel, isolated by Linux namespaces and cgroups. A VM runs its own kernel on a hypervisor. Containers boot in ms and add negligible overhead; VMs boot in seconds and allocate hundreds of MB for the guest OS. The trade-off is isolation: VM boundaries are hardware-virtualized and much stronger than namespace boundaries.
Why multi-stage Dockerfiles? — So the final image contains only the runtime, not the build toolchain. A Maven stage brings gigabytes of JDK + Maven + .m2; only the compiled jar + a JRE need to ship. Smaller images deploy faster, have a smaller attack surface, and fewer CVEs to scan.
ENTRYPOINT vs CMD? — ENTRYPOINT is the command; CMD is the default arguments to it. At docker run image args, args replaces CMD but leaves ENTRYPOINT intact. Use exec form (["java", "-jar", "app.jar"]) to avoid the shell wrapping PID 1 and eating signals.
How does Dockerfile layer caching work, and how do you exploit it? — Each instruction creates a layer; Docker hashes inputs and reuses unchanged layers. Cache invalidates at the first changed instruction, so order by change frequency: OS packages first, language dep-manifest copy + install next (so pom.xml/package.json change doesn't invalidate deps for source-only commits), source last. BuildKit adds --mount=type=cache for deps that shouldn't bake into the image at all.
Distroless vs Alpine vs Ubuntu — pick one and defend it. — Distroless for prod: smallest attack surface (no shell, no package manager), nothing to exploit if compromised. Alpine for size-sensitive Go/Rust binaries (static linking sidesteps musl); problematic for Python/Java with native deps. Ubuntu-slim for developer ergonomics and glibc compatibility when image size isn't critical.
How do you keep secrets out of a Docker image? — Never via ARG/ENV — they persist in docker history. Use BuildKit --mount=type=secret for build-time secrets (they're not layered). At runtime, mount as files/env from K8s Secrets, Vault agent injector, or a secrets manager. Treat the image as if it will be pulled by anyone who gets registry read.
Why sign container images? — To verify provenance at deploy time — that the image was built by your pipeline, not by an attacker who pushed a lookalike tag. Cosign signs with Sigstore-backed OIDC identities; an admission controller (Kyverno / policy-controller) enforces "only signed images" in the cluster.
What happens when you docker run something? — CLI sends the request to the daemon. Daemon resolves image (pulls layers if missing). Daemon asks containerd; containerd asks runc to create the OCI spec: new namespaces, cgroups, root filesystem via overlay, capability/seccomp policy, then execve the entrypoint. Network plugin sets up veth pair + bridge. You see the process.
Container PID 1 — what's special about it? — PID 1 traditionally reaps orphans (adopted children) and handles signals — but language runtimes usually don't. If your app spawns subprocesses (any shell in ENTRYPOINT, or test runners), you'll leak zombies and mangle signals. Use tini / dumb-init as PID 1, or run your app with proper init handling.
Reduce a 1 GB image to <100 MB — walk me through it. — Multi-stage build (drop the toolchain). Switch to distroless or Alpine. .dockerignore to exclude .git/, node_modules/, test fixtures. RUN combined package install + cleanup (apt-get install && rm -rf /var/lib/apt/lists/*). Pin deps to runtime-only (npm ci --omit=dev, pip install --no-compile). Use dive to find the biggest layers and attack them.

10. Kubernetes Architecture (Deep)

Why this matters

Knowing kubectl apply is the floor. Senior DevOps interviews go up one level: explain what happens end-to-end when you apply a manifest, where etcd sits, how the scheduler picks a node, why you care about admission controllers. You'll also be asked diagnostic questions ("etcd is slow, what breaks?" — answer: everything, because every write goes through it).

Core concepts

Control plane (runs on master nodes; often 3 for HA):

┌──────────────────────────────────────────────────────────┐
│                   CONTROL PLANE                          │
│                                                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐  ┌──────┐  │
│  │kube-api- │◀▶│   etcd   │  │   scheduler  │  │cloud-│  │
│  │  server  │  │ (Raft)   │  │              │  │ ctrl │  │
│  └────▲─────┘  └──────────┘  └──────────────┘  │ mgr  │  │
│       │                                         └──────┘  │
│       │         ┌────────────────┐                        │
│       │         │ controller-mgr │                        │
│       │         │ (node, rs, dep,│                        │
│       │         │  svc, endpts…) │                        │
│       │         └────────────────┘                        │
└───────┼──────────────────────────────────────────────────┘
        │
    (HTTPS + auth)
        │
┌───────▼─────────────────────────────────────────────┐
│                   DATA PLANE                        │
│                                                     │
│   Node 1              Node 2          Node 3        │
│   ┌───────┐           ┌───────┐       ┌───────┐     │
│   │kubelet│           │kubelet│       │kubelet│     │
│   │kube-  │           │kube-  │       │kube-  │     │
│   │proxy  │           │proxy  │       │proxy  │     │
│   │ctr-rt │           │ctr-rt │       │ctr-rt │     │
│   │ Pods  │           │ Pods  │       │ Pods  │     │
│   └───────┘           └───────┘       └───────┘     │
└─────────────────────────────────────────────────────┘

Control-plane components:

kube-apiserver — the only component that talks to etcd. Everything else (kubelet, controller-manager, scheduler, kubectl) talks to the API server. It handles authn/authz/admission, then persists to etcd.
etcd — distributed key-value store (Raft consensus). Every cluster state (pods, configmaps, secrets) is in etcd. Backups = etcd snapshots. Size grows with object count; perf is sensitive to disk latency (<10 ms fsync ideal).
kube-scheduler — watches for Pending pods (no nodeName), filters nodes by feasibility (resources, affinity, taints), scores them, binds the winner.
kube-controller-manager — runs the built-in controllers: Node, ReplicaSet, Deployment, Endpoints, Service Account Token, Namespace, PV/PVC binder, HPA, etc. Each runs a reconcile loop: "observe desired, observe actual, converge."
cloud-controller-manager — cloud-specific controllers (node health from EC2, LoadBalancer creation from ALB, route setup, volume attach).

Data-plane components (per node):

kubelet — the node agent. Polls the API server for pods assigned to this node; talks to the container runtime (via CRI) to start/stop containers; reports status; runs probes; handles volumes; applies resource cgroups.
kube-proxy — implements Services on each node. Watches Services + Endpoints, programs iptables/IPVS/nftables rules so ClusterIP → random healthy Pod IP. Being replaced by eBPF-based alternatives (Cilium) on modern clusters.
container runtime — containerd or CRI-O (Docker is deprecated as a direct runtime since Kubernetes 1.24).
CNI plugin — the actual pod networking (Calico, Cilium, Flannel, AWS VPC CNI, etc.).

The apply flow, end-to-end. You run kubectl apply -f deploy.yaml:

kubectl — compares your manifest to the last-applied annotation (client-side merge), sends the final spec to the API server via HTTPS.
kube-apiserver (authn) — validates TLS client cert, OIDC token, or service account token. Fails fast on bad creds.
kube-apiserver (authz) — RBAC check: does this user have create/update on deployments in this namespace?
Admission chain:
- Mutating webhooks/controllers — inject defaults, sidecars (Istio), imagePullSecrets, serviceaccount tokens.
- Validating webhooks/controllers — Pod Security Admission, policy engines (Kyverno, Gatekeeper), schema validation. Reject if disallowed.
etcd write — API server writes object to etcd with resource version bumped.
Deployment controller — watches Deployments, notices a change, creates/updates a ReplicaSet.
ReplicaSet controller — notices desired replicas ≠ actual, creates Pod objects.
Scheduler — notices Pods with no nodeName, runs filtering + scoring, writes nodeName back to etcd.
kubelet — on the assigned node, notices its new Pod; calls CRI to pull image, create container, mount volumes, set network; updates Pod status.
kube-proxy / CNI — if the Pod backs a Service, Endpoints update; proxy rules on every node reflect the new IP.

Leader election. Control-plane components scale by running multiple replicas; only one is "active" at a time. They compete for a Lease object in the API (or legacy ConfigMap/Endpoints). Lose the lease (heartbeat failure, clock skew) → another replica takes over. This is why a 3-replica control plane needs quorum (2 of 3 etcd nodes).

Resource versioning. Every object has resourceVersion. Watches use it for "give me updates since version X." Optimistic concurrency: an update with a stale resourceVersion is rejected — you must refetch.

API priority & fairness (APF). API server's internal request prioritization — exempts system traffic, throttles low-priority clients. Tune it if a noisy controller is starving the rest.

Admission controllers.

Built-in — NamespaceLifecycle, LimitRanger, ServiceAccount, DefaultStorageClass, PodSecurity, ResourceQuota, etc.
Custom via webhooks — mutating (MutatingAdmissionWebhook) or validating (ValidatingAdmissionWebhook). Kyverno, Gatekeeper (OPA), Falco Sidekick all plug in this way.
Pod Security Admission (PSA) — replacing PSP (removed in 1.25). Enforces privileged / baseline / restricted per namespace.

Commands you should know cold

bash

# Context & cluster info
kubectl config current-context
kubectl cluster-info
kubectl get --raw='/readyz?verbose'             # is API healthy?

# What can I do?
kubectl auth can-i create pods -n prod
kubectl auth can-i '*' '*' --as=system:serviceaccount:prod:ci

# See what's actually running on the control plane
kubectl get pods -n kube-system
kubectl get nodes -o wide
kubectl top nodes ; kubectl top pods -A     # (metrics-server required)

# Inspect a manifest server-side before apply
kubectl apply -f deploy.yaml --dry-run=server --server-side
kubectl diff -f deploy.yaml

# Watch apply through to readiness
kubectl rollout status deploy/api -n prod
kubectl rollout history deploy/api -n prod
kubectl rollout undo deploy/api -n prod

# Find who changed what (audit proxy)
kubectl -n kube-system logs kube-apiserver-<node> | grep audit
# Better: use the audit log via whatever aggregator (Loki, ELK)

Gotchas & war stories

etcd is the performance ceiling — slow disk = slow cluster. Run etcd on dedicated SSDs or io2 volumes; watch etcd_disk_wal_fsync_duration_seconds (target p99 <10 ms).
Too many CRDs bloat API server memory — each CRD + instances are cached. Clean up unused CRDs.
Admission webhook downtime = cluster downtime — a failing webhook with failurePolicy: Fail blocks all create/update. Set failurePolicy: Ignore for non-critical webhooks; exempt kube-system namespaces.
kubectl apply without --server-side — classic client-side merge loses fields when two controllers manage the same object. Prefer server-side apply (SSA) — it tracks field ownership.
LoadBalancer Service pending forever — cloud-controller-manager not running, or no cloud provider flag, or AWS CNI IAM permissions missing.
Control-plane upgrade downtime — upgrade nodes one at a time; watch etcd quorum. For EKS/AKS/GKE, the cloud handles control plane; you still upgrade nodes (see Section 15).
kubelet killed by OOM on the node — if system reserved resources aren't set, kubelet can starve. Configure --kube-reserved and --system-reserved.

Interview Q&A

Walk me through what happens when I kubectl apply a Deployment. — kubectl sends the spec to kube-apiserver. apiserver runs authn (cert/token), authz (RBAC), admission (mutating then validating webhooks, PSA, quota). Writes to etcd. Deployment controller notices, creates/updates a ReplicaSet; ReplicaSet controller creates Pod objects; scheduler assigns each to a node; kubelet on that node pulls the image, starts the container via containerd, reports status back. kube-proxy/CNI update Service endpoints once pods are ready.
What does etcd store, and why does it matter? — All cluster state: every object (pods, services, configmaps, secrets, CRDs, RBAC) is a key in etcd. Everything else is reconstructible. Consequences: back up etcd (snapshots), run it on low-latency storage, don't exceed ~8 GB object size, and protect access — anyone with etcd access has full cluster read.
What does the scheduler actually do? — Watches for Pending pods (no nodeName). Runs two phases: filter nodes (do they have enough CPU/memory? Do they match nodeSelector, affinity, taints?), then score remaining candidates (prefer spread, prefer affinity, prefer image locality). Binds the winner by writing nodeName back to etcd. Kubelets on nodes only run pods assigned to them.
Controller manager — what are "controllers" and what pattern do they implement? — Reconcile loops. Each controller watches specific API objects, compares observed state to desired state, takes actions to converge them. Deployment controller, ReplicaSet controller, Node controller, Endpoints controller, etc. It's the same pattern as operators / CRDs — you're writing "observe + act" logic.
What's an admission controller, and why are webhooks dangerous? — Admission controllers run after authn/authz, before etcd persistence. They can mutate (default values, inject sidecars) or validate (reject bad manifests). Webhooks let third-party code (Kyverno, Gatekeeper, Istio) plug in. They're dangerous because a failing webhook with failurePolicy: Fail blocks every create/update in the scope; cluster effectively freezes. Mitigate by scoping namespaces, timeouts, Ignore for non-critical, and HA webhook deployments.
How does kube-proxy implement a Service? — Watches the Service and its Endpoints/EndpointSlices via API. Programs the node's netfilter (iptables or IPVS) so packets to ClusterIP:port DNAT to a random healthy Endpoint IP. For LoadBalancer Services, cloud-controller-manager provisions the external LB and wires its backends to NodePort. eBPF datapaths (Cilium) replace kube-proxy entirely.
Why 3 control-plane nodes? — Quorum for etcd: you need ⌈n/2⌉+1 to elect a leader and accept writes. 1 node = no HA. 2 nodes = no fault tolerance (loss of either breaks quorum). 3 nodes = survive 1 failure. 5 = survive 2 but with diminishing returns on write latency. Enterprises often stop at 3.
How do you back up a cluster? — etcd snapshot: etcdctl snapshot save backup.db, store off-cluster. On managed K8s (EKS/AKS/GKE) the cloud handles it. App-level: Velero backs up namespaces + PVs; essential for DR. Don't rely only on GitOps manifests — CRD resource data (cert-manager certs, PVCs, in-flight custom resources) is only in etcd.
PSA vs PSP? — PodSecurityPolicy (PSP) was deprecated in 1.21 and removed in 1.25. Pod Security Admission (PSA) replaces it with three preset profiles (privileged, baseline, restricted) applied at namespace level via labels. PSA is enforcement-only; if you need mutation (auto-downgrade privileged to baseline), pair with Kyverno/Gatekeeper.
The API server is slow — what do you check? — etcd latency first (etcd_disk_wal_fsync_duration_seconds, etcd_request_duration_seconds). Then apiserver-side: CPU saturation, admission webhook latency (a slow webhook stalls every call it's registered for), watch fan-out from a huge number of clients, APF throttling. kubectl get --raw /metrics gives you the apiserver's Prometheus metrics.

11. Kubernetes Workloads & Objects

Why this matters

"Pod vs Deployment vs StatefulSet" is the bread-and-butter K8s question. Beyond the label, interviewers want you to explain why StatefulSet exists (stable network id, stable storage, ordered startup) and when to reach for each — plus the scheduling knobs (affinity, topology spread, taints/tolerations, PDBs) that keep production actually survivable.

Core concepts

Pod. The smallest deployable unit — one or more containers sharing network (one IP, one port space), IPC, and volumes. Almost always one main container per pod. Sidecars (logging, service mesh proxy, config reloader) share the lifecycle.

Pod lifecycle phases: Pending → Running → Succeeded / Failed. Plus Unknown. Container state sub-phases: Waiting (pulling, CrashLoopBackOff) / Running / Terminated.

Init containers — run to completion before app containers start. Use for: wait-for-dependency, schema migration, secret fetch. Sequential — each must succeed.

Sidecar containers (1.28+) — restartPolicy: Always on an init container marks it as a sidecar: starts before app, runs alongside, terminates after. Replaces the old "just add another container" hack; now supported by init-container ordering semantics.

Ephemeral containers (kubectl debug) — injected into a running pod for troubleshooting; not part of the spec.

Pod termination:

terminationGracePeriodSeconds (default 30s). On delete: preStop hook → SIGTERM → wait → SIGKILL.
Your app must handle SIGTERM and drain (stop accepting new work, finish in-flight, close resources).
Readiness probe should flip to failing on SIGTERM so Service stops routing. K8s doesn't do this for you — depend on the readiness-probe-off pattern.

ReplicaSet. Maintains N identical Pods. Rarely created directly; a Deployment owns one (or several during rollout).

Deployment. Declarative update of a stateless workload. Owns the ReplicaSet, which owns Pods. Rollout strategies:

RollingUpdate (default) — with maxSurge and maxUnavailable.
Recreate — kill all, then start new. Brief downtime; acceptable for stateful/singleton apps.

History: Deployments track revisions (default last 10). kubectl rollout undo deploy/api rolls back.

StatefulSet. Stateful workloads where identity matters:

Stable network ID — pod names are <sts>-0, <sts>-1, ...; each has a stable DNS name (<sts>-0.<headless-svc>.<ns>.svc.cluster.local).
Stable storage — volumeClaimTemplates: auto-generates a PVC per pod. Each pod reattaches to its PVC across restarts.
Ordered startup/teardown — -0 comes up fully before -1. Scale-down reverses. Use for: databases, Kafka/ZooKeeper, anything that has "leader/follower" or "shard N" identity.

DaemonSet. One pod per node (or per node matching a selector). Use for: log shippers (Fluent Bit), node exporters, CNI agents, storage drivers. Scales with the fleet automatically.

Job. Run to completion. parallelism + completions for batch workloads. Retries on failure (backoffLimit).

CronJob. Job on a cron schedule. Catches:

Skew — scheduler can miss by minutes under load.
concurrencyPolicy — Allow / Forbid / Replace.
startingDeadlineSeconds — drop missed runs older than this.
Hung jobs accumulate — set activeDeadlineSeconds and ttlSecondsAfterFinished.

Pod Disruption Budget (PDB). Caps how many pods can be voluntarily disrupted at once: minAvailable: 2 or maxUnavailable: 1. Binds kubelet/cluster-autoscaler/drain — voluntary disruptions only. Does not protect against node crashes.

Topology Spread Constraints. "Don't put all replicas in one zone/node/hostname." Replaces the old pod-anti-affinity hack:

yaml

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector: { matchLabels: { app: api } }

Affinity / anti-affinity.

nodeAffinity — "only schedule on nodes with label X" (required or preferred).
podAffinity — "co-locate with pods matching X."
podAntiAffinity — "don't co-locate" (e.g., keep replicas across different nodes for HA).

Taints & tolerations. A taint on a node repels pods that don't tolerate it. Used for: dedicated node pools (GPU, spot, karpenter), draining for maintenance (kubectl drain applies NoSchedule), critical workload isolation.

Commands you should know cold

bash

# Everything in a namespace
kubectl get all -n prod

# Apply and watch
kubectl apply -f deploy.yaml
kubectl rollout status deploy/api -n prod
kubectl rollout history deploy/api -n prod
kubectl rollout undo deploy/api --to-revision=3

# Scale
kubectl scale deploy/api --replicas=5

# Dump YAML of a live object (minus server-side cruft)
kubectl get deploy api -o yaml | kubectl neat        # if 'kubectl-neat' plugin installed

# Drain a node gracefully
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Attach a debug container to a running pod
kubectl debug -it pod/api-abc --image=busybox --target=api

# Expose a pod locally
kubectl port-forward deploy/api 8080:8080

# Evict check: will PDB allow it?
kubectl get pdb

Gotchas & war stories

App doesn't handle SIGTERM — rolling update causes 500 spikes because in-flight requests get cut. Fix: add a preStop sleep (e.g., 10s) + readiness-probe flip off + graceful shutdown in app code.
maxSurge + maxUnavailable = 0 — your rollout can't progress. Classic misconfig for low replica counts.
StatefulSet name collision — deleting the STS keeps PVCs around; a new STS with the same name will re-bind and may surprise you. Delete PVCs explicitly if you want a clean slate.
Init containers running forever — failing exit codes infinite-loop. Guard with activeDeadlineSeconds or fail-fast logic.
Job with restartPolicy: OnFailure — retries the same pod; Never creates a new pod per attempt. Pick intentionally.
CronJobs silently missing runs — concurrencyPolicy: Forbid + a long-running job means subsequent runs are skipped. Always set startingDeadlineSeconds.
PodDisruptionBudget blocking upgrades — minAvailable: 100% means kubectl drain hangs forever. Use realistic values.
Pod anti-affinity with few nodes — anti-affinity + 3 replicas + 2 nodes = 1 pod perpetually Pending. Use topology spread constraints instead; they degrade gracefully.

Interview Q&A

Pod vs Deployment vs StatefulSet vs DaemonSet — when each? — Pod: rarely directly; it's the unit other controllers manage. Deployment: stateless replicas where identity is fungible (web servers, APIs). StatefulSet: stateful workloads where identity matters (databases, Kafka, ZooKeeper) — stable network name + stable PVC + ordered rollout. DaemonSet: one per node workloads (log shipper, node exporter, CNI agent).
Why not just run every pod in a Deployment? — Deployments give you fungible replicas — any pod can replace any other. Databases and consensus systems (Kafka, etcd) care which pod is which: shard-0 must always hold shard 0's data; leaders and followers differ. StatefulSet's stable DNS, ordered rollout, and per-pod PVC templates give you those guarantees.
What is a PodDisruptionBudget and what does it protect against? — Caps how many pods in a set can be voluntarily disrupted at once (drain, upgrade, HPA scale-down, cluster-autoscaler consolidation). Does not protect against node crashes or OOM kills — those are involuntary. Set minAvailable: 2 on critical Deployments so upgrades can't take the service below quorum.
What happens during a rolling update? — Deployment controller creates a new ReplicaSet. Following maxSurge (extra pods allowed above desired) and maxUnavailable (pods allowed below desired), it scales the new RS up and the old RS down in lockstep. Each new pod waits for readiness before the old one is terminated. On failure, rollout pauses; rollout undo swaps back to the old RS.
Init container vs sidecar container? — Init containers run sequentially to completion before app containers start — used for migrations, dep checks, secret fetches. Sidecars (1.28+ first-class via initContainers.restartPolicy: Always) run concurrently with the app and share its lifecycle — used for service mesh proxies, log forwarders, config reloaders.
Handling SIGTERM gracefully — what's the contract? — When K8s sends SIGTERM, readiness probe should start failing (so Service stops routing new traffic), app drains in-flight work (finish requests, close DB connections, commit offsets), then exits within terminationGracePeriodSeconds. If app ignores SIGTERM, K8s sends SIGKILL after the grace period — in-flight requests die. A common fix: a preStop hook that sleep 10 to give Endpoint propagation time.
Topology spread vs pod anti-affinity — which do you pick? — Topology spread. Anti-affinity is hard (either-or); topology spread gives you maxSkew for graceful degradation when you don't have enough nodes/zones. Anti-affinity leaves pods Pending when constraints can't be met; DoNotSchedule topology spread does too, but ScheduleAnyway softens it for small clusters.
How do taints and tolerations actually work? — A node's taint (NoSchedule/PreferNoSchedule/NoExecute) repels pods that don't tolerate it. Pods with matching tolerations: are allowed through. Used for: GPU nodes (only GPU workloads tolerate), spot nodes (only fault-tolerant workloads), temporarily cordoning a node (kubectl drain adds NoSchedule).
CronJob skipping runs — why and how to fix? — concurrencyPolicy: Forbid + a slow job that exceeds the schedule interval = next run skipped. Or: controller clock skew / API latency pushed the run past startingDeadlineSeconds. Fix: tune startingDeadlineSeconds, add activeDeadlineSeconds to prevent hangs, use concurrencyPolicy: Replace if "start a new one and kill the old" is acceptable, add monitoring on last successful run.
How do you safely delete a StatefulSet without losing data? — kubectl delete sts <name> --cascade=orphan keeps the pods (and PVCs) while removing the controller. Re-apply the STS later and it re-adopts them. Plain kubectl delete sts deletes pods but leaves PVCs (unless persistentVolumeClaimRetentionPolicy is set). Always snapshot before destructive operations on stateful data.

12. Kubernetes Networking

Why this matters

Kubernetes networking is where most outages happen — and most interviewers reach for it because it sorts "I deploy" from "I operate." Expect the whole ladder: Service types, Ingress vs Gateway, CNI plugins, NetworkPolicies, and DNS inside the cluster.

Core concepts

The four networking requirements Kubernetes demands:

Every Pod gets its own IP.
Pods can talk to each other without NAT (across nodes).
Nodes can talk to Pods without NAT.
A Pod sees itself at the same IP others see it at. Any CNI plugin that satisfies these is legal. Most ship with an overlay (VXLAN) or use cloud VPC routing directly.

Services — stable virtual IPs in front of a set of Pods (selected by label).

ClusterIP (default) — internal virtual IP, routable only inside the cluster.
NodePort — ClusterIP + every node exposes a port in 30000–32767. Useful for minimal ingress without a cloud LB.
LoadBalancer — NodePort + cloud-controller-manager provisions an external LB (ALB/NLB on AWS, etc.).
ExternalName — DNS CNAME to an external service; no proxying.
Headless (clusterIP: None) — DNS returns all Pod IPs; used by StatefulSets for stable per-pod DNS.

Endpoints / EndpointSlices. The API layer behind Services. A Service's selector is matched to Pods; their IPs go into an EndpointSlice (scalable replacement for Endpoints). kube-proxy watches EndpointSlices and programs the node's netfilter to DNAT Service IP → Pod IP.

kube-proxy modes.

iptables (default) — rules per Service/Endpoint; O(n) lookup per packet; works well up to ~1000 Services.
IPVS — kernel load balancer; O(1), scales to tens of thousands.
nftables (beta-ish) — modern replacement for iptables.
eBPF datapath (Cilium) — bypasses kube-proxy entirely, much faster at scale, better observability (Hubble).

Ingress vs Gateway API.

Ingress — resource that describes L7 HTTP routing; an Ingress controller (nginx, Traefik, AWS ALB Controller, HAProxy, Envoy) watches and programs itself. Limited: annotations-heavy, one controller per class is awkward.
Gateway API — the replacement, now GA. Three kinds: GatewayClass (who runs it), Gateway (entry point, port, TLS), HTTPRoute/TCPRoute/GRPCRoute (path/header routing). Cleaner separation, better multi-team (different teams own Routes without editing the central Gateway).

CNI plugins.

AWS VPC CNI — gives each Pod a VPC IP directly (security groups work, no overlay, ENI IP limits per instance type).
Calico — pure L3 routing or overlay; NetworkPolicy + BGP; very popular.
Cilium — eBPF-based; fastest; replaces kube-proxy; built-in observability (Hubble); ClusterMesh for multi-cluster.
Flannel — simple VXLAN overlay; no policy.
Weave — older; VXLAN + mesh encryption.

NetworkPolicies. Default in K8s: every Pod can talk to every other Pod. NetworkPolicies switch to allow-list model for pods they select — ingress + egress rules by pod selector / namespace selector / CIDR block / port. Requires a policy-enforcing CNI (Calico, Cilium; AWS VPC CNI needs the plugin with policy enforcement enabled).

DNS in Kubernetes.

CoreDNS runs as a Deployment in kube-system.
Service DNS: <svc>.<ns>.svc.cluster.local → ClusterIP.
Headless Service DNS: <pod-name>.<svc>.<ns>.svc.cluster.local → Pod IP.
External names via ExternalDNS controller + Route53/CloudDNS.
ndots:5 in /etc/resolv.conf combined with search domains causes 5 DNS lookups for every external hostname — a known perf issue. Mitigation: use FQDN (trailing .) or lower ndots.

NodeLocal DNSCache. DaemonSet that caches DNS on each node — reduces CoreDNS load and tail latency. Worth deploying on any cluster with >100 pods.

Commands you should know cold

bash

# Service + Endpoints
kubectl get svc -n prod
kubectl get endpointslices -n prod
kubectl describe svc api -n prod

# Port-forward for local testing
kubectl port-forward svc/api 8080:80 -n prod

# Exec into a DNS debug pod
kubectl run -it --rm dns-test --image=nicolaka/netshoot --restart=Never -- bash
# Inside:
# dig @10.96.0.10 api.prod.svc.cluster.local
# curl -v http://api.prod.svc.cluster.local/health
# ss -tulpn

# NetworkPolicy quick test
kubectl run tmp --rm -it --image=busybox --restart=Never -- wget -qO- http://api.prod:80/

# Gateway API
kubectl get gatewayclasses
kubectl get gateways -A
kubectl get httproutes -A

Gotchas & war stories

ClusterIP not reachable from outside the cluster — that's the design. Use NodePort/LoadBalancer/Ingress.
Service with no endpoints — selector doesn't match any Pod labels. kubectl get endpointslices returns empty. Typo check.
Pods in a namespace can't reach each other — NetworkPolicy deny-all with no matching allow.
externalTrafficPolicy: Local — preserves client source IP but drops traffic to nodes without a local Pod. Pair with matching topology spread.
DNS tail latency — ndots:5 + stale cache. Add NodeLocal DNSCache; use FQDNs in app configs.
AWS VPC CNI IP exhaustion — pod density is capped by ENI IPs per instance type. Use prefix delegation or IPv6 to relieve.
LoadBalancer provisioning stuck — IAM permissions for the load balancer controller, or subnets not tagged properly.

Interview Q&A

What are the Service types and when do you use each? — ClusterIP for internal-only services (default for microservice-to-microservice). NodePort for minimal external exposure in dev, or when you front with your own LB. LoadBalancer to auto-provision a cloud LB. ExternalName for a DNS alias to an external service (not a proxy). Headless (clusterIP: None) for direct-to-pod DNS, used by StatefulSets.
Ingress vs Gateway API — which would you pick today? — Gateway API for anything new. Ingress has been the de facto standard but its per-controller annotation sprawl is painful. Gateway API is now GA, properly separates "who runs the gateway" from "what traffic rules" with Gateway and HTTPRoute, supports multiple teams co-owning a gateway, and handles TCP/gRPC cleanly.
Explain NetworkPolicies. — By default every Pod can talk to every other Pod. A NetworkPolicy selects Pods via label and declares allowed ingress/egress. As soon as any NetworkPolicy selects a Pod, traffic not explicitly allowed is denied for that Pod. Enforcement depends on the CNI supporting NetworkPolicy (Calico and Cilium do; Flannel doesn't).
What's a CNI plugin and how do I choose one? — CNI (Container Network Interface) plugins are responsible for Pod networking: allocate IPs, set up veth pairs, program routes, optionally enforce policy. Key choices: AWS VPC CNI if on EKS and you want real VPC IPs; Cilium if you want eBPF performance + deep observability + policy; Calico for pure L3 + BGP + policy; Flannel for simple overlays without policy.
What happens DNS-wise when my pod resolves api.prod.svc.cluster.local? — /etc/resolv.conf points to CoreDNS ClusterIP (usually 10.96.0.10) with search paths (prod.svc.cluster.local, svc.cluster.local, etc.) and ndots:5. Stub resolver tries each search suffix first for short names. CoreDNS looks up the Service → EndpointSlice, returns the ClusterIP. Pod's kernel DNAT routes traffic to a Pod IP via kube-proxy's iptables/IPVS.
How does kube-proxy implement Services? — Watches Services and EndpointSlices. Programs the node's netfilter (iptables by default; IPVS for scale; eBPF replaces it entirely with Cilium) with DNAT rules so that a packet to ClusterIP:port is rewritten to a random healthy Pod IP:port. This is stateless load balancing per-packet; connection affinity is session- or source-IP-based via config.
externalTrafficPolicy: Cluster vs Local — trade-offs? — Cluster (default) balances across all Pods regardless of node, NATting the source IP (client IP is lost). Local preserves client IP and avoids an extra hop, but if a node has no local Pod, traffic hitting that NodePort drops. Pair Local with topology spread or cluster autoscaler guarantees.
Pod can't reach another pod in the same namespace — walk through debugging. — (1) Confirm the target is running and has an IP (kubectl get pod -o wide). (2) Confirm the Service has endpoints (kubectl get endpointslices). (3) Exec into the source pod; nslookup the service; curl -v the pod IP directly to isolate DNS vs network vs app. (4) Check NetworkPolicies in both pods' namespaces. (5) If cross-node, check CNI health; overlay MTU mismatches cause silent drops. (6) Check for mesh sidecar issues (Istio mTLS mismatch).
What's NodeLocal DNSCache and why run it? — A DaemonSet that runs a tiny DNS cache on each node; pods are configured (via a kubelet flag or CoreDNS override) to ask it instead of CoreDNS. Drops lookup tail latency drastically (no ClusterIP round-trip, no ndots fan-out pain) and offloads CoreDNS. Near-universal on production clusters >100 pods.

13. Kubernetes Storage

Why this matters

Stateless workloads are a subset of real workloads. Databases, caches, search, upload services — all need persistent storage that survives pod restarts and maybe node failures. Kubernetes abstractions are elegant but the fault modes (PVC stuck Pending, node-affinity collision, zone mismatch) are their own genre of outage.

Core concepts

Volume types (scoped to a single Pod's lifecycle):

emptyDir — ephemeral, dies with pod. Use for scratch, tmp, caches.
hostPath — mount from the node. Avoid in multi-tenant; kills pod portability.
configMap / secret — mount a ConfigMap or Secret as files.
projected — combine multiple sources into one mount (SA token + CA cert + downward API).
persistentVolumeClaim — the adult way.

PersistentVolume (PV) & PersistentVolumeClaim (PVC).

PV — cluster resource representing a piece of storage (EBS volume, EFS mount, a block device). Either statically provisioned (admin creates manually) or dynamically provisioned (created on demand).
PVC — a request by a namespace for storage: "give me 20 GB, ReadWriteOnce, gp3." The controller binds a matching PV (or provisions one via StorageClass).
StorageClass — recipe: "for this class, ask the EBS CSI driver to dynamically provision a gp3 volume."

Binding modes:

Immediate — provision as soon as PVC is created.
WaitForFirstConsumer (preferred on multi-AZ) — wait until a Pod using the PVC is scheduled, then provision in the Pod's AZ. Avoids the classic "PV in us-east-1a, Pod scheduled to us-east-1b" mismatch.

Access modes:

ReadWriteOnce (RWO) — one node can mount RW. Most block volumes (EBS, GCE PD).
ReadOnlyMany (ROX) — many nodes, read-only.
ReadWriteMany (RWX) — many nodes, RW. File-based: EFS, Azure Files, CephFS, NFS.
ReadWriteOncePod (RWOP) — single pod, single node. Newer; stronger guarantee than RWO.

CSI (Container Storage Interface). Standard plugin interface for storage drivers. Providers run a controller + node daemon; they implement provision/attach/mount/snapshot. Every modern storage (EBS, EFS, GCE PD, Azure Disk, Ceph, Longhorn, Portworx) uses CSI.

VolumeSnapshot. K8s object for point-in-time snapshot of a PVC, backed by the CSI driver's snapshot capability. Enables backup workflows (Velero, Kasten K10 use this).

Reclaim policies (on the PV):

Retain — keep the PV (and underlying storage) after PVC delete. Admin cleans up.
Delete — destroy underlying storage on PVC delete. Common for dynamically-provisioned dev volumes.

Persistent Volume lifecycle:

Provisioned → Bound (to PVC) → In Use (by Pod) → Released (PVC deleted) → Reclaimed (per policy)

StatefulSet volume claim template. Each Pod in an STS gets its own PVC named <volumeTemplateName>-<sts>-<ordinal>, auto-created on scale-up, not deleted on scale-down (you must delete PVCs explicitly). persistentVolumeClaimRetentionPolicy (1.27 GA) lets you configure this.

Running databases in K8s — pros/cons:

Pros: unified ops, GitOps-managed, consistent secrets/monitoring stack.
Cons: hardware awareness is harder (NUMA, huge pages, disk tuning); operators mature but not trivial (CloudNativePG, Strimzi, Percona, Crunchy); rebalancing on node failure is delicate; cloud managed DBs (RDS, Cloud SQL) often still win for reliability.
Rule of thumb: small/secondary DBs in K8s (read-replica analytics, dev sandboxes), primary prod DBs in managed cloud services unless you're deep enough to own the operator.

Commands you should know cold

bash

# See PVs, PVCs, StorageClasses
kubectl get sc
kubectl get pv
kubectl get pvc -A

# Inspect why a PVC is Pending
kubectl describe pvc mydata -n prod
# Usual suspects: no matching StorageClass, no capacity, zone mismatch, CSI driver not healthy

# Expand a PVC (if StorageClass allows it)
kubectl patch pvc mydata -n prod -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'

# Take a snapshot
kubectl apply -f - <<'YAML'
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata: { name: mydata-snap-20260417, namespace: prod }
spec:
  volumeSnapshotClassName: csi-ebs
  source: { persistentVolumeClaimName: mydata }
YAML

# Orphan cleanup after STS delete
kubectl get pvc -n prod -l app=kafka
kubectl delete pvc -n prod -l app=kafka

Gotchas & war stories

PVC stuck in Pending — usually StorageClass issue or AZ mismatch. Prefer WaitForFirstConsumer on multi-AZ clusters.
Can't scale StatefulSet down then up — old PVCs still exist, new pods attach them with stale data. Decide whether that's desired; if not, delete PVCs explicitly.
EBS volume stuck Attaching — CSI controller node failure, or IAM missing ec2:AttachVolume. Check the CSI controller pod logs.
File permissions on mounted volumes — pod UID doesn't match mount ownership. Use fsGroup in Pod spec to chown on mount (expensive on large filesystems — consider fsGroupChangePolicy: OnRootMismatch).
EFS RWX + many pods = contention — file locking across NFS is painful. Don't use EFS as a write-heavy DB mount.
Not backing up PVCs — GitOps doesn't cover state. Velero + VolumeSnapshot or a CSI-aware backup tool (Kasten K10) is mandatory for stateful workloads.

Interview Q&A

PV vs PVC vs StorageClass — explain to a junior. — PV is the physical storage resource; PVC is a namespace-scoped claim on some storage; StorageClass is a template for dynamically provisioning a PV to fulfill a PVC. A Pod references a PVC; the PVC binds to a PV (either pre-existing or auto-created via StorageClass). Think: StorageClass = "menu," PVC = "order," PV = "delivered dish."
Why WaitForFirstConsumer? — In multi-AZ clusters, Immediate provisions the PV in whichever AZ the provisioner picks — which may not match where the Pod is eventually scheduled, leaving the Pod unschedulable. WaitForFirstConsumer delays provisioning until a Pod using the PVC is scheduled, then provisions in that AZ. Near-default on managed K8s.
Access modes — RWO vs RWX vs RWOP? — RWO: one node mounts read-write (most block volumes, EBS, GCE PD). RWX: many nodes mount read-write (file-based: EFS, Azure Files, NFS). RWOP: a single pod on a single node — strongest guarantee; newer. Pick based on the backing driver's capability and whether multiple pods need concurrent write access (RWX is rarely a good idea for databases).
How do you take a backup of a PVC? — For snapshotable CSI drivers: create a VolumeSnapshot pointing at the PVC. The driver creates a cloud-native snapshot (EBS snapshot, etc.). Velero or Kasten K10 orchestrates this across namespaces and ships metadata off-cluster. Without snapshot support, mount the PVC in a sidecar pod and rsync contents to object storage.
Should I run my production database in K8s? — Usually no, unless you have the operator expertise. Managed cloud DBs (RDS, Cloud SQL, Aurora) offer better backups, failover, point-in-time restore, and support. In-cluster DBs make sense for dev/test, analytics read-replicas, or teams committed to full-stack K8s ownership with mature operators (CloudNativePG, Strimzi for Kafka).
StatefulSet scale-down — what happens to PVCs? — Pods are deleted in reverse order. By default PVCs are retained — scaling back up re-attaches them. persistentVolumeClaimRetentionPolicy (1.27+) lets you set whenScaled: Delete to auto-delete on scale-down. Retention is the safer default.
How do you expand a volume? — Set allowVolumeExpansion: true on the StorageClass. kubectl edit pvc (or patch) to increase spec.resources.requests.storage. CSI driver resizes the backing volume. For most filesystems you also need to online-resize (CSI does this automatically if the driver supports it). You can never shrink — expansion is one-way.

14. Kubernetes Config, Secrets & RBAC

Why this matters

Config + secrets + identity is where most security incidents land. Interviewers will ask "ConfigMap vs Secret?" to see if you know Secrets are base64, not encrypted by default. They'll ask about RBAC because least-privilege is table-stakes for regulated environments, and about IRSA (on EKS) because short-lived cloud creds are the modern correct answer.

Core concepts

ConfigMap — non-sensitive key/value config. Mount as env vars, files, or command-line args. Plain text in etcd.

Secret — like ConfigMap, but base64-encoded. Not encrypted at rest unless you enable encryption-at-rest on etcd. RBAC treats Secrets separately so you can grant access to ConfigMaps without granting access to Secrets.

Encryption at rest (etcd). Enable EncryptionConfiguration on the apiserver; secrets are encrypted with KMS/AES before being written to etcd. Essential for compliance. Managed K8s (EKS/AKS/GKE) offer this as a checkbox backed by the cloud KMS.

Alternatives for real secret management:

Sealed Secrets (Bitnami) — encrypt a manifest with a public key that the controller decrypts in-cluster. Manifest is safe to commit. Per-cluster keys; rotation story is weak.
External Secrets Operator — fetches from Vault/AWS Secrets Manager/GCP Secret Manager/Azure Key Vault and syncs into K8s Secret objects. CRDs: SecretStore, ExternalSecret. Dominant pattern in production.
SOPS — encrypt YAML/JSON in place with age / KMS / PGP keys. Pair with GitOps (ArgoCD plugin, Helm SOPS plugin).
Vault Agent Injector — sidecar injects secrets as files into the pod at start; renews automatically.
CSI Secrets Store — mount secrets from Vault/AWS/Azure/GCP as files via CSI. No K8s Secret object created (if you want), reducing blast radius.

Immutable ConfigMaps / Secrets. immutable: true prevents accidental changes and improves apiserver watch scalability. Best for "versioned config per release" patterns.

Environment vs file mount:

Env vars — simple, convenient, but can leak to crash dumps, /proc/<pid>/environ, and are not hot-reloadable.
File mounts — can be hot-reloaded by apps watching the file; don't show up in env. Prefer for secrets when possible.

ServiceAccount. Every Pod has a ServiceAccount (the default one if unspecified). The SA is how Pod code authenticates to the K8s API and (with IRSA/Workload Identity) to cloud APIs.

RBAC:

Role — permissions scoped to a namespace (rules on verbs × resources × resourceNames).
ClusterRole — cluster-scoped equivalent; can also be used at namespace scope via RoleBinding.
RoleBinding — grant a Role to a user/group/SA in a namespace.
ClusterRoleBinding — cluster-wide grant.

Rule shape:

yaml

rules:
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch", "update", "patch"]
    resourceNames: ["api"]        # optional; pin to specific objects

Verbs: get list watch create update patch delete deletecollection. Plus impersonate, bind, escalate (meta-permissions on RBAC itself).

Common RBAC patterns:

One ServiceAccount per workload; bind least privilege.
Read-only SA for monitoring/debug tooling.
Cluster-admin only for platform team breakglass (with audit alerting).

Cloud-identity patterns:

IRSA (IAM Roles for Service Accounts, EKS) — trust policy on an IAM role allows a specific SA in a specific namespace; pod gets short-lived STS creds via a projected SA token the SDK exchanges for AWS creds.
EKS Pod Identity (newer, simpler) — similar outcome, less JWT plumbing.
GKE Workload Identity — analog for GCP.
AKS Pod Identity / Workload Identity — analog for Azure.

kubeconfig. Merged view of clusters, users, and contexts (~/.kube/config). Contexts tie "this cluster + this user + this default namespace." kubectx / kubens / kubie / k9s are common quality-of-life tools.

Commands you should know cold

bash

# Create ConfigMap / Secret (imperative for quick use; YAML for GitOps)
kubectl create configmap app-config --from-file=./config.yaml -n prod
kubectl create secret generic api-token --from-literal=TOKEN=xyz -n prod

# Decode a Secret value
kubectl get secret api-token -n prod -o jsonpath='{.data.TOKEN}' | base64 -d

# Who am I / what can I do?
kubectl auth whoami                          # (1.27+)
kubectl auth can-i create pods -n prod
kubectl auth can-i '*' '*' --as=system:serviceaccount:prod:api
# List all permissions of an SA
kubectl auth can-i --list --as=system:serviceaccount:prod:api -n prod

# Show effective RBAC for a subject
kubectl describe rolebinding -n prod
kubectl describe clusterrolebinding | grep -A4 api

# Create a minimal Role + RoleBinding for a service account
kubectl create role reader --verb=get,list,watch --resource=pods,services -n prod
kubectl create rolebinding reader-binding --role=reader --serviceaccount=prod:api -n prod

# IRSA-style annotation (EKS)
kubectl annotate sa api -n prod \
  eks.amazonaws.com/role-arn=arn:aws:iam::123:role/api-role

# Rotate a Secret with zero pod restart (if app watches file)
kubectl create secret generic api-token --from-literal=TOKEN=new --dry-run=client -o yaml \
  | kubectl apply -f - -n prod

Gotchas & war stories

Secrets are base64, not encrypted. In most default clusters, any kubectl get secret -o yaml with RBAC read returns the value. Enable etcd encryption-at-rest + lock down secrets access.
Binding ClusterRole with a RoleBinding scopes it. A powerful ClusterRole can be scoped to one namespace via RoleBinding — useful for reuse without broad grants.
system:serviceaccount:<ns>:default — tons of workloads accidentally use this SA with whatever the admin attached to it. Create explicit SAs per workload.
Wildcards in RBAC — resources: ["*"], verbs: ["*"] on a ClusterRole is a supply-chain landmine. Audit all of them.
ConfigMap size limit — 1 MiB per object. Big static assets don't belong here (use a PVC or bake into the image).
Mutable env-from-ConfigMap — if you change the ConfigMap, pods don't see env-var changes until restart. Use the checksum/configmap annotation pattern (Helm) or a reloader controller.
Exposing the SA token to containers — automountServiceAccountToken: false on Pods that don't need to call the K8s API. Defense in depth.

Anchor example

A typical Spring Boot deployment uses env-based config (SPRING_PROFILES_ACTIVE, Mongo URIs). The production-ready pattern to advocate in an interview: pull non-secret config from a ConfigMap mounted as env; pull secrets from External Secrets Operator → Vault/AWS Secrets Manager, mounted as files; set automountServiceAccountToken: false on workloads that don't call the K8s API; use IRSA for the app's AWS SDK calls (S3, KMS) — no static access keys anywhere.

Interview Q&A

ConfigMap vs Secret? — Both are key/value stores. ConfigMap is plain text, intended for non-sensitive config. Secret is base64 (not encryption) and has separate RBAC by convention. Both can be mounted as files or env vars. For real secret safety, pair Secret with etcd encryption at rest or bypass K8s Secret entirely via External Secrets / CSI Secrets Store.
Are Kubernetes Secrets secure? — By default, no — they're base64 in etcd, readable by anyone with get secrets RBAC. For actual security: enable encryption at rest (KMS-backed), restrict RBAC tightly, rotate regularly, prefer External Secrets Operator or Vault Agent Injector to keep the source of truth outside K8s, and consider CSI Secrets Store to skip the K8s Secret object altogether.
How do you give a pod AWS credentials the modern way? — IRSA on EKS: annotate the ServiceAccount with an IAM role ARN, the SA token is a projected JWT, the AWS SDK exchanges it via STS for short-lived creds. Newer: EKS Pod Identity — similar guarantees, less plumbing. Never: bake static access keys into the image or Secret.
Role vs ClusterRole? — Role is namespace-scoped; ClusterRole is cluster-scoped (and can also be used cross-namespace via RoleBinding). ClusterRoles are for cluster-wide resources (Nodes, CRDs, PersistentVolumes) or reusable permission sets. Typical pattern: define ClusterRoles centrally, bind via RoleBindings in each namespace.
How do you implement least privilege in K8s? — (1) One ServiceAccount per workload, never the default. (2) Narrow Role with specific verbs/resources/resourceNames, bound with a RoleBinding. (3) automountServiceAccountToken: false where the app doesn't need K8s API. (4) Audit with kubectl auth can-i --list. (5) For cloud creds, IRSA/Workload Identity; for secrets, External Secrets so RBAC on K8s Secrets isn't the gate. (6) PSA restricted on workload namespaces.
RBAC audit — find all ServiceAccounts with cluster-admin. — kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin") | {binding:.metadata.name, subjects:.subjects}'. Then audit each subject. Typical findings: old CI service accounts, cluster-admin bound to group you forgot, operators that over-requested.
How do you rotate a secret without downtime? — Depends on the app. Preferred: app watches file (Secret mounted as volume), you update the Secret, kubelet refreshes the file, app re-reads on next cycle — no restart. If app only reads env at start, you must roll the deployment (kubectl rollout restart deploy/api) after updating the Secret. Best practice: long-term move to short-lived creds (IRSA, Vault dynamic secrets) so rotation is continuous.

15. Kubernetes Scaling & Scheduling

Why this matters

Autoscaling is how you match cost to load and cost to cost. A senior interview question in this area goes past "what is HPA" straight to "HPA is only scaling on CPU and your p99 latency is climbing — what do you do?" (custom metrics) or "your cluster autoscaler took 8 minutes to add a node during a spike — why and how do you fix it?" (Karpenter / pre-warmed capacity).

Core concepts

Resource requests vs limits.

request — what the scheduler reserves on a node for this container. Guaranteed floor.
limit — hard ceiling. Exceeding CPU = throttled (not killed). Exceeding memory = OOMKilled.

QoS classes (inferred from requests/limits):

Guaranteed — requests == limits on every container. Last to be evicted.
Burstable — some requests set, limits differ (or unset). Evicted before Guaranteed.
BestEffort — nothing set. First to go under pressure.

Eviction signals. kubelet evicts pods when the node is under pressure: memory.available, nodefs.available, imagefs.available, pid.available. Pods are ranked by QoS + priority + usage.

Horizontal Pod Autoscaler (HPA).

Default: scale on CPU % of request.
v2 supports custom & external metrics: memory, application metrics via Prometheus adapter, queue depth via KEDA bridge.
Key knobs: minReplicas, maxReplicas, metrics:, behavior: (scale-up/down stabilization windows, rate limits). Default scale-down stabilization is 5 minutes (prevents flapping).

Vertical Pod Autoscaler (VPA).

Recommends (or auto-applies) better requests/limits based on observed usage.
Three modes: Off (recommendations only), Auto / Recreate (apply by recreating pods), Initial (only on pod creation).
Do NOT combine VPA's Auto with HPA on the same metric (they fight). Usually: HPA on CPU/custom, VPA in "recommender-only" mode to tune requests.

Cluster Autoscaler (CA).

Watches for Pending pods that don't fit; scales up node groups (EC2 ASG, AKS VMSS, GKE node pools). Scales down nodes that have been under-utilized for 10 min.
Limitations: reacts to Pending pods only (not pre-emptive), slow to scale (30s–5min depending on cloud), respects PDBs on evictions.

Karpenter (AWS-native, open-source, CNCF sandbox).

Replacement for Cluster Autoscaler on EKS.
Groupless — provisions nodes dynamically based on Pod requirements, bin-packing across instance types.
Much faster than CA (30s node-ready is typical) and better cost-fit (spots the right instance size).
Also handles consolidation — periodically repacks to cheaper layouts.

KEDA — event-driven autoscaling (Kubernetes Event-Driven Autoscaling).

Scales from 0 to N based on external event sources: Kafka lag, SQS depth, CloudWatch metric, Postgres query count, cron.
Layers on top of HPA (creates an HPA behind the scenes).
Canonical for worker pods consuming a queue.

Priority and Preemption.

PriorityClass objects define integer priorities.
Pod spec references one. Pods with higher priority can preempt (evict) lower-priority pods to make room.
System-critical pods (system-cluster-critical, system-node-critical) are reserved; don't overuse custom high priorities or scheduler becomes flapping.

Topology-aware scheduling. TopologyManager, topologySpreadConstraints, NUMA-aware scheduling for GPU / HPC workloads.

Commands you should know cold

bash

# Current top consumers
kubectl top pods -A --sort-by=cpu | head
kubectl top nodes

# Define an HPA on CPU
kubectl autoscale deploy/api --min=3 --max=20 --cpu-percent=60 -n prod

# Inspect HPA decisions
kubectl describe hpa/api -n prod        # shows current/target, last scale event, metrics fetched

# Diagnose Pending pods (why didn't it schedule?)
kubectl describe pod mypod -n prod | sed -n '/Events/,$p'
# Or, one-liner across namespace
kubectl get events -n prod --sort-by=.lastTimestamp

# Is CA/Karpenter seeing the pressure?
kubectl logs -n kube-system deploy/cluster-autoscaler --tail=200
kubectl get nodes -L node.kubernetes.io/instance-type

# KEDA — scale by Kafka lag
kubectl apply -f - <<'YAML'
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: kafka-consumer, namespace: prod }
spec:
  scaleTargetRef: { name: consumer }
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: my-group
        topic: orders
        lagThreshold: "100"
YAML

Gotchas & war stories

No requests set — scheduler packs pods assuming they use nothing → nodes OOM in minutes. Always set requests.
Limits much higher than requests — Burstable QoS; under pressure you're first out. Worth setting limits == requests for Guaranteed QoS on critical workloads.
CPU limits cause throttling under bursty load — Java/Node apps that use lots of threads get silently stalled at the cgroup boundary. Many prod teams omit CPU limits entirely and rely on requests + HPA.
HPA scaled down during deploy — new pods' load is low in the warm-up window; HPA concludes capacity should shrink; later everything goes hot and HPA lags. Tune behavior.scaleDown.stabilizationWindowSeconds longer or disable scale-down during deploys.
Cluster autoscaler slow — 2–5 minutes to create a node, pull images, schedule. For spiky workloads, use Karpenter or keep a headroom buffer (low-priority pause pods that can be preempted).
Over-provisioned nodes for "safety" — costs pile up fast. Use VPA recommender to right-size requests, Karpenter for consolidation.
Priority class escalation attack — a malicious operator creates a PriorityClass: 99999999 and preempts prod. Protect via admission policy (Kyverno, Gatekeeper) — only platform team can create PriorityClass objects.

Interview Q&A

Walk through how HPA decides to scale. — HPA controller queries metrics-server (or the external/custom metrics API) for each pod's current usage. Computes the ratio current/target across pods, multiplies by current replica count, rounds up. desiredReplicas = ceil(currentReplicas * sum(currentValue) / (target * readyPods)). Applies min/max bounds and the scaleUp/scaleDown behavior (stabilization windows, percent/pod rate limits).
HPA is scaling on CPU but the bottleneck is queue depth — what do you do? — Swap to a custom or external metric. Two common paths: Prometheus adapter to expose kafka_consumergroup_lag as a custom metric consumed by HPA; or use KEDA with a Kafka scaler that creates the HPA for you and scales from zero.
VPA vs HPA — can I use both? — Yes, but never on the same metric. Common combo: HPA scales replicas based on CPU / custom metric; VPA in Off or Initial mode right-sizes per-pod resource requests. If VPA runs in Auto and HPA scales on CPU %, they will fight (VPA shrinks request → CPU % rises → HPA adds replicas → VPA shrinks further).
Cluster Autoscaler vs Karpenter? — CA scales predefined node groups (ASG / VMSS) to fit Pending pods — simple, stable, slower, wastes capacity when the group's instance type mismatches pod needs. Karpenter is groupless: it looks at pending pods' requirements and provisions instances directly, bin-packs across types, consolidates periodically. Karpenter is the modern choice on EKS.
Pod is Pending — walk debug steps. — kubectl describe pod → check Events. Common causes: insufficient resources on any node (scheduler can't find fit — scale up or reduce requests), taint without matching toleration, nodeSelector/affinity with no matching node, PVC binding AZ mismatch, quota exceeded. If Cluster Autoscaler should scale up, check its logs for why it refused.
What's QoS and why does it matter? — Three classes based on requests/limits: Guaranteed (requests == limits), Burstable (some set, mismatch), BestEffort (nothing set). Under node pressure, kubelet evicts BestEffort first, Burstable next, Guaranteed last. Running critical workloads as Guaranteed gives them the strongest eviction protection.
Should I set CPU limits? — Often no for latency-sensitive apps. CPU limits enforce via CFS quota — exceeding causes throttling, which can produce tail-latency spikes even though you have headroom on the node. Many teams set requests (for scheduling) and leave CPU limits unset, relying on the kernel's fair scheduling. Memory limits remain essential to prevent node OOM cascades.
How do you scale from zero? — HPA doesn't support 0 → 1 directly (minReplicas >= 1 historically, though 1.27+ allows 0 with HPAScaleToZero feature gate). Use KEDA — it adds a ScaledObject that keeps replicas at 0 when no events, scales up on the first event.

16. Kubernetes Observability, Probes & Troubleshooting

Why this matters

This is the "debug a live cluster" section — the one that sorts theory from operating experience. Interviewers love to walk you through a scenario: "pod is CrashLoopBackOff" or "deploys are rolling out but nothing serves traffic." If you can narrate tools and decisions in order, you pass.

Core concepts

Probes. Three kinds; each answers a different question.

Liveness — "is this container alive?" On failure: kubelet restarts the container. Use cautiously — a misconfigured liveness probe will thrash healthy apps.
Readiness — "is this container ready to serve?" On failure: Endpoints remove it, Service stops routing. Container is not restarted. Use freely.
Startup — "has this container finished starting?" Suspends liveness/readiness until startup succeeds. Use for slow-booting apps (JVMs, DB migrations).

Probe types: HTTP GET, TCP open, gRPC health, exec command. Timeouts, periods, thresholds (failureThreshold), initial delay.

Golden rules of probe design:

Readiness can be strict (check DB connection, external deps); Liveness should be minimal (is the process still responsive?).
Liveness on an endpoint that depends on external state = restart loops the moment the dep blips.
Startup probe for anything with initialDelaySeconds > 30s in the liveness probe.

Status fields to read first:

Pending — not scheduled (no fit, image pull pending, PVC pending).
Running but Ready: false — pod alive, readiness probe failing.
CrashLoopBackOff — container keeps exiting; kubelet backs off exponentially.
ImagePullBackOff — can't pull the image (wrong name/tag, auth, network).
Error / OOMKilled / CrashLoopBackOff (137) — different reasons; read lastState.terminated.

Events — short-lived, attached to objects. kubectl get events --sort-by=.lastTimestamp.

Logs — kubectl logs pod -c container (current), --previous for the last crashed instance, -f to follow. For multi-replica, use stern or kubetail to stream across pods.

kubectl debug — attaches an ephemeral container to a running pod, sharing PIDs/network/volumes. Huge for distroless pods where you can't exec sh:

bash

kubectl debug -it pod/api-abc --image=nicolaka/netshoot --target=api

Commands you should know cold

bash

# One-shot inspection
kubectl get pod api-abc -n prod -o wide
kubectl describe pod api-abc -n prod
kubectl logs api-abc -n prod --all-containers --tail=200 --previous

# Across all pods
kubectl get pods -A --field-selector=status.phase!=Running

# Event timeline
kubectl get events -n prod --sort-by=.lastTimestamp

# Resource pressure
kubectl top nodes
kubectl top pods -n prod --sort-by=memory | head

# Debug a distroless pod with a netshoot ephemeral container
kubectl debug -it pod/api-abc -n prod --image=nicolaka/netshoot --target=api

# Shell into a broken pod by creating a fresh copy without the broken ENTRYPOINT
kubectl debug pod/api-abc -n prod --copy-to=debug --set-image=api=busybox --share-processes

# Tail logs across all pods of a deployment
stern deploy/api -n prod
# Or: kubectl logs -f -l app=api -n prod --all-containers --max-log-requests=50

# Dump everything for a post-mortem
kubectl get events,pods,rs,deploy -n prod -o yaml > postmortem.yaml

# What's in kube-system's logs (API server, scheduler) — often only via audit log or cloud console on managed K8s

Gotchas & war stories

Liveness probe that depends on DB — one DB hiccup restarts every pod. Split: readiness checks DB, liveness only checks process.
initialDelaySeconds too short — pod gets killed before it finishes booting. Use a Startup probe with longer failureThreshold.
Pod Ready: true but nothing serves — readiness flips green before app is actually ready. Usually: probe checks a trivial endpoint; the main endpoint takes more warm-up. Tighten the readiness endpoint.
CrashLoopBackOff with exit code 0 — app exiting cleanly because config is wrong (missing env var). --previous logs are your friend.
kubectl logs returns empty — app logs to file instead of stdout. Rewrite or sidecar-ship.
OOMKilled at 80% of limit — probably a different process in the pod exceeded the shared cgroup, or kernel page cache accounting surprise. dmesg on the node tells the truth.
kubectl debug with target=<container> — shares namespaces. Great for inspection; but ephemeral containers can't mount new volumes (by design).

Scenario: CrashLoopBackOff — full walk-through

kubectl describe pod/api-abc -n prod. Read status reason, restart count, Events.
kubectl logs api-abc --previous -c api. Usually the exit reason is here (panic, failed config parse, missing env).
If logs are empty, check exit code in describe — 137 = SIGKILL (OOM or liveness), 139 = SEGV, 1 = generic app error, 2 = missing arg often.
If OOMKilled (137 with reason), check memory limit vs usage — kubectl top pod historically, or Prometheus container_memory_working_set_bytes.
If ImagePull-related, kubectl get events shows image name + error; check imagePullSecret, registry auth, image existence (docker pull from your workstation for the same tag).
For subtle bugs, kubectl debug --copy-to debug --set-image api=busybox — start a sidecar-only pod to poke at volumes/config before running the real image.

Interview Q&A

Liveness vs readiness vs startup — what happens on failure of each? — Liveness failure → container restarted by kubelet. Readiness failure → pod removed from Service endpoints; no restart. Startup failure → initial grace period exceeds, container treated as failed (restart per restartPolicy). Rule of thumb: readiness can be strict; liveness should be minimal (process alive); startup when boot is slow and liveness would otherwise trigger.
A pod is CrashLoopBackOff — walk me through debugging. — kubectl describe pod for the exit reason and restart count. kubectl logs --previous for the last crashed instance's output. If exit code 137 = OOM or SIGKILL; check limits vs usage. If image pull issue, check events. For mysteries, kubectl debug --copy-to with a shell-friendly image to introspect config and volumes before running the real binary.
kubectl describe pod — what fields do you read first? — Status (Phase, Ready, Reason), container state (especially LastState.Terminated reason + exit code), Events at the bottom (most recent failures). Then resources: requests/limits. If scheduling, NodeSelector/Tolerations/Affinity. If networking, IP + Service Account.
Pod is Ready but users get 5xx — where do you look? — Readiness probe passing ≠ app correct. Check app logs for errors. Check the Service's Endpoints — sometimes label selector doesn't match. Check Ingress/Gateway routing rules. Check NetworkPolicy / mesh mTLS mismatch. curl the pod IP directly from another pod to isolate app vs networking vs ingress.
ImagePullBackOff — common causes? — Wrong image name or tag; wrong registry; missing imagePullSecret for private registries; registry rate limits (Docker Hub's 100/6hr limits on anonymous pulls); network egress blocked; on EKS, ECR permission missing on node IAM or IRSA; image manifest mismatch (ARM vs AMD64 for a mixed node fleet).
How would you debug a distroless pod with no shell? — kubectl debug pod/x --image=busybox --target=<container>. The ephemeral container shares the target's namespaces, so you can poke at its /proc, network, and shared volumes with normal tools. Distroless is great for security; ephemeral containers are how you still debug.
How do you tail logs across 20 replicas? — stern deploy/api -n prod (or kubetail) — streams logs from all matching pods with color-coded sources. kubectl logs -f -l app=api --max-log-requests=50 is the built-in version but limited.
Pod disappears — no crash, just gone. What happened? — Evicted. kubectl get events -A --sort-by=.lastTimestamp usually shows "Evicted: node pressure" or "preemption by higher-priority pod." Check node conditions (kubectl describe node) — disk pressure, memory pressure. PDB might have allowed it; adjust priority class and PDB.

17. Kubernetes Extensibility — CRDs, Operators, Webhooks

Why this matters

Every serious K8s user eventually hits "K8s doesn't have a native concept for X." The answer: extend K8s itself. CRDs + operators are the standard; admission webhooks for cross-cutting policy. An interviewer who asks "when would you build an operator?" is probing for "I know when not to."

Core concepts

Custom Resource Definition (CRD). A schema defining a new API kind. Once installed, kubectl get mycustomkind works. CRDs contain just the schema + validation; they need a controller to actually do anything.

Custom Resource (CR). An instance of a CRD. Stored in etcd like any other object.

Operator pattern. CRD + controller loop. The controller watches CRs, observes the real world, and reconciles:

while true:
  desired = observe(CR)
  actual = observe(world)
  diff = compute(desired, actual)
  act(diff)

This is the same pattern as the built-in controllers (Deployment, ReplicaSet) — you're writing your own.

Operator SDKs.

Kubebuilder — Go-based, most common, aligns with controller-runtime.
Operator SDK (Red Hat) — wraps Kubebuilder + Ansible + Helm operator modes.
Metacontroller — write controllers in any language via webhooks.
shell-operator / python-operator (Flant) — simpler runtime.

When to build an operator:

Stateful software with complex lifecycle (Kafka, Postgres, Redis, Vault, Elasticsearch).
Complex workflow orchestrations (backup/restore, version upgrades, cluster topology changes).
Domain concepts that naturally fit the reconcile loop.

When NOT to build an operator:

Static config — use a ConfigMap.
"Just run 3 replicas" — use a Deployment.
Helm-chart-shaped — Helm is enough; don't overbuild.

Admission webhooks.

MutatingAdmissionWebhook — rewrite the incoming request (inject sidecars, default values, labels).
ValidatingAdmissionWebhook — accept or reject (policy checks).
Run as Deployments + Service; register via MutatingWebhookConfiguration / ValidatingWebhookConfiguration.
Webhooks intercept every matching request; slow webhook = slow cluster.

Policy engines (built on webhooks):

Kyverno — K8s-native policies written as YAML; learn nothing new.
OPA Gatekeeper — Rego language, more powerful for complex policies; steeper learning curve.
Kubewarden — WASM-based policy modules.

Validating Admission Policy (1.30+ GA). Built-in policy via CEL expressions — no external webhook needed for simple rules. Starting to replace webhooks for "reject if" cases.

Finalizers. Strings on an object's metadata.finalizers. Until all finalizers are removed, delete just sets deletionTimestamp and waits. Used by controllers to clean up external resources before letting the K8s object go. Get stuck on a finalizer? kubectl patch ... -p '{"metadata":{"finalizers":null}}' — but understand why first.

Commands / config you should know cold

bash

# List CRDs in the cluster
kubectl get crds
kubectl api-resources --api-group=argoproj.io

# Inspect a CR
kubectl get applications.argoproj.io -A
kubectl get application my-app -n argocd -o yaml

# Webhook configurations
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

# Stuck finalizer? (last resort)
kubectl patch <kind> <name> -n <ns> --type merge -p '{"metadata":{"finalizers":[]}}'

A minimal Kyverno policy:

yaml

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: require-labels }
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-labels
      match: { any: [{ resources: { kinds: [Pod] } }] }
      validate:
        message: "label 'app' is required"
        pattern:
          metadata:
            labels:
              app: "?*"

Gotchas & war stories

CRD deleted with CRs still present — CRs orphaned; readers break. Always check kubectl get <crd-plural> -A before deleting.
Webhook outage = cluster outage — failurePolicy: Fail + a crashed webhook blocks all create/update in scope. Use Ignore for non-critical; exempt kube-system.
Webhook latency stacks — 10 webhooks at 200ms each = 2s per API call. Watch apiserver_admission_webhook_admission_duration_seconds.
Operator reconcile storms — a bug that requeues on every tick can DOS its own controller. Always back off on error, use exponential retries.
CRD schema drift — changing required fields breaks existing CRs. Add fields as optional with defaults; use CRD versioning + conversion webhooks for breaking changes.
Finalizer never removed — controller down, finalizer stuck forever. Patch it manually once — but find the owning controller first.

Interview Q&A

What's a CRD and a Custom Resource? — CRD is the schema registering a new API kind in the cluster (like saying "K8s, from now on PostgresCluster is a thing"). A CR is an instance of that kind. CRDs are just storage; they do nothing until a controller watches them.
When would you build an operator? — For stateful software with lifecycle operations that don't fit a vanilla Deployment: Kafka clusters (leader election, broker rebalancing, topic management), Postgres (failover, backups, schema migrations), Vault (unseal dance), complex app workflows. Not for "just run a binary" — Helm is enough for that.
Kyverno vs OPA Gatekeeper? — Kyverno uses YAML-native policy; easier onboarding, K8s-native patterns. OPA Gatekeeper uses Rego, more expressive, better for cross-cutting multi-input policies (also policy outside K8s). Kyverno for 80% of use cases; Gatekeeper when you need Rego's power. Many shops run Kyverno today because the CNCF-graduated maturity + YAML ergonomics win.
What is an admission webhook and why is it dangerous? — A webhook called synchronously during API admission to mutate or validate objects. Dangerous because every matching API call waits on it; if the webhook is slow or down with failurePolicy: Fail, cluster operations block. Mitigate with short timeouts, Ignore policy for non-critical, scope by namespace, HA deployments, exempt kube-system.
Validating Admission Policy (CEL) vs webhook? — Policy is in-apiserver CEL expressions — no external service, faster, simpler ops. Limitation: can't do external lookups. Use for "reject if label missing" kind of rules. Stick with webhooks (Kyverno/Gatekeeper) for anything needing external state or complex logic.
Finalizer stuck — what now? — Understand why: the owning controller is supposed to remove it after cleaning up an external resource. Check if the controller pod is healthy and reconciling. If the controller is gone or hopelessly broken and the external resource is already cleaned up, patch the finalizer off directly with kubectl patch --type merge -p '{"metadata":{"finalizers":null}}'. Never patch without understanding what the finalizer was protecting.
Operator SDK frameworks — which would you reach for? — Kubebuilder + controller-runtime for any serious Go operator — it's what most CNCF projects use. Helm operator (in Operator SDK) if the domain is "install this Helm chart differently per env." Ansible operator if you're porting Ansible roles. Metacontroller + your favorite language if Go isn't an option.

18. OpenShift vs Vanilla Kubernetes

Why this matters

Red Hat OpenShift is widely adopted in federal, financial, and regulated enterprises. Interviewers probe OpenShift experience both for its own sake and as a proxy for "worked in a hardened, opinionated environment." Expect "what's different about OpenShift?" and "what are SCCs?"

Core concepts

OpenShift = upstream Kubernetes + opinionated additions from Red Hat:

Routes — pre-dated Ingress; L7 HTTP/HTTPS exposure with sticky sessions, passthrough/edge/re-encrypt TLS. OpenShift now supports Ingress too.
SCCs (Security Context Constraints) — predecessor to Pod Security Standards. Granular control over runAsUser, capabilities, hostPath, privileged, volumes.
ImageStreams — virtual refs to images, with tag tracking; decouples deploy from registry tags.
BuildConfigs — in-cluster builds (S2I, Dockerfile, custom) triggered by git webhooks or image changes.
DeploymentConfigs (legacy) — OpenShift's pre-Deployment workload; should migrate to Deployments for new work.
Projects — namespaces + default network policy + resource quotas + default SCC.
Integrated OAuth — built-in identity provider; no separate dex/oauth2-proxy needed.
Operator Hub — curated operator catalog (community and Red Hat-certified).
oc CLI — kubectl superset with extra verbs for OpenShift-specific kinds (oc new-app, oc expose, oc whoami, oc login).

SCCs vs PSA. SCCs are older, more granular, but incompatible with PSA (both are admission). OpenShift 4.11+ runs both simultaneously — SCC enforces OpenShift-specific, PSA enforces upstream standards.

Default SCCs:

restricted / restricted-v2 — no privileged, no hostPath, dropped capabilities, runs as arbitrary UID assigned by namespace (this is the OpenShift-default).
anyuid — allows specific UIDs (e.g., USER 1001 in Dockerfile).
nonroot / nonroot-v2 — any non-zero UID.
hostaccess, hostnetwork, hostmount-anyuid, privileged — escalating permissions.

The OpenShift "random UID" quirk. restricted SCC ignores the image's USER and runs as a random UID from the namespace's range. This breaks many off-the-shelf images (they chown to a hardcoded UID). Fixes: build images with /app group-writeable (chgrp -R 0 /app && chmod -R g=u /app), or grant anyuid SCC (reviewed!), or use nonroot-v2.

Routes vs Ingress.

	Route	Ingress
Scope	OpenShift-specific	Kubernetes-native
Controllers	HAProxy (default), custom	nginx, Traefik, ALB, many
TLS	Edge, passthrough, re-encrypt	Edge via controller
Multi-team	Weaker isolation	Stronger via Gateway API

For new apps, many OpenShift shops are moving to standard Ingress / Gateway API for portability.

Commands you should know cold

bash

oc login https://api.cluster.example.com --token=$TOKEN
oc whoami ; oc whoami -t ; oc whoami --show-console
oc project myproject                        # like kubectl ns

# Namespace-like admin
oc new-project myapp --display-name='My App' --description='...'

# Spin up an app from source or an image
oc new-app nodejs:18~https://github.com/me/my-node-app.git
oc new-app my-registry/my-image:tag

# Routes
oc expose svc/api --hostname=api.example.com
oc create route edge api --service=api --hostname=api.example.com \
  --cert=tls.crt --key=tls.key

# SCC debugging: who granted what?
oc get scc
oc describe scc restricted-v2
oc adm policy who-can use scc restricted-v2
oc adm policy add-scc-to-user anyuid -z my-sa
oc adm policy remove-scc-from-user anyuid -z my-sa

# ImageStreams
oc get is
oc import-image my-image --from=registry/my-image:latest --confirm

# Troubleshoot with oc debug (bounces a pod with a shell, skipping the real entrypoint)
oc debug deploy/api --as-root

Gotchas & war stories

USER in Dockerfile ignored — OpenShift runs as a random UID. Build images that work with any UID in group 0 (chgrp 0 / chmod g+rwX).
Granting anyuid widely — breaks the SCC model. Review each grant; prefer fixing the image.
Routes don't support arbitrary TCP — L7 HTTP/HTTPS only. For TCP services (databases, SMTP), use NodePort / LoadBalancer / MetalLB.
oc vs kubectl divergence — kubectl works for standard K8s, but some OpenShift objects require oc. Mixing confuses newcomers; stick with oc on OpenShift shops.
DeploymentConfig legacy — old OpenShift workloads use DC. For new work, use Deployment — DC is deprecated. But don't migrate existing DCs without testing (triggers, image change automation differ).

Anchor example

If you've worked on an OpenShift platform, highlight the specifics: "We're on OpenShift with SCC restricted-v2 as the default — we built our Spring Boot images with chgrp 0 on the app directory and USER 1001 so they run cleanly under any UID in group 0. Routes for ingress. ArgoCD on top for GitOps across 6+ services." Tie back to concrete deployments (ArgoCD 99% success rate) and the hardening decisions you've touched.

Interview Q&A

What's OpenShift and how does it differ from upstream K8s? — Red Hat's opinionated, supported distribution of K8s. Adds: Routes (L7 ingress pre-Ingress), SCCs (stricter admission predating PSA), ImageStreams (indirection over registry tags), BuildConfigs (in-cluster builds), Projects (namespaces with defaults), integrated OAuth. It's still K8s under the hood; every standard API works.
What are SCCs and why do they matter? — Security Context Constraints — admission rules tighter than vanilla K8s. They gate privileged features (hostPath, hostNetwork, capabilities, runAsUser). Default restricted-v2 assigns a random UID, drops capabilities, forbids hostPath — upstream images built to run as UID 1001 fail until you build for arbitrary UIDs or grant a less-restricted SCC.
Routes vs Ingress vs Gateway API — pick on OpenShift? — Routes are still the native and fastest path if you're OpenShift-exclusive. Ingress works and is more portable. Gateway API is the cleanest long-term architecture (separate Gateway ops from Route definition across teams). For new platforms, lean toward Gateway API or vanilla Ingress for cluster portability; use Routes when you need passthrough TLS or the native edge/re-encrypt modes.
Why does my Docker image "work on minikube but not on OpenShift"? — Almost always SCC restricted-v2 running as a random UID. Image built with USER 1001 + files owned by 1001 fails to read files. Fix: build with group 0 file ownership + group-readable (RUN chgrp -R 0 /app && chmod -R g=u /app), or grant anyuid (not ideal), or switch to a non-root UID base like UBI.
ImageStream — why? — Decouples deploy manifests from registry tags. You track a moving tag (:latest, :v1) via an ImageStream; when a new image lands in the external registry, the ImageStream's tag event triggers a DeploymentConfig or BuildConfig. Lost relevance a bit with modern GitOps (you usually pin SHA in the manifest anyway); still handy for CI triggers.
How would you migrate from OpenShift to vanilla EKS? — Replace Routes with Ingress or Gateway API. Rewrite SCC-dependent manifests to satisfy Pod Security Admission restricted. Drop ImageStreams; pin images by digest. Convert DeploymentConfigs to Deployments. Replace oc new-app / BuildConfigs with a CI pipeline. Replace OpenShift OAuth with IRSA + OIDC provider. Non-trivial — budget real time.

19. Helm & Package Management

Why this matters

Helm is the de-facto K8s package manager. Installing, upgrading, and composing Helm charts is a daily activity; writing and publishing a chart is a weekly one. Interviewers will ask you the mechanics (what is Chart.yaml, what are hooks) and the architectural decisions (umbrella chart vs library chart, Helm vs Kustomize).

Core concepts

Helm terminology.

Chart — a directory with Chart.yaml, values.yaml, templates/, etc. Packaged as a .tgz.
Release — a specific install of a chart into a cluster (named).
Repository — a place charts are published (HTTP or OCI).
Values — user-supplied inputs merged with defaults.

Chart structure:

mychart/
├── Chart.yaml              # name, version, appVersion, deps
├── values.yaml             # defaults
├── values.schema.json      # (optional) JSON schema to validate values
├── templates/
│   ├── _helpers.tpl        # named templates + functions
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── NOTES.txt           # printed after install
│   └── tests/              # helm test hooks
├── charts/                 # subcharts (deps resolved here)
└── crds/                   # CRDs installed once, outside templating

Chart.yaml:

yaml

apiVersion: v2
name: api
description: The API service
type: application
version: 1.4.2        # chart version (semver)
appVersion: "2.8.0"   # the app's version (arbitrary)
dependencies:
  - name: postgresql
    version: ~14.0
    repository: oci://registry-1.docker.io/bitnamicharts
    condition: postgresql.enabled

Values hierarchy (lowest to highest precedence):

Chart defaults (values.yaml).
Dependency overrides (parent chart overrides subcharts).
-f file.yaml (user-supplied values file).
--set key=val / --set-string / --set-file / --set-json.

Most teams maintain per-env values files: values-dev.yaml, values-staging.yaml, values-prod.yaml.

Templating. Go templates + Sprig function library. Common idioms:

gotemplate

{{- define "mychart.fullname" -}}
{{- printf "%s-%s" .Release.Name .Chart.Name | trunc 63 | trimSuffix "-" -}}
{{- end }}

{{- if .Values.ingress.enabled }}
# ... Ingress manifest
{{- end }}

{{- range .Values.extraEnv }}
- name: {{ .name }}
  value: {{ .value | quote }}
{{- end }}

# Pod-restart-on-configmap-change checksum trick
annotations:
  checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

Hooks. Templates with helm.sh/hook: ... annotations run at specific lifecycle phases:

pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete, pre-rollback, post-rollback, test.
Common uses: schema migrations (pre-upgrade Job), cert generation (pre-install), smoke tests (test).
helm.sh/hook-weight orders hooks of the same phase.
helm.sh/hook-delete-policy: hook-succeeded cleans up after success.

helm test — runs pods annotated with helm.sh/hook: test. Simple smoke test harness post-deploy.

Library charts. type: library in Chart.yaml. Contains only named templates, never installs its own resources. Used for cross-cutting shared templates (standard Deployment boilerplate, labels).

Umbrella charts. A chart whose sole purpose is pulling in other charts (dependencies: list). Deploy a whole app via one helm install.

Dependencies.

bash

helm dep update       # pulls deps into charts/
helm dep build        # just populate from Chart.lock

OCI chart registries. Helm 3 can push charts to OCI-compliant registries (ECR, GHCR, Artifactory, Harbor). helm push mychart-1.4.2.tgz oci://registry/charts. Modern alternative to HTTP repos.

chart-testing (ct). CI tool that lints + installs charts changed in a PR.

Helm 3 vs Helm 2. Helm 2 had Tiller (server-side component with cluster-admin) — a security landmine. Helm 3 is client-only; RBAC is your kubeconfig's. Always Helm 3.

Helm vs Kustomize.

Helm — full templating; package+version charts; hooks; upgrade/rollback.
Kustomize — patch/overlay YAML without templating; built into kubectl. Simpler for "base + env overlays" patterns.
Kustomize inside Helm — possible but awkward.
Reality — most large shops use both: Helm for third-party packages (Prometheus, cert-manager, Istio) and internal shared charts; Kustomize for simple app-specific patches.

Commands you should know cold

bash

# Discover
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm search repo postgres
helm search hub postgres                 # Artifact Hub

# Install / upgrade
helm install api ./mychart -n prod --create-namespace \
  -f values-prod.yaml --set image.tag=abc123

helm upgrade --install api ./mychart -n prod \
  -f values-prod.yaml --set image.tag=abc123 \
  --atomic --timeout 10m

# Render templates without applying (debug)
helm template ./mychart -f values-prod.yaml | less
helm install --dry-run --debug api ./mychart

# Inspect
helm list -A
helm history api -n prod
helm status api -n prod
helm get values api -n prod
helm get manifest api -n prod        # rendered YAML of current release

# Rollback
helm rollback api 3 -n prod --cleanup-on-fail

# Uninstall
helm uninstall api -n prod --keep-history        # can still rollback

# Deps
helm dep update

# OCI
helm package ./mychart
helm push mychart-1.4.2.tgz oci://ghcr.io/my-org/charts
helm install api oci://ghcr.io/my-org/charts/mychart --version 1.4.2 -n prod

# chart-testing in CI
ct lint --chart-dirs charts
ct install --chart-dirs charts --helm-extra-args "--timeout 600s"

Gotchas & war stories

Whitespace in templates — Go templates are whitespace-sensitive; forgetting {{- (dash) leaves blank lines that break strict YAML parsers. Always render with helm template and kubectl apply --dry-run=server in CI.
helm upgrade that partially fails — leaves the release in failed state; next upgrade can skip steps. Use --atomic (roll back on failure) + --cleanup-on-fail.
CRDs installed by Helm — crds/ folder is NOT templated, only installed once on first install. Updates to CRDs via chart upgrade are silently ignored. Manage CRDs separately or use the --force flag carefully.
ConfigMap updates don't roll pods — use the checksum/config annotation pattern, or a reloader controller, or set configMap.mountPath changes to force pod replacement.
Subchart values aliasing — you must use the subchart's name as the top-level key in values.yaml (postgresql: { auth: { username: app } }) and respect the subchart's schema.
Shipping secrets in values — values committed to Git. Use --set-file secrets.tls=./tls.crt, SOPS+Helm secrets plugin, or External Secrets rather than hardcoded values.
Helm + ArgoCD value rendering — ArgoCD renders Helm client-side by default; hooks and test pods may behave differently than helm install. Tune via ArgoCD's Helm settings.

Anchor example

A practical framing: "We standardized on a library chart for all Spring Boot services — one place to own probe shape, prometheus annotations, pod security context, and label conventions. Per-service charts consume it, so adding a new service is 30 lines of YAML instead of 300." If asked about Helm vs Kustomize: "Helm for third-party (Prometheus, cert-manager, ArgoCD) and our internal shared chart; Kustomize would work but we already have the Helm toolchain and the templating power helps with our probe/config patterns."

Interview Q&A

Walk me through a Helm chart's structure. — Chart.yaml with name, version, appVersion, deps. values.yaml with configurable defaults. templates/ folder with Go-templated K8s manifests, plus _helpers.tpl for named template functions. charts/ for subchart copies. crds/ for CRDs installed once. Optional values.schema.json for input validation, NOTES.txt for install output.
How does value precedence work? — Defaults in chart's values.yaml < parent-chart overrides < user -f file.yaml (later files override earlier) < --set on the CLI. This lets you layer a base values file, per-env file, and one-off overrides.
What's a Helm hook and give a use case? — A template annotated with helm.sh/hook: <phase> that runs at a lifecycle boundary. Classic: a pre-upgrade Job that runs DB migrations before the new pods roll out. Also: pre-install Job creating a certificate before Deployments need it; test hooks for post-deploy smoke tests triggered by helm test.
Library chart vs umbrella chart? — Library chart is type: library and contains only reusable templates — you can't install it directly. Use for "standard Deployment shape" across many services. Umbrella chart is a normal chart whose purpose is bundling other charts via dependencies: — useful for "deploy my whole app (API + DB + cache + ingress) with one release."
Helm vs Kustomize — how do you decide? — Helm when you want full templating, packaging, versioning, and hooks — the default for third-party software and internal reusable components. Kustomize when you just need a base manifest + env-specific patches without templating complexity — lighter footprint, built into kubectl. Many teams use both: Helm for charts, Kustomize for small per-env tweaks on top.
Why not template with raw kubectl and envsubst? — You lose packaging (no versioned tarball), dependency management, upgrade/rollback semantics, and hooks. Helm's Go-template + Sprig is more powerful than envsubst, and the ecosystem (chart registries, lint, test) is massive. envsubst for trivial cases is fine; serious packaging needs Helm or Kustomize.
How do you upgrade a chart with a CRD change? — Helm's crds/ directory is installed once and never modified by upgrades. You must upgrade CRDs out-of-band: kubectl apply -f crds/ manually, or via a separate CRD-only chart, or via an operator. Modern pattern: split CRDs into their own chart and install them first.
How do you test a chart? — helm lint for syntax. helm template + kubectl apply --dry-run=server --server-side for render validation. chart-testing (ct) in CI for install-and-test on a kind cluster. helm test post-install for in-cluster smoke. Unit tests with helm-unittest plugin for template assertions.

Part IV — GitOps & Progressive Delivery

20. GitOps Principles

Why this matters

GitOps is the dominant deployment paradigm in cloud-native shops. Every serious DevOps interview has at least one GitOps question because it directly shapes how you answer every deployment-related question that follows. "What's GitOps?" tests vocabulary; "why is pull-based better than push-based?" tests reasoning.

Core concepts

GitOps as defined by OpenGitOps (CNCF):

Declarative — the system's desired state is described declaratively (manifests, not scripts).
Versioned & immutable — desired state lives in Git; every change is a commit, every revert is a revert.
Pulled automatically — software agents pull the desired state from Git (not CI pushing into the cluster).
Continuously reconciled — agents observe live state, detect drift, and converge.

Push-based (classic CI/CD):

CI runner  ─────[kubectl apply]─────▶ Cluster

CI holds cluster creds, applies directly. Simple, but: CI needs write access to cluster (big RBAC target), no continuous reconciliation (drift goes unseen), rollback requires re-running CI.

Pull-based (GitOps):

Cluster agent ◀────[git pull]──── Git repo ◀──── CI (commit-only)

CI commits rendered manifests or Helm values to a Git repo. In-cluster agent (ArgoCD, Flux) reconciles the cluster toward Git. Benefits:

No outbound cluster access needed (agents pull; no ingress for CI).
Continuous reconciliation — drift detected and corrected automatically.
Git is audit log — every change is a commit, signed + reviewed.
Rollback = revert — git revert <sha> and the agent reconciles.
Onboarding a new cluster = point an agent at the repo.

Repo topology patterns.

Single repo (app + manifests) — simple; harder for cross-service rollouts.
App repo + config repo — CI builds image in app repo, commits image: tag bump to config repo; ArgoCD watches config repo. Dominant pattern.
Repo per environment vs branches per environment vs directories per environment — trade-offs: branches are rarely a good fit for environments; directories or separate repos work better with ArgoCD ApplicationSets.

Rendered manifests vs source. Either:

Store Helm charts + values in Git; ArgoCD renders at sync. Pro: simple. Con: hard to review what will actually apply.
Render with CI (helm template > rendered/); store the rendered YAML in a separate repo. Pro: exact review. Con: more moving parts.

Sync policies.

Manual — human triggers sync.
Automated — agent reconciles on every change.
Self-heal — revert manual cluster changes.
Prune — delete resources removed from Git (crucial; without it Git becomes append-only).

Progressive delivery on top of GitOps. Argo Rollouts / Flagger replace the Deployment rollout with canary/blue-green analysis — live metrics from Prometheus decide whether to progress or roll back.

ApplicationSet / Kustomize overlays / Helm umbrella. Patterns for "one source of truth, many deployments" across clusters or tenants.

Gotchas & war stories

Commits that can't be reconciled — invalid YAML, missing CRDs, resources referenced but not present. Lint + render + kubectl diff --server-side in CI before commit.
Drift loops — a mutating webhook adds a label; ArgoCD sees it as drift; reverts; webhook re-adds. Teach ArgoCD to ignore specific fields (ignoreDifferences).
Manual "kubectl apply" bypasses GitOps — self-heal will undo it, which can surprise operators mid-incident. Define breakglass procedures (temporarily disable self-heal, document, re-sync).
Secrets in Git — plaintext is a no-go. SOPS + age, Sealed Secrets, or External Secrets Operator. Never commit unencrypted sensitive values.
Big-bang cluster reconfig — a config repo change that rolls 50 services simultaneously. Use ApplicationSets with sync waves and PR-based change review.

Interview Q&A

What is GitOps? — A declarative, pull-based deployment model where Git is the single source of truth for cluster state. Agents running in each cluster continuously reconcile live state to match Git. Four principles: declarative, versioned/immutable, pulled automatically, continuously reconciled.
Pull vs push-based deployments — why pull? — Pull keeps credentials and logic inside the cluster (no inbound from CI), detects drift continuously (self-healing), treats Git as the audit log (every change is a signed commit), and makes rollback a git revert. Push-based requires CI to hold cluster-admin creds, doesn't correct drift, and rollback is a new pipeline run.
How do you handle secrets in GitOps? — Never commit plaintext. Options: Sealed Secrets (encrypt once for that cluster's controller key), SOPS + age / KMS (encrypt in Git, decrypt at sync via plugin), External Secrets Operator (fetch from Vault/AWS Secrets Manager at runtime — usually the winner), CSI Secrets Store (mount as files without K8s Secret objects).
How do you do rollback in GitOps? — git revert <bad-commit> in the config repo. ArgoCD/Flux notices the new HEAD and reconciles the cluster back. For emergency: use the controller's manual rollback (argocd app rollback) to point at an older ref; then create the git revert so state and Git match.
Monorepo vs separate config repo? — Separate config repo is the dominant pattern. Your app repo builds and pushes an image; CI opens a PR in the config repo updating the image tag. The config repo has reviewers that focus on deployment changes; access can be scoped tighter than source; and cross-service changes become explicit coordinated PRs.
What's drift and how does GitOps handle it? — Drift = cluster state diverged from Git (someone kubectl-edited, a webhook mutated, a controller changed a value). GitOps agents detect drift on every reconcile and, if selfHeal is on, revert it. Legitimate "platform-owned" fields can be excluded via ignoreDifferences.
When is GitOps a bad fit? — Stateful data migrations (Git doesn't hold DB schema state well — needs a migration tool). Extremely dynamic configs (per-request; lots of autoscaling knobs). One-off debug operations (breakglass). Ephemeral dev environments where commit churn defeats the purpose. Even then, hybrid: core deploys GitOps, dev envs imperative.

21. ArgoCD

Why this matters

ArgoCD is the most-deployed GitOps controller. Interview depth here tracks closely with day-to-day operational responsibility (e.g., "we hit a 99% deploy success rate"). Expect "walk through an ArgoCD Application manifest," "what's an ApplicationSet," and "how would you structure ArgoCD for 10+ microservices."

Core concepts

Architecture.

argocd-server — API + Web UI, auth, RBAC, SSO.
argocd-repo-server — clones Git, renders manifests (Helm, Kustomize, Jsonnet, plain YAML).
argocd-application-controller — the reconcile loop; compares rendered manifests to live cluster state.
argocd-applicationset-controller — generates Applications from generators (cluster, list, git, matrix, merge, pull-request).
argocd-notifications-controller — Slack/Teams/email notifications on sync status.
argocd-dex-server — OIDC broker (or use external OIDC directly).
redis — cache for repo/cluster state.

Application — a single reconciled unit, pointing at a Git path + cluster + namespace.

yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/config
    targetRevision: main
    path: apps/api/envs/prod
    # or Helm:
    # helm:
    #   valueFiles: [values-prod.yaml]
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [CreateNamespace=true, ApplyOutOfSyncOnly=true]
    retry:
      limit: 5
      backoff: { duration: 5s, factor: 2, maxDuration: 3m }
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers: ["/spec/replicas"]     # HPA owns replicas

Sync policies.

automated — sync on any Git change.
prune — delete K8s objects removed from Git.
selfHeal — revert manual changes in the cluster.
syncOptions — CreateNamespace=true, Replace=true, ServerSideApply=true, ApplyOutOfSyncOnly=true.

Sync waves & hooks. Order resources inside an Application via argocd.argoproj.io/sync-wave annotation. Example: CRDs wave 0, controllers wave 1, CRs wave 2. Pre/post-sync hooks (argocd.argoproj.io/hook: PreSync|Sync|PostSync|SyncFail) run Jobs around sync.

Health checks. Built-in health states: Healthy, Progressing, Degraded, Missing, Unknown. Custom Lua health checks for CRDs.

App-of-apps — one parent Application whose path contains child Application manifests. Scales poorly with many clusters — prefer ApplicationSet.

ApplicationSet generators.

List — iterate over a static list.
Cluster — iterate over clusters registered in ArgoCD.
Git — iterate over directories or files matching a glob in a repo.
Matrix — Cartesian product of two generators (e.g., cluster × git directory).
Merge — combine generators by key.
Pull request — spin up an Application per open PR (preview envs).
SCM provider — iterate over repos in a GitHub/GitLab org.

Canonical pattern: one Git repo with apps/<svc>/envs/<env>/, ApplicationSet matrix: clusters × environments.

Projects (AppProjects). RBAC + allow-lists: source repos, destination clusters+namespaces, permitted resource kinds, orphan policy. Multi-tenant hardening.

Auto-sync with PR-based change review. Config repo protected-branch rules + required reviewers; only merged commits roll out. No direct cluster access for devs.

ArgoCD Image Updater. Separate controller that watches registries for new image tags matching a policy (semver, regex, digest) and commits updates back to the config repo — closes the "CI pushes image, something has to commit the tag bump" loop automatically.

ArgoCD Notifications. Rules-based: "notify Slack on sync failure," "notify team on OutOfSync > 15min."

RBAC. argocd-rbac-cm ConfigMap maps SSO groups to roles; roles have permissions on projects/applications/clusters.

Commands you should know cold

bash

# Login (OIDC/SSO)
argocd login argocd.example.com --sso

# Apps
argocd app list
argocd app get api -n argocd
argocd app diff api
argocd app sync api --prune
argocd app history api
argocd app rollback api <revision>
argocd app delete api --cascade

# Live-check a specific resource
argocd app resources api

# Projects & RBAC
argocd proj list
argocd proj role list my-proj

# Trigger a hard refresh (re-fetch Git + re-render)
argocd app get api --hard-refresh

# Force-sync a specific resource
argocd app sync api --resource apps:Deployment:prod/api

ArgoCD sample ApplicationSet — clusters × services

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata: { name: microservices, namespace: argocd }
spec:
  generators:
    - matrix:
        generators:
          - clusters: { selector: { matchLabels: { env: prod } } }
          - git:
              repoURL: https://github.com/org/config
              revision: main
              directories:
                - path: apps/*
  template:
    metadata: { name: '{{path.basename}}-{{name}}' }
    spec:
      project: default
      source:
        repoURL: https://github.com/org/config
        targetRevision: main
        path: '{{path}}/envs/prod'
      destination:
        server: '{{server}}'
        namespace: '{{path.basename}}'
      syncPolicy: { automated: { prune: true, selfHeal: true } }

Gotchas & war stories

Health check false positives — for CRDs with non-standard status fields, ArgoCD reports Unknown; write a custom health script or the app never shows Healthy.
Replicas drift with HPA — HPA changes spec.replicas; ArgoCD sees drift and reverts; HPA re-scales; loop. Use ignoreDifferences for /spec/replicas on Deployments with HPA.
Helm chart rendering differences — ArgoCD's Helm render vs helm install differ on hooks and tests. Test sync in a staging cluster before touching prod.
Sync wave ordering with Jobs — a pre-sync migration Job that's annotated wave 0 won't hold up wave 1 Deployments unless using hooks (not just waves). Read the doc on waves vs hooks.
Auto-prune wipes something important — if someone accidentally removes a resource from Git with prune on, it's deleted. Protect with argocd.argoproj.io/sync-options: Prune=false on critical resources, or use finalizers.
ApplicationSet explosion — a matrix generator with 10 clusters × 50 apps = 500 Applications churning every reconcile. Size the application controller shards (ARGOCD_APPLICATION_CONTROLLER_REPLICAS).

Anchor example

A 99% deploy success rate maps directly to an ArgoCD-driven change failure rate of ~1% — elite-tier in DORA. In an interview, make this concrete: "We ran ArgoCD with automated: { prune: true, selfHeal: true } across ~6 services; Image Updater closed the loop from ECR push to Git commit; and our 1% failure rate came from a combination of self-heal (manual drift was reverted before it became a 'mystery outage'), strong pre-merge lint of manifests, and rollback = git revert."

Interview Q&A

Walk through an ArgoCD Application manifest. — It declares a source (Git repo + revision + path, plus Helm or Kustomize config), a destination (cluster + namespace), and a sync policy (automated, prune, selfHeal, retry). ArgoCD's application controller reconciles by rendering the source, diffing against the cluster, and applying — continuously.
App-of-apps vs ApplicationSet — when each? — App-of-apps (a parent Application that points at a path full of child Application manifests) is fine for small scale. ApplicationSet with generators scales better: one object produces N Applications by iterating a matrix of clusters × directories, clusters registered by label, or PR preview envs. Use ApplicationSet for anything beyond ~10 Applications.
How do you keep ArgoCD from fighting HPA? — HPA writes spec.replicas; ArgoCD sees drift from Git. Add an ignoreDifferences entry for apps/Deployment on /spec/replicas. Same pattern for any controller that mutates managed fields.
Sync waves, hooks, and why use them. — Waves order resources inside an Application (CRDs before CRs, Deployments before HPAs). Hooks run Jobs around sync: PreSync for migrations, PostSync for smoke tests, SyncFail for notification. Without them, ArgoCD applies everything in parallel and hopes for the best.
How do you roll back a bad deploy? — Most common: git revert <sha> in the config repo; ArgoCD reconciles. Emergency: argocd app rollback api <revision> to an older Git hash, then push a matching revert so state and Git reconverge.
How do you scale ArgoCD across 20+ clusters? — Register each cluster as a destination; use ApplicationSet with a cluster generator. Shard the application-controller (ARGOCD_APPLICATION_CONTROLLER_REPLICAS + cluster-sharding). Scale repo-server for cache hits. Consider argocd-server --rootpath hosting behind a single ingress. Tune timeout.reconciliation (default 3m) for large estates.
How did you achieve 99% deploy success rate? — "Automated self-heal + prune meant drift was continuously reverted, catching config errors immediately. Image Updater automatically bumped tags from ECR. CI ran argocd app diff in PR to preview what the merge would change. Rollback was git revert. Operators never kubectl-edited in prod — the self-heal discipline enforced it. The 1% failures were almost always broken user-code — the deploy substrate itself rarely failed."
Multi-tenancy with ArgoCD — how? — AppProjects scope each team: allowed source repos, destination clusters/namespaces, allowed resource kinds (e.g., no ClusterRoleBindings). RBAC maps SSO groups to project roles. ApplicationSet templates enforce naming conventions. Users only see their projects' apps in the UI.

22. Flux CD

Why this matters

Flux is ArgoCD's main GitOps-controller peer — both CNCF graduates. Many shops pick one or the other based on philosophy. Being able to articulate the differences is expected.

Core concepts

Architecture. Flux is a set of controllers, each with a small focused job:

source-controller — fetches manifests (Git, Helm repo, OCI, Bucket).
kustomize-controller — applies Kustomize overlays.
helm-controller — installs/upgrades Helm releases.
notification-controller — events + alerts.
image-reflector-controller & image-automation-controller — image tag automation (Flux's version of Image Updater).

CRDs.

GitRepository / HelmRepository / OCIRepository / Bucket — sources.
Kustomization — apply a Kustomize overlay from a source.
HelmRelease — install/upgrade a chart from a source with values.
ImagePolicy / ImageUpdateAutomation — watch registries, commit tag bumps.
Alert, Provider, Receiver — notifications.

Flux vs ArgoCD trade-offs.

	Flux	ArgoCD
UI	Minimal (Weave GitOps / Flamingo / Headlamp plugins)	First-class web UI
Multi-cluster	Each cluster runs its own Flux + sources	One ArgoCD manages many clusters
RBAC	K8s RBAC + CRD ownership	App + Project + SSO mapping
Image automation	Native (controllers ship with Flux)	Via Image Updater (separate install)
Opinionated	Smaller, modular, CLI-first	Larger, batteries-included

Flux fits "one agent per cluster, managed by GitOps" philosophy purely. ArgoCD fits "central control plane with visual ops."

Commands you should know cold

bash

# Bootstrap flux into a cluster (creates manifests in your Git repo)
flux bootstrap github --owner=org --repository=fleet-infra --branch=main \
  --path=clusters/prod --personal

# Sync & inspect
flux get sources git -A
flux get kustomizations -A
flux get helmreleases -A
flux reconcile kustomization apps --with-source
flux suspend kustomization apps
flux resume kustomization apps

Interview Q&A

Flux vs ArgoCD — how do you pick? — Flux if you want a smaller, more modular, CLI-first agent per cluster with first-class image automation baked in — good fit for fleets. ArgoCD if you want a central UI, rich project-based RBAC, and your team wants a single pane of glass for many clusters. Both are CNCF-graduated; either is a fine choice.
How does Flux do image automation? — ImagePolicy watches a registry for tags matching a policy (semver, filter). ImageUpdateAutomation commits back to the config repo when a matching new image appears. Then source-controller notices the commit, kustomize/helm controller applies — closes the CI-to-deploy loop without CI needing cluster creds.
What's a Kustomization in Flux? — A CRD pointing at a path in a source (GitRepository) to apply as a Kustomize overlay. Multiple Kustomizations can target the same source at different paths, with dependencies between them (dependsOn:). It's Flux's unit of deploy.

23. Progressive Delivery

Why this matters

Rolling deploys catch bugs after every pod takes traffic — too late. Canary / blue-green with automated analysis makes "did this deploy degrade p99?" a gate, not a post-mortem. Expect questions on Argo Rollouts, Flagger, and feature flags.

Core concepts

Strategies:

Rolling (K8s default via Deployment) — incrementally replace pods.
Blue/Green — run full new env ("green"), cut over at the LB. Simple rollback (flip back). Costs 2× capacity during cutover.
Canary — serve a small % of traffic to new version. Analyze metrics. Ramp if healthy. Best of both worlds.
Shadow / Mirror — copy traffic to new version, discard its responses. Zero user impact, finds bugs.
A/B testing — route by user cohort (header, cookie). Different from canary: based on attributes, not percent.

Feature flags. Decouple deploy from release. Ship code hidden behind a flag; flip it on gradually. Tools: LaunchDarkly, Split, Unleash, GrowthBook, OpenFeature (CNCF standard). Let you canary users, not pods.

Argo Rollouts — replaces K8s Deployment with Rollout CRD. Defines canary/blue-green strategy with steps, pauses, analysis templates. AnalysisTemplate queries Prometheus (or other) and fails if metric violates threshold.

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates: [{ templateName: success-rate }]
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
      canaryService: api-canary
      stableService: api
      trafficRouting:
        istio:
          virtualService: { name: api }
  selector: { matchLabels: { app: api } }
  template:
    # standard pod spec

Flagger (Flux-adjacent) — similar idea as Argo Rollouts but operator-driven, integrates with several service meshes.

Analysis metrics. The metric that gates progression:

Error rate (5xx %)
p95/p99 latency
Business metric (checkout rate)
Custom SLI ratio

Fail-fast thresholds plus a ramp-up policy: don't let the first bad metric unwind if it's a transient blip; require N consecutive failures.

Automated rollback. On analysis failure, Rollout aborts: flips traffic back to stable, scales canary to zero, marks itself Degraded. No human needed at 3 AM.

Interview Q&A

Rolling vs blue/green vs canary — pick one and defend. — Canary with metric-based analysis is the modern default: 5% traffic → measure → 25% → measure → 100%. You catch regressions with minimal user impact and automatic rollback. Blue/green is right when tests must run on full prod traffic before cutover or when rollback must be instantaneous. Rolling is for low-stakes workloads where the risk is mild.
What does Argo Rollouts give you over a plain Deployment? — Explicit canary/blue-green strategy (setWeight steps, pauses, analysis). Metric-driven pass/fail via AnalysisTemplate (Prometheus, Datadog, New Relic). Integration with service meshes (Istio, Linkerd) or ingress (ALB, NGINX, SMI) to route by weight. Automatic rollback on analysis failure.
Feature flag vs canary — same thing? — No. Canary routes a percentage of requests to a new version. Feature flags toggle a code path within a single deployed version. Flags decouple deploy from release: ship code dark, flip on when ready. Both are tools for safe rollout; many shops use both together.
How do you design the analysis metric for a canary? — Pick SLIs that represent user impact: p95/p99 latency, error rate, saturation. Compare canary to stable (not absolute thresholds), so organic spikes don't trip. Require N consecutive failures before abort to ignore transient blips. Don't mix business metrics in until you trust the technical gates.
Shadow traffic — when is it useful? — When testing a rewrite or major refactor where response semantics matter. Mirror production traffic to the new service; discard its responses; compare outputs offline or observe failures/latency. Zero user impact but catches subtle bugs that unit tests miss. Costly in dollars (double-compute) and subtle (side effects to third parties — writes must be no-oped).

Part V — Cloud

24. AWS for DevOps

Why this matters

AWS is the dominant cloud in enterprise, and any engineer listing it on a resume will likely face "design an AWS deployment" or "walk through IAM / VPC / EKS" questions. The point isn't memorizing every service — it's knowing the 20-service core in depth, cost implications, and the identity model.

Core concepts

Compute tier.

EC2 — raw VMs. Still the foundation for long-lived managed services.
ECS (on EC2 or Fargate) — AWS-native container orchestration. Simpler than K8s; AWS-tied.
EKS — managed Kubernetes control plane. Node groups (managed, self-managed) or Fargate profiles (serverless pods).
Fargate — run a container without managing nodes.
Lambda — functions-as-a-service. 15-min max. Cold starts a real concern.
App Runner — simpler container running (think: "Heroku on AWS").
Batch — managed batch compute over EC2/Fargate.

Pick: Lambda for event-driven short tasks (<15m, low RPS), Fargate/ECS for containerized apps without K8s overhead, EKS for container platforms at scale, EC2 when you need kernel-level control (databases, HPC, GPU clusters).

IAM — the model.

Identity: User (humans), Role (assumed by services/users/workloads, preferred over users for ~everything), Group (collection of users).
Policy: JSON document attached to identity or resource. Statement = Effect (Allow/Deny) + Action (s3:GetObject) + Resource (ARN) + Condition.
Role: trust policy (who can assume it) + permissions policy (what the assumed role can do).
STS: token vending service. sts:AssumeRole returns short-lived creds.
IRSA (IAM Roles for Service Accounts): EKS pods assume roles via projected SA tokens → STS.
EKS Pod Identity (newer) — simpler alternative to IRSA.
Permission boundaries: ceiling on effective permissions (defensive depth against privilege creep).
SCPs (Service Control Policies): Org-level guardrails on what accounts can do.

Principle: prefer roles over users; short-lived creds over static keys; permission boundaries + SCPs for defense in depth.

S3.

Storage classes: Standard, Standard-IA, One Zone-IA, Glacier Instant / Flexible / Deep Archive, Intelligent-Tiering.
Lifecycle policies: auto-transition to colder classes, expire old objects.
Versioning: every overwrite is a new version.
Bucket policies vs ACLs: prefer bucket policies (JSON IAM-like) + Block Public Access.
Pre-signed URLs: temporarily allow PUT/GET from outside AWS without making the bucket public.
Server-side encryption: SSE-S3 (AES-256, AWS keys), SSE-KMS (your KMS key, auditable), SSE-C (you provide the key).
Notifications: to Lambda, SQS, SNS, EventBridge on object events.

VPC.

Subnets: public (route table with IGW), private (route via NAT GW), isolated (no internet at all).
NAT Gateway: allows outbound internet from private subnets. Expensive (AZ-redundant setup adds up).
Security Groups (SGs): stateful, ENI-level allow-lists.
NACLs: stateless, subnet-level allow+deny. Rarely needed.
VPC Endpoints: bypass NAT for AWS services (S3, DynamoDB = Gateway endpoints; others = Interface endpoints / PrivateLink). Huge cost saver and security win.
Transit Gateway / VPC Peering: inter-VPC connectivity.
VPC Flow Logs: packet metadata to S3/CloudWatch.

Route 53 — DNS + health checks + routing policies (latency-based, weighted, failover, geolocation, multivalue).

CloudFront — global CDN; terminates TLS; origin can be S3, ALB, or arbitrary HTTP. WAF + Lambda@Edge integration.

Load balancers.

ALB (L7) — HTTP/HTTPS, path/host routing, WebSocket, gRPC. Target groups.
NLB (L4) — TCP/UDP, static IPs, huge throughput.
GWLB — insert security appliances into traffic path.
Classic — legacy.

Managed data tier.

RDS — managed SQL (Postgres, MySQL, MariaDB, Oracle, SQL Server).
Aurora — AWS-native MySQL/Postgres-compat, scale out read replicas, serverless v2.
DynamoDB — managed NoSQL K/V + document. Partition/sort key design is everything.
DocumentDB — MongoDB-compat (not full).
ElastiCache — Redis/Memcached.
OpenSearch — managed ES.

Messaging.

SQS — standard (at-least-once, unordered) or FIFO (ordered, dedup). Visibility timeout, DLQ.
SNS — pub/sub fan-out.
EventBridge — event bus, rules, targets (schema registry too).
Kinesis Data Streams — Kafka-like, ordered, replayable.
MSK — managed Kafka.
Amazon MQ — managed ActiveMQ / RabbitMQ (relevant for MQ migration scenarios).

Secrets & config.

Systems Manager Parameter Store — free-ish, tier-limited, String / SecureString. Good for config + small secrets.
Secrets Manager — paid, with rotation + multi-region replication. Good for DB creds + API keys.
KMS — key management. Customer-managed keys (CMK) for encryption at rest.

Observability.

CloudWatch Logs / Metrics / Alarms — native monitoring.
CloudWatch Container Insights — EKS/ECS metrics.
CloudTrail — API audit log (every AWS API call). Essential for compliance + forensics.
Config — resource configuration history + compliance rules.
X-Ray — distributed tracing (being deprecated in favor of OpenTelemetry on ADOT).

Governance.

Organizations + Control Tower — multi-account baseline.
IAM Identity Center (formerly SSO) — SSO + permission sets to accounts.
CloudFormation / CDK — IaC native.
Systems Manager (Session Manager, Run Command, Patch Manager) — fleet mgmt.

EKS add-ons. VPC CNI (networking), CoreDNS (DNS), kube-proxy, EBS CSI driver, EFS CSI driver. Managed add-ons keep them current automatically.

Cost patterns to know

Data transfer costs — inter-AZ and egress; surprise on PoC-to-prod migrations.
NAT Gateway $/GB — can eat 30% of a cloud bill. Mitigate with VPC endpoints, VPC-wide NAT sharing, or moving egress-heavy workloads to public subnets with SGs.
Savings Plans vs Reserved Instances vs Spot — SP more flexible, RIs cheaper if you can commit; Spot 60–90% off for fault-tolerant workloads.
S3 class math — IA is cheaper per GB but has retrieval fees; lifecycle policies matter.
Unused EBS — orphaned volumes after EC2 termination. aws ec2 describe-volumes --filters Name=status,Values=available.
Idle RDS, NAT GW, ELBs — scheduled shutdowns in nonprod, Instance Scheduler pattern.

Commands you should know cold

bash

# Identity
aws sts get-caller-identity
aws sts assume-role --role-arn arn:... --role-session-name s

# EKS cluster access
aws eks update-kubeconfig --name mycluster --region us-east-1

# IAM policy simulation
aws iam simulate-principal-policy --policy-source-arn arn:...:user/me \
  --action-names s3:PutObject --resource-arns arn:aws:s3:::mybucket/*

# S3 basics
aws s3 cp ./file.tgz s3://mybucket/path/
aws s3 presign s3://mybucket/path/file.tgz --expires-in 600
aws s3api put-bucket-policy --bucket mybucket --policy file://policy.json

# VPC quick view
aws ec2 describe-vpcs --filters Name=isDefault,Values=false
aws ec2 describe-security-groups --group-ids sg-xxx

# CloudWatch Logs Insights query
aws logs start-query --log-group-name /aws/eks/mycluster/cluster \
  --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
  --query-string 'fields @timestamp,@message | filter @message like /error/'

# Cost Explorer budget check
aws ce get-cost-and-usage --time-period Start=2026-04-01,End=2026-04-17 \
  --granularity DAILY --metrics UnblendedCost --group-by Type=DIMENSION,Key=SERVICE

Anchor example

Kafka/IBM MQ experience ties directly to Amazon MQ and MSK. In interviews where AWS comes up with messaging, lean in: "I'd candidate MSK for a Kafka workload with high throughput and replay needs, Amazon MQ for lift-and-shift of ActiveMQ/IBM MQ workloads where the app binds to JMS contracts. The IBM-MQ-to-Amazon-MQ bridge is a real migration pattern I can talk through — network isolation, persistent messages, XA/transaction semantics, dedup on the bridge."

Interview Q&A

EC2 vs ECS vs EKS vs Lambda — when each? — Lambda for event-driven short tasks, scale-to-zero, no infra. ECS/Fargate for containers without K8s overhead and tight AWS integration. EKS for container platforms at scale, portability to other clouds, CNCF tooling. EC2 when you need kernel-level access (databases, special networking, GPU HPC).
IAM role vs user — when to use each? — Always prefer roles. Users mean long-lived access keys (a supply-chain liability). Roles are assumed with short-lived STS credentials, scoped by trust policy (who can assume) + permissions policy. Humans: IAM Identity Center issues roles via SSO. Workloads: IRSA/Pod Identity/EC2 instance profile.
IRSA — how does it work? — The SA token projected into the Pod is a signed JWT. The AWS SDK on the pod calls sts:AssumeRoleWithWebIdentity passing the JWT. STS validates against the OIDC provider for that EKS cluster's identity. The IAM role's trust policy restricts the SA + namespace. Result: short-lived creds, no secrets to rotate.
Security Group vs NACL? — SG is stateful (return traffic auto-allowed), attaches to ENIs, allow-only. NACL is stateless (both directions explicit), attaches to subnets, allow+deny. Use SGs by default; NACLs only when you need a broad deny at the subnet level (e.g., quarantining IP ranges).
Reduce my AWS cloud bill — where do you look? — Top savings: NAT GW (use VPC endpoints for S3/DynamoDB — Gateway endpoints are free; Interface endpoints for others beat NAT $/GB at scale). Right-size EC2/RDS. Move fault-tolerant workloads to Spot. Savings Plans / RIs on steady-state compute. Lifecycle S3 to IA/Glacier. Shut down nonprod off-hours. Delete orphaned EBS, unattached EIPs.
Design a disaster-recovery strategy for my EKS workload on AWS. — Pick RTO/RPO. Multi-AZ within a region for most workloads (EKS managed control plane is already multi-AZ). Multi-region for strict DR: pilot light (minimal capacity warm in secondary region) or warm standby (scaled-down replica running). Replicate state: RDS cross-region read replica, S3 CRR, secrets multi-region. DNS failover with Route53 health checks. Test DR quarterly.
S3 pre-signed URL — problem it solves. — Lets external users PUT/GET objects directly without making the bucket public and without proxying through your app. You sign a URL server-side with your IAM creds; URL is time-bounded; user PUTs directly to S3. Classic: upload a large file from browser → client gets a pre-signed URL → uploads directly, skipping your app.
SQS vs SNS vs EventBridge — how do you pick? — SQS: work queue; one consumer processes each message. SNS: pub/sub fan-out; many subscribers get each message. EventBridge: schema-aware event bus with rules, targets, cross-account, SaaS integrations. Rule of thumb: SQS for work queues, SNS for simple fan-out, EventBridge for event-driven architectures with filtering and many targets.
Running Spring Boot on EKS — what's the production-ready setup? — VPC with private subnets; EKS with managed node groups + Karpenter for bursts; IRSA for AWS API access; External Secrets pulling from Secrets Manager; ALB Ingress Controller for L7; ECR with Cosign-signed images; ArgoCD for GitOps; CloudWatch Container Insights + OpenTelemetry Collector → ADOT → X-Ray/Prometheus; WAF on ALB; VPC Flow Logs to S3. IAM: fine-grained IRSA per service, permission boundaries on platform roles.
On-prem Spring Boot to AWS migration — walk me through it. — (1) Containerize: Dockerfile, local Docker Compose test, then ECS Fargate PoC. (2) Replicate state: RDS for Postgres, ElastiCache for Redis, S3 for file storage, MSK/Amazon MQ for messaging. (3) Network: VPN or Direct Connect for hybrid; VPC endpoints to reduce egress. (4) IaC: Terraform for network + IAM + data stores. (5) CI/CD: GitHub Actions OIDC → ECR → ArgoCD → EKS (or ECS in simpler version). (6) Cutover: blue/green via Route53 weighted records. Observability + rollback plan before any prod cutover.

25. Azure & GCP (Cross-Cloud Literacy)

Why this matters

Even at AWS-first shops, you'll be asked "what are the Azure/GCP equivalents?" to test breadth. Pure Azure/GCP shops will dive deeper; make sure you know the core mappings and a couple of unique-to-that-cloud features.

Core concepts

Service mapping cheatsheet:

	AWS	Azure	GCP
K8s	EKS	AKS	GKE
Managed VMs	EC2	VMs	Compute Engine
Serverless	Lambda	Functions	Cloud Functions / Cloud Run
Containers (no K8s)	ECS/Fargate	Container Apps	Cloud Run
Object storage	S3	Blob Storage	Cloud Storage (GCS)
Block storage	EBS	Managed Disks	Persistent Disks
Managed SQL	RDS/Aurora	Azure SQL / DB for Postgres	Cloud SQL
NoSQL	DynamoDB	Cosmos DB	Firestore / Bigtable
Queue	SQS	Service Bus / Storage Queue	Pub/Sub (also covers pub/sub)
Pub/Sub	SNS / EventBridge	Event Grid / Event Hubs	Pub/Sub
Managed Kafka	MSK	Event Hubs Kafka	Managed Kafka / Confluent Cloud
CDN	CloudFront	Front Door / CDN	Cloud CDN
DNS	Route 53	Azure DNS / Traffic Manager	Cloud DNS
Load balancer	ALB/NLB	Load Balancer / Application Gateway	Cloud Load Balancing
Secrets	Secrets Manager	Key Vault	Secret Manager
Identity	IAM	Entra ID (Azure AD) / RBAC	IAM
Workload identity	IRSA / EKS Pod Identity	Workload Identity (AKS)	Workload Identity (GKE)
Observability	CloudWatch / X-Ray	Monitor / Log Analytics / App Insights	Cloud Monitoring / Logging / Trace
IaC native	CloudFormation / CDK	Bicep / ARM	Deployment Manager / Config Connector
Cost mgmt	Cost Explorer	Cost Management	Billing

Azure peculiarities:

Entra ID (formerly AAD) is the identity layer for everything, not just Azure — it underpins Office 365, Teams.
Resource Groups — a required logical grouping for every resource.
Subscriptions + Management Groups — the billing + governance tree.
Bicep — Azure's modern IaC DSL that compiles to ARM.
AKS + Application Gateway Ingress Controller (AGIC).

GCP peculiarities:

Projects are the billing + isolation unit (no "account/subscription" equivalent).
Organizations + Folders + Projects tree.
IAM bindings are at resource level, not attached to identities.
Workload Identity on GKE is the cleanest cross-cloud equivalent.
Cloud Run is a strong alternative to ECS/Fargate + Lambda in one product.

Interview Q&A

AWS EKS equivalents on Azure and GCP? — Azure: AKS. GCP: GKE (probably the most polished managed K8s; GKE Autopilot is "EKS Fargate" equivalent — serverless K8s). Workload identity maps: IRSA (AWS) / Workload Identity (AKS and GKE).
Multi-cloud — is it worth it? — Rarely for apps; often for specific services (Snowflake on AWS + Azure, or BigQuery-specific workloads on GCP). "Portable workloads" is a myth unless you invested in true abstractions (K8s + Crossplane + cloud-agnostic data stores), which adds complexity and slows you down. Most "multi-cloud" in practice is "primary cloud + a bit of another."
Cross-cloud K8s differences I should know? — Ingress / LoadBalancer provisioning differs (ALB on EKS, Application Gateway on AKS, Cloud Load Balancer on GKE — handled by the CCM). CNI: AWS VPC CNI allocates VPC IPs (ENI-limited); AKS default CNI is Azure CNI or kubenet; GKE uses its own. Workload identity flows are the same idea but different plumbing per cloud.
Cloud-native services you'd pick GCP for? — BigQuery (unrivaled warehouse), Cloud Run (developer experience beats ECS + Fargate), Spanner (globally consistent relational), GKE Autopilot (lowest-ops K8s). Many analytics-first shops go GCP for BigQuery alone.

26. FinOps & Cost Optimization

Why this matters

DevOps owns deploy, scale, and a huge chunk of the cloud bill. FinOps makes cost a first-class metric alongside reliability and throughput. Expect "how do you reduce a cloud bill?" and "walk through your tagging strategy."

Core concepts

FinOps principles (from the FinOps Foundation):

Teams collaborate (Finance + Eng + Product).
Everyone takes ownership for their cloud usage.
A centralized team drives FinOps (but doesn't own spend).
Reports are accessible and timely.
Decisions are driven by business value.
Take advantage of the variable cost model of the cloud.

The three phases (repeat): Inform (show cost, tag, allocate) → Optimize (rightsizing, commitments, delete orphans) → Operate (automate, enforce, cultural alignment).

Cost drivers in AWS (roughly):

EC2 (including EKS node groups).
Data transfer (inter-AZ, egress, NAT GW).
RDS / Aurora.
S3 + EBS.
Managed services (ALB, NAT, etc.).
"Other" (CloudWatch, X-Ray, KMS calls).

Commitment pricing.

Reserved Instances (RIs) — 1-year/3-year; up to 72% off; instance-family specific. Mostly superseded by SPs.
Savings Plans (SPs) — 1-year/3-year; Compute SP (flexible across families + regions + ECS/EKS/Lambda), EC2 Instance SP (family-pinned, deeper discount). Apply automatically.
Spot — 60–90% off; interruptible. Good for fault-tolerant batch, build workers, stateless workloads with HPA + graceful termination.

Rightsizing.

CloudWatch metrics + AWS Compute Optimizer give recommendations.
VPA recommender for K8s workloads.
Heuristics: p95 CPU < 20% → downsize; p95 memory < 40% → shrink.

Tagging strategy. Every resource tagged with: environment, team, service, cost-center, owner. Cost Explorer filters by tag; cross-charge internally.

Idle detection.

Unattached EBS (aws ec2 describe-volumes --filters Name=status,Values=available).
Unused EIPs (not associated).
Load balancers with zero targets.
RDS instances with no connections for 7 days.
Tools: AWS Trusted Advisor, Cost Anomaly Detection, third-party (CloudHealth, Cloudability, Vantage).

Kubernetes cost visibility.

Kubecost — breakdown by namespace / workload / label.
OpenCost (CNCF) — open-source core of Kubecost.

Data transfer traps.

Inter-AZ: $0.01/GB each way.
Inter-region: $0.02/GB and up.
Internet egress: tiered, $0.05–$0.09/GB.
Fix: topology-aware routing, VPC endpoints for AWS service traffic, CloudFront for public egress.

GreenOps overlap. Reducing compute for cost reduces carbon. Both benefit from rightsizing + consolidation + efficient regions.

Interview Q&A

Our AWS bill jumped 30% last month — how do you investigate? — Cost Explorer filtered by day + service + linked account to localize the spike. Compare to previous billing cycles. Look for new tags (a new team deployed). Check Cost Anomaly Detection alerts. Top suspects: a new NAT GW routing heavy traffic, unbounded log retention, a runaway HPA, RDS M5 → M5.xlarge misclick. Drill into the specific service + tag that jumped.
How do you make cost a first-class metric for the team? — Tag everything with team/service/env. Use Kubecost/OpenCost to show per-workload cost. Chargeback or showback to teams in a monthly report. Set budget alerts per team. Include "deploy cost delta" as a PR check for infra changes. Cultural: cost is owned by engineers, not finance alone.
Spot instances for EKS worker nodes — what are the gotchas? — Interrupts happen (2-minute warning via SQS or metadata). Need: diverse instance types (Karpenter can manage this automatically — bin-pack with 10+ types), fault-tolerant workloads (HPA + PDB + terminationGracePeriodSeconds + readiness probe drop on SIGTERM), and stateful workloads on on-demand. Typical split: stateless APIs + async workers on Spot, databases + stateful on on-demand.
Why is NAT Gateway so expensive? — $0.045/hr per NAT GW + $0.045/GB processed. Multi-AZ NAT for HA triples the hourly cost. The $/GB eats up if you have heavy egress to public services. Fix: VPC endpoints for AWS services (Gateway endpoints for S3/DynamoDB are free; Interface endpoints are $0.01/hr but drop NAT processing), VPC endpoint policies, or occasionally running specific egress-heavy workloads in public subnets with SGs.
Reserved Instances vs Savings Plans — pick? — Compute Savings Plans for most: they apply across EC2, Fargate, Lambda; flex across families and regions. EC2 Instance SPs or RIs when you have very steady state on a specific family. Avoid 3-year commits unless confident the workload will persist. RIs are mostly a legacy of convenience; SPs cover ~all use cases now.

Part VI — Observability & SRE

27. Observability Foundations

Why this matters

"Monitoring" is what you do when you know what can break. "Observability" is what you do when new failures surprise you — it's the ability to ask any question about your system without deploying new code. Interviewers probe this to distinguish candidates who've actually been on-call from candidates who've only read about it.

Core concepts

The three pillars: logs, metrics, traces.

	What it is	Best for	Watch out for
Logs	Discrete timestamped events	"What exactly happened at T?"	Volume, cost, cardinality
Metrics	Numeric time series with labels	"How's the system trending?"	Label cardinality explosion
Traces	Request's path through services	"Where did this slow request spend time?"	Sampling, storage

Newer view: events as a single data model — one wide structured event carries log fields, metric dimensions, and trace spans (Honeycomb's pitch). Still, mainstream tooling is separate pipelines per pillar.

Cardinality. Number of unique label combinations. http_requests_total{method, status, path} — if path includes user IDs, cardinality explodes, Prometheus chokes, bill explodes. Rule: never label with unbounded user data; buckets (e.g., path=/api/users/:id), not raw values.

Sampling.

Head sampling — decide at span creation (random %). Cheap, misses rare errors.
Tail sampling — buffer the whole trace, decide after (keep errors + slow traces). Expensive, catches the interesting ones.
Deterministic — hash trace ID % rate. Consistent across services.

Structured logging. JSON lines, consistent field names, no unstructured prose. Every log line has: timestamp, level, service, trace_id, span_id, message, plus context fields. Search and aggregate become queries, not regex.

Correlation IDs. The request's fingerprint across services. Generate at edge (ingress), propagate via header (traceparent W3C standard). Every log line from every service for that request carries the same ID. Essential for microservices debugging.

Trace context propagation (W3C Trace Context).

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             |    |                                |              |
           version trace-id                      parent-id      flags

Plus tracestate for vendor-specific additions. Works across any vendor that supports W3C.

MDC (Mapped Diagnostic Context). Java/SLF4J concept — per-thread map of key/values that log appenders include automatically. Put trace_id in MDC at request start; every log line downstream carries it.

Pillars vs SLOs. Observability data feeds SLOs (see §33). "Your metrics show 99.92% success" vs SLO of 99.9% means you're burning error budget.

Commands you should know cold

bash

# Query Loki (logs)
logcli query '{app="api", level="error"} |= "timeout"' --since=1h

# PromQL (metrics)
# Error rate last 5m
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) \
/ sum(rate(http_requests_total[5m])) by (service)

# p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Cardinality check (Prometheus)
curl -s http://prom:9090/api/v1/label/__name__/values | jq '.data|length'

Gotchas & war stories

Labeling user IDs as metric labels — kills Prometheus in hours.
Logging request bodies verbatim — PII + volume blow-up + slow log pipeline.
Dropping trace context through async boundaries — Kafka consumers forget to propagate traceparent. Fix with OTel's auto-instrumentation for the broker client.
No sampling — tracing every request costs more than the compute. Tail-sample 100% errors + slow, 1–5% of happy path.
Logs and metrics telling different stories — metric says 0.1% error; log count says 5% errors. Usually: metric ignores a status code, or a middleware returns 200 after logging an error.

Interview Q&A

Logs, metrics, traces — when each? — Metrics for rates, counts, histograms: the dashboard story. Logs for discrete events: "what exactly happened at 04:37:22." Traces for request-path time-distribution across services: "where is this slow request spending its time?" In practice: alert on metrics, correlate with trace for a slow request, drill into logs for the specific line.
What is observability and how does it differ from monitoring? — Monitoring answers known questions ("is CPU over 80%?"). Observability lets you answer new questions you haven't asked yet, by exposing rich, structured, high-cardinality data. Practically: observability is monitoring + tracing + structured logs + high-cardinality labels.
Cardinality — why do I care? — Each unique label-value combination is a separate time series in Prometheus; series cost memory + storage. Labeling with user IDs or URL paths multiplies cardinality and can OOM Prometheus. Bound cardinality: aggregate, bucket, or move high-cardinality data to traces/logs where storage is append-only.
How do you correlate a log line with a trace? — Inject trace_id and span_id into MDC (Java) or thread-local / AsyncLocalStorage (JS) at request entry. Log format includes those fields. Click in your logs UI → filter by trace_id → jump to the trace UI. OpenTelemetry auto-instrumentation handles the propagation.
Head vs tail sampling — trade-offs? — Head: cheap (decide before span creation), randomness means rare errors may never be captured. Tail: expensive (buffer entire trace in a collector, decide after seeing it), catches errors and slow outliers. Tail sampling is the modern default for production services; requires a collector like the OpenTelemetry Collector.

28. OpenTelemetry

Why this matters

OpenTelemetry (OTel) is the CNCF-graduated standard for generating and transmitting telemetry. It's displacing vendor SDKs (Datadog's, New Relic's, Jaeger's) as the default instrumentation layer. If your resume calls it out, expect deep questions.

Core concepts

What OTel standardizes:

API — what application code calls to create spans, metrics, logs.
SDK — reference implementation of the API per language.
Semantic conventions — standard attribute names (http.method, http.status_code, db.system, messaging.kafka.destination).
OTLP protocol — binary gRPC (or HTTP/protobuf) wire format for shipping telemetry.

OTel does NOT prescribe a backend. Ship OTLP to any compliant backend: Jaeger, Tempo, Zipkin, Grafana, Datadog, New Relic, Honeycomb, Dynatrace.

Auto-instrumentation. Most languages have agents that auto-instrument common libraries without code changes:

Java: -javaagent:opentelemetry-javaagent.jar — instruments servlet, JDBC, Kafka, HTTP clients, Spring, Lombok, etc. Zero code.
.NET / Python / Go / Node: auto-instrumentation packages.
K8s-native: OpenTelemetry Operator injects the agent as an init container.

Manual instrumentation. When you need custom spans or attributes:

java

Span span = tracer.spanBuilder("computeDiscount").startSpan();
try (Scope s = span.makeCurrent()) {
  span.setAttribute("user.tier", tier);
  // ...
} finally {
  span.end();
}

Context propagation. SDK automatically reads/writes traceparent on HTTP and Kafka clients. Cross-language (Java → Go service → Python worker) works transparently.

OpenTelemetry Collector. The data plane — a standalone binary/pod that receives OTLP, processes, and exports:

Receivers — OTLP, Prometheus scraping, Jaeger, Zipkin, Fluentd, host metrics, K8s events.
Processors — batching, attributes (add/remove/rename), filtering, tail sampling, memory limiter.
Exporters — OTLP, Prometheus remote write, Loki, Elasticsearch, Datadog, New Relic.
Pipelines — wire receivers → processors → exporters, one pipeline per signal (traces/metrics/logs).

Typical deployment: agent mode (DaemonSet, per-node) + gateway mode (Deployment, cluster-wide). Agents forward to gateway; gateway does heavy processing (tail sampling) and exports.

Semantic conventions. Standard attribute names across languages + tools. A consistent http.method label means your dashboards don't need to handle 17 dialects of the same field. Read the OTel Semantic Conventions for the canonical names.

Logs, metrics, traces converge. OTel now covers all three. Metrics SDK is stable; logs SDK is bridging (adoption lags — most teams keep their existing log pipeline and fold in traces/metrics first).

Commands / config you should know cold

OpenTelemetry Collector minimal config:

yaml

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          static_configs: [{ targets: [localhost:8888] }]

processors:
  batch: { timeout: 5s }
  memory_limiter: { limit_mib: 400, check_interval: 1s }
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: prob
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls: { insecure: true }
  prometheusremotewrite:
    endpoint: http://mimir.observability.svc/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Java auto-instrumentation on a Spring Boot pod:

yaml

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://otel-collector.observability.svc:4318
  - name: OTEL_SERVICE_NAME
    value: api
  - name: OTEL_TRACES_SAMPLER
    value: parentbased_traceidratio
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.1"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: deployment.environment=prod,service.version=2.8.0
  - name: JAVA_TOOL_OPTIONS
    value: "-javaagent:/otel/opentelemetry-javaagent.jar"

Interview Q&A

What does OpenTelemetry actually standardize? — The API developers call, the SDK that implements it per language, the semantic conventions for attribute names, and the OTLP wire protocol for shipping telemetry. It does NOT pick a backend — any compliant backend (Jaeger, Tempo, Datadog) works. The win: one instrumentation, many backends; no lock-in.
Why move off vendor SDKs (Datadog, New Relic agent)? — Vendor lock-in. With OTel, the same instrumentation works across vendors; you can A/B backends or migrate without code changes. Semantic conventions mean dashboards and alerts port. Vendors still differentiate on analysis/UX, not ingestion.
OpenTelemetry Collector — why run one? — Keep the app's SDK light (just OTLP out). Do enrichment, sampling, batching, and export shaping at the collector. Handles multiple backends. Tail sampling needs a collector (app can't see the whole trace). Cluster-wide gateway collector is standard; per-node DaemonSet agent collects local kube-state-metrics, node exporter, and forwards to gateway.
How do you propagate trace context across Kafka? — OTel's Kafka instrumentation writes traceparent into message headers on produce and reads it on consume. Then the consumer continues the trace. Works for Java (Kafka client auto-instr), Go (otelkafka), Python (opentelemetry-instrumentation-kafka). Without instrumentation, traces stop at the producer — the consumer starts a new trace.
Tail sampling at the collector — how does it work? — Collector buffers spans per trace_id for decision_wait (e.g., 10s). When a trace is "complete" (no new spans for N seconds), it evaluates policies: keep errors, keep slow, keep probabilistic sample. Drops the rest. Memory-bound; requires collector sizing. Huge cost savings at high QPS.
What's in service.version and deployment.environment? — Semantic-convention resource attributes that every span/metric from this service carries. Lets you filter "errors in prod for version 2.8.0" across signals without wiring custom tags. Set via OTEL_RESOURCE_ATTRIBUTES env var or the SDK config. Essential for multi-env telemetry.

29. Metrics & Prometheus

Why this matters

Prometheus is the de-facto metrics backend in cloud-native. Its data model, PromQL, and operational patterns (cardinality, federation, long-term storage) show up in nearly every DevOps interview. If you haven't written PromQL, you're behind.

Core concepts

Pull model. Prometheus scrapes /metrics HTTP endpoints on a schedule (default 15s). Services expose text-format metrics; clients add labels. Targets discovered via static config, file SD, Kubernetes SD, Consul, EC2 SD, etc.

Metric types.

Counter — monotonically increasing (except on reset). http_requests_total. Use rate() or increase() — never the raw value.
Gauge — goes up and down. memory_used_bytes, queue_depth.
Histogram — _count, _sum, and _bucket{le="..."} series. Use histogram_quantile() to compute percentiles from buckets.
Summary — client-side quantiles (_quantile). Cheaper to query, can't aggregate across instances. Generally prefer histograms.

Labels. Arbitrary key-value pairs per metric. The cross-product with metric name is a time series. Bounded labels (method, status, service) are good; unbounded (user_id, full URL) are fatal.

PromQL.

promql

# Simple instant queries
process_resident_memory_bytes
up{job="api"}

# Rate of increase for a counter (per second, over last 5m)
rate(http_requests_total[5m])

# Error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)

# p99 latency from histogram
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Top 10 busiest endpoints
topk(10, sum(rate(http_requests_total[5m])) by (endpoint))

# Over-time: max queue depth in the last 1h
max_over_time(queue_depth[1h])

# Saturation of a pool
(active_connections / max_connections)

# Availability SLI
(1 - sum(rate(http_requests_total{status=~"5.."}[30d]))
     / sum(rate(http_requests_total[30d]))) * 100

Recording rules. Pre-aggregate expensive queries so dashboards are fast. prometheus.yml:

yaml

groups:
  - name: api-aggregations
    interval: 30s
    rules:
      - record: api:http_request_error_rate:5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service)

Alerting rules. Fire alerts based on PromQL:

yaml

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: api:http_request_error_rate:5m > 0.02
        for: 10m
        labels: { severity: page, team: api }
        annotations:
          summary: "API error rate > 2% for 10m on {{ $labels.service }}"
          runbook: "https://runbooks/api-errors"

Alertmanager. Receives alerts; routes, silences, groups, dedupes, sends to Slack/PagerDuty/webhook. Config is a tree of routes → receivers with matchers.

Federation. One Prometheus scrapes a subset of another's metrics. Useful for: global aggregation across clusters, team-owned Prometheus with platform-owned aggregator.

Remote write. Prometheus ships samples to long-term storage (Thanos, Cortex, Mimir, Victoria Metrics). Native Prometheus storage caps at ~15 days practically; long-term + global view needs these.

Service discovery. Kubernetes SD auto-discovers pods/services via annotations:

yaml

# Pod annotation
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"

Prometheus Operator / kube-prometheus-stack. The Helm chart that deploys Prometheus + Alertmanager + Grafana + node-exporter + kube-state-metrics + default rules. The standard.

ServiceMonitor / PodMonitor — CRDs that tell Prometheus Operator how to scrape a service or pod. Replaces raw prometheus.yml scrape configs.

Gotchas & war stories

Cardinality explosion — ID in a label. Prometheus OOMs. Drop or relabel in scrape config. Audit with topk(10, count by (__name__)({__name__=~".+"})).
Recording rules too frequent — eat CPU; interval: 30s is usually fine, 15s rarely needed.
rate() on gauge — meaningless. rate() is for counters.
Short [5m] windows on sparse series — noisy. Use [15m] or [1h] for rare events.
Alertmanager missing silences — run HA Alertmanager (gossip cluster). A single AM is a SPOF.
scrape_interval mismatched with for: — for: 5m with scrape_interval: 60s means you need 5 consecutive scrapes — not 5 minutes of coverage, 5 failed evaluations. Plan around this.

Interview Q&A

Counter vs gauge vs histogram? — Counter: monotonic, compute rates (rate()). Gauge: up/down (current value). Histogram: pre-bucketed distribution, compute percentiles server-side (histogram_quantile()). Summary: client-computed quantiles — prefer histograms for aggregatability.
Walk me through computing p99 latency. — App exposes http_request_duration_seconds_bucket{le="..."} — a histogram. Query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Aggregate buckets across instances (sum by le), rate over 5m, then compute the 99th percentile interpolated from buckets.
How do you solve a cardinality explosion? — Find the offender with count by (__name__)({__name__=~".+"}). Drop or relabel the high-cardinality label in scrape config (metric_relabel_configs: drop). Long-term: educate app teams to bucket IDs, avoid full URL paths, use trace attributes for high-cardinality investigation instead.
Prometheus for long-term storage — how? — Native Prometheus isn't great past 15–30 days. Use remote write to Thanos, Cortex, Mimir, or Victoria Metrics — horizontally scalable, object-storage-backed TSDBs that speak the Prometheus API. Grafana queries the long-term store; Prometheus becomes the short-window scraper.
How do you scale Prometheus in a K8s cluster? — Shard by namespace or by target label (hashmod in scrape config). Kube-prometheus-stack runs HA pairs (two Prometheuses scraping the same targets; Alertmanager gossip dedupes). Remote write to Mimir/Thanos for global query. Federation is legacy; remote write is preferred.
Recording rules — when do I need one? — When the same complex query is run many times — a dashboard with 12 panels all computing error rate, or an alert that evaluates the same expression every rule interval. Pre-compute in a recording rule, query the short name. Cuts dashboard load time and CPU on Prometheus.
How do you avoid alert fatigue? — Alert on symptoms, not causes (high error rate, not high CPU). Alert on SLO burn, not raw metric thresholds. Use for: to avoid flapping. Group alerts in Alertmanager so 50 pod failures = 1 page, not 50. Document a runbook link in every alert. Review alerts quarterly — if nobody's fixed it in a month, it's either not real or already tolerated.

30. Visualization & APM

Why this matters

Dashboards are where observability meets humans. Grafana is the default for self-hosted; Datadog/New Relic/Dynatrace own the commercial APM space. Interviewers expect you to articulate when each is the right fit and to know what makes a good dashboard.

Core concepts

Grafana.

Data-source-agnostic visualization: Prometheus, Loki, Tempo, Elasticsearch, InfluxDB, Postgres, CloudWatch, Datadog, etc.
Dashboards as JSON; store in Git.
Variables (drop-downs for cluster, service, instance) make one dashboard reusable across envs.
Alerting in Grafana — unified alerting supports Prometheus-style rules against any data source. Good for cross-pillar alerts (combine metric + log).
Annotations — pin deploy events on dashboards.
Grafana Cloud — managed offering including Mimir (metrics), Loki (logs), Tempo (traces), Pyroscope (profiles).

Dashboard design principles (distilled from Brendan Gregg + Google SRE):

One workload per dashboard — don't make a dashboard that tries to show everything.
Golden signals first — rate, errors, duration (+ saturation) at the top.
Dependencies drill-down — DB latency, cache hit rate, downstream calls.
Deploy markers — annotate deploys; most incidents start "right after we deployed X."
Links to runbooks — from the dashboard to the Confluence/Git runbook for the on-call.

Commercial APM tools.

Datadog — broadest product; strongest on-call experience; expensive per host + per metric.
New Relic — similar; recently moved to consumption pricing.
Dynatrace — AI-first; strong auto-discovery; enterprise-focused.
Honeycomb — events-first, high-cardinality; beloved by teams that have outgrown traditional dashboards.
Splunk (Observability Cloud, formerly SignalFx) — strong logs, adding traces/metrics.

When to use commercial: fast-moving teams without platform bandwidth, or need for advanced AIOps. When to self-host: cost control, data sovereignty, regulatory constraints.

RUM (Real User Monitoring) vs APM (Application Performance).

RUM: browser/mobile instrumentation — what users actually experience (Core Web Vitals, page load, JS errors).
APM: server-side — traces, service maps, DB queries.
Full picture needs both.

Interview Q&A

What makes a good dashboard? — One workload per dashboard. Golden signals at the top (request rate, error rate, latency p95/p99, saturation). Drill-down panels (DB, cache, downstream services). Deploy annotations. Runbook links. Variables to switch service/env. Avoid "wall of graphs" — you can't read 40 panels at 3 AM.
Grafana vs Datadog — when each? — Grafana (+ Prometheus + Loki + Tempo) when you want self-hosted, customizable, open ecosystem — strong option for cost-sensitive teams with platform engineers. Datadog when speed-to-value matters more than bill; one product for infra + APM + logs + synthetics. Hybrid is common: Grafana Cloud for the ecosystem UX without operating Prometheus yourself.
How do you correlate a slow user request across frontend and backend? — RUM in the browser captures page/network timing and the backend trace_id (injected via header or returned in a meta tag). Browser reports to the RUM backend; backend trace is in Tempo/Jaeger. Click the RUM session → navigate to the trace → see the full request path server-side. OTel's JS SDK for RUM closes the loop.
Alerting from dashboards — why/not? — Grafana supports alerting on any panel/query. Good: cross-pillar alerts (errors in metrics + specific log pattern), one tool for viz + alerts. Bad: drift risk (dashboard edits can silently break alerts). Many teams keep alerting rules in Prometheus/Alertmanager (code-reviewed) and Grafana for viz only.

31. Logging Stack

Why this matters

Logs are the last line of defense when you need to know exactly what happened. The stack is a choice between ELK (featureful, expensive at scale), Loki (cheap, label-only indexing), and managed services. Sizing and retention are everything — logs grow linearly with traffic and exponentially with developer enthusiasm.

Core concepts

ELK / Elastic Stack.

Elasticsearch — inverted-index search engine.
Logstash — data pipeline with filters (grok, geoip, etc.).
Kibana — UI.
Beats (Filebeat, Metricbeat) — lightweight shippers.

Elasticsearch internals to know:

Inverted index — term → list of document IDs. What makes full-text search fast.
Shard — a Lucene index; a partition of data. Docs are hashed to shards.
Replica — a copy of a shard on another node (HA + read scaling).
Index Lifecycle Management (ILM) — hot → warm → cold → delete phases with rollover.
Mapping — schema per field; dynamic mappings can cause "mapping explosion" (thousands of fields; heap pressure).

OpenSearch — AWS-maintained fork of Elasticsearch since the license change. API-compatible.

Loki (Grafana). "Prometheus for logs."

Indexes labels only, not content — cheap to store, content-searching is full scans per label range.
Works great when you structure your logs with labels (service, level, env) and use LogQL to filter by label → full-text within.
Object-storage backend (S3, GCS) — low cost.
Labels have the same cardinality discipline as Prometheus — no user IDs as labels.

Fluentd vs Fluent Bit vs Vector.

Fluentd — Ruby + C; robust; many plugins; heavier.
Fluent Bit — C only; tiny footprint (~MBs); fast; now default in most K8s log shippers.
Vector (Datadog, open-source) — Rust; observability-data-pipeline; fast; growing.
Logstash — Java + JRuby; heavier; ELK-native.

In K8s, run as DaemonSet: tail /var/log/containers/*.log, parse JSON, enrich with pod metadata, ship to backend.

LogQL (Loki's query lang).

logql

# All logs from API service with error level
{app="api", level="error"}

# Filter content
{app="api"} |= "timeout"

# Parse JSON and filter
{app="api"} | json | duration > 1000

# Metric from logs (count errors per service)
sum by (service) (rate({level="error"}[5m]))

Structured logging. JSON logs from apps — timestamp, level, message, plus context fields and trace_id. Makes LogQL / Kibana / CloudWatch Insights queries trivial.

Retention & cost.

Hot days (fast queries): 7–30 days.
Warm (slower): 30–180 days.
Cold / archive (S3/Glacier): 1–7 years (compliance).
Logs grow linearly with traffic; sampling low-signal logs (access logs summarized to metrics) is the main lever.

Interview Q&A

ELK vs Loki — pick one. — ELK for rich full-text search and existing ecosystem investment; scales but cost grows fast with volume (heap, replicas, hot storage). Loki when you can structure logs with good labels and object-storage cost is a priority. Many shops end up with Loki for cloud-native apps + ELK for legacy with structured search needs.
How does Elasticsearch store and retrieve logs? — Docs indexed to shards; each shard is a Lucene index with an inverted index mapping terms → doc IDs. Queries consult the index, fetch matching doc IDs, score, return. Performance depends on heap for caches, fast disks (SSD), and careful mapping to avoid field explosion.
Fluent Bit vs Fluentd — which to deploy? — Fluent Bit for DaemonSet log shipping in K8s — tiny memory, C-based, fast. Fluentd for a central aggregator or when you need a rich plugin the Bit doesn't have. Vector is the modern alternative to both. In 2026, Fluent Bit is the default K8s log shipper.
How do you control log volume and cost? — Move high-volume low-signal logs to metrics (aggregate counts, not raw lines). Sample access logs. Set aggressive ILM: hot 7d → warm 30d → cold 180d → delete. Limit retention at the source — don't let a team log debug into prod. Audit by service; log a daily top-N to find the "chatty" culprit.
Why structure logs as JSON? — Field-level search and aggregation without regex. Pipeline parsers can enrich (add pod metadata). Queries become SQL-like (level=error AND service=api). Unstructured strings fight you at every scale boundary.

32. Tracing

Why this matters

Distributed tracing explains "this specific request was slow" in a microservices world. Interviewers expect you to know span vs trace, sampling, and how traces connect to logs and metrics.

Core concepts

Vocabulary.

Trace — one request's journey; a tree of spans.
Span — a unit of work (HTTP call, DB query, function). Has start/end, name, attributes, events, links, status.
Parent/child — caller is parent; callee is child.
Trace context — trace_id + span_id carried across boundaries via headers.
Baggage — key-value propagated with context across spans (e.g., tenant id).

Tracing backends.

Jaeger (CNCF graduated) — Go; pluggable storage (Elasticsearch, Cassandra, Kafka). Legacy backend many teams run.
Tempo (Grafana) — object-storage-backed, trace-id-only indexing. Cheap. Pairs perfectly with Loki + Prometheus.
Zipkin — original distributed tracer; legacy but still around.
AWS X-Ray — native AWS; being deprecated in favor of ADOT + OTel to vendor of choice.

Sampling strategies revisited.

Always on — capture every trace. Only viable for low QPS or small orgs.
Head probabilistic — random % at span start.
Parent-based — respect upstream's sampling decision.
Tail sampling (in collector) — keep errors, keep slow, probabilistic sample the rest.

Trace-log correlation. Inject trace_id into logs; in your logging backend click "view trace" to jump to the tracing UI. Loki + Tempo make this seamless ("Derived fields" in Grafana).

Span naming conventions. Span name = operation: GET /api/users/:id, sql.query, kafka.consume orders. Avoid variable content in names (use attributes instead — http.target holds the path).

Interview Q&A

Span vs trace vs baggage? — Trace is the whole tree for one request. Span is a node — one operation. Baggage is key/value data attached to the context and propagated across spans (e.g., tenant_id, user_type) so downstream services see the upstream context.
Why not always-on tracing? — Cost — storage and ingest. At high QPS, 10K spans/sec × 1KB × 30 days is a lot. Sample at the collector: keep 100% of errors and slow requests; probabilistic 1–5% of happy path.
How do you see a slow request end-to-end? — Click the trace in your UI. Top bar shows total duration; each child span shows duration and attributes. Look for the span that ate most of the total — DB query, downstream service, GC pause. Span attributes (db.statement, http.url) give context. Pivot to logs for that trace_id for the "exactly what happened" story.
Jaeger vs Tempo? — Jaeger is the original CNCF tracer; needs storage (Elasticsearch/Cassandra) that's expensive. Tempo is object-storage-backed (S3/GCS), indexes by trace_id only, designed to pair with Grafana + Loki + Prometheus. For new deployments, Tempo's cost + ecosystem fit wins for most.

33. SRE Practices

Why this matters

SRE is DevOps with teeth — specific mechanisms (SLOs, error budgets, blameless postmortems) that make reliability quantifiable. Even non-SRE DevOps roles will ask about SLOs and postmortems because those practices scale past tooling.

Core concepts

SLI / SLO / SLA.

SLI (Indicator) — a metric: e.g., (successful_requests / total_requests) × 100 over 30d.
SLO (Objective) — a target: "99.9% of requests succeed over 30d."
SLA (Agreement) — contractual; consequences if not met (credits, $). SLOs are your internal targets that keep you above SLA.

Pick SLIs that reflect user experience: availability (success rate), latency (p95/p99), correctness (data freshness), throughput.

The "Nines" cheat:

SLO	Downtime/year	Downtime/30d	Downtime/day
99%	3.65 days	7.2 hours	14.4 min
99.9%	8.77 hours	43.8 min	86 sec
99.95%	4.38 hours	21.6 min	43 sec
99.99%	52.6 min	4.38 min	8.6 sec
99.999%	5.26 min	26.3 sec	<1 sec

Error budget. 1 - SLO = your allowed unreliability. 99.9% SLO = 0.1% error budget. If you burn it before the window ends, feature development slows until reliability recovers. This is the lever SRE uses to balance velocity vs stability — a team burning error budget should prioritize reliability work over features until the budget is back in surplus.

Burn rate alerts. Alert on fast burn vs slow burn of the error budget:

Fast burn: "at this rate, we'll burn the 30d budget in 1h — page now."
Slow burn: "at this rate, we'll burn it in 3d — ticket, not page."

Toil.

Google SRE definition: manual, repetitive, automatable, tactical, no enduring value, O(n) with service size.
Cap toil at 50% of an SRE's time; the other half is engineering to reduce future toil.
Track via: "number of tickets handled," "manual deploys," "on-call paging frequency."

On-call.

Rotation — sustainable (7 days max, preferably shorter).
Primary + secondary.
Handoff — explicit, written "what happened this shift."
Runbooks — linked from every alert.
Compensation — either comp time or pay differentials; respect the toll.

Incident management.

Severity levels (typical SEV1–4). SEV1 = customer-visible outage; SEV4 = minor cosmetic.
Incident Commander (IC) — runs the incident; doesn't debug.
Roles: IC, Scribe (timeline), Communications (stakeholder updates), Technical Lead.
Status page for customer-facing incidents.
Timeline discipline — scribe captures events + decisions in real time.

Blameless postmortem.

Focus on systems, not people.
5 Whys or Contributing Factors.
Action items with owners + dates.
Publish broadly — learn across teams.
"Hindsight bias" is the enemy. Ask "given what was known at the time, would any reasonable engineer have made the same call?"

MTTA / MTTR / MTTD / MTTF / MTBF.

MTTD — Mean Time to Detect.
MTTA — Mean Time to Acknowledge.
MTTR — Mean Time to Restore (common DORA metric). Sometimes parsed as Repair.
MTTF — Mean Time to Failure.
MTBF — Mean Time Between Failures.

Capacity planning. Project growth (usage, data size, RPS); stress-test to find the breaking point; keep 2× headroom over expected. Reload the model every quarter.

Chaos Engineering. Proactive failure injection — drop a pod, kill a DB node, add latency, fail a dependency. Verifies assumptions before incidents prove them wrong. Tools: Chaos Mesh, Litmus, Gremlin, Pumba. Run as game days, not surprise raids.

Interview Q&A

Explain SLI, SLO, SLA. — SLI is the measurement — e.g., success rate. SLO is the target on that measurement — e.g., 99.9% success over 30 days. SLA is the contractual promise to customers with consequences for missing — e.g., service credits if monthly SLA is missed. SLOs sit above SLAs so you have room before contractual breach.
What's an error budget and how do you use it? — The budget is 1 − SLO. If SLO is 99.9%, you can be down 0.1% of the time without being "out of SLO." When you've burned the budget, the policy is: stop shipping risky features, focus on reliability, until the budget recovers. It makes the velocity-vs-stability trade a data-driven decision, not a political one.
How do you pick SLIs? — Start from user experience. For a web API: availability (success rate), latency p95 or p99, error rate. For a batch system: completion within SLA window, data freshness. Don't pick metrics just because they're easy to collect (CPU) — those are debugging aids, not user-facing SLIs.
Walk me through how you'd run a blameless postmortem. — Within 48h after incident resolution. Attendees: IC, responders, relevant leads — not execs shaming people. Agenda: timeline (what happened minute-by-minute), contributing factors (what made it possible; what delayed recovery), action items (owners + due dates), what went well. Published to the org. Tracked by SRE manager to ensure action items close.
What is toil and why cap it? — Work that's manual, repetitive, automatable, O(n) with growth, no enduring value. On-call toil compounds: if you don't cap it, engineers spend all their time firefighting, no time to fix root causes, team burns out. Google rule: 50% maximum; the rest is engineering to reduce future toil.
Symptom-based alerting vs cause-based. — Alert on symptoms (elevated user-facing error rate) — those are customer-impacting. Don't alert on every cause (CPU high, disk low) unless they directly cause a symptom you can't otherwise detect. Symptom-based keeps pages relevant; cause-based buries the real signal.
Chaos engineering — how do you start without breaking prod? — Start in staging with scoped experiments — kill one pod, verify HA. Move to a small prod blast radius (one AZ, one tenant) under observation. Always have: abort criteria, clear observers, a steady-state hypothesis, and a rollback plan. Game days are a safer entry point than surprise injections.
An alert fires — your on-call flow. — Acknowledge within SLA (5 min). Open the runbook linked from the alert. Check dashboard (is the symptom still present?). Identify blast radius. If unclear or getting worse, declare an incident, page secondary, start a war room. Coordinate a mitigation (rollback, scale up, reroute). Write a status update. After: postmortem.

34. Alerting Philosophy

Why this matters

Bad alerting is how good teams burn out. A team with 500 alerts/week is a team that's slowly giving up on alerting. Interviewers probe the frameworks (USE, RED, Golden Signals) to see if you've thought about this beyond "set a threshold on CPU."

Core concepts

USE method (Brendan Gregg) — for resources. On every resource (CPU, memory, disk, network), check:

Utilization — % busy.
Saturation — queue length / waiters.
Errors — rate of errors.

Great for infra — find the bottleneck on a box.

RED method (Tom Wilkie) — for services. On every request-driven service:

Rate — requests per second.
Errors — failed per second.
Duration — latency distribution.

Great for microservices — the dashboard shape for every API.

Four Golden Signals (Google SRE).

Latency
Traffic
Errors
Saturation

Basically RED + saturation; for anything serving requests.

Symptom-based alerting. Alert on user-visible effects, not causes. "Error rate exceeds SLO burn rate" — page. "CPU 90%" — ticket only (might not be an issue).

Burn rate alerts (SRE Workbook Chapter 5):

Fast burn — 2% of 30d budget in 1h → page (outage in progress).
Slow burn — 10% in 6h → ticket (degraded, investigate).
Multiple windows — combine long + short windows to avoid false-positive flapping.

Alert routing.

Page for things that need action within minutes.
Ticket for things that need action within days.
Digest (email/Slack summary) for things that need awareness.
The only thing that should page is a customer-impacting symptom or imminent one.

Silences & maintenance windows. Planned change → silence related alerts. Undo after the change. Unsilenced maintenance is the #1 cause of "oh, we knew about that."

Runbook links. Every alert has a URL in its annotations. Runbook says: what this means, how to verify, what to do first, when to escalate. Alerts without runbooks are bugs.

Review cadence. Quarterly review of all alerts. If an alert hasn't fired in 6 months, it's probably useless (or the threshold needs tightening). If an alert fires weekly and nobody acts on it, it's noise — delete or fix.

Interview Q&A

When would you use USE vs RED? — USE for resources (a machine, a disk, a pool) — finds bottlenecks. RED for services (a microservice, an endpoint) — shows health of the request path. Most teams have both: USE dashboards for nodes/DBs, RED dashboards per service, Golden Signals at the top of each service dashboard.
How do you reduce alert fatigue? — Symptom-based alerting only pages; cause-based becomes ticketing. SLO burn-rate alerts replace raw threshold alerts (avoids "CPU 81% at 3 AM"). Every alert has a runbook link or it doesn't ship. Group related alerts in Alertmanager. Quarterly audits — if it hasn't fired in 6 months, delete; if it fires weekly and no one reads, fix.
What's a burn-rate alert and why better than threshold? — It measures how fast you're consuming error budget over a window. "2% of monthly budget burned in 1 hour" → that's outage pace, page immediately. A raw threshold alert ("error rate > 1%") doesn't distinguish a blip from a disaster. Burn-rate gives you time-aware severity.
An alert keeps firing but nothing's wrong — what do you do? — Don't just snooze it. Investigate: is the threshold wrong (tune)? Is the metric wrong (migrate)? Is there a real condition we've learned to ignore (maybe the alert is valid but the fix is expected background work — convert to ticket)? Every flapping alert is a trust erosion on all alerts.
How do you design a new service's alerting? — Start from SLOs (availability, latency). Define burn-rate alerts (fast + slow). Add dependency health (DB latency, cache hit rate) as tickets/non-page. Add saturation alerts (queue depth, pool exhaustion) as tickets. One runbook per page alert. Test alerts fire correctly before going to prod.

Part VII — Security / DevSecOps

35. DevSecOps Principles

Why this matters

Security bolted on at the end is always wrong. DevSecOps pushes security into every stage of the loop — commit, build, deploy, run. It's an interview priority in any security-conscious environment. Expect "what is shift-left?" and "walk through security in your pipeline."

Core concepts

Shift-left — move security findings as early as possible:

IDE plugins catch issues while coding (SonarLint, Snyk).
Pre-commit hooks block secrets.
PR checks run SAST + SCA.
Build gates on critical CVEs.
Runtime policies enforce what made it through.

Cost of a bug fix rises ~10× per phase moved right (dev → test → prod). Shifting left is both cheaper and faster.

"Shift smart." 2025-era evolution: not everything should run on every commit. Lightweight gates in PR; heavy scans (OWASP DC full scan, DAST) nightly. Avoid drowning developers in noise.

Zero Trust.

Never trust the network.
Verify every request, every time, at every layer.
Assume breach.

Practical for DevOps: mTLS between services, workload identity (IRSA/Workload Identity), NetworkPolicies, signed images + admission controllers, continuous auth (OIDC token refresh), audit logging everywhere.

Threat modeling.

STRIDE — Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege.
PASTA — Process for Attack Simulation and Threat Analysis.
Attack trees — hierarchical decomposition of how an adversary would attack.

Do it for new services, major features, and on a cadence (yearly) for existing systems.

OWASP Top 10 (2021 edition still current):

Broken Access Control.
Cryptographic Failures.
Injection (SQL, command, etc.).
Insecure Design.
Security Misconfiguration.
Vulnerable and Outdated Components.
Identification and Authentication Failures.
Software and Data Integrity Failures.
Security Logging and Monitoring Failures.
Server-Side Request Forgery (SSRF).

Know all 10; map each to a code-level mitigation (Spring Security's method-level auth for #1, parameterized queries for #3, etc.).

Defense in depth. Every security control will fail. Layer them so no single failure is catastrophic:

Network (SG, NetworkPolicy, WAF)
Auth (OIDC, mTLS)
Authorization (RBAC, method security)
Input validation
Output encoding (XSS)
Secret management
Logging + monitoring

Interview Q&A

What is DevSecOps? — Security as an integrated practice in DevOps — not a gate at the end. Automated security checks at every CI/CD stage (SAST on commit, SCA on PR, image scan on build, policy check on deploy, runtime policies in cluster). Culture: shared ownership between dev + ops + security; no "security team throws it back."
Walk me through shift-left for a Spring Boot service. — IDE: SonarLint + GitLeaks pre-commit catches common bugs + secrets. PR: SAST (Semgrep/SonarQube), SCA (OWASP DC), unit tests, lint. Build: container scan (Trivy), Dockerfile lint, Cosign signing, SBOM generation. Deploy: Kyverno admission ensures signed image + non-root + restricted SCC. Runtime: Falco for anomaly detection. Every finding is blockable in PR if critical.
What's Zero Trust and how does it apply to K8s? — Don't trust the network; verify every request. In K8s: mTLS between services (service mesh like Istio), NetworkPolicies for default-deny, RBAC + ServiceAccounts (no shared identities), short-lived tokens (OIDC/IRSA, no static access keys), admission controllers that reject untrusted images, audit log every API call.
How do you threat-model a new microservice? — STRIDE walkthrough per component: can someone Spoof the identity (auth)? Tamper with data (integrity)? Repudiate action (audit logs)? See data they shouldn't (authz, encryption)? Deny service (rate limit, circuit breaker)? Escalate privilege (RBAC, least privilege)? Produce a list of mitigations; prioritize by impact × likelihood.
OWASP Top 10 — name as many as you can, with a Spring mitigation. — (1) Broken Access Control — Spring Security @PreAuthorize method security. (2) Crypto failures — use BouncyCastle / JCA, enforce TLS 1.2+. (3) Injection — JPA parameterized queries, never concat SQL. (4) Insecure design — threat model. (5) Security misconfig — Spring Boot actuator endpoints locked down. (6) Vulnerable components — OWASP DC + Dependabot. (7) AuthN — OAuth2/OIDC, strong session handling. (8) Integrity — signed artifacts, SBOM, Cosign verification. (9) Logging — structured, enriched, shipped to SIEM. (10) SSRF — validate + allow-list outbound URLs.

36. Pipeline & Code Security

Why this matters

Your CI/CD pipeline is your supply chain. If a PR's SAST is broken or image scanning is advisory-only, attackers ship their payloads through your pipeline. Interviewers want specific tools + gates + triage workflow.

Core concepts

SAST (Static Application Security Testing). Scans source code for bug patterns (SQL injection, path traversal, insecure crypto).

SonarQube / SonarCloud — broad language coverage, quality + security in one.
Semgrep — pattern-based, fast, easy to write custom rules.
CodeQL (GitHub) — query-based, powerful, runs in Actions; free for public repos.
Checkmarx, Veracode, Fortify — enterprise, deep scanning, heavier.
SpotBugs / Find Security Bugs — Java-specific, fast.

SAST fires false positives — triage discipline matters. Track baseline; fail only on NEW issues or severity increases.

DAST (Dynamic). Runs against a running app — fuzzes inputs, finds runtime vulns.

OWASP ZAP — open-source; baseline scans in pipeline; full scan nightly.
Burp Suite — commercial; manual + scripted.
StackHawk — CI-friendly DAST.

IAST — combines SAST + DAST via runtime instrumentation. Contrast Security, Seeker. Rare in cloud-native; heavy agent.

SCA (Software Composition Analysis). Scans dependencies for known CVEs.

OWASP Dependency-Check — free, cross-language; many projects run it.
Snyk — commercial, rich UI, fix suggestions.
Dependabot (GitHub) — automated PR to upgrade.
Renovate — more configurable alternative to Dependabot.
Trivy — also does SCA for language manifests alongside container scanning.

Secret scanning.

gitleaks — pre-commit + CI.
trufflehog — entropy-based + verification (try the secret; if it auths, definitely real).
GitHub Advanced Security secret scanning — built-in for Enterprise.
AWS GuardDuty — detects leaked AWS creds in use.

Pipeline policy gates.

Fail PR if NEW critical/high CVEs in SAST/SCA/image scan.
Fail PR if secrets detected.
Fail PR if license scan finds a forbidden license.
Signed + verified images required for prod (Kyverno/policy-controller at admission).

Triage discipline.

Ignore / accept with justification + expiry.
Baseline existing findings; only block on new.
Weekly security standup reviews backlog.
Auto-remediation where possible (Renovate + Dependabot).

Anchor example

A typical setup runs OWASP Dependency Check + npm audit via a nightly GitHub Actions workflow and pins caret-vulnerable packages like axios after supply-chain incidents. Concrete example to cite: "We pin packages with a history of supply-chain compromise to exact versions — no caret ranges — after the axios@1.14.1 incident where a malicious transitive dep got pulled in via auto-resolved version. Nightly OWASP DC on backend + npm audit on frontend; findings file issues; critical ones block the PR merge."

Interview Q&A

SAST vs DAST vs IAST vs SCA — explain each. — SAST: scans source code for patterns (SQL injection, unsafe deserialization). DAST: fuzzes running app inputs looking for runtime vulns. IAST: runtime-instrumented hybrid; detect both patterns and behavior. SCA: scans third-party dependencies for known CVEs. You need all four (or SAST + SCA + DAST minimum).
How do you prevent false-positive fatigue in SAST? — Baseline: only block on NEW findings from the current PR. Tune: suppress known-false-positive rules with comments. Triage weekly: legit → fix; false → suppress permanently with justification. Measure: % of flagged findings that lead to a code change; if <10%, tools or rules need tuning.
Dependabot vs Renovate? — Dependabot: built into GitHub, zero config for basics, limited configurability. Renovate: more flexible (group updates, schedule windows, separate PR strategies per dep type), works on GitHub/GitLab/self-hosted. Most teams end up on Renovate once they need custom grouping.
How do you handle a critical CVE in a transitive dependency you don't directly control? — Pin the transitive in the dependency tree (Maven dependencyManagement, npm overrides, Gradle resolutionStrategy). Verify the fix works in integration tests. Long-term: push upstream maintainers for a fix; if abandoned, consider a fork.
Secret scanning — how do you handle a leaked secret? — Treat it as compromised — rotate immediately regardless of push duration. Purge from git history with git filter-repo or BFG; force-push after coordinating with team. Audit access logs for use of the leaked credential. Add pre-commit scanner to prevent repeat. Document the incident.

37. Supply Chain Security

Why this matters

Supply-chain attacks have overtaken "classic" code vulns as the most-reported root cause. SolarWinds, xz, axios — the pattern repeats. Interviewers want you to know SBOM, SLSA, Cosign, and have a story about a supply-chain incident (the axios@1.14.1 event is a recent one most engineers can speak to).

Core concepts

SBOM (Software Bill of Materials). Machine-readable inventory of every dependency in a build. Two main formats:

SPDX — Linux Foundation; broad tool support.
CycloneDX — OWASP; slimmer; security-focused.

Tools:

Syft — generate SBOM from images/filesystems.
Trivy — generates SBOM + scans against it.
OWASP Dependency-Track — SBOM inventory + vuln tracking.

SBOM is now a US federal requirement (Executive Order 14028) for software sold to government.

Provenance. Attested record of how an artifact was built. SLSA framework levels:

SLSA 0 — no provenance.
SLSA 1 — build scripting.
SLSA 2 — signed provenance.
SLSA 3 — isolated, unprivileged build; source-verified provenance.
SLSA 4 — highest: two-party review, hermetic build, reproducible.

Most real-world shops aim for SLSA 2–3.

Image signing.

Cosign (Sigstore) — sign images + provenance + SBOM. Keyless signing via OIDC (GitHub, GitLab, Google) — no key management.
Notation (CNCF) — alternative standard.

Verify at admission with Kyverno or Sigstore's policy-controller:

yaml

apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata: { name: require-cosign }
spec:
  images: [{ glob: "ghcr.io/myorg/*" }]
  authorities:
    - keyless:
        url: https://fulcio.sigstore.dev
        identities:
          - issuer: https://token.actions.githubusercontent.com
            subject: "https://github.com/myorg/*/.github/workflows/*"

Sigstore. The ecosystem behind Cosign: Fulcio (CA for keyless signing), Rekor (transparency log of signatures), Cosign (signing tool). Trust model: short-lived cert tied to an OIDC identity; Rekor stores a tamper-evident log.

Dependency pinning.

Exact versions for anything with a history of supply-chain compromise (HTTP clients, auth libs, cryptocurrency-adjacent packages).
Caret ranges (^1.2.3) are a supply-chain risk — auto-resolve can pull in a compromised patch.
Lockfiles — package-lock.json, Pipfile.lock, go.sum, Cargo.lock — pin to exact resolved versions. Commit them.
Digest pinning for container images: image@sha256:... instead of image:tag.

Trusted registries. Pull from private registries you control (ECR, GAR, GHCR, Harbor). Proxy external registries through your private (Artifactory) to cache + scan + sign.

Reproducible builds. Same source + same toolchain = bit-identical artifact. Enables independent verification. Hard to achieve in practice (timestamps, non-deterministic ordering). SLSA L4 requires it.

Famous supply-chain incidents to know:

SolarWinds (2020) — build system compromise; malicious code signed with legit cert; persisted in updates.
Log4Shell (2021) — JNDI lookup in log4j; 100% of Java devs affected.
Codecov bash uploader (2021) — malicious script exfiltrated credentials from CI.
ua-parser-js (2021) — typo-squat / compromised maintainer.
xz backdoor (2024) — multi-year social engineering of an OSS maintainer; SSH backdoor in compressed library.
axios@1.14.1 (2026) — malicious transitive (plain-crypto-js); dropped a Python RAT.

Anchor example

Narrating the axios incident in an interview: "In March we caught axios@1.14.1 pulling in a poisoned transitive, plain-crypto-js@4.2.1, which deployed a Python RAT. Our caret-range ^1.14.0 auto-resolved to the bad version. Remediation: pinned axios to exact 1.14.0, regenerated the lockfile, purged npm cache, scanned for persistent compromise, rotated credentials as precaution. Lesson to burn in: caret ranges on high-value deps (HTTP clients, auth libs) are a supply-chain risk — pin exact, upgrade deliberately with review."

Interview Q&A

What is an SBOM and why does it matter? — Software Bill of Materials — a structured inventory of every dependency in an artifact (name, version, hash, license, relationships). Matters because when a new CVE lands (Log4Shell-level), you can query "do we ship this?" in minutes instead of days of manual audit. Increasingly a federal procurement requirement.
Walk through signing and verifying an image with Cosign. — Build image, push to registry. cosign sign --yes <image> — uses keyless OIDC (GitHub Actions identity is common), fetches a short-lived cert from Fulcio, signs, pushes signature to the registry, records in Rekor's transparency log. At admission, Sigstore policy-controller or Kyverno verifies the signature came from a trusted identity (e.g., "workflow in my org's main branch") before allowing the pod.
SLSA levels — explain 0 through 3. — 0: no provenance. 1: there's a documented build process. 2: builds are tamper-resistant and produce signed provenance. 3: isolated hermetic builds; source is version-controlled with verified provenance. Each level adds progressively stronger trust in the artifact's origin story.
Caret vs exact version pinning — trade-off? — Caret (^1.2.3) auto-consumes minor/patch updates — security fixes flow without a PR, but so does a compromised patch. Exact (1.2.3) requires a PR to upgrade — human review catches anomalies, but security fixes lag unless you run Renovate/Dependabot. Pragmatic: exact pins + automated update PRs (Renovate) + scanning on every PR.
How would you detect a SolarWinds-style attack? — Hard — malicious code signed with a legit cert, delivered via update channel. Defenses: SLSA-3 builds (isolated, reproducible — the attacker can't silently slip a payload into a reproducible build). Diff analysis on update content (unexpected functions added). Runtime anomaly detection (Falco) flags unusual syscalls. SBOM + provenance give you a forensic trail after the fact; prevention needs defense in depth.
Tell me about a supply-chain incident you've dealt with. — "In March 2026, axios@1.14.1 was compromised with a malicious transitive (plain-crypto-js) that dropped a Python RAT on install. Our ^1.14.0 range pulled it in automatically. We rotated credentials, purged npm cache, pinned axios to 1.14.0 exact, regenerated the lockfile, and added a 'high-value dep exact-pin' rule. Now any HTTP client, auth lib, or crypto package is exact-pinned; we accept the ongoing upgrade PR churn in exchange for not being a supply-chain casualty."

38. Secrets Management

Why this matters

Every system has secrets. How you store, distribute, and rotate them is the difference between a minor rotation exercise and a "leaked master key" nightmare. DevOps interviews almost always include secrets questions.

Core concepts

Secret stores (cloud + OSS).

HashiCorp Vault — the Swiss Army knife. KV secrets, dynamic secrets (generate DB creds on demand), PKI (issue short-lived certs), Transit (encryption as a service), SSH signing, AWS/Azure/GCP dynamic secrets. Enterprise features: namespaces, HSM, replication.
AWS Secrets Manager — AWS-native, integrated rotation, multi-region replication, pricey per secret.
AWS SSM Parameter Store — cheaper; SecureString type; good for config + small secrets; no built-in rotation.
Azure Key Vault / GCP Secret Manager — cloud equivalents.
K8s Sealed Secrets — encrypt manifests with cluster-specific pub key; controller decrypts.
K8s External Secrets Operator — pulls from Vault/AWS/Azure/GCP at runtime; materializes K8s Secret objects.
CSI Secrets Store — mounts secrets as files from Vault/AWS/Azure/GCP without creating K8s Secret objects.
SOPS — encrypt YAML/JSON in Git with age/GPG/KMS keys. Pair with GitOps (ArgoCD plugin, Helm SOPS).

Dynamic secrets. Instead of storing a long-lived password, Vault generates a short-lived one on demand: "give me a Postgres user," Vault creates it, returns it with a 1h TTL; on expiry Vault deletes the user. Eliminates the "rotate a shared password" workflow.

Rotation patterns.

Scheduled — every N days.
On-demand — via ops or API call.
Event-driven — on employee departure, on compromise.
Dynamic — creds are short-lived by design; rotation is automatic.

Workload identity > static creds. IRSA, Workload Identity, Pod Identity — workloads present a signed identity token to the cloud, cloud returns short-lived creds. No secret to rotate. This is the cleanest pattern for cloud access.

Secret injection patterns in K8s.

Vault Agent Injector — mutating webhook; injects a sidecar that fetches from Vault and writes to a shared volume; app reads files.
External Secrets Operator — pulls from backend, writes a K8s Secret; app consumes like normal. Most common.
CSI Secrets Store — mounts secrets as files via CSI; no K8s Secret object needed.
init container — not recommended; single fetch, doesn't refresh.

What should NOT be in secret stores. Non-secrets (config, feature flags, URLs) — use ConfigMaps / Parameter Store string params / env. Mixing balloons the secret store's access surface.

Envelope encryption. Data encrypted with a DEK (Data Encryption Key); DEK encrypted with a KEK (Key Encryption Key) in KMS. Rotate KEK without re-encrypting data. All major clouds + Vault support this pattern.

Commands you should know cold

bash

# Vault basics
vault login -method=oidc
vault kv put secret/api/db password=s3cr3t host=...
vault kv get secret/api/db

# AWS Secrets Manager
aws secretsmanager get-secret-value --secret-id prod/api/db | jq -r .SecretString

# SSM Parameter Store
aws ssm put-parameter --name /app/prod/db_url --type SecureString --value "$URL"
aws ssm get-parameter --name /app/prod/db_url --with-decryption

# External Secrets - typical manifest
cat <<'YAML' | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata: { name: api-db, namespace: prod }
spec:
  refreshInterval: 1h
  secretStoreRef: { name: aws-secrets, kind: ClusterSecretStore }
  target: { name: api-db }
  data:
    - secretKey: password
      remoteRef: { key: prod/api/db, property: password }
YAML

Gotchas & war stories

Committed .env file in a public repo — rotate immediately; assume compromised even if "only for an hour."
Environment variables expose secrets to ps aux + crash dumps — prefer file-mounted.
Long-lived cloud access keys in CI — OIDC federation instead.
Rotating a secret without rolling the app — if the app caches the old value in memory, rotation has no effect until restart.
Vault bootstrapping problem — who initializes Vault's root token? Usually: one-time root generates admin policies, root is revoked, admins are audited.
External Secrets refresh interval — too short hammers the backend's API limits; too long delays rotation. 1h is a reasonable default.

Interview Q&A

How do you get a database password into a Kubernetes pod securely? — Store the secret in a backend (Vault / AWS Secrets Manager). Use External Secrets Operator to materialize it as a K8s Secret (or CSI Secrets Store to skip the K8s Secret object). Pod mounts the K8s Secret as a file (not env var). App reloads on file change. Rotate the backend secret — the K8s Secret updates on next sync; app picks up the new value.
Static creds vs workload identity — trade-off? — Static creds are simple (one env var, a password) but require rotation, leak easily (logs, dumps, commits), and are usually long-lived. Workload identity (IRSA, Workload Identity, Pod Identity) trades complexity (setup IAM role trust + SA annotations) for zero static secrets to rotate — the cloud SDK handles short-lived credentials automatically. Always prefer workload identity for cloud access.
What are dynamic secrets and when use them? — Vault generates a unique ephemeral credential on demand (DB user, SSH cert, AWS STS token) that expires automatically. Use for: DB access (no shared prod password), SSH to servers (cert-based, per-session), cloud access via Vault. Trade-off: requires Vault integration in the app or a sidecar.
Why not just put secrets in Kubernetes Secrets? — By default, K8s Secrets are base64 in etcd (not encrypted). Anyone with get secrets RBAC in the namespace can read them. Mitigations: enable etcd encryption-at-rest (KMS-backed), limit RBAC, audit access, rotate regularly. Or bypass with CSI Secrets Store (mount from Vault without creating a K8s Secret). External Secrets Operator is the middle ground — still creates K8s Secret, but the source of truth is outside.
Rotate a leaked secret — what's your playbook? — Declare an incident. Rotate the secret immediately in the backend. Force downstream consumers to pick up new value (rolling restart if they cache, hot reload if they watch a file). Audit logs for use of the old secret (when was it last used? By whom?). Determine blast radius (what does the secret grant?). Revoke any sessions/tokens it minted. Postmortem: how did it leak, how do we prevent repeat.
SOPS vs Sealed Secrets vs External Secrets — pick one for GitOps. — External Secrets is the dominant pattern: source of truth stays in Vault/AWS Secrets Manager, the operator syncs values. Sealed Secrets encrypts the secret into the manifest — fine for small teams, but key rotation and cluster migration are pain. SOPS is great when you want to keep encrypted values in Git with access control via KMS/age keys; compatible with ArgoCD via a plugin. For most production GitOps: External Secrets.

39. Container & Kubernetes Security

Why this matters

Containers and K8s have a large attack surface — image supply chain, privileged escalation, network lateral movement, runtime anomalies. "How do you harden a K8s cluster?" is a near-certain question for anyone touching orchestration.

Core concepts

Image scanning. Part of CI (see §36) + periodic scans of running images. Tools: Trivy, Grype, Snyk, Harbor's built-in.

Image policies at admission.

Require signed images (Cosign + policy-controller or Kyverno).
Reject known-vulnerable images (Trivy admission webhook).
Reject :latest tags.
Require images from trusted registries.

Runtime container security.

runAsNonRoot: true — refuse to run as UID 0.
runAsUser: 1000 — specific non-root UID.
readOnlyRootFilesystem: true — forces writable paths to be explicit mounts.
allowPrivilegeEscalation: false — blocks setuid binaries from gaining more privs.
capabilities: { drop: [ALL], add: [NET_BIND_SERVICE] } — drop everything; add only what's needed.
seccompProfile: { type: RuntimeDefault } — block uncommon syscalls.

Pod Security Standards (PSA). Apply labels to namespaces:

yaml

# Highest enforcement
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted

Three profiles: privileged (anything goes), baseline (sensible defaults), restricted (locked down, no runAsRoot, no hostPath, dropped capabilities).

NetworkPolicies. Default-deny namespace model:

yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: deny-all, namespace: prod }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Then allow specific flows. Requires a CNI that enforces NetworkPolicy (Calico, Cilium; AWS VPC CNI needs the NP agent add-on).

Service mesh for mTLS. Istio / Linkerd adds sidecar proxies that enforce mutual TLS between pods — encryption + identity-based authorization at L7. Can enforce policy ("service A may call service B's /read but not /write").

Runtime security (anomaly detection).

Falco (CNCF) — eBPF/kernel-syscall-based detection of anomalous behavior (shell in container, crypto mining, sensitive file read). Alerts to Slack/PagerDuty.
Tetragon (Cilium/Isovalent) — eBPF-native, newer.
Sysdig Secure — commercial; richer UI + response workflows.

Admission policy engines (revisit §17):

Kyverno / OPA Gatekeeper enforce: no privileged pods, required labels, approved images, etc.

CIS Kubernetes Benchmark. Canonical hardening checklist. Scan with kube-bench. Cover: API server flags, etcd access, kubelet config, RBAC minimums.

Cluster surface reduction.

Disable anonymous auth on kubelet.
Restrict kube-system access via RBAC audit.
Private API server endpoint (no public internet).
Short-lived node bootstrap tokens.
Audit logging enabled + shipped to SIEM.

Interview Q&A

How do you harden a Kubernetes cluster? — PSA restricted on workload namespaces. NetworkPolicies (default deny + explicit allows). Admission policies: signed images only, forbid privileged, required resource limits. RBAC least-privilege; audit ClusterRoleBindings. Enable etcd encryption at rest. Private API endpoint. mTLS via service mesh. Audit log to SIEM. Run kube-bench on a cadence; fix findings. Runtime detection (Falco) for anomalies.
What does runAsNonRoot: true actually do? — Tells the kubelet to refuse to start the container if the image's effective UID is 0. Belt-and-suspenders with runAsUser: 1000 (specific UID). Defense against images that were accidentally built as root and slipped past review.
Explain NetworkPolicies with a concrete example. — Namespace starts default-open. A NetworkPolicy that selects pods app=api with ingress: - from: [podSelector: {app=frontend}] means only pods labeled app=frontend can reach those API pods on specified ports. Once any NP selects a pod, traffic not allowed is denied for that pod. Requires a CNI that enforces (Calico, Cilium).
Falco — what's it catching? — Syscall anomalies mapped to rules: unexpected shell in container, sensitive file read (/etc/shadow), network connection to a suspicious IP, crypto-miner patterns, privilege escalation. eBPF-based so low overhead. Alerts to your incident pipeline. Good at "known-bad behavior"; not a replacement for preventive controls.
mTLS in Istio — how does it work? — Each pod gets a sidecar (Envoy) injected by the mesh. Istiod (control plane) issues each sidecar a short-lived cert tied to the pod's SPIFFE ID (derived from SA). Two pods in the mesh automatically establish mutual TLS using those certs; AuthorizationPolicies layer RBAC on top. App code is unchanged — encryption and identity are transparent sidecar concerns.
CIS benchmark findings — how do you prioritize? — Severity-based: "privileged container allowed" and "etcd unencrypted" are blocker-level; "anonymous-auth enabled on kubelet" is high; config warnings are informational. Fix in tiers: blocker → high → medium → cosmetic. Track remediation in a compliance dashboard.

40. Identity, Access & Zero Trust

Why this matters

Modern auth is OAuth2 + OIDC + mTLS + workload identity. Interviewers probe this to distinguish candidates who've wired real SSO vs candidates who've only read about it.

Core concepts

OAuth2 flows.

Authorization Code + PKCE — for interactive apps (browsers, mobile). PKCE protects against code-interception on public clients.
Client Credentials — service-to-service; client_id + client_secret (or cert).
Device Code — for devices without a browser (CLI tools, TVs).
Implicit — deprecated, don't use.
Resource Owner Password Credentials (ROPC) — deprecated.

OIDC (OpenID Connect). Identity layer on top of OAuth2. Adds an ID Token (JWT) with user identity claims. Providers: Auth0, Okta, Keycloak, Azure Entra, Google, AWS Cognito.

JWT pitfalls.

alg: none — some libs used to accept unsigned JWTs (verify the signing algorithm).
Key confusion (HS256 vs RS256) — attacker signs with the public key as HMAC secret.
Long expirations — use short (5–15 min) access tokens + refresh tokens.
No revocation — if you need revocation, maintain a denylist or use reference tokens + introspection.
Storing JWTs in localStorage — XSS-exposed. Prefer httpOnly cookies.

mTLS. Mutual TLS — both client and server present certs. Client cert proves client identity. In service mesh, automatic. Outside: manage via PKI (Vault PKI engine, internal CA).

SAML. Older XML-based SSO. Still prevalent in enterprise. More cumbersome than OIDC but often required for integration with legacy IdPs.

SPIFFE / SPIRE. Standard for workload identity across clusters/clouds. spiffe://<trust-domain>/<workload> identifier. SPIRE is the reference implementation. Foundation for many service-mesh mTLS implementations.

SSO for tools. Wire everything to one IdP (Okta/Entra/Google): ArgoCD, Grafana, Vault, kubectl (OIDC), cloud consoles (AWS IAM Identity Center, Azure AD app registrations). One source of identity; offboarding is a single click.

Short-lived tokens. Access tokens 5–15 min; refresh tokens hours/days; session tokens bound to device fingerprint. Shorter windows = smaller blast radius on theft.

Interview Q&A

OAuth2 Authorization Code + PKCE — walk through it. — Client redirects browser to auth server with code_challenge (hash of random verifier). User logs in; auth server redirects back to client with a code. Client POSTs code + code_verifier to token endpoint; server verifies hash matches, issues access + ID + refresh tokens. PKCE prevents an attacker intercepting the code from exchanging it (they lack the verifier).
OAuth2 vs OIDC? — OAuth2 is for authorization — "can this app act on behalf of this user for these scopes." OIDC builds on OAuth2 and adds authentication via an ID Token (JWT) containing user identity claims. OIDC tells you WHO the user is; OAuth2 tells you WHAT they authorized.
JWT alg: none attack — what is it and how prevent? — A library trusts the alg field in the JWT header. An attacker sets alg: none, removes the signature, and submits a forged token. Server "validates" using the declared algorithm — which is none — and accepts it. Prevention: pin the expected algorithm server-side; don't ask the JWT what its own algorithm is.
How do services authenticate to each other in a mesh? — Automatic mTLS via sidecar proxies. Each pod gets a certificate issued by the mesh control plane, bound to the pod's ServiceAccount. The sidecar terminates TLS on inbound, initiates mTLS on outbound, and proves identity on both ends. Mesh authorization policies layer on top (service A may call service B /read, not /write).
SSO to kubectl — how? — Configure the API server with OIDC issuer URL + client ID. Users authenticate via OIDC (browser flow), get a JWT ID token, and kubeconfig uses the oidc auth provider to exchange refresh → access token. API server validates against the issuer. RBAC is based on JWT claims (email, groups). Tools like kubelogin / kubectl-oidc-login handle the browser dance.
Principle of least privilege — apply it in K8s. — One ServiceAccount per workload; never default. Role/RoleBinding with specific verbs on specific resources (no wildcards). automountServiceAccountToken: false unless the pod calls the K8s API. Namespace-scoped where possible. Periodic audits: kubectl auth can-i --list. For cloud access, IRSA/Workload Identity with scoped IAM policies.

41. Compliance & Regulated Environments

Why this matters

Federal / regulated environments run under CJIS, FedRAMP, FISMA, NIST 800-53. Interviewers in that space probe specifics: "what does CJIS require in your CI pipeline?" and expect you to know the framework names, how they relate, and what they concretely demand of a build + deploy flow.

Core concepts

Frameworks to know:

NIST 800-53 — baseline security controls for US federal systems. Low/Moderate/High baselines.
NIST 800-171 — requirements for contractors handling Controlled Unclassified Information (CUI).
FedRAMP — authorization program for cloud services used by US federal agencies. Low/Moderate/High. AWS, Azure, GCP all have FedRAMP offerings.
FISMA — the law; FedRAMP is one implementation.
CJIS — FBI-administered; covers law-enforcement data. Audit logging, encryption in transit + at rest, MFA, access controls, personnel screening.
HIPAA — healthcare. PHI protection.
PCI-DSS — payment card data.
SOC 2 — commercial compliance audit (Type 1: point in time; Type 2: ongoing).
ISO 27001 — international infosec management standard.
CIS Benchmarks — configuration baselines; not a legal framework but widely adopted.
STIG (DISA) — DoD hardening standards.

FIPS 140-2 / 140-3. Cryptographic module validation. Federal systems must use FIPS-validated crypto libs. Has implications: you can't just import whatever JDK / OpenSSL; use vendor FIPS-validated builds.

Air-gapped clusters. No public internet. Implications: mirror all registries (images, Helm charts, OS packages) internally; bundle CI with the cluster; patching requires explicit staging + transfer.

STIG hardening. Prescriptive configs for OS, Kubernetes, databases. Tools: OpenSCAP / oscap, AWS Inspector, vendor STIG viewers. Often scored against DISA's STIG viewer checklists.

CJIS specifics:

Audit logs of every user action; tamper-evident storage.
MFA for all access.
Encryption in transit (TLS 1.2+) and at rest.
Background checks for personnel with access.
Physical security of servers/media.
Immediate incident reporting.

Compliance in CI/CD:

SBOM + provenance mandatory (EO 14028).
Audit trail of every deploy (GitOps + signed commits).
Scanning results archived per release.
Separation of duties (developer doesn't push to prod directly; CI does, approved PR).
Evidence generation automated (screenshots, configs, logs for auditors).

Drift + configuration management. Auditors ask "is the system still in compliance?" — continuous scanning (Inspector, Config, Checkov, kube-bench) answers automatically.

Interview Q&A

What's the difference between FISMA, FedRAMP, and NIST 800-53? — NIST 800-53 is the control catalog — a big list of security controls for federal systems, with Low/Moderate/High baselines. FISMA is the law that says federal systems must implement appropriate controls. FedRAMP is the specific program authorizing cloud services for federal use by applying 800-53 with additional cloud-specific requirements. 800-53 is what; FISMA is law; FedRAMP is the implementation path for cloud.
Working in a CJIS environment — what's different? — Everything's logged with tamper-evident audit storage; MFA for every access; encryption in transit + at rest with FIPS-validated crypto; personnel background-checked; physical security of hosts; strict break-glass procedures; incident reporting is hours, not days. Developer velocity is slower — every change has an audit trail; you design for auditability upfront.
FIPS 140-2 — what does it mean for my build? — You must use crypto modules that have been FIPS-validated. On JVM: use an approved FIPS JVM distribution (OpenJDK with BouncyCastle FIPS, Red Hat's FIPS-certified OpenJDK). On Node/Python: vendor FIPS-enabled OpenSSL. In container images: use a FIPS-mode base (UBI FIPS). Your pipeline tests must include FIPS-mode runs.
Air-gapped K8s deployment — what's different from regular? — No public internet. You mirror every registry (Docker Hub, Quay, Helm repos, OS packages, Maven Central) into an internal registry. CI runners run inside the enclave. Updates require staging + transfer to the air-gapped side (often via one-way diode + hash verification). Troubleshooting is harder — no Googling from the jumphost.
How do you balance security and developer velocity in a regulated environment? — Automate the security gates so they don't add human-days per deploy (SAST/SCA in PR, not after). Platform team builds "paved road" templates that are compliant by default (golden image, golden chart, pre-scanned deps). Frequent audit-ready deploy is faster than slow manually-reviewed deploy. Invest in evidence generation — auto-export scan results per release so audits don't become engineer time sinks.
Tell me about an audit you went through. — "We had a CJIS re-certification. Biggest prep was proving every prod access was auditable: CloudTrail + app-level audit → SIEM, with 7-year retention on an immutable bucket. Evidence for encryption-at-rest: KMS key policy + ConfigService audit of encrypted volumes. MFA: Okta enforcement logs. No findings on the infra side; one finding on an unused legacy bastion we hadn't retired — fixed and re-verified within the audit window."

42. Incident Response for Security Events

Why this matters

Security incidents need a different playbook than ops incidents. Evidence preservation, legal, PR, and communication all enter the picture. Interviewers may ask about specific past CVEs (Log4Shell, SolarWinds, xz, axios) to see if you track the field.

Core concepts

NIST incident response lifecycle.

Preparation — tools, playbooks, contacts, training.
Detection & Analysis — alerts, triage, scope.
Containment — stop the bleeding; isolate.
Eradication — remove the attacker, patch the vuln.
Recovery — restore to good state; verify.
Post-incident — lessons learned, improve preparation.

Containment strategies.

Short-term — isolate the compromised host (network ACL), disable the compromised credential, block the C2 IP at edge.
Long-term — patch, rotate, rebuild.

Evidence preservation. Before rebuilding, capture: memory dump (volatility), disk snapshot, logs. If it goes legal, volatile evidence disappears fast.

Blast radius triage. What did this attacker have access to? Which services used that credential? What data could have been exfiltrated? Err on the side of over-scoping the response.

Communication.

Internal: security team, eng leadership, legal, PR, affected teams.
External (if applicable): regulators (GDPR 72h, HIPAA 60d, state laws), customers, public disclosure.
Status page for customer-visible incidents.

Famous CVEs to know:

Log4Shell (CVE-2021-44228) — JNDI lookup in log messages; trivial RCE. Mitigation: upgrade to log4j 2.17+ or set log4j2.formatMsgNoLookups=true or remove the JndiLookup class.
Spring4Shell (CVE-2022-22965) — Spring Framework RCE via data binding; mitigated by upgrade.
OpenSSL Heartbleed (CVE-2014-0160) — read memory past buffer boundaries; leaked keys.
Shellshock (CVE-2014-6271) — Bash env-var RCE.
ProxyLogon (CVE-2021-26855) — Exchange server RCE.
xz backdoor (CVE-2024-3094) — SSH auth bypass; multi-year social engineering.
axios 1.14.1 (2026) — malicious transitive dropping Python RAT.

Tabletop exercises. Simulate an incident quarterly: "your CI runner just got popped." Walk through detection → containment → recovery → post. Uncovers gaps in runbooks and contacts before a real event.

Interview Q&A

Walk through how you responded to Log4Shell (or a similar high-profile CVE). — (1) Detection via scanners + advisories; we saw the CVE at [T0]. (2) Rapid scoping: all Java services using log4j; ran SCA across all repos within 6h. (3) Short-term mitigation: set log4j2.formatMsgNoLookups=true via env var where possible; firewalled known exploit IPs at edge. (4) Long-term: upgraded log4j in every affected service; rebuilt and deployed. (5) Verification: confirmed no active exploitation in logs (LDAP egress); hunted for JNDI patterns. (6) Post: added a "surprise CVE" tabletop; improved SCA in pipeline.
How do you preserve evidence when containing an incident? — Before terminating an instance or rebuilding: snapshot the disk, capture memory (if feasible), archive logs off the host. Quarantine the original (isolate via SG, don't delete). If law enforcement or legal may get involved, maintain chain of custody — who handled it, when, with what tool. Don't run investigative tools that modify state.
A compromised ServiceAccount token was found in public GitHub — what's your response? — Rotate immediately (force token rotation for the SA). Audit kube audit and cloud audit logs (CloudTrail / equivalent) for any use of that token — timestamps, source IPs, actions. Contain: revoke the SA's permissions temporarily if you can't trust blast radius. Rebuild anything that depended on it. Incident report. Post: add secret scanning pre-commit + pre-push; alert on GitHub secret-scanning findings.
xz backdoor (CVE-2024-3094) — what made it special? — Multi-year social engineering: the attacker built maintainer trust over years, then introduced an SSH auth backdoor in a widely-used compression library that was indirectly used by sshd in some distros. Shows: supply-chain attacks can bypass all technical controls if the attacker becomes a trusted contributor. Defense: more reviewers per commit on critical libs, signed commits, funding under-resourced maintainers, SBOM + provenance for forensics.
How often do you run tabletop exercises? — Quarterly minimum for security-sensitive orgs. Rotate scenarios: compromised CI runner, leaked AWS credential, CVE in a core dep, insider threat, supply-chain dep compromise. Include non-eng: legal, PR, CEO for major scenarios. Debrief: what would've gone wrong if real; what to fix in runbooks.

Part VIII — Advanced & Emerging

43. Service Mesh

Why this matters

Service meshes add capability that's hard to replicate in app code — mTLS everywhere, traffic shifting, resilience policies, deep observability. They also add complexity and latency. Interviewers want to know you've thought about whether you actually need one.

Core concepts

Why a mesh.

mTLS automatic between services.
Retries / circuit breakers / timeouts declarative in config, not library code.
Traffic shifting (canary, blue/green, mirror) at the L7 layer.
Authorization at L7 (service A may call service B's /read).
Observability — every hop generates a span + metrics automatically.

The options.

Istio — the heavyweight; control plane is istiod, data plane is Envoy sidecars. Rich feature set; steeper learning curve; operational overhead. The default for large enterprise cloud-native.
Linkerd — Rust-based "proxy" (Linkerd2); lighter, simpler, fast. Smaller feature set; fewer knobs. Often beloved by SREs for operational clarity.
Consul Connect — HashiCorp; integrates with Consul's broader service discovery + secrets.
AWS App Mesh — managed, Envoy-based. Being deprecated in favor of upstream Istio on EKS.
Cilium Service Mesh — no sidecars; uses eBPF at the node. Emerging alternative.

Sidecar vs sidecar-less (ambient mesh). Classic mesh injects a proxy per pod — each pod has an Envoy sidecar. Ambient (Istio Ambient, Cilium eBPF) runs the L4 proxy per node (shared) and L7 proxy per namespace (optional). Trade-offs: less resource overhead, simpler pod spec; less tenant isolation, newer.

Key Istio objects.

VirtualService — L7 routing rules (path, header, weight).
DestinationRule — policy per service (subset selection, connection pool, retries).
Gateway — ingress/egress configuration for the mesh.
AuthorizationPolicy — L7 RBAC between services.
PeerAuthentication — mTLS policy (STRICT, PERMISSIVE, DISABLED).

Latency overhead. ~1–5ms per hop per sidecar. Chained 10 services = 10–50ms added. Factor into SLO budget.

When NOT to use a mesh.

Monolith or very few services.
Heavy L4 workloads (high-QPS databases) — sidecar overhead dominates.
Team without ops bandwidth for mesh day-2.
When a simpler alternative (ingress controller + mTLS via cert-manager) meets your needs.

Interview Q&A

What problems does a service mesh solve? — Cross-cutting L7 concerns: mTLS, retries, circuit breakers, traffic shifting, L7 authorization, observability. Removes them from app code into a platform concern. If you have many microservices, encrypted-by-default communication, and need traffic shifting for canary, a mesh pays back its complexity.
Istio vs Linkerd — pick one. — Istio for rich features (Gateway API, advanced traffic shifting, ambient mode, external service integration, large community). Linkerd for operational simplicity (Rust-based, fewer knobs, fast install, minimal CPU overhead). Many teams start with Linkerd and stay; enterprises often land on Istio for the feature depth.
mTLS in a mesh — how does it work? — Control plane (istiod) is a CA; issues each pod's sidecar a short-lived cert tied to its SPIFFE ID (derived from SA). Sidecar terminates TLS on inbound, initiates mTLS on outbound. App code is unmodified. Mesh rotates certs automatically (~24h typical TTL).
Sidecar mesh vs ambient mesh? — Sidecar: one proxy per pod — strong tenant isolation, proven model, high resource overhead (a Envoy per pod). Ambient: node-level L4 proxy + per-namespace L7 proxy — much less overhead, simpler pod spec, newer so fewer production proof points. Ambient is the future for most use cases; sidecar still wins for strict tenant isolation.
Service mesh latency cost — how do you factor it? — Each sidecar adds 1–5ms per hop. A 10-service chain adds 10–50ms. Measure with load tests before committing; budget into SLOs. For sub-10ms SLOs or high-QPS critical paths, consider ambient mode or bypassing the mesh for specific services.

44. Platform Engineering & Developer Experience

Why this matters

In 2026, Platform Engineering has become the dominant evolution of DevOps for orgs past ~10 engineering teams. Gartner predicts 80% of large engineering orgs will have a platform team by 2026; Backstage holds ~89% of the IDP market share. Interviewers — especially at Fortune 500 or scale-up stage — ask about IDPs, golden paths, and self-service.

Core concepts

Platform Engineering. Build an internal product (the Platform / IDP) that product teams consume. The platform team treats its developers as users, with product-management discipline: roadmap, UX, feedback loops, NPS. Named teams follow Team Topologies: Platform Team (provides services), Stream-Aligned Team (ships product features), Enabling Team (temporarily help), Complicated-Subsystem Team (specialized expertise).

Internal Developer Platform (IDP). The internal product — a self-service layer over infra, CI/CD, observability, security:

"Create a new microservice in one click" (scaffold repo, CI, deploy pipeline, observability, secrets).
"Deploy to dev/staging/prod" (GitOps + approval gates).
"See logs, metrics, traces, on-call, runbooks for my service" (single pane).
"Request a new DB / cache / S3 bucket" (self-service provisioning).

Backstage (Spotify, CNCF). The de-facto open-source IDP framework.

Software Catalog — registry of services, libraries, teams, APIs, resources, with owners and docs.
TechDocs — docs-as-code rendered from Markdown in each repo; browsable in Backstage.
Scaffolder templates — "create new service" wizards that provision repo + CI + initial code.
Plugin ecosystem — hundreds of plugins: Kubernetes, ArgoCD, GitHub, PagerDuty, Sonar, Datadog, etc.

Alternatives.

Port — SaaS IDP; faster setup than Backstage; less flexible.
Roadie — managed Backstage.
Humanitec — platform orchestrator, "app-centric" model.
Kratix — Kubernetes-native; define "promises" (offerings); teams claim them.
Self-built — mega-corp pattern; starts as a wiki + ticket form and evolves.

Golden paths / Paved roads. Opinionated defaults for common patterns: "the way we build a Spring Boot service is: use this Helm library chart, this CI template, this observability stack, this secrets pattern." Devs can deviate but opt-in to extra work.

DORA + DX measurement. Platform teams measure success with DORA metrics (delivery), SPACE (dev experience), plus:

Time-to-first-PR for new devs.
Time-to-first-production-deploy for new services.
NPS of the platform.
% of devs using the platform's golden path (vs. rolling their own).

Spotify reported a 55% drop in time-to-tenth-PR after deploying Backstage. Red metric (green DORA + red NPS) = you're delivering fast but the platform feels bad — fix the UX.

AI integration. 2026 pattern: AI coding tools (Claude Code, Copilot) + IDPs with guardrails. Platform teams define which repo templates, which approval policies, which AI models are approved per data-sensitivity class. Google's DORA 2026 survey: ~90% of devs use AI coding assistants daily.

Interview Q&A

What is Platform Engineering and how does it differ from DevOps? — DevOps says "every team owns their own ops." Platform Engineering says "at scale, a specialized team builds a self-service platform that other teams consume — treat your developers as customers." PE is a way to scale DevOps past the handful-of-teams point where "every team builds its own pipeline" stops working.
What's an Internal Developer Platform? — An internal product that abstracts CI/CD, deploy, observability, security into self-service workflows. Devs click "create new microservice" and get a repo, pipeline, deployment, observability, and a runbook — all paved-road-compliant. Backstage is the dominant framework; Port and Humanitec are SaaS alternatives.
Backstage — what does it actually give you? — Software Catalog (who owns what, what depends on what), TechDocs (docs rendered from each repo's Markdown), Scaffolder (service-creation wizards), and a plugin ecosystem for every tool in your stack (K8s views, ArgoCD sync status, on-call rotation, CI status, vulnerability dashboard). It's the portal that unifies a fragmented tooling surface.
How do you measure platform success? — DORA metrics first: deployment frequency + lead time + change failure rate + MTTR across all teams. DX metrics: time-to-first-deploy for new services, time-to-first-PR for new devs, platform NPS. % of services using golden path (vs rolling their own). Watch for the "green DORA, red NPS" trap — fast delivery but unhappy devs means the platform feels bad.
Platform team anti-patterns? — (1) Building a platform nobody uses — no user research, no product management. (2) Mandating adoption without a compelling value prop — devs route around you. (3) "We'll build it, they'll come" — need early design partners and close feedback loops. (4) Scaling before product-market fit — 10 engineers on a platform 3 teams use is waste.
How do you integrate AI tools into a platform? — Approve specific AI models per data-sensitivity class (public / internal / restricted). Provide AI-aware paved-road templates (Copilot / Claude Code preconfigured). Gate AI-generated PRs through additional review or tests. Track AI usage + outcomes via DORA — does AI-heavy code have a different failure rate? The IDP is the natural place to enforce guardrails since it's the consumption surface.

45. GitOps-Native Infrastructure Management

Why this matters

"Use Terraform for cloud, K8s for workloads" was the pattern for years — but split IaC from app deploy means two reconciliation engines, two drift states, two audit trails. Crossplane, Pulumi Operator, and Terraform Controller let you manage cloud resources as K8s resources — one GitOps story for everything.

Core concepts

Crossplane. Brings cloud resources into K8s via CRDs. Install a provider (AWS, Azure, GCP); Crossplane exposes that cloud's resources as K8s kinds. kind: Bucket on AWS creates an S3 bucket.

Core objects:

Provider — installed package for a specific cloud.
Managed Resource (MR) — a CR that maps 1:1 to a cloud resource (e.g., Bucket, Database).
Composition — define your own higher-level resource by composing MRs.
Claim (XR claim) — user-facing resource that references a composition; abstracts cloud specifics.

Example: platform team defines a PostgreSQLInstance claim; dev creates one; Crossplane provisions RDS (AWS), Cloud SQL (GCP), or Azure Database depending on the selected composition. Multi-cloud abstraction — done right.

Terraform Controller (Weaveworks / Flux). Run Terraform inside K8s as a reconciliation engine. CRs wrap Terraform modules; controller runs plan/apply.

Pulumi K8s Operator. Similar — Pulumi programs run reconciled from K8s CRs.

ACK (AWS Controllers for Kubernetes). AWS-specific — each AWS service is its own operator (s3-controller, rds-controller, etc.). No unified model like Crossplane, but officially maintained by AWS.

Config Connector (GCP). GCP equivalent — GCP resources as K8s objects.

Azure Service Operator (ASO). Same for Azure.

Trade-offs vs Terraform.

Pro — one reconciliation engine; GitOps covers infra; K8s RBAC for provisioning; drift auto-corrected; claims give strong abstractions to devs.
Con — Crossplane still maturing; Terraform ecosystem is vastly larger; some cloud features lag; breaks the "infra team writes Terraform, app team uses K8s" separation that works for many orgs.

Interview Q&A

What is Crossplane? — A Kubernetes control plane for cloud (and non-cloud) resources. It turns "make me an RDS instance" into "create this K8s custom resource." Reconcile loops in the Crossplane provider keep the cloud resource matching the K8s spec. You get one GitOps story, one RBAC model, one audit trail for infra + workloads.
Crossplane vs Terraform — when each? — Terraform when you have existing Terraform expertise, complex modules, or need the enormous provider ecosystem. Crossplane when you're already all-in on K8s, want GitOps for everything, or need strong multi-cloud abstractions (one DatabaseInstance claim that routes to RDS / Cloud SQL / Azure). Many orgs run Terraform for foundational (VPC, IAM, K8s cluster itself) + Crossplane for app-team-managed (bucket per service, DB per service).
What's a Crossplane Composition? — A platform-team-defined template that combines multiple Managed Resources into a higher-level concept. Example: a Composition named RDS+SecurityGroup+Secret — dev creates one claim and gets all three. Compositions are how platform engineers build opinionated offerings on top of raw cloud resources.
ACK (AWS Controllers for Kubernetes) vs Crossplane on AWS — pick? — ACK if you're AWS-only and want the official AWS-maintained operators; generally lower CRD-object-to-real-resource abstraction. Crossplane if you want the Composition pattern for platform-team abstractions, or need multi-cloud consistency. Choice often reduces to: "do you want AWS's opinion or your own?"

46. Disaster Recovery & Backups

Why this matters

"What's your RTO/RPO?" is a near-certain interview question for any production-scale role. DevOps owns backup + restore for clusters and state. Tested DR is the difference between "we have backups" and "we survive an incident."

Core concepts

RTO vs RPO.

RTO (Recovery Time Objective) — how long until service is back up. E.g., 1h.
RPO (Recovery Point Objective) — how much data loss is acceptable. E.g., 15 min.

The pair defines your DR posture. 0/0 = active-active multi-region + synchronous replication (expensive). 4h/24h = nightly backup restored to a standby region (cheap but data-lossy).

3-2-1 backup rule. 3 copies of data, on 2 different media, with 1 off-site. For cloud: primary + snapshot in same region + replicated to second region.

DR strategies (AWS model):

Backup & Restore — snapshots + cold restore. Cheapest; hours-days RTO.
Pilot Light — minimal capacity warm in DR region (DB replica); scale up on failover. Minutes-hours RTO.
Warm Standby — scaled-down live env in DR region; scale up on failover. Minutes RTO.
Active-Active / Hot — full capacity both regions; traffic split. Seconds RTO, highest cost.

K8s backup tools.

Velero — open-source; backs up cluster objects + PV snapshots to object storage. Restore to same or different cluster. Handles namespace migration, cluster upgrade rollback.
Kasten K10 — commercial; more UX, multi-cluster, policy-based. Strong in regulated environments.
CloudNativePG / operator-specific — database-level backup to object storage with PITR.

State backup mechanics.

etcd snapshots — cluster state. On managed K8s (EKS/AKS/GKE), cloud handles it.
PV snapshots — data. Use CSI VolumeSnapshots; orchestrate with Velero or Kasten.
Database native backups — pg_dump/logical for small; PITR via WAL shipping for large.
Application backups — often forgotten: search indices, cache warmup files, seeded data.

DNS failover. Route53 health checks + failover routing policy → primary region, fall back to DR region on health-check failure. Or: weighted routing for active-active. TTLs matter (60s typical for failover).

Testing DR. The backup you haven't restored is Schrödinger's backup — both good and corrupt simultaneously. Schedule quarterly DR drills: restore a backup to a staging cluster, verify integrity, document time and gaps. Production-fire-drill-like game days annually.

Chaos engineering overlap (see §48). DR is "planned chaos."

Commands you should know cold

bash

# Velero — install
velero install --provider aws --plugins velero/velero-plugin-for-aws:latest \
  --bucket my-velero-bucket --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

# Backup a namespace
velero backup create prod-$(date +%F) --include-namespaces prod

# Backup with schedule
velero schedule create daily-prod --schedule="0 2 * * *" \
  --include-namespaces prod --ttl 720h0m0s

# Restore to a new cluster
velero restore create --from-backup prod-2026-04-16 \
  --namespace-mappings prod:prod-restore

# etcd snapshot (self-managed K8s)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F).db \
  --endpoints=https://127.0.0.1:2379 --cacert=... --cert=... --key=...

Interview Q&A

Design a DR strategy for an EKS workload. — Start with RTO/RPO — ask the business. For 1h RTO / 15m RPO: Warm Standby in a second region. Multi-AZ EKS in primary. RDS Multi-AZ + cross-region read replica. S3 cross-region replication. Secrets Manager multi-region replication. ArgoCD registered to both clusters; app manifests in Git apply to both. Route53 health-check failover with 60s TTL. Quarterly failover drills. Document runbook with step-by-step failover.
3-2-1 backup rule — how in AWS? — 3 copies: production data + daily RDS snapshot + replicated snapshot in another region. 2 media: live DB + S3 for snapshots. 1 off-site: cross-region replication with MFA-delete on the bucket. Immutable for compliance (Object Lock / Glacier Vault Lock).
Velero — how does it back up and restore? — Controller queries the K8s API for objects matching an include list. Snapshots PVs via the CSI VolumeSnapshot driver. Uploads object manifests + PV snapshots to object storage (S3/GCS/Azure Blob). Restore reverses: pulls from object store, recreates K8s objects, restores PVs. Handles namespace remapping, hooks for DB-level consistency.
How do you test backups? — Quarterly DR drill: pick a random backup, restore to an isolated namespace or staging cluster, validate integrity (row counts, checksums, app health). Time the restore. Document gaps (missing RBAC, missing secrets, wrong region). Fix in the backup spec. Annual end-to-end failover drill.
What's forgotten in most DR plans? — (1) Secrets: are they in the DR region? (Many teams forget Secrets Manager replication.) (2) DNS TTL: if it's 1h, your failover is slow. (3) IAM / IRSA: does the DR cluster have the right trust policies? (4) Backups of the backup: is the bucket also replicated? (5) Runbook gaps — the person who wrote it has left.

47. Performance, Load Testing & Capacity Planning

Why this matters

"How do you know it handles peak load?" is a fair question. DevOps owns load-testing infra, baselines, capacity plans. Interviewers ask to distinguish engineers who ship blindly from those who validate before scaling.

Core concepts

Test types.

Load test — sustained expected-traffic load; verify SLOs hold.
Stress test — ramp until failure; find the breaking point.
Soak test — sustained load for hours/days; find memory leaks, connection leaks, slow-GC issues.
Spike test — sudden burst (10× traffic); verify autoscaling + graceful degradation.
Breakpoint test — incremental ramp to find exact saturation.

Tools.

k6 (Grafana) — JS-scripted; CLI or cloud; good dashboards via Grafana Cloud.
Locust — Python; great for custom protocols; distributed load easy.
JMeter — Java GUI-heavy classic; runs on CI too.
Gatling — Scala DSL; strong reports.
Artillery — YAML + JS; lightweight.
Vegeta — CLI-oriented; constant rate.
fortio — Istio's load tool, useful for latency histograms.

k6 example:

javascript

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp up
    { duration: '5m', target: 100 },   // steady
    { duration: '2m', target: 500 },   // spike
    { duration: '5m', target: 500 },
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_failed: ['rate<0.01'],          // <1% errors
    http_req_duration: ['p(95)<500'],        // p95 <500ms
  },
};

export default function () {
  let r = http.get('https://api.example.com/users/me');
  check(r, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

Capacity planning process.

Baseline current usage (RPS, CPU, memory, p99) per service.
Project growth (3/6/12 months) — from PM or historical.
Load test at projected load; measure breaking point.
Plan headroom — typically 2× the projected peak.
Budget compute; plan for ramp.

Synthetic monitoring. Pre-scheduled probes of prod from multiple regions (Pingdom, Datadog Synthetics, New Relic Synthetics, CloudWatch Synthetics). Catches availability + latency before users do.

Key metrics to watch during tests.

Request rate + error rate + latency percentiles.
CPU, memory, IOPS, network on each tier.
Downstream: DB connections, cache hit rate.
Autoscaling: did HPA/CA respond in time?
Graceful degradation: did circuit breakers fire at the right threshold?

Interview Q&A

Walk me through how you'd load-test a new API. — Define SLOs first — target RPS, p95 latency, error rate. Pick a tool (k6 for CLI + scripting). Write tests: steady-state at target RPS, ramp to 2× peak, spike test. Run from outside the cluster (real network path). Set thresholds in the test (fail if p95 >500ms or errors >1%). Observe full stack during: app metrics, infra saturation, autoscaler response. Iterate: fix the first bottleneck, rerun, repeat.
Load vs stress vs soak — when each? — Load: verify SLOs at expected peak. Stress: find where the system breaks (so you know your safety margin). Soak: run load for days to catch leaks. Always do load + soak before going to prod; stress for capacity planning decisions.
What's the difference between average latency and p99? — Average hides tail latency — one 5-second request averaged with 999 fast ones looks fine at the mean. p99 captures the worst 1% — the users most likely to churn. Always report p50/p95/p99 distribution; never just average. Alert on p95 or p99, not average.
Autoscaling didn't keep up with a spike — what do you do? — Investigate: how long did Cluster Autoscaler take to add nodes (EKS typical: 2–5 min)? Switch to Karpenter for faster provisioning (30s node-ready). Pre-warm: HPA minReplicas higher, or KEDA scaling-to-zero with a low-priority keep-alive. Pause pods: low-priority resource reservations that get preempted on real load — gives a buffer of warm capacity. Finally: rate-limit at the gateway so the system degrades gracefully instead of collapsing.
How do you capacity-plan for next year? — Baseline current: RPS, resource usage, p99. Get growth projections from PM (users, transactions, data volume). Load test at projected peak (with 2× headroom). Compute cost at that scale; budget accordingly. Revisit quarterly — projections drift.

48. Chaos Engineering

Why this matters

"Hope is not a strategy." Chaos engineering is proactive failure injection — verify resilience instead of discovering it in production. Growing from Netflix's Chaos Monkey to a widespread practice.

Core concepts

Principles of Chaos (from principlesofchaos.org):

Define "steady state" as a measurable output.
Hypothesize that steady state will hold in both the control and experimental groups.
Introduce real-world variables (server crashes, disk failures, network latency, dep failures).
Run in prod (carefully) — that's where the surprises are.
Automate experiments to run continuously.

The experiment shape.

Steady-state hypothesis (SLO holds)
        ↓
Inject fault (kill a pod)
        ↓
Observe (does SLO still hold?)
        ↓
Learn → fix weakness → repeat

Tools.

Chaos Monkey (Netflix) — OG; kills EC2 instances randomly.
Chaos Mesh (CNCF) — K8s-native; rich fault types (pod kill, network partition, IO delay, JVM chaos).
Litmus (CNCF) — K8s-native; chart-based experiments.
Gremlin — commercial; extensive UI and safety controls.
Pumba — chaos for Docker containers.
Toxiproxy — network-level chaos.
AWS Fault Injection Simulator (FIS) — AWS-native; inject faults into EC2, ECS, RDS.

Scoping experiments.

Blast radius — start small (one pod), ramp up (10%, 25%, 50%).
Abort criteria — "if error rate >X, stop immediately."
Time-boxed — don't leave a chaos experiment running overnight.
Announce to on-call — no surprise raids.

Game days. Scheduled, team-wide chaos exercises. "Today at 2 PM, we'll kill the primary RDS. Everyone observe and practice the runbook." Great for training on-call, finding runbook gaps, building confidence.

Pre-conditions. Observability must be solid; rollback must be fast; SLOs must be measurable. Don't start chaos on a system you can't observe.

Interview Q&A

What is chaos engineering and why? — Proactive failure injection to verify resilience. Instead of waiting for prod to surprise you with a "that wasn't supposed to happen" moment, you deliberately inject faults under controlled conditions — see what breaks, fix it, repeat. Turns hidden assumptions into tested ones.
How do you start chaos engineering in a team that's never done it? — Game days first — scheduled, announced, in staging. Pick a common fault (kill a pod, fail a dependency). Observe, note runbook gaps. Move to prod with small blast radius (1 pod of 100), clear abort criteria, on-call awareness. Automate small experiments into CI. Don't start until observability + rollback are solid.
Chaos Mesh vs Litmus — pick? — Both are CNCF; both K8s-native. Chaos Mesh has a slicker UI and more fault types (JVM-specific, IO-specific). Litmus has a richer chart/experiment marketplace and stronger integration with GitOps (experiment manifests commit to Git). Try both in a sandbox; pick based on UX fit.
Blast radius — how do you contain it? — Scope by namespace or label. Start with a single pod. Increase step-by-step (25%, 50%). Always define abort criteria — "if error rate exceeds X" → experiment stops automatically. Run in off-peak hours first. Have an on-call aware. Kill-switch ready.
What's a "steady state" in chaos engineering? — A measurable baseline: "99.9% success rate on /checkout, p95 latency <500ms." Before injecting chaos, confirm steady state. Inject the fault. Verify steady state still holds (or breaks as expected). Violations reveal weak points.

49. AIOps & AI-Assisted DevOps

Why this matters

2026 reality: ~90% of devs use AI coding assistants daily (per Google DORA). AI in observability (anomaly detection, auto-remediation suggestions) is becoming mainstream. Interviewers will ask how you use AI tooling responsibly and what guardrails you put on it.

Core concepts

AIOps (AI for IT Ops) — primarily:

Anomaly detection — unsupervised models flag weird metric/log patterns that threshold alerts miss.
Log clustering — group similar log patterns; find the emerging one.
Predictive autoscaling — forecast load; scale ahead of time.
Root cause analysis — correlate changes + metrics + traces to suggest "deploy X likely caused incident Y."
Runbook generation — LLM drafts runbooks from incident history; human reviews.

Vendors: Datadog AI, New Relic AI, Dynatrace Davis, Moogsoft, BigPanda.

AI coding tools.

Claude Code, Cursor, GitHub Copilot — editor-integrated; write code, refactor, review.
Agentic tools — run in your repo, execute multi-step tasks (test-fix loops).

Guardrails.

Which models are approved per data class (public code → any model; internal → in-org hosted; restricted → denylist external).
Auto-generated code review — tighter review for AI-heavy PRs; track "AI ownership" for incident attribution.
Prompt injection defense — AI processing untrusted input (user-submitted text, logs) can be hijacked; sandbox + output filter.
SBOM impact — AI may suggest deps that haven't passed your policies. SCA gate catches this.
License compliance — AI-generated code may inherit training data licenses; track via tooling like Copilot's license filter.

IDPs as guardrail vehicle. Platform teams wire AI guardrails into the IDP: approved models via the chat interface, auto-documenting AI usage in PR metadata, blocking AI on highly-sensitive repos.

Interview Q&A

How do you use AI coding assistants responsibly? — Treat AI-generated code like junior-engineer code: review it, test it, don't ship blindly. Gate by data sensitivity — public repos can use any model; internal repos use an in-org-hosted model; restricted repos may disable AI entirely. Track usage in PR metadata. Measure: does AI-heavy code have different failure rates?
Prompt injection in an ops tool — what's the risk? — If your AIOps runbook generator reads incident logs, an attacker could plant a log line that tricks the LLM into exfiltrating data or suggesting destructive actions. Mitigations: treat all LLM input as untrusted; sandbox execution; filter outputs; require human approval before automated actions.
AIOps anomaly detection vs threshold alerts — trade-off? — Threshold alerts are predictable but miss emerging patterns ("this metric is fine relative to threshold but weird for this day of week at this time"). ML anomaly detection catches those but generates more false positives and is harder to explain ("why did it alert?"). Use together: thresholds for known failure modes, ML for "this looks unusual" awareness.
Where would you NOT use AI in DevOps? — Anywhere that needs deterministic repeatability (builds, compliance evidence). Decision points that require auditability (deny/allow in admission policy — explain WHY). Anything operating on secrets/credentials. Rule of thumb: AI drafts; humans approve. Never let AI take irreversible actions without a human gate.

50. GreenOps / Sustainable DevOps

Why this matters

Carbon is becoming a first-class metric alongside cost. Regulatory pressure (EU CSRD) and customer pressure push it up the priority list. Interviewers at climate-conscious companies ask about it.

Core concepts

Why GreenOps overlaps DevOps. Same levers reduce compute, cost, and carbon:

Rightsizing (CPU/memory).
Autoscaling (turn off when not needed).
Efficient regions (some data centers run greener grids).
Consolidation (bin-pack with Karpenter).
Batch to off-peak (grid is greener when solar/wind is abundant).

Measuring carbon.

Kepler (CNCF sandbox) — Kubernetes-native; eBPF-based; estimates per-pod carbon from energy usage + grid intensity.
Cloud Carbon Footprint (ThoughtWorks) — estimates carbon from cloud billing.
Cloud native carbon APIs — AWS Customer Carbon Footprint, Azure Emissions Impact, Google Carbon Footprint dashboards.

Carbon-aware scheduling. Run batch workloads when the grid is cleanest (low carbon intensity). Tools: WattTime, Electricity Maps APIs return intensity per region per hour. Scheduler decides "run this job in us-west-2 between 10am-2pm when solar peak."

Region selection. Us-west-2 (Oregon) tends to be greener than us-east-1 (Virginia). Europe: nordics > central. Factor into architecture decisions for new workloads.

Interview Q&A

GreenOps — what does it mean in practice? — Measure and optimize the carbon footprint of your infrastructure. In practice it's the same work as cost optimization + region selection + carbon-aware scheduling. Rightsize, autoscale, shift batch to low-carbon hours, pick green regions for new workloads.
Kepler — what does it measure? — Per-pod energy consumption, estimated via eBPF-collected hardware counters + models. Translates to carbon using grid intensity data. Surfaces in Prometheus — your container's carbon cost alongside its CPU cost.
How do you factor carbon into architecture decisions? — For new workloads, default to greener regions unless latency/compliance forces another choice. Batch jobs scheduled to off-peak / carbon-aware windows. Lean toward serverless or autoscaling for variable load (no carbon when idle). Efficiency wins both $ and carbon.

Part IX — Scenario-Based & System Design

51. Troubleshooting Scenarios

Each below is a walk-through: narrate the debugging path, not just the answer.

51.1 Pod in CrashLoopBackOff

kubectl describe pod — read Status, Reason, restart count, LastState.Terminated exit code.
Exit code 137 = OOMKilled or SIGKILL → check memory limits vs usage.
Exit code 139 = SEGV → native crash; look at /var/log/containers/*-previous*.log.
Exit code 1/2 = generic app error → kubectl logs --previous for the stack.
If image pull error: kubectl get events; check image name/tag, imagePullSecret, ECR/registry auth.
If config error: confirm env vars, secret/configmap mounts, mounted file permissions.
If probe failures: tighten probe path/delay/timeouts.
Mystery: kubectl debug --copy-to debug --set-image api=busybox to poke at config/volumes without the real binary.

51.2 Node NotReady

kubectl get nodes -o wide — which node, for how long.
kubectl describe node — Conditions (MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable).
SSH to node (or SSM Session Manager on AWS). systemctl status kubelet → if dead, journalctl -u kubelet --since "10 min ago".
Check disk: df -h (common cause: /var/lib/docker full).
Check runtime: crictl ps, crictl images (is the runtime healthy?).
Check network: CNI plugin pod running? kubectl get pods -n kube-system -o wide | grep <node>.
Cordon + drain + replace if recovery is slow: kubectl cordon <node>; kubectl drain <node> --ignore-daemonsets --delete-emptydir-data.

51.3 ArgoCD app OutOfSync (keeps drifting back)

argocd app diff <app> — what's the drift?
Usually: a controller (HPA, mesh webhook) mutates a field that Git declares.
Add ignoreDifferences on the specific field.
If drift is legit (someone manually edited prod), sync to reconcile + educate the team.
If sync repeatedly fails: check health checks for CRDs (custom Lua health may be needed).

51.4 Deploy rollback under pressure

Identify: is the new deploy actually causing the incident? Correlate deploy annotation with the symptom spike on the dashboard.
Rollback: argocd app rollback <app> <revision> OR kubectl rollout undo deploy/<x> OR git revert <bad-sha> (pick the fastest for your stack; generally git revert since it matches GitOps).
Verify: watch the dashboard. Symptoms should clear within 1 deploy cycle.
Communicate: post to the status channel.
Post-incident: blameless RCA.

51.5 Kafka consumer lag spike

Confirm lag: kafka-consumer-groups --bootstrap-server ... --describe --group <g>.
Is consumer pod running? kubectl get pods -l app=consumer. Restarts?
Is consumer throughput dropping? Check consumer app metrics (records_consumed_rate).
Producer flood? Check producer rate metric.
Rebalance in progress? coordinator events in consumer logs.
Scale consumers: increase replicas (remember: partitions cap effective parallelism; more consumers than partitions = idle).
If partitions are the cap: add partitions (one-way; can't reduce).
Poison message? DLT config; check consumer records_rebalance.

51.6 Certificate expiry

Immediate: identify all TLS endpoints; which failed.
Check cert-manager: kubectl get certificate -A | grep -v True; kubectl describe certificate ....
If Let's Encrypt: rate limit, DNS validation issue, or ACME challenge failure?
Manual renew while debugging automation: cmctl renew <cert>.
Long-term: Prometheus alert on certs expiring <30d; cert-manager auto-renew (default 2/3 of validity).

51.7 DNS resolution failure in cluster

kubectl run dns-test --rm -it --image=nicolaka/netshoot -- dig kubernetes.default — basic resolution works?
If no: CoreDNS pod(s) healthy? kubectl get pods -n kube-system -l k8s-app=kube-dns.
CoreDNS logs: kubectl logs -n kube-system deploy/coredns.
NetworkPolicy blocking egress to CoreDNS? Check kube-system:k8s-app=kube-dns reachable from your namespace.
NodeLocal DNSCache crashed? kubectl get ds -n kube-system | grep node-local.
Upstream DNS issue (external names failing)? Check CoreDNS Corefile upstream settings.

51.8 Traffic 10× in 5 minutes

Confirm: dashboard shows request rate 10× normal.
Check autoscaling: HPA fired? Cluster Autoscaler scaling nodes?
If HPA + Karpenter are working but still saturated: gateway rate-limit to protect upstream (favor some users over total collapse).
Identify: legitimate spike (launch, news, seasonality) or abuse (scraper, DDoS)?
If abuse: WAF rate-limit at edge; IP-based throttling.
Inform product; if organic, postmortem adds capacity planning item.

52. DevOps System Design Prompts

Framework for any design prompt: clarify (who, scale, SLOs) → API / data model → high-level architecture → deep-dive on 1–2 components → scaling + failure modes → rollback plan.

52.1 GitOps for 10+ microservices across dev/staging/prod (ArgoCD + Helm + External Secrets)

Clarify: tenants/teams? Per-team or platform-managed ArgoCD? Secrets source (Vault? AWS SM?)? Compliance needs?

Architecture:

Two repos: app-<svc> (source + CI) and config (Helm values, ArgoCD Applications).
CI on push to app-<svc>: build → test → scan → image push to ECR → Cosign sign → commit tag bump to config repo.
config repo: apps/<svc>/envs/{dev,staging,prod}/values.yaml + ArgoCD ApplicationSet matrix of clusters × app dirs.
ArgoCD on each cluster, connected to SSO; Applications with automated: { prune: true, selfHeal: true }.
External Secrets Operator synced to Vault; per-service ExternalSecret.
Cosign + Kyverno admission: refuse unsigned images.
Observability: each app exposes /actuator/prometheus; kube-prometheus-stack scrapes.

Scaling: ApplicationSet with cluster generator; shard application controller as clusters grow. Failure modes: ArgoCD down → last-known state remains in cluster; CI can't commit to config repo → alert; bad commit in config → git revert. Rollback plan: git revert <sha> or argocd app rollback.

52.2 Centralized observability for 20+ microservices

Architecture:

OTel Java agent on every service → OTLP to OTel Collector (DaemonSet) → gateway Collector (Deployment) with tail sampling → Tempo (traces), Mimir/Thanos (metrics), Loki (logs).
kube-prometheus-stack on each cluster; remote-write to Mimir.
Fluent Bit DaemonSet → Loki.
Grafana as single-pane UI with derived fields linking trace_id across signals.
Alertmanager → PagerDuty + Slack.

Deep-dive (tracing): tail sampling policies (errors 100%, slow >500ms 100%, probabilistic 5%). Scaling: Mimir horizontally scalable (object-store backed). Loki labeled, also horizontally scalable. Cost: retention tiers (hot 7d, warm 30d, cold 1y in S3).

52.3 Zero-downtime DB migration (e.g., Postgres column rename across 6 services)

Plan (expand-contract):

Expand: deploy migration adding new column; dual-write old + new.
Backfill: async job backfills new column from old.
Dual-read: services read new; fall back to old.
Migrate consumers: flip reads to new column only.
Contract: drop old column after grace period.

Each step is a separate deploy; all deploys are backwards-compatible. Test every step in staging with prod-like data volume.

52.4 Multi-region K8s platform

Strategy: warm standby.

Primary region: full capacity, active traffic.
DR region: scaled-down EKS cluster, DB read replica.
S3 cross-region replication for state.
Secrets Manager multi-region.
Route53 failover records (TTL 60s).
ArgoCD connected to both; manifests deploy to both via ApplicationSet cluster generator.
Quarterly DR drills.

RTO ~15 min; RPO ~1 min (async replication).

52.5 Secrets rotation system

Design:

Source of truth: Vault + DB dynamic secrets.
K8s integration: External Secrets Operator syncs with 1h refresh.
App pattern: mount as file; watch for changes; reload without restart.
Rotation event: Vault rotates on schedule; new value flows to ESO to K8s Secret to file in pod; app reloads.
Static creds eliminated: IRSA for AWS, Workload Identity for GCP/Azure.
Breakglass: Vault break-glass procedure with audit alarm on use.

52.6 Self-service IDP (Backstage-based)

Components:

Backstage as the portal.
Scaffolder templates: "new Spring Boot microservice" creates repo, CI workflow, Helm chart, ArgoCD Application, observability dashboards, on-call rotation, runbook stub.
Software Catalog: every service registered with owner team, deps, SLOs.
Plugin integrations: ArgoCD (sync status), PagerDuty (on-call), Sonar (quality), Datadog (APM).
Golden paths enforced via admission (Kyverno): reject pods not from approved templates.
DORA dashboard per team.

52.7 IBM MQ ↔ AWS ActiveMQ bridge (classic legacy MQ bridge prompt)

Architecture:

Spring Boot bridge service in hybrid network (VPN / Direct Connect).
IBM MQ JMS inbound listener → filter/transform → Amazon MQ JMS producer.
Persistent delivery on both sides; XA transaction across (or idempotency + at-least-once).
Dedup store (DynamoDB) with TTL to catch duplicates.
DLT for poison messages on each side.
Observability: span per message; Prometheus lag + queue-depth metrics.
High availability: multi-AZ; primary/standby bridge with shared dedup.

Anchor example: classic MQ bridge prompt — be ready to whiteboard a legacy MQ microservice scenario end to end.

53. Behavioral — DevOps-Flavored

Use STAR. Reuse resume stories with DevOps framing.

Tell me about a time you broke production. — Be honest; focus on detection, containment, RCA, prevention. What monitoring did you add after?
A deploy is failing in prod at 3 AM — walk me through. — Acknowledge alert within SLA, open runbook, check dashboard for blast radius, decide rollback vs roll-forward, coordinate with on-call, write a status update, post-mortem next day.
You took over a team's deployment pipeline that was brittle — what did you fix first? — Specifics: flaky tests, slow feedback, no caching, static creds, no signed images. Prioritize: highest pain per time-invested.
Tell me about a time you pushed back on a "just deploy it" request. — Production readiness: no SLO, no observability, no runbook, no rollback plan. Propose the alternative path; respect the urgency while protecting reliability.
You mentored 20+ interns — how did you teach DevOps practices? — Structured program: week-1 pipeline walkthrough, daily standups, PR reviews with TDD focus, pair programming on ops tasks, rotation through on-call shadowing.
Describe a time you improved sprint completion from 60% to 85%. — Process changes (right-sized stories, definition of done, pipeline reliability, reduced interrupt load). Quantified outcomes. What didn't work (e.g., tried estimation poker; stopped).
Tell me about a technically challenging DevOps problem you solved. — Candidates: 99% deploy rate, MQ bridge, cross-team schema standardization (arguably data-format but with DevOps angle), OpenShift image compatibility.
How do you balance velocity and reliability? — Error budgets. When budget is healthy, ship. When burned, pause features for reliability work. Don't treat it as a tradeoff — elite teams deliver both.
What's your philosophy on on-call? — Respect the load; cap toil at 50%; every page has a runbook; blameless postmortem; track trends in paging frequency; rotate sustainably.
Describe your PR review philosophy. — TDD focus (does it test behavior?), security (injection, secrets, permissions), operational (observability, graceful degradation), simplicity (smaller diffs land faster). 30% defect reduction was the measurable outcome.
Tell me about influencing a technical decision without authority. — Rolled out Testcontainers across the team by writing a reference PR, showing metrics (fewer false green CI runs), offering office hours. Authority is earned.
How do you stay current? — Specifics: CNCF releases / KubeCon talks, Brendan Gregg's blog, OWASP + Sigstore threat intel, DORA annual report, the OpenTelemetry blog. Weekly 30-min reading slot protected on calendar.

Appendices

Appendix A — Commands Cheat Sheet

The "I need this at 3 AM or in a live interview" quick reference.

kubectl — workload

bash

kubectl get pods -A -o wide
kubectl get pods -n prod --field-selector=status.phase!=Running
kubectl describe pod <name> -n prod
kubectl logs <name> -n prod --previous --all-containers --tail=200
kubectl logs -f -l app=api -n prod --max-log-requests=50
kubectl exec -it <pod> -n prod -- sh
kubectl debug -it <pod> -n prod --image=nicolaka/netshoot --target=<container>
kubectl rollout status deploy/api -n prod
kubectl rollout history deploy/api -n prod
kubectl rollout undo deploy/api --to-revision=3 -n prod
kubectl rollout restart deploy/api -n prod
kubectl scale deploy/api --replicas=5 -n prod
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl cordon <node> ; kubectl uncordon <node>
kubectl port-forward svc/api 8080:80 -n prod
kubectl get events -n prod --sort-by=.lastTimestamp
kubectl top pods -A --sort-by=cpu | head
kubectl top nodes
kubectl diff -f deploy.yaml
kubectl apply -f deploy.yaml --dry-run=server --server-side
kubectl auth can-i --list --as=system:serviceaccount:prod:api -n prod

kubectl — config / RBAC / secrets

bash

kubectl config current-context
kubectl config use-context dev
kubectl config view --minify
kubectl create secret generic x --from-literal=K=V -n prod --dry-run=client -o yaml | kubectl apply -f -
kubectl get secret x -n prod -o jsonpath='{.data.K}' | base64 -d
kubectl describe rolebinding -n prod
kubectl describe clusterrolebinding

Helm

bash

helm list -A
helm install api ./chart -n prod --create-namespace -f values-prod.yaml --atomic --timeout 10m
helm upgrade --install api ./chart -n prod -f values-prod.yaml
helm template ./chart -f values-prod.yaml | less
helm history api -n prod
helm rollback api 3 -n prod
helm get values api -n prod
helm get manifest api -n prod
helm dep update ./chart
helm push chart.tgz oci://ghcr.io/org/charts

ArgoCD

bash

argocd login argocd.example.com --sso
argocd app list
argocd app get api
argocd app diff api
argocd app sync api --prune
argocd app rollback api <rev>
argocd app get api --hard-refresh
argocd proj list

Docker

bash

docker build -t app:1.0 .
docker buildx build --push --platform linux/amd64,linux/arm64 -t ghcr.io/org/app:1.0 .
docker history app:1.0
docker run --rm -it --user 1000:1000 --read-only --tmpfs /tmp --memory 512m app:1.0
docker exec -it <name> sh
docker logs -f <name>
docker stats <name>
docker inspect <name>
docker image prune -a
trivy image app:1.0
cosign sign --yes app:1.0
cosign verify app:1.0 --certificate-identity=... --certificate-oidc-issuer=...
syft app:1.0 -o cyclonedx-json > sbom.json

Git (daily)

bash

git diff --staged
git reset --soft HEAD~1
git revert <sha>
git rebase -i HEAD~5
git pull --rebase
git bisect start ; git bisect bad HEAD ; git bisect good v1.2.0
git blame -L 42,50 path/to/file
git log -S 'mysterious_function' --all --source
git cherry-pick <sha>
git push --force-with-lease
git reflog

Terraform

bash

terraform init -upgrade
terraform fmt -recursive
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform state list
terraform state show <addr>
terraform state mv <from> <to>
terraform state rm <addr>
terraform apply -replace='aws_instance.web[0]'
terraform import <addr> <id>
terraform destroy

AWS CLI

bash

aws sts get-caller-identity
aws eks update-kubeconfig --name mycluster --region us-east-1
aws s3 cp ./file s3://bucket/path/
aws s3 presign s3://bucket/path/file --expires-in 600
aws logs start-query --log-group-name ... --query-string '...'
aws ec2 describe-volumes --filters Name=status,Values=available
aws secretsmanager get-secret-value --secret-id prod/db | jq -r .SecretString

Linux / Networking

bash

ss -tlnp                                    # listening sockets + owning process
ss -s                                       # connection state summary
lsof -i :8080                              # who owns port
lsof -p <pid>                              # files open by pid
ps auxf --sort=-rss | head                 # top memory consumers
dmesg -T | grep -i 'killed process'        # OOM kills
journalctl -u kubelet -f                   # follow systemd log
vmstat 1                                    # cpu/mem/io per second
iostat -xz 1                                # disk per device
strace -f -p <pid>                         # syscall trace
tcpdump -i eth0 -n port 443                # packet capture
dig example.com +short                     # DNS
openssl s_client -connect ex.com:443 -servername ex.com </dev/null | openssl x509 -noout -dates
curl --resolve ex.com:443:10.0.0.5 https://ex.com/
mtr example.com                             # traceroute + loss

Prometheus / PromQL

promql

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)                     # error rate
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))   # p99
topk(10, sum(rate(http_requests_total[5m])) by (endpoint))
count by (__name__)({__name__=~".+"})                                 # cardinality audit

Appendix B — Glossary

Quick definitions for the acronym jungle.

Term	Expansion
ACK	AWS Controllers for Kubernetes
ADOT	AWS Distro for OpenTelemetry
AKS	Azure Kubernetes Service
ALB	Application Load Balancer (AWS L7)
APF	API Priority and Fairness
APM	Application Performance Monitoring
AppArmor	Linux MAC security module
ASA	Apply to All (not an acronym here — keep list clean)
AWS	Amazon Web Services
AZ	Availability Zone
BGP	Border Gateway Protocol
CA	Cluster Autoscaler (also Certificate Authority)
CALMS	Culture, Automation, Lean, Measurement, Sharing
CDK	(AWS) Cloud Development Kit
CFN	CloudFormation
CI/CD	Continuous Integration / Delivery (or Deployment)
CIS	Center for Internet Security
CJIS	Criminal Justice Information Services
CMK	Customer Managed Key (AWS KMS)
CNCF	Cloud Native Computing Foundation
CNI	Container Network Interface
CR / CRD	Custom Resource / Custom Resource Definition
CRI	Container Runtime Interface
CRR	Cross-Region Replication (S3)
CSI	Container Storage Interface
CSP	Cloud Service Provider
CUI	Controlled Unclassified Information
CVE	Common Vulnerabilities and Exposures
DAST	Dynamic Application Security Testing
DLT	Dead-Letter Topic
DORA	DevOps Research & Assessment
DR	Disaster Recovery
EBS	Elastic Block Store
ECR	Elastic Container Registry
ECS	Elastic Container Service
EKS	Elastic Kubernetes Service
ELB	Elastic Load Balancer
ENI	Elastic Network Interface
FIPS	Federal Information Processing Standards
FIS	(AWS) Fault Injection Simulator
FISMA	Federal Information Security Management Act
GitOps	Git as source of truth; pull-based deploy
GKE	Google Kubernetes Engine
HPA / VPA	Horizontal / Vertical Pod Autoscaler
HSM	Hardware Security Module
IAM	Identity and Access Management
IAST	Interactive Application Security Testing
IDP	Internal Developer Platform (also: Identity Provider)
ILM	Index Lifecycle Management (Elasticsearch)
IRSA	IAM Roles for Service Accounts
JWT	JSON Web Token
K8s	Kubernetes
KEDA	Kubernetes Event-Driven Autoscaling
KEK / DEK	Key Encryption Key / Data Encryption Key
KMS	Key Management Service
L4 / L7	OSI Layer 4 (transport) / Layer 7 (application)
LDAP	Lightweight Directory Access Protocol
MDC	Mapped Diagnostic Context
MFA	Multi-Factor Authentication
MTTA / MTTD / MTTR / MTTF / MTBF	Mean Time to Acknowledge / Detect / Restore / Failure / Between Failures
MVCC	Multi-Version Concurrency Control
NACL	Network Access Control List
NAT	Network Address Translation
NIST	National Institute of Standards and Technology
NLB	Network Load Balancer (AWS L4)
NP	Network Policy
OCI	Open Container Initiative (also: Oracle Cloud Infrastructure)
OIDC	OpenID Connect
OPA	Open Policy Agent
OTel	OpenTelemetry
OTLP	OpenTelemetry Protocol
PDB	Pod Disruption Budget
PITR	Point-In-Time Recovery
PKI	Public Key Infrastructure
PSA / PSP	Pod Security Admission / Policy
PV / PVC	PersistentVolume / PersistentVolumeClaim
QoS	Quality of Service
RBAC	Role-Based Access Control
RCA	Root Cause Analysis
RED	Rate, Errors, Duration (SRE dashboard framework)
RI / SP	Reserved Instance / Savings Plan
RPO / RTO	Recovery Point / Time Objective
RUM	Real User Monitoring
RWO / ROX / RWX / RWOP	ReadWriteOnce / ReadOnlyMany / ReadWriteMany / ReadWriteOncePod
SAST	Static Application Security Testing
SBOM	Software Bill of Materials
SCA	Software Composition Analysis
SCC	Security Context Constraint (OpenShift)
SCP	Service Control Policy (AWS Organizations)
SELinux	Security-Enhanced Linux (MAC module)
SIEM	Security Information and Event Management
SLA / SLI / SLO	Service Level Agreement / Indicator / Objective
SLSA	Supply-chain Levels for Software Artifacts
SPACE	Satisfaction, Performance, Activity, Collaboration, Efficiency
SPIFFE / SPIRE	Secure Production Identity Framework for Everyone / Runtime Env
SRE	Site Reliability Engineering
SSA	Server-Side Apply (kubectl)
SSH	Secure Shell
SSM	Systems Manager (AWS)
SSO	Single Sign-On
STIG	Security Technical Implementation Guide (DISA)
STS	Security Token Service (AWS)
TF	Terraform
TLS	Transport Layer Security
TSDB	Time Series Database
TTL	Time To Live
TTY	Teletypewriter (terminal)
USE	Utilization, Saturation, Errors (SRE infra framework)
VPC	Virtual Private Cloud
WAF	Web Application Firewall
WASM	WebAssembly
XSS / CSRF / SSRF	Cross-Site Scripting / Cross-Site Request Forgery / Server-Side Request Forgery
ZGC	Z Garbage Collector (JVM)

Appendix C — Further Reading

Books (in order of priority for a DevOps-focused interview)

Google SRE Book (free online) — foundation.
The Google SRE Workbook (free online) — implementation playbook; read after the SRE Book.
Kubernetes Up & Running (Burns, Beda, Hightower) — K8s canonical.
Kubernetes Patterns (Ibryam & Huß) — patterns for designing K8s-native apps.
Container Security (Liz Rice) — what's actually under the namespace/cgroup hood.
Terraform Up & Running (Brikman) — IaC foundation.
Observability Engineering (Majors, Fong-Jones, Miranda) — Honeycomb team's pitch; shifts how you think about data.
Team Topologies (Skelton & Pais) — org design for platform engineering.
Accelerate (Forsgren, Humble, Kim) — the DORA research in book form.
The Phoenix Project (Kim, Behr, Spafford) — the "why" of DevOps in novel form. Quick read.

Sites / Blogs

sre.google — workbook, SRE book, resources.
dora.dev — annual State of DevOps report.
kubernetes.io/docs — the canonical K8s reference.
brendangregg.com — Linux performance, USE/RED, flame graphs.
cncf.io — project landscape + TAG papers.
platformengineering.org — the PE community.
hpbn.co — High Performance Browser Networking (free book).
Liz Rice's talks on containers — exceptionally clear.
Bash Pitfalls — bugs you're about to write.

Podcasts

Kubernetes Podcast from Google — weekly, project + ecosystem news.
Software Engineering Daily — wide; good DevOps/K8s interviews.
The Cloud Pod — AWS-focused news.
Screaming in the Cloud (Corey Quinn) — AWS cost + culture; sharp.

Newsletters

DevOps'ish — weekly curated DevOps news.
Last Week in AWS (Corey Quinn) — AWS coverage + commentary.
CNCF Newsletter — project releases, KubeCon coverage.

Appendix D — Day-Of Interview Checklist

Parallels INTERVIEW_PREP.md's checklist with DevOps-specific additions.

The night before

[ ] Re-read Sections 1, 6, 10, 21, 27, 33 (culture, CI/CD, K8s arch, ArgoCD, observability foundations, SRE) — the most "default-asked" sections.
[ ] Know the company's stack (cloud, orchestration, observability) — their job post + engineering blog.
[ ] Refresh 2–3 numbers from your resume (99% deploy rate, 10k+ tx/day, 20+ interns). You will be asked to defend them.
[ ] Have 3 "hero stories" tight for DevOps-flavored behavioral: MQ microservice + ArgoCD rollouts, OpenShift image hardening, Intern leader program.

60 seconds before

[ ] Water + pen + second monitor (if remote) for scribbling diagrams.
[ ] Excalidraw or Miro open for whiteboard prompts.
[ ] Breathe.

During — framework for any question

Clarify first. "Before I answer — what's the scale / SLO / compliance constraint?"
Outline the shape. Don't rush into details; state 3–5 bullets of how you'd approach, then dive.
Anchor to concrete experience. Every tool mention should land on "I've done this: at work, we used X for Y, and the trade-off was Z."
Acknowledge trade-offs. Every architectural choice has a cost; name it.
Be candid about gaps. "I haven't operated GCP at scale; my answer is based on AWS analog + reading." Interviewers trust calibrated candidates.

System design specifically

Clarify (users, scale, SLOs, compliance).
API / data model (2–3 min).
High-level architecture (sketch 5–7 boxes).
Deep-dive 1–2 components (the interviewer will pick).
Failure modes — "where's the SPOF? what happens on region loss?"
Scale to 10x — "what breaks first?"
Rollback + observability plan.

Questions YOU ask at the end (mix 3–5)

What's the deployment pipeline end-to-end?
What observability tools does the team use? Are engineers empowered to instrument their own code?
What does on-call look like? Rotation, pager frequency, runbook culture?
Biggest DevOps challenge the team is facing right now?
How does the team balance feature work, tech debt, and platform work?
What's the testing philosophy? Integration vs unit ratio?
How are technical decisions made — RFCs, staff engineers, consensus?
What's the most interesting problem the team solved recently?
Anything about my background that gives you hesitation I can address?

After

[ ] Thank-you email within 24h — reference one thing from the conversation.
[ ] Write down the questions you got (and the ones you fumbled) in a running doc. Study gaps for next round / next interview.

Good luck. You've done the prep. Trust it.

End of DEVOPS.md. Cross-reference with INTERVIEW_PREP.md for adjacent sections (Core Java, Spring, Distributed Systems, Observability) — many questions span multiple files. When in doubt, anchor to a concrete story from your work.

DevOps — All-Inclusive Study & Interview Guide ​

TABLE OF CONTENTS ​

Part I — Foundations ​

Part II — Build & Delivery ​

Part III — Containers & Orchestration ​

Part IV — GitOps & Progressive Delivery ​

Part V — Cloud ​

Part VI — Observability & SRE ​

Part VII — Security / DevSecOps ​

Part VIII — Advanced & Emerging ​

Part IX — Scenario-Based & System Design ​

Appendices ​

Part I — Foundations ​

1. DevOps Culture, Lifecycle & Metrics ​

Why this matters ​

Core concepts ​

Gotchas & war stories ​

Anchor example ​

Interview Q&A ​

Further reading ​

2. Linux & Systems Administration ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

Gotchas & war stories ​

Interview Q&A ​

Further reading ​

3. Networking Fundamentals ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

Gotchas & war stories ​

Interview Q&A ​

Further reading ​

4. Shell Scripting & Automation Languages ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

Gotchas & war stories ​

Interview Q&A ​

Further reading ​

5. Version Control & Branching Strategies ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

Gotchas & war stories ​

Interview Q&A ​

Further reading ​

Part II — Build & Delivery ​

6. CI/CD Pipelines (Deep Dive) ​

Why this matters ​

Core concepts ​

Commands / config you should know cold ​

Gotchas & war stories ​

Anchor example ​

Interview Q&A ​

Further reading ​

7. Infrastructure as Code (Terraform, Ansible, Pulumi) ​

Why this matters ​

Core concepts ​

Terraform deep dive ​

Terraform vs Ansible vs CloudFormation vs Pulumi vs CDK vs Crossplane ​

Ansible essentials ​

Commands you should know cold ​

Gotchas & war stories ​

Anchor example ​

Interview Q&A ​

Further reading ​

8. Configuration Management ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

Gotchas & war stories ​

Interview Q&A ​

Further reading ​

Part III — Containers & Orchestration ​

9. Containers & Docker (Deep) ​

Why this matters ​

Core concepts ​

Commands you should know cold ​

DevOps — All-Inclusive Study & Interview Guide

TABLE OF CONTENTS

Part I — Foundations

Part II — Build & Delivery

Part III — Containers & Orchestration

Part IV — GitOps & Progressive Delivery

Part V — Cloud

Part VI — Observability & SRE

Part VII — Security / DevSecOps

Part VIII — Advanced & Emerging

Part IX — Scenario-Based & System Design

Appendices

Part I — Foundations

1. DevOps Culture, Lifecycle & Metrics

Why this matters

Core concepts

Gotchas & war stories

Anchor example

Interview Q&A

Further reading

2. Linux & Systems Administration

Why this matters

Core concepts

Commands you should know cold

Gotchas & war stories

Interview Q&A

Further reading

3. Networking Fundamentals

Why this matters

Core concepts

Commands you should know cold

Gotchas & war stories

Interview Q&A

Further reading

4. Shell Scripting & Automation Languages

Why this matters

Core concepts

Commands you should know cold

Gotchas & war stories

Interview Q&A

Further reading

5. Version Control & Branching Strategies

Why this matters

Core concepts

Commands you should know cold

Gotchas & war stories

Interview Q&A

Further reading

Part II — Build & Delivery

6. CI/CD Pipelines (Deep Dive)

Why this matters

Core concepts

Commands / config you should know cold

Gotchas & war stories

Anchor example

Interview Q&A

Further reading

7. Infrastructure as Code (Terraform, Ansible, Pulumi)

Why this matters

Core concepts

Terraform deep dive

Terraform vs Ansible vs CloudFormation vs Pulumi vs CDK vs Crossplane

Ansible essentials

Commands you should know cold

Gotchas & war stories

Anchor example

Interview Q&A

Further reading

8. Configuration Management

Why this matters

Core concepts

Commands you should know cold

Gotchas & war stories

Interview Q&A

Further reading

Part III — Containers & Orchestration

9. Containers & Docker (Deep)

Why this matters

Core concepts

Commands you should know cold