Skip to content

AWS β€” Senior Engineer Study Guide ​

πŸ“ Quiz Β· πŸƒ Flashcards

Companion to INTERVIEW_PREP.md Β§10. This guide is the teaching layer: concepts explained from first principles, CLI/console snippets, gotchas, and direct ties to representative engineering work (a legacy MQ microservice, cross-team schema standardization, JAXB migration, ArgoCD rollouts, mentoring a ~20-person intern cohort).

Scope: Every AWS service and pattern a senior backend/DevOps engineer is expected to reason about on an interview whiteboard. AI/ML services (Bedrock, SageMaker) are intentionally out-of-scope for this pass.

How to use: Skim the Coverage Matrix to jump to the section answering a specific INTERVIEW_PREP Β§10 question. For open study, walk Β§1 β†’ Β§24 in order. Morning-of-interview: Β§22 Rapid-Fire.


Table of Contents ​

  1. AWS Foundations
  2. Identity & Access Management (IAM)
  3. Compute β€” EC2, Lambda, Containers, Batch
  4. Networking β€” VPC, Subnets, SG/NACL, Peering, PrivateLink
  5. DNS & Traffic Management β€” Route 53, ELB, Global Accelerator
  6. Content Delivery β€” CloudFront
  7. Storage β€” S3, EBS, EFS, FSx
  8. Databases β€” RDS, Aurora, DynamoDB, ElastiCache
  9. Messaging & Streaming β€” SQS, SNS, EventBridge, MQ, MSK, Kinesis
  10. Serverless & Integration β€” Lambda, API Gateway, Step Functions
  11. Containers on AWS β€” ECR, ECS, EKS
  12. Observability β€” CloudWatch, X-Ray, CloudTrail, Config
  13. Security Services β€” KMS, Secrets, WAF, GuardDuty
  14. CI/CD on AWS β€” CodePipeline, CodeBuild, CodeDeploy
  15. Infrastructure as Code β€” CloudFormation, CDK, SAM, Terraform
  16. Cost Management
  17. Well-Architected Framework
  18. High Availability & Disaster Recovery
  19. Migration & Modernization
  20. Federal / Compliance Context
  21. Connect to Your Experience
  22. Rapid-Fire Review
  23. Practice Exercises
  24. Further Reading

Interview-Question Coverage Matrix ​

Maps each INTERVIEW_PREP.md Β§10 question (1–10) to the section(s) in this guide that answer it. Cross-references into other INTERVIEW_PREP sections are noted where the AWS realization is the concrete form of a broader concept.

Q#TopicSection
1EC2 vs ECS vs EKS vs Lambda β€” when pick eachΒ§3, Β§10, Β§11
2IAM role vs user vs policy; IRSA on EKSΒ§2, Β§11
3S3 β€” storage classes, lifecycle, versioning, policies vs ACLsΒ§7
4S3 pre-signed URLΒ§7
5SQS vs SNS vs EventBridge β€” when pick eachΒ§9
6SQS β€” Standard vs FIFOΒ§9
7VPC β€” public/private subnet, NAT, SG vs NACLΒ§4
8CloudFront with S3 / with ALBΒ§6
9Cost monitoringΒ§16
10Migrate on-prem Spring Boot to AWSΒ§3, Β§11, Β§18, Β§19

Cross-refs: Β§4 Messaging β†’ Β§9 AWS Messaging. Β§5 Microservices β†’ Β§17 Well-Architected + Β§18 HA/DR. Β§9 DevOps β†’ Β§11 EKS + Β§14 CI/CD + Β§15 IaC. Β§11 Observability β†’ Β§12 CloudWatch/X-Ray. Β§12 Security β†’ Β§2 IAM + Β§13 Security services.


1. AWS Foundations ​

Global infrastructure ​

AWS is organized as a hierarchy. You need the mental model cold β€” every design question hinges on it.

  • Region β€” an independent geographic area (e.g., us-east-1, us-gov-west-1). Two regions never share infrastructure. Services, pricing, and even API endpoints are region-scoped. Data does not leave a region unless you explicitly replicate it (S3 CRR, DynamoDB Global Tables, Aurora Global Database).
  • Availability Zone (AZ) β€” one or more physically isolated data centers within a region, connected by low-latency links. A region has β‰₯ 3 AZs (typically 3–6). AZs are the unit of HA: deploying across AZs survives a data-center failure.
  • Edge Location / PoP β€” CloudFront / Global Accelerator / Route 53 endpoints. Hundreds of them, distinct from regions.
  • Local Zone / Wavelength / Outposts β€” extensions of a region into metros, 5G networks, or on-prem racks. Niche but sometimes asked.

Gotcha: AZ names (us-east-1a, us-east-1b, …) are account-scoped aliases. Your us-east-1a and another account's us-east-1a may be different physical AZs. Use the AZ ID (use1-az1) when coordinating shared resources across accounts.

Shared Responsibility Model ​

A frequent-but-easy interview primer. AWS ⇄ customer responsibilities differ by service model:

LayerAWSCustomer
Physical / hypervisorβœ”
Network substrate / DCβœ”
Managed-service OS (RDS, Lambda)βœ”
Guest OS on EC2βœ”
App code, data, IAM configβœ”
Encryption key controlsharedshared

Shorthand: AWS owns "security OF the cloud"; you own "security IN the cloud." For IAM and network config, you are 100% responsible β€” misconfigurations are the #1 source of public breaches.

Pricing model (enough to pass smell tests) ​

AWS bills mostly per-second or per-request. Four dimensions dominate:

  1. Compute time β€” EC2 by instance-hour (1-second granularity for most modern families); Lambda by GB-second.
  2. Storage β€” S3 by GB-month + per-request + per-GB egress.
  3. Data transfer β€” egress is the hidden cost: out of AWS, out of region, and often cross-AZ. Ingress is almost always free.
  4. Managed-service fees β€” per-endpoint-hour (NAT Gateway, ALB, VPC Interconnect endpoints).

Purchasing models for steady workloads: Savings Plans (flexible, 1/3-yr commit, covers EC2/Lambda/Fargate) > Reserved Instances (legacy, tied to an instance family) > Spot (up to 90% discount, interruptible) > On-Demand (full price).

CLI / SDK / API ​

Every AWS resource is ultimately a JSON API call. The CLI, SDKs, CloudFormation, Terraform, and the console are all just HTTPS clients.

bash
aws sts get-caller-identity           # who am I?
aws ec2 describe-instances --region us-east-1
aws s3 cp ./file.txt s3://my-bucket/   # high-level S3 wrapper
aws s3api put-object ...                # low-level 1:1 with API

Credential precedence (CLI v2): env vars (AWS_ACCESS_KEY_ID) β†’ CLI --profile β†’ ~/.aws/credentials β†’ ~/.aws/config β†’ container/role credentials (IMDSv2, ECS task role, EKS IRSA). Memorize this β€” debugging "wrong credentials" problems starts here.

Gotcha: IMDSv2 (session-token-based) is the default on new AMIs. An IMDSv1-only SDK call from an old Ruby or Python runtime will fail with 401s. Fix: upgrade the SDK or explicitly enable IMDSv1 on the instance (not recommended).


2. Identity & Access Management (IAM) ​

IAM is the single most asked AWS topic. Expect a whiteboard policy question.

Principals, policies, resources ​

  • Principal β€” anything that can make an AWS API call: an IAM user, an IAM role (assumed by anyone/-thing), an AWS service, or a federated identity (SAML/OIDC).
  • Identity policy β€” attached to a principal (user/role/group). "What can this identity do?"
  • Resource policy β€” attached to a resource (S3 bucket, KMS key, SQS queue, Lambda function, …). "Who can do what to this resource?" Resource policies can grant access cross-account without the other side having to add you to their IAM.
  • Permissions boundary β€” a ceiling on what a role's identity policies can grant. Used when delegating "create-role" rights to developers without letting them privilege-escalate.
  • SCP (Service Control Policy) β€” org-level guardrail in AWS Organizations. Intersects with identity + resource policies; can only deny/allow at the top, never grants.
  • Session policy β€” passed inline to AssumeRole; narrows the assumed session further.

Policy evaluation logic (memorize this) ​

The evaluation is not left-to-right. It is a set intersection:

  1. Default deny. No policy β†’ no access.
  2. Any explicit Deny wins. An SCP deny, a resource-policy deny, an identity-policy deny β€” any of them stops the call.
  3. Otherwise the union of Allows from identity + resource + session policies, intersected with permissions boundaries and SCPs, decides access.
  4. If nothing allows, default deny.

Classic gotcha: a bucket policy that says Allow s3:GetObject alone doesn't help a same-account IAM user if their identity policy lacks s3:GetObject β€” both ends must allow (for same-account). For cross-account, resource policy alone is enough on some services (S3, Lambda, SNS, SQS) but KMS needs both sides.

Users vs roles ​

  • User β€” long-lived credentials (password, access key pair). Bind to a human (or, historically, a service). Avoid access keys for workloads; they're a leak magnet.
  • Role β€” no long-lived credentials. Assumed by a principal via STS, issuing temporary credentials (15 min – 12 h). Use for: EC2 instance profiles, Lambda execution roles, cross-account access, federated login, IRSA on EKS.

Interview gold: "We eliminated all IAM user access keys for services β€” every workload assumes a role. Rotation and leak blast-radius dropped to hours." Tie this to your ArgoCD/EKS work.

STS and AssumeRole ​

sts:AssumeRole exchanges a caller's identity for temporary credentials for a role:

bash
aws sts assume-role \
  --role-arn arn:aws:iam::111122223333:role/ProdReader \
  --role-session-name readonly-session
# returns AccessKeyId, SecretAccessKey, SessionToken, Expiration

Flavors: AssumeRole (IAM β†’ role), AssumeRoleWithSAML (SAML federation), AssumeRoleWithWebIdentity (OIDC β€” this is what IRSA uses under the hood).

IRSA β€” IAM Roles for Service Accounts (EKS) ​

A pod in an EKS cluster needs AWS credentials (say, to read S3). The modern answer is IRSA:

  1. EKS cluster has an OIDC provider registered with IAM.
  2. An IAM role trusts that OIDC provider + a specific ServiceAccount name in a specific namespace.
  3. Pods using that ServiceAccount receive a projected JWT (/var/run/secrets/eks.amazonaws.com/serviceaccount/token).
  4. The AWS SDK detects the token via AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE env vars and calls AssumeRoleWithWebIdentity automatically β€” no static keys.

Alternatives:

  • Pod Identity (2023+) β€” newer, no cluster-wide OIDC, simpler setup. Prefer for new clusters.
  • kiam / kube2iam β€” legacy, avoid.
  • Node IAM role β€” too broad; every pod on the node shares it. Anti-pattern.
yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments
  namespace: prod
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/payments-app

Gotcha: the IAM role's trust policy must name the exact namespace + service account, e.g. "system:serviceaccount:prod:payments". Typos fail silently β€” the pod just gets permission-denied.

Multi-factor, password policy, conditions ​

  • MFA β€” require on root, every IAM user, and for sensitive API calls (aws:MultiFactorAuthPresent).
  • Condition keys β€” the expressive escape hatch. aws:SourceIp, aws:PrincipalOrgID, aws:RequestTag/*, s3:prefix. Use them aggressively β€” least privilege is about conditions, not just actions.
json
{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::photos/*",
  "Condition": {
    "StringEquals": { "aws:PrincipalTag/dept": "marketing" },
    "IpAddress": { "aws:SourceIp": "203.0.113.0/24" }
  }
}

Access Analyzer, credential report, Access Advisor ​

  • IAM Access Analyzer β€” flags resource policies that grant cross-account or public access you may not have intended. Run it; review findings weekly.
  • Credential report β€” CSV of every user + access-key age + MFA status. Script it into compliance.
  • Access Advisor β€” per-role "service last accessed" view. Use to right-size over-permissive roles.

Compliance callout (inline) ​

For regulated workloads (PCI, HIPAA, SOC 2, etc.), AWS expects:

  • FIPS 140-2/140-3 validated endpoints (e.g., kms-fips.us-gov-west-1.amazonaws.com).
  • MFA on all human principals that can touch sensitive data.
  • CloudTrail + S3 Object Lock + KMS for an immutable audit trail.
  • AWS GovCloud (US) for bulk unclassified-sensitive workloads with FedRAMP High requirements.

3. Compute β€” EC2, Lambda, Containers, Batch ​

INTERVIEW_PREP Β§10 Q1 β€” "EC2 vs ECS vs EKS vs Lambda β€” when pick each?" ultimately resolves into operational responsibility, latency/cost profile, and team maturity.

Decision tree (memorize this) ​

  • Truly custom OS, long-running, GPU-specific, or licensed software? β†’ EC2
  • Event-driven, sub-15-minute, unpredictable traffic, zero-ops preferred? β†’ Lambda
  • Containerized app, no Kubernetes expertise, AWS-native stack? β†’ ECS (on Fargate)
  • Containerized app, Kubernetes already standard, multi-cloud portability? β†’ EKS
  • Batch / periodic / HPC? β†’ AWS Batch (wraps ECS/EKS/EC2 spot fleets)
  • Monolith lift-and-shift with zero refactor? β†’ Elastic Beanstalk or App Runner (for a single container app + HTTPS + scale, without the ECS ceremony)

EC2 essentials ​

  • AMI β€” machine image (OS + preinstalled packages). Snapshot a running instance to bake a custom AMI.
  • Instance family letters β€” memorize: M (general), C (compute), R (memory), T (burstable), I/D (storage), G/P (GPU), A (ARM/Graviton), Mac (yes, real).
  • Generations β€” c7i (Intel 7th gen), c7g (Graviton ARM, ~20% cheaper, ~40% perf/W better for most Java apps).
  • Placement groups β€” cluster (low-latency, same-AZ), spread (max 7 per AZ, fault-isolated), partition (big distributed systems like Kafka/Cassandra).
  • Purchase options β€” On-Demand / Reserved / Savings Plans / Spot / Dedicated Hosts (licensing, compliance).

Spot gotcha: Spot interruption is a 2-min warning, delivered via IMDS. Apps must drain cleanly. For stateless consumers this is easy; for stateful (Kafka brokers) it's a nightmare β€” use On-Demand/RIs for stateful tiers.

Lambda β€” the serverless compute unit ​

A function + a config triggered by an event source. Key knobs:

KnobLimits / defaults
Memory128 MB – 10,240 MB (CPU scales linearly with memory)
TimeoutMax 15 min
Package size50 MB zipped / 250 MB unzipped / 10 GB for container-image Lambdas
/tmp512 MB – 10,240 MB ephemeral
Concurrency1,000 per region by default; raise via quota
Reserved concurrencyReserve a slice + cap the function's max concurrency
Provisioned concurrencyPre-warmed execution envs for cold-start-sensitive paths

Cold starts β€” first invocation after idle requires runtime init + VPC ENI attach (historically the big one) + your new SpringApplication(). Mitigations: SnapStart (Java β€” 10Γ— faster init), provisioned concurrency, keep VPC-attachment to a minimum, use ARM/Graviton.

Java on Lambda specifically: Spring Boot starts slow; spring-cloud-function + SnapStart + GraalVM native-image are your levers. For true sub-100ms cold starts, Quarkus + native is the common answer.

Interview answer pattern: "Lambda is great for glue code, event-driven fan-out, and spiky workloads. I wouldn't put a 100ms-SLA synchronous Spring Boot service on it β€” cold starts and the 15-min cap aren't worth fighting. I'd use EKS or App Runner for that."

ECS β€” AWS-native containers ​

Primitives:

  • Task definition β€” the "Dockerfile + deployment config" (image URI, CPU/mem, env, secrets, logging).
  • Task β€” a running instance of a task definition.
  • Service β€” keeps N tasks running, integrates with an ALB/NLB target group.
  • Cluster β€” logical grouping; has a capacity provider (EC2 Auto Scaling Group, or Fargate).

Fargate = serverless containers. You hand ECS/EKS a task definition; AWS runs it on invisible infrastructure. No EC2 to patch. Pays a premium per vCPU-hour for that.

EKS β€” managed Kubernetes ​

EKS runs the Kubernetes control plane. You run data-plane nodes (self-managed ASG, managed node groups, or Fargate profiles) + install add-ons (aws-load-balancer-controller, cluster-autoscaler or Karpenter, external-dns, cert-manager).

Day-2 pieces worth knowing:

  • IRSA / Pod Identity β€” Β§2.
  • Karpenter β€” newer, faster, bin-packing-aware autoscaler (replaces Cluster Autoscaler); provisions nodes in seconds directly.
  • VPC CNI β€” pods get real ENIs on the VPC. ENI limit per instance caps pod density.
  • EKS Anywhere / EKS Distro β€” run the same Kubernetes on-prem.

Batch / Beanstalk / App Runner ​

  • Batch β€” submit jobs (compute needs + container image); Batch provisions EC2/Fargate/Spot, runs the job, tears down. Good for nightly ETL.
  • Elastic Beanstalk β€” opinionated PaaS: upload a WAR/JAR, get an ALB + ASG + CloudWatch. Mostly legacy; EKS/ECS/App Runner have displaced it.
  • App Runner β€” container + HTTPS + autoscaling in minutes, with private-VPC egress and custom domains. Think "Fargate without the YAML." Great migration target for a lone Spring Boot app.

INTERVIEW_PREP Β§10 Q7. Be ready to whiteboard a public-private-database 3-subnet layout in under 3 minutes.

VPC β€” the virtual network boundary ​

A VPC is a private IP space in a region (e.g., 10.0.0.0/16). Resources in the VPC can talk to each other; anything else requires explicit plumbing.

Subnet β€” an AZ-scoped slice of the VPC CIDR (e.g., 10.0.1.0/24 in us-east-1a). A subnet is:

  • Public if its route table has 0.0.0.0/0 β†’ igw-... (Internet Gateway).
  • Private if it has no IGW default route β€” instances cannot accept unsolicited inbound internet traffic.

Typical 3-tier layout per AZ:

us-east-1a:  10.0.1.0/24  public   (ALB, NAT GW)
             10.0.11.0/24 private  (app tier: EKS nodes, EC2)
             10.0.21.0/24 data     (RDS, ElastiCache β€” no internet)
us-east-1b:  10.0.2.0/24  public
             10.0.12.0/24 private
             10.0.22.0/24 data
us-east-1c:  ...

Three AZs for HA; some services (Aurora Serverless v2, ElasticSearch) require three subnets minimum.

Route tables, Internet Gateway, NAT Gateway ​

  • Internet Gateway (IGW) β€” one per VPC, symmetric (allows egress and ingress for public IPs / Elastic IPs). Free.
  • NAT Gateway β€” per-AZ, egress-only. Lets private-subnet instances reach the internet (e.g., apt install, S3 over public endpoint) without being reachable inbound. Priced per hour + per GB β€” $$ if you're sloppy.
  • Egress-only IGW β€” IPv6 equivalent of NAT; no charge beyond data.

Gotcha (bitten often in prod): NAT Gateways are per-AZ. If you put one NAT in us-east-1a and route all three private subnets' 0.0.0.0/0 through it, an AZ-a outage takes down egress for the entire app β€” and cross-AZ NAT traffic costs more than same-AZ. Deploy one NAT per AZ.

Security Groups (SG) vs Network ACLs (NACL) ​

Two separate firewalls. Both apply.

Security GroupNACL
ScopeENI (instance-level)Subnet-level
Stateful?Yes β€” return traffic auto-allowedNo β€” rules for in AND out
RulesAllow-only (no deny)Allow + Deny
EvaluationAll rules unionFirst-match by rule number
DefaultDeny all inbound / Allow all outboundAllow all in + out
ReferenceCan reference other SG IDsIP CIDRs only

Interview-ready distinction: "SGs are stateful and identity-aware β€” I can say 'allow port 5432 from the app-tier SG.' NACLs are stateless, subnet-wide, and used mostly for coarse blocks (block a known-bad IP range at the subnet boundary)."

VPC Peering vs Transit Gateway vs VPC Lattice ​

  • VPC Peering β€” 1:1 link between two VPCs. No transitive routing (A↔B and B↔C does not give A↔C). CIDRs must not overlap. Cheap.
  • Transit Gateway (TGW) β€” cloud router: N-to-N hub for VPCs + on-prem VPN/DX attachments. Route tables per attachment. Pays per attachment + GB.
  • VPC Lattice (GA 2023, expanded 2025) β€” application-layer service mesh across VPCs/accounts. Takes care of service discovery, IAM auth, and observability without running a sidecar proxy. Good answer for "how would you expose a Spring Boot service across 10 teams' VPCs without managing Envoy?"
  • PrivateLink + VPC Endpoints β€” expose a single service (yours or AWS's) into another VPC via ENIs in that VPC. No peering, no overlap concerns, no transitive exposure. Two flavors:
    • Interface endpoint β€” ENI, DNS-based, for most AWS APIs + your own services.
    • Gateway endpoint β€” route-table entry, only for S3 and DynamoDB. Free. Always use a gateway endpoint for S3 β€” it removes S3 traffic from NAT (huge cost win).

VPN, Direct Connect, Client VPN ​

  • Site-to-Site VPN β€” IPsec from your DC/branch to a VPG or TGW. Minutes to set up, up to ~1.25 Gbps per tunnel.
  • Direct Connect (DX) β€” dedicated fiber from your DC to an AWS DX location (1/10/100 Gbps). Weeks to provision, lower latency, predictable.
  • Client VPN β€” OpenVPN-based per-user access into a VPC. Use for admin access; prefer SSM Session Manager for day-to-day instance shell access.

DNS inside the VPC ​

  • AmazonProvidedDNS at .2 of the VPC CIDR resolves public names + internal ec2.internal / Route 53 private hosted zones.
  • Route 53 Resolver endpoints β€” bridge DNS across VPC ↔ on-prem (inbound and outbound resolvers).

5. DNS & Traffic Management β€” Route 53, ELB, Global Accelerator ​

Route 53 ​

AWS's managed DNS. Two flavors:

  • Public hosted zone β€” authoritative DNS for your domain (example.com).
  • Private hosted zone β€” split-horizon DNS scoped to one or more VPCs.

Routing policies β€” the part interviewers probe:

PolicyBehaviorUse case
SimpleOne record β†’ one answerDefault
WeightedSplit traffic by weightCanary releases
LatencyServe from the region with lowest RTT to the resolverMulti-region apps
FailoverPrimary + secondary with health checkActive/passive DR
GeolocationAnswer by the user's countryCompliance, localized content
GeoproximityBias by geographic distance + bias value (via Traffic Flow)Gradual region weighting
Multivalue answerUp to 8 healthy answers returnedPoor-man's LB for internal services

Health checks β€” endpoint / calculated / CloudWatch-alarm-based. Drive failover records and Multivalue.

Gotcha: Route 53 does not cache based on TTL if resolvers upstream (browsers, ISPs) cache aggressively. Failover is eventual. Set reasonable TTLs (60s for failover records) but architect for at-least-once resolution anomalies.

Elastic Load Balancing β€” four flavors ​

TypeLayerUse
ALB β€” Application LBL7 (HTTP/HTTPS/gRPC/WebSocket)Most web workloads; path/host-based routing; native integration with Cognito/OIDC; WAF
NLB β€” Network LBL4 (TCP/UDP/TLS)Static IPs, ultra-high PPS, preserve client IP, non-HTTP protocols
GWLB β€” Gateway LBL3 (IP packets)Inline appliance chains (firewalls, IDS) β€” you will rarely need this
CLB β€” Classic LBL4+L7 (legacy)Don't use for new work

ALB features worth knowing: target groups (instance / IP / Lambda), sticky sessions (cookie-based), redirect rules, authentication actions (Cognito/OIDC), rate-based WAF rules, gRPC support (with HTTP/2).

Global Accelerator ​

Static anycast IPs fronted by AWS's backbone, routing to regional endpoints (ALB/NLB/EC2). Benefits: faster first-byte (AWS backbone vs. best-effort internet), instant regional failover, DDoS protection (Shield Standard bundled).

When to choose GA over CloudFront: you have non-HTTP traffic, need static IPs for enterprise firewalls, or need sub-second regional failover. CloudFront still wins for cacheable HTTP.


6. Content Delivery β€” CloudFront ​

INTERVIEW_PREP Β§10 Q8. CloudFront sits at edge PoPs (600+ worldwide) and caches content close to users.

Origins, behaviors, distributions ​

A distribution is the logical CDN endpoint (dXXXXX.cloudfront.net, or your custom domain). It points at one or more origins:

  • S3 origin β€” static assets, video, firmware. Use Origin Access Control (OAC) (replaces legacy OAI) to lock the bucket so only CloudFront can read it.
  • ALB / EC2 / custom origin β€” dynamic backends. CloudFront terminates TLS and proxies.
  • Lambda function URL / API Gateway / MediaPackage β€” other AWS-native origins.

Cache behaviors β€” per-path rules. You can have one distribution that:

  • Caches /static/* from S3 with 1-year TTL and no cookies forwarded.
  • Sends /api/* to an ALB, with TTL=0 and all headers forwarded (acts as a reverse proxy with DDoS protection).

Cache keys, policies, invalidation ​

Old model: "forward these headers/cookies/query strings" β€” error-prone. Modern model uses:

  • Cache policy β€” what goes into the cache key (headers, query strings, cookies, compressed variants).
  • Origin request policy β€” what gets forwarded upstream (can exceed the cache key).
  • Response headers policy β€” CORS / Strict-Transport-Security / Permissions-Policy added at the edge.

Invalidation β€” aws cloudfront create-invalidation --paths "/*" β€” 1000 paths/month free, then $$ per path. Prefer versioned filenames (app.abc123.js) over invalidations.

Signed URLs / signed cookies ​

Restrict private content to authenticated users:

  • Signed URL β€” single-resource link with expiry. Good for one-click downloads.
  • Signed cookies β€” scope an expiry/IP to an entire path; good for session-gated streaming.

Separately, S3 pre-signed URLs (Β§7) sign directly to S3 β€” no CloudFront required. Decide based on whether you need CDN performance on top of access control.

Edge compute β€” Lambda@Edge vs CloudFront Functions ​

FeatureLambda@EdgeCloudFront Functions
RuntimeNode.js / PythonJavaScript (V8 isolate)
Max duration5s (viewer req) / 30s (origin)1 ms
Max memory128–10,240 MB2 MB
UseURL rewrites, auth, A/BJWT verify at edge, URL rewrites, header munging

Rule of thumb: start with CloudFront Functions (far cheaper/faster); fall back to Lambda@Edge only when you need the SDK or larger code.

Interview answer β€” CloudFront + S3 vs CloudFront + ALB ​

  • S3 origin: static site or static assets. Use OAC to block direct-bucket access. Put your ACM cert on CloudFront (it must live in us-east-1 for CF). Set long TTLs + versioned filenames.
  • ALB origin: dynamic API or SPA. TTL=0 for /api/*, cache /static/* with the SPA build hash. Shield Standard bundled; add WAF ACL for SQLi/XSS/rate limits.

7. Storage β€” S3, EBS, EFS, FSx ​

INTERVIEW_PREP Β§10 Q3, Q4. S3 is the most-asked AWS service after IAM. Know it cold.

S3 β€” object storage ​

  • Bucket β€” globally unique name, region-bound after creation. Up to 100 per account by default.
  • Object β€” up to 5 TB. Uploaded directly (≀ 5 GB) or via multipart upload (> 5 GB, or anytime you want parallel upload + resumability). Multipart upload initiates β†’ parts uploaded in parallel β†’ complete-multipart-upload combines them. Abort multipart uploads in lifecycle policy or orphaned parts accrue cost indefinitely.
  • Key β€” the object name. "Folders" are a console fiction; keys are flat strings. Listing is lexicographic.

Storage classes (know when to use each) ​

ClassRetrievalCost profileUse
S3 Standardms$$Hot data, primary storage
S3 Intelligent-Tieringms (auto-tiered)$$ + monitoring feeUnknown/variable access β€” let S3 tier it
S3 Standard-IAms$ storage, $$ per retrievalInfrequently accessed, still needs ms access
S3 One Zone-IAmscheaper IARe-creatable data (thumbnails, derivatives)
S3 Glacier Instant Retrievalmscheap storage, Β’/GB retrievalArchive with occasional fast reads
S3 Glacier Flexible Retrievalminutes–hourscheaperCompliance archive, backups
S3 Glacier Deep Archive12 hcheapest7–10-yr retention

Each class has a minimum billing duration (30/90/180 days for IA/Glacier variants). Deleting early = full-duration charge. Relevant when designing lifecycle transitions.

Lifecycle policies ​

Declarative rules to transition objects between classes and expire them. Great for log retention:

json
{
  "Rules": [{
    "Filter": { "Prefix": "logs/" },
    "Status": "Enabled",
    "Transitions": [
      { "Days": 30,  "StorageClass": "STANDARD_IA" },
      { "Days": 90,  "StorageClass": "GLACIER" }
    ],
    "Expiration": { "Days": 365 }
  }]
}

Versioning, replication, Object Lock ​

  • Versioning β€” every PUT creates a new version; DELETE creates a "delete marker" above the latest version. Required for Cross-Region Replication and Object Lock.
  • Replication β€” CRR (cross-region) or SRR (same-region, same-account). Asynchronous. For synchronous "multi-region write" you're in DynamoDB/Aurora territory, not S3.
  • Object Lock β€” WORM (write-once-read-many). Governance mode (privileged users can bypass) vs Compliance mode (no one can delete, not even root). SEC 17a-4 / regulated-compliance target. Turn on at bucket creation (cannot enable later without support).

Access control β€” the modern answer ​

Use bucket policies + block-public-access (on by default since 2023). Avoid ACLs entirely (BucketOwnerEnforced setting disables them). For cross-account: bucket policy + IAM on the caller; for same-account: IAM alone is enough.

json
{
  "Statement": [{
    "Sid": "AllowCloudFrontOAC",
    "Effect": "Allow",
    "Principal": { "Service": "cloudfront.amazonaws.com" },
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::assets-prod/*",
    "Condition": {
      "StringEquals": {
        "AWS:SourceArn": "arn:aws:cloudfront::111:distribution/E1234"
      }
    }
  }]
}

S3 pre-signed URLs ​

A URL with a time-limited signature granting GetObject or PutObject on a specific key. Server (with IAM credentials) generates the URL; clients upload/download directly to S3 β€” bypasses your app as a data pipe.

java
S3Presigner presigner = S3Presigner.create();
PresignedPutObjectRequest req = presigner.presignPutObject(p ->
    p.signatureDuration(Duration.ofMinutes(10))
     .putObjectRequest(o -> o.bucket("uploads").key("user42/avatar.png")));
String uploadUrl = req.url().toString();
// client then does:  curl -X PUT --upload-file avatar.png "$uploadUrl"

Use cases: file upload from browsers without streaming through the app; email-able download links; mobile clients that shouldn't embed AWS credentials.

Gotcha: the signer must have s3:PutObject itself β€” you can't pre-sign privileges you don't have. And the URL is only as secure as the channel you send it over; treat it as a bearer token.

Consistency, performance, partitioning ​

S3 is strongly read-after-write consistent (since Dec 2020) for all operations β€” PUT-then-GET returns the new object, overwrite-then-GET returns the new, LIST reflects the latest state.

Per-prefix scaling: 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD per second per prefix. Spread write-heavy keys across multiple prefixes (e.g., hash-prefix pattern) if you exceed that.

EBS β€” block storage for EC2 ​

Per-AZ (not regional). Types:

  • gp3 (default) β€” SSD, baseline IOPS + throughput, can exceed with provisioned bursts. Cheapest modern SSD.
  • io2 / io2 Block Express β€” provisioned IOPS up to 256k; for Oracle/SQL-Server-scale DBs.
  • st1 β€” throughput HDD (big-data sequential).
  • sc1 β€” cold HDD.

Snapshots are incremental, stored in S3, and can be copied cross-region/account. Encrypt by default (account-level setting).

EFS β€” POSIX-compliant NFS ​

Mountable from multiple EC2/ECS/Lambda clients across AZs. Lifecycle policies move cold files to IA. Use when a legacy app expects a shared FS (/shared/uploads) and you don't want to refactor it to S3. Otherwise, prefer S3.

FSx β€” managed third-party FS ​

  • FSx for Windows File Server β€” SMB, AD-integrated. Windows workloads.
  • FSx for Lustre β€” HPC, tied to S3 buckets.
  • FSx for NetApp ONTAP / OpenZFS β€” lift-and-shift for existing storage tiers.

Storage Gateway ​

On-prem appliance (VM) that exposes S3 / EBS / FSx over NFS/SMB/iSCSI. For hybrid backups/archives. Niche.


8. Databases β€” RDS, Aurora, DynamoDB, ElastiCache ​

RDS β€” managed relational databases ​

Engines: Postgres, MySQL, MariaDB, Oracle, SQL Server, and Aurora (AWS-native rewrite of MySQL/Postgres). AWS handles patching, backups, minor-version upgrades, failover.

Multi-AZ vs read replicas β€” the frequent confusion:

Multi-AZ (standby)Read replica
PurposeHA / failoverRead scale-out
SyncSynchronous replicationAsynchronous
EndpointSame endpoint (DNS flip on failover)Separate endpoint
ReadsNot served from standby (except new Multi-AZ DB Cluster for Postgres/MySQL)Yes
Failover time~60–120 sManual promotion
Cost2Γ—1Γ— per replica

You can combine them: Multi-AZ primary + N read replicas.

Backups β€” automated daily snapshot + transaction-log continuous backup = Point-in-Time Restore (up to 35 days). On top of that: manual snapshots, cross-region/cross-account copies.

Gotcha: restoring a snapshot always creates a new RDS instance with a new endpoint. No in-place restore. Plan your DNS layer.

Aurora β€” AWS-native cloud database ​

A fork of MySQL/Postgres with a distributed, log-structured storage layer that replicates 6Γ— across 3 AZs. Replace "Multi-AZ + read replica" with "one cluster":

  • Writer endpoint β€” goes to the single writer (only one at a time in non-Global).
  • Reader endpoint β€” load-balances across up to 15 reader replicas (all read from the same shared storage β€” lag is milliseconds).
  • Failover β€” promote a reader to writer in ~30s.
  • Serverless v2 β€” scales ACUs up/down in seconds. Dev/staging or spiky-prod. Still requires a VPC.
  • Global Database β€” cross-region reader cluster, sub-second replication, fast failover with RPO ~1s.

DynamoDB β€” NoSQL key-value / document ​

Fully managed, single-digit ms at any scale. Pricing model: provisioned (RCU/WCU reserved) or on-demand (pay-per-request).

Core concepts:

  • Partition key (PK) β€” required. Hashes into the partition.
  • Sort key (SK) β€” optional. Items with same PK share a partition; SK defines order within.
  • Item β€” a row; up to 400 KB.
  • GSI (Global Secondary Index) β€” alternate PK/SK; new partition scheme; eventually consistent by default.
  • LSI (Local Secondary Index) β€” same PK, alternate SK; created at table time only; strongly consistent.
  • Streams β€” per-table CDC for changes (old/new images); feed Lambda or Kinesis.
  • TTL β€” per-item expiry timestamp; items deleted asynchronously.
  • Transactions β€” TransactWriteItems / TransactGetItems; ACID across up to 100 items.
  • DAX β€” in-memory accelerator; microsecond reads; cache-aside for DDB.

Single-table design β€” pack multiple entity types into one table, using overloaded PK/SK patterns (USER#42 / ORDER#2026-04-17). Reduces joins, simplifies ops, but is cognitively heavier β€” prefer for mature apps, not MVPs. For typical Spring Boot microservices, one table per bounded context is typically the right pragmatic call.

Hot partition pitfall: DDB partitions are fixed-size throughput. A skewed PK (country=US as PK) overwhelms a partition while others idle. Fix: high-cardinality PK, or suffix a bucket (US#0..US#99) and scatter-gather.

ElastiCache β€” Redis / Memcached / Valkey ​

Managed in-memory cache. Two engines:

RedisMemcached
Data typesstrings, hashes, lists, sets, zsets, streams, pub/substrings only
PersistenceRDB + AOF optionalNone
HAMulti-AZ replication + automatic failoverNone
Cluster modeYes (sharded)Yes (client-side sharding)
UseSession cache, leaderboards, rate limiters, pub/subRaw cache, multi-tenant

Prefer Redis (or its Valkey fork, which AWS is pushing post-Redis-license change) for nearly everything modern. Memcached rarely right.

Cache invalidation patterns:

  • Cache-aside (lazy loading) β€” app checks cache, falls back to DB, populates cache. Simple; stale reads on cache miss race.
  • Write-through β€” every DB write populates the cache. Read-simple; doubled write cost.
  • Write-behind β€” queue writes; eventual consistency, risk of data loss on cache failure.
  • TTL + invalidation on writes β€” most common pragmatic combo.

Redshift / DocumentDB / others (brief) ​

  • Redshift β€” columnar MPP warehouse. Petabyte analytics. Materialized views, RA3 instances, Redshift Spectrum (query S3 in place), Serverless. Competes with Snowflake/BigQuery.
  • DocumentDB β€” MongoDB-API-compatible (3.6/4.0/5.0 wire protocols). Managed Mongo-ish. Not a literal MongoDB fork β€” missing some aggregation pipeline features. Mostly chosen when Atlas isn't allowed by procurement/compliance.
  • Neptune β€” graph (Gremlin + SPARQL + openCypher).
  • Timestream β€” time-series.
  • Keyspaces β€” Cassandra-compatible.
  • MemoryDB β€” durable Redis for primary-data workloads.

9. Messaging & Streaming β€” SQS, SNS, EventBridge, MQ, MSK, Kinesis ​

INTERVIEW_PREP Β§10 Q5 and Q6. Your strongest territory. The AWS side is the concrete implementation of INTERVIEW_PREP Β§4 (Messaging).

SQS β€” fully managed queues ​

  • Standard queues β€” unlimited throughput, at-least-once, best-effort ordering. Consumers MUST be idempotent.
  • FIFO queues β€” strict ordering, exactly-once-processing (dedup window 5 min), 300 msgs/sec (3,000 with batching), up to 70k/s with high-throughput mode. Suffix .fifo. Costs more.

Core mechanics:

  • Visibility timeout β€” when a consumer receives a message, it's hidden for N seconds (default 30). Extend with ChangeMessageVisibility if processing is slow. Failure to delete β†’ message reappears.
  • Dead-letter queue (DLQ) β€” route messages that fail N times (maxReceiveCount) to a DLQ for out-of-band inspection. Always pair prod queues with a DLQ.
  • Long polling β€” WaitTimeSeconds=20 on ReceiveMessage reduces empty-receive overhead to near zero.
  • Batching β€” send/receive/delete up to 10 at a time. 10Γ— cost reduction per message.
  • Delay queue β€” initial invisibility up to 15 min.
  • Message size β€” 256 KB (extend to 2 GB with the SQS Extended Client Library, which stores payload in S3).
java
@Bean
SqsListenerContainerFactory<?> factory(SqsAsyncClient client) {
    return SqsMessageListenerContainerFactory.builder()
        .configure(c -> c
            .pollTimeout(Duration.ofSeconds(20))        // long poll
            .maxNumberOfMessagesPerPoll(10)             // batch receive
            .messageVisibility(Duration.ofSeconds(60))  // visibility timeout
        )
        .sqsAsyncClient(client)
        .build();
}

Gotcha (Spring): @SqsListener acks by returning normally. If processing is fast and you return before your DB commit flushes, you can ack before the commit is durable β†’ at-least-once becomes at-most-once on DB failure. Use @Transactional around the handler body and return only after commit.

SNS β€” pub/sub topics ​

Topics fan out to N subscribers:

  • Subscriber types: SQS queue, Lambda, HTTP(S) endpoint, email, SMS, mobile push, Kinesis Data Firehose.
  • Filter policies β€” JSON matcher on message attributes; subscribers get only messages they care about (saves DLQ plumbing on consumers).
  • FIFO topics β€” pair with FIFO queues for ordered fan-out.

Classic pattern: one SNS topic β†’ N SQS queues (one per consumer service). Gives you fan-out and each consumer has its own DLQ/retry independently. "Topic-to-queue fan-out."

EventBridge β€” event bus with routing & schemas ​

Successor to CloudWatch Events. Think of it as:

  • Default bus (AWS service events) + custom buses (your own events) + partner buses (Zendesk/Shopify/etc).
  • Rules β€” pattern-match on JSON structure; target up to 5 per rule (Lambda, SQS, Step Functions, API Destinations HTTPS, …).
  • Schema registry β€” discovers schemas from events on the bus; can generate typed POJOs/TS bindings.
  • Pipes β€” source (SQS/Kinesis/DDB Streams/…) β†’ optional filter + transform β†’ target. Replaces a lot of glue Lambdas.

Decision matrix β€” SQS vs SNS vs EventBridge ​

NeedUse
One producer, one consumer, durable queueSQS
One producer, many consumers, minimal routingSNS β†’ SQS fan-out
Many producers, routed by content, schema-firstEventBridge
Windowed streaming / replay / ordered per-keyKinesis / MSK
Pub/sub with JMS semantics from legacy appsAmazon MQ

Rule of thumb: EventBridge for event-driven architectures across teams; SNS+SQS for simple fan-out to a known set of consumers; SQS for plain job queues.

Amazon MQ β€” managed ActiveMQ & RabbitMQ ​

The answer to an on-prem IBM MQ migration. Amazon MQ runs ActiveMQ Classic, ActiveMQ Artemis, or RabbitMQ brokers managed by AWS β€” so your existing JMS / AMQP / STOMP / MQTT / OpenWire clients keep working. Single-instance or active-standby HA pairs.

Interview story: "We bridged on-prem IBM MQ to AWS ActiveMQ via Amazon MQ. Maintained JMS semantics so the Spring Boot listener container didn't change. For durability across the bridge, we used XA transactions client-side and a retry-with-backoff on the bridge; for poison messages, a backout queue mirrored to a DLT in Amazon MQ."

When to prefer MSK/Kafka instead: high-throughput streaming (10k+/s), log-based replay, schema registry, event-sourcing patterns. When to prefer Amazon MQ: you already speak JMS, you need per-message ACKs and XA, you don't want to rewrite consumers.

Amazon MSK β€” managed Kafka ​

Managed Kafka brokers + Zookeeper/KRaft. Variants:

  • MSK Provisioned β€” you pick broker counts + instance types; you operate partitions/topics.
  • MSK Serverless β€” AWS scales brokers; you pay per throughput + storage. Great default for unknown load.

Companion: MSK Connect (managed Kafka Connect) and Glue Schema Registry (Avro/Protobuf/JSON registry β€” pair with cross-team schema standardization work). MSK IAM auth replaces SASL/SCRAM: IAM policies authorize producer/consumer actions, no shared secrets.

Kinesis β€” streaming data platform ​

Three related services, easily confused:

  • Kinesis Data Streams β€” Kafka-like: partitioned shards, retention (1–365 days), multiple consumers. Client-side or Kinesis Client Library (KCL) to checkpoint.
  • Kinesis Data Firehose β€” fully managed delivery to S3/Redshift/OpenSearch/HTTP. No consumer code. Buffers by time + size.
  • Kinesis Data Analytics / Managed Service for Apache Flink β€” SQL or Flink jobs over a stream. Competes with MSK + Flink.

Kinesis vs MSK: Kinesis is more plug-and-play, capped at 1000 records/s/shard (write) and 2 MB/s/shard (read). MSK is a real Kafka cluster (partition into thousands) with all the Kafka ecosystem. Pick Kinesis when simplicity matters; pick MSK when you want Kafka semantics (Connect, Streams, schema registry, consumer-group rebalance).


10. Serverless & Integration β€” Lambda, API Gateway, Step Functions ​

Lambda advanced ​

Already covered fundamentals in Β§3. Extras:

  • Layers β€” shared code/deps across functions. 5 per function max. Use for the AWS SDK (you get an older bundled one), Lambda Insights extension, or your own common libs.
  • Extensions β€” sidecar processes inside the Lambda sandbox (Datadog, New Relic, Parameter-store-cache). Run outside the function's handler, can pre-fetch.
  • SnapStart (Java, Python, .NET) β€” snapshot the initialized JVM once, restore for each invocation. ~10Γ— faster cold start. Enabled per function + published version.
  • Powertools for AWS Lambda β€” opinionated libraries (tracing, metrics, structured logging, idempotency, batch processing). Have one in every runtime (Java, Python, TypeScript, .NET).
  • Destinations β€” on-success/on-failure targets for async invocations (SNS/SQS/Lambda/EventBridge). Replaces explicit DLQ configs.
  • Event Source Mappings β€” the pollers for SQS/Kinesis/MSK/DynamoDB Streams. Configure batch size, batching window, max concurrency, filter criteria.

Gotcha: With an SQS event source, Lambda batches but one poison message in a batch fails the whole batch unless you set ReportBatchItemFailures and return per-item failures from your handler. Turn this on β€” otherwise one bad record can stall the queue and blow through visibility timeouts.

API Gateway ​

Two primary flavors:

  • HTTP API (v2) β€” newer, cheaper (~70% cheaper than REST), faster, JWT authorizers built-in. Most new APIs.
  • REST API (v1) β€” feature-rich: API keys + usage plans, request/response transformations with VTL, WAF integration, Resource Policies. Use when you need those specific features.
  • WebSocket API β€” stateful, used for chat, live updates, multiplayer.

Integration types β€” Lambda (proxy or non-proxy), HTTP backend, AWS service (e.g., S3 direct upload, SQS SendMessage), VPC Link (private ALB/NLB in your VPC), mock.

Authorizers: IAM (SigV4), Cognito User Pool, Lambda authorizer (custom auth), JWT authorizer (HTTP API only).

Throttling β€” per-API and per-key rate + burst limits. Use this to protect downstream backends from spike attacks.

Step Functions β€” orchestration ​

State machine runtime for multi-step workflows. Two flavors:

  • Standard β€” long-running (up to 1 year), exactly-once, 2000 state transitions/sec, visual history. Orchestrate Saga-style transactions across microservices.
  • Express β€” sub-5-minute, at-least-once, high-volume, cheaper per transition. Replace chains of Lambda invocations.

Amazon States Language (ASL) β€” JSON. Built-in integrations (SDK integrations cover ~200 AWS services). Error handling with Retry + Catch blocks. Parallel states for fan-out/fan-in. Map states for iterating over a list.

When to use: anywhere you'd otherwise write a Lambda that invokes another Lambda that invokes an SQS message that… β€” that's a Step Function. Saga orchestration across microservices is a direct fit.

AppSync (brief) ​

Managed GraphQL. Resolvers backed by Lambda / DynamoDB / Aurora / OpenSearch / HTTP. Subscriptions over WebSocket. Niche but valuable when the front end is GraphQL-native.


11. Containers on AWS β€” ECR, ECS, EKS ​

ECR β€” Elastic Container Registry ​

Private Docker registry per AWS account + region. Features:

  • Image scanning (basic or enhanced via Inspector) β€” CVE scan on push.
  • Lifecycle policies β€” delete untagged images older than N days.
  • Cross-region replication for DR.
  • Pull-through cache β€” mirror Docker Hub / Quay / GHCR behind ECR to avoid rate limits.

Auth: aws ecr get-login-password β†’ docker login. In EKS, the ecr-credential-provider handles it automatically.

ECS β€” covered Β§3. Plus: ​

  • Capacity providers β€” pick EC2 ASG (control cost) vs Fargate (zero-ops) vs Fargate Spot (cheaper interruptible).
  • Service Connect β€” AWS-native service discovery + mTLS; Envoy sidecars deployed automatically. Alternative to Cloud Map + App Mesh.
  • Task role vs execution role: task role = what your app can do (read S3); execution role = what ECS does on your behalf (pull image from ECR, push logs to CloudWatch). Don't conflate them.

EKS deep dive ​

Networking: pods get VPC IPs via the AWS VPC CNI. ENI limit per instance bounds pod density; check when sizing nodes. You can run the ipv6 CNI mode for huge clusters.

Add-ons to install day 0:

  • aws-load-balancer-controller β€” turns Ingress/Service type=LoadBalancer into ALB/NLB.
  • cluster-autoscaler (or Karpenter) β€” scales nodes to match pending pods.
  • external-dns β€” syncs Ingress hostnames to Route 53.
  • cert-manager β€” ACM or Let's Encrypt for TLS.
  • metrics-server β€” drives HPA.
  • kube-state-metrics + Prometheus/Grafana or CloudWatch Container Insights.

Authentication/authorization:

  • kubectl β†’ STS AssumeRoleWithWebIdentity β†’ aws-auth ConfigMap maps IAM entities to Kubernetes groups β†’ Kubernetes RBAC authorizes.
  • 2023+ EKS Access Entries API replaces the aws-auth ConfigMap editing dance.
  • IRSA (Β§2) for pods.

Node options: self-managed node groups, managed node groups, Fargate profiles (serverless pods β€” no nodes to patch, but can't run DaemonSets/hostPath).

Interview story: "We ran ArgoCD on EKS, using IRSA to grant it eks:DescribeCluster + sts:AssumeRole for target clusters. App-of-apps + ApplicationSets across 3 environments. Karpenter replaced Cluster Autoscaler and cut node-scale-up from ~3 min to ~30 s."


12. Observability β€” CloudWatch, X-Ray, CloudTrail, Config ​

CloudWatch ​

Three related but distinct concepts:

  • Metrics β€” numeric time-series. Namespaces (AWS/EC2, AWS/Lambda, or custom), dimensions, 1-min or 1-sec resolution. Embedded Metric Format (EMF) lets you publish metrics by writing a specific JSON line in Lambda/container logs β€” no extra API calls, cheap.
  • Logs β€” log groups β†’ log streams. Retention per group (1 day – forever). Logs Insights is a query language (close to SQL) for exploring them:
    fields @timestamp, @message
    | filter @message like /ERROR/
    | stats count() by bin(5m)
  • Alarms β€” thresholds over metrics with actions (SNS, Auto Scaling, Systems Manager).

Dashboards β€” mixed metric/log-query widgets; sharable cross-account. JSON-as-code, drop them in Git with CloudFormation/Terraform.

Container Insights / Lambda Insights / RUM / Synthetics β€” add-on products for deeper telemetry.

Gotcha (cost): CloudWatch Logs ingestion is ~$0.50/GB. Noisy DEBUG logs or an unbounded exception loop can ring up four-figure bills fast. Set alarms on IncomingBytes per log group, or filter-before-ingest on the agent.

X-Ray β€” distributed tracing ​

Trace IDs propagate via X-Amzn-Trace-Id (or W3C traceparent when configured). SDKs instrument AWS SDK calls, HTTP clients, SQL drivers. Sampling is default 1 req/s + 5%; tune via sampling rules.

X-Ray now integrates with OpenTelemetry: the AWS Distro for OpenTelemetry (ADOT) exports OTLP to X-Ray, CloudWatch, or both, and also to your existing Jaeger/Grafana/Datadog. This is the right path for most observability work β€” "we standardized on OTEL, used the ADOT collector, exported to X-Ray + Elasticsearch."

CloudTrail β€” the audit log of AWS itself ​

Every API call in your account, for every service, gets a CloudTrail event. Two types:

  • Management events (default, free tier on first trail) β€” iam:CreateRole, ec2:TerminateInstances, etc.
  • Data events (opt-in, priced) β€” per-object-level on S3, per-invoke on Lambda, per-item on DynamoDB.

Route to S3 (with Object Lock for tamper-evidence) + CloudWatch Logs + EventBridge. For CJIS/FedRAMP, this + immutable bucket policies is your audit backbone.

AWS Config β€” resource-state + compliance ​

Continuously records AWS resource configurations and evaluates them against rules ("S3 bucket must not allow public read," "security group must not expose port 22 to 0.0.0.0/0"). Managed rules cover most CIS benchmarks; custom rules are Lambda functions. Pairs with Security Hub to aggregate findings.

Observability decision tree (interview-ready) ​

  • Who did what when? β†’ CloudTrail
  • What's happening right now in my service? β†’ CloudWatch Metrics + Alarms
  • Why is the latency spiking? β†’ X-Ray / OTEL traces
  • Show me the raw events. β†’ CloudWatch Logs Insights or OpenSearch
  • Is the infra still compliant? β†’ AWS Config + Security Hub

13. Security Services β€” KMS, Secrets, WAF, GuardDuty ​

KMS β€” encryption at rest ​

  • CMK (Customer Master Key) β€” the logical key. Two flavors:

    • AWS-managed (aws/s3, aws/rds) β€” free, rotated by AWS annually, can't modify policy.
    • Customer-managed (CMK) β€” you control the key policy, aliases, rotation, grants. Pays per key + API call.
  • Envelope encryption β€” KMS doesn't encrypt your data directly (too expensive). Instead:

    1. KMS generates a data key (encrypted with the CMK).
    2. Your service encrypts data with the data key locally.
    3. Stores encrypted data + encrypted data key together.
    4. Decrypt = KMS decrypts the data key, your service decrypts the data.

    S3, EBS, RDS, DynamoDB all use this under the hood.

  • Grants β€” time-bound, fine-grained delegations of KMS permissions without editing the key policy. Handy for ephemeral workloads (EMR jobs, Lambda).

  • Key policies vs IAM β€” KMS requires the key policy to allow access; IAM alone is insufficient. Memorize this.

FIPS endpoints β€” kms-fips.us-east-1.amazonaws.com. Required for CJIS / FedRAMP High.

Secrets Manager vs Parameter Store ​

Secrets ManagerSSM Parameter Store
Per-secret cost$0.40/mo + API callsFree (Standard) / $0.05/mo (Advanced > 4 KB)
RotationBuilt-in for RDS/DocumentDB/Redshift + Lambda hookCustom via Lambda
VersioningYesYes
Resource policiesYesAdvanced only
Max value size64 KB4 KB (Standard) / 8 KB (Advanced)

Rule: Use Parameter Store for config + free-tier secrets; Secrets Manager when you need automated rotation (DB creds primarily).

Spring Boot integration: spring-cloud-aws-starter-secrets-manager / ...starter-parameter-store autoload values as Spring properties. Cache to avoid billing every startup.

ACM β€” certificates ​

Free public TLS certs, auto-rotated, DNS-validated (1-click with Route 53). Private CA for internal PKI (priced higher). Attach to ALB/NLB/CloudFront/API Gateway β€” not to EC2 directly.

WAF β€” web application firewall ​

Attached to ALB / CloudFront / API Gateway / AppSync. Rules:

  • Managed rule groups (AWS, Marketplace vendors) β€” covers OWASP Top 10.
  • Rate-based rules β€” 2000 req / 5 min per IP is a common floor.
  • Custom rules β€” SQL keyword match, path regex, geo block, header match.

Bot Control, CAPTCHA, and IP reputation lists are add-on paid features.

Shield β€” DDoS protection ​

  • Shield Standard β€” free, always-on, L3/L4 protection on all AWS edge. You already have it.
  • Shield Advanced β€” $3,000/mo + data, gives you: 24Γ—7 DRT access, cost protection during attacks, application-layer attack detection, health-based DDoS mitigation.

GuardDuty / Macie / Inspector / Security Hub / IAM Access Analyzer ​

  • GuardDuty β€” ML-driven threat detection on CloudTrail + VPC Flow Logs + DNS logs + EKS audit logs + S3 data events. Catches compromised instances, crypto-mining, anomalous API calls.
  • Macie β€” scans S3 buckets for sensitive data (PII, credentials). For CJIS/HIPAA data classification.
  • Inspector β€” vuln scanner for EC2 / ECR / Lambda. Continuous CVE scanning.
  • Security Hub β€” pane-of-glass aggregator of findings from Config / GuardDuty / Macie / Inspector / 3rd parties. CIS + PCI + NIST packs.
  • IAM Access Analyzer β€” external-access findings (Β§2) + unused-access findings + policy generation from CloudTrail.

14. CI/CD on AWS β€” CodePipeline, CodeBuild, CodeDeploy ​

The Code* quartet ​

  • CodeCommit β€” managed Git. AWS announced closed to new customers in 2024; on life support. Use GitHub/GitLab.
  • CodeBuild β€” managed build runner (runs buildspec.yml). On-demand containers. Cheap alternative to self-hosted GitHub runners, with IAM-native AWS access.
  • CodeDeploy β€” deploys to EC2/ECS/Lambda/on-prem. Supports blue/green, canary, linear shifts. Integrates with CloudWatch alarms for auto-rollback.
  • CodePipeline β€” orchestrator: pulls source, invokes CodeBuild, runs approvals, triggers CodeDeploy.

Realistic CI/CD topology ​

"We used GitHub Actions for CI (build, test, scan), pushed images to ECR, and ArgoCD pulled from a Helm-charts repo into EKS. CodeBuild wasn't the primary runner, but we used it for IAM-native AWS calls (OWASP Dependency Check publishing, CloudFormation deploys, etc.). CodeDeploy for EC2 was a legacy pathway we kept for the monolith during migration."

Deployment strategies on AWS ​

  • Rolling (default ECS/EKS) β€” replace N% at a time. Simple; partial-state window if rollback.
  • Blue/Green β€” two full environments; traffic shift is an ALB listener rule change or Route 53 flip. ECS and Lambda support it natively via CodeDeploy.
  • Canary β€” shift 5% for 5 min, then 100%. Tie to CloudWatch alarms for automatic rollback.
  • Argo Rollouts on EKS β€” Kubernetes-native canary + analysis with Prometheus queries.

15. Infrastructure as Code β€” CloudFormation, CDK, SAM, Terraform ​

CloudFormation β€” the native IaC ​

JSON/YAML templates describing resources. Key terms:

  • Stack β€” a deployed instance of a template.
  • Change set β€” preview of what a stack update will do. Always create one before execute in prod.
  • Drift detection β€” compares actual resources to stack template; flags manual console changes.
  • Nested stacks β€” compose templates hierarchically.
  • StackSets β€” deploy the same template to many accounts/regions from the org's management account.
  • Custom resources β€” Lambda-backed hooks for things CFN doesn't model natively.

Slow and verbose but deeply integrated. AWS-native only.

CDK β€” code-first IaC ​

Write TypeScript/Python/Java/Go/.NET; synthesizes to CloudFormation. Construct levels:

  • L1 (CfnBucket) β€” 1:1 with CFN resources. Verbose.
  • L2 (Bucket) β€” opinionated helpers. You mostly live here.
  • L3 (patterns) β€” higher-level, e.g., ApplicationLoadBalancedFargateService gives you ALB + ECS service + target group + security groups + DNS + TLS in 10 lines.
typescript
const cluster = new ecs.Cluster(this, 'Cluster', { vpc });
new ecs_patterns.ApplicationLoadBalancedFargateService(this, 'Svc', {
  cluster, memoryLimitMiB: 1024, cpu: 512,
  desiredCount: 2,
  taskImageOptions: { image: ecs.ContainerImage.fromEcrRepository(repo) },
  publicLoadBalancer: true,
});

Great for application teams where the developer and the IaC owner are the same person.

SAM β€” serverless-focused CDK-lite ​

Extension to CloudFormation with serverless shortcuts (AWS::Serverless::Function, AWS::Serverless::Api). SAM CLI adds sam local invoke, sam local start-api for local Lambda dev. Good for small serverless projects where CDK is overkill.

Terraform β€” the multi-cloud default ​

HashiCorp's IaC. HCL DSL; declarative; dependency graph; state file holds known state. Pros: multi-cloud, rich community modules, often the operator-team preference. Cons: state backend (S3 + DynamoDB locks) setup, non-AWS-native, drift handling is less elegant than CFN's.

Decision: app developers in AWS-only shops β†’ CDK. Platform/SRE teams spanning AWS+GCP+GitHub+Datadog β†’ Terraform. Don't mix; pick per-account and stick.

Gotcha: never edit a resource by hand once it's managed by IaC. The next apply will either revert your change (good) or fail because of drift (bad). If you must do emergency surgery, codify the change before the next scheduled run.


16. Cost Management ​

INTERVIEW_PREP Β§10 Q9.

Tools, from reactive to preventive ​

  • Cost Explorer β€” visualize spend by service/tag/account/dimension, up to 13 months back. First stop for "why is our bill up."
  • AWS Budgets β€” alert (or take action) when spend crosses a threshold. Budget actions can auto-apply restrictive SCPs on cost runaway.
  • Cost & Usage Report (CUR) β€” daily CSV/Parquet into S3, the ground truth. Query with Athena or pipe to Snowflake/Redshift for FinOps reporting.
  • Compute Optimizer β€” ML recommendations to right-size EC2/EBS/Lambda/ASG based on CloudWatch metrics.
  • Trusted Advisor β€” checks for unused resources, idle load balancers, low-utilization EC2. Full checks require Business/Enterprise support.
  • Savings Plans β€” 1- or 3-yr commit on hourly compute spend. Covers EC2, Fargate, Lambda. 40–70% discount. Choose Compute Savings Plans (flexible) over EC2 Instance Savings Plans (family-locked) in most cases.
  • Reserved Instances β€” older model; tied to instance family + OS. Use for RDS / ElastiCache / OpenSearch (they don't have Savings Plans yet).
  • Spot Instances β€” up to 90% discount, 2-min interruption warning.
  • Database Savings Plans (re:Invent 2025) β€” new; covers RDS/Aurora compute.

Tagging strategy ​

Enforce a small mandatory tag set from day one: Environment, Owner, CostCenter, Project, DataClassification. Use AWS Organizations tag policies + Service Catalog or CFN guards to enforce. Cost Explorer splits by these tags; without them, you can't answer "what does Team X cost."

Common cost traps ​

  • Idle NAT Gateways β€” $0.045/hr per AZ + $0.045/GB. Replace with S3/DynamoDB gateway endpoints where possible; dual-NAT only in production, not dev.
  • Unattached EBS volumes / unused EIPs β€” tiny per-resource, adds up. Trusted Advisor flags both.
  • CloudWatch Logs β€” Β§12 gotcha.
  • Cross-AZ data transfer β€” $0.01/GB each way. Kafka brokers in different AZs replicating can add up.
  • NAT traffic to S3 β€” huge, obvious fix: gateway endpoint.
  • Over-provisioned Lambdas β€” doubling memory from 256β†’512 MB can halve runtime and save money; benchmark before cutting.

17. Well-Architected Framework ​

Six pillars. Know all six; interviewers use them as a rubric.

  1. Operational Excellence β€” IaC, automated deploys, chaos drills, runbooks, post-incident reviews. "We used ArgoCD for GitOps, CloudWatch + Opsgenie for on-call, and ran a quarterly GameDay."
  2. Security β€” least privilege, defense in depth, traceability, encrypted at rest + in transit, automated response.
  3. Reliability β€” fault isolation (Multi-AZ minimum), automated recovery, capacity planning. RPO/RTO quantified.
  4. Performance Efficiency β€” right-size instances, use managed services, cache, measure before optimizing.
  5. Cost Optimization β€” Β§16.
  6. Sustainability β€” ARM/Graviton, right-sizing, region choice (renewables), storage tier lifecycle.

For each pillar AWS publishes a Lens (Financial Services, FedRAMP, HPC, AI/ML, SaaS, Serverless). Worth skimming the Serverless Lens before an interview.

Interview pattern: pick a pillar from your story. "A legacy MQ bridge put Reliability first β€” Multi-AZ consumers, idempotent handlers, DLQs wired to a Slack channel, RPO near zero with MQ persistence + mirrored DB writes, RTO under 5 min with automated failover. Operational excellence came second β€” same GitHub Actions pipeline every team used."


18. High Availability & Disaster Recovery ​

HA vs DR β€” different problems ​

  • HA β€” survive the loss of a component (an instance, a task, an AZ) within a region. Standard design: Multi-AZ everything, stateless apps behind an ALB, DBs with Multi-AZ standby.
  • DR β€” survive the loss of an entire region. Costs more, touched less often, drilled rarely. Measured in RPO (data loss tolerance) and RTO (time-to-recover).

Four DR strategies (AWS official, ascending cost/complexity) ​

StrategyRPORTOCostHow
Backup & restorehourshours–day$Snapshots + S3 cross-region copy. Rebuild infra on demand.
Pilot lightminutestens of min$$Minimal footprint running in DR region (DB replica, baked AMIs). Scale up on failover.
Warm standbysecondsminutes$$$Scaled-down but functional copy. Increase capacity, flip DNS.
Multi-site active/activenear-zeroseconds$$$$Full parallel stack; Route 53 latency routing / Global Accelerator; conflict resolution on shared data.

Building blocks ​

  • RDS / Aurora β€” automated cross-region snapshot copies; Aurora Global Database for active-secondary.
  • DynamoDB Global Tables β€” multi-region multi-writer with last-writer-wins resolution.
  • S3 β€” CRR for buckets + enable versioning.
  • Route 53 β€” failover routing + health checks β†’ DNS flip.
  • CloudFormation StackSets β€” deploy the same infra into a second region on demand.

Common interview drill ​

"Walk me through a full region failover of a stateful Spring Boot / MongoDB / Kafka stack."

  1. Pre-work: infra-as-code for everything; data replicated (DocumentDB CRR / Mongo Atlas cross-region / MSK replicator); secrets replicated (Secrets Manager CRR).
  2. Detection: Route 53 health checks + CloudWatch cross-region alarms.
  3. Cutover: promote secondary data plane (if needed), flip Route 53 failover records, increase ASG desired counts in DR region, enable Kafka MirrorMaker reversal.
  4. Verification: smoke tests, check lag, dashboards.
  5. Rollback: reverse the flip; reconcile data.

19. Migration & Modernization ​

INTERVIEW_PREP Β§10 Q10. Must be fluent here.

The 7 Rs ​

AWS's canonical migration strategies, in rough order of effort:

  1. Retire β€” decommission apps that aren't used. Easiest "migration" is the one you don't do.
  2. Retain β€” keep on-prem for now (compliance, cost, license).
  3. Relocate β€” VMware Cloud on AWS / AWS Outposts β€” no OS changes.
  4. Rehost (lift-and-shift) β€” AWS Application Migration Service (MGN) to EC2. Minimal code changes. Fastest; lowest immediate benefit.
  5. Replatform (lift-tinker-shift) β€” e.g., EC2 β†’ Elastic Beanstalk or App Runner; self-managed DB β†’ RDS/Aurora. Medium effort, modest AWS-native gains.
  6. Repurchase β€” replace with a SaaS (self-hosted ticketing β†’ Zendesk). Exit operational burden.
  7. Refactor β€” rewrite for cloud-native (split monolith β†’ microservices, SQL β†’ DynamoDB where fitting). Highest effort + highest long-term payoff.

Decision tree for a Spring Boot app (interview answer) ​

  1. Stateless app, containerized already? β†’ App Runner (for one app) or ECS Fargate (if you want VPC, private subnets, service mesh). Quick win.
  2. Large team standardized on Kubernetes? β†’ EKS. More ceremony, but ArgoCD / Karpenter / standard OSS tooling all work.
  3. Legacy packaging (WAR, fat JAR with native deps)? β†’ start on EC2 with Systems Manager for patching; plan to containerize.
  4. Event-driven glue? β†’ Lambda with SAM/CDK.
  5. Data tier β€” Postgres β†’ RDS Postgres (or Aurora Postgres for scale). Mongo β†’ DocumentDB (if Atlas isn't allowed) or MongoDB Atlas on AWS.
  6. Messaging β€” IBM MQ β†’ Amazon MQ. Kafka β†’ MSK (or MSK Serverless). Kinesis only if you start fresh and don't need Kafka ecosystem.
  7. Frontend β€” React build β†’ S3 + CloudFront.
  8. CI/CD β€” GitHub Actions β†’ ECR β†’ ArgoCD (on EKS) or CodePipeline+CodeDeploy for ECS.
  9. Observability β€” OpenTelemetry β†’ ADOT β†’ CloudWatch + X-Ray (or keep your Elasticsearch and export there).
  10. Secrets β€” Secrets Manager for rotating DB creds; Parameter Store for config.
  11. Compliance β€” if CJIS/FedRAMP: GovCloud + FIPS endpoints + Object Lock on CloudTrail.

DMS β€” Database Migration Service ​

Replicates between heterogeneous sources (Oracle β†’ Postgres, SQL Server β†’ Aurora, Mongo β†’ DocumentDB). Full load + CDC (continuous replication). Pair with Schema Conversion Tool (SCT) to translate schema + stored procs. Validate constantly β€” DMS can silently drop rows under certain null constraints.


20. Federal / Compliance Context ​

Short but critical if working in regulated environments. Inline references only β€” dedicated deep-dive is out of scope for this guide.

  • AWS GovCloud (US) β€” separate partitions (us-gov-west-1, us-gov-east-1) for ITAR, DoD IL2/4/5. Operated by US persons on US soil. ARNs use arn:aws-us-gov:…. You need a separate account; Orgs span GovCloud or commercial, not both.
  • FedRAMP β€” Moderate authorized in commercial regions; High authorized in GovCloud. Different service availability β€” check the FedRAMP Marketplace before promising a design works.
  • CJIS β€” FBI CJIS Addendum available via AWS Artifact. KMS + CloudTrail + Object Lock + MFA are the technical backbone. State-specific CSO approval still required.
  • DoD IL (Impact Level) 2/4/5/6 β€” a DoD-specific authorization; IL5/6 only in GovCloud (US) DoD regions.
  • FIPS 140-2/3 endpoints β€” *-fips.*.amazonaws.com. Required for compliance.
  • Artifact β€” self-service portal for compliance reports (SOC 1/2/3, PCI, HIPAA BAA, FedRAMP, CJIS).

Interview note: Regulated-environment patterns are worth being fluent in β€” air-gapped review processes, MFA-as-floor-not-ceiling, and treating the security organization as a stakeholder rather than a blocker.


21. Connect to Your Experience ​

Stories to anchor abstract topics. Drop these specifics into behavioral / technical answers.

Anchor example: legacy MQ microservice (10k+ tx/day) ​

  • Messaging translation: IBM MQ (on-prem) ↔ Amazon MQ (ActiveMQ). Bridge maintains JMS semantics; Spring @JmsListener + JmsTemplate work unchanged.
  • Durability: XA or idempotent-consumer + outbox on the AWS side; DLT for poison messages.
  • Observability: OTEL β†’ ADOT β†’ CloudWatch Metrics + X-Ray. trace_id into MDC so a Logs Insights query joins across services.
  • Scaling: raise listener concurrency; ECS Service Auto Scaling on ApproximateNumberOfMessagesVisible SQS metric if staging through SQS.
  • Pairs with Β§9 Messaging and Β§12 Observability.

Anchor example: cross-team schema standardization ​

  • AWS realization: Glue Schema Registry (native) or MSK + Confluent Schema Registry (if the org standardized on Confluent).
  • Subject-naming strategies enforced via IAM policy on the registry.
  • CI gate via CodeBuild or GitHub Actions against the registry's compatibility API before merge.
  • Pairs with Β§9 MSK.

Anchor example: JAXB migration from Thymeleaf ​

  • XML parsing security β€” still hardens when running on Lambda / ECS (AWS doesn't sandbox this for you).
  • If the XML is sourced from S3: use KMS-encrypted buckets, versioning + Object Lock for any schema you can't mutate.

Anchor example: ArgoCD rollouts (99% deploy success rate) ​

  • EKS + IRSA β€” ArgoCD's argocd-application-controller assumes a role per destination cluster; no static kubeconfigs with long-lived tokens.
  • Karpenter over Cluster Autoscaler β€” 30s cold-start delta that changed deploy-time math.
  • Blue/green with Argo Rollouts β€” analysis template queries Prometheus or CloudWatch Metric Streams; auto-rollback ties back to Β§14's CodeDeploy pattern but in Kubernetes.
  • Pairs with Β§11 EKS and Β§14 CI/CD.

Anchor example: mentoring a ~20-person intern cohort ​

  • Topics juniors get wrong on AWS: SG vs NACL direction, NAT cost traps, the SG-references-SG-by-ID idiom, "resource policy alone is enough for same-account" (it isn't), pre-signed URL confusion ("it's a capability, not authN").
  • Build a small katalog (a VPC from scratch, a properly-locked S3 bucket with OAC, an IRSA-enabled pod) β€” these are drills that expose the confusion fast.

22. Rapid-Fire Review ​

INTERVIEW_PREP Β§10 (10 questions, one-liner each) ​

  1. EC2 vs ECS vs EKS vs Lambda β€” EC2 = raw VMs (OS control). ECS = AWS-native containers (simple, Fargate = zero-ops). EKS = Kubernetes (portable, richer ecosystem). Lambda = event-driven, ≀15 min, zero-ops. Choose by ops preference + workload shape.
  2. IAM role vs user vs policy; IRSA β€” User = long-lived creds for humans; Role = temp creds via STS, no secrets; Policy = JSON grant. IRSA maps an EKS ServiceAccount to an IAM role via OIDC; pods get credentials from a projected JWT. No static keys.
  3. S3 classes + lifecycle + versioning + policies vs ACLs β€” Standard/IA/Glacier tiers with lifecycle transitions + expiration. Versioning enables CRR + Object Lock. Use bucket policies + Block Public Access; disable ACLs (BucketOwnerEnforced).
  4. S3 pre-signed URL β€” Server signs a time-limited URL; client uploads/downloads to S3 directly (no data through your app). Signer must hold the permission; treat URL as a bearer token.
  5. SQS vs SNS vs EventBridge — SQS = queue (one consumer). SNS = pub/sub fan-out (multi-consumer). EventBridge = schema-first event bus with content routing. SNS→SQS fan-out for simple broadcast; EventBridge for cross-team events.
  6. SQS Standard vs FIFO β€” Standard: unlimited throughput, at-least-once, best-effort order. FIFO: strict order per group, exactly-once, 300/3000 msg/s (70k with HTTP mode), .fifo suffix, ~$$.
  7. VPC β€” public/private subnet, NAT, SG vs NACL β€” Public = has 0.0.0.0/0 β†’ IGW. NAT Gateway (per AZ) gives private subnets egress. SG = stateful, ENI-level, allow-only, references other SGs. NACL = stateless, subnet-level, allow+deny, evaluated by rule number.
  8. CloudFront + S3 / + ALB β€” S3 origin = static assets, lock with OAC, long TTL + versioned filenames. ALB origin = dynamic API; TTL=0 for /api, cache static paths. ACM cert must be in us-east-1 for CloudFront.
  9. Cost monitoring β€” Cost Explorer for exploration, Budgets for alerts + actions, CUR for ground truth, Compute Optimizer for right-sizing, mandatory tags from day one.
  10. On-prem Spring Boot β†’ AWS β€” Rehost (MGN) fast path; replatform to App Runner/ECS/EKS for real wins. DB β†’ RDS/Aurora; MQ β†’ Amazon MQ; Kafka β†’ MSK; React β†’ S3+CloudFront; OTEL β†’ ADOT β†’ CloudWatch/X-Ray.

Extra rapid-fire (25 senior-level extras) ​

  1. Region vs AZ vs Edge β€” Region = geographic, independent. AZ = isolated DC cluster inside a region. Edge = CloudFront/Accelerator/R53 PoP; hundreds.
  2. Shared Responsibility β€” AWS owns "OF the cloud"; customer owns "IN the cloud" (IAM, data, guest OS, app).
  3. Policy evaluation β€” default deny β†’ explicit deny wins β†’ union of allows, intersected with SCPs/boundaries.
  4. Assume-role flow β€” caller β†’ STS AssumeRole β†’ temp creds (15min–12h); IRSA uses AssumeRoleWithWebIdentity under the hood.
  5. VPC gateway endpoint β€” free, for S3 + DynamoDB, avoids NAT for traffic to those services. Always use.
  6. Route 53 routing β€” simple/weighted/latency/failover/geo/geoprox/multivalue. Weighted = canary; Latency = multi-region perf; Failover = active/passive DR.
  7. ALB vs NLB β€” ALB = L7, path/host routing, OIDC, WAF, gRPC. NLB = L4, static IPs, PPS, preserve client IP.
  8. Aurora vs RDS β€” Aurora = shared distributed storage, 6Γ—3-AZ, readers lag ms, 15 readers, ~5Γ— RDS perf at similar cost; RDS = plain engines on EBS.
  9. DynamoDB partitioning β€” PK hashes into partitions with fixed throughput; skewed PK = hot partition. Fix with high-cardinality or suffix buckets.
  10. DynamoDB GSI vs LSI β€” GSI = alternate PK/SK, eventually consistent, any time. LSI = same PK alt SK, strongly consistent, at table creation only.
  11. Redis cache patterns β€” cache-aside (most common), write-through, write-behind. Invalidate on write + TTL floor.
  12. SNS filter policy β€” subscribers get only matching messages; lets you skip per-consumer filter logic and DLQ noise.
  13. EventBridge Pipes β€” source β†’ filter β†’ transform β†’ target, replacing glue Lambdas.
  14. Amazon MQ vs MSK β€” MQ = managed ActiveMQ/RabbitMQ, JMS/AMQP, XA, per-message ACK. MSK = managed Kafka, log-based replay, partitions, Connect ecosystem.
  15. Lambda cold start mitigations β€” SnapStart (Java), provisioned concurrency, ARM/Graviton, keep VPC-attach minimal, lean deps.
  16. Step Functions Standard vs Express β€” Standard = long-running (1 yr), exactly-once, 2k TPS; Express = short (≀5 min), at-least-once, high TPS.
  17. EKS IRSA β€” cluster OIDC + role trust policy naming system:serviceaccount:<ns>:<sa> + projected JWT β†’ AssumeRoleWithWebIdentity. No static keys.
  18. Karpenter vs Cluster Autoscaler β€” Karpenter provisions EC2 directly from pending pods in ~30s; bin-packs across instance types; superior default now.
  19. KMS envelope encryption β€” KMS never encrypts payload; generates data keys, your service uses them. Grants for ephemeral delegation; key policy required (IAM alone isn't enough).
  20. Secrets Manager vs Parameter Store β€” SM: rotation, $$ per secret; SSM: free/cheap, rotation via your own Lambda. Use SM for DB creds; SSM for config.
  21. CloudTrail management vs data events β€” Mgmt = API calls, default. Data = per-object S3 / per-invoke Lambda / per-item DDB, opt-in + priced.
  22. X-Ray vs OTEL β€” X-Ray native; OTEL vendor-neutral; ADOT exports OTLP to X-Ray + others. Go OTEL by default.
  23. Savings Plans vs RIs vs Spot β€” SP = flexible commit covering EC2/Fargate/Lambda; RIs = legacy family-locked; Spot = interruptible 90% off.
  24. WAF rules β€” managed groups (AWS/vendor), rate-based (per IP/5min), custom matches. Attach to ALB/CF/API GW.
  25. 4 DR strategies β€” backup-restore / pilot light / warm standby / multi-site active-active. RPO + RTO decide.

23. Practice Exercises ​

Tier 1 β€” Design from memory ​

Set a timer; whiteboard without AWS docs.

A. Three-tier web app VPC

  • 10.0.0.0/16 VPC spanning 3 AZs.
  • Per AZ: public subnet (ALB, NAT GW), private app subnet (ECS/EKS), private data subnet (RDS).
  • Route tables: public β†’ IGW; private-app β†’ NAT GW (per AZ!); private-data β†’ nowhere except VPC endpoints.
  • SGs: alb-sg accepts 443 from world; app-sg accepts 8080 from alb-sg; db-sg accepts 5432 from app-sg.
  • VPC endpoints: S3 + DynamoDB gateway; KMS + Secrets Manager interface.

B. Static site + API on AWS

  • S3 bucket (private, OAC) for SPA; CloudFront distribution; ACM cert in us-east-1.
  • Same distribution has a second behavior /api/* pointing at an ALB + ECS Fargate service.
  • Route 53 ALIAS to CloudFront.
  • Sketch it in under 5 min.

C. Event-driven order processing

  • API Gateway β†’ Lambda β†’ EventBridge bus.
  • Rules routing to (a) Payments service (ECS), (b) Inventory service (DDB via direct put), (c) Analytics (Kinesis Firehose β†’ S3).
  • DLQs on every async path.

Tier 2 β€” Predict the outcome ​

Cover each, state what will happen, then check.

  1. Bucket policy on Account-A bucket grants s3:GetObject to Account-B. Account-B user has no IAM policy for S3. Can they GET?Yes β€” cross-account S3 requires either policy to allow (bucket policy is enough).
  2. Same setup but inside Account-A (a user in Account-A with no IAM S3 policy)?No β€” same-account requires both identity and resource policy to allow (for most resources, S3 is not the exception β€” don't be fooled).
  3. Three private subnets in three AZs. One NAT Gateway in AZ-a. AZ-a goes down. What breaks? Egress for AZ-b and AZ-c apps (their route tables point to a dead ENI in AZ-a).
  4. Security group inbound rule: 22 from 0.0.0.0/0. NACL outbound rule: deny 1024-65535. Can you SSH in?No β€” NACL is stateless; SSH reply traffic uses ephemeral ports outbound and will be denied.
  5. Lambda in a VPC, no VPC endpoint for S3. Increases cold start by ~1s. Why? ENI attach (mitigated in modern Lambda VPC arch, but still adds measurable time).
  6. DynamoDB table, provisioned 1000 WCU, PK = customer_id. One customer does 100k writes in 1 minute. What happens? Hot partition β†’ throttling on that customer even though total provisioned capacity isn't exhausted.
  7. S3 bucket with versioning + lifecycle expiration = 30 days. Will it ever free the storage? Only if you also add NoncurrentVersionExpiration; otherwise the old versions pile up forever.
  8. SQS message received but handler crashes before calling DeleteMessage. What happens? After visibility timeout, it reappears; after maxReceiveCount receives, DLQ (if configured).
  9. FIFO SQS, MessageGroupId = "orders-42". You send 500 msgs/sec for that group. Throughput? Per-group cap is 300/sec (default) β€” throttled. HTTP mode lifts it; still per-group ordering forces serialization.
  10. CloudFront distribution with TTL=86400. You push new JS file with same name. When do users see it? Up to 24 h, unless you invalidate /* (pay per path) or (better) version the filename.

Tier 3 β€” Mini system designs ​

Spend 20 minutes each, diagram on paper, speak aloud.

A. IBM MQ β†’ AWS bridge with zero message loss

  • On-prem MQ β†’ JMS bridge (containerized Spring Boot on-prem or on EC2) β†’ Amazon MQ (ActiveMQ).
  • Bridge uses XA between IBM MQ and a staging queue, or idempotency keys + outbox on the AWS side.
  • DLT on the AWS side; alarm via CloudWatch β†’ SNS β†’ Opsgenie.
  • Runbook for resumable bridge restarts.
  • Cutover plan: run both brokers in parallel, shift consumers, drain IBM MQ, decommission.

B. URL shortener with 10k/s redirect traffic

  • DDB (PK = short code) behind Lambda behind CloudFront. Cache TTL on successful redirect = 1 h.
  • Writes: API Gateway β†’ Lambda β†’ DDB with condition-check (prevent collisions).
  • Analytics: DDB Stream β†’ Kinesis Firehose β†’ S3 + Athena.
  • Global: DDB Global Tables + CloudFront (already global).

C. Centralized observability for 20 microservices on EKS

  • ADOT collector as DaemonSet; OTLP ingest.
  • Export metrics to CloudWatch Metric Streams; traces to X-Ray + optionally a self-hosted Grafana Tempo/Jaeger; logs to CloudWatch Logs (or fluent-bit β†’ OpenSearch if you want free-text power).
  • Logs Insights dashboards per team; Grafana on top of CloudWatch + X-Ray for cross-signal views.
  • Alerting via CloudWatch β†’ SNS β†’ Opsgenie; SLO dashboards per service.

Spaced-review checklist ​

Mark βœ… / πŸ” / ❌ per section. Repeat ❌ daily, πŸ” every other day, βœ… weekly.

  • [ ] Β§1 Foundations
  • [ ] Β§2 IAM (drill policy eval + IRSA)
  • [ ] Β§3 Compute decision tree
  • [ ] Β§4 VPC + SG vs NACL
  • [ ] Β§5 Route 53 + ELB
  • [ ] Β§6 CloudFront
  • [ ] Β§7 S3 (classes, pre-signed URLs, policies)
  • [ ] Β§8 RDS / Aurora / DynamoDB
  • [ ] Β§9 SQS / SNS / EventBridge / MQ / MSK / Kinesis
  • [ ] Β§10 Lambda + API GW + Step Functions
  • [ ] Β§11 ECS / EKS / IRSA
  • [ ] Β§12 CloudWatch + X-Ray + CloudTrail + Config
  • [ ] Β§13 KMS + Secrets + WAF + GuardDuty
  • [ ] Β§14 CI/CD on AWS
  • [ ] Β§15 CloudFormation / CDK / Terraform
  • [ ] Β§16 Cost Management
  • [ ] Β§17 Well-Architected 6 pillars
  • [ ] Β§18 HA / DR (4 strategies)
  • [ ] Β§19 Migration 7 Rs
  • [ ] Β§20 Federal / Compliance
  • [ ] Β§21 Experience tie-ins
  • [ ] Β§22 Rapid-fire (10 + 25 extras)
  • [ ] Β§23 Practice exercises (3 designs)

24. Further Reading ​

  • AWS Well-Architected Framework whitepaper + all six pillar whitepapers + Serverless Lens + FedRAMP Lens β€” the official rubric interviewers use.
  • The AWS Builders' Library β€” engineering write-ups on how AWS builds AWS. Start with "Timeouts, retries, and backoff with jitter," "Reliability, constant work, and a good cup of coffee," "Avoiding overload in distributed systems."
  • Effective AWS by Haijun Yu (2024/2025) β€” senior-engineer deep-dive on patterns.
  • AWS Certified Solutions Architect – Professional exam guide (just the guide β€” not the cert) β€” a cheat sheet for breadth coverage.
  • re:Invent 2025 keynotes β€” VPC Lattice v2, S3 Vectors, Database Savings Plans, Graviton5, new Well-Architected AI Lens. Worth skimming for "latest news" answers.
  • Official service deep-dive talks (re:Invent 300/400 level) β€” watch the DynamoDB, Aurora, Lambda, and EKS ones end to end before a cloud-heavy interview.
  • aws-samples GitHub org β€” reference architectures for nearly every pattern.
  • Werner Vogels's blog (All Things Distributed) β€” perspective on AWS architectural choices.

Next in the series: CONCURRENCY.md (deep dive into executors, locks, CompletableFuture, ForkJoin, virtual threads internals, structured concurrency), then SPRING_BOOT.md, then MESSAGING.md (Kafka / IBM MQ / ActiveMQ deep dive pairing with this guide's Β§9).

Last updated: