Our payment service had 3 cascading failures in 6 months — all triggered by a downstream timeout. Should we add a circuit breaker, switch to async processing, or both?
Implement a circuit breaker using Resilience4j (Java)
Decision
Implement a circuit breaker using Resilience4j/Polly/equivalent — a library, not a new service. Configure: 50% failure rate threshold over a 20-request sliding window, 30-second open duration, 3 half-open probe requests, and 5-second downstream call timeout (replacing the likely 30s+ default that causes thread pool exhaustion). When the circuit opens, return HTTP 503 with Retry-After: 30 header. Add in-process retries with exponential backoff (2s, 4s, 8s, max 3 attempts) using existing task queue or scheduled executor — no new infrastructure. Critical failure mode: intermittent failures at ~40% error rate never trip the circuit. Mitigate by adding an 80% slow-call rate threshold at 5 seconds alongside the failure rate threshold. The economics are clear: a 30-second false trip costs ~$375 in rejected transactions versus $180K per cascading failure outage. One part-time senior engineer can deliver this in 5-8 working days. This is a library-level change, not an architecture change.
Next actions
Council notes
Evidence boundary
Observed from your filing
- Our payment service had 3 cascading failures in 6 months — all triggered by a downstream timeout. Should we add a circuit breaker, switch to async processing, or both?
Assumptions used for analysis
- The downstream payment gateway timeout is currently set to 30s+ and thread pool exhaustion is the cascading failure mechanism
- The team has access to a circuit breaker library (Resilience4j, Polly, or equivalent) compatible with their stack at zero additional cost
- The payment service processes enough requests that a 20-request sliding window provides meaningful signal (not so low-volume that the window covers hours of traffic)
- The $180K per outage estimate is roughly accurate, making the $375 false-trip cost an acceptable trade-off
- 1 part-time senior engineer is available for 5-8 working days of implementation
- current scale defaulted: moderate scale assumed (not_addressed)
Inferred candidate specifics
- Implement a circuit breaker using Resilience4j/Polly/equivalent — a library, not a new service. Configure: 50% failure rate threshold over a 20-request sliding window, 30-second open duration, 3 half-open probe requests, and 5-second downstream call timeout (replacing the likely 30s+ default that causes thread pool exhaustion). When the circuit opens, return HTTP 503 with Retry-After: 30 header. Add in-process retries with exponential backoff (2s, 4s, 8s, max 3 attempts) using existing task queue or scheduled executor — no new infrastructure. Critical failure mode: intermittent failures at ~40% error rate never trip the circuit. Mitigate by adding an 80% slow-call rate threshold at 5 seconds alongside the failure rate threshold. The economics are clear: a 30-second false trip costs ~$375 in rejected transactions versus $180K per cascading failure outage. One part-time senior engineer can deliver this in 5-8 working days. This is a library-level change, not an architecture change.
- Write the circuit breaker configuration class using Resilience4j (or stack equivalent) with these exact parameters: 50% failure rate threshold, 20-request sliding window, 80% slow-call rate at 5 seconds, 30-second open duration, 3 half-open probes, and wire it around the downstream payment gateway client with a 5-second call timeout replacing the current default.
- b003 had the highest confidence (0.94), survived 3 rounds of adversarial debate including splits and strengthening, named specific library recommendations, exact configuration parameters, quantified failure mode costs ($375 false trip vs $180K cascade), provided implementation timeline (5-8 days), and identified two specific failure modes with mitigations. No other surviving branch approached this level of specificity.
- Write circuit breaker wrapper around downstream payment gateway client using Resilience4j/Polly with specified thresholds (50% failure rate, 20-request window, 5s timeout, 30s open, 3 half-open probes)
- Add in-process retry mechanism with exponential backoff (2s, 4s, 8s) for failed payments using existing ScheduledExecutorService or equivalent
- Run load test simulating downstream timeout scenarios to verify circuit trips correctly and half-open recovery works before production deployment
- Set up alerts on circuit breaker state transitions (closed→open, open→half-open, half-open→closed) and track false-trip rate over first 30 days
- Pull incident reports from the 3 cascading failures to verify the actual downstream timeout value, confirm thread pool exhaustion as root cause, and measure real cost per outage for threshold calibration
Unknowns blocking a firmer verdict
- The actual current downstream timeout value is assumed to be 30s+ based on typical payment gateway defaults — the real value should be verified before configuring the 5-second replacement
- The $180K per outage figure and 4-hour outage duration are from the winning branch but are not verified against actual incident data — actual cost per outage should be measured
- Whether the downstream provider's failure pattern is truly random or correlated (e.g., end-of-month settlement spikes) affects whether a fixed sliding window is the right detection mechanism
- The killed branch b005's async payment pipeline may be the correct long-term architecture if circuit breaker alone doesn't reduce failure frequency — this should be revisited after 3 months of circuit breaker operation
- The killed branch b004 raised a valid point that timeouts may signal upstream overload rather than downstream failure — root cause analysis of the 3 incidents should confirm the actual failure mechanism