should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

accepted_conditional · Pro · 629s · $0.77

This verdict assumes 50% of constraints

The following constraints were not provided and default values were used:

5 branches explored · 2 survived · 3 rounds · integrity 75%
82% confidence
WeakStrong
Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy...)
Risk unknown 629s
Decision timeline Verdict

Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months

Decision
82%
Execution
Uncertainty

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.

Next actions

Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)
infra · immediate
Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads
infra · immediate
Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%
infra · immediate
Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume
backend · immediate
Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting
backend · before_launch
At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or abort migration
infra · before_launch
This verdict stops being true when
Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration → Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization
Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments → Stay on Redis, upgrade to latest version, cancel migration
Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale → Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Socrates
RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not a technical one. Before committing t...
Vulcan
Explore the technical and operational feasibility of migrating the 200-node Redis deployment to Valkey, focusing on m...
Daedalus
RECOMMENDATION: Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 ...
Loki
A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable consistency risks from out-of-order deliv...

Evidence boundary

Observed from your filing

  • should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

Assumptions used for analysis

  • Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
  • The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
  • Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
  • The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
  • Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
  • team size defaulted: standard team (5-10 engineers) (not_addressed)
  • existing stack defaulted: greenfield assumed (not_addressed)
  • connection pooler defaulted: not specified (not_addressed)
  • current state defaulted: not specified (not_addressed)
  • rollback plan defaulted: not specified (not_addressed)
  • data volume defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

  • Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
  • Deploy a 20-node Valkey 7.2.6 canary cluster in the same availability zone as the existing Redis deployment, configure Envoy with redis_proxy filter for dual-write from 10% of the production write path, and instrument Prometheus/Grafana dashboards tracking p99 latency, gossip bandwidth, pub/sub delivery latency, and node failure rate against the four abort thresholds.
  • b003 had the highest confidence (0.90) among surviving branches, survived 3 rounds of adversarial challenge including a direct attack on dual-write feasibility (b004, killed), and provided the most concrete architecture: named proxy technology (Envoy redis_proxy), specific phase timeline, quantified abort thresholds, named failure modes with mitigations, and a budget breakdown. b002 (0.70) was a strictly weaker version of the same recommendation without the specificity.
  • Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)
  • Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads
  • Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%
  • Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume
  • Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

Unknowns blocking a firmer verdict

  • Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
  • b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
  • The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
  • Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
  • b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol

Operational signals to watch

reversal — Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration
reversal — Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments
reversal — Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005
Battle timeline (3 rounds)
Round 1 — Initial positions · 2 branches
Branch b001 (Socrates) eliminated — This branch has fundamental structural problems that make...
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b004
Branch b004 (Loki) eliminated — Branch b004 posits that dual-write introduces insurmounta...
Socrates proposed branch b005
Loki A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable cons…
Socrates RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not…
Round 3 — Final convergence · 2 branches
Branch b005 (Socrates) eliminated — This branch has a fatal structural flaw: it treats the Re...
Markdown JSON