should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

accepted_conditional · Pro · 629s · $0.77

This verdict assumes 50% of constraints

The following constraints were not provided and default values were used:

team_size: standard team (5-10 engineers) (not_addressed)
existing_stack: greenfield assumed (not_addressed)
connection_pooler: not specified (not_addressed)
current_state: not specified (not_addressed)
rollback_plan: not specified (not_addressed)
data_volume: not specified (not_addressed)

5 branches explored · 2 survived · 3 rounds · integrity 75%

WeakStrong

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy...)

Risk unknown 629s

Read brief Open timeline MD ↓ Pro JSON ↓ Pro PDF ↓ Ent

Decision timeline Verdict

Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months

Decision

82%

Execution

—

Uncertainty

—

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.

Next actions

Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)

infra · immediate

Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads

infra · immediate

Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%

infra · immediate

Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume

backend · immediate

Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

backend · before_launch

At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or abort migration

infra · before_launch

This verdict stops being true when

Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration → Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization

Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments → Stay on Redis, upgrade to latest version, cancel migration

Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale → Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Socrates

RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not a technical one. Before committing t...

Vulcan

Explore the technical and operational feasibility of migrating the 200-node Redis deployment to Valkey, focusing on m...

Daedalus

RECOMMENDATION: Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 ...

Loki

A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable consistency risks from out-of-order deliv...

Evidence boundary

Observed from your filing

should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

Assumptions used for analysis

Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
team size defaulted: standard team (5-10 engineers) (not_addressed)
existing stack defaulted: greenfield assumed (not_addressed)
connection pooler defaulted: not specified (not_addressed)
current state defaulted: not specified (not_addressed)
rollback plan defaulted: not specified (not_addressed)
data volume defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
Deploy a 20-node Valkey 7.2.6 canary cluster in the same availability zone as the existing Redis deployment, configure Envoy with redis_proxy filter for dual-write from 10% of the production write path, and instrument Prometheus/Grafana dashboards tracking p99 latency, gossip bandwidth, pub/sub delivery latency, and node failure rate against the four abort thresholds.
b003 had the highest confidence (0.90) among surviving branches, survived 3 rounds of adversarial challenge including a direct attack on dual-write feasibility (b004, killed), and provided the most concrete architecture: named proxy technology (Envoy redis_proxy), specific phase timeline, quantified abort thresholds, named failure modes with mitigations, and a budget breakdown. b002 (0.70) was a strictly weaker version of the same recommendation without the specificity.
Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)
Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads
Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%
Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume
Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

Unknowns blocking a firmer verdict

Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol

Operational signals to watch

reversal — Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration

reversal — Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments

reversal — Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale

Branch battle map

Battle timeline (3 rounds)

Round 1 — Initial positions · 2 branches

Branch b001 (Socrates) eliminated — This branch has fundamental structural problems that make...

Round 2 — Adversarial probes · 3 branches

Loki proposed branch b004

Branch b004 (Loki) eliminated — Branch b004 posits that dual-write introduces insurmounta...

Socrates proposed branch b005

Loki A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable cons…

Socrates RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not…

Round 3 — Final convergence · 2 branches

Branch b005 (Socrates) eliminated — This branch has a fatal structural flaw: it treats the Re...

Markdown JSON

Council chamber

Socrates

Analyst

Vulcan

Engineer

Daedalus

Architect

Loki

Disruptor

5851eba9-8d79-4bab-9a09-6e2e22ae5b37 · Protocol

Council archetypes represent independent reasoning perspectives. They are not individuals but structured reasoning roles.

This verdict is a structured reasoning artifact, not professional advice. VectorCourt does not provide legal, financial, medical, or other professional advice. You are responsible for your own decisions.

VectorCourt · Pricing · Terms · Privacy