Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

accepted_conditional · Pro · 646s · $0.86

This verdict assumes 40% of constraints

The following constraints were not provided and default values were used:

7 branches explored · 3 survived · 3 rounds · integrity 75%
72% confidence
WeakStrong
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores)
Risk unknown 646s
Decision timeline Verdict

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)

Decision
72%
Execution
Uncertainty

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.

Next actions

Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
backend · immediate
Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
data · immediate
Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
infra · before_launch
If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
infra · immediate
Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards
infra · before_launch
This verdict stops being true when
DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone → Optimize existing DynamoDB setup: redesign partition keys, implement auto-scaling, add DAX caching layer, defer migration
Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load → Evaluate self-managed Citus on AWS with multiple coordinators, or consider CockroachDB/TiDB as distributed SQL alternatives without single-coordinator constraint
Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin → Deploy self-managed Citus on AWS EC2/EKS with increased budget allocation for DBA operational overhead
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan
Propose a hybrid architecture: retain DynamoDB for read-heavy, non-relational workloads while introducing PostgreSQL ...
Socrates
Before considering migration, conduct a comprehensive database implementation audit of the current DynamoDB setup. Ma...
Daedalus
RECOMMENDATION: Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Citus), NOT self-managed Citus on EC2/R...
Loki
Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-world coordinator bottlenecks: with 2,0...

Evidence boundary

Observed from your filing

  • Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

Assumptions used for analysis

  • DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
  • The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
  • 90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
  • The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
  • The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
  • current scale defaulted: moderate scale assumed (not_addressed)
  • existing stack defaulted: greenfield assumed (not_addressed)
  • connection pooler defaulted: not specified (not_addressed)
  • data volume defaulted: not specified (not_addressed)
  • traffic shape defaulted: not specified (not_addressed)
  • current bottleneck defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

  • Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
  • Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
  • b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
  • Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
  • Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
  • Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
  • If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
  • Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards

Unknowns blocking a firmer verdict

  • Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
  • Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
  • The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
  • No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
  • Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.

Operational signals to watch

reversal — DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone
reversal — Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load
reversal — Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005b006b007
Battle timeline (3 rounds)
Round 1 — Initial positions · 3 branches
Branch b001 (Vulcan) eliminated — Branch b001 proposes a hybrid DynamoDB + PostgreSQL/Citus...
Socrates proposed branch b004
Socrates Reframe the problem: Instead of asking whether to migrate from DynamoDB to Postg…
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b005
Branch b005 (Loki) eliminated — auto-pruned: unsupported low-confidence branch
Socrates proposed branch b006
Branch b006 (Socrates) eliminated — auto-pruned: unsupported low-confidence branch
Loki Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-wo…
Socrates Instead of a simple yes/no migration decision, we should evaluate whether a hybr…
Round 3 — Final convergence · 3 branches
Branch b004 (Socrates) eliminated — Branch b004 proposes a 'polyglot persistence strategy' wi...
Socrates proposed branch b007
Socrates Before considering migration, conduct a comprehensive database implementation au…
Markdown JSON