Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

accepted_conditional · Pro · 646s · $0.86

This verdict assumes 40% of constraints

The following constraints were not provided and default values were used:

current_scale: moderate scale assumed (not_addressed)
existing_stack: greenfield assumed (not_addressed)
connection_pooler: not specified (not_addressed)
data_volume: not specified (not_addressed)
traffic_shape: not specified (not_addressed)
current_bottleneck: not specified (not_addressed)

7 branches explored · 3 survived · 3 rounds · integrity 75%

WeakStrong

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores)

Risk unknown 646s

Read brief Open timeline MD ↓ Pro JSON ↓ Pro PDF ↓ Ent

Decision timeline Verdict

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)

Decision

72%

Execution

—

Uncertainty

—

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.

Next actions

Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency

backend · immediate

Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus

data · immediate

Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment

infra · before_launch

If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable

infra · immediate

Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards

infra · before_launch

This verdict stops being true when

DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone → Optimize existing DynamoDB setup: redesign partition keys, implement auto-scaling, add DAX caching layer, defer migration

Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load → Evaluate self-managed Citus on AWS with multiple coordinators, or consider CockroachDB/TiDB as distributed SQL alternatives without single-coordinator constraint

Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin → Deploy self-managed Citus on AWS EC2/EKS with increased budget allocation for DBA operational overhead

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan

Propose a hybrid architecture: retain DynamoDB for read-heavy, non-relational workloads while introducing PostgreSQL ...

Socrates

Before considering migration, conduct a comprehensive database implementation audit of the current DynamoDB setup. Ma...

Daedalus

RECOMMENDATION: Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Citus), NOT self-managed Citus on EC2/R...

Loki

Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-world coordinator bottlenecks: with 2,0...

Evidence boundary

Observed from your filing

Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

Assumptions used for analysis

DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
current scale defaulted: moderate scale assumed (not_addressed)
existing stack defaulted: greenfield assumed (not_addressed)
connection pooler defaulted: not specified (not_addressed)
data volume defaulted: not specified (not_addressed)
traffic shape defaulted: not specified (not_addressed)
current bottleneck defaulted: not specified (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards

Unknowns blocking a firmer verdict

Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.

Operational signals to watch

reversal — DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone

reversal — Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load

reversal — Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin

Branch battle map

Battle timeline (3 rounds)

Round 1 — Initial positions · 3 branches

Branch b001 (Vulcan) eliminated — Branch b001 proposes a hybrid DynamoDB + PostgreSQL/Citus...

Socrates proposed branch b004

Socrates Reframe the problem: Instead of asking whether to migrate from DynamoDB to Postg…

Round 2 — Adversarial probes · 3 branches

Loki proposed branch b005

Branch b005 (Loki) eliminated — auto-pruned: unsupported low-confidence branch

Socrates proposed branch b006

Branch b006 (Socrates) eliminated — auto-pruned: unsupported low-confidence branch

Loki Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-wo…

Socrates Instead of a simple yes/no migration decision, we should evaluate whether a hybr…

Round 3 — Final convergence · 3 branches

Branch b004 (Socrates) eliminated — Branch b004 proposes a 'polyglot persistence strategy' wi...

Socrates proposed branch b007

Socrates Before considering migration, conduct a comprehensive database implementation au…

Markdown JSON

Council chamber

Vulcan

Engineer

Socrates

Analyst

Daedalus

Architect

Loki

Disruptor

cbfe26f4-4e75-417d-951d-0d3ef481fdd9 · Protocol

Council archetypes represent independent reasoning perspectives. They are not individuals but structured reasoning roles.

This verdict is a structured reasoning artifact, not professional advice. VectorCourt does not provide legal, financial, medical, or other professional advice. You are responsible for your own decisions.

VectorCourt · Pricing · Terms · Privacy