Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?
This verdict assumes 40% of constraints
The following constraints were not provided and default values were used:
- current_scale: moderate scale assumed (not_addressed)
- existing_stack: greenfield assumed (not_addressed)
- connection_pooler: not specified (not_addressed)
- data_volume: not specified (not_addressed)
- traffic_shape: not specified (not_addressed)
- current_bottleneck: not specified (not_addressed)
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)
Decision
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
Next actions
Council notes
Evidence boundary
Observed from your filing
- Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?
Assumptions used for analysis
- DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
- The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
- 90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
- The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
- The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants
- current scale defaulted: moderate scale assumed (not_addressed)
- existing stack defaulted: greenfield assumed (not_addressed)
- connection pooler defaulted: not specified (not_addressed)
- data volume defaulted: not specified (not_addressed)
- traffic shape defaulted: not specified (not_addressed)
- current bottleneck defaulted: not specified (not_addressed)
Inferred candidate specifics
- Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.
- Deploy a 2-node Azure Cosmos DB for PostgreSQL (Hyperscale Citus) proof-of-concept cluster with 1 coordinator + 1 worker node, load 3 representative tenants (including the largest by data volume), distribute on tenant_id, replay 24 hours of production query logs via pgbench, and measure p99 latency against the 50ms target before committing to full migration.
- b003 (0.86 confidence) narrowly exceeded b002 (0.85) and was selected because it names specific node configurations, cost projections, migration tooling (pgloader, AWS DMS), concrete failure modes with quantified thresholds (>40% skew, >60% worker saturation), and an actionable architecture. b002 provides a sound decision framework but lacks architectural specificity — it says 'if analysis confirms, then migrate' without detailing what the migration looks like. b003 survived 3 rounds of adversarial strengthening and provides the most execution-ready path.
- Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
- Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
- Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
- If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
- Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards
Unknowns blocking a firmer verdict
- Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
- Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
- The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
- No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
- Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.