Should we move our CI/CD from GitHub Actions to self-hosted runners for a 50-developer team spending $8K/month on Actions minutes with 400 builds per day?
Deploy a hybrid CI/CD model migrating heavy workflows to self-hosted runners on Kubernetes while retaining GitHub...
Decision
Implement a hybrid CI/CD model: migrate heavy workflows (compilation, integration tests, Docker image builds) to self-hosted runners orchestrated via Kubernetes (ARC on EKS/GKE), while retaining GitHub-hosted runners for lightweight and low-frequency tasks. Target 40-60% cost reduction from $8K to $4-5K/month. Self-hosted infrastructure must reliably handle at least 200 of the 400 daily builds. Critical failure mode: Runner image drift. GitHub-hosted runners update base images weekly with ~200 pre-installed tools. Self-hosted runners diverge within 2-3 weeks, breaking builds that worked on hosted runners. This is the primary reason self-hosted migrations get reverted. Mitigate with automated weekly image rebuilds matching GitHub's runner image releases. Second failure mode: Spot interruptions affecting 5-10% of instances. Use mixed instance types and maintain 3+ on-demand baseline runners that never scale to zero. Set termination grace periods to allow in-flight builds to complete. Critical threshold: DevOps staffing. This infrastructure requires ~0.5 FTE dedicated DevOps capacity. If your team lacks this, the TCO advantage collapses — a $75K+ annual staffing cost against ~$36-48K annual savings makes this marginal. Only proceed if existing DevOps capacity can absorb the load.
Next actions
Council notes
Evidence boundary
Observed from your filing
- Should we move our CI/CD from GitHub Actions to self-hosted runners for a 50-developer team spending $8K/month on Actions minutes with 400 builds per day?
Assumptions used for analysis
- The $8K/month spend is primarily driven by a subset of heavy workflows that can be isolated and migrated independently
- The team has or can allocate ~0.5 FTE of DevOps/platform engineering capacity for runner infrastructure maintenance
- Build workflows can be cleanly categorized into 'heavy' (suitable for self-hosted) and 'light' (retain on GitHub-hosted) without significant cross-dependencies
- The team operates in a cloud environment (AWS/GCP/Azure) where Kubernetes infrastructure can be provisioned, and has existing cloud accounts and networking in place
- Security and compliance requirements do not prohibit running CI/CD workloads on self-managed infrastructure
Inferred candidate specifics
- Implement a hybrid CI/CD model: migrate heavy workflows (compilation, integration tests, Docker image builds) to self-hosted runners orchestrated via Kubernetes (ARC on EKS/GKE), while retaining GitHub-hosted runners for lightweight and low-frequency tasks. Target 40-60% cost reduction from $8K to $4-5K/month. Self-hosted infrastructure must reliably handle at least 200 of the 400 daily builds. Critical failure mode: Runner image drift. GitHub-hosted runners update base images weekly with ~200 pre-installed tools. Self-hosted runners diverge within 2-3 weeks, breaking builds that worked on hosted runners. This is the primary reason self-hosted migrations get reverted. Mitigate with automated weekly image rebuilds matching GitHub's runner image releases. Second failure mode: Spot interruptions affecting 5-10% of instances. Use mixed instance types and maintain 3+ on-demand baseline runners that never scale to zero. Set termination grace periods to allow in-flight builds to complete. Critical threshold: DevOps staffing. This infrastructure requires ~0.5 FTE dedicated DevOps capacity. If your team lacks this, the TCO advantage collapses — a $75K+ annual staffing cost against ~$36-48K annual savings makes this marginal. Only proceed if existing DevOps capacity can absorb the load.
- Run a 2-week build profiling analysis: tag every GitHub Actions workflow by category (compilation, test, lint, deploy, other), measure per-workflow minute consumption and cost, and identify the top 10 workflows by cost — these are your migration candidates for self-hosted runners.
- Branch b004 had the highest stated confidence (0.95) but is structurally a [reframe] — it recommends conducting a review rather than providing an actionable implementation path. It names no specific technology, no concrete threshold, and no architectural pattern, failing the specificity gate for an implementation winner. Its valid insight (TCO analysis, build volume optimization) is captured in unresolved_uncertainty and next_actions. Branch b002 (0.85) provides a concrete hybrid architecture with specific cost targets ($4-5K/month), capacity thresholds (200+ daily builds on self-hosted), and an implementation pattern (cloud-hosted Kubernetes runners + GitHub Actions retention). It also survived Round 3 strengthening. b001 (0.75) was weakened by b005's prosecution of spot instance reliability risks.
- Run a 2-week build profiling analysis: instrument all 400 daily builds to measure per-workflow GitHub Actions minutes, categorize by type (compile/test/lint/deploy), and identify top 10 costliest workflows
- Measure existing team Kubernetes expertise — survey DevOps/platform engineers on ARC familiarity and estimate available FTE capacity for runner infrastructure maintenance
- Deploy a proof-of-concept ARC cluster with 3 on-demand nodes, migrate the single costliest workflow, and measure cost/reliability over 2 weeks before broader rollout
- Set up automated weekly runner image rebuilds that track GitHub's runner-images repository releases to prevent image drift
- Create a CI/CD cost and reliability dashboard tracking: monthly spend, build success rate, queue wait times, and spot interruption frequency — with alerts if build failure rate exceeds 5% or queue time exceeds 5 minutes
Unknowns blocking a firmer verdict
- Actual build profile distribution is unknown — the 200/200 split between heavy and light workflows is assumed, not measured. If 350+ builds are heavy, the hybrid approach saves less because more infrastructure is needed
- Team's existing DevOps capacity and Kubernetes expertise is unspecified — if no current Kubernetes competency exists, ramp-up time and staffing costs could eliminate the cost advantage for 6-12 months
- b004's core point remains unaddressed: whether 400 builds/day is optimal or includes redundant/wasteful builds. Build caching and workflow optimization alone might reduce spend by 20-30% with zero infrastructure changes
- Killed branch b003 had the most specific architecture (ARC, exact instance types, capacity math showing 16 concurrent runners needed at peak) but was eliminated for underestimating DevOps staffing costs — its technical specifics may still be the right implementation details
- Security and compliance implications of self-hosted runners (secrets management, network isolation, audit logging) are unaddressed by any surviving branch