The day your side project gets real users is also the day you start thinking seriously about infrastructure. Not because you want to — because you have to. Someone tweets about it, traffic spikes, the database chokes. Or you get a Slack message at 1am saying the app is down. Or you open your AWS bill and it's three times what you expected with no obvious explanation.
This is the startup infrastructure stack guide for founders who are at that inflection point: you have something real, it's getting traction, and the "just push to Heroku" setup is showing its cracks. You don't need an enterprise playbook. You need to know exactly what to use, when to add complexity, and what to skip until you actually need it.
This is that guide. Let's go stage by stage.
Stage 0: The Side Project (0–100 Users) — Don't Optimize. Just Ship.
At this stage, your only job is to find out whether anyone cares about what you're building. Infrastructure complexity is your enemy. Every hour you spend on Terraform is an hour you're not talking to users.
The right stack:
- Compute: Vercel, Railway, or Render. These platforms handle deploys, scaling, and SSL out of the box. Push to GitHub, it's live. That's the full ops workflow.
- Database: Supabase or PlanetScale. Managed Postgres or MySQL with a generous free tier and zero DBA required.
- DNS + DDoS protection: Cloudflare. Free tier, five minutes to set up, basic DDoS mitigation without thinking about it.
- Source control + CI/CD: GitHub. Use Actions for anything you need to automate.
Total infra overhead: near zero. This is correct.
The mistake most technical founders make at this stage is over-engineering. They read a Hacker News thread, get nervous about scaling, and spend two weeks setting up Kubernetes for a product with 12 users. Don't. The cost of premature infrastructure investment isn't just time — it's the momentum and focus you need to validate the idea.
If the product breaks under load at 50 users, that's a good problem — it means 50 people wanted to use it. Scale then.
Stage 1: Early Traction (100–1,000 Users) — Now You Need Visibility
Somewhere between 100 and 1,000 users, new problems appear. They're subtle at first:
- "Why is it slow sometimes?"
- "Did that deploy break something, or was it already broken?"
- "Which part of this is costing money?"
You can't answer any of these questions without observability. This is the single most important thing you add at Stage 1.
Add these now:
- Basic logging: Axiom, Logtail, or AWS CloudWatch if you're already on AWS. Pick one, send your app logs there, and learn to query them before you need them at 2am.
- Uptime monitoring: Better Uptime or UptimeRobot. You want to hear about downtime before your users tell you on Twitter.
Consider moving to your own cloud account. At some point, the platforms that made Stage 0 effortless (Vercel, Railway) become either expensive or limiting. Moving your backend to a real cloud account — AWS, GCP, or a cost-efficient provider like Hetzner or DigitalOcean — gives you more control over cost and architecture. This isn't urgent, but it's worth planning for.
This is where Clanker Cloud enters the picture. Connect your cloud accounts and GitHub, and you can ask questions about your infrastructure in plain English: "What's running in this account?", "Why did costs go up last week?", "What changed in the last deploy?" You get answers without context-switching into five different dashboards. For a solo founder on-call for everything, that's the difference between catching a problem early and waking up to a support inbox full of complaints. See how it works in the demo.
Stage 2: Growing Product (1,000–10,000 Users) — Formalize Without Bureaucracy
At this stage, you're no longer moving fast and breaking things — you're moving fast and trying very hard not to break things. The product matters now. Users have workflows that depend on it. Downtime has real business consequences.
The goal at Stage 2 is formalization without the overhead of a full platform engineering team.
What to put in place:
Separate staging and production environments. A surprising number of teams at this stage still test against production. Get a staging environment that mirrors prod closely enough that deploys aren't surprises.
Secrets management. Stop putting secrets in .env files scattered across providers. Use AWS Secrets Manager, Doppler, or HashiCorp Vault. Takes a day to set up and pays off the first time a key needs rotation.
Infrastructure as Code (IaC). Start using Terraform or Pulumi. You don't need to boil the ocean — getting your compute, database, and networking into IaC means you can recreate your environment and avoid configuration drift. Pulumi is the better choice if your team prefers TypeScript or Python over HCL.
Kubernetes — only if you actually need it. If you're running a single service, you do not need Kubernetes. If you're running multiple services that communicate with each other, and you have someone on the team who understands it (or is willing to), then DOKS (DigitalOcean Kubernetes), GKE, or EKS is worth considering. But be honest with yourself: most early-stage startups add Kubernetes too early and pay for it in ops complexity for years.
Cost tracking. Your cloud bill matters for runway. Use AWS Cost Explorer, GCP Billing, or Infracost to know which service is consuming what. Set billing alerts. Know your baseline.
Security scanning. Misconfigurations are the most common cause of security incidents at early-stage companies — an S3 bucket left public, a security group too permissive, an API endpoint with no authentication. Clanker Cloud's autonomous security agents scan your running infrastructure for misconfigs, exposed endpoints, and anomalies — without requiring a dedicated security team. More on the early-stage infra guide here.
Stage 3: Scale (10,000+ Users / Series A) — Hire or Delegate
At this stage, infrastructure is a product concern, not just an ops concern. Latency, availability, and security directly affect your ability to retain users and close enterprise deals.
This is when you hire your first DevOps or Platform engineer. Someone whose job it is to think about reliability, cost optimization, and the deployment pipeline. That hire usually happens somewhere between Series A and hitting meaningful revenue — often later than it should.
Until then, you operate like you have a team when you don't. That means:
- Proper CI/CD pipelines with automated testing gates, not "push to main and hope"
- SLOs — define what good looks like for uptime and latency before you sign enterprise SLAs
- Multi-region deployment if latency to distributed users matters to your product
- Incident response runbooks, written down
Clanker Cloud's AI workspace and reviewed-change workflow is designed for this gap: the period between "just the founders" and "we have a dedicated platform team." When something breaks at 2am, you need to move fast and understand blast radius before you act. The AI DevOps workflow and human-approved execution model means you investigate thoroughly and act only when you know what you're doing. Full documentation at docs.clankercloud.ai.
The Full Recommended Stack by Stage
| Stage | Compute | Database | DNS/CDN | CI/CD | Observability | Infra Ops |
|---|---|---|---|---|---|---|
| 0 — Side project (0–100) | Vercel / Railway / Render | Supabase / PlanetScale | Cloudflare | GitHub Actions | None yet | None yet |
| 1 — Early traction (100–1K) | Same or own cloud (Hetzner/DO/AWS) | Same (managed) | Cloudflare | GitHub Actions | Axiom / Logtail + Better Uptime | Clanker Cloud |
| 2 — Growing product (1K–10K) | AWS / GCP / Hetzner | RDS / Cloud SQL / Managed | Cloudflare | GitHub Actions + deploy gates | Full logging + APM | Clanker Cloud + Terraform/Pulumi |
| 3 — Scale (10K+ / Series A) | AWS / GCP multi-region | Managed + replicas | Cloudflare / Fastly | Full CI/CD pipeline | Datadog / Grafana | Platform engineer + Clanker Cloud |
The Infrastructure Decisions That Actually Matter for Founders
A few opinionated takes, because this is where most guides go vague:
AWS vs. GCP vs. Hetzner/DigitalOcean: Don't pick based on prestige. AWS has the most services and the largest ecosystem, which matters once you have specific requirements (ML training, complex networking, compliance). GCP is strong if you're building ML-native products. Hetzner and DigitalOcean are dramatically cheaper for compute-heavy workloads and perfectly capable for most apps up to significant scale. If you're pre-Series A and cost-sensitive, Hetzner in Europe or DigitalOcean in the US will save you real money.
Kubernetes: only when you have multiple services AND someone who knows it. The most common Kubernetes mistake is adding it before it's necessary. It's a powerful system for managing containerized services at scale, but it brings real operational overhead. One service? Use a managed container service (Fargate, Cloud Run, Railway). Three or more services with complex networking? Kubernetes starts making sense.
Serverless vs. containers: Serverless (Lambda, Cloud Functions, Vercel Edge) is excellent for event-driven, sporadic workloads — webhooks, background jobs, API routes with variable traffic. Containers are better for persistent services: anything that needs warm state, long-running connections, or predictable latency. Most production apps end up using both.
Managed databases always beat self-managed at early stages. Self-managing Postgres or MySQL on a VM looks cheaper until you factor in backups, failover, patching, and the ops time when something goes wrong. Pay for RDS, Cloud SQL, Supabase, or PlanetScale. The margin isn't worth it.
The one thing most founders get wrong: no visibility until something breaks. You find out your app is down from a user. You find out costs spiked when the bill arrives. You find out a security misconfiguration after it's already a problem. Visibility is not a luxury — it's how you operate a production system responsibly. Add it at Stage 1, not Stage 3.
How Clanker Cloud Fits the Founder Journey
Clanker Cloud is designed specifically for the gap between "I have a side project" and "I have a platform team." Here's where it fits at each stage:
Stage 1: The first tool that tells you what's actually running in your cloud account. Connect your AWS or GCP credentials (they stay local — Clanker Cloud is a local-first desktop app, not a hosted SaaS), and ask questions like "what EC2 instances are running?", "what's changed in the last 24 hours?", "is anything publicly exposed that shouldn't be?" You get answers in seconds, not dashboards.
Stage 2: The investigation layer when something breaks at 2am and you're the only one on call. Rather than jumping between CloudWatch, your cloud console, GitHub, and a Slack thread, you have one surface that understands your infrastructure topology and can trace what changed, what's affected, and what the likely cause is.
Stage 3: The bridge that lets you operate like you have a DevOps team before you can afford one. Reviewed change plans, autonomous security agents, and full integration with your existing tools mean your small team maintains production-level discipline without the headcount.
BYOK model: Use Gemma 4 locally to keep infrastructure data completely private, or connect Claude Code or Codex for full agentic workflows. No token markup, no data leaving your machine unless you explicitly choose it. For founders handling sensitive user data or working in regulated industries, this matters. Explore agentic workflows in the docs.
One-minute setup. Install the desktop app, connect your existing credentials, and you're running. No new IAM roles to wrestle with, no hosted layer to configure.
Conclusion
The startup infrastructure stack isn't one thing — it's a series of decisions you make as your product grows, each one calibrated to your current stage and constraints. The founders who get it right aren't the ones who build the most sophisticated infrastructure early. They're the ones who add exactly the right amount of structure at the right time, stay out of over-engineering traps, and build in visibility before they desperately need it.
The path is actually pretty clean: ship on platforms until you have users, add observability early, formalize as you grow, hire when the complexity justifies it. Every extra layer of infrastructure you add before you need it is time and focus taken away from the product.
If you're somewhere in Stage 1 or Stage 2 right now, the most valuable thing you can do today is get visibility into what you're running. Try Clanker Cloud free — connect your cloud accounts, ask a question about your infrastructure in plain English, and see what comes back. It takes a minute to set up and costs nothing to start.
FAQ
What infrastructure does a startup need?
Early-stage startups need the minimum viable infrastructure stack: a managed compute platform (Vercel, Railway, or Render), a managed database (Supabase or PlanetScale), DNS and DDoS protection (Cloudflare), and source control with basic CI/CD (GitHub). Observability (logging and uptime monitoring) becomes essential once you have real users. Everything else — Kubernetes, IaC, secrets management, multi-region — should be added as specific needs arise, not in anticipation of needs that may never materialize.
When should a startup move off Heroku?
The right time to move off Heroku (or similar PaaS platforms like Railway or Render) is when either cost or capability becomes a constraint. Cost is usually the first signal: managed PaaS platforms charge a premium for compute that starts to matter at meaningful scale. Capability becomes a constraint when you need fine-grained networking control, custom runtime environments, or specific cloud services that aren't available on the platform. For most startups, this happens somewhere between 1,000 and 10,000 active users, though it depends heavily on the app's compute profile.
Do startups need Kubernetes?
Most early-stage startups do not need Kubernetes. It's a powerful system for orchestrating containerized workloads across multiple services, but it adds real operational complexity. If you're running one or two services, a managed container service (AWS Fargate, Google Cloud Run, Railway) gives you most of the benefits with a fraction of the overhead. Kubernetes becomes worth considering when you have three or more services with complex interdependencies, someone on the team who understands it well, and a scale that justifies the operational investment — typically at the growth stage or beyond.
How do founders manage cloud infrastructure?
Most solo founders and small teams manage cloud infrastructure reactively — which means they find out about problems from users instead of monitoring. The better approach is to add basic observability (logging, uptime monitoring) early, use managed services to minimize ops burden, and use a tool like Clanker Cloud to query and understand your infrastructure in plain English from a single surface. At the growth stage, introducing Infrastructure as Code (Terraform or Pulumi) and automated security scanning reduces the risk of configuration drift and misconfigurations. The goal is to operate with as much visibility as a full platform team, even when you're a team of one.
Move the repo from prototype to production
Install the desktop app, connect GitHub plus one cloud provider, and review the deployment plan before Clanker Cloud touches real infrastructure.
