Running Kubernetes at Scale: What Actually Works in 2025
Operators, GitOps, service mesh and security patterns that survive real production traffic — not the ones that look good in a conference slide.
Kubernetes is no longer the interesting part of your stack. It's the assumed substrate. The interesting part is what you build on top, and more importantly, what you don't build. After shipping K8s platforms for financial services, retail and SaaS clients, a few patterns consistently separate teams running 50 clusters calmly from teams firefighting 5.
Operators: write fewer, reuse more
The Operator pattern is powerful, and that's exactly why it's overused. Every team wants their own CRD. Three years later, you have 40 bespoke controllers nobody maintains.
Our rule of thumb:
- Use an existing operator when the domain is well-covered:
cloudnative-pgfor Postgres,strimzifor Kafka,cert-managerfor TLS,external-secretsfor secret sync. - Write an operator only when you encode genuine business logic that no Helm chart can express — typically a multi-step reconciliation with external systems.
- Prefer Kustomize + Helm for everything else. A CRD is a commitment; a values file is not.
If you do write one, use Kubebuilder or Operator SDK, set a conservative RequeueAfter, and make the reconcile loop idempotent. A badly written operator with a tight requeue is how you DDoS your own API server at 3 a.m.
GitOps: ArgoCD vs Flux, pragmatically
Both work. The choice matters less than the discipline around it.
| Criteria | ArgoCD | Flux | |---|---|---| | UI / visibility | Strong, built-in | Minimal (Weave GitOps or third-party) | | Multi-tenancy | AppProjects, mature RBAC | Kustomize-based, lighter | | Progressive delivery | Argo Rollouts (same ecosystem) | Flagger | | Footprint | Heavier | Lighter, more modular | | Team preference | Dev-facing orgs | Platform/SRE-heavy orgs |
What actually breaks GitOps in production isn't the tool — it's:
- Drift you tolerate. If
kubectl editworks in prod, you don't have GitOps. - One giant repo. Split by blast radius: platform repo (cluster-scoped), team repos (namespaces), and an
apps-of-appspattern to glue them. - No environment promotion model. Use Kustomize overlays or Helm value files per env, and promote with a PR, not a merge to
main.
A minimal ArgoCD Application with automated sync and self-heal:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api
namespace: argocd
spec:
project: payments
source:
repoURL: git@github.com:acme/payments-deploy.git
targetRevision: main
path: overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
retry:
limit: 5
backoff:
duration: 10s
factor: 2
maxDuration: 3m
Service mesh: only if you need one
In 2025, Cilium with its native service mesh (eBPF, no sidecars) and Istio Ambient Mode have made the sidecar tax mostly optional. That's good, because a mesh's cost-to-value ratio is often misjudged.
Reasonable triggers for adopting a mesh:
- You need mTLS everywhere and cert-manager at pod level is getting painful.
- You're doing canary/blue-green across dozens of services.
- You need L7 authorization policies (JWT claims, headers) your apps don't implement consistently.
If none of that is true, a CNI with NetworkPolicy support plus a decent ingress (Envoy-based: Contour, Gateway API with Envoy Gateway) is enough. Skip the mesh.
Security: the boring list that prevents incidents
Most K8s compromises we've investigated came down to the same handful of gaps. A concrete checklist:
- [ ] Pod Security Admission set to
restrictedon all workload namespaces. - [ ] NetworkPolicies default-deny ingress and egress, then allowlist.
- [ ] No
cluster-adminbindings outside of platform service accounts. Audit withrbac-lookup. - [ ] Image signing with Cosign, verified at admission via Kyverno or Sigstore policy-controller.
- [ ] Runtime detection with Falco or Tetragon — eBPF-based, catches the exec-into-prod you missed.
- [ ] SBOMs in CI (Syft) and CVE scanning (Trivy, Grype) gating image promotion.
- [ ] Rotate kubeconfig credentials; prefer OIDC (Dex, your IdP) over static tokens.
- [ ] etcd encryption at rest and backups tested by actually restoring, quarterly.
None of this is novel. That's the point — the exotic stuff isn't what gets you.
Scaling past the obvious limits
At ~500 nodes or ~10k pods per cluster, defaults start hurting. What we tune:
- Karpenter over Cluster Autoscaler for AWS workloads — faster, bin-packs better, handles spot diversification natively.
- KEDA for event-driven autoscaling (Kafka lag, SQS depth, Prometheus queries). HPA on CPU is rarely the right signal.
- API server protection:
--max-requests-inflight, Priority and Fairness config, and a hard look at any controller requeuing every 30 seconds. - etcd: dedicated nodes, NVMe, and watch compaction intervals. Above 8 GB etcd, split the cluster instead of scaling up.
- Multi-cluster over mega-cluster: past a threshold, a second cluster is cheaper than the coordination cost of one huge one. Use Cluster API to keep provisioning reproducible.
Key takeaways
- Treat operators as a last resort, not a default. Helm + Kustomize covers 80% of cases.
- ArgoCD or Flux — pick one, but enforce zero drift and split repos by blast radius.
- Adopt a service mesh for a concrete reason (mTLS, L7 authz, progressive delivery), not by reflex.
- Security is a checklist discipline: PSA, NetworkPolicies, signed images, runtime detection.
- Beyond a few hundred nodes, prefer more clusters over bigger ones, and replace CPU-based HPA with event-driven signals via KEDA.
Read also
- Kubernetes & Cloud NativeMay 7, 2026
Running Kubernetes at Scale: Beyond the Basics
Operators, GitOps, service mesh and zero-trust security: what actually matters when your K8s footprint crosses 50 clusters.
Read article - DevSecOpsMay 14, 2026
DevSecOps in 2025: A Practical Pipeline Blueprint
Shift-left security is a discipline, not a slogan. Here's how to wire SBOMs, SAST/DAST and secret scanning into CI without slowing your teams down.
Read article - Agents IA & automatisationMay 11, 2026
AI Agents in Production: MCP, Tool Use, and Orchestration
From autonomous agents to multi-agent orchestration with MCP and LangGraph — what actually works in enterprise settings, with patterns, pitfalls and code.
Read article