Kubernetes & Cloud NativeMay 7, 2026

Running Kubernetes at Scale: Beyond the Basics

Operators, GitOps, service mesh and zero-trust security: what actually matters when your K8s footprint crosses 50 clusters.

Most teams stop learning Kubernetes once kubectl apply works. That's fine until you hit the second cluster, the fifth team, or the fiftieth namespace. Past that point, the platform you thought you had becomes a distributed systems problem with its own SRE budget. Here's what we see working in production at our clients running anything from 5 to 500+ clusters.

Operators: stop writing Helm charts for stateful workloads

Helm is fine for stateless apps. For databases, message brokers, or anything with a non-trivial lifecycle (backup, failover, version upgrades), an Operator pays for itself within months.

The pattern is mature in 2025: CloudNativePG for Postgres, Strimzi for Kafka, Crossplane for cloud resources, and the Operator SDK if you need to build your own. The win isn't "Kubernetes-native deployment"; it's encoding your senior DBA's runbook as a controller that runs at 3 AM without paging anyone.

A realistic rule of thumb: if your runbook for a service is longer than two pages, you have an Operator-shaped problem.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: orders-db
spec:
  instances: 3
  postgresql:
    parameters:
      max_connections: "200"
  bootstrap:
    initdb:
      database: orders
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/orders-db
      s3Credentials:
        inheritFromIAMRole: true
    retentionPolicy: "30d"

Three replicas, automated backups to S3, and failover handled by the controller. Replacing this with a Helm chart and a CronJob is a regression.

GitOps: ArgoCD vs Flux, picking deliberately

GitOps is no longer a debate, but the tool choice still matters. Quick comparison from what we deploy in the field:

| Criterion | ArgoCD | Flux | |---|---|---| | UI / UX | Strong web UI, good for dev teams | CLI-first, minimal UI | | Multi-tenancy | AppProjects, RBAC mature | Tenant model via Kustomize/Flux CRDs | | Multi-cluster | Hub-and-spoke or ApplicationSets | Cluster API + Flux per cluster | | Progressive delivery | Argo Rollouts (same family) | Flagger | | Footprint | Heavier control plane | Lighter, more modular |

ArgoCD wins when you have many product teams who want visibility. Flux wins when platform engineers own delivery end-to-end and prefer composable, smaller controllers. Both are CNCF graduated; both work. What kills GitOps adoption isn't the tool, it's putting cluster-scoped resources and app manifests in the same repo with no ownership boundaries.

A pattern that scales: one platform repo (cluster add-ons, RBAC, network policies), N team-* repos (their apps), and ApplicationSets generating Argo Applications from a directory convention. New team onboarding becomes a pull request, not a ticket.

Service mesh: only if you can name the problem

Istio, Linkerd, Cilium Service Mesh, Kuma. The honest question: what are you solving?

  • mTLS everywhere → Linkerd or Cilium (lowest operational cost)
  • Fine-grained traffic policy, multi-cluster, complex L7 routing → Istio with ambient mode (no more sidecars per pod since 1.22)
  • eBPF-native, replacing kube-proxy → Cilium

Istio's ambient mode is the most significant shift of the last 18 months: ztunnel handles L4/mTLS at the node level, waypoint proxies handle L7 only when you ask. The sidecar tax (memory, restart coupling, init containers) is largely gone. If you evaluated Istio in 2022 and rejected it, it's worth a second look.

That said: a service mesh you don't need is a 2 AM incident waiting to happen. Network policies (Calico, Cilium) cover 80% of east-west security needs without a mesh.

Security: the non-negotiable baseline

In 2025, "we run Kubernetes securely" means at minimum:

  • [ ] Pod Security Admission set to restricted on all non-system namespaces
  • [ ] No long-lived ServiceAccount tokens; use projected tokens or workload identity (IRSA on EKS, Workload Identity on GKE)
  • [ ] Image signing with Sigstore/Cosign, verified by a policy controller (Kyverno or Gatekeeper)
  • [ ] SBOM generated at build, scanned continuously (Trivy, Grype)
  • [ ] Network policies default-deny per namespace
  • [ ] Secrets encrypted at rest with a KMS-backed provider, or moved to External Secrets Operator + Vault
  • [ ] Audit logs shipped off-cluster, with alerts on exec, portforward, and impersonate
  • [ ] Regular CIS benchmark scans (kube-bench) wired into CI

Kyverno policies as code, stored in the same GitOps repo as cluster config, give you an auditable trail. The combination of Cosign + Kyverno verification is now the de facto supply chain control for Kubernetes; SLSA level 3 is achievable without exotic tooling.

Scaling: where the money leaks

At scale, three things dominate the bill and the pager:

  1. Autoscaling correctness. HPA on CPU is rarely the right signal. Use KEDA for event-driven workloads (queue depth, Kafka lag, Prometheus metrics). Combine with Karpenter (or Cluster Autoscaler with multiple node groups) for node-level elasticity. Karpenter's consolidation feature alone typically cuts EC2 spend 20–30% on bursty workloads.
  1. Control plane limits. etcd starts hurting around 8000 pods or 1.5M objects per cluster. Before that, federate: more clusters, smaller blast radius. Tools like Cluster API and vCluster make this manageable. Anti-pattern: one 2000-node cluster shared by 40 teams.
  1. Observability cardinality. Prometheus with default scrape configs explodes past 10M active series. Use Mimir, Thanos, or VictoriaMetrics, drop unused labels at ingestion, and treat metrics as a budget per team.

Key takeaways

  • Use Operators for stateful workloads; Helm is not a lifecycle manager.
  • Pick ArgoCD or Flux based on team topology, not Twitter consensus, and separate platform repos from app repos.
  • Adopt a service mesh only when you can write down the problem in one sentence; revisit Istio ambient if you dismissed it pre-2024.
  • Treat Cosign + Kyverno + PSA restricted + default-deny NetworkPolicies as the non-negotiable security floor.
  • At scale, federate clusters before tuning etcd, and pair KEDA with Karpenter to control both pod and node elasticity.
Share this article

Read also