Post

Zero-Downtime Kubernetes Upgrades on GCP (Including PostgreSQL)

Zero-Downtime Kubernetes Upgrades on GCP

How We Upgrade GKE (Including PostgreSQL) Without Maintenance Windows

Upgrading Kubernetes in production is one of those tasks that looks easy on paper and terrifying in reality—especially when databases are running inside the cluster.

In this post, we share how our DevOps team performs near zero-downtime Kubernetes upgrades on Google Cloud Platform (GCP) using GKE, even with PostgreSQL running as a StatefulSet. This is not theory—this is a repeatable production runbook.


What “Zero Downtime” Means (Realistically)

Let’s be precise.

  • No scheduled maintenance window
  • No user-visible outage
  • Applications may experience brief connection retries, but traffic recovers automatically
  • Control plane, nodes, and workloads upgrade safely

This is the standard most modern SRE teams aim for—and it’s achievable with the right design.


Platform Context

Our setup:

  • Google Kubernetes Engine (GKE Standard)
  • Regional cluster
  • Multiple node pools
  • PostgreSQL running in Kubernetes
  • Kubegres operator (1 primary, 1 replica)
  • Applications connect via Service, not Pod IPs

The Core Strategy: Blue/Green Node Pools

Instead of upgrading nodes in place, we treat node upgrades like an application rollout.

Conceptually:

  • Blue node pool → current production (Kubernetes 1.34)
  • Green node pool → new nodes (Kubernetes 1.35)

Workloads are migrated deliberately, not evicted blindly.


Architecture Diagram

Image

Image

Flow overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Users
  |
  v
GKE Load Balancer / Service
  |
  v
PostgreSQL Primary (Service)
  |
  +--> Replica (Blue)  ---> moved first
  |
  +--> Primary (Blue)  ---> moved last

Blue Node Pool  ----->  Green Node Pool
(K8s 1.34)             (K8s 1.35)

Step 0: Preconditions (Non-Negotiable)

1. PodDisruptionBudget

We protect the database from accidental mass eviction.

1
2
3
4
5
6
7
8
9
10
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: postgres-pdb
  namespace: db
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: postgres

2. Applications must retry database connections

Failover takes seconds. Clients must retry.


Step 1: Create the Green Node Pool

We create a new node pool running the target Kubernetes version.

1
2
3
4
5
6
gcloud container node-pools create green-135 \
  --cluster prod-cluster \
  --region europe-west1 \
  --cluster-version 1.35.x-gke.y \
  --machine-type e2-standard-4 \
  --num-nodes 2

Wait until nodes are ready:

1
kubectl get nodes -l cloud.google.com/gke-nodepool=green-135

Step 2: Force PostgreSQL Pods onto the Green Pool

Kubegres supports scheduling configuration. We apply node affinity so any restarted DB pod lands only on green nodes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
kubectl -n db patch kubegres my-postgres --type merge -p '{
  "spec": {
    "scheduler": {
      "affinity": {
        "nodeAffinity": {
          "requiredDuringSchedulingIgnoredDuringExecution": {
            "nodeSelectorTerms": [{
              "matchExpressions": [{
                "key": "cloud.google.com/gke-nodepool",
                "operator": "In",
                "values": ["green-135"]
              }]
            }]
          }
        }
      }
    }
  }
}'

This is the safety lock that makes everything predictable.


Step 3: Move the Replica First (No Impact)

We identify the replica:

1
2
kubectl -n db exec postgres-1 -- \
  psql -U postgres -tAc "select pg_is_in_recovery();"

Delete the replica pod:

1
kubectl -n db delete pod postgres-1

What happens:

  • Pod restarts on green node
  • Disk reattaches
  • Replica resyncs
  • Primary remains untouched

No client impact.


Step 4: Promote the Replica (Controlled Failover)

Kubegres supports manual promotion.

1
2
3
4
5
6
7
kubectl -n db patch kubegres my-postgres --type merge -p '{
  "spec": {
    "failover": {
      "promotePod": "postgres-1"
    }
  }
}'

What happens:

  • Replica becomes primary
  • Primary Service switches automatically
  • Clients reconnect (brief retry window)

This is the only moment where connections may reset—usually a few seconds.


Step 5: Move the Old Primary

Now the old primary is just a replica.

1
kubectl -n db delete pod postgres-0

It restarts on green, attaches its disk, and joins replication.

At this point:

  • Primary → green
  • Replica → green
  • Blue pool no longer hosts database pods

Step 6: Upgrade the Rest of the Cluster

Now that the database is safe:

Upgrade control plane

1
2
3
4
gcloud container clusters upgrade prod-cluster \
  --region europe-west1 \
  --master \
  --cluster-version 1.35.x-gke.y

Upgrade remaining node pools (with surge)

1
2
3
4
5
gcloud container node-pools update default-pool \
  --cluster prod-cluster \
  --region europe-west1 \
  --max-surge-upgrade 1 \
  --max-unavailable-upgrade 0
1
2
3
4
gcloud container clusters upgrade prod-cluster \
  --region europe-west1 \
  --node-pool default-pool \
  --cluster-version 1.35.x-gke.y

Step 7: Decommission the Blue Pool

Once everything runs on green:

1
2
3
gcloud container node-pools delete blue-134 \
  --cluster prod-cluster \
  --region europe-west1

Rollback is trivial until this step.


Why This Works

  • No forced evictions
  • Databases move last
  • Failover is controlled, not accidental
  • Services abstract pod identity
  • Rollback is always possible

This pattern scales cleanly from stateless services to critical stateful systems.


Common Failure Modes We Avoided

MistakeResult
Upgrading nodes in placeDB restart + outage
No PDBSimultaneous eviction
Pod IP connectionsBroken clients
No retriesUser-visible downtime
Single Postgres podUnavoidable outage

When We Would Not Do This

If your requirements include:

  • Absolute zero connection resets
  • No failover logic in apps
  • Minimal operational burden

Then Cloud SQL or AlloyDB is the better choice.

Running databases in Kubernetes gives flexibility—but demands discipline.


Final Thoughts

Zero-downtime Kubernetes upgrades are not about a magic flag.

They require:

  • Architectural intent
  • Blue/green infrastructure
  • Explicit control of scheduling
  • Applications designed for failure

On GCP, GKE gives you excellent primitives—but DevOps engineering turns them into reliability.


Want This as a Reference?

We use this runbook for every production upgrade.

If you want:

  • a PDF version
  • a step-by-step internal runbook
  • a diagram-only executive summary
  • or a conference talk version

Just say the word.

This post is licensed under CC BY 4.0 by the author.