Kubernetes Node Upgrade Procedure: Best Practices and Failover Mechanisms
Kubernetes Node Upgrade Procedure: Best Practices and Failover Mechanisms
When managing a Kubernetes cluster, node upgrades are an inevitable part of the maintenance cycle. Whether you’re applying security patches, upgrading the Kubernetes version, or implementing system updates, a well-planned node upgrade procedure ensures minimal disruption to your workloads. This article outlines a comprehensive approach to node upgrades and explores the various hooks and mechanisms available for implementing effective failover strategies.
Understanding Node Upgrades
A node upgrade typically involves:
- Cordoning the node to prevent new pod scheduling
- Draining the node to gracefully evict existing pods
- Performing the actual upgrade (OS patches, kubernetes components, etc.)
- Uncordoning the node to return it to service
Best Practices for Node Upgrades
Pre-Upgrade Planning
- Inventory Assessment: Document all workloads running on the node
- Impact Analysis: Identify critical applications that might be affected
- Backup Strategy: Ensure all critical data is backed up
- Communication Plan: Notify stakeholders about the maintenance window
Execution Strategy
1. Node Preparation
1
2
3
4
5
# Mark the node as unschedulable
kubectl cordon <node-name>
# Check node status
kubectl get nodes
2. Controlled Pod Eviction
1
2
# Drain the node with graceful termination
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
3. Upgrade Process
- Apply OS updates
- Upgrade kubelet, kube-proxy, and container runtime
- Apply any node-specific configurations
4. Verification
1
2
3
4
5
# Verify node status after upgrade
kubectl get nodes -o wide
# Check component versions
kubectl get nodes <node-name> -o jsonpath='{.status.nodeInfo.kubeletVersion}'
5. Return to Service
1
2
# Mark the node as schedulable again
kubectl uncordon <node-name>
Failover Mechanisms and Hooks
Kubernetes provides several mechanisms to ensure application resilience during node upgrades:
1. Pod Disruption Budgets (PDBs)
PDBs allow you to limit the number of pods that can be down simultaneously during voluntary disruptions like node drains.
1
2
3
4
5
6
7
8
9
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: my-application
2. PriorityClasses
PriorityClasses determine the order of pod eviction during resource constraints or node drains.
1
2
3
4
5
6
7
8
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-service
spec:
value: 1000000
globalDefault: false
description: "High priority pods that should be evicted last"
3. Pod Lifecycle Hooks
These hooks enable applications to gracefully handle termination:
- PreStop Hook: Executed immediately before a pod is terminated
1 2 3 4
lifecycle: preStop: exec: command: ["/bin/sh", "-c", "/pre-stop-hook.sh"]
- PostStart Hook: Executed immediately after a container is created
1 2 3 4
lifecycle: postStart: exec: command: ["/bin/sh", "-c", "/post-start-hook.sh"]
4. Termination Grace Period
Defines how long Kubernetes waits for a pod to shut down gracefully before force terminating it.
1
terminationGracePeriodSeconds: 60
5. Readiness Probes
Ensure traffic is only sent to pods that are ready to handle requests.
1
2
3
4
5
6
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
6. StatefulSet Ordered Updates
For stateful applications, StatefulSets provide ordered and graceful deployment updates:
1
2
3
4
5
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
7. Custom Controllers and Operators
Application-specific operators (like for databases) often implement custom failover logic:
- Database Operators: Kubegres, PostgreSQL Operator, MySQL Operator
- Service Mesh Controllers: Istio, Linkerd for traffic management
- Custom Resource Definitions (CRDs): Extending Kubernetes API for application-specific failover behaviors
8. Anti-Affinity Rules
Ensure pods are distributed across different nodes to minimize impact:
1
2
3
4
5
6
7
8
9
10
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: "kubernetes.io/hostname"
Real-World Example: Database Failover During Node Upgrade
For database applications like PostgreSQL managed by Kubegres, a comprehensive failover strategy might combine:
- High Priority Class: Ensuring database pods are evicted last
- Pod Disruption Budget: Maintaining minimum database replicas
- PreStop Hooks: Triggering clean database shutdown
- StatefulSet Ordered Updates: Controlling the order of pod restarts
- Anti-Affinity Rules: Distributing replicas across nodes
- Custom Operator Logic: Handling leader election and promotion
Conclusion
A successful node upgrade procedure requires careful planning and leveraging of Kubernetes’ built-in mechanisms for failover and high availability. By implementing the appropriate hooks and strategies, you can maintain application availability even during infrastructure maintenance operations.
The key to minimal disruption lies in understanding your workloads’ specific requirements and applying the right combination of priority settings, lifecycle hooks, and resource definitions. With proper preparation, even critical stateful applications can remain highly available during node upgrades.