The Mystery of the Premature 502: Untangling GCP's Two Health Check Systems
The Mystery of the Premature 502: Untangling GCP’s Two Health Check Systems
The Scene of the Crime: A Flawless Deployment Ends in a 502
You’ve just deployed a new version of your application to Google Kubernetes Engine (GKE). The CI/CD pipeline glows green, kubectl get pods
shows all pods in a Running
state, and your deployment is, by all accounts, successful. You navigate to your application’s URL, hit refresh, and are greeted by the dreaded 502 Bad Gateway error.
After a few frantic refreshes, the error vanishes and your application loads perfectly. What just happened? You have readiness probes configured. They should have prevented traffic from reaching the pods before they were ready. Yet, somehow, traffic arrived too early.
This is the story of a common and confusing problem in GKE, born from a simple fact: there aren’t one, but two separate health check systems at play. Understanding the division between them is the key to solving the mystery of the premature 502.
The Core Concept: Two Worlds, Two Health Checks
The fundamental misunderstanding arises because the Kubernetes health checks (liveness/readiness probes) and the Google Cloud Load Balancer’s health check operate in different domains and serve different purposes.
- Kubernetes Probes (Internal World): These live inside your GKE cluster. They determine a pod’s health from the perspective of Kubernetes itself. Is the pod running correctly? Is it ready to accept internal traffic from other services?
- Cloud Load Balancer Health Check (External World): This lives outside your cluster, within the Google Cloud network. It determines a backend’s health from the perspective of the global load balancer. Is this group of pods (represented by a Network Endpoint Group or NEG) ready to receive external user traffic?
Here’s a high-level view of how they are separated:
When these two systems are not perfectly aligned, the Load Balancer can declare the backend “healthy” and send traffic before Kubernetes considers all the individual pods truly ready to serve, leading to the 502.
A Deep Dive into the Components
Let’s trace the path of a request from the user to your pod to understand each component’s role.
1. Kubernetes Liveness & Readiness Probes
These are the health checks you define in your pod’s specification.
- Readiness Probe: Tells the Kubernetes Service when a pod is ready to be added to the pool of service endpoints. If it fails, the Service removes the pod’s IP address, stopping internal cluster traffic from reaching it.
- Liveness Probe: Tells the Kubelet if a pod needs to be restarted. If it fails, the pod is killed and a new one is created.
Key takeaway: Probes manage a pod’s lifecycle and internal network availability within the Kubernetes cluster.
2. Kubernetes Service (ClusterIP
)
When you create a Service of type ClusterIP
, you create a stable, internal IP address that other pods can use to access the pods matched by the Service’s selector. The Service uses the results of the readiness probe to maintain its list of healthy endpoints.
3. Kubernetes Ingress
The Ingress object is your request for external access. You define rules for how traffic from a specific host or path should be routed to a Kubernetes Service. In GKE, the GKE Ingress controller watches for these objects and automatically provisions a powerful Google Cloud External HTTPS Load Balancer.
4. The Google Cloud Load Balancer
This is not a single entity. The GKE Ingress controller creates a collection of GCP resources:
- Forwarding Rule: The public IP address that receives user traffic.
- Target Proxy: Terminates the user’s HTTPS session.
- URL Map: Routes the request to the correct backend based on the host and path (e.g.,
api.example.com/users
). - Backend Service: Manages a collection of backends and, crucially, is where the Load Balancer’s Health Check is configured.
5. The Load Balancer Health Check
This is the heart of the problem. When GKE provisions the load balancer, it automatically creates a GCP Health Check and attaches it to the Backend Service. This health check runs from Google’s global infrastructure and pings your pods directly on their IP addresses (via a Network Endpoint Group - NEG).
If this health check passes, the Load Balancer considers the backend healthy and will send user traffic to it. It does not know or care about your Kubernetes readiness probe’s status.
Connecting the Dots: The Race Condition and the 502
Now we can see the full picture and the race condition that causes the 502 error.
- Deployment: You deploy your application. New pods start up.
- Pod Startup: Your application inside the pod takes time to initialize (e.g., connect to a database, load caches). During this time, its readiness probe is failing.
- Two Checks Begin: Both the K8s readiness probe and the GCP LB health check start pinging the new pods.
- The Race: The default GCP health check created by the Ingress controller is often very simple. It might just check if a TCP connection can be made on the port. This check can pass very quickly, sometimes seconds before your application is truly initialized and its more sophisticated readiness probe passes.
- The Result:
- GCP LB Health Check passes. The Load Balancer declares the backend “HEALTHY”.
- The Load Balancer immediately starts sending user traffic to the new pod’s IP.
- The pod receives the traffic but its internal application isn’t ready. It can’t handle the request and returns an error.
- The Ingress proxy sees this application error and returns a 502 Bad Gateway to the user.
- The Recovery: A few seconds later, the Kubernetes readiness probe finally passes. The application is now fully initialized. New requests that arrive are handled correctly. The 502s stop.
Troubleshooting Tips
When you see a 502, don’t just look at your pods. Check the state of the Load Balancer.
Find your Backend Service: Get the name of the backend service associated with your Ingress.
1 2 3
kubectl describe ingress <your-ingress-name> # Look for an annotation like "ingress.gcp.kubernetes.io/backends" # It will contain the name of the Backend Service.
Check the Load Balancer’s view of backend health: Use
gcloud
to ask the Load Balancer directly if it thinks its backends are healthy.1
gcloud compute backend-services get-health <backend-service-name> --global
This command will show you the health status of each endpoint (pod) from the Load Balancer’s perspective. If it says
HEALTHY
here while you’re getting 502s, you’ve confirmed the race condition.Describe the GCP Health Check: See exactly what the Load Balancer is checking.
1 2 3 4 5
# First, get the health check name from the backend service gcloud compute backend-services describe <backend-service-name> --global | grep healthChecks # Then, describe the health check itself gcloud compute health-checks describe <health-check-name> --global
This will reveal the port, path, interval, and thresholds the LB is using. You will often find it’s a generic, and too optimistic, check.
The Solution: Aligning the Two Worlds with BackendConfig
The official GKE solution is to use a BackendConfig
Custom Resource Definition (CRD). This allows you to customize the GCP-specific settings for your backend, including the health check.
By creating a BackendConfig
, you can force the Load Balancer to use health check parameters that are identical or closely aligned with your Kubernetes readiness probe.
Here’s how to do it:
Create a
BackendConfig
YAML file:1 2 3 4 5 6 7 8 9 10 11 12 13 14
# backend-config.yaml apiVersion: cloud.google.com/v1 kind: BackendConfig metadata: name: my-app-backend-config spec: healthCheck: checkIntervalSec: 10 timeoutSec: 5 healthyThreshold: 1 unhealthyThreshold: 3 type: HTTP requestPath: /healthz # <-- Use the same path as your readiness probe! port: 8080 # <-- Use the same port as your readiness probe!
Associate the
BackendConfig
with your Kubernetes Service: You do this by adding an annotation to your Service manifest. Note that the service port must also be named.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# service.yaml apiVersion: v1 kind: Service metadata: name: my-app-service annotations: cloud.google.com/backend-config: '{"ports": {"8080":"my-app-backend-config"}}' spec: ports: - name: 8080 # <-- Port name must match the key in the annotation port: 8080 protocol: TCP targetPort: 8080 selector: app: my-app type: NodePort # Ingress requires NodePort or LoadBalancer type service
Apply both files:
1 2
kubectl apply -f backend-config.yaml kubectl apply -f service.yaml
Now, when the GKE Ingress controller creates the Load Balancer, it will see the BackendConfig
and apply your custom health check parameters. Because the Load Balancer is now checking the exact same /healthz
endpoint as your readiness probe, it cannot become healthy until the application itself reports that it is ready. The race condition is eliminated.
Conclusion
The mysterious 502 on GKE is rarely a sign that your application is broken. Instead, it’s a symptom of a sophisticated system with distinct layers of health checking. By understanding that the external Cloud Load Balancer and the internal Kubernetes Service operate independently, you can diagnose the issue. And by using a BackendConfig
, you can elegantly bridge the gap between these two worlds, ensuring traffic only flows when your application is truly ready, leading to more stable and reliable deployments.