HTTP 502 Server Error: Troubleshooting and Solutions
HTTP 502 Server Error: Troubleshooting and Solutions
The HTTP 502 Bad Gateway server error response code indicates an issue where the server, acting as a gateway or proxy, has received an invalid response from the upstream server. This typically suggests a problem with the Global Load Balancer.
Issue Identification
Upon reviewing the logs, it was discovered that there were no available backends for the load balancer to forward the request to. The error details from the log are as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"resource": {
"type": "http_load_balancer",
"labels": {
"forwarding_rule_name": "global-load-balancer",
"url_map_name": "global"
}
},
"httpRequest": {
"requestUrl": "https://example.com/"
},
"jsonPayload": {
"statusDetails": "failed_to_pick_backend"
}
}
This indicates that the error was caused by the failure to select a backend, implying that none of the backends were available at the time.
Environment Overview
We are using a GKE (Google Kubernetes Engine) cluster to provide the backend service. The unavailability implies that the internal load balancer failed to find a ready instance to serve.
Readiness Probe Failure
In Kubernetes, the readinessProbe
checks the health of the pod, ensuring no requests are sent to those that are not ready or available. The logs suggest that at that moment, all of our pods had failed the readiness probe.
Potential Causes
- Pods Died but Not Restarted: This can be mitigated with a
livenessProbe
. - All Probes Failed Simultaneously: This could happen if all pods are on the same node and it terminates.
- Pods Were Too Busy for Readiness Probe: The pods might have been overwhelmed by traffic, failing the readiness probe not because of service failure but due to being too busy.
Solutions
Liveness Probe
To address the first issue, we implement a livenessProbe
:
1
2
3
4
5
6
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
Anti-Affinity
To prevent all pods from being on the same node, we use the podAntiAffinity
feature:
1
2
3
4
5
6
7
8
9
10
11
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: kubernetes.io/hostname
Readiness Probe Optimization
For the third issue, we ensure the readiness probe is lightweight:
1
2
3
4
5
6
7
8
9
10
11
readinessProbe:
exec:
command:
- curl
- -s
- --fail
- http://localhost:8080/healthz
timeoutSeconds: 3
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
Network Layer Issues
After deploying the above changes, occasional ReadTimeout
, Reset
, or Refuse
problems still occur on the network layer. Further investigation is required to determine the root cause, which will be discussed in a subsequent article.