Post

HTTP 502 Server Error: Troubleshooting and Solutions

HTTP 502 Server Error: Troubleshooting and Solutions

The HTTP 502 Bad Gateway server error response code indicates an issue where the server, acting as a gateway or proxy, has received an invalid response from the upstream server. This typically suggests a problem with the Global Load Balancer.

Issue Identification

Upon reviewing the logs, it was discovered that there were no available backends for the load balancer to forward the request to. The error details from the log are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "resource": {
    "type": "http_load_balancer",
    "labels": {
      "forwarding_rule_name": "global-load-balancer",
      "url_map_name": "global"
    }
  },
  "httpRequest": {
    "requestUrl": "https://example.com/"
  },
  "jsonPayload": {
    "statusDetails": "failed_to_pick_backend"
  }
}

This indicates that the error was caused by the failure to select a backend, implying that none of the backends were available at the time.

Environment Overview

We are using a GKE (Google Kubernetes Engine) cluster to provide the backend service. The unavailability implies that the internal load balancer failed to find a ready instance to serve.

Readiness Probe Failure

In Kubernetes, the readinessProbe checks the health of the pod, ensuring no requests are sent to those that are not ready or available. The logs suggest that at that moment, all of our pods had failed the readiness probe.

Potential Causes

  1. Pods Died but Not Restarted: This can be mitigated with a livenessProbe.
  2. All Probes Failed Simultaneously: This could happen if all pods are on the same node and it terminates.
  3. Pods Were Too Busy for Readiness Probe: The pods might have been overwhelmed by traffic, failing the readiness probe not because of service failure but due to being too busy.

Solutions

Liveness Probe

To address the first issue, we implement a livenessProbe:

1
2
3
4
5
6
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 3

Anti-Affinity

To prevent all pods from being on the same node, we use the podAntiAffinity feature:

1
2
3
4
5
6
7
8
9
10
11
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - store
        topologyKey: kubernetes.io/hostname

Readiness Probe Optimization

For the third issue, we ensure the readiness probe is lightweight:

1
2
3
4
5
6
7
8
9
10
11
readinessProbe:
  exec:
    command:
    - curl
    - -s
    - --fail
    - http://localhost:8080/healthz
  timeoutSeconds: 3
  periodSeconds: 5
  successThreshold: 1
  failureThreshold: 3

Network Layer Issues

After deploying the above changes, occasional ReadTimeout, Reset, or Refuse problems still occur on the network layer. Further investigation is required to determine the root cause, which will be discussed in a subsequent article.

This post is licensed under CC BY 4.0 by the author.