This guide provides a step-by-step process to diagnose and resolve high memory issues causing NextGen Gateway pods to crash in a Kubernetes environment. It includes commands to check the pod status, identify memory-related issues, and implement solutions to stabilize the pod.
Verify the Memory usage if Pod Crashes due to Memory Issue
To verify the memory usage in Kubernetes pods, make sure that you have enabled the metrics server in the Kubernetes cluster. Kubectl top command can be used to retrieve snapshots of resource utilization of pods or nodes in your Kubernetes cluster.
Use the below command to verify POD memory usage.
$ kubectl top pods NAME CPU(cores) MEMORY(bytes) nextgen-gw-0 48m 1375Mi
Use the below command to verify Node memory usage.
$ kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% nextgen-gateway 189m 9% 3969Mi 49%
NextGen Gateway pod Crashed due to High Memory Usage
The NextGen Gateway pod in a Kubernetes cluster crashes due to high memory usage.
Possible Causes
When a pod exceeds its allocated memory, the Kubernetes system automatically kills the process to protect the node’s stability, resulting in an “OOMKilled” (Out of Memory Killed) error. This is particularly critical for the NextGen Gateway, as it may affect the stability and monitoring capabilities of the OpsRamp platform.
Troubleshooting Steps
Follow these steps to diagnose and fix memory issues for the NextGen Gateway pod:
- Check the status of Kubernetes objects to determine if pods are running or not.
- If a pod is restarting or crashing, then check if the pod is restarting due to memory issues.
- Look for memory-related termination reasons in the pod’s event logs.
Sample output of logs:vprobe: Container ID: containerd://40c8585cf88dc7d0dd4e43560dc631ef559b0c92e6d5d429719a384aaea77777 Image: us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe:17.0.0 Image ID: us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe@sha256:8de1a98c3c14307fa4882c7e7422a1a4e4d507d2bbc454b53f905062b665e9d2 Port: <none> Host Port: <none> State: Running Started: Mon, 29 Jan 2024 12:01:30 +0530 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 29 Jan 2024 12:00:42 +0530 Finished: Mon, 29 Jan 2024 12:01:29 +0530 Ready: True Restart Count: 1
- Confirm memory issue by Exit Code.
- If the exit code is 137, then the pod is crashing due to memory issue.
- Fix the memory issue:
- Decrease the load on NextGen Gateway by limiting the number of metrics.
- Adjust the memory limits for the NextGen Gateway accordingly.