Problem Note 69251: You experience node failure in Kubernetes 1.22 due to pods being in a NotReady state
Overview
When you use Kubernetes 1.22.x, it is possible that deleted or NotReady pods will not be cleaned up properly, which causes node instability.
Example Scenario
When the issue occurs, you might see GRPC errors similar to the following in the events log when trying to start a pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ContainerGCFailed 172m kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16779817 vs. 16777216)
Warning ContainerGCFailed 171m kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16785197 vs. 16777216)
Normal NodeNotReady 170m kubelet Node node-name-01.example.com status is now: NodeNotReady
Warning ContainerGCFailed 2m11s (x169 over 170m) kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16787148 vs. 16777216)
Causes
The issue appears to be caused by the new implementation of podSyncStatus not setting a status for pod termination. This behavior causes the GarbageCollect cycle to not remove the pod's sandbox or log directories. After a large number of pods reach this scenario, all of the pod names cannot be sent via a single GRPC call, which causes the command to fail.
Prevention
If you are using Kubernetes 1.22.x but have not experienced the problem, you can still clean up your containerd sandboxes by using the Kubernetes debug tool, crictl.
The following command completely removes pods that are in a NotReady state:
crictl pods -state NotReady -o json | jq -r '.items[].id' | xargs -I% crictl rmp %
Fixes
If you are encountering errors similar to those in the Example Scenario above, complete these steps:
- Stop the containerd service:
sudo systemctl stop containerd.service
- Find your containerd library root:
grep -e "^root" /etc/containerd/config.toml | awk -F'"' '{print $2}'
- Delete all sandboxes:
rm -rf <containerd root>/io.containerd.grpc.v1.cri/sandboxes/*
- Delete the metadata database:
rm -rf <containerd root>/io.containerd.metadata.v1.bolt/meta.db
- Reboot the node. This action restarts the containerd service and rebuilds the metadata database, which should now be clear from any pods that were previously in the NotReady state.
Operating System and Release Information
| SAS System | SAS Viya | Linux for x64 | Viya | |
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.
Pods in the NotReady state might not be cleaned up properly with Kubernetes 1.22. Too many NotReady pods might result in a GRPC buffer overflow, which causes node instability.
| Type: | Problem Note |
| Priority: | medium |
| Date Modified: | 2022-06-16 13:48:30 |
| Date Created: | 2022-06-01 11:24:13 |