69251 - You experience node failure in Kubernetes 1.22 due to pods being in a NotReady state

Problem Note 69251: You experience node failure in Kubernetes 1.22 due to pods being in a NotReady state

Overview

When you use Kubernetes 1.22.x, it is possible that deleted or NotReady pods will not be cleaned up properly, which causes node instability.

Example Scenario

When the issue occurs, you might see GRPC errors similar to the following in the events log when trying to start a pod:

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Warning ContainerGCFailed 172m kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16779817 vs. 16777216)

Warning ContainerGCFailed 171m kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16785197 vs. 16777216)

Normal NodeNotReady 170m kubelet Node node-name-01.example.com status is now: NodeNotReady

Warning ContainerGCFailed 2m11s (x169 over 170m) kubelet rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16787148 vs. 16777216)

Causes

The issue appears to be caused by the new implementation of podSyncStatus not setting a status for pod termination. This behavior causes the GarbageCollect cycle to not remove the pod's sandbox or log directories. After a large number of pods reach this scenario, all of the pod names cannot be sent via a single GRPC call, which causes the command to fail.

Prevention

If you are using Kubernetes 1.22.x but have not experienced the problem, you can still clean up your containerd sandboxes by using the Kubernetes debug tool, crictl.

The following command completely removes pods that are in a NotReady state:

crictl pods -state NotReady -o json | jq -r '.items[].id' | xargs -I% crictl rmp %

Fixes

If you are encountering errors similar to those in the Example Scenario above, complete these steps:

Stop the containerd service:

sudo systemctl stop containerd.service

Find your containerd library root:

grep -e "^root" /etc/containerd/config.toml | awk -F'"' '{print $2}'

Delete all sandboxes:

rm -rf <containerd root>/io.containerd.grpc.v1.cri/sandboxes/*

Delete the metadata database:

rm -rf <containerd root>/io.containerd.metadata.v1.bolt/meta.db

Reboot the node. This action restarts the containerd service and rebuilds the metadata database, which should now be clear from any pods that were previously in the NotReady state.

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS Viya	Linux for x64	Viya

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Problem Note
Priority:	medium

Date Modified:	2022-06-16 13:48:30
Date Created:	2022-06-01 11:24:13

Support

Problem Note 69251: You experience node failure in Kubernetes 1.22 due to pods being in a NotReady state

Operating System and Release Information