Identifying Kernel Memory Usage Culprits

After suspecting the Kernel Memory is leaked, using slabtop --sort c where it shows high memory usage. You can use the following steps to confirm the memory usage culprit using slub_debug=U. (Thanks to ServerFault).

  1. Login to OpenShift
$ oc login
  1. Check that you don’t already see 99-master-kargs-slub.
$ oc get mc 99-master-kargs-slub
  1. Create the slub_debug=U kernel argument. Note, that it’s assigned to the master role.
cat << EOF > 99-master-kargs-slub.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-kargs-slub
spec:
  kernelArguments:
  - slub_debug=U
EOF
  1. Create the Kernel Arguments Machine Config.
$ oc apply -f 99-master-kargs-slub.yaml 
machineconfig.machineconfiguration.openshift.io/99-master-kargs-slub created
  1. Wait until the master nodes are updated.
$ oc wait mcp/master --for condition=updated --timeout=25m
machineconfigpool.machineconfiguration.openshift.io/master condition met
  1. Confirm the node status as soon as it’s up, and list the master nodes.
$ oc get nodes -l machineconfiguration.openshift.io/role=master
NAME                                                    STATUS   ROLES    AGE   VERSION
lon06-master-0.xip.io   Ready    master   30d   v1.23.5+3afdacb
lon06-master-1.xip.io   Ready    master   30d   v1.23.5+3afdacb
lon06-master-2.xip.io   Ready    master   30d   v1.23.5+3afdacb
  1. Connect to the master node and switch to the root user
$ ssh core@lon06-master-0.xip.io
sudo su - 
  1. Check the kmalloc-32 allocation
$  cat /sys/kernel/slab/kmalloc-32/alloc_calls | sort -n  | tail -n 5
   4334 iomap_page_create+0x80/0x190 age=0/654342/2594020 pid=1-39569 cpus=0-7
   5655 selinux_sk_alloc_security+0x5c/0xd0 age=916/1870136/2594937 pid=0-39217 cpus=0-7
  41908 __kernfs_new_node+0x70/0x2d0 age=406911/2326294/2594938 pid=0-38398 cpus=0-7
9969728 memcg_update_all_list_lrus+0x1bc/0x550 age=2564414/2567167/2594607 pid=1 cpus=0-7
19861376 __list_lru_init+0x2b8/0x480 age=406870/2007921/2594449 pid=1-38406 cpus=0-7

This points to memcg_update_all_list_lrus is using a lot of resources, which is currently fixed in a patch to the Linux Kernel.

References

  1. https://serverfault.com/questions/1020241/debugging-kmalloc-64-slab-allocations-memory-leak
  2. http://www.jikos.cz/jikos/Kmalloc_Internals.html
  3. https://stackoverflow.com/questions/20079767/what-is-different-functions-malloc-and-kmalloc
  4. ServerFault: Debugging kmalloc-64 slab allocations / memory leak
  5. Kmalloc Internals: Exploring Linux Kernel Memory Allocation
  6. How I investigated memory leaks in Go using pprof on a large codebase
  7. Using Go 1.10 new trace features to debug an integration test
  8. Kernel Memory Leak Detector
  9. go-slab – slab allocator in go
  10. Red Hat Customer Support Portal: Interpreting /proc/meminfo and free output for Red Hat Enterprise Linux
  11. Red Hat Customer Support Portal: Determine how much memory is being used on the system
  12. Red Hat Customer Support Portal: Determine how much memory and what kind of objects the kernel is allocating

Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.