Optimizing Workloads with NUMA-Aware CPU Distribution in Kubernetes

DRAFT This is not a complete article. I haven’t yet fully tested and vetted the steps I built. I will come back and hopefully update.

Kubernetes 1.30 introduces a powerful enhancement to CPU resource management: the ability to distribute CPUs across NUMA nodes using a new CPUManager policy. This feature, part of KEP-2902, enables better performance and resource utilization on multi-NUMA systems by spreading workloads instead of concentrating them on a single node.

Non-Uniform Memory Access (NUMA) is a memory design used in modern multi-socket systems where each CPU socket has its own local memory. Accessing local memory is faster than accessing memory attached to another CPU. Therefore, NUMA-aware scheduling is crucial for performance-sensitive workloads.

Traditionally, Kubernetes’ CPUManager used a “packed” policy, allocating CPUs from a single NUMA node to reduce latency. However, this can lead to resource contention and underutilization in systems with multiple NUMA nodes.

High-throughput applications like databases or analytics engines
Multi-threaded workloads that benefit from parallelism
NUMA-aware applications that manage memory locality explicitly

The new “distributed” policy spreads CPU allocations across NUMA nodes, improving parallelism and overall system throughput.

To enable the distributed CPUManager Policy, here is a step-by-step guide to enable and use this feature on Kubernetes v1.30+:

Label the nodes you want to be enabled with cpumanager and the distributed policy.

oc label node worker-0 custom-kubelet=cpumanager-enabled

Create a custom KubeletConfig to allow the CPUManager to use distributed cpuManagerPolicy.

cat << EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: cpumanager-enabled
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: cpumanager-enabled
  kubeletConfig:
     cpuManagerPolicy: distributed 
     cpuManagerReconcilePeriod: 5s 
EOF

Wait for the Node to restart the Kubelet
Create a Pod to request Guaranteed QoS by specifying equal CPU requests and limits:

apiVersion: v1
kind: Pod
metadata:
  name: numa-aware-pod
spec:
  containers:
  - name: workload
    image: your-image
    resources:
      requests:
        cpu: "4"
      limits:
        cpu: "4"

Kubernetes will now distribute the 4 CPUs across NUMA nodes instead of packing them on one.

To visualize the difference, here’s a conceptual graphic to illustrate the difference between the two policies:

Packed Policy:

NUMA Node 0: [CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, CPU7] ← All assigned here
NUMA Node 1: [CPU0 ]

Distributed Policy:

NUMA Node 0: [CPU0, CPU1, CPU2, CPU3]
NUMA Node 1: [CPU0, CPU1, CPU2, CPU3] ← Balanced across nodes

This balance reduces memory contention and improves cache locality for distributed workloads.

This enhancement gives Kubernetes administrators more control over CPU topology, enabling better performance tuning for complex workloads. It’s a great step forward in making Kubernetes more NUMA-aware and suitable for high-performance computing environments.b

Optimizing Workloads with NUMA-Aware CPU Distribution in Kubernetes

More posts

Bridging the Gap: Shared NFS Storage Between VMs and OpenShift

Introducing OpenShift Installer Provisioned Infrastructure (IPI) for IBM PowerVC (TechPreview)

Quick Fix: Resolving tmpfs Space Constraints on IBM Power (ppc64le)

Yet more Images on the IBM Container Registry for Caching on Power