I recently had to work with the Kubernetes Topology Manager and OpenShift. Here is a braindump on Topology Manager:
If the Topology Manager – Feature Gate is enabled, then any active HintProviders are registered to the TopologyManager.
If the CPU Manager and feature gate are enabled, then the CPU Manager can be used to help workloads which are sensitive to CPU throttling, context switches, cache misses, require hyperthreads on same physical CPU core, low latency, and benefit from shared processor resources. The manager has two policies none
and static
which registers a NOP provider or statically locks the container to a set of CPUs.
If the Memory Manager and feature gate are enabled, then the MemoryManager can be used to process independently of the CPU Manager – e.g. allocate HugePages or guarnteed memory.
If Device Plugins are enabled, then it can be turned on to allocate Devices next to NUMA node resources (e.g., SR-IOV NICs). This may be used independent of the typical CPU/Memory management for GPUs and other machine devices.
Generally, these are all used together to generate a BitMask that admits a pod using a best-effort
, restricted
, or single-numa-node
policy.
An important limitation is the Maximum Number of NUMA nodes is hard-coded to 8. When there are more than eight NUMA nodes, it’ll error out when assigning to the topology. The reason for this is related to state explosion and computational complexity.
- Check the worker nodes CPU if the NUMA returns 1, it’s a single NUMA node. If it returns 2 or more, it’s multiple NUMA nodes.
sh-4.4# lscpu | grep 'NUMA node(s)'
NUMA node(s): 1
The kubernetes/enhancements repo contains great detail on the flows and weaknesses of the TopologyManager.
To enable the Topology Manager, one uses Feature Gates:
- Kubernetes: TopologyManager
- Kubernetes: CPUManager
- Kubernetes: MemoryManager
- Kubernetes: DevicePlugins
And OpenShift prefers the FeatureSet LatencySensitive
- Via FeatureGate
$ oc patch featuregate cluster -p '{"spec": {"featureSet": "LatencySensitive"}}' --type merge
Which turns on the basic TopologyManager /etc/kubernetes/kubelet.conf
"featureGates": {
"APIPriorityAndFairness": true,
"CSIMigrationAzureFile": false,
"CSIMigrationvSphere": false,
"DownwardAPIHugePages": true,
"RotateKubeletServerCertificate": true,
"TopologyManager": true
},
- Create a custom KubeletConfig, this allows targeted TopologyManager feature enablement.
file: cpumanager-kubeletconfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
$ oc create -f cpumanager-kubeletconfig.yaml
Net: They can be used independent of each other. They should be turned on at the same time to maximize the benefits.
There are some examples and test cases out there for Kubernetes and OpenShift
- Red Hat Sys Engineering Team Test cases for Performance Addon Operator which is now the Cluster Node Tuning Operator– These are the clearest tests, which apply directly to the Topology Manager.
- Kube Test Cases
- Topology Manager
- CPU Manager
- Device Plugin If you already do SRIOV testing, this should be implicitly covered.
- Memory manager
- Test Cases Matrix from Kubernetes PR #83481
This is one of the best examples k8stopologyawareschedwg/sample-device-plugin.
Tools to know about
- GitHub: numalign (amd64) – you can download this in the releases. In this fork prb112/numalign I added ppc64le to the build
- numactl and numastat are superbly helpful to see the topology spread on a node link to a handy pdf on numa I’ve been starting up a fedora container with numactl and numastat installed
Final note, I had written down that fedora is a great combination with taskset and numactl if you copy in the binaries. I think I used Fedora 35/36 as a container. link
Yes. I built a Hugepages hungry container Hugepages. I also looked at hugepages_tests.go and the test plan.
When it came down to it, I used my hunger container with the example.
I hope this helps others as they start to work with Topology Manager.
References
Red Hat
- Red Hat Topology Aware Scheduling in Kubernetes Part 1: The High Level Business Case
- Red Hat Topology Awareness in Kubernetes Part 2: Don’t we already have a Topology Manager?
OpenShift
- OpenShift 4.11: Using the Topology Manager
- OpenShift 4.11: Using device plug-ins to access external resources with pods
- OpenShift 4.11: Using Device Manager to make devices available to nodes Device Manager
- OpenShift 4.11: About Single Root I/O Virtualization (SR-IOV) hardware networks – Device Manager
- OpenShift 4.11: Adding a pod to an SR-IOV additional network
- OpenShift 4.11: Using CPU Manager CPU Manager
Kubernetes
- Kubernetes: Topology Manager Blog
- Feature Highlight: CPU Manager
- Feature: Utlizing the NUMA-aware Memory Manager
Kubernetes Enhancement
- KEP-693: Node Topology Manager e2e tests: Link
- KEP-2625: CPU Manager e2e tests: Link
- KEP-1769: Memory Manager Source: Link PR: Link