In OpenShift, the kube-scheduler binds a unit of work (Pod) to a Node. The scheduler reads from a scheduling queue the work, retrieves the current state of the cluster, scores the work based on the scheduling rules (from the policy) and the cluster’s state, and prioritizes binding the Pod to a Node.
These nodes are scheduled based on an instantaneous read of the policy and the environment and a best-estimation placement of the Pod on a Node. With best estimate at the time, these clusters are constantly changing shape and context; there is a need to deschedule and schedule the Pod anew.
There are four actors in the Descheduler:
- User configures the KubeDescheduler resource
- Operator creates the Descheduler Deployment
- Descheduler run on a set interval and re-evaluates the scheduled Pod and Node and Policy, setting an eviction if the Pod should be removed based on the Descheduler Policy.
- Pod is removed (unbound).

Thankfully, OpenShift has a Descheduler Operator that more easily facilitates the unbinding of a Pod from a Node based on a cluster-wide configuration of the KubeDescheduler CustomResource. In a single cluster, there is at most one configured KubeDescheduler named cluster (it has to be fixed), and configures one or more Descheduler Profiles.
Descheduler Profiles are predefined and available in the profiles folder – DeschedulerProfile:
AffinityAndTaints | Balance pods based on node taint violations |
TopologyAndDuplicates | Spreads pods evenly among nodes based on topology constraints and duplicate replicates on the same node The profile cannot be used with SoftTopologyAndDuplicates. |
SoftTopologyAndDuplicates | Spreads pods with prior with soft constraints The profile cannot be used with TopologyAndDuplicates. |
LifecycleAndUtilization | Balances pods based on node resource usage This profile cannot be used with DevPreviewLongLifecycle |
EvictPodsWithLocalStorage | Enables pods with local storage to be evicted by the descheduler by all other profiles |
EvictPodsWithPVC | Prevents pods with PVCs from being evicted by all other profiles |
DevPreviewLongLifecycle | Lifecycle management for pods that are ‘long running’ This profile cannot be used with LifecycleAndUtilization |
There must be one or more DeschedulerProfile specified, and there cannot be any duplicates entries. There are two possible mode values – Automatic and Predictive. You have to go the Pod to check the output to see what is Predicted or is Completed.
The DeschedulerOperator excludes the openshift-*, kube-system and hypershift namespaces.
Steps
1. Login to your OpenShift Cluster
oc login --token=sha256~1111-g --server=https://api..sslip.io:6443
2. Create a Pod that indicates it’s available for eviction using the annotation descheduler.alpha.kubernetes.io/evict: “true” and is updated for the proper node name.
cat << EOF > pod.yaml
kind: Pod
apiVersion: v1
metadata:
annotations:
descheduler.alpha.kubernetes.io/evict: "true"
name: demopod1
labels:
foo: bar
spec:
containers:
- name: pause
image: docker.io/ibmcom/pause-ppc64le:3.1
EOF
oc apply -f pod.yaml
pod/demopod1 created
- 3. Create the KubeDescheduler CR with a Descheduling Interval of 60 seconds and Pod Lifetime of 1m.
cat << EOF > kd.yaml
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
name: cluster
namespace: openshift-kube-descheduler-operator
spec:
logLevel: Normal
mode: Predictive
operatorLogLevel: Normal
deschedulingIntervalSeconds: 60
profileCustomizations:
podLifetime: 1m0s
observedConfig:
servingInfo:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
minTLSVersion: VersionTLS12
profiles:
- LifecycleAndUtilization
managementState: Managed
EOF
oc apply -f kd.yaml
- 4. Get the Pods in the openshift-kube-descheduler-operator
oc get pods -n openshift-kube-descheduler-operator
NAME READY STATUS RESTARTS AGE
descheduler-f479c5669-5ffxl 1/1 Running 0 2m7s
descheduler-operator-85fc6666cb-5dfr7 1/1 Running 0 27h
- 5. Check the Logs for the descheduler pod
oc -n openshift-kube-descheduler-operator logs descheduler-f479c5669-5ffxl
I0506 19:59:10.298440 1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="minio-operator/console-7bc65f7dd9-q57lr" maxPodLifeTime=60
I0506 19:59:10.298500 1 evictions.go:158] "Evicted pod in dry run mode" pod="default/demopod1" reason="PodLifeTime"
I0506 19:59:10.298532 1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="default/demopod1" maxPodLifeTime=60
I0506 19:59:10.298598 1 toomanyrestarts.go:90] "Processing node" node="master-0.rdr-rhop-.sslip.io"
I0506 19:59:10.299118 1 toomanyrestarts.go:90] "Processing node" node="master-1.rdr-rhop.sslip.io"
I0506 19:59:10.299575 1 toomanyrestarts.go:90] "Processing node" node="master-2.rdr-rhop.sslip.io"
I0506 19:59:10.300385 1 toomanyrestarts.go:90] "Processing node" node="worker-0.rdr-rhop.sslip.io"
I0506 19:59:10.300701 1 toomanyrestarts.go:90] "Processing node" node="worker-1.rdr-rhop.sslip.io"
I0506 19:59:10.301097 1 descheduler.go:287] "Number of evicted pods" totalEvicted=5
This article shows a simple case for the Descheduler and you can see how it ran a dry run and showed it would evict five pods.