OpenShift Descheduler Operator: How-To

In OpenShift, the kube-scheduler binds a unit of work (Pod) to a Node. The scheduler reads from a scheduling queue the work, retrieves the current state of the cluster, scores the work based on the scheduling rules (from the policy) and the cluster’s state, and prioritizes binding the Pod to a Node.

These nodes are scheduled based on an instantaneous read of the policy and the environment and a best-estimation placement of the Pod on a Node. With best estimate at the time, these clusters are constantly changing shape and context; there is a need to deschedule and schedule the Pod anew.

There are four actors in the Descheduler:

User configures the KubeDescheduler resource
Operator creates the Descheduler Deployment
Descheduler run on a set interval and re-evaluates the scheduled Pod and Node and Policy, setting an eviction if the Pod should be removed based on the Descheduler Policy.
Pod is removed (unbound).

Thankfully, OpenShift has a Descheduler Operator that more easily facilitates the unbinding of a Pod from a Node based on a cluster-wide configuration of the KubeDescheduler CustomResource. In a single cluster, there is at most one configured KubeDescheduler named cluster (it has to be fixed), and configures one or more Descheduler Profiles.

Descheduler Profiles are predefined and available in the profiles folder – DeschedulerProfile:

AffinityAndTaints	Balance pods based on node taint violations
TopologyAndDuplicates	Spreads pods evenly among nodes based on topology constraints and duplicate replicates on the same node The profile cannot be used with SoftTopologyAndDuplicates.
SoftTopologyAndDuplicates	Spreads pods with prior with soft constraints The profile cannot be used with TopologyAndDuplicates.
LifecycleAndUtilization	Balances pods based on node resource usage This profile cannot be used with DevPreviewLongLifecycle
EvictPodsWithLocalStorage	Enables pods with local storage to be evicted by the descheduler by all other profiles
EvictPodsWithPVC	Prevents pods with PVCs from being evicted by all other profiles
DevPreviewLongLifecycle	Lifecycle management for pods that are ‘long running’ This profile cannot be used with LifecycleAndUtilization

There must be one or more DeschedulerProfile specified, and there cannot be any duplicates entries. There are two possible mode values – Automatic and Predictive. You have to go the Pod to check the output to see what is Predicted or is Completed.

The DeschedulerOperator excludes the openshift-*, kube-system and hypershift namespaces.

Steps

1.   Login to your OpenShift Cluster

oc login --token=sha256~1111-g --server=https://api..sslip.io:6443

2. Create a Pod that indicates it’s available for eviction using the annotation descheduler.alpha.kubernetes.io/evict: “true” and is updated for the proper node name.

cat << EOF > pod.yaml 
kind: Pod
apiVersion: v1
metadata:
  annotations:
    descheduler.alpha.kubernetes.io/evict: "true"
  name: demopod1
  labels:
    foo: bar
spec:
  containers:
  - name: pause
    image: docker.io/ibmcom/pause-ppc64le:3.1
EOF
oc apply -f pod.yaml 
pod/demopod1 created

3. Create the KubeDescheduler CR with a Descheduling Interval of 60 seconds and Pod Lifetime of 1m.

cat << EOF > kd.yaml 
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  logLevel: Normal
  mode: Predictive
  operatorLogLevel: Normal
  deschedulingIntervalSeconds: 60
  profileCustomizations:
    podLifetime: 1m0s
  observedConfig:
    servingInfo:
      cipherSuites:
        - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
        - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      minTLSVersion: VersionTLS12
  profiles:
    - LifecycleAndUtilization
  managementState: Managed
EOF
oc apply -f kd.yaml

4. Get the Pods in the openshift-kube-descheduler-operator

oc get pods -n openshift-kube-descheduler-operator                              
NAME                                    READY   STATUS    RESTARTS   AGE
descheduler-f479c5669-5ffxl             1/1     Running   0          2m7s
descheduler-operator-85fc6666cb-5dfr7   1/1     Running   0          27h

5. Check the Logs for the descheduler pod

oc -n openshift-kube-descheduler-operator logs descheduler-f479c5669-5ffxl
I0506 19:59:10.298440       1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="minio-operator/console-7bc65f7dd9-q57lr" maxPodLifeTime=60
I0506 19:59:10.298500       1 evictions.go:158] "Evicted pod in dry run mode" pod="default/demopod1" reason="PodLifeTime"
I0506 19:59:10.298532       1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="default/demopod1" maxPodLifeTime=60
I0506 19:59:10.298598       1 toomanyrestarts.go:90] "Processing node" node="master-0.rdr-rhop-.sslip.io"
I0506 19:59:10.299118       1 toomanyrestarts.go:90] "Processing node" node="master-1.rdr-rhop.sslip.io"
I0506 19:59:10.299575       1 toomanyrestarts.go:90] "Processing node" node="master-2.rdr-rhop.sslip.io"
I0506 19:59:10.300385       1 toomanyrestarts.go:90] "Processing node" node="worker-0.rdr-rhop.sslip.io"
I0506 19:59:10.300701       1 toomanyrestarts.go:90] "Processing node" node="worker-1.rdr-rhop.sslip.io"
I0506 19:59:10.301097       1 descheduler.go:287] "Number of evicted pods" totalEvicted=5

This article shows a simple case for the Descheduler and you can see how it ran a dry run and showed it would evict five pods.

OpenShift Descheduler Operator: How-To

More posts

Great work from the IBM’s Power10 Private Cloud Rack for Db2 Warehouse team

Multi-Arch Compute and the Red Hat OpenShift Container Platform on IBM Power

Entering into Kubernetes Network Policies

Extending PCI-DSS v4 Support on Red Hat OpenShift Container Platform on IBM Power with the Compliance Operator