My MachineConfigPool is … Stuck

My teammate was investigating an SSHD config change and hit a stuck MachineConfigPool. Here are some steps we followed to get it unstuck.

Steps

  1. Verify that the MachineConfigPool is stuck updating
❯ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-0de63bfa1c0db0777031adddb3286fbc   False     True       True       3              0                   0                     3                      9d
worker   rendered-worker-38e4049eaf0b7fca848408378092e607   True      False      False      3              3                   3                     0                      9d
  1. Find out for one of your nodes in the mcp that is stuck (for instance, master-0)
❯ oc get pods -n openshift-machine-config-operator --field-selector spec.nodeName=master-0
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-t8x8j   2/2     Running   2          35h
machine-config-server-kfx8n   1/1     Running   1          35h
  1. Check the logs and grab the rendered-master
❯ oc logs pod/machine-config-daemon-tgnss -n openshift-machine-config-operator
...
E0124 07:19:26.746977  780508 on_disk_validation.go:208] content mismatch for file "/etc/ssh/sshd_config" (-want +got):
  bytes.Join({
-       "\n#\t",
+       "#       ",
        "$OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22 tj Exp $\n\n# Th",
        "is is the sshd server system-wide configuration file.  See\n# ssh",
        ... // 1437 identical bytes
        "keys and .ssh/authorized_keys2\n# but this is overridden so insta",
        "llations will only check .ssh/authorized_keys\nAuthorizedKeysFile",
-       `       `,
+       "      ",
        ".ssh/authorized_keys\n\n#AuthorizedPrincipalsFile none\n\n#Authorize",
        "dKeysCommand none\n#AuthorizedKeysCommandUser nobody\n\n# For this ",
        ... // 2258 identical bytes
        "E LC_MEASUREMENT\nAcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE\nAcc",
...
+       "\n",
  }, "")
E0124 07:19:26.747042  780508 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-master-0de63bfa1c0db0777031adddb3286fbc: content mismatch for file "/etc/ssh/sshd_config"
I0124 07:19:28.973484  780508 daemon.go:1248] Current+desired config: rendered-master-0de63bfa1c0db0777031adddb3286fbc
...
  1. OK, this looks like a problem with the whitespace, and inspect the URL decoded version’s whites pace vim :set list
> oc get mc rendered-master-0de63bfa1c0db0777031adddb3286fbc -o yaml > out.yaml

You may have to update the white space.

  1. Check the reasons for the failure if the whitespace doesn’t fix it.
> oc describe mcp master

Message:
    Node master-0 is reporting: 
        "unexpected on-disk state validating against rendered-master-0de63bfa1c0db0777031adddb3286fbc: 
        mode mismatch for file: \"/etc/ssh/sshd_config\"; 
        expected: -rw-------/384/0600; received: -rw-r--r--/420/0644", 
        Node master-1 is reporting: "unexpected on-disk state validating 
        against rendered-master-0de63bfa1c0db0777031adddb3286fbc: content 
        mismatch for file \"/etc/ssh/sshd_config\"", Node master-2 is reporting:
        "unexpected on-disk state validating against 
        rendered-master-0de63bfa1c0db0777031adddb3286fbc: content mismatch for file 
        \"/etc/ssh/sshd_config\""

In this case, the local files were edited while preparing the ideal sshd_config and needed a forced update.

  1. Force the machine-config to refresh files.
> touch /run/machine-config-daemon-force
  1. You should see the states change after the node reboots.
Events:
  Type    Reason            Age    From                                    Message
  ----    ------            ----   ----                                    -------
  Normal  AnnotationChange  5m19s  machineconfigcontroller-nodecontroller  Node master-0 now has machineconfiguration.openshift.io/state=Done

  degradedMachineCount: 2
  machineCount: 3
  observedGeneration: 500
  readyMachineCount: 0
  unavailableMachineCount: 2
  updatedMachineCount: 0

If you need to select a file from the rendered config:

> oc get mc rendered-master-0de63bfa1c0db0777031adddb3286fbc -o yaml | yq -r '.spec.config[].files[] | select(.path == "/etc/ssh/sshd_config").contents.source'
data:,%0A%23%09$OpenBSD:%20sshd_config%2Cv%201.103
...

References


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.