Kafka, Zookeeper… and Kerberos

My team runs a Kafka service for data ingestion, we ran across a rare timeout when our main Key Distribution Center (KDC) went down. When the zookeeper service restarted, zookeeper worked flawlessly. I checked the services with the zookeeper four-letter commands. However, the kafka-broker/zookeeper startup and authentication failed and the brokers went down.

We checked each system with the following:

echo ruok | nc localhost 2181
iamok

echo stat | nc localhost 2181
Zookeeper version: 3.4.6-1569965, built on XXXXXX
Clients:
/10.10.10.10:3888[1](queued=0,recved=95261,sent=95261)

Latency min/avg/max: 0/0/316
Received: 1
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x2100000000
Mode: follower
Node count: 200

We checked zookeeper with Kerberos/JAAS using the shell.

export JVMFLAGS="-Djava.security.auth.login.config=/usr/iop/current/kafka-broker/config/kafka_jaas.conf"

zookeeper-client -server `hostname --long`:2181 ls /

You’ll see a failover to the secondary server after 90 seconds and the final list. This indicates that the server finally fails over the secondary KDC.

[root@kafka-1 ~]# export JVMFLAGS="-Djava.security.auth.login.config=/usr/iop/current/kafka-broker/config/kafka_jaas.conf"
[root@kafka-1 ~]# time zookeeper-client -server `hostname --long`:2181 ls /
Connecting to kafka-1:2181
2019-03-12 17:31:44,765 - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.6-IBM_4--1, built on 06/17/2016 01:58 GMT
2019-03-12 17:31:44,767 - INFO [main:Environment@100] - Client environment:host.name=kafka-1
2019-03-12 17:31:44,767 - INFO [main:Environment@100] - Client environment:java.version=1.8.0_77
2019-03-12 17:31:44,769 - INFO [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
2019-03-12 17:31:44,769 - INFO [main:Environment@100] - Client environment:java.home=/usr/jdk64/java-1.8.0-openjdk/jre
2019-03-12 17:31:44,769 - INFO [main:Environment@100] - Client environment:java.class.path=/usr/iop/4.2.0.0/zookeeper/bin/../build/classes:/usr/iop/4.2.0.0/zookeeper/bin/../build/lib/*.jar:/usr/iop/4.2.0.0/zookeeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/usr/iop/4.2.0.0/zookeeper/bin/../lib/slf4j-api-1.6.1.jar:/usr/iop/4.2.0.0/zookeeper/bin/../lib/netty-3.7.0.Final.jar:/usr/iop/4.2.0.0/zookeeper/bin/../lib/log4j-1.2.17.jar:/usr/iop/4.2.0.0/zookeeper/bin/../lib/jline-0.9.94.jar:/usr/iop/4.2.0.0/zookeeper/bin/../zookeeper-3.4.6_IBM_4.jar:/usr/iop/4.2.0.0/zookeeper/bin/../src/java/lib/*.jar:/usr/iop/4.2.0.0/zookeeper/conf::/usr/iop/4.2.0.0/zookeeper/conf:/usr/iop/4.2.0.0/zookeeper/zookeeper-3.4.6_IBM_4.jar:/usr/iop/4.2.0.0/zookeeper/zookeeper.jar:/usr/iop/4.2.0.0/zookeeper/lib/jline-0.9.94.jar:/usr/iop/4.2.0.0/zookeeper/lib/log4j-1.2.17.jar:/usr/iop/4.2.0.0/zookeeper/lib/netty-3.7.0.Final.jar:/usr/iop/4.2.0.0/zookeeper/lib/slf4j-api-1.6.1.jar:/usr/iop/4.2.0.0/zookeeper/lib/slf4j-log4j12-1.6.1.jar:/usr/share/zookeeper/*
2019-03-12 17:31:44,769 - INFO [main:Environment@100] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2019-03-12 17:31:44,769 - INFO [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:java.compiler=<NA>
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:os.name=Linux
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:os.arch=amd64
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:os.version=3.10.0-514.21.1.el7.x86_64
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:user.name=root
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:user.home=/root
2019-03-12 17:31:44,770 - INFO [main:Environment@100] - Client environment:user.dir=/root
2019-03-12 17:31:44,771 - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=kafka-1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@68de145
2019-03-12 17:31:44,875 - INFO [main-SendThread(kafka-1:2181):Login@327] - successfully logged in.
2019-03-12 17:31:44,878 - INFO [Thread-0:Login$1@156] - TGT refresh thread started.
2019-03-12 17:31:44,882 - INFO [main-SendThread(kafka-1:2181):ZooKeeperSaslClient$1@285] - Client will use GSSAPI as SASL mechanism.
2019-03-12 17:31:44,892 - INFO [Thread-0:Login@335] - TGT valid starting at: Tue Mar 12 17:31:44 UTC 2019
2019-03-12 17:31:44,892 - INFO [Thread-0:Login@336] - TGT expires: Wed Mar 13 17:31:44 UTC 2019
2019-03-12 17:31:44,892 - INFO [Thread-0:Login$1@210] - TGT refresh sleeping until: Wed Mar 13 12:45:17 UTC 2019
2019-03-12 17:31:44,894 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server kafka-1/192.168.1.1:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2019-03-12 17:31:44,952 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@852] - Socket connection established to kafka-1/192.168.1.1:2181, initiating session
2019-03-12 17:31:44,966 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server kafka-1/192.168.1.1:2181, sessionid = 0x16972ce2b3f002b, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

WATCHER::

WatchedEvent state:SaslAuthenticated type:None path:null
[controller_epoch, controller, brokers, zookeeper, kafka-acl, kafka-acl-changes, admin, isr_change_notification, consumers, config]

real 0m1.328s
user 0m0.573s
sys 0m0.102s

We removed the down KDC in our /etc/krb5.conf file. (We eventually added it back when the server was restarted). When I executed the same command as above, we were able to return the system to operation and reduce the time to get a ticket and authorize our services on startup.

019-03-12 17:33:09,364 - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=kafka-1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@68de145
2019-03-12 17:33:09,460 - INFO [main-SendThread(kafka-1:2181):Login@327] - successfully logged in.
2019-03-12 17:33:09,462 - INFO [Thread-0:Login$1@156] - TGT refresh thread started.
2019-03-12 17:33:09,465 - INFO [main-SendThread(kafka-1:2181):ZooKeeperSaslClient$1@285] - Client will use GSSAPI as SASL mechanism.
2019-03-12 17:33:09,477 - INFO [Thread-0:Login@335] - TGT valid starting at: Tue Mar 12 17:33:09 UTC 2019
2019-03-12 17:33:09,478 - INFO [Thread-0:Login@336] - TGT expires: Wed Mar 13 17:33:09 UTC 2019
2019-03-12 17:33:09,478 - INFO [Thread-0:Login$1@210] - TGT refresh sleeping until: Wed Mar 13 13:27:51 UTC 2019
2019-03-12 17:33:09,479 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server kafka-1/192.168.1.1:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2019-03-12 17:33:09,536 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@852] - Socket connection established to kafka-1/192.168.1.1:2181, initiating session
2019-03-12 17:33:09,554 - INFO [main-SendThread(kafka-1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server kafka-1/192.168.1.1:2181, sessionid = 0x16972ce2b3f002c, negotiated timeout = 30000

real 0m0.718s
user 0m0.573s
sys 0m0.102s

I hope this helps you get your service back up and working.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.