HBase Tips

Recently, I’ve had to do some advanced HBase Shell scripting to check data consistency.

Tip # 1 – date with nanoseconds

You can easily establish min/max times in nanoseconds from BASH and feed them into your script.

  $(date +%s%6N) 

gets you 6 digits of precision 3 for millis 3 for nanos.

Tip # 2 – Include Java

Use the ‘include Java’ line to get access to all JARs that HBase has access to.

Tip # 3 – Forget about Bytes….

You should convert to Base64 to make it human readable (and you don’t want to break lines – the number 8 keeps it from wrapping). Link

import org.apache.hadoop.hbase.util.Base64;
content = Bytes.toString(tableName.getValue(Bytes.toBytes("m"),Bytes.toBytes("d")))
x = Base64.encodeBytes(r.getRow(), 8)
puts "#{x}"

Tip # 4 – GSON is available

use GSON to parse JSON efficiently across a scan.

import com.google.gson.JsonParser;
parser = JsonParser.new
jsonBody = Bytes.toString(tableName.getValue(Bytes.toBytes("d"),Bytes.toBytes("b")))

json = parser.parse(jsonBody)
object = json.getAsJsonObject()
metaObj = object.get('mx')
objVer = metaObj.get('vid').getAsString()
objId = object.get('id').getAsString()

Tip #5 – use it as a script

time /usr/iop/current/hbase-client/bin/hbase org.jruby.Main /tmp/check.rb > check.log

HBase Metadata – How to Scan

I learned this tip from a colleague. You can easily scan the hbase:meta to find the timestamp an hbase table was created. Details on the metadata can be found on the O’Reilly website.

sudo su - hbase
/usr/bin/kinit -kt /etc/security/keytabs/<KEYTAB FILE> $(/usr/bin/klist -kt /etc/security/keytabs/hbase.headless.keytab | tail -n 1 | awk '{print $NF}')
cat << EOF | /usr/iop/current/hbase-client/bin/hbase shell
scan 'hbase:meta'
EOF

Running a Long Running Thread in the HBase shell

I had to write a fast script to process data into HBase (it’s fast and dirty), and I knew there was a likelihood of disconnection. I found a small tip in case you get disconnected. You can use nohup along with the hbase shell. I hope it helps you.

#constants
LOG_FILE_OUT=/var/log/logged-action.log
LOG_FILE_ERR=/var/log/logged-action.err

# Create a ruby file
cat << EOF > test.rb
include Java
print("Starting the Export")
import java.lang.Thread
Thread.sleep(10000)
print("\ndone waiting")
STDOUT.flush
EOF

nohup /usr/iop/current/hbase-client/bin/hbase shell -n \
test.rb "${@}" > ${LOG_FILE_OUT} 2> ${LOG_FILE_ERR} &

KMS Ranger API – Tips and cURLs

I use Hadoop KMS Ranger in one environment. Some sample rest api calls are below, along with two tips.

versionName is used in multiple queries.

When not using kerberos – set ?user.name=hdfs on the URL

curl -k -X GET -H "Content-type:application/json" "http://<kms-node>:16000/kms/v1/keyversion/<key-version>/_eek?ee_op=decrypt&user.name=hdfs" -v
curl -k -X GET -H "Content-type:application/json" http://<kms-node>:16000/kms/v1/key/<key>/_currentversion?user.name=hdfs -v

#Response:

{
"name" : "<key>",
"versionName" : "<key>@0",
"material" : "RANDOM_MATERIAL"
}

curl -k -X GET -H "Content-type:application/json" http://<kms-node>:16000/kms/v1/key/<key>/_metadata?user.name=hdfs -v

#Response:

{
"name" : "<key>",
"cipher" : "AES/CTR/NoPadding",
"length" : 128,
"description" : null,
"attributes" : {
"key.acl.name" : "<key>"
},
"created" : 1234567812,
"versions" : 1
}

References

  • https://hadoop.apache.org/docs/current/hadoop-kms/index.html#KMS_HTTP_REST_API
  • https://hadoop.apache.org/docs/current/hadoop-kms/index.html#Get_Key_Names
  • https://stackoverflow.com/questions/37601763/authentication-issue-with-kms-hadoop

Kerberos and Java

I have worked on a Kerberos smoke test for my team. I learned a few tips in the process.

The useTicketCache is a preferred use in case the java process dies while the KDC is down.

HBase Canary Testing runs on a kerberos enabled cluster using hbase canary

If you are port forwarding over SSH, you’ll want to switch to tcp using this trick in your krb5.conf file. Thanks to IBM’s site, it’s an easy fix…

A working example for Kerberos is as follows:

The site Kerberos Java site describes in detail how to build a kerberos client.

References

Solution: Ambari API to Restart IOP Deployment

I had to restart something like 120 tasks in Ambari for an IOP deployment. I used the Ambari API.

Ambari All Sorts of Messed Up

I run Ambari and Ambari agents which controls our HDFS/HBase and general HADOOP/Apache ecosystem machines. Our bare metal machines hung, and we could not get anything restarted.

In the logs, we had:

{
'msg': 
'Unable to read structured output 
from /var/lib/ambari-agent/data/structured-out-status.json'
}

We found a fix at link.

Steps

  1. Remove /var/lib/ambari-agent/data/structured-out-status.json
  2. Restart ambari agent.

Our Ambari setup now works.

Solution: Logging using PQS 4.7 and Higher

1 – Drop this into /tmp and chmod 755 on the file In practice, I’d probably put it in /opt/phoenix/config or /etc/hbase/phoenix.

2 – Edit the queryserver.py (/opt/phoenix/bin/queryserver.py ) Around line 128 (hopefully not changed much from 4.7 to 4.8), add this " -Dlog4j.configuration=file:///tmp/log4j.properties".

3 – Restart phoenix

/opt/phoenix/bin/queryserver.py stop
/opt/phoenix/bin/queryserver.py start

4 – scan the file – grep -i XYZ.dim /tmp/phoenix-query.log

2017-01-05 13:40:39,227 [qtp-1461017990-33 - /] TRACE org.apache.calcite.avatica.remote.ProtobufTranslationImpl - Serializing response 'results { connection_id: "0deb2c47-53e5-4846-b22b-ba3faa0bc37a" statement_id: 1 own_statement: true signature { columns { searchable: true display_size: 32 label: "ID" column_name: "ID" precision: 32 table_name: "XYZ.DIM" read_only: true column_class_name: "java.lang.String" type { id: 12 name: "VARCHAR" rep: STRING } } sql: "select ID from XYZ.DIM WHERE VLD_TO_TS IS NULL LIMIT 1" cursor_factory { style: LIST } } first_frame { done: true rows { value { scalar_value { type: STRING string_value: "00025a56f1084f0584a50f7cf9dc4bfc" } } } } update_count: 18446744073709551615 metadata { server_address: "demo.net:80" } } metadata { server_address: "demo.net:80" }'

You can then correlated on connection_id and analyze the log file for data.

Solution: Logging using PQS lower than 4.6

To logging PQS, you can log using NGINX as a proxy that records the HTTP Request Body and the timing.

On your Centos/RHEL machine, switch to root privileges sudo -s

Add nginx.repo

Install nginx packages, and you can setup a proxy yum install nginx.x86_64 -y

Create the nginx configuration so you have a log_format that outputs the body $request_body and time $request_time

Create the default server configuration, and set the proxy_pass to point to your PQS/phoenix query server

Reload nginx – nginx -s reload

Tail the log file to see the data called.

[root@data-com-4 pbastide]# tail -f /var/log/nginx/request-body.log

Links