Recently, I’ve had to do some advanced HBase Shell scripting to check data consistency.
Tip # 1 – date with nanoseconds
You can easily establish min/max times in nanoseconds from BASH and feed them into your script.
$(date +%s%6N)
gets you 6 digits of precision 3 for millis 3 for nanos.
Tip # 2 – Include Java
Use the ‘include Java’ line to get access to all JARs that HBase has access to.
Tip # 3 – Forget about Bytes….
You should convert to Base64 to make it human readable (and you don’t want to break lines – the number 8 keeps it from wrapping). Link
import org.apache.hadoop.hbase.util.Base64;
content = Bytes.toString(tableName.getValue(Bytes.toBytes("m"),Bytes.toBytes("d")))
x = Base64.encodeBytes(r.getRow(), 8)
puts "#{x}"
Tip # 4 – GSON is available
use GSON to parse JSON efficiently across a scan.
import com.google.gson.JsonParser;
parser = JsonParser.new
jsonBody = Bytes.toString(tableName.getValue(Bytes.toBytes("d"),Bytes.toBytes("b")))
json = parser.parse(jsonBody)
object = json.getAsJsonObject()
metaObj = object.get('mx')
objVer = metaObj.get('vid').getAsString()
objId = object.get('id').getAsString()
Tip #5 – use it as a script
time /usr/iop/current/hbase-client/bin/hbase org.jruby.Main /tmp/check.rb > check.log
HBase Metadata – How to Scan
I learned this tip from a colleague. You can easily scan the hbase:meta
to find the timestamp an hbase table was created. Details on the metadata can be found on the O’Reilly website.
sudo su - hbase
/usr/bin/kinit -kt /etc/security/keytabs/<KEYTAB FILE> $(/usr/bin/klist -kt /etc/security/keytabs/hbase.headless.keytab | tail -n 1 | awk '{print $NF}')
cat << EOF | /usr/iop/current/hbase-client/bin/hbase shell
scan 'hbase:meta'
EOF
Running a Long Running Thread in the HBase shell
I had to write a fast script to process data into HBase (it’s fast and dirty), and I knew there was a likelihood of disconnection. I found a small tip in case you get disconnected. You can use nohup along with the hbase shell. I hope it helps you.
#constants
LOG_FILE_OUT=/var/log/logged-action.log
LOG_FILE_ERR=/var/log/logged-action.err
# Create a ruby file
cat << EOF > test.rb
include Java
print("Starting the Export")
import java.lang.Thread
Thread.sleep(10000)
print("\ndone waiting")
STDOUT.flush
EOF
nohup /usr/iop/current/hbase-client/bin/hbase shell -n \
test.rb "${@}" > ${LOG_FILE_OUT} 2> ${LOG_FILE_ERR} &
KMS Ranger API – Tips and cURLs
I use Hadoop KMS Ranger in one environment. Some sample rest api calls are below, along with two tips.
versionName is used in multiple queries.
When not using kerberos – set ?user.name=hdfs on the URL
curl -k -X GET -H "Content-type:application/json" "http://<kms-node>:16000/kms/v1/keyversion/<key-version>/_eek?ee_op=decrypt&user.name=hdfs" -v
curl -k -X GET -H "Content-type:application/json" http://<kms-node>:16000/kms/v1/key/<key>/_currentversion?user.name=hdfs -v
#Response:
{
"name" : "<key>",
"versionName" : "<key>@0",
"material" : "RANDOM_MATERIAL"
}
curl -k -X GET -H "Content-type:application/json" http://<kms-node>:16000/kms/v1/key/<key>/_metadata?user.name=hdfs -v
#Response:
{
"name" : "<key>",
"cipher" : "AES/CTR/NoPadding",
"length" : 128,
"description" : null,
"attributes" : {
"key.acl.name" : "<key>"
},
"created" : 1234567812,
"versions" : 1
}
References
- https://hadoop.apache.org/docs/current/hadoop-kms/index.html#KMS_HTTP_REST_API
- https://hadoop.apache.org/docs/current/hadoop-kms/index.html#Get_Key_Names
- https://stackoverflow.com/questions/37601763/authentication-issue-with-kms-hadoop
Kerberos and Java
I have worked on a Kerberos smoke test for my team. I learned a few tips in the process.
The useTicketCache
is a preferred use in case the java process dies while the KDC is down.
HBase Canary Testing runs on a kerberos enabled cluster using hbase canary
If you are port forwarding over SSH, you’ll want to switch to tcp using this trick in your krb5.conf file. Thanks to IBM’s site, it’s an easy fix…
A working example for Kerberos is as follows:
The site Kerberos Java site describes in detail how to build a kerberos client.
References
Solution: Ambari API to Restart IOP Deployment
I had to restart something like 120 tasks in Ambari for an IOP deployment. I used the Ambari API.Ambari All Sorts of Messed Up
I run Ambari and Ambari agents which controls our HDFS/HBase and general HADOOP/Apache ecosystem machines. Our bare metal machines hung, and we could not get anything restarted.
In the logs, we had:
{
'msg':
'Unable to read structured output
from /var/lib/ambari-agent/data/structured-out-status.json'
}
We found a fix at link.
Steps
- Remove /var/lib/ambari-agent/data/structured-out-status.json
- Restart ambari agent.
Our Ambari setup now works.
Solution: Logging using PQS 4.7 and Higher
1 – Drop this into /tmp
and chmod 755
on the file
In practice, I’d probably put it in /opt/phoenix/config
or /etc/hbase/phoenix
.
2 – Edit the queryserver.py
(/opt/phoenix/bin/queryserver.py
)
Around line 128 (hopefully not changed much from 4.7 to 4.8), add this
" -Dlog4j.configuration=file:///tmp/log4j.properties"
.
3 – Restart phoenix
/opt/phoenix/bin/queryserver.py stop
/opt/phoenix/bin/queryserver.py start
4 – scan the file – grep -i XYZ.dim /tmp/phoenix-query.log
2017-01-05 13:40:39,227 [qtp-1461017990-33 - /] TRACE org.apache.calcite.avatica.remote.ProtobufTranslationImpl - Serializing response 'results { connection_id: "0deb2c47-53e5-4846-b22b-ba3faa0bc37a" statement_id: 1 own_statement: true signature { columns { searchable: true display_size: 32 label: "ID" column_name: "ID" precision: 32 table_name: "XYZ.DIM" read_only: true column_class_name: "java.lang.String" type { id: 12 name: "VARCHAR" rep: STRING } } sql: "select ID from XYZ.DIM WHERE VLD_TO_TS IS NULL LIMIT 1" cursor_factory { style: LIST } } first_frame { done: true rows { value { scalar_value { type: STRING string_value: "00025a56f1084f0584a50f7cf9dc4bfc" } } } } update_count: 18446744073709551615 metadata { server_address: "demo.net:80" } } metadata { server_address: "demo.net:80" }'
You can then correlated on connection_id and analyze the log file for data.
Solution: Logging using PQS lower than 4.6
To logging PQS, you can log using NGINX as a proxy that records the HTTP Request Body and the timing.
On your Centos/RHEL machine, switch to root privileges
sudo -s
Add nginx.repo
Install nginx packages, and you can setup a proxy
yum install nginx.x86_64 -y
Create the nginx configuration so you have a log_format that outputs the body $request_body
and time $request_time
Create the default server configuration, and set the proxy_pass to point to your PQS/phoenix query server
Reload nginx – nginx -s reload
Tail the log file to see the data called.
[root@data-com-4 pbastide]# tail -f /var/log/nginx/request-body.log
Links