Spark and Data Tips for November 2018

Full Hadoop / HBase data platform for testing spark

I found the following docker very handy for testing hadoop.
https://hub.docker.com/r/bigdatauniversity/spark2/
docker pull bigdatauniversity/spark2
docker run -it –name bdu_spark2 -P -p 4040:4040 -p 4041:4041 -p 8080:8080 -p 8081:8081 bigdatauniversity/spark2:latest /etc/bootstrap.sh -bash

Spark Notebooks

I found these sites useful – http://spark-notebook.io/ and https://github.com/spark-notebook/spark-notebook and https://github.com/IBM?language=jupyter+notebook

Version Mismatch

If you hit this error

Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions

It’s a quick fix

Update spark-env.sh 
export PYSPARK_PYTHON=/usr/local/bin/python3

Thanks to http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Python-in-worker-has-different-version-than-that-in-driver/td-p/60620

Kernels

A handy list of kernels https://github.com/jupyter/jupyter/wiki/Jupyter-kernels

I used the Python 3 kernel.

pip3 install spylon-kernel

I did also play with Toreee
https://github.com/apache/incubator-toree

pip3 install toree
jupyter toree install

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.