Full Hadoop / HBase data platform for testing spark
I found the following docker very handy for testing hadoop.
docker pull bigdatauniversity/spark2
docker run -it –name bdu_spark2 -P -p 4040:4040 -p 4041:4041 -p 8080:8080 -p 8081:8081 bigdatauniversity/spark2:latest /etc/bootstrap.sh -bash
I found these sites useful – http://spark-notebook.io/ and https://github.com/spark-notebook/spark-notebook and https://github.com/IBM?language=jupyter+notebook
If you hit this error
Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions
It’s a quick fix
Update spark-env.sh export PYSPARK_PYTHON=/usr/local/bin/
Thanks to http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Python-in-worker-has-different-version-than-that-in-driver/td-p/60620
A handy list of kernels https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
I used the Python 3 kernel.
pip3 install spylon-kernel
I did also play with Toreee
pip3 install toree jupyter toree install