Install Spark on the Hadoop cluster using the following steps:
1) Download latest version of Spark:
$ wget http://mirrors.ocf.berkeley.edu/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
2) Extract and rename in the specific directory (/opt/Apache):
$ tar xzf spark-1.6.1-bin-hadoop2.6.tgz
$ mv spark-1.6.1-bin-hadoop2.6 spark-1.6.1
3) Spark configuration files are part of the build and named as ".template" files in the conf directory
We can get started by editing: spark-defaults.conf, spark-env.sh and slaves files
$ cp slaves.template slaves
$ cp spark-defaults.conf.template spark-defaults.conf
$ cp spark-env.sh.template spark-env.sh
4) slaves
This file holds the hostname/IP adresses of Spark worker nodes. Add the two nodes we have in the cluster:
192.168.1.20
192.168.1.20
192.168.1.18
5) spark-defaults.conf
Defines Spark master node and other standard options as shown below:
spark.master spark://192.168.1.16:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/opt/Apache/hadoop-2.6.1/logs/spark-events
spark.eventLog.dir file:/opt/Apache/hadoop-2.6.1/logs/spark-events-log
5) spark-defaults.conf
Defines Spark master node and other standard options as shown below:
spark.master spark://192.168.1.16:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/opt/Apache/hadoop-2.6.1/logs/spark-events
spark.eventLog.dir file:/opt/Apache/hadoop-2.6.1/logs/spark-events-log
6) spark-env.sh
This script defines runtime options for Spark. We can define them either using spark-submit or use defaults set in this script.
HADOOP_CONF_DIR=/opt/Apache/hadoop-2.6.1/etc/hadoop
SPARK_MASTER_IP=192.168.1.16
7) We can re-use the spark cfg changes made on master node without having to repeat all the above steps.
rsync the spark-1.6.1 folder to both worker nodes
$ rsync -avxP /opt/Apache/spark-1.6.1@192.168.1.18:/opt/Apache
$ rsync -avxP /opt/Apache/spark-1.6.1@192.168.1.20:/opt/Apache
8) Start Spark master and worker processes
$ sbin/start-all.sh
9) Start history server
$ sbin/start-history-server.sh