|
Recently Spark cluster need to use the complete set, so we recorded the deployment process. We know Spark official three cluster deployment scenarios: Standalone, Mesos, YARN. Wherein Standalone most convenient, this paper focuses on combining YARN deployment scenarios.
Software Environment:
Ubuntu 14.04.1 LTS (GNU / Linux 3.13.0-32-generic x86_64)
Hadoop: 2.6.0
Spark: 1.3.0
0 EDITORIAL
This example demonstrates both non-root privileges, so the need to add some command line sudo, if you are running as root, please ignore sudo. Download and install software recommendations are placed above the home directory, such as ~ / workspace, so convenient, so as not to cause unnecessary trouble permissions problem.
1. Prepare the environment
Modify the host name
We will set up a master, 2 Ge slave cluster program. First, modify the host name vi / etc / hostname, modify the master on the master, which modify a slave to slave1, another empathy.
Configuring hosts
Modify the host file on each host
vi / etc / hosts
10.1.1.107 master
10.1.1.108 slave1
10.1.1.109 slave2
After you configure the user name of ping to see if the entry into force
ping slave1
ping slave2
Free SSH password
Openssh server installation
sudo apt-get install openssh-server
Private and public keys are generated on all machines
ssh-keygen -t rsa # all the way round
Between machines need to be able to access each other, put the machine on each issue id_rsa.pub master node, you can use scp to transfer public transport.
scp ~ / .ssh / id_rsa.pub spark @ master: ~ / .ssh / id_rsa.pub.slave1
On the master, all the public keys for authentication public key is added to the file authorized_keys
cat ~ / .ssh / id_rsa.pub * >> ~ / .ssh / authorized_keys
The public key file authorized_keys distributed to each slave
scp ~ / .ssh / authorized_keys spark @ master: ~ / .ssh /
SSH without password verification on each machine communication
ssh master
ssh slave1
ssh slave2
If the test is unsuccessful landing, you may need to modify the permissions of the authorized_keys file (set permissions is very important because of insecurity set security settings, and you can not use the RSA function)
chmod 600 ~ / .ssh / authorized_keys
Install Java
From the official website to download the latest version of Java can, Spark is the official description Java 6 or later as long as can be, I was under the jdk-7u75-linux-x64.gz
At ~ / workspace directory directly extract
tar -zxvf jdk-7u75-linux-x64.gz
Modify environment variables sudo vi / etc / profile, add the following note to replace the path to your home:
export WORK_SPACE = / home / spark / workspace /
export JAVA_HOME = $ WORK_SPACE / jdk1.7.0_75
export JRE_HOME = / home / spark / work / jdk1.7.0_75 / jre
export PATH = $ JAVA_HOME / bin: $ JAVA_HOME / jre / bin: $ PATH
export CLASSPATH = $ CLASSPATH:.: $ JAVA_HOME / lib: $ JAVA_HOME / jre / lib
Then the environment variables to take effect, and verify whether the installation was successful Java
$ Source / etc / profile # environment variables to take effect
$ Java -version # If you print out the following version information, then the installation was successful
java version "1.7.0_75"
Java (TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot (TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
Install Scala
Spark official requirements Scala version 2.10.x, do not pay attention to the wrong version 2.10.4 I got here, the official download address (we download Scala turtle speed in general).
Similarly, we unpacked in ~ / workspace in
tar -zxvf scala-2.10.4.tgz
Modify environment variables sudo vi / etc / profile again, add the following:
export SCALA_HOME = $ WORK_SPACE / scala-2.10.4
export PATH = $ PATH: $ SCALA_HOME / bin
The same method allows environment variables to take effect, and verify whether the installation was successful scala
$ Source / etc / profile # environment variables to take effect
$ Scala -version # If you print out the following version information, then the installation was successful
Scala code runner version 2.10.4 - Copyright 2002-2013, LAMP / EPFL
Installation and Configuration Hadoop YARN
Download extract
Hadoop2.6.0 version download from the official website, here to give our school a mirror Download.
Similarly, we unpacked in ~ / workspace in
tar -zxvf hadoop-2.6.0.tar.gz
Configuring Hadoop
cd ~ / workspace / hadoop-2.6.0 / etc / hadoop hadoop configuration into the directory, you need to configure the following seven documents: hadoop-env.sh, yarn-env.sh, slaves, core-site.xml, hdfs-site .xml, maprd-site.xml, yarn-site.xml
Configuring the JAVA_HOME in hadoop-env.sh
# The java implementation to use.
export JAVA_HOME = / home / spark / workspace / jdk1.7.0_75
Configuring the JAVA_HOME in yarn-env.sh
# Some Java parameters
export JAVA_HOME = / home / spark / workspace / jdk1.7.0_75
The ip or host slave nodes in slaves, the
slave1
slave2
Modify core-site.xml
< Configuration>
< Property>
< Name> fs.defaultFS < / name>
< Value> hdfs: // master: 9000 / < / value>
< / Property>
< Property>
< Name> hadoop.tmp.dir < / name>
< Value> file: /home/spark/workspace/hadoop-2.6.0/tmp < / value>
< / Property>
< / Configuration>
Modify hdfs-site.xml
< Configuration>
< Property>
< Name> dfs.namenode.secondary.http-address < / name>
< Value> master: 9001 < / value>
< / Property>
< Property>
< Name> dfs.namenode.name.dir < / name>
< Value> file: /home/spark/workspace/hadoop-2.6.0/dfs/name < / value>
< / Property>
< Property>
< Name> dfs.datanode.data.dir < / name>
< Value> file: /home/spark/workspace/hadoop-2.6.0/dfs/data < / value>
< / Property>
< Property>
< Name> dfs.replication < / name>
< Value> 3 < / value>
< / Property>
< / Configuration>
Modify mapred-site.xml
< Configuration>
< Property>
< Name> mapreduce.framework.name < / name>
< Value> yarn < / value>
< / Property>
< / Configuration>
Modify yarn-site.xml
< Configuration>
< Property>
< Name> yarn.nodemanager.aux-services < / name>
< Value> mapreduce_shuffle < / value>
< / Property>
< Property>
< Name> yarn.nodemanager.aux-services.mapreduce.shuffle.class < / name>
< Value> org.apache.hadoop.mapred.ShuffleHandler < / value>
< / Property>
< Property>
< Name> yarn.resourcemanager.address < / name>
< Value> master: 8032 < / value>
< / Property>
< Property>
< Name> yarn.resourcemanager.scheduler.address < / name>
< Value> master: 8030 < / value>
< / Property>
< Property>
< Name> yarn.resourcemanager.resource-tracker.address < / name>
< Value> master: 8035 < / value>
< / Property>
< Property>
< Name> yarn.resourcemanager.admin.address < / name>
< Value> master: 8033 < / value>
< / Property>
< Property>
< Name> yarn.resourcemanager.webapp.address < / name>
< Value> master: 8088 < / value>
< / Property>
< / Configuration>
The configured hadoop-2.6.0 folder distributed to all slaves bar
scp -r ~ / workspace / hadoop-2.6.0 spark @ slave1: ~ / workspace /
Start Hadoop
Do the following on the master, you can start hadoop up.
cd ~ / workspace / hadoop-2.6.0 # to enter the hadoop directory
bin / hadoop namenode -format # format namenode
sbin / start-dfs.sh # start dfs
sbin / start-yarn.sh # start yarn
Hadoop verify whether the installation was successful
Can you view each node started by jps command normal process. On the master should have the following processes:
$ Jps #run on master
3407 SecondaryNameNode
3218 NameNode
3552 ResourceManager
3910 Jps
On each slave should have the following processes:
$ Jps #run on slaves
2072 NodeManager
2213 Jps
1962 DataNode
Or type in the browser http: // master: 8088, should have hadoop out management interface, and can see slave1 and slave2 node.
Spark installation
Download extract
Enter the official download address to download the latest version of the Spark. I downloaded the spark-1.3.0-bin-hadoop2.4.tgz.
Decompression at ~ / workspace directory
tar -zxvf spark-1.3.0-bin-hadoop2.4.tgz
mv spark-1.3.0-bin-hadoop2.4 spark-1.3.0 # original file name is too long, change the next
Configuring Spark
cd ~ / workspace / spark-1.3.0 / conf # to enter the spark configuration directory
cp spark-env.sh.template spark-env.sh # copied from the configuration template
vi spark-env.sh # Configure add content
Add the following to the end (this is my configuration, you can modify) in spark-env.sh:
export SCALA_HOME = / home / spark / workspace / scala-2.10.4
export JAVA_HOME = / home / spark / workspace / jdk1.7.0_75
export HADOOP_HOME = / home / spark / workspace / hadoop-2.6.0
export HADOOP_CONF_DIR = $ HADOOP_HOME / etc / hadoop
SPARK_MASTER_IP = master
SPARK_LOCAL_DIRS = / home / spark / workspace / spark-1.3.0
SPARK_DRIVER_MEMORY = 1G
NOTE: In the process of setting Worker number of CPU and memory size, pay attention to the actual hardware condition of the machine, if the configuration exceeds the current Worker node hardware conditions, Worker process will fail to start.
vi slaves fill slave hostname in slaves file:
slave1
slave2
The configured spark-1.3.0 folder distributed to all slaves bar
scp -r ~ / workspace / spark-1.3.0 spark @ slave1: ~ / workspace /
Start Spark
sbin / start-all.sh
Spark verify whether the installation was successful
Check with jps in the master should have the following processes:
$ Jps
7949 Jps
7328 SecondaryNameNode
7805 Master
7137 NameNode
7475 ResourceManager
On the slave should have the following processes:
$ Jps
3132 DataNode
3759 Worker
3858 Jps
3231 NodeManager
Spark enter the Web administration page: http: // master: 8080
Run the sample
# Two threads running in local mode
./bin/run-example SparkPi 10 --master local [2]
#Spark Standalone cluster mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark: // master: 7077 \
lib / spark-examples-1.3.0-hadoop2.4.0.jar \
100
#Spark On the YARN cluster yarn-cluster mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client`
lib / spark-examples * .jar \
10
Note Spark on YARN supports two modes of operation, namely yarn-cluster and yarn-client, broadly speaking, yarn-cluster for a production environment; and yarn-client interactions and is suitable for debugging, but also hope quickly see application output. |
|
|
|