1. Introduction cluster deployment
1.1 Hadoop Profile
Apache Software Foundation's Hadoop is an open source distributed computing platform. In Hadoop Distributed File System HDFS (Hadoop Distributed Filesystem) and MapReduce (Google MapReduce open source implementation) as the core Hadoop system to provide users with low-level details transparent distributed infrastructure.
For Hadoop clusters in terms of the role can be divided into two categories: Master and Salve. An HDFS cluster consists of a NameNode and several DataNode thereof. Wherein NameNode as the primary server, and client management namespace file system access to the file system operations; data cluster DataNode managing storage. MapReduce framework consists of a single run on the primary node JobTracker and run from each node TaskTracker composed of. The master node is responsible for scheduling a job constitutes all tasks, these tasks are distributed across different from the node. Master monitor their implementation, and re-run the previous failed tasks; only from the node is responsible for the tasks assigned by the master node. After When a Job is submitted, JobTracker received submit jobs and configuration information, configuration information will be sent from node aliquots, and scheduling tasks and monitoring TaskTracker execution.
As can be seen from the above description, HDFS and MapReduce together form the core Hadoop distributed system architecture. HDFS implement distributed file system in the cluster, MapReduce cluster on a distributed computing and tasking. HDFS provides MapReduce task processing the file operations such as storage and support, on the basis of HDFS MapReduce realized on the distribution of tasks, tracking, job execution, and collect the results, interaction between the two, completed the Hadoop distributed cluster main mission.
1.2 Environmental instructions
My environment is in a virtual machine configuration, Hadoop cluster comprises four nodes: LAN connection between a Master, 2 Ge Salve, nodes can ping each node IP address distributed as follows:
The virtual machine system
Master machine main configuration NameNode and JobTracker role, responsible for the implementation of distributed data and decomposition manifold tasks; 3 Salve machine configuration DataNode and TaskTracker role, responsible for implementing distributed data storage, and tasks. In fact, there should also be a Master machine is used as a backup to prevent the Master server is down, there is a backup immediately enabled. Subsequent experience gained after a certain stage to catch some spare Master machine (the standby machine can be modified via the configuration file).
Note: Due to the requirements of all machines on hadoop hadoop deployment directory structure requirements are the same (as at startup by the master node in the same directory to start other tasks node), and have an identical user name accounts. With reference to various documents on to say that all machines have established a hadoop user, use this account to be achieved without password authentication. For convenience, all were re-established a hadoop user on three machines.
1.3 environment configuration
Hadoop clusters in accordance with Section 1.2 is configured as shown in the table below describes how to modify the machine name and configure the hosts file, for ease of use.
Note: My virtual machine connected to the network using NAT, IP addresses are assigned automatically, so there will use the automatically assigned IP address but not specially modified to some IP addresses.
(1) modify the current machine name
Suppose we found our host name of the machine is not what we want.
1) Modify the machine name in Ubuntu
Modify the file / etc / hostname in value to, the success of the revised view the current host name is successfully set with the hostname command.
Also in order to be able to resolve the host name, it is best to modify the / etc / hosts file corresponding host name
2) Modify the machine name in Fedora
Through the "/ etc / sysconfig / network" file modify value "HOSTNAME" back, changed the name of our regulations.
Command: vi / etc / sysconfig / network, amended as follows:
Also in order to properly resolve the host name, it is best to modify the / etc / hosts file corresponding to the host name.
(2) Configuration hosts file (required)
"/ Etc / hosts" This file is used to configure the DNS server information will be used by the host, each host is described within the corresponding LAN connectivity [HostName IP] used. When the user making the network connection, first locate the file, find the corresponding IP address corresponding to a host name.
We want to know whether the test communication between two machines, general use "ping machine's IP", if you want to use the "ping the host name of the machine," finding it missing the name of the machine (which is why at the same time to modify the host name most good modify the file in the corresponding host name), way is to modify the "/ etc / hosts" this file is solved by each host within the LAN IP address and hostName of correspondence written to this file, you can Solve the problem.
For example: a machine is "Master.Hadoop: 192.168.1.141" machine is "Salve1.Hadoop: 192.168.1.142" with the command "ping" memory connection test.
Directly to the IP address of the test can ping, but the hostname tested and found to ping fails, prompt "unknown host-- unknown host," then view "Master.Hadoop" the "/ etc / hosts" file content You will find that there is no "192.168.1.142 Slave1.Hadoop" content, therefore the machine is unable to host machine named "Slave1.Hadoop" resolution.
Making Hadoop cluster configuration, you need to add the cluster IP and host names of all machines in the "/ etc / hosts" file, so not only between Master and Slave all machines to communicate over IP, but also by a host name communication. So on all machines "/ etc / hosts" file should add the following:
Command: vi / etc / hosts
Now we are making the machine as "Slave1.Hadoop" ping the host name to test to see whether the test was successful.
We have been able to ping the host name, and that we just added content in the LAN can perform DNS resolution, so now the rest of the thing is the same configuration on the remaining Slave machine. Then tested.
1.4 Required Software
(1) JDK software
Download: http: //www.Oracle.com/technetwork/java/javase/index.html
JDK version: jdk-7u25-linux-i586.tar.gz
(2) Hadoop software
Download: http: //hadoop.apache.org/common/releases.html
Hadoop version: hadoop-1.1.2.tar.gz
2, SSH without password authentication configuration
Hadoop running manage remote Hadoop daemons need, after Hadoop startup, NameNode is to start and stop various daemons on each DataNode by SSH (Secure Shell) is. This time must execute the instructions between nodes is not required to enter a password form, so we need to configure SSH public key authentication using password-free form, so NameNode use SSH without password and start DataName process, on the same principle, DataNode You can also use SSH without password to log into NameNode.
Note: If you do not have Linux installed SSH, please first install SSH
Installation under Ubuntu ssh: sudo apt-get install openssh-server
Ssh Fedora installation under: yum install openssh-server
2.1 SSH basic principles and usage
1) SSH Fundamentals
SSH has been able to ensure the security, because it uses a public key encryption. As follows:
(1) remote host receives the user's logon request, his own public key to the user.
(2) the user to use the public key to encrypt the login password after sending back.
(3) the remote host with their own private key to decrypt the password, if the password is correct, you agree to the user login.
2) SSH Basic Usage
If the user name is java, log into the remote host name linux, the following command:
$ Ssh java @ linux
The default SSH port is 22, that is, your login request will be sent to port 22 of the remote host. Use p parameter, you can modify the port, such as modifying the port 88, the command is as follows:
$ Ssh -p 88 java @ linux
Note: If an error message appears: ssh: Could not resolve hostname linux: Name or service not known, it is because linux host does not add to this host Name Service, and it does not recognize, you need to / etc / hosts to add into the and the corresponding IP host can be:
2.2 Configuration No password Master all Salve
1) SSH password without principle
Master (NameNode | JobTracker) as the client, to achieve the public-key authentication without a password to connect to the server Salve (DataNode | Tasktracker) on when you need to generate a key pair on the Master, including a public key and a private key, and then copy the public key to all the Slave. When Master Salve connected via SSH, Salve will generate a random number with Master's public key to encrypt the random number, and sends Master. After Master then receive encrypted private key to decrypt and decrypts it back to Slave, Slave to confirm the correct number after decryption allows to connect the Master. This is a public key certification process, during which the user does not need to manually enter the password.
2) on the Master machine settings without password
a. Master nodes use ssh-keygen command generates a non-cryptographic key pair.
Execute the following command on the Master node:
ssh-keygen -t rsa -P ''
Directly enter the default path after running inquiry save path. Generated key pair: id_rsa (private key) and id_rsa.pub (public key), default is stored in the "/ home / username /.ssh" directory.
Check whether there is "/ home / username /" under ".ssh" folder and the ".ssh" Are there two newly produced no cipher key to the next file.
b. Then do as follows on the Master node, the id_rsa.pub key is appended to the authorization to go inside.
cat ~ / .ssh / id_rsa.pub >> ~ / .ssh / authorized_keys
Check under the purview of authorized_keys if the authority does not use the following command to set the permissions of the file:
chmod 600 authorized_keys
c. Log in as root user to modify the SSH configuration file "/ etc / ssh / sshd_config" of the following.
"#" In front of the following lines to check whether the annotation canceled:
RSAAuthentication yes # Enable RSA authentication
PubkeyAuthentication yes # Enable public and private key pair authentication
AuthorizedKeysFile% h / .ssh / authorized_keys # key file path
After setting remember to restart the SSH service, just to make the setting effective.
Exit logged in as root, normal user verify success.
From the figure above this level that no password has been set up, the next thing is to copy the public key
Some Slave machine.
Note: Sometimes when testing may occur error: Agent admitted failure to sign using the key solution is:. Ssh-add ~ / .ssh / id_rsa, as follows:
d. Use ssh-copy-id command to transfer the public key to the remote host (here Slave1.Hadoop example).
e. no password test whether other machines success.
So far, we have achieved after five steps from "Master.Hadoop" to SSH without password "Slave1.Hadoop", the following is to repeat the above steps for the remaining two (Slave2.Hadoop and Slave3.Hadoop) Slave Server configuration. Thus, we have completed the "Configure Master Slave no password all servers."
Then configure all Slave No password Master, its Master and Slave no password all the same principle, that is, the public key is appended to the Slave Master of the ".ssh" folder under the "authorized_keys", and remember append (>> ).
NOTE: During some problems may occur as follows:
(1) If an error occurs when ssh connection "ssh: connect to host port 22: Connection refused", as shown below:
It may be because the machine that a remote login service is not installed or installed ssh ssh service is not turned on, the following test to Slave3.Hadoop host:
For once, start the service when you set the system starts: # systemctl enable sshd.service
(2) If the command is not found when using the command ssh-copy-id "ssh-copy-id: Command not found", it may be too low version of ssh service reasons, such as if your machine is RedHat the system may issue, the solution is: local pubkey manually copy content to a remote server, the command is as follows:
cat ~ / .ssh / id_rsa.pub | ssh hadoop@Master.Hadoop 'cat >> ~ / .ssh / authorized_keys'
This command is equivalent to the following two commands:
1> execute on the local machine: scp ~ / .ssh / id_rsa.pub hadoop@Master.Hadoop: / ~
2> to the remote machine to perform: cat ~ / id_rsa.pub >> ~ / .ssh / authorized_keys
3, Java installation environment
Must be installed on all machines JDK, now installed in the Master server and the other servers can follow the steps repeated. Install JDK and configure the environment variables, you need to "root" of identity.
Install JDK 3.1
After the first login as root "Master.Hadoop" Create "java" folder under "/ usr", then "jdk-7u25-linux-i586.tar.gz" copy to "/ usr / java" folder then you can extract. See "/ usr / java" will find more under a program called "jdk1.7.0_25" folder, indicating our JDK installation is completed, remove the "jdk-7u25-linux-i586.tar.gz" file, go to the next "Configuring environment variables" link.
3.2 configuration environment variable
(1) Edit "/ etc / profile" file
Edit "/ etc / profile" file, add Java in the back of the "JAVA_HOME", "CLASSPATH" and "PATH" reads as follows:
# Set java environment
export JAVA_HOME = / usr / java / jdk1.7.0_25 /
export JRE_HOME = / usr / java / jdk1.7.0_25 / jre
export CLASSPATH =:. $ CLASSPATH: $ JAVA_HOME / lib: $ JRE_HOME / lib
export PATH = $ PATH: $ JAVA_HOME / bin: $ JRE_HOME / bin
# Set java environment
export JAVA_HOME = / usr / java / jdk1.7.0_25 /
export CLASSPATH =:. $ CLASSPATH: $ JAVA_HOME / lib: $ JAVA_HOME / jre / lib
export PATH = $ PATH: $ JAVA_HOME / bin: $ JAVA_HOME / jre / bin
In both mean the same, then we choose the first kind to be set.
(2) to validate the configuration
Save and exit, execute the following command to make the configuration take effect immediately.
source / etc / profile or. / etc / profile
3.3 Verify successful installation
Once configured and take effect, use the following command to determine success.
From the figure above that, we determine the JDK has been installed successfully.
More than 3.4 installation machine
Then copy the "/ usr / java /" file to another Slave above ordinary users hadoop by scp command format, the rest of the thing is in the rest of the Slave server Follow the steps on the graph configuration environment variable and test whether the installation was successful here to Slave1.Master example:
scp -r / usr / java seed@Slave1.Master: / usr /
Note: Some of the machines lower library function version, you may not install a higher version of JDK, for example, some Redhat9, not at this time choose a lower version of the JDK installation, because all the cluster must be the same version of the JDK (tested), there are two ways to resolve: First, give up the machine, use a different version of the JDK can hold the machine; the second is to choose low version of the JDK, re-installed on all machines.
4, Hadoop cluster installation
Must be installed on all machines hadoop, now installed in the Master server and the other servers can follow the steps repeated. Hadoop need to install and configure the "root" of identity.
4.1 Installation hadoop
First, with the root user login "Master.Hadoop" machine, download "hadoop-1.1.2.tar.gz" copied to the / usr directory. Then go to the next "/ usr" directory, use the following command to "hadoop-1.1.2.tar.gz" to decompress it, and rename it to "hadoop", the read permissions on the folder assigned to ordinary users hadoop, then delete the "hadoop-1.0.0.tar.gz" installation package.
cd / usr
tar -xzvf hadoop-1.1.2.tar.gz
mv hadoop-1.1.2 hadoop
chown -R hadoop: hadoop hadoop # folder "hadoop" hadoop read permissions assigned to ordinary users
rm -rf hadoop-1.1.2.tar.gz
Finally, in "/ usr / hadoop" created under tmp folder and add the Hadoop installation path to "/ etc / profile", modify the "/ etc / profile" file, add the following statement to the end, and entry into force (. / etc / profile):
# Set hadoop path
export HADOOP_HOME = / usr / hadoop
export PATH = $ PATH: $ HADOOP_HOME / bin
4.2 Configuration hadoop
(1) Configuration hadoop-env.sh
The "hadoop-env.sh" file located in the "/ usr / hadoop / conf" directory.
Modify the following contents in the file:
export JAVA_HOME = / usr / java / jdk1.7.0_25
Hadoop configuration files in the conf directory, the previous version of the profile is mainly Hadoop-default.xml and Hadoop-site.xml. Due to the rapid development of Hadoop, a sharp increase in the amount of code, code development into the core, hdfs and map / reduce three parts, the configuration file is also divided into three core- site.xml, hdfs-site.xml, mapred-site.xml. core-site.xml and hdfs-site.xml standing HDFS angle profile; core-site.xml and mapred-site.xml standing MapReduce angle on the profile.
(2) core-site.xml configuration file
Modify Hadoop core configuration file core-site.xml, this configuration is the HDFS master (ie namenode) address and port number.
/ usr / hadoop / tmp value>
(Note: Please create tmp file in / usr / hadoop directory folder)
A base for other temporary directories. Description>
<-! File system properties ->
hdfs: //192.168.1.141: 9000 value>
NOTE: If no hadoop.tmp.dir configuration parameters, then the system default temporary directory: / tmp / hadoo-hadoop. And this directory will be deleted after each restart, you must re-execute the format for the job, or an error occurs.
(3) hdfs-site.xml configuration file
Modify in Hadoop HDFS configuration, configuration backup default is 3.
(Note: replication of data is the number of copies, the default is 3, salve will be given less than three)
(4) mapred-site.xml configuration file
Hadoop MapReduce modify the configuration file, the configuration is JobTracker address and port.
(5) configuration file masters
There are two options:
(1) a first
Modify localhost as Master.Hadoop
(2) The second
Remove "localhost", added Master machine IP: 192.168.1.141
For insurance purposes, to enable a second, because if forgot Configuration "/ etc / hosts" LAN DNS failures, which would be unexpected errors, but once IP matching, network flow, you can find the appropriate host IP.
(6) slaves configuration file (Master host-specific)
There are two options:
(1) a first
Remove "localhost", each row add a host name, the remaining Slave hostname fill.
For example: add the following form:
(2) The second
Remove "localhost", added to the cluster IP all Slave machines, also one per line.
For example: add the following form
Cause and adding "masters" file, select the second mode.
Hadoop is now configured on Master machine is over, and the rest is to configure Hadoop Slave machine.
The easiest way is to host configured on the Master hadoop file folder "/ usr / hadoop" Slave copied to all the "/ usr" directory (in fact slavers file Slave machine is unnecessary, also copied no problem). Use the following command format. (Note: At this point the user can as an ordinary user may be a root)
scp -r / usr / hadoop root @ server IP: / usr /
For example: From the "Master.Hadoop" to "Slave1.Hadoop" Hadoop replication configuration file.
scp -r / usr / hadoop root@Slave1.Hadoop: / usr /
Root user to copy, of course, whether it is user root or an ordinary user, although the "/ usr / hadoop" Master machine folder user hadoop have permission, but hadoop user Slave1 on but not "/ usr" privileges, so you do not create folder permissions. So whatever the user to copy, the right is the "root @ machine IP" format. Because we just established a common user SSH connection without a password, so root on "scp" when you throw prompted to enter "Slave1.Hadoop" server root user's password.
See "Slave1.Hadoop" next server "/ usr" directory already exists "hadoop" folder has been copied successfully confirmed. See the following results:
From the figure above know, hadoop folder has indeed been replicated, but we found hadoop authority is root, so we now give "Slave1.Hadoop" user on the server hadoop add "/ usr / hadoop" read permission.
Log in as root "Slave1.Hadoop", execute the following command.
chown -R hadoop: hadoop (user name: User group) hadoop (folder)
Then modify "Slave1 .Hadoop" on "/ etc / profile" file, add the following statement to the end and make it effective (source / etc / profile):
# Set hadoop environment
export HADOOP_HOME = / usr / hadoop
export PATH = $ PATH: $ HADOOP_HOME / bin
If you do not know how to configure settings, see the previous "Master.Hadoop" machine "/ etc / profile" file, so far Hadoop is configured on a Slave machine concludes. The remaining thing is to copy or imitate the remaining several Slave machine to deploy Hadoop.
4.3 Start-up and verification
(1) Format HDFS file system
Regular user hadoop in "Master.Hadoop" on the operation. (Note: only once, the next start no longer needs to be formatted, just start-all.sh)
hadoop namenode -format
From the figure above know that we have successfully formatted, but the fly in the ointment is that there is a warning from the Internet that this warning does not affect the implementation of hadoop, but there are solutions, the details see the back of the "Frequently Asked Questions FAQ".
(2) start hadoop
Before starting close all the machines in the cluster firewall, otherwise there will be datanode open and then automatically shut down. Use the following command to start.
You can see the log by following startup, and then start the first start namenode datanode1, datanode2, ..., then start secondarynamenode. Restart jobtracker, then start tasktracker1, tasktracker2, ....
After a successful start hadoop in Master in tmp folder generated dfs folder, Slave in tmp folder are generated dfs mapred folders and folders.
(3) Verify hadoop
(1) a verification method: using "jps" command
On the Master comes with java gadgets jps view process.
On Slave2 with jps view process.
If viewing Slave machine "DataNode" and "TaskTracker" Without it, the first look at the log, if it is "namespaceID" inconsistencies, a "Frequently Asked Questions FAQ6.2" to resolve, if it is "No route to host "The problem, the use of" Frequently Asked questions FAQ6.3 "to resolve.
(2) Verify that way: with a "hadoop dfsadmin -report"
Use this command to view the status of Hadoop cluster.
See page 4.4 Cluster
(1) Access "http://192.168.1.141:50030"
(2) Access "http://192.168.1.142:50070"
5. Frequently Asked Questions FAQ
About 5.1 Warning: $ HADOOP_HOME is deprecated.
When hadoop hadoop command after installing Knock, always suggested that this warning:
Warning: $ HADOOP_HOME is deprecated.
The investigation hadoop-1.1.2 / bin / hadoop script and "hadoop-config.sh" script, the script found on HADOOP_HOME environment variable settings do judge, in fact, do not need to set the environment variable HADOOP_HOME.
Solution one: Edit "/ etc / profile" file, remove HADOOP_HOME variable settings, re-enter the command hadoop fs warning disappears.
Solution two: Edit the "/ etc / profile" file, add an environment variable, after the warning disappeared:
export HADOOP_HOME_WARN_SUPPRESS = 1
5.2 solve the "no datanode to stop" problem
When I found the following information to stop Hadoop:
no datanode to stop
Cause: Each namenode format recreates a namenodeId, down tmp / dfs / data contains the id under the previous format, namenode format data namenode cleared under, but no clear data datanode under fail at startup, there are two solutions:
The first solution is as follows:
1) delete "/ usr / hadoop / tmp"
rm -rf / usr / hadoop / tmp
2) Create "/ usr / hadoop / tmp" folder
mkdir / usr / hadoop / tmp
3) Delete "/ tmp" to "hadoop" beginning of the file
rm -rf / tmp / hadoop *
4) reformat hadoop
hadoop namenode -format
5) Start hadoop
With the first scheme, a kind of advantage is not important data on the original cluster was gone. If you say Hadoop cluster has been running for some time. The second is recommended.
The second scheme is as follows:
1) Modify each Slave and Master of namespaceID it's namespaceID consistent.
2) modify the Master of namespaceID to match the Slave of namespaceID.
The "namespaceID" located in "/ usr / hadoop / tmp / dfs / name / current / VERSION" file in front of the blue may vary according to the actual situation, but behind the red is generally unchanged.
For example: "VERSION" File Viewer "Master" under
I suggest using the second, so convenient, but also to prevent accidental deletion.
5.3 Slave servers automatically started and then shut down datanode
See issued the following error log.
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to ... failed on local exception: java.net.NoRouteToHostException: No route to host
The solution: turn off the firewall
5.4 upload files from the local file system to hdfs
There are mistakes:
INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
INFO hdfs.DFSClient: Abandoning block blk_-1300529705803292651_37023
WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
The solution is:
1) turn off the firewall
2) Disable selinux
Edit "/ etc / selinux / config" file, set "SELINUX = disabled"
5.5 Safe Mode error caused
There are mistakes:
org.apache.hadoop.dfs.SafeModeException: Can not delete ..., Name node is in safe mode
In a distributed file system boot time, there will be time to start in safe mode, the case where the distributed file system in safe mode, the file system can not modify the content can not be deleted until the safe mode ends. Safety mode is used to check the validity of the system starts DataNode on each block of data while copy or delete part of the data block based on the policy necessary. Runtime command can also enter Safe Mode. In practice, the system starts to modify and delete files there will be a security model does not allow modification of an error message, you only need to wait a while.
The solution: turn off safe mode
hadoop dfsadmin -safemode leave
5.6 solve Exceeded MAX_FAILED_UNIQUE_FETCHES
Error is as follows:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
Programs which need to open multiple files for analysis, the system is generally the default number is 1024 (with ulimit -a can see) is enough for normal use, but the program is concerned, it is too little.
The solution is: Modify two files.
1) "/ etc / security / limits.conf"
soft nofile 102400
hard nofile 409600
2) "/ etc / pam.d / login"
session required /lib/security/pam_limits.so
For the first question I correct answer next:
This is to obtain the map has been completed when the pre-processing stage shuffle reduce output exceeded the maximum number of failures caused by the default limit is 5. This problem may be caused by the way there are many, such as a network connection is not normal, connection timeout, bandwidth, and poor port obstruction. Usually within the framework of the network is better not exhibit this error.
5.7 solve "Too many fetch-failures"
This problem occurs mainly in communication between nodes is not comprehensive enough.
The solution is:
1) Check the "/ etc / hosts"
Corresponding to the requirements of this machine ip server name
To include all the requirements of the server name server ip +
2) Check the ".ssh / authorized_keys"
Required to include all the servers (including itself) of the public key
Especially slow processing speed of 5.8
Map appear quickly, but reduce very slowly, and repeated "reduce = 0%".
Solutions are as follows:
5.7 combined solution, and then modify the "conf / hadoop-env.sh" in the "export HADOOP_HEAPSIZE = 4000"
5.9 hadoop OutOfMemoryError problem solving
When this exception, obviously was not enough memory jvm reasons.
Solutions are as follows: To modify all datanode the jvm memory size.
Java -Xms 1024m -Xmx 4096m
General jvm maximum memory usage should be half of the total memory size, 8G memory we use, so set to 4096m, this value may still not be optimal values.