Home PC Games Linux Windows Database Network Programming Server Mobile  
  Home \ Server \ Hadoop 2.0 Detailed Configuration Tutorial     - CentOS ClamAV antivirus package updates (Linux)

- Linux variable learning experience (Linux)

- Ubuntu 14.04 Nvidia graphics driver installation and settings (Linux)

- Ubuntu program using the Sound Recorder (Linux)

- CentOS 7.0 local address and configure yum source address priority (Linux)

- Spring AOP (Programming)

- FileZilla install on Ubuntu 14.10 (Linux)

- Text analysis tools - awk (Linux)

- Slice MyCAT common fragmentation rules of enumeration (Database)

- Examples of testing and installation Mesos on CentOS (Linux)

- Necessity in Java packages (Programming)

- Solaris 10 nagios monitoring system (Linux)

- Android Custom View step (Programming)

- TL-WR703N to install OpenWrt process notes (Linux)

- Use source packages compiled and installed GCC5.1 in Mac OS X 10.10.3 (Linux)

- How to install PlayOnLinux 4.2.5 under Ubuntu 14.04 / 12.04 (Linux)

- How to install and configure in Ubuntu 14.10 'Weather Information Indicator' (Linux)

- How to create SWAP files in Ubuntu 14.04 (Linux)

- Ubuntu batch scp to copy files without password (Linux)

- The ActiveMQ JMS installation and application examples for Linux (Linux)

  Hadoop 2.0 Detailed Configuration Tutorial
  Add Date : 2018-11-21      
  PS: Some articles refer to data from the Internet, write and after practice, have any questions welcome to contact me.

You can try to configure Hadoop Ambari related to environment

Rapid deployment of Hadoop, Hbase Hive and the like and to provide Ganglia and Nagios monitoring functions, it is strongly recommended.
Hadoop 2.0 cluster configuration detailed tutorial


Hadoop2.0 Introduction

Apache Hadoop is an open source project, the main purpose of development is to build reliable, scalable scalable, distributed systems, hadoop is the sum of a series of sub-projects, which includes
1. hadoop common: to provide infrastructure for other projects
2. HDFS: distributed file system
3. MapReduce: A software framework for distributed processing of large data sets on compute clusters. A simplified framework for distributed programming.
4. Other project includes: Avro (serialization system), Cassandra (data library project), etc.

Hadoop, to Hadoop Distributed File System (HDFS, Hadoop Distributed Filesystem) and MapReduce (Google MapReduce open source implementation) as the core Hadoop system to provide users with low-level details transparent distributed infrastructure.
For Hadoop clusters in terms of the role can be divided into two categories: Master and Salve. An HDFS cluster consists of a NameNode and several DataNode thereof. Wherein NameNode as the primary server, and client management namespace file system access to the file system operations; cluster DataNode manage stored data. MapReduce framework consists of a single run on the primary node JobTracker and run from each cluster node TaskTracker composed of. All tasks constitute the master node is responsible for scheduling a job, these tasks are distributed across different from the node. Master monitor their implementation, and re-run the previous failed tasks; only from the node is responsible for the tasks assigned by the master node. After When a Job is submitted, JobTracker received job submission and configuration information, configuration information will be sent from node aliquots, and scheduling tasks and monitoring TaskTracker execution.
As can be seen from the above Introduction, HDFS and MapReduce together form the core Hadoop distributed systems architecture. HDFS implement distributed file system in the cluster, MapReduce cluster on a distributed computing and tasking. HDFS provides MapReduce task processing the file operations such as storage and support, on the basis of HDFS MapReduce realized on the distribution of tasks, tracking, job execution, and collect the results, interaction between the two, completed the Hadoop distributed cluster main mission.
Why should you use version 2.0 (from Dong's blog)

This version provides some important new features, including:
• HDFS HA, the current can only achieve manual switching.
Hadoop HA branch merge into that version and fervent support, the main features include:
(1) NN configuration file has changed, making configuration easier
(2) NameNode divided into two roles: active NN and standby NN, active NN provide external services to read and write, in case of failure, it switches to standby NN.
(3) Support Client-side redirect that is, when the active NN switched to standby NN process, Client end all operations can be carried out seamlessly and transparently redirected to the standby NN, Client yourself feeling less than the switching process.
(4) DN reporting information at the same time to block active NN and standby NN.
Specific design document reference: https://issues.apache.org/jira/browse/HDFS-1623
Current Hadoop HA manual switching can only be achieved, which is useful in some situations, such as when upgrading to NN, NN first switch to standby NN, and before active NN upgrade, the upgrade is complete, then NN NN switch to the upgraded, then standby NN upgrade.

• YARN, which is a next-generation MapReduce unified resource management and scheduling platform can manage a variety of computing framework, including MapReduce, Spark, MPI and the like.
YARN is a unified resource management and scheduling platform can manage a variety of computing framework, including MapReduce, Spark, MPI and the like. Although it is made completely rewritten, but the idea is derived from MapReduce come, and to overcome its numerous deficiencies in scalability and fault tolerance and so on. With particular reference to:
• HDFS Federation, allowing multiple NameNode HDFS, and the charge for each NameNode part catalog, and DataNode unchanged, thereby reducing the impact of the fault range, and play a role in isolation.
Traditional HDFS is a master / slave structure in which, master (ie NameNode) need to save metadata information store all file system and all files are stored operations require access to multiple NameNode, thus NameNode become constrained scalability major bottleneck. To solve this problem, the introduction of HDFS Federation, allows multiple NameNode HDFS, and the charge for each NameNode part of the directory, and DataNode unchanged, that is, "from a centralized dictatorship into various local self-government", thereby reducing the fault zone to the scope of, and play a role in isolation. With particular reference to:
• benchmark
This version for HDFS and YARN added performance benchmarks set, which HDFS tests include:
(1) dfsio benchmark HDFS I / O read and write performance
Performance (2) slive benchmark NameNode internal operations
(3) scan benchmark MapReduce job access of HDFS I / O performance
(4) shuffle shuffle stage performance benchmark
Intermediate results (5) compression benchmark MapReduce jobs and the final result of the compression performance
(6) gridmix-V3 cluster throughput benchmark
YARN tests include:
(1) ApplicationMaster Scalability Benchmark
The main test scheduling task / container performance. Compared with the 1.0 version, about 2 times faster.
(2) ApplicationMaster recovery benchmarking
Test YARN restart after the job recovery. Explain a little ApplicationMaster recovery operations: During the execution of the job, Application Master will continue to run the job status to be saved on disk, such as the completion of the task which is running, which is not completed and so on, so that once the master cluster restart or hang, after rebooting, you can recover the status of each job and the task is not simply re-run which runs completed.
(3) ResourceManager Scalability Benchmark
By continually adding node test RM scalability to Hadoop cluster.
(4) small jobs Benchmark
Specialized test throughput small batch jobs
With particular reference to:
• to provide HDFS and compatibility through protobufs YARN
Wire-compatibility for both HDFS & YARN
Hadoop Hadoop RPC uses its own set of serialization framework to carry out various objects serialized anti-sequence, but there is a problem: poor scalability, it is difficult to add new data types at the same time ensure version compatibility. To this end, Hadoop 2.0 data type module independent from the RPC, becoming an independent pluggable module, which allows users to use a variety of personal preferences serialization / de-serialization framework, such as thrift, arvo, protocal Buffer, etc., default using Protocal Buffer.
In addition to these five characteristics, there are also two very important features are in research and development, namely:
• HDFS snapshot
The user can at any time on HDFS snapshots, so that when the HDFS is faulty, you can restore data to a point in time in the state. With particular reference to:
• HDFS HA ​​automatically switch
The first function of the front Introduction "HDFS HA" artificial switching current can only be achieved, that is, the administrator runs a command such acitve NN switched to the standby NN. The future will support automatic switching, that is, the current fault monitoring module can detect active NN Ho, and automatically to the switched to the standby NN, which can greatly smaller Hadoop cluster operation and maintenance staff workload. With particular reference to:



Machine ready

Physical machine a total of four, based on the physical machine you want to configure hadoop cluster comprises four nodes: LAN connection between a Master, 3 Ge Salve, nodes can ping each other
Ip distribution hadoop1 hadoop2 hadoop3 hadoop4

Operating system is CentOS 5.6 64bit
Master machine main configuration NameNode and JobTracker role, responsible for the overall pipe exploded distributed data and tasks performed; 3 Salve machine configuration DataNode and TaskTracker role, responsible for data storage and distributed task execution. Its real should also should have a Master machine is used as a spare, in order to prevent the Master server is down, there is also a spare horse enabled. Up on a spare machine Master subsequent experience accumulated after a certain stage.
create Account

After using root login in all machines, all machines are created hadoop user
useradd hadoop
passwd hadoop

When this will generate a hadoop directory in / home /, the directory path is / home / hadoop

Create the relevant directory

Defined needs and data storage path of the directory

Path defined code and tool storage
mkdir -p / home / hadoop / source
mkdir -p / home / hadoop / tools

Path definition data stored in the node with hadoop file directory under the folder in which data is stored node directory need to have enough space to store
mkdir -p / hadoop / hdfs
mkdir -p / hadoop / tmp
mkdir -p / hadoop / log
Limit set writable
chmod -R 777 / hadoop

Definitions java installer path
mkdir -p / usr / java


Install the JDK

In the above connection download linux 64 under the jdk installation file: jdk-6u32-linux-x64.bin
1, Down good jdk-6u32-linux-x64.bin spread through the SSH / usr / java under
scp -r ./jdk-6u32-linux-x64.bin root @ hadoop1: / usr / java
2, enter the JDK installation directory cd / usr / java and execute chmod + x jdk-6u32-linux-x64.bin
3, the implementation ./jdk-6u32-linux-x64.bin
4, configure the environment variables, after executing cd / etc command vi profile, add at the end of the line
export JAVA_HOME = / usr / java / jdk1.6.0_32
export CLASSPATH =:. $ JAVA_HOME / lib / tools.jar: /lib/dt.jar
export PATH = $ JAVA_HOME / bin: $ PATH
5, do chmod + x profile turning it into an executable file
6, so that the implementation of source profile configuration takes effect immediately
source / etc / profile
7, the implementation of java -version to see if the installation was successful

This step must be installed on all machines

[Root @ hadoop1 bin] # java -version
java version "1.6.0_32"
Java (TM) SE Runtime Environment (build 1.6.0_32-b05)
Java HotSpot (TM) 64-Bit Server VM (build 20.7-b02, mixed mode)


Modify the host name

Modify the host name, all nodes have the same configuration
1, is connected to the master node, modify network, execute vim / etc / sysconfig / network, modify HOSTNAME = hadoop1
2, modify the hosts file, execute vi hosts after executing cd / etc command line added at the end: hadoop1 hadoop2 hadoop3 hadoop4

3, the implementation hostname hadoop1
4, after the implementation of exit reconnect can see the host name to modify OK

Add Host after the other nodes to modify the host name, or host files can perform operations in the back cover scp

Configuring SSH without password

No SSH password Introduction Principles:
On hadoop1 first generated a key pair comprising a public key and a private key and a public key copied to all slave (hadoop2-hadoop4) on.
Then, when the master through the SSH connection slave, slave will generate a random number with master's public key to encrypt the random number sent to master.
After the final master receive encrypted and then decrypted private key and decrypts it back to slave, slave to confirm the correct number after decryption allows the master to connect without entering a password

2, concrete steps (done by the root user and user login hadoop case)
1, after performing the command ssh-keygen -t rsa all the way back to the car to see just generated no password key pair: After execution ll cd .ssh
2, the id_rsa.pub key is appended to the authorization to go inside. Run cat ~ / .ssh / id_rsa.pub >> ~ / .ssh / authorized_keys
3, modify the permissions: Execute chmod 600 ~ / .ssh / authorized_keys
4, to ensure that cat / etc / ssh / sshd_config present in the following

RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh / authorized_keys
To modify, then modify the execution after the restart SSH server command to take effect: service sshd restart

5. Copy the public key to all of the slave machine: scp ~ / .ssh / id_rsa.pub ~ / and enter yes, the last input slave machine password
6, create the .ssh folder on the slave machine: mkdir ~ / .ssh then execute chmod 700 ~ / .ssh (if the folder to create presence is not required)
7, added to the authorization file authorized_keys Run: cat ~ / id_rsa.pub >> ~ / .ssh / authorized_keys then execute chmod 600 ~ / .ssh / authorized_keys
8. Repeat steps 4
9. Verify command: executed on the master machine ssh host name found by the hadoop1 become hadoop3 That success, and finally delete id_rsa.pub file: rm -r id_rsa.pub
According to the above steps are arranged hadoop1, hadoop2, hadoop3, hadoop4, requires that each can log on without a password

Source download

HADOOP version
The latest version of hadoop-2.0.0-alpha installation package for hadoop-2.0.0-alpha.tar.gz
Down official website address: http: //www.apache.org/dyn/closer.cgi/hadoop/common/
Down to the next / home / hadoop / source directory
wget http://ftp.riken.jp/net/apache/hadoop/common/hadoop-2.0.0-alpha/hadoop-2.0.0-alpha.tar.gz
Extracted directory
tar zxvf hadoop-2.0.0-alpha.tar.gz

Create a soft link
cd / home / hadoop
ln -s /home/hadoop/source/hadoop-2.0.0-alpha/ ./hadoop
Source configuration changes

/ Etc / profile

Configure the environment variables: vim / etc / profile
Add to
export Hadoop_DEV_HOME = / home / hadoop / hadoop
export PATH = $ PATH: $ HADOOP_DEV_HOME / bin
export PATH = $ PATH: $ HADOOP_DEV_HOME / sbin
export HADOOP_CONF_DIR = $ {HADOOP_DEV_HOME} / etc / hadoop
export HDFS_CONF_DIR = $ {HADOOP_DEV_HOME} / etc / hadoop
export YARN_CONF_DIR = $ {HADOOP_DEV_HOME} / etc / hadoop

Create and configure hadoop-env.sh

vim /usr/hadoop/hadoop-2.0.0-alpha/etc/hadoop/hadoop-env.sh
At the end add export JAVA_HOME = / usr / java / jdk1.6.0_27


In the configuration node inside Add Property