Big data related to the back-end development work over the past year, with the continuous development of Hadoop community, are constantly trying new things, this article focuses on speaking off his Ambari, the new Apache project, designed to allow you to easily rapid configuration and deployment of Hadoop ecosystem-related components of the environment, and provide maintenance and monitoring functions.
As a novice, I talk about my own learning experience, just beginning to learn, of course, the easiest Google under Hadoop, then download the packages, install a stand-alone in its own virtual machine (CentOS 6.3) is used on the Hadoop version do the test, write a few test class, then do the next test CRUD like, Running Map / Reduce test, of course, this time for Hadoop is not very understanding, constantly look at the article about the overall architecture themselves done is modify a few configuration files under conf, so Hadoop to normal run, this time several modifications in the configuration, after this stage, but also uses HBase, Hadoop ecosystem that another product, of course, modify the configuration, then start-all.sh, start-hbase.sh starting up the service, and then is to modify your program, do the test, as with Hbase learned under Zookeeper Hive and the like, and then after this operation phase after, began to study Hadoop2.0, as a way to Hadoop ecosystem as a whole have some understanding between developing their own in the company undertaken involved in related technologies only on those. but as a hobby to explore whether the people who want to know more about it , its performance how? it is specifically how it works? see large companies that PPT, people (Taobao and other large companies) simply dozens, hundreds, or even thousands of nodes, how people are managed, performance is kind of how? watching those performance tests PPT curve inside, if you can detailed understanding and tuning on their own projects? I seemingly found the answer, and that is Ambari, developed by the HortonWorks a Hadoop-related projects, specifically on the official to understand.
Learn Hadoop ecosystem
Now we often see some of the keywords are: HDFS, MapReduce, HBase, Hive, ZooKeeper, Pig, Sqoop, Oozie, Ganglia, Nagios, CDH3, CDH4, Flume, Scribe, Fluented, HttpFS so, in fact, there should be more more, Hadoop ecosystem development now considered to be fairly prosperous, while those behind the prosperity and who promoted it? read history Hadoop friends may know, Hadoop was first started in Yahoo, but now mainly by HortonWorks and Cloudera this two defenders in the company, most of which belong to two commiter company, so now the market has seen two major versions, CDH series, and community Edition, I first use the community edition, later changed to CDH3, now in exchange for community edition because there Ambari. of course, what and what not, so long as their technology at home, or can be modified to run normal. there is not much to say. talk so much nonsense , began to speak Ambari install it.
First, understand the next Ambari, project address: http: //incubator.apache.org/ambari/
Installation documentation: http://incubator.apache.org/ambari/1.2.2/installing-hadoop-using-ambari/content/index.html
HortonWorks who wrote an article describes how to install my translation follows: When http://www.linuxidc.com/Linux/2014-05/101530.htm mounted installation documentation please look at it, you must install the documentation serious look, combined with their own version of the system currently used to configure different source, and the time required for the installation process is relatively long, it is necessary to seriously do each step of the installation documentation. Some say I'm here, I met problem.
The following talk about my own installation process.
My test environment uses nine HP rotten machines are cloud100 - cloud108, cloud108 as the management node.
Environment path Ambari installation:
Each machine installation directory:
/ Usr / lib / hadoop
/ Usr / lib / hbase
/ Usr / lib / zookeeper
/ Usr / lib / hcatalog
/ Usr / lib / hive
Log path, where the need to see the error information can be found in the log directory
/ Var / log / hadoop
/ Var / log / hbase
Path to the configuration file
/ Etc / hadoop
/ Etc / hbase
/ Etc / hive
Storage path of HDFS
/ Hadoop / hdfs
The installation process takes note of the point:
1, the installation, you need to do each machine ssh password-free login, this http://www.linuxidc.com/Linux/2014-05/101532.htm mentioned, well after the management node between each cluster node, you can use this landing.
2, if your machine is installed before Hadoop-related services, in particular Hbase configuration inside the HBASE_HOME environment variables, you need to unset out, this will affect the environment variable, because before I put these paths into / etc / profile which lead to influence the HBase, because the path Ambari installation and before you install may be different.
3, when the service selection page, NameNode and SNameNode needs to be laid together, I try to do HA before and take them apart, but SNameNode has mountains, leading to the launch failure, the next time need be spent on HA.
4. JobTrakcer discord Namenode together will lead to not start up.
5. Datanode Block replication nodes can not be less than the number, basically require> = 3.
6. Confirm Hosts, the need for attention Warning inside information, to dispose of all related Warning, some Warning will cause installation errors.
7. Remember that the installation of the new users, you need to use these users.
8. Hive and HBase Master deployed in the same node, where of course you can also be separated. Set up after the start of the installation.
9. If the next case of failure to install, how to re-install.
First, let's delete the document directory system has been installed,
sh file_cp.sh cmd "rm -rf / usr / lib / Hadoop && rm -rf / usr / lib / hbase && rm -rf / usr / lib / zookeeper"
sh file_cp.sh cmd "rm -rf / etc / hadoop && rm -rf / etc / hbase && rm -rf / hadoop && rm -rf / var / log / hadoop"
sh file_cp.sh cmd "rm -rf / etc / ganglia && rm -rf / etc / hcatalog && rm -rf / etc / hive && rm -rf / etc / nagios && rm -rf / etc / sqoop && rm -rf / var / log / hbase && rm -rf / var / log / nagios && rm -rf / var / log / hive && rm -rf / var / log / zookeeper && rm -rf / var / run / hadoop && rm -rf / var / run / hbase && rm -rf / var / run / zookeeper "
Then remove the relevant packages off installed in Yum.
sh file_cp.sh cmd "yum -y remove ambari-log4j hadoop hadoop-lzo hbase hive libconfuse nagios sqoop zookeeper"
I use here to write their own Shell, easy to execute commands between multiple machines:
Reset under Ambari-Server
10. Note that the time synchronization of time can cause eye regionserver
11. iptables needs to close, sometimes the machine may be restarted, it is not only needed service stop also need chkconfig closed off.
After the final installation is complete, log in to view the address in case the service:
http: // management node ip: 8080, for example, I have here: after http://192.168.1.108:8080/ before landing, you need to set the time Ambari-server installation enter account number and password, enter
See ganglia monitoring
See nagios monitoring
After installation is complete, look at these are normal, and if you need to test yourself? But basically ran after smoke testing, normal, basic or normal, but we ourselves have to operate under the bar.
Verify Map / Reduce
to sum up
Here, the relevant Hadoop and related configuration hbase hive and would have completed the configuration, you need to do some stress tests. There are other aspects of the test, with the Ambari is HortonWorks packaged rpm version of Hadoop relevant source code, so there may be other versions have some differences, but as a development environment, temporary or not a lot of big impact, but have not yet used on the production, so it is said, no matter how stable, then I will process development project, the Bug encountered to be listed. overall Ambari still very worth using, after all, can reduce a lot of unnecessary configuration time, and relatively in stand-alone environment, in a clustered environment more close do some production-related performance testing and tuning tests, etc., and ganglia nagios monitoring configuration and can also be released allow us to view data related to the cluster, in general, it is recommended to use, there are new things in the Bug inevitable, but with the process, we will continue to improve. Then if you have time, Ambariserver will extend the functionality, such as adding redis / nginx like conventional high-performance modules of monitoring options. this time in get a short, Welcome Ambari.
Recently encountered some problems Ambari of:
1. After the custom which turned append option, or still can not append.