Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Server \ A detailed introduction to the Hadoop ecosystem     - Linux uses the SMTP proxy to send mail (Linux)

- Linux System Getting Started Learning: After starting in Ubuntu or Debian, enter the command line (Linux)

- Package the Python script file into an executable file (Programming)

- Why is better than Git SVN (Linux)

- Java8 Lambda expressions and flow operations (Programming)

- Linux operating system Samba server configuration and use (Server)

- Build their own recursive DNS server (Server)

- ACL permissions Linux command (Linux)

- C ++ How to determine the types of constants (Programming)

- Oracle TAF Analysis (Database)

- Python maketrans () method (Programming)

- Beautiful start Ubuntu installation tool Duck Launcher 0.64.5 (Linux)

- Introduces Linux kernel compilation system and compiler installation (Linux)

- Linux SU command security Suggestions (Linux)

- Mac OS X system setup Google Go language development environment configuration tool Sublime Text 2 (Linux)

- Android shutdown (reboot) process (Programming)

- Virtual Judge structures under Ubuntu 14.04 (Server)

- Linux Network Security: nmap port scanning software (Linux)

- Linux system security knowledge (Linux)

- CentOS Linux Optimization and real production environment (Linux)

 
         
  A detailed introduction to the Hadoop ecosystem
     
  Add Date : 2017-08-31      
         
         
         
  1, Hadoop Ecosystem Overview

Hadoop is a software framework that enables distributed processing of large amounts of data. With reliable, efficient, scalable features.

Hadoop's core is HDFS and MapReduce, hadoop2.0 also includes YARN.

2, HDFS (Hadoop Distributed File System)

GFS paper from Google, published in October 2003, HDFS is the GFS clone.

Is Hadoop system data storage management foundation. It is a highly fault-tolerant system that can detect and respond to hardware failures for low-cost general-purpose hardware. HDFS simplifies the file consistency model, through the flow of data access, providing high-throughput application data access capabilities for large data sets with applications.

HDFS this part has the following main components:

(1), Client: slice file; access HDFS; interaction with the NameNode, access to file location information; DataNode interaction, read and write data.

(2), NameNode: Master node, hadoop1.X in only one, manage the HDFS name space and data block mapping information, configure the copy strategy to deal with client requests. For large clusters, Hadoop1.x has two major drawbacks: 1) for large clusters, namenode memory becomes a bottleneck, namenode scalability problems; 2) namenode single point of failure problems.

In view of the above two defects, Hadoop2.x after these two issues were resolved. For the defect 1) proposed Federation namenode to solve, the program is mainly through multiple namenodes to achieve multiple namespace to achieve the namenode horizontal expansion. Thereby alleviating individual namenode memory problems.

For defect 2), hadoop2.X proposed to achieve two namenode hot standby HA to solve the program. One is in the standby state, one in the active state.

(3), DataNode: Slave node, store the actual data, report storage information to the NameNode.

(4), Secondary NameNode: secondary NameNode, share their workload; regular merged fsimage and edits, pushed to NameNode; emergency cases, can assist in the recovery NameNode, but NameNode NameNode not Hot Standby.

At present, the hard disk is not a bad situation, we can achieve the namenode secondarynamenode recovery.

3, Mapreduce (distributed computing framework)

A MapReduce paper from google, published in December 2004, Hadoop MapReduce is a Google MapReduce clone. MapReduce is a computational model used to calculate large amounts of data. Map on the data set where the independent elements of the specified operation, generate key-value pairs in the form of intermediate results. Reduce treats all "values" of the same "key" in the intermediate result to get the final result. MapReduce such a functional division, is ideal for a large number of computers in a distributed parallel environment for data processing.

MapReduce computing framework to the development of the MapReduce now has two versions of the API, the main components for MR1 has the following components:

(1), JobTracker: Master node, there is only one, the main task is the allocation of resources and job scheduling and supervision and management, management of all operations, job / task monitoring, error handling, etc .; the task is divided into a series of tasks and assigned to TaskTracker.

(2), TaskTracker: Slave node, run Map Task and Reduce Task; and JobTracker interaction, reporting task status.

(3), Map Task: parse each data record, passed to the user to write the map (), and the implementation of the output will be written to the local disk.

(4), Reducer Task: Map Task from the implementation of the results, the remote read input data, sort the data, the data passed in accordance with the packet to the user to prepare the implementation of the reduce function.

In this process, there is a shuffle process that is essential for understanding the process of the MapReduce computing framework. The process contains the map function output to the middle of the reduce function to enter all the operations in the process, called the shuffle process. In this process, can be divided into map end and reduce end.

Map side:

1) After the input data is fragmented, the size of the fragment is related to the original file size and the file block size. Each fragment corresponds to a map task.

2) map task in the implementation process, the results will be stored in memory, when the memory footprint reaches a certain threshold (the threshold can be set), the map will be the middle of the results written to the local disk, the formation of temporary File This process is called overflow.

3) map in the process of writing, will reduce the number of tasks assigned to the corresponding partition were written, which is the partition process. Each partition corresponds to a reduce task. And in the process of writing, the corresponding sort. In the overflow process can also be set conbiner process, the process with reduce the results should be consistent, so there is a certain limit to the application process, the need for caution.

4) At the end of each map, there is only one temporary file as input to reduce, so Merge operations are performed on multiple temporary files that are spilled over to the disk. And finally form a temporary file of an internal partition.

Reduce end:

1) First of all, to achieve data localization, the remote node map output needs to be copied to the local.

2) Merge process, the merger process is mainly on the different nodes on the map output to merge.

3) continuous copy and merge, the final form of an input file. Reduce the final calculation results stored in the HDFS.

MR2 is a new generation of MR's API. It is mainly run on Yarn's resource management framework.

4, Yarn (Resource Management Framework)

The framework is hadoop2.x hadoop1.x before the JobTracker and TaskTracker model optimization, and generated, the JobTracker resource allocation and job scheduling and supervision separately. The framework mainly ResourceManager, Applicationmatser, nodemanager. Its main work process is as follows: The ResourceManager is responsible for all the application of the resource allocation, ApplicationMaster is responsible for each job task scheduling, that is, each job corresponds to an ApplicationMaster. Nodemanager is to receive Resourcemanager and ApplicationMaster command to implement the implementation of the allocation of resources.

ResourceManager client receives the job submission request, will be assigned a Conbiner, it should be noted that Resoucemanager allocation of resources is Conbiner allocated units. The first allocated Conbiner will start the Applicationmaster, which is responsible for scheduling the job. ApplicationManager will start directly after communication with the NodeManager.

In YARN, resource management by ResourceManager and NodeManager together, in which, ResourceManager in the scheduler is responsible for the allocation of resources, and NodeManager is responsible for the supply and isolation of resources. After ResourceManager assigns resources on a NodeManager to a task (this is called "resource scheduling"), the NodeManager provides resources for the tasks as required, and even ensures that these resources should be exclusive, providing a basis for task operation, This is known as resource isolation.

In the Yarn platform can run multiple computing framework, such as: MR, Tez, Storm, Spark and other computing, framework.

5, Sqoop (data synchronization tool)

Sqoop is an acronym for SQL-to-Hadoop, primarily for transferring data between traditional databases and Hadoop. Data import and export is essentially Mapreduce program, take full advantage of the MR of the parallel and fault-tolerant. Which mainly uses the Map task in MP to implement parallel import and export. Sqoop development has now emerged in two versions, one is sqoop1.x.x series, one is sqoop1.99.X series. For the sqoop1 series, the main way through the command line to operate.

Sqoop1 import principle: from the traditional database access to metadata information (schema, table, field, field type), the import function is only Map MapReduce operations, mapreduce in a lot of map, each map read a piece of data, and then parallel Complete the copy of the data.

Sqoop1 export Principle: get the export table schema, meta information, and Hadoop field match; multiple map only jobs run at the same time, complete the hdfs data exported to the relational database.

Sqoop1.99.x is the product of sqoop2, which is currently not fully functional product, in a test phase, generally will not be used in commercial products.

Sqoop tools, the current understanding of it is that there may be some problems because when the import and export time, map task failed, then Applicationmaster will re-schedule another task to run this failed task. But there may be a problem is that the task before the failure of the Map data and re-map the task of mapping the results of the duplication of the phenomenon there.

6, Mahout (data mining algorithm library)

Mahout originated in 2008, was originally a subproject of Apache Lucent, which in a very short period of time has made great progress, is now the top of the Apache project. Compared with the traditional MapReduce programming methods to achieve the machine learning algorithm, often need to call a lot of development time, and the cycle is longer, and Mahout's main goal is to create some scalable machine learning domain classic algorithm is designed to help Developers more easily and quickly create intelligent applications.

Mahout now includes clustering, classification, recommendation engine (collaborative filtering) and frequent data mining, such as data mining methods are widely used. In addition to the algorithms, Mahout also includes data import / export tools, and other data mining support architectures such as integration with other storage systems such as databases, MongoDB or Cassandra.

Mahout the following components will generate the corresponding jar package. At this point we need to understand a question: in the end how to use mahout?

In fact, mahout is just a machine learning algorithm library, in this library is to think of the corresponding machine learning algorithms, such as: recommendation system (including based on user and object-based recommendation), clustering and classification algorithm. And some of these algorithms to achieve MapReduce, spark which hadoop platform can be run in the actual development process, only the corresponding jar package can be.

7, Hbase (distributed inventory database)

Bigtable paper from Google, published in November 2006, the traditional relational database is a line-oriented database. HBase is a Google Bigtable clone, HBase is a scalable, highly reliable, high-performance, distributed, and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses the BigTable data model: an enhanced sparse sort map (Key / Value), where the keys are made up of row keys, column keys, and timestamps. HBase provides random, real-time read and write access to large-scale data, while HBase preserves data that can be processed using MapReduce, which combines data storage and parallel computing.

Hbase table features

1), large: a table can have billions of rows, millions of columns;

2), no model: Each row has a sortable primary key and any number of columns, the column can be dynamically increased according to the needs of a table with different rows can have different columns;

3), column-oriented: column-oriented (family) storage and access control, column (family) independent retrieval;

4), sparse: null (null) does not take up storage space, the table can be designed very sparse;

5), multi-version data: the data in each unit can have multiple versions, by default, the version number is automatically assigned, is inserted when the cell time stamp;

6), data types single: Hbase data are strings, no type.

Hbase physical model

Each column family is stored in a separate file on HDFS, and null values are not saved.

Key and Version number have one in each column family;

HBase maintains a multi-level index for each value, ie, its physical storage:

1, Table all rows in accordance with the row key dictionary order;

2, Table in the direction of the line is divided into multiple Region;

3, Region divided by size, each table begins with only one region, with the increase in data, region increasing, when increased to a threshold when, region will be divided into two new regions, followed by More and more regions;

4, Region is the Hbase distributed storage and load balancing of the smallest unit, different Region RegionServer distributed to different. , & Lt;

5, Region Although the smallest unit of distributed storage, but not the smallest storage unit. Region is made up of one or more stores, each of which stores a columns family; each Strore is made up of a memStore and 0 to more StoreFiles; the StoreFile contains the HFile; the memStore is stored in memory and the StoreFile is stored on the HDFS.

8, Zookeeper (distributed collaboration services)

Chubby's paper, published by Google in November 2006, is a Chubby clone that addresses data management issues in distributed environments: unified naming, state synchronization, cluster management, and provisioning synchronization.

Zookeeper's main implementation of two steps: 1), election Leader 2), synchronous data. This component is required to implement HA high availability of namenode.

9, Pig (based on Hadoop data flow system)

By yahoo! Open source, the design motivation is to provide a MapReduce-based ad-hoc (calculation occurs in the query) data analysis tools

Defines a data flow language, Py Latin, which converts scripts to MapReduce tasks on Hadoop. Usually used for offline analysis.

10, Hive (based on the Hadoop data warehouse)

Open source by facebook, originally used to solve the massive structured log data statistics.

Hive defines a SQL-like query language (HQL) that translates SQL into MapReduce tasks on Hadoop. Usually used for off-line analysis.

11, Flume (log collection tool)

Cloudera open source log collection system, with distributed, high reliability, high fault tolerance, easy to customize and extend the characteristics.

It abstracts the data from the process of generating, transmitting, processing and finally writing the path of the object into a data stream. In the specific data flow, the data source supports customizing the data sender in Flume, thus supporting the collection of different protocol data. At the same time, Flume data stream to provide a simple log data processing capabilities, such as filtering, format conversion. In addition, Flume also has the ability to write logs to a variety of data objects (customizable). In general, Flume is a scalable, suitable for complex environment of the massive log collection system.
     
         
         
         
  More:      
 
- Four safety delete files under Linux tools (Linux)
- Internal class broadcasting needs public and static (Programming)
- CentOS 6.5 installation Python3.0 (Linux)
- linux smartd [FAILED] appears at startup (Linux)
- 8 Docker knowledge you may not know (Server)
- Linux non-root user uses less than 1024 ports (Linux)
- Linux Live CD lets your PC is no longer secure (Linux)
- CentOS 6.4 of cron scheduled task configuration (Linux)
- Java List add duplicate the same object (Programming)
- Configuring Eclipse Note Templates (Linux)
- Reported too many open files Linux solutions (Server)
- ORA-12547: TNS: lost contact error Solution (Database)
- Depth understanding of C language (Programming)
- Tsung first test installation (Linux)
- JDK comes with tools jinfo (Linux)
- With screenfetch linux logo and basic hardware information display with cool Linux logo (Linux)
- Java Database Programming JDBC configuration (Programming)
- Use Vagrant up a local development environment tutorials (Server)
- Linux environment password security settings (Linux)
- Java memory-mapped file MappedByteBuffer (Programming)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.