Home PC Games Linux Windows Database Network Programming Server Mobile  
  Home \ Server \ 2 minutes to read large data framework Hadoop and Spark similarities and differences     - Management DB2 logs (Database)

- ARM assembler instruction debugging method (Programming)

- Android Studio and Git Git configuration file status (Linux)

- Git and GitHub use of Eclipse and Android Studio (Programming)


- Shell Scripting Basics (Linux)

- Hutchison DG standby database CPU consumption reached bottleneck repair (Database)

- Installation and Configuration Munin monitoring server on Linux (Server)

- Yii2 Advanced Version Copy New Project Problem Solved (Programming)

- Jigsaw project will solve the problem of Java JAR hell Mody (Programming)

- Gitlab installation under CentOS 7 (Linux)

- MongoDB Installation under CentOS 6.6 (Database)

- Xshell key authentication mechanism using a remote login Linux (Linux)

- Ubuntu 14.10 Apache installation and configuration (Server)

- Docker in the development and practice of IFTTT (Server)

- numpy and SciPy installation under Python for scientific computing package (Linux)

- Ubuntu and derivatives installation Atom 0.104.0 (Linux)

- CentOS 6.2 install git error Perl5 (Linux)

- Memcached distributed caching (Server)

- VirtualBox modify the size of the virtual machine disk VDI (Linux)

  2 minutes to read large data framework Hadoop and Spark similarities and differences
  Add Date : 2018-11-21      
  Turning big data, I believe we Hadoop and Apache Spark is no stranger to these two names. But we tend to understand them aside just literally, and they have not been in-depth thinking, the following might come with me facie What are the similarities and differences between them.

Not the same level of problem-solving

First, Spark both Hadoop and Apache are big data framework, but the purpose of their existence is different. Hadoop is essentially a more distributed data infrastructure: it huge data sets assigned to a computer by a common cluster consisting of a plurality of storage nodes, which means you do not need to purchase and maintain expensive server hardware.

At the same time, Hadoop indexing and tracking these data will allow large data processing and analysis efficiency unprecedented heights. Spark, is so dedicated to a distributed storage for big data processing tools, it will not be stored in the distributed data.

Both can be combined separable

In addition to providing Hadoop for everyone consensus HDFS distributed data storage capabilities, but also provides a data processing function, called MapReduce. So here we can put aside Spark, using Hadoop MapReduce itself to finish processing the data.

Instead, Spark is not necessarily dependent on Hadoop body to survive. But as mentioned above, after all, it does not provide file management system, therefore, it must be distributed file systems and other integration to work. Here we can choose Hadoop's HDFS, you can choose another cloud-based data platform. Spark default but is still being used in Hadoop above, after all, everyone thinks they are the best combination.

The following is the most concise MapReduce analytic world will Zhuhai Fenduo excerpts from the Internet:

We want to count all the books in the library. You numeral 1 bookshelf, I count No. 2 shelves. This is the "Map". The more people we have, the number of books it faster.

Now we are together, all the statistics together. This is the "Reduce".

Spark spike MapReduce data processing speed

Spark because it processes data in a different way, a lot faster than MapReduce. MapReduce is a step by step process the data: "The data is read from the cluster, once processed, will write the results to a cluster, read the updated data from the cluster, for the next process, writes the results to the cluster, etc ... "Booz Allen Hamilton data scientist Kirk Borne true resolution.

In contrast Spark, it will be in memory to be close to "real time" to complete all of the data analysis: "Read data from the cluster, you must complete all of the analysis process, the result is written back to the cluster, complete," Born said. Spark batch faster than MapReduce nearly 10 times the memory of the data analysis speed is nearly 100 times faster.

If most data to be processed and the results demand is static, and you also have the patience to wait for the completion of the batch, then, MapReduce is handled perfectly acceptable.

But if you need to stream data for analysis, such as those derived from plant sensor data collected back, or that your need for multiple data processing applications, then you probably should use Spark processing.

Most machine learning algorithms is the need for multiple data processing. In addition, usually used Spark application scenario has the following areas: real-time market activity, online product recommendations, network security analysis, machine monitoring diary.

Disaster Recovery

Both disaster recovery methods are different, but very good. Because Hadoop data after each process are written to disk, so it can be born very flexible system error processing.

Spark stored data objects in the distributed data cluster is called the flexible distributed data sets (RDD: Resilient Distributed Dataset) in. "These data objects can be placed on either memory, can be placed on the disk, so the RDD can also provide complete disaster recovery capabilities," Borne noted.
- Use the dd command to the hard disk I / O performance test (Linux)
- Create your own YUM repository (Linux)
- Install Ubuntu open source drawing program MyPaint 1.2.0 (Linux)
- Security Knowledge: How to hide a backdoor PHP file tips (Linux)
- Install Ubuntu text editor KKEdit 0.2.10 (Linux)
- By way of a binary installation innobackupex (Database)
- Dalvik heap memory management and recycling (Linux)
- MySQL binary packages install for RedHat Linux Enterprise 6.4 (Database)
- How do you know your public IP terminal in Linux (Linux)
- OpenSSL for secure transmission and use of skills of files between Windows and Linux (Linux)
- HTTPS and SSH and use the difference between the way: Git User's Manual (Linux)
- How do you turn on and off IPv6 address on Fedora (Linux)
- Node.js simple interface server (Server)
- Linux firewall to prevent external network attacks (Linux)
- Java filter (Programming)
- Ubuntu installed Gimp 2.6.11 (stable version) with PPA (Linux)
- Linux server disk expansion and Oracle tablespace file migration operations (Database)
- Ubuntu Control Panel to resolve network-manager icon display issue (Linux)
- Source code compiled by the installation program under Linux (Linux)
- Linux Change ssh port and disable remote root login at (Linux)
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.