Turning big data, I believe we Hadoop and Apache Spark is no stranger to these two names. But we tend to understand them aside just literally, and they have not been in-depth thinking, the following might come with me facie What are the similarities and differences between them.
Not the same level of problem-solving
First, Spark both Hadoop and Apache are big data framework, but the purpose of their existence is different. Hadoop is essentially a more distributed data infrastructure: it huge data sets assigned to a computer by a common cluster consisting of a plurality of storage nodes, which means you do not need to purchase and maintain expensive server hardware.
At the same time, Hadoop indexing and tracking these data will allow large data processing and analysis efficiency unprecedented heights. Spark, is so dedicated to a distributed storage for big data processing tools, it will not be stored in the distributed data.
Both can be combined separable
In addition to providing Hadoop for everyone consensus HDFS distributed data storage capabilities, but also provides a data processing function, called MapReduce. So here we can put aside Spark, using Hadoop MapReduce itself to finish processing the data.
Instead, Spark is not necessarily dependent on Hadoop body to survive. But as mentioned above, after all, it does not provide file management system, therefore, it must be distributed file systems and other integration to work. Here we can choose Hadoop's HDFS, you can choose another cloud-based data platform. Spark default but is still being used in Hadoop above, after all, everyone thinks they are the best combination.
The following is the most concise MapReduce analytic world will Zhuhai Fenduo excerpts from the Internet:
We want to count all the books in the library. You numeral 1 bookshelf, I count No. 2 shelves. This is the "Map". The more people we have, the number of books it faster.
Now we are together, all the statistics together. This is the "Reduce".
Spark spike MapReduce data processing speed
Spark because it processes data in a different way, a lot faster than MapReduce. MapReduce is a step by step process the data: "The data is read from the cluster, once processed, will write the results to a cluster, read the updated data from the cluster, for the next process, writes the results to the cluster, etc ... "Booz Allen Hamilton data scientist Kirk Borne true resolution.
In contrast Spark, it will be in memory to be close to "real time" to complete all of the data analysis: "Read data from the cluster, you must complete all of the analysis process, the result is written back to the cluster, complete," Born said. Spark batch faster than MapReduce nearly 10 times the memory of the data analysis speed is nearly 100 times faster.
If most data to be processed and the results demand is static, and you also have the patience to wait for the completion of the batch, then, MapReduce is handled perfectly acceptable.
But if you need to stream data for analysis, such as those derived from plant sensor data collected back, or that your need for multiple data processing applications, then you probably should use Spark processing.
Most machine learning algorithms is the need for multiple data processing. In addition, usually used Spark application scenario has the following areas: real-time market activity, online product recommendations, network security analysis, machine monitoring diary.
Both disaster recovery methods are different, but very good. Because Hadoop data after each process are written to disk, so it can be born very flexible system error processing.
Spark stored data objects in the distributed data cluster is called the flexible distributed data sets (RDD: Resilient Distributed Dataset) in. "These data objects can be placed on either memory, can be placed on the disk, so the RDD can also provide complete disaster recovery capabilities," Borne noted.