Over the past decade, to build and expand Yahoo has invested a lot of energy in the Apache Hadoop cluster. Currently, there are 19 Yahoo Hadoop cluster, which contains more than 40,000 servers and more than 600PB of storage. They developed a large-scale machine learning algorithms on these clusters, we will build into Yahoo Hadoop cluster preferred large-scale machine learning platform. Recently, the Yahoo Big ML team Cyprien Noel, Jun Shi and Andy Feng author
Deep learning (DL) is a function of Yahoo's many products are needed. For example, Flickr scene detection, object recognition, aesthetic and other computing functions are dependent on the depth of learning. In order to make more products to benefit from machine learning, they recently introduced to the DL function Hadoop cluster locally. Depth study on Hadoop mainly has the following advantages:
Depth study on the Hadoop cluster executed directly, to avoid moving data between Hadoop clusters and individual cluster deep learning;
Spark with Hadoop data processing and machine learning, like pipes, deep learning can also be defined as Apache Oozie workflow a step;
YARN well together with the depth of learning, deep learning of multiple experiments can be performed simultaneously on a single cluster. Compared with traditional methods, which makes it extremely efficient deep learning.
DL on Hadoop is a new depth of learning. To implement this approach, Yahoo main works the following two aspects:
Enhanced Hadoop clusters: they add to the GPU node Hadoop cluster. Each node has four Nvidia Tesla K80 card, each card has two GK 210 GPU. Processing power of these nodes is 10 times that of traditional commercial CPU node. GPU node has two independent Ethernet network interfaces and Infiniband. The former as an external communication interface, which is 10 times faster for connection to the cluster nodes and GPU is directly accessible via RDMA GPU memory support. With YARN latest node-tagging, you can specify in the job container is running on the CPU or run on the GPU.
Create Caffe-on-Spark: This is a comprehensive solution for their distributed based on open source software libraries Apache Spark and Caffe created. With it, through a few simple commands can submit jobs to the deep learning GPU cluster node, and you can specify the number of executor processes required to start the Spark, the number of GPU executor assigned to each training data on HDFS storage location and store path model. Users can use the standard configuration file specifies Caffe Caffe solver and deep network topology. Spark on YARN starts the specified number of executor, each executor will be assigned a training HDFS data partition, based on the training and start multiple threads of Caffe.
That done, they would approach the benchmark in the two data sets. ImageNet 2012 on the set of test data show that compared with the use of a GPU, using four GPU requires only 35% of the time can be up to 50% accuracy. The tests on GoogLeNet datasets show, eight GPU reach 60% top-5 accuracy rate is 6.8 times that of a GPU.
This shows that their approach is effective. In order to make the distributed Hadoop cluster on deep learning more efficient, they plan to continue to invest in Hadoop, Spark and the Caffe.
Yahoo has been part of the code released on GitHub, interested readers can learn more.