Home PC Games Linux Windows Database Network Programming Server Mobile  
  Home \ Server \ How to implement large-scale distributed Yahoo depth study on the Hadoop cluster     - Attic-- delete duplicate data backup program (Linux)

- Python programmers most often committed ten errors (Programming)

- When Vim create Python scripts, vim autocomplete interpreter and encoding method (Programming)

- Linux using RAID how to use Mdadm Tool Management Software RAID (Linux)

- W and uptime command (Linux)

- Glibc support encryption by modifying the DNS (Programming)

- Oracle Data Guard LOG_ARCHIVE_DEST_n optional attribute parameters (Database)

- 7 JavaScript interview questions (Programming)

- Java method to read and write files summary (Programming)

- MySQL separation Amoeba achieve literacy (Database)

- Oracle Sql Loader tool has shown signs (Database)

- Oracle Linux 7.1 install Oracle 12C RAC (Database)

- VMware virtual machine can not start VMnet0 no Internet access and other issues (Linux)

- A brief description of Java 8 new features introduced syntax (Programming)

- C ++ in the elimination Wunused (Programming)

- How to troubleshoot Windows and Ubuntu dual system time is not synchronized (Linux)

- Win7 used Eclipse to connect the virtual machine in Ubuntu Hadoop2.4 (Server)

- Linux Systemd-- To start / stop / restart services in RHEL / CentOS 7 (Linux)

- Use IF NOT EXISTS create a data table (Database)

- ACL permissions Linux command (Linux)

  How to implement large-scale distributed Yahoo depth study on the Hadoop cluster
  Add Date : 2017-08-31      
  Over the past decade, to build and expand Yahoo has invested a lot of energy in the Apache Hadoop cluster. Currently, there are 19 Yahoo Hadoop cluster, which contains more than 40,000 servers and more than 600PB of storage. They developed a large-scale machine learning algorithms on these clusters, we will build into Yahoo Hadoop cluster preferred large-scale machine learning platform. Recently, the Yahoo Big ML team Cyprien Noel, Jun Shi and Andy Feng author

Deep learning (DL) is a function of Yahoo's many products are needed. For example, Flickr scene detection, object recognition, aesthetic and other computing functions are dependent on the depth of learning. In order to make more products to benefit from machine learning, they recently introduced to the DL function Hadoop cluster locally. Depth study on Hadoop mainly has the following advantages:

Depth study on the Hadoop cluster executed directly, to avoid moving data between Hadoop clusters and individual cluster deep learning;
Spark with Hadoop data processing and machine learning, like pipes, deep learning can also be defined as Apache Oozie workflow a step;
YARN well together with the depth of learning, deep learning of multiple experiments can be performed simultaneously on a single cluster. Compared with traditional methods, which makes it extremely efficient deep learning.
DL on Hadoop is a new depth of learning. To implement this approach, Yahoo main works the following two aspects:

Enhanced Hadoop clusters: they add to the GPU node Hadoop cluster. Each node has four Nvidia Tesla K80 card, each card has two GK 210 GPU. Processing power of these nodes is 10 times that of traditional commercial CPU node. GPU node has two independent Ethernet network interfaces and Infiniband. The former as an external communication interface, which is 10 times faster for connection to the cluster nodes and GPU is directly accessible via RDMA GPU memory support. With YARN latest node-tagging, you can specify in the job container is running on the CPU or run on the GPU.
Create Caffe-on-Spark: This is a comprehensive solution for their distributed based on open source software libraries Apache Spark and Caffe created. With it, through a few simple commands can submit jobs to the deep learning GPU cluster node, and you can specify the number of executor processes required to start the Spark, the number of GPU executor assigned to each training data on HDFS storage location and store path model. Users can use the standard configuration file specifies Caffe Caffe solver and deep network topology. Spark on YARN starts the specified number of executor, each executor will be assigned a training HDFS data partition, based on the training and start multiple threads of Caffe.
That done, they would approach the benchmark in the two data sets. ImageNet 2012 on the set of test data show that compared with the use of a GPU, using four GPU requires only 35% of the time can be up to 50% accuracy. The tests on GoogLeNet datasets show, eight GPU reach 60% top-5 accuracy rate is 6.8 times that of a GPU.

This shows that their approach is effective. In order to make the distributed Hadoop cluster on deep learning more efficient, they plan to continue to invest in Hadoop, Spark and the Caffe.

Yahoo has been part of the code released on GitHub, interested readers can learn more.
- How to choose the correct HTTP status code (Server)
- Java 8 Lambda principle analysis (Programming)
- Installation on Ubuntu class Winamp audio player Qmmp 0.9.0 (Linux)
- Inherent limitations of Linux systems network security (Linux)
- Linux Getting Started tutorial: hard disk partition and to deal with traps (Linux)
- ASM Disk Space Check (Database)
- and localhost difference (Server)
- How to limit network bandwidth usage in Linux (Linux)
- CentOS 6.5 system installation Oracle11.2.0.4 basic steps (Database)
- Boost notes --Thread - problems encountered in the initial use on Ubuntu (Programming)
- Java singleton mode (Singleton mode) (Programming)
- Linux operating system security settings initial understanding (Linux)
- Ubuntu 12.04 / 14.04 users to install software LyX document processing (Linux)
- C ++ Const breaking rules (Programming)
- Spring next ActiveMQ combat (Programming)
- Linux systems for entry-learning - Install Go language in Linux (Linux)
- The difference between Linux su and sudo commands (Linux)
- Linux platform host to prevent hacking skills (Linux)
- Easily solve the MySQL database connection error too many (Database)
- Red-black tree in C ++ (Programming)
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.