Home IT Linux Windows Database Network Programming Server Mobile  
  Home \ Linux \ Github inventory objects Algorithm     - Customize the 404 error page Nginx (Server)

- CentOS 6.5 install VNC-Server (Linux)

- socket busy poll of Linux kernel 3.11 to avoid sleep switch (Linux)

- Linux ls command (Linux)

- Graphics of Java Tools (Programming)

- Installation under Linux Mint system guidelines for Gtk (Linux)

- Java developers question (Programming)

- Oracle 11g maintenance partitions - Adding Partitions (Database)

- Wireless LAN security solutions (Linux)

- Java static code analysis tool Infer (Programming)

- Puppet 3.5 Source package Installation and Configuration (Server)

- GDB remote connections RX Probe online debug program (Programming)

- MongoDB start under Linux (Database)

- Ubuntu 14.04 and derivative version of the user on how to install cURL 7.37.1 (Linux)

- About phpwind 5.01-5.3 0day analysis of the article (Linux)

- Python uses multi-process pool (Programming)

- Linux NFS service fixed ports and firewall configuration (Linux)

- Linux System Administrator Network Security Experience (Linux)

- Common Linux System Troubleshooting (Linux)

- Linux NFS FTP use (Server)

  Github inventory objects Algorithm
  Add Date : 2018-11-21      
  $ Git clone https://github.com/torvalds/linux
Cloning into 'linux' ...
remote: Counting objects: 4350078, done.
remote: Compressing objects: 100% (4677/4677), done.
Receiving objects: 4% (191786/4350078), 78.19 MiB | 8.70 MiB / s

Section suggested that the remote code libraries, a total of 4,350,078 objects need to clone.

This is called "inventory objects" (counting objects), Github need calculated in real time, the total number of objects need to clone.

This process is very slow, according to the disclosure Github, like the Linux kernel such a huge library inventory time 8 minutes! That is, after issuing git clone command will do so on eight minutes, and then begin actual data transmission. This is of course intolerable. Github team has been trying to solve this problem.

Later, they finally found a new algorithm, now counted once as long as 3 milliseconds!

To understand this algorithm, you must first know what is Git object. Simply put, the object is a file, there are three most important objects.

Snapshot Object (Commit)
Directory object (Directory)
File object (File)
Each commit code, generates a commit object, which has a corresponding "Directory Object" in the name of the current. "Directory Object" to save the subdirectories and files in the root directory of the information contained in the code. Each subdirectory is another "directory objects," each file is "file object", which is the specific contents of the file.

Therefore, the "inventory objects" is the inventory of the various commit, directories, files, and so on. git clone git fetch operations and inventory objects required because of the need to know which object files to download in the end.

Inventory object's original algorithm is as follows.

It lists all the local branches of the latest in a commit
Lists all the remote branch the latest in a commit
Compare the two, as long as there are different, it means that the branch changed
Each commit changes occur, which are counted subdirectories and files for the specific changes
Dating back to the parent node of the current commit, repeat the fourth step, until local and remote history of consistency so far
The sum of all the objects need to be changed
The above description of the process, "Inventory Object" is a document traversal algorithm, the object will change to inventory was 11, which means that a large number of file reads. For large code base, this process is very slow.

New Algorithm Github team think, is to create a Bitmap index, that is, each commit to generate a binary value.

Open the local Github repository .git / objects / pack / directory, you will see an index file and a data file, they are Bitmap. Simply put, these two files indexed all the current object code libraries, and then use a binary value representing these objects. How many objects there are that many bits of the binary value. Its first n bits to represent the data file inside the n-th object.

Each commit will have a corresponding binary value, all objects contained in the current snapshot representation. These objects corresponding bit is 1, the other bits are 0.

This benefit is not read commit objects, just read the binary value, you will know which node contains the current commit. Even better, as long as the two binary values ​​do a XOR operation, we will know which bits (ie, which objects) changes have taken place. Moreover, because the new objects are always added to the existing binary bits, so long as those who read the extra bit, you know the current commit more than in the last commit what object.

As a result, "Inventory Object" becomes a binary value comparison operation, so fast. Further introduction, see the official documentation "Bitmap interpretation", "Bitmap format."

Currently, Github production environment has been deployed with this algorithm, users no longer have to count the objects, while waiting over. Moreover, Github team also merged it into Git, which means that, from now on all Git implementations can use the Bitmap function, so the future will definitely have more fun usage appears.
- Piostat - Monitoring and Statistics Linux process (Linux)
- WebLogic administrator account and reset the password (Database)
- Use SecureCRT to transfer files between local and remote hosts (Linux)
- Use of the storage-level replication technology will quickly clone a ASM database to the target environment (Database)
- RPM package management under Linux (Linux)
- How Vim playing a mature IDE (Linux)
- Linux Getting Started tutorial: Borrow Windows fonts in Ubuntu 14.10 (Linux)
- Firewall types and instructions (Linux)
- Between the two to achieve the main MySQL database synchronization from (Database)
- TWiki LDAP error appears the problem is solved (Linux)
- IPTABLES configuration steps under Linux (Linux)
- Vagrant build LNMP environment (Server)
- After installing minimize RHEL / CentOS 7 need to do some things (Linux)
- Java Annotation Comments (Programming)
- Ubuntu arm-none-eabi-gcc compiler treated with STM32F10x (Linux)
- DupeGuru- find and remove duplicate files (Linux)
- rpm package specify the installation path (Linux)
- FileZilla FTP && TLS connection settings of (Linux)
- Build ASP.NET 5 development environment in Ubuntu (Server)
- Linux compiler installation Redis (Database)
  CopyRight 2002-2016 newfreesoft.com, All Rights Reserved.