$ Git clone https://github.com/torvalds/linux
Cloning into 'linux' ...
remote: Counting objects: 4350078, done.
remote: Compressing objects: 100% (4677/4677), done.
Receiving objects: 4% (191786/4350078), 78.19 MiB | 8.70 MiB / s
Section suggested that the remote code libraries, a total of 4,350,078 objects need to clone.
This is called "inventory objects" (counting objects), Github need calculated in real time, the total number of objects need to clone.
This process is very slow, according to the disclosure Github, like the Linux kernel such a huge library inventory time 8 minutes! That is, after issuing git clone command will do so on eight minutes, and then begin actual data transmission. This is of course intolerable. Github team has been trying to solve this problem.
Later, they finally found a new algorithm, now counted once as long as 3 milliseconds!
To understand this algorithm, you must first know what is Git object. Simply put, the object is a file, there are three most important objects.
Snapshot Object (Commit)
Directory object (Directory)
File object (File)
Each commit code, generates a commit object, which has a corresponding "Directory Object" in the name of the current. "Directory Object" to save the subdirectories and files in the root directory of the information contained in the code. Each subdirectory is another "directory objects," each file is "file object", which is the specific contents of the file.
Therefore, the "inventory objects" is the inventory of the various commit, directories, files, and so on. git clone git fetch operations and inventory objects required because of the need to know which object files to download in the end.
Inventory object's original algorithm is as follows.
It lists all the local branches of the latest in a commit
Lists all the remote branch the latest in a commit
Compare the two, as long as there are different, it means that the branch changed
Each commit changes occur, which are counted subdirectories and files for the specific changes
Dating back to the parent node of the current commit, repeat the fourth step, until local and remote history of consistency so far
The sum of all the objects need to be changed
The above description of the process, "Inventory Object" is a document traversal algorithm, the object will change to inventory was 11, which means that a large number of file reads. For large code base, this process is very slow.
New Algorithm Github team think, is to create a Bitmap index, that is, each commit to generate a binary value.
Open the local Github repository .git / objects / pack / directory, you will see an index file and a data file, they are Bitmap. Simply put, these two files indexed all the current object code libraries, and then use a binary value representing these objects. How many objects there are that many bits of the binary value. Its first n bits to represent the data file inside the n-th object.
Each commit will have a corresponding binary value, all objects contained in the current snapshot representation. These objects corresponding bit is 1, the other bits are 0.
This benefit is not read commit objects, just read the binary value, you will know which node contains the current commit. Even better, as long as the two binary values do a XOR operation, we will know which bits (ie, which objects) changes have taken place. Moreover, because the new objects are always added to the existing binary bits, so long as those who read the extra bit, you know the current commit more than in the last commit what object.
As a result, "Inventory Object" becomes a binary value comparison operation, so fast. Further introduction, see the official documentation "Bitmap interpretation", "Bitmap format."
Currently, Github production environment has been deployed with this algorithm, users no longer have to count the objects, while waiting over. Moreover, Github team also merged it into Git, which means that, from now on all Git implementations can use the Bitmap function, so the future will definitely have more fun usage appears.