Git – distributed source control
August 29, 2007
I watched Linus Torvalds rant on Git and distributed source control management (SCM). First of all, Torvalds is a hard-core jerk. He is stunningly obnoxious and egotistical; the very embodiment of the poorly socialized nerd stereotype. As for the talk, he takes a lot of credit for some fairly well known ideas (much like Linux OS).
He admits he got from BitKeeper the concept of distributed version control, which is the only interesting thing about git. He then pretends that applying SHA-1 hashes to check consistency and equivalence is an earth-shattering invention. If Torvalds is the first to apply it to SCM, that says more about the poor implementation of other SCMs than his otherworldly genius. He then rambles about git’s insane performance. I don’t know how git is implemented, but I assume there are two techniques that can increase performance. The first is to use a modified rsync algorithm to quickly compute and transmit the changes between two repositories. The second is to merge the hash computation and compression into one pass over a file, since the bottleneck for large repositories will be disk access. And it certainly helps that he has very deep knowledge of the kernel. I was confused when he kept saying that merging was extremely fast in git, but I assume he means merging without conflicts. Big freakin’ deal! Conflicts are where things get horribly difficult because you have to manually pick which change set you want to apply to the repository. Since the modifications to the different subsystems rarely touch the same files, he probably doesn’t have to deal with conflicts much.
Despite his radioactive endorsement, the concept of distributed version control is actually quite nice. Rather than have a central repository managed by Torvalds, imagine a tree rooted at his holiness. The leaves of the tree are where most of the edits take place. The nodes in the middle correspond to project managers for the various subsystems in Linux. If you want to modify the SCSI code, you edit and test your private codebase and then push the changes to the guy who manages that small chunk. The changes will bubble up the tree, getting tested repeatedly in combination with larger changesets, until it arrives at the root repository. It’s actually a very nice model for a large development team.
Some in the audience didn’t see how this model could be applied to Google. Here’s an example. The codebase for Gmail is big, but broken into distinct subprojects. Every developer gets his own repository to edit and run a (small, testable) version of Gmail. A group of devs working on spellcheck can update their repositories, then push the changes up to the UI manager, who is merging changes from a variety of small groups. He pushes his changes up the chain of “managers” until it lands in the Gmail architect’s codebase. It works pretty much the same way companies do it now with centralized systems: make changes to a branch, then merge branches together until you reach the main branch. But distributed SCM more closely matches this process, and even encourages it. Just layer it on top of the existing org chart: architect at the root, devs and testers at the leaves, project managers in between.
Subversion should add namespaces to the branch tags. I should be able to create a branch called “experimental” within my namespace which doesn’t conflict with anyone else. When I want to merge with someone else, they just fetch my tag within my namespace. In fact, the quickest solution is to support URIs as tag names. Problem solved. Now you get “distributed” development within a centralized repository. It can’t be that easy, can it?