Gittin’ the job done: the choice of a version control system

One of the good things about starting a software project from scratch is that you can think about the best way to organise your source code and documentation without the constraints of being in a place where a system, which could be flawed or old-fashioned (often both), has already been in place for a long time, and which your managers and colleagues refuse to modify. This applies to things such as the way you set up the files and directories, the coding guidelines and, in particular, the decision of which version control system (VCS) to use. In the companies I’ve worked for before, I never had to think about these issues since somebody had already made a decision and it was simply a matter of adapting to whatever the company had chosen. In this post I’m going to discuss the choice of a version control system. If you’re new to version control software, also referred to as ‘revision control software’, or ‘source code management’ (SCM) when used for source code, I recommend you read the Wikipedia articles Revision control and Comparison of revision control software.

Working out which VCS to use is a bit of a daunting task because there are a lot of competing systems. The browser wars may be the ones that get the most headlines, but the fights have been fiercer, and there have been much more victims, on the VCS battleground. After reading a bit about the various systems, I’ve finally decided to use Git, which is quickly becoming the most successful of the new kids on the block. While researching this, I’ve found that the VCS landscape has changed dramatically over the last few years, and three new systems: Git, Mercurial and Bazaar have completely taken over from the older systems, like CVS and Subversion. The success of these newer systems stems from a fundamental change in paradigm: whereas the older systems follow a centralised approach, where all the versioning history is stored on a server and the clients communicate with that server in order to update changes or check the history, the newer systems adhere to a distributed approach, where there is no privileged server and all the working copies of a repository are fully-fledged repositories by themselves. The initial gut feeling is that the distributed approach sounds like overkill, but it has actually turned out to be much more powerful.

Git was originally conceived by Linus Torvalds for use in the Linux kernel after the software they were using, BitKeeper, stopped being free. There is a video on YouTube of a talk Torvalds gave at the Google headquarters, in which he explains the advantages of Git.

I have to say that I find Linus Torvalds so arrogant in this video to the point that it even discouraged me from using Git. He basically says that people who like CVS should be in a mental institution, and that Subversion was a pointless project. Some of the comments that accompany the YouTube video point out that the apparent cockiness is just his very geeky sense of humour. I could agree with that if he had qualified his quips with some sort of acknowledgement that the people who wrote CVS and Subversion were actually doing a great job at their time, and that it is only the experience gained from using such systems that has shed some light on how the original concepts could be improved. Instead, he keeps on ranting about how awful those projects were. I’m not sure I’d understand that sense of humour if I were one of the original developers of CVS or Subversion. Another thing that also put me off about using Git was the name, which I find rude. I feel as if I were insulting somebody every time I have to type a command that goes something like ‘git do whatever’…

Anyway, in the end I decided to put aside those prejudices and accept that Git is probably the best system to use at this time, and so it is the one we use at Retibus Software. This decision came in two stages: first, there is ample evidence that the distributed systems are better than the centralised ones. This excludes Subversion from the list of candidates. Secondly, the distributed systems are all very good, and picking one is a more difficult task. What made me go for Git was fundamentally the fact that it seems to be the more successful one so far, and its user base keeps on growing at an amazing rate. This means that there are more and more resources on the web about it. If you have a problem running a Git command, a Google search will turn up lots of relevant results. In the remainder of this post, I shall discuss these two stages of the decision process in more detail.

1. The first decision: centralised v. distributed systems

There are two kinds of arguments that make the case for the distributed systems. For a start, the distributed paradigm is better than the centralised one. The second kind of arguments isn’t strictly about the paradigm these systems adhere to, but about the quality of the implementation. Probably because the distributed systems are recent developments, there are a number of features, like branching and merging, that are handled much better in the new systems than in the ones that have been around for a long time. In this sense, we can say that the likes of Git, Mercurial and Bazaar represent a more advanced evolutionary stage in the concept of version control software.

1.1 Centralised systems

If we wanted to restrict our choice to the software that follows the traditional centralised model, there’s a number of programs that have been common during the last decade. The first VCS software I used was Microsoft Visual SourceSafe, which had a good integration within Visual Studio. Other than that, it hasn’t been very popular, and the last stable version was released in 2005. It seems that Microsoft don’t use it either. In the article Visual SourceSafe: Microsoft’s Source Destruction System, Alan de Smet sums up the main flaws that SourceSafe had. A much much better program that adheres to the centralised paradigm, and which I’ve also used in the past, is Perforce. In my experience, Perforce is good software, but the problem with it is that it is a commercial product, which requires you to pay for a licence (with a possible exception if you’re developing open source software). It may have made sense to pay for good version control software five years ago, but not in 2011, when there are lots of free alternatives. Among the systems that follow a centralised approach, the main free alternatives are CVS and Subversion. CVS is older and it has been largely superseded by Subversion, which was, in fact, promoted as a better CVS at its inception. The only possible advantages of CVS are the way it differentiates between tags and branches and, apparently, better Eclipse integration. But Subversion wins hands down on almost every count (see this interesting Stackoverflow discussion for a comparison). So, if, for any reason, we wanted to adopt a centralised system, the one to choose would be Subversion, and we can discard the rest as either old-fashioned (CVS) or proprietary (Perforce) or both (Visual SourceSafe).

But it turns out that the new distributed systems are much better.

1.2 A new approach: peer-to-peer v. client-server

The traditional centralised systems like Subversion follow a client-server model where there is a central repository which stores the history of the project, with its successive changes and branches. This central repository is typically stored on a computer that can be accessed by all the developers through the network (either a LAN or the Internet). The clients simply store a ‘working copy’ of one version of the files in one particular branch they are working on. This basically means that if a developer wants to switch branches or revert to an older state of the code, the VCS software must update the files downloading the retrieved versions from the server. Similarly, if a developer wants to make a diff between a file as it is today and as it was at some point in the past, the client needs to communicate with the server. In these systems, the centralised repository must be stored in a computer which is backed up regularly. If the central repository were lost, it would be impossible to retrieve the project’s history from the developers’ computers.

Now in a peer-to-peer or distributed approach there is no distinction between a repository and a working copy. The history of changes is stored locally with the copy that the developer is working on. This means that common operations such as committing and reverting changes, switching branches or diffing files don’t incur any sort of network latency. In fact, you can even be disconnected from the network and still commit, revert and navigate the history and branches of the project. It is only when you pull changes from somebody else (a ‘peer’) or when you push your changes on to them that sending data across a network is necessary. This means that distributed systems restrict the use of the network to a minimum, whereas the centralised systems abuse the network since pretty much anything requires access to the repository. Besides, the regular backups are no longer so crucial since the complete project history is being stored in as many computers as there are developers (using different machines) in the project.

The main question that this approach poses to those, like myself, who are used to the centralised approach to version control, is whether storing a full repository on the client-turned-peer computers is not overkill. Won’t the information become huge and unmanageable? Surprisingly, this is rarely a problem. The distributed systems make a great job of storing changes as differences, so that even in very big projects, keeping all that information locally is perfectly possible. And the advantage of not having to wait for the data to flow through the network outweighs the larger hard-drive footprint.

It is important to note that the fact that the distributed systems follow a paradigm where there is no canonical repository doesn’t mean that you cannot designate such a central repository. In fact, one of the advantages of the distributed systems is that you can reproduce a client-server workflow with them in a straightforward way. You can set up a project so that it is shared by four machines ‘dev1’, ‘dev2’, ‘dev3’ and ‘central’. Even if the software will see the four repositories as peers, your in-house rules could state that only stable changes should be pushed to ‘central’ and at the same time forbid dev1, dev2 and dev3 from pulling and pushing among themselves. The great thing about the distributed approach is that it also allows many other workflows. On the other hand, simulating peer-to-peer workflows with a centralised system like Subversion would require careful use of branches and merging, and is considerably more complicated.

For some interesting examples of alternative workflow approaches using distributed systems, you can check Scott Chacon’s chapter on this in his excellent Pro Git book. And this Bazaar document also provides several workflow examples.

The loss of the distinction between a repository and a working copy in the distributed systems has an interesting advantage for solo projects. Imagine you’re writing a novel and want to keep it under version control so that you can track the history of revisions and you’re only using your battered old laptop (of course you should back your work up regularly, but you don’t need version control for that). With Subversion you would have to set up two separate directories on your system: one to act as the repository, and another one to act as the working copy. But with a distributed system you simply have to initialise a repository at the directory where you have the files of your novel. You don’t need an additional location for that.

1.3 Additional advantages of the distributed systems

As I mentioned above, the distributed systems that have become popular also seem to score better on other counts. This may be due to the flexibility of the distributed architecture or simply a matter of better programming. In particular, I feel that the three following points are important:

  1. Good and efficient branching and merging. The distributed systems are much more reliable when it comes to merging changes. This contrasts with the difficulties in CVS and Subversion, where even changes to different parts of one file may appear as conflicts and require more manual solving than should be necessary. The fact that the distributed systems handle branches better encourages developers to take advantage of branches, whereas users of the centralised systems tend to fear and avoid branches.
  2. The way changes are tracked is not based on files having a sequence of versions, but rather on content changes. Thus, if we add a line to a function in a foo.c file, a centralised system like Subversion would see foo.c as having a version N and then a version (N + 1), and any diffs or merges would be based on comparing the two foo.c files. In contrast, the distributed systems don’t see two versions of a file, but a content change that consists in the addition of the new line. This paradigm is much more robust since it allows to track changes in the code better. I haven’t stopped to investigate this myself, but some of the articles in the references below mention that this is particularly true of Git, which ditches the concept of files altogether, and can identify changes to a piece of code even if the code has been moved across file boundaries.
  3. The version control information is stored at the root directory that defines the repository, and doesn’t propagate into the subdirectories. In Subversion, if a directory is under version control, all its subdirectories get hidden .svn subdirectories with the version control information for each directory. The problem with this is that copying and pasting a subdirectory under version control drags all the Subversion information into the copy. In case we don’t want that, Subversion provides an ‘Export’ command that copies the subdirectory tree without the hidden version control subdirectories. In the distributed systems, the hidden subdirectory only exists at the root of the repository, so it is easier to copy a branch of the subdirectory tree without worrying about the version control information. This may seem like a minor point, but I find it very useful since I often make temporary copies of files or directories that are under version control in order to share them across the network or attach them to emails. This behaviour of the distributed systems is related to the previous point above. Because content changes need to be tracked irrespective of file and directory boundaries, the version control information cannot be split on a per-directory basis in the distributed systems. Subversion does that, which is the reason why it can never track a change that moves a piece of code from a file in a directory to another file in a different directory.

In the references section I have listed all the articles I have found interesting. Some of them, like Subversion Re-education by Joel Spolsky and Why You Should Switch from Subversion to Git by Scott Chacon, offer further, and surely better, explanations of the features that make the distributed systems better.

2. The second decision: which distributed system?

If we accept the evidence that the distributed systems are more powerful, we then have to decide which one of the distributed systems to use. The Wikipedia article Comparison of revision control software lists the many products available. If we discard those that are no longer maintained like GNU arch, those that are commercial, like BitKeeper (why pay or worry about licensing issues with so many free alternatives?) and those that have a small user base like Monotone and Darcs, we are left with the three candidates I mentioned at the beginning of this post: Git, Mercurial and Bazaar.

Choosing one of them is a difficult decision because they’re all excellent products. One feels like Buridan’s ass, unable to make a choice because all of them seem equally good. In the end, I have decided to use Git because it seems to outperform the others in a few important areas, like the ability to track changes across file boundaries that I’ve mentioned before. Besides, its growth in popularity means that there are more and more resources for it, both in terms of documentation and in terms of utilities that make it easier to interact with it. In particular, the tools that integrate Git with Windows Explorer and with Mac Finder are getting better at a very fast rate. Initially I was a bit worried about the claims that Git, coming from the Linux world, has poor Windows support since I do most of my development work on Windows 7, but I’ve found that this is no longer the case. In fact, I’ve been using TortoiseGit for several months and I’m quite happy with it. The only big issue I’ve found with Git’s Windows integration is the fact that it doesn’t support Unicode characters beyond the local code page in file names, and even non-ASCII characters can give it trouble. So, in my Spanish Windows system, which uses ISO-8859-1 as its ‘ANSI’ encoding, I can put a file like cañón.txt under version control and cross my fingers that it’ll work, but with a file with a name like 中文-عربية-español.txt, perfectly acceptable as far as Windows is concerned, things will go horribly wrong. I reported this issue more than nine months ago, but unfortunately it hasn’t been fixed yet. Anyway, I don’t usually use accented letters, let alone Chinese or Arabic characters in my file names, so I can live with this issue. Still, I find it a sign of sloppiness in the code. I have a very strong opinion that good software should provide complete Unicode support and never stumble on encoding issues.

In the end, I have based my decision to use Git on the general impression I gained from all the information I found on the web. The tests I ran with Mercurial and Bazaar were quite limited in scope. I thought running very comprehensive tests with lots of use cases wasn’t worth all the trouble and time it would have required. After all, if it eventually turns out that another system is better, switching should be easy. Precisely as a result of the stiff competition between the main VCS products, all of them provide import and export capabilities that make them compatible with the other common systems. This means that if we choose to use a system which eventually dies out and is superseded by another one, there will certainly be tools that can convert a repository from one format to the other. The fact that Git is praised by lots of people as the best system and the way its growth has exploded in such a short span of time should guarantee that any competing product will ensure that it can import a repository in the Git format.

3. References

These are the articles and forum discussions I found interesting while I was researching this topic.

  1. Distributed Version Control Systems: A Not-So-Quick Guide Through, an excellent article by Sebastien Auvray.
  2. Distributed Version Control is here to stay, baby, a very interesting article by Joel Spolsky.
  3. Why you should switch from Subversion to Git, a very interesting article by Scott Chacon.
  4. GitSvnComparison, an article on Gitwiki.
  5. Subversion Vision and Roadmap Proposal, by Subversion developer C. Michael Pilato.
  6. Popularity of Git/Mercurial/Bazaar vs. which to recommend, a Stackoverflow discussion.
  7. Svn vs Git, a Stackoverflow discussion.
  8. Is it easier to manage code with GIT or Bazaar?, a Stackoverflow discussion.
  9. What are the Git limits?, a Stackoverflow discussion.
  10. Popularity of Git/Mercurial/Bazaar vs. which to recommend, a Stackoverflow discussion.
  11. Git vs. Subversion, a discussion with some interesting comments.
  12. Linus Torvalds on GIT and SCM, this interesting blog post includes the YouTube video of Linu Torvalds’ talk at Google.
  13. Git vs SVN – Which is Better?, an article that compares both systems.
  14. Git vs SVN for bosses, an article about how to explain to managers that Git is better because it makes creating branches much easier.
  15. Why Switch to Bazaar?
  16. Why revision control? Why Mercurial?
  17. Bazaar vs. Mercurial : An unscientific comparison, a comparison of Mercurial and Bazaar.
  18. Why Git is Better than X, an article by Scott Chacon in which he defends Git’s superiority over the other similar systems.
  19. Why Git ain’t better than X, a reply to the above article by Bazaar user Matt Giuca.
  20. Going away from bzr toward git, an article by David Cournapeau.
  21. Git – SVN Crash Course, a comparison of SVN and Git commands.
  22. Subversion Re-education. The first chapter in the Mercurial tutorial by Joel Spolsky. It offers a very good introduction to distributed systems for people used to Subversion.
  23. TortoiseGit The coolest Interface to (Git) Version Control, a Git Windows-GUI client based on TortoiseSvn.
  24. Using Git With OS X: 6 Tools to Get You Up and Running, an article by Andrew Bednarz about using Git on Mac OS X.
This entry was posted in Version control software. Bookmark the permalink.

5 Responses to Gittin’ the job done: the choice of a version control system

  1. Matt Giuca says:

    Great article. A nice analysis of centralised vs distributed and a bit of a discussion about different distributed systems. I thought I’d reply, as the author of reference #19 (Why Git Ain’t Better Than X), with my obviously-biased-in-favour-of-Bazaar opinion (take it with as much salt as you feel necessary).

    As a Bazaar user (who has tried Git extensively over the course of years of arguments with Git fans, and still don’t see the appeal), I am frustrated that everybody’s reason to use Git is “everybody else uses Git.” This is the same reason to use Windows — it has some merit of course (more tools, more people familiar with the system), but popularity is not a great reason to choose a system — especially given your first paragraph that says the great thing about starting from scratch is you can make properly-informed decisions and not just go with the status quo.

    As Bazaar is written in Python, it naturally runs fine on Windows and (I believe) fully supports Unicode.

    Git’s “ability to track changes across file boundaries” is both a blessing and a curse. I have experimented extensively with Git’s automatic rename detection, and it is highly irregular. Git automatically detects renamed/moved files, which is nice, but it doesn’t let you manually specify it (the data structure in fact doesn’t record it at all; it is inferred by each operation). This means if Git fails to detect a rename, there is no way to correct it, you just have to live with it thinking that you created a new file. Furthermore, I have found many cases where some tools detect a rename while others don’t (even with the “detect renames” flag). In Bazaar on the other hand, it will not consider a file to be renamed unless you explicitly state it. This takes some getting used to, but it means you can guarantee 100% of the moves are tracked correctly.

    The biggest gripe I have with Git is that, unlike all other revision control systems (even Subversion), it is hostile towards other revision control systems (in terms of compatibility). The reason is that Git commit objects cannot store arbitrary metadata (key-value pairs) — they can only store a commit message and author ID. The reason this is critical is that if I want to commit in Bazaar and push to Git, I cannot do it without losing information (such as the Bazaar revision ID and any other metadata). This information loss means that pulling from Git back to Bazaar produces commits incompatible with the original Bazaar commits. The way around this is to use “dpush”, which deletes all the local Bazaar metadata and synchronises it with Git, but that breaks any other Bazaar branches. This doesn’t happen with any other VCS. Bazaar can push and pull just fine from Subversion, for example, without any information loss. In fact, all the other VCSs could theoretically talk to one another, but not Git. Git is a one-way information hole, thanks to its overly-simple data model. This is why I’m frustrated that the world is moving to Git. If I am using Bazaar on the server, then you can pretty trivially use Git as a client, and push and pull the Bazaar repository. But if you use Git on the server, I can’t (properly) use Bazaar as a client, because of the data loss.

    Anyway, thanks for the detailed article. I just wanted to highlight a few issues I have.
    Matt

    • Ángel José Riesgo says:

      Thanks a lot for your comments, Matt. I agree that popularity is not the best reason to choose a system. In fact, the final part of my post is probably quite weak in terms of arguments. I basically found that it was so hard to find any flaws in how Mercurial, Bazaar and Git were treating the admittedly simple cases I tried that it was quite difficult to declare a winner. In this situation, using popularity as a tiebreaker was not unreasonable, especially since I’m mostly working on my own and couldn’t afford to spend too much time thinking about this. Maybe with a little more experience in real-life situations I will find that Git is not as good as some people say it is, and that Bazaar (or Mercurial or Monotone or Darcs) is better. Herd behaviour made people choose VHS over Betamax in the 80’s, and there are lots of similar cases in the software industry. The popularity of Git could be similar. Time will tell.

      In the meantime, I think it’s good that there are people who are very vocal about the merits of the other systems. Monotone developer Thomas Keller recently published a blog post where he claims that Monotone is better at merging than Git. I think these debates are an essential contribution to the evolution of software. This Darwinian competition between systems and ideas will ensure that we will have much better revision control systems in the future.

  2. Adam says:

    Hello, Ángel!

    I have to say this is the one best article about centralized VCS vs DVCS I have read. Comprehensive, honest and, incredibly, full of respect and readable. Congratulations.

  3. Pingback: Setting up a central Git repository on a Windows server | Nubaria Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>