May 6, 2014 - How to transform a project from a huge subversion repository to git

Disclaimer: I’ve written this post a while ago and I am not entirely sure how accurate it was the time I stopped proof-reading it. So take with care and always keep a backup. The information provided here might be outdated, or even wrong. I hope, it still helps someone getting rid of overly large SVN-repositories.

Multi-project subversion repositories sound convenient at the first glance: Everything at one place and only one system to manage. If you ever consider to switch over to git it is a bad idea. Usually transforming a subversion into a git repository is quite easy, but that is only valid for repositories of a reasonable size. Multi-project repository tend to grow quite big. The usual transformation via git svn clone <svn-url> extremely slows down on huge repositories. I once had to convert some projects from a SVN-repository with at the end 141 more or less active projects and around 470000 commits…

A short overview of what is required

  • Time. It doesn’t take weeks anymore, but it still requires some time (I got it working in around 4 hours at the end, but after I had created a local repository)
  • A fast storage. The whole process is extremely IO-intensive. A SSD works fine, even better — if there is enough RAM available — is a RAM-disk of around 4 to 8 GB.
  • svnrdump. This is part of SVN 1.7.
  • I used svn-all-fast-export for the actual conversion (it’s available in the ubuntu repository). svn-fast-export should work too, but I had issues with that and stopped investigating.
  • BFG Repo-Cleaner to get rid of large files and files containing sensitive data.
  • Of course Subversion (especially svnadmin) and git.

Step 1: Create a local copy

If you already have access via filesystem to the repository you can skip this step. This is to create a local copy of a remote repository, so that we don’t have to perform every operation over the network.

This example downloads 10000-commit chunks in parallel up to commit 400000. Remember to update the values, so they match your repository. Downloading in chunks bypass the limitations of the HTTP a little bit.

# First one without "--incremental"
svnrdump dump \
    --revision 0:10000 \ | gzip -9 > MyProject.00.svn.gz
for i in {01..39}; do
    svnrdump dump \
        --incremental \
        --revision $(($i))0001:$(($i+1))0000 \ | gzip -9 > MyProject.$i.svn.gz

# Create local repository and import the chunks
mkdir MyProject.svn
sudo mount -t tmpfs none MyProject.svn # Skip this, if you don't want to use a ramdisk,
                                       # but you have to deal with the consequences yourself
svnadmin create MyProject.svn
for i in {00..39}; do
    gunzip < MyProject.$i.svn.gz | svnadmin load --quiet --force-uuid MyProject.svn;

svnadmin dump --quiet MyProject.svn | gzip -9 > MyProject.svn/MyProject.svn.gz

mv MyProject.svn/MyProject.svn.gz /path/to/backup/

The import of the full-backup into a ramdisk-SVN-repository is quite fast (took me around 5min), so it’s fine to just keep this and rebuild the repo a new after a restart.

Step 2: Prepare export

SVN tracks committers only by an username, but git by an email-address with an additional, optional and arbirtrary (display-)name. Create a script, paste the following code into it and make it executable. Note, that you must change the path to your local SVN-repository.

#!/usr/bin/env bash
authors=$(svn log -q file:///absolute/path/to/MyProject.svn | grep -e '^r' | awk 'BEGIN { FS = "|" } ; { print $2 }' | sort | uniq)
for author in ${authors}; do
  echo "${author} = ${author} <${author}>";
./ > authors.txt

Fix the content. git-all-fast-export (and as far as I know all the other tools too), expect a format svn-user-name = committer name <>. It’s really easy. If you don’t know each and every name, it doesn’t matter.

Step 3: Export

The ugliest part is creating the “rules”-file. svn-all-fast-export expects a rules file, that contains rules on how to map paths in SVN to git branches and tags. For further options, see Gitorius samples

create repository MyProject
end repository

match /MyProject/trunk/
  repository MyProject
  branch master
end match

match /MyProject/branches/([^/]+)/
  repository MyProject
  branch \1
end match

match /MyProject/tags/([^/]+)/
  repository MyProject
  branch tag/\1
end match

match /MyProject/tags/([^/]+)/
  repository MyProject
  branch refs/tags/\1
end match

match /
  # ignore everything we don't know (remove/comment this to find missing mappings)
end match

As you can see every mapping is prefixed with “MyProject”, because this tool isn’t able to strip the project path itself.

svn-all-fast-export --identity-map=authors.txt --rules=my-rules.rules /path/to/local/svn

If everything worked fine, you know should have the bare git repository in a subfolder named MyProject. Theoretically you are finished now.

Step 4: Cleanup

First we simply drop all already merged branches. If they are already merged, you don’t need them anymore and you can re-create the branch, when you need it again.

git branch -d `git branch --merged`

[BFG here]

git reflog expire --expire=now
git gc --aggressive --prune=now

One thing, that is a little bit annoying is, that for whatever reason svn-all-fast-export exports the svn:ignore property into an .svnignore file. With some git “black magic” you can rename the file in the whole repository. I recommend it as the last step, because it is by far the most time consuming step when handling with the git repository, but it is at least faster, when the repository is already smaller.

git filter-branch --index-filter 'git ls-files -s \
    | sed "s-\(\t\"*\).svnignore-\1.gitignore-" \
    | GIT_INDEX_FILE=$ git update-index --index-info && mv "$" "$GIT_INDEX_FILE"' HEAD

However, this isn’t completely sufficient. svn-all-fast-export doesn’t convert svn:ignore-properties in subfolders. A good start to fix this manually (at least in the master-branch) is svn propget -R and compare it with the .gitignore. Also it is a good idea to review, if it really covers everything, that should get ignored. I wouldn’t spend too much time into this, because it only affects developing and it only affects “living” branches.

Jan 11, 2014 - Strings are constants too

In our development team, we have a (more or less strict) rule: If it’s a constant value, make a constant out of it. In many cases this makes sense, or at least increase clarity, or readability.

class Constant {
    const HOUR = 3600;
    const DEFAULT_TIMEOUT = 120;

(Aside: We haven’t left the “classes for everything”-paradigm yet)

As you can see both values have at least a small semantic value and either increase readability (HOUR), or may change over time. But sometimes during code-reviews I see comments like (simplified example)

// Constant string: Make a constant out of it

Lets assume I make a (class-)constant out of it. May it change some time in the future? Quite sure no. Can it make anything clearer? If you are aware of the manual (hopefully you are) probably not. Does it increase readability? In this case it can make things even worse (maybe not that obvious at first glance)

use Foo\Bar\Constant;

So why is this worse? It is longer, what isn’t bad on it’s own, but just unnecessary, it has a reference to a (otherwise) unrelated class, what is also acceptable, and it decreases clarity, because it’s name doesn’t point out, whether, or not it makes use of leading zeros and if it is in 12-hour- or 24-hour-format. The solution would be something like

class Constant {
    const HOUR_12H_WITH_LEADING_ZEROS = 'h';
    const HOUR_24H_WITH_LEADING_ZEROS = 'H';

Now remember the other date-related formatting characters. Or think of combining them… Doesn’t sound fun anymore, does it? What about “weekday”? Does that tell you, if it’s a numeric value, the name, or the shortened name? That sounds like it will end up in a huge bunch of constants with unnecessary long names for something you can read in the official, public available manual.

Whenever I read “That could be a constant” I usually think “Well, a constant value is a constant too”. Sometimes there is simply no good reason to substitute constant values with a constant. Using 3600 as as “timestamp”-ish parameter should be clear to every developer. If you are concerned, that somebody can misunderstand that, try 60*60 instead. If you have a constant DEFAULT_TIMEOUT you maybe use it once, because for other connections you may use different defaults. Is it worth it, to create this separation between the value and the use of it? If you want to change it, you’ll probably not look at the constant first, but at the call, where it is used, which is always an indirection. Now you want to use a different timeout for this single connection, so you’ll remove the reference to the constant anyway and maybe even add a new constant DEFAULT_TIMEOUT_XY_CONNECTION.

Think, before you scream for constants. Not every constant is helpful. Some are at best superfluous. But having all this indirections – because it wont end with a single superfluous constant – can get really distracting.

Jan 4, 2014 - New Domain

I moved the whole blog to a new domain: All other non-country-dependent TLDs were already taken by an pickles manufacturer