Disclaimer: I’ve written this post a while ago and I am not entirely sure how accurate it was the time I stopped proof-reading it. So take with care and always keep a backup. The information provided here might be outdated, or even wrong. I hope, it still helps someone getting rid of overly large SVN-repositories.
Multi-project subversion repositories sound convenient at the first glance: Everything
at one place and only one system to manage. If you ever consider to switch over to git
it is a bad idea. Usually transforming a subversion into a git repository is quite easy,
but that is only valid for repositories of a reasonable size. Multi-project repository
tend to grow quite big. The usual transformation via git svn clone <svn-url>
extremely
slows down on huge repositories. I once had to convert some projects from a SVN-repository
with at the end 141 more or less active projects and around 470000 commits…
A short overview of what is required
- Time. It doesn’t take weeks anymore, but it still requires some time (I got it working in around 4 hours at the end, but after I had created a local repository)
- A fast storage. The whole process is extremely IO-intensive. A SSD works fine, even better — if there is enough RAM available — is a RAM-disk of around 4 to 8 GB.
- svnrdump. This is part of SVN 1.7.
- I used svn-all-fast-export for
the actual conversion (it’s available in the ubuntu repository).
svn-fast-export
should work too, but I had issues with that and stopped investigating. - BFG Repo-Cleaner to get rid of large files and files containing sensitive data.
- Of course Subversion (especially
svnadmin
) andgit
.
Step 1: Create a local copy
If you already have access via filesystem to the repository you can skip this step. This is to create a local copy of a remote repository, so that we don’t have to perform every operation over the network.
This example downloads 10000-commit chunks in parallel up to commit 400000. Remember to update the values, so they match your repository. Downloading in chunks bypass the limitations of the HTTP a little bit.
# First one without "--incremental"
svnrdump dump \
--revision 0:10000 \
http://example.com/path/to/svn/MyProject | gzip -9 > MyProject.00.svn.gz
for i in {01..39}; do
svnrdump dump \
--incremental \
--revision $(($i))0001:$(($i+1))0000 \
http://example.com/path/to/svn/MyProject | gzip -9 > MyProject.$i.svn.gz
done;
# Create local repository and import the chunks
mkdir MyProject.svn
sudo mount -t tmpfs none MyProject.svn # Skip this, if you don't want to use a ramdisk,
# but you have to deal with the consequences yourself
svnadmin create MyProject.svn
for i in {00..39}; do
gunzip < MyProject.$i.svn.gz | svnadmin load --quiet --force-uuid MyProject.svn;
done;
svnadmin dump --quiet MyProject.svn | gzip -9 > MyProject.svn/MyProject.svn.gz
mv MyProject.svn/MyProject.svn.gz /path/to/backup/
The import of the full-backup into a ramdisk-SVN-repository is quite fast (took me around 5min), so it’s fine to just keep this and rebuild the repo a new after a restart.
Step 2: Prepare export
SVN tracks committers only by an username, but git by an email-address with an additional, optional and arbirtrary (display-)name. Create a script, paste the following code into it and make it executable. Note, that you must change the path to your local SVN-repository.
#!/usr/bin/env bash
authors=$(svn log -q file:///absolute/path/to/MyProject.svn | grep -e '^r' | awk 'BEGIN { FS = "|" } ; { print $2 }' | sort | uniq)
for author in ${authors}; do
echo "${author} = ${author} <${author}@example.com>";
done
./extract-authors.sh > authors.txt
Fix the content. git-all-fast-export
(and as far as I know all the other tools too), expect
a format svn-user-name = committer name <[email protected]>
. It’s really easy. If you don’t know
each and every name, it doesn’t matter.
Step 3: Export
The ugliest part is creating the “rules”-file. svn-all-fast-export
expects a rules file, that contains
rules on how to map paths in SVN to git branches and tags. For further options, see
Gitorius samples
create repository MyProject
end repository
match /MyProject/trunk/
repository MyProject
branch master
end match
match /MyProject/branches/([^/]+)/
repository MyProject
branch \1
end match
match /MyProject/tags/([^/]+)/
repository MyProject
branch tag/\1
end match
match /MyProject/tags/([^/]+)/
repository MyProject
branch refs/tags/\1
end match
match /
# ignore everything we don't know (remove/comment this to find missing mappings)
end match
As you can see every mapping is prefixed with “MyProject”, because this tool isn’t able to strip the project path itself.
svn-all-fast-export --identity-map=authors.txt --rules=my-rules.rules /path/to/local/svn
If everything worked fine, you know should have the bare git repository in a subfolder named
MyProject
. Theoretically you are finished now.
Step 4: Cleanup
First we simply drop all already merged branches. If they are already merged, you don’t need them anymore and you can re-create the branch, when you need it again.
git branch -d `git branch --merged`
[BFG here]
git reflog expire --expire=now
git gc --aggressive --prune=now
One thing, that is a little bit annoying is, that for whatever reason svn-all-fast-export
exports
the svn:ignore
property into an .svnignore
file. With some git “black magic” you can rename
the file in the whole repository. I recommend it as the last step, because it is by far the most
time consuming step when handling with the git repository, but it is at least faster, when the
repository is already smaller.
git filter-branch --index-filter 'git ls-files -s \
| sed "s-\(\t\"*\).svnignore-\1.gitignore-" \
| GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info && mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD
However, this isn’t completely sufficient. svn-all-fast-export
doesn’t convert
svn:ignore
-properties in subfolders. A good start to fix this manually (at least in the
master
-branch) is svn propget -R
and compare it with the .gitignore
. Also it is
a good idea to review, if it really covers everything, that should get ignored. I wouldn’t
spend too much time into this, because it only affects developing and it only affects
“living” branches.