Migrating from SVN to Git while Splitting Repository

As the final stage of a recent project we were transferring our large Subversion repository into multiple Git repositories. We wanted to retain as much of the Subversion history as possible. Interestingly this turned out to be considerably more complex than expected! See the end for links to pages that various other people have written about this process.

We initially considered splitting up the repository at the same time as doing the migration. However, it turned out to be very difficult to preserve the full history this way. Doing the process in two steps – migration then splitting – was a better approach. Another reason for doing two steps was that not all of the original repository was going to be split off. A large amount of content that was legacy would be left behind as an archive.

So the plan was that we would do an initial migration to Git. We would retain this as a standalone read-only archive of everything (about 1.6GB). Active ongoing content would then be split (copied) into separate Git repositories (which would only be about 100MB in total).

Subversion to Git Migration

The normal migration approach for Subversion to Git seems to be to use git svn . This tool however appears to be more of a live synchronisation tool so that people can use both Git and Subversion at the same time. For this reason, and others, we did not find it ideal for a one off migration. We eventually discovered an excellent tool called reposurgeon which is designed precisely for one-off migration. This tool is hugely configurable, fast, and supports a lot of the spurious baggage that comes with an old Subversion repository (which in our case had started out life as an even older CVS repository). Another advantage of reposurgeon is that it automatically creates a Makefile to define migration options and manage each migration step. This encapsulates the whole process and is important as we found that a number of migration attempts were necessary before it looked like everything had worked correctly.

The first problem we hit was that when our Subversion repository was created it had not been structured into trunk and branches. It was only at a later point (when a formal branch was actually needed) that it had been reformatted into this standard structure. The reposurgeon tool assumes the standard structure and as a result history prior to the restructuring and creation of trunk was effectively lost (buried in pseudo branches that reposurgeon creates to represent the previous root structure, complicating history extraction when subsequently splitting up the repository). Since we had only ever had one formal branch (albeit a big one) we opted to migrate using a flat structure (by using --nobranch as a read option in the reposurgeon Makefile). This meant that the repository was treated as an ordinary directory tree from the root – trunk and branches had no special meaning.

The second problem we had was that, by default, the Subversion revision number was not preserved. This revision number was important for us as they had been frequently used: in comments in our issue tracking system (as a reference to the commit where a fix had been made); in formal release notes; within commits (for example, simple errors like “this change is also associated with the fix in revision r12345”). To resolve this we needed to add the --legacy write option in the reposurgeon Makefile so that every commit message was automatically annotated with the original Subversion revision number.

After this the migration went pretty smoothly and we had a clean Git repository with the full Subversion repository history.

There were a couple of minor leftover problems with our new Git repository. One of the directories in our Subversion repository had contained content from a third party which was actually just a Git repository. At the end of the migration git status reported lots of error: Invalid path 'trunk/tools/.../.git/...' messages. Since we didn’t care about retaining the actual embedded .git directory content we just did git reset --hard to remove these errors and leave a clean repository.

The other issue we had was that in the Subversion repository some files had altered over time from being normal files with content to being symbolic links and then back to being normal files with content again. In the migrated Git repository some of these files (but not all) were broken as the file was still a symbolic link pointing to a file which had a name which was the concatenation of what should have been the file content! We did not pick up on this error until later when a clone of the split off repository containing these files failed to checkout because the resulting file names were too long. We remain unclear what could have caused this. Presumably something odd at the time the particular Subversion commit was made that reverted the symbolic link back to a file and something reposurgeon wasn’t consequently able to deal with automatically. We fixed this in the end by careful rewrite of the offending commit in the history of the split off Git repository.

Splitting the Git Repository

There seem to be two approaches to this, use git subtree or git filter-branch with the --subdirectory-filter option. Both scan the entire commit history and keep just those commits containing the specified path (removing the rest). Both have limitations if you want the full history. With subtree it is only possible to scan one branch, so history recorded in any other branch will be lost. This was a significant issue until we chose to migrate the Subversion repository as a flat directory structure. However, a principal failing of both approaches is that the specified path must have remained the same throughout the repository history. So, if at any point in the history, the specified path has changed name or parentage, or content in that path has arrived into it from elsewhere in the repository then all that history will be lost. Since a significant amount of re-structuring at both file and directory level had taken place in our repository over time this limitation was a significant issue to preserving full history.

One option would be to just accept the loss of history. This was not entirely unreasonable as we were retaining the original complete migration into Git as a permanent archive. However we are not able to make that archive public because of some of the legacy content. This would have made a lot of relevant history for content in active development unavailable to anyone else. Nor would it have been that convenient, even as a local developer, to not have the full history of a file self-contained within the split off repository – so you couldn’t just do git log --follow FILE for example. Instead, having to locally clone 1.6GB just to access a tiny fraction of that to get one files full history.

In the end we managed to find a way to do what we want (with one caveat) using a combination of techniques and because we also had the advantage of access to the full file rename history in our old Subversion repository.

Our approach was to first identify all the current files in a directory that was being split off. For each of these files the Subversion repository was then accessed to extract a full file name and path history. Then every commit in the Git repository was filtered so that it contained only these files (when they existed in that commit). The result became the split repository, where every file contained its full history irrespective of whether it changed name or path in its history. The precise steps we took for each directory being split off are outlined below.

First we created a file containing a list of all the files (including all previous names and locations) in the directory to be split off, trunk/root1 in this example case. We built this from a checked out working copy of the Subversion repository.

cd /path/to/repo/svn/working/copy
REPOURL=`svn info | awk '/^URL:/{print$2}'`
(for f in `find trunk/root1 -type f`; do svn log $f | grep -o -E '^r[0-9]+' | xargs -L 1 -irev svn info -rev $f | grep '^URL:'; done)| sed -e "s?^URL: $REPOURL/??1" | uniq > /tmp/root1.files

Next we created a clone of our migrated Subversion repository so everything we did was isolated and the original could be used again to split off another repository.

cd /tmp
git clone --no-hardlink /path/to/repo.git clone1
cd clone1

Then we ran the filter. We used --index-filter as it is a lot faster than --tree-filter and has an identical result in this particular case. For each commit this filter first removes all the files and then selectively re-instates only those files which existed in that commit and which are also listed as being part of the split directory (eventually, even if not at this particular point). The filter also removes any resulting empty commits and re-writes tags.

git filter-branch --prune-empty --index-filter 'git rm --cached -qr -- . && cat /tmp/root1.files | xargs -n 1 git reset -q $GIT_COMMIT --' --tag-name-filter cat -- --all

Now we had a modified repository just containing the current files in the root1 directory (actually trunk/root1) but with their full history. Next we tidied up by cloning (a simple way of leaving behind all the objects that have now been orphaned by the filter) and did some shuffling so that the content of trunk/root1 became the content at the top level of the new repository.

cd /tmp
mkdir root1
cd root1
git init
git remote add origin ../clone1
git pull origin master
git remote rm origin
git mv trunk/root1 .
git clean -d -x -n
git clean -fd
git mv root1/* .
git clean -d -x -n
git clean -fd
git status -s
git commit
git status
git log

After this we simply cloned again into a bare repository and pushed this onto our Git server. Then we moved onto the next directory to be split off.

There is one major caveat with this approach. Only current files (those in the most recent commit) and their history is included. Any files that used to exist in the split directory path but have at some point been deleted will not be included. This may or may not matter. For us it did not as one of the reasons for splitting into separate repositories was to lose the 1.5GB of legacy files that were no longer of any relevance, so we did not want them retained. Also we still had the fallback of referring to the original migrated archive with all content as a last resort.

Another problem with this approach was performance. Our repository had over 12,000 commits. The filter has to process each commit one by one. This was alright when the number of files (and their path and name history) for a split directory was small, but when it was in the thousands or more (as some were in our case) the time to process each commit became significant (from fractions of a second to tens of seconds).

There is probably a better way to do this. Perhaps using a combination of this approach to get the unique file path changes and then using the filter to do specific commit history grafting at each change point.