Monday, August 13, 2012

SVN Repository Migration

I work for a product based IT company. They have started using svn in 2004 and have maintained a single repository for everything, for projects for docs for private experiments for dogs and for cats! The repository has grown and grown and grown as a well fed tree and has reached to be around 275 GB in 2012.

The back up process was taking around 6 hours and a senior manager was threatening to get rid of all the history and to export the content to a new repository.

Now as software engineers we know it's not a very good thing to do. How many times do we take our shovels and put on our overalls and dig deep into the mystic history of svn in search of skeletons and criminals. No we NEED that history!

I, the poor little new joinee who never had anything to do with any kind of repository maintenance, was given the mammoth task of finding a solution, and finding it quick. Break that big evil repo into pieces I was told.

What kind of hammers can I use for this? As the first step I downloaded the svn book 1.7, a document freely available on internet which is compiled by the svn crew. And to put a long story short, here are the options I considered; in a nutshell.


Investigated Options


Copy & Delete 
Why not copy the entire repository and delete unwanted projects along with their history. That should be simple enough. There is a little problem though. SVN doesn't provide an option to delete a part of repository with history. You simply can't get rid of your past in svn. If you really really want to do it, svn makes you work real hard for it. Well... Isn't there a command called svn delete? Yes but it will only mark the files as deleted but it retains everything and in fact will increase the repo size. Conclusion: Cannot be used.

svn export
This command is used to export a clean directory tree with no history and no meta data. (those pesky little .svn files that like to hide) Not a good option from developers' point of view. The whole point of this exercise is to save the history. But this can be used for parts that do not require history of course.

svnsync
This is the Subversion remote repository mirroring tool. It allows to mirror a svn and keep it up-todate by syncing the original with the mirror from time to time. svnsync does the magic by replaying the revisions of one repository into another one. And the good news is yes it can be used to break a big repo to several smaller repos by creating mirrors of sub trees. Tests have shown svnrdump, which does things in a very similar way, has better performance. But you may get some nasty errors due to property validations. Take a look at the property validation topic for further details. There is no option to skip validations. Manual fixing of errors is possible, yet not a feasible approach considering the magnitude of the task.

svnadmin dump/load
This command is used to dump the contents of the file system and load it into a new repository. But it cannot dump a sub tree. It's the entire thing or nothing. It can't help us alone, but we can use it with something else called svndumpfilter.

svnrdump dump/load
This is a shining new feature available in SVN 1.7. It proudly announces that it can be used for Remote Repository Data Migration. In simple terms that means even if you're not admin and even if you're not logged into svn server, you can create a dump remotely. It's the cousin of svnadmin dump and a more flexible one. You can dump a sub-tree of the repository through this—something svnadmin dump cannot do. Same property validation errors occur as in svnsync when loading. But the good news is that svnadmin load can skip validations and skipping validation does no harm.

svndumpfilter
Ladies and gentlemen, let me introduce you to the hero of the day! (Or should I say the hero who saved my day).... svndumpfilter! It is a utility for removing history from a Subversion dump file by either excluding or including paths beginning with one or more named prefixes. As per the specs it can operate on any dump file and filter it and give a dump that has only desired content.
There's a small glitch though. We know 2 ways of creating svn dumps; svnadmin dump and svnrdump. As of svn 1.7, svnadmin dump creates a version 2 dump file as the default dump file type. You can specify it to be of version 3 if you want. As the new kid in the block, svnrdump creates only dumps of type version 3. svndumpfilter, being old fashioned, doesn't like to deal with new type of dumps. (It's a known bug) So we must give it a version 2, a dump created by svnadmin dump. Can migrate multiple projects together. It's not possible to use both include and exclude together, but you can always do them one after another. For example include some stuff and get a dump and then filter on the resulting dump to exclude things out of it.


Feasible Solutions

After analyzing all the above, following 2 were identified as the feasible Solutions. Multiple projects can be migrated to the same repository using both options.
  • Option1 :svnrdump dump & svnadmin load Destination repository need to be in SVN 1.7. (Remember svnrdump is new) Since it's a new feature there can be bugs. (After all we're all developers) 
  • Option2 :svnadmin dump & svndumpfilter This has been time tested and offer some neat options that are handy.


Revision Numbers

This is something lot of people liked to keep. Bugzilla matched bugs with revision numbers. Release documents contained so many references to revision numbers. X was born in revision 2563 and x fell in love with y in revision 5845 and they got married in revision 6541 and had their first kid in revision 5425 and the story continued for generations. Now we can't blame anyone for being sentimental about revision numbers, can we? svndumpfilter gives some nice options regarding revisions. You can drop empty revisions and renumber the remaining ones. Or you can keep the original ones as it is. So we thought it was easy. It just happened that I had to create some directories for the new repository before I loaded my dump to it. Repository creation was revision 0 and creating these directories was revision 1. And when it started loading, all my revision numbers were getting incremented by one! 

<<< Started new transaction, based on original revision 1
------- Committed new rev 2 (loaded from original rev 1)
>>> <<< Started new transaction, based on original revision 2 ------- Committed new rev 3 (loaded from original rev 2)
>>>
Well, me and my supervisor wondered, that can be forgiven, it's just one. We can just tell people to look at the x-1 revision. And the suddenly a light bulb flashed! May be, just may be, if we dump from second revision and load it again ???? And yes, to our delight it worked.
svnadmin dump /svn/tempRepo -r 2:HEAD | svnadmin load --bypass-prop-validation /svn/destinationRepo
Option1 :svnrdump dump & svnadmin load
 Revision numbers are identical in source and destination
Option2 :svnadmin dump & svndumpfilter
 Can keep original revision numbers
 If filtering causes any revision to be empty, can remove these revisions from the dump.
 Can renumber revisions that remain after filtering. We decided to go with option 2 for the sake of revision numbers.


Dependency Resolution

Creating a dump sounds as a very easy thing to do, just run the command and wait, isn't it. Yes that can be the case if the developers at your company project teams were enemies and have sworn never to touch the other project's code. But in reality, people think svn is such a cool tool and use it to copy stuff, move stuff from completely random places to another set of completely random places. What they don't know is that svn watches their every move and records them. Say you copy something like /projectA/x/y/fancyCode.java to somewhere in your projectB, you have created a dependency and if you want to filter project B you have to include that copy path too.


Handling Binaries

Svn saves stuff in a delta based algorithm. As svn book puts it neatly, To keep the repository small, Subversion uses deltification (or delta-based storage) within the repository itself. Deltification involves encoding the representation of a chunk of data as a collection of differences against some other chunk of data. If the two pieces of data are very similar, this deltification results in storage savings for the deltified chunk—rather than taking up space equal to the size of the original data, it takes up only enough space to say, “I look just like this other piece of data over here, except for the following couple of changes.” The result is that most of the repository data that tends to be bulky—namely, the contents of versioned files—is stored at a much smaller size than the original full-text representation of that data. But things get ugly with binaries. All our jars, docs, pdfs fall into this category. They cannot be diffed. So when your tech writer lady corrects a typo and checks in a huge document, svn simply saves another copy of the document as a new revision. In our repository there was a folder to which they checked in jars. Excluding this folder reduced the repository size by more than 50%. We did things in a smart way and didn't try to add it to a svn again. It was decided to maintain it in a normal directory. Who checked in what, and when didn't matter for these jars which are third party tools. IT folks promised to enforce necessary permissions so that these won't be deleted by anyone. And also there was a folder with documents owned by the tech writers. This part of repository was twice as big as the space took by company's biggest project. We were anyway keeping the old repo as read only and our dear tech writers agreed to kiss the history goodbye for docs folder. Docs were given a fresh start in life as they were simply checked in as new content to the new repos.
Loading multiple projects to the same repository
Specify the parent directory with –parent-dir. Else it will be loaded to the root.
svnadmin load --bypass-prop-validation /svn/destinationRepo --parent-dir A < A.dump
svnadmin load --bypass-prop-validation /svn/destinationRepo --parent-dir B < B.dump



Roadblocks You May Encounter


E125005

svnadmin load may fail giving the following error. svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading. svnadmin: E125005: Cannot accept 'svn:log' property because it is not encoded in UTF-8 Same error is reported as below for svnsync Committed revision 35670. Copied properties for revision 35670. svnsync: At least one property change failed; repository is unchanged svnsync: Error setting property 'log': Could not execute PROPPATCH. This error is due to non-UTF8 encodings are not supported in svn logs. As svn book explains Newer versions of Subversion have grown more strict regarding the format of the values of Subversion's own built-in properties. Of course, properties created with older versions of Subversion wouldn't have benefited from that strictness, and as such might be improperly formatted. Dump streams carry property values as-is, so using Subversion 1.7 to load dump streams created from repositories with ill-formatted property values will, by default, trigger a validation error. There are several workaround for this problem. First, you can manually repair the problematic property values in the source repository and recreate the dump stream. Or, you can manually tweak the dump stream itself to fix those property values. Finally, if you'd rather not deal with the problem right now, use the --bypass-prop-validation option with svnadmin load. One solution is to manually update the logs that are not encoded in UTF 8.
svn proplist -v --revprop -r 35670 http://svn.abc.com/project/A | iconv --to-code UTF8//IGNORE -o /tmp/iconv.out
svn propset svn:log --revprop -r 35670 -F /tmp/iconv.out http://svn.abc.com/project/A 

For my scenario best solution was to by pass the property validations in svnadmin load. What I needed to do was migrate the content as it is to the new repositories. Fixing somebody else's dirty work was not in my specs!
svnadmin load --bypass-prop-validation /svn/destinationRepo < sourceDump.dump


E140001
When tried to use dumpfilter on a dump that was created by svnrdump following error was encountered.
svnrdump dump http://svn.abc.com/project/ | svndumpfilter include /A > A.dump svndumpfilter: E140001: Unsupported dumpfile version: 3
svnadmin dump which creates a dump from a local repository, creates a dump with the default format 'format 2'. svnrdump which creates a dump from a remote repository creates 'format 3' dump files only. svndumpfilter supports only 'format 2' and not 'format 3'

Invalid copy source path
svndumpfilter include /projects/A < fullRepo.dump > A.dump svndumpfilter: Invalid copy source path '/projects/B/xyz' Say you want to move your project A to a different repository. You proudly say that your project is independent and you can survive alone. But then when you put the filter to work to create your dump you get this error. One of your smart developers had seen that project B had just what he wanted and decided to steal their code. He had done a svn copy from /projects/B/xyz to projects/A and now svn says unless you give it that path too it will never create the dump. Svn can be a one tough kid. What you can do is simply give it what it asks for. I call it resolving dependencies. And for a big project their can be quite a number of dependencies. svndumpfilter include /projects/A /projects/B/xyz< fullRepo.dump > A.dump Sometimes you may be forced to add content that can must not be in that repository. In that case you can do a svn delete and delete it from head once the loading is complete.

E160013
Say you create a dump like this

svndumpfilter include /projects/A /projects/B/xyz /projects/C/abc/efg < fullRepo.dump > A.dump 

And try to load it
svnadmin load --bypass-prop-validation /svn/destinationRepo < A.dump 

And it gives an error, svnadmin: E160013: File not found: transaction '5104-3xs', path 'project/B' This occurs when SVN is unable to figure out certain paths. Manually creating the path fix the issue. What you need to create are the intermediary directories that are there in the include. For example what is there in bold in the following

svndumpfilter include /projects/A /projects/B/xyz /projects/C/abc/efg < fullRepo.dump > A.dump
svnadmin create /svn/destinationRepo
svn mkdir "create folders" \
file:///svn/destinationRepo/projects \
file:///svn/destinationRepo/projects/A \
file:///svn/destinationRepo/projects/C \ file:///svn/destinationRepo/projects/C/abc
svnadmin load --bypass-prop-validation /svn/destinationRepo < A.dump 

Do the making of directories in one shot using --parents option.

svn mkdir -m "create folders" --parents \ file:///svn/destinationRepo/projects/A \ file:///svn/destinationRepo/projects/C/abc 

 This error can also occur due to missing dependencies that didn't cause issues with the filtering. If you get this error for a file, take the svn log for that file and see what has happened in that particular revision the error occurs. For example I got this error due to a folder rename and had to include the earlier path name and create a fresh dump.

After the migration

Once the migration is complete compare the svn logs of source and destination. Getting the logs in xml formats is good if the source and destination are in two different svn versions, since with version the format of logs may differ. A merge too can be used to compare the two logs in xml format. Viewsvn can also be used to verify the contents after migration. Once the migration is done, users' working copies have to be pointed to the new repositories. What we did was simply ask the users to get fresh check outs. Following can be useful for those who don't like to do that.


1. svn relocate: Relocate the working copy to point to a different repository root URL. This “rewrites” the working copy's administrative metadata to refer to the new repository location. But, it wants to compare the UUID of the repository against what is stored in the working copy. If UUIDs don't match, the working copy relocation is disallowed. We have two ways of keeping the UUID of the source. Please note that this will make both repos to have the same UUID. 1. svnadmin load  has following option --force-uuid By default, when loading data into a repository that already contains revisions, svnadmin will ignore the UUID from the dump stream. This option will cause the repository's UUID to be set to the UUID from the stream. 2. svnadmin setuuid — Reset the repository UUID. Reset the repository UUID for the repository located at REPOS_PATH. If NEW_UUID is provided, use that as the new repository UUID; otherwise, generate a brand-new UUID for the repository.

2. svn upgrade — Upgrade the metadata storage format for a working copy. This will be needed for the users to upgrade to svn 1.7 As new versions of Subversion are released, the format used for the working copy metadata changes to accomodate new features or fix bugs. Older versions of Subversion would automatically upgrade working copies to the new format the first time the working copy was used by the new version of the software. Beginning with Subversion 1.7, working copy upgrades must be explicitly performed at the user's request. svn upgrade is the subcommand used to trigger that upgrade process. If you attempt to use Subversion 1.7 on a working copy created with an older version of Subversion, you will see an error.



Summarized Commands

At the source
DUMPDIR=/dumps
SOURCE_REPO=/svn/sourceRepo
svnadmin dump $SOURCE_REPO | svndumpfilter include \
/component \
/docs \
/project/A \
/project/B \
/project/C/applications/journal \ >
$DUMPDIR/A1.dump
svndumpfilter exclude /docs/userguides/custom/ < $DUMPDIR/A1.dump > $DUMPDIR/A.dump


At the destination
DUMPDIR=/dumps
REPO_LOCATION=/svn
rm -rf $REPO_LOCATION/tempRepo
svnadmin create $REPO_LOCATION/tempRepo
svn mkdir -m "create initial folders" --parents \ file://$REPO_LOCATION/tempRepo/project/C/applications
svnadmin load --bypass-prop-validation $REPO_LOCATION/tempRepo < $DUMPDIR/A.dump
rm -rf $REPO_LOCATION/destinationRepo
svnadmin create $REPO_LOCATION/destinationRepo
svnadmin dump $REPO_LOCATION/ tempRepo -r 2:HEAD | svnadmin load --bypass-prop-validation $REPO_LOCATION/destinationRepo

svn delete -m "delete unwanted content" file://$REPO_LOCATION/destinationRepo/project/C

5 comments:

  1. Dear K Girl,

    Great post. I've been doing a lot of migrations recently. Like you had errors from svnrdump which works if I am lucky.

    Once I had to resort to a hex editor to fix one dumpfile. (You cannot have ^M in a property.)

    Your post is great. KUTGW.

    Bill

    ReplyDelete
  2. Finally somebody has commented on a post :) :). Thanks Bill and I'm very happy it was useful to you

    ReplyDelete
  3. Hi there,

    your post is a very good summary for migrating (or creating backup) svn-repositories. It would have been great, had I found it earlier ... this would have saved a lot of time and work. I'll keep the post in my bookmarks.

    Very well done! Cheers!

    ReplyDelete
  4. Hello, I have almost the same requirement. I need to pull out few projects out of a big Single Repository with size almost 150GB. The only variation I want is to keep the original revisions intact. The svndumpfilter works correct with --drop-empty-revs switch and preserve my revisions.
    However "svnadmin load" Renumbers revisions starting from 1. That is the actual hurdle for my activity.
    Can anyone help.

    ReplyDelete
  5. Reading this helped me choosing the right option after I spent hours with frustrating try&error, thx!

    ReplyDelete