In the nearly twenty years that CVS has been in use, there have been literally hundreds of thousands, (perhaps even millions) of CVS repositories created around the globe. Some of those have already converted to Subversion (SVN), while others are candidates for conversion. This posting explores how best to approach such a conversion.
When considering converting your CVS repository to another Source Code Repository Management (SCRM) scheme, you must answer these questions:
What problem (or problems) is the conversion attempting to solve?
If the conversion is not meant to address an actual problem or issue, why are you considering it? If there is simply a desire to move to a more modern SCRM system, it is well advised to first prepare a cost-benefit analysis.
Start listing the perceived benefits of converting from CVS, describing them using simple phrases like “Enable directory versioning“. Also list the costs of the conversion, including everything from time consumed in the conversion itself to time and training required for developers to use the new SCRM. Be prepared to abandon the conversion if the benefits don’t clearly outweigh the costs.
Could the problem(s) be addressed or resolved without a conversion?
There may be less disruptive solutions to your issue. Be sure to research and consider those, because converting your repository is effectively taking a ride down a one-way street. Once your repository has been converted to SVN and there is new activity in it, there is no practical method for going back to CVS other than simply creating a new CVS repository with no version history. As I said above, research and weigh your options.
Are there system dependencies on the CVS repositories?
If yes, what is the scope of the dependencies? If all of your build and release management automation is tied to interacting with a CVS repository or commavee files, you must scope the level of effort required to port the automation to Subversion.
If the LOE is low, then proceed. If the LOE is high, you must now determine whether the problem you’re trying to resolve with the conversion is serious enough - as measured by how costly it is to your organization to deal with it - to warrant a large, costly conversion of existing systems.
Are these commercial systems? If so, are there equivalent SVN-compatible versions available? If there are internally developed systems, is the original engineer(s) that created the system still at your company? If not, is the system documented well enough for someone else to port the dependencies to SVN?
What is the size and distribution of your development team?
This will determine the scale of your development process retrofit. Your engineers will need to be able to securely engage the SVN repository from their IDE of choice from the same geographic location they currently engage CVS. Since SVN uses HTTP or HTTPS as its transport protocol, you will no longer have pserver, sserver, or ssh protocols involved.
CVS to SVN - the cvs2svn.py script
The primary authors of subversion have created a conversion utility called cvs2svn. This is a Python script that will attempt to convert a well-formed CVS repository into a corresponding Subversion repository. This script has many command line options that allow for conversion type (full, no history, etc.), and targets.
My experience with cvs2svn is that is is useful for non-complex repositories that are relatively modern in age. Repositories that are either complex, for example those that contain deep branch structures, or “old” and have been in use for many years, are likely to contain anonamlistic files or version histories that cvs2svn cannot parse.
The script has two characteristics that impede its usefullness on very large, very old CVS repositories:
- Poor error flagging & reporting
- Inability to stop / restart a conversion
This means that when errors (usually obscure or inaccurate, see #1) are reported and the conversion stops, you have to start over (see #2). This is a potential show-stopper for large repository conversions, where you can be multiple hours into a conversion before stopping errors are encountered.
Specific Error Patterns and Causes
I have noticed three specific CVS repository anomalies that cause cvs2svn to fail and hard-stop. If you wish to use cvs2svn, you must deal with these issues either before using the script, or as the errors are encountered.
1. Malformed ,v file
CVS has been around for a long time. Like any tool that has (or had) a long, active development cycle, it had bugs that were fixed over time. Some of these early bugs had the potential to write malformed ,v files into the repository.
Even though CVS client operations would simply ignore these files, they remained in the repository. Now, when cvs2svn encounters these files outside a CVS client operation, the internal flawed structure is exposed and generates a fatal error.
2. Multiple identical tags on disparate revisions of the same file
For the same reasons as noted above, old bugs in CVS made it possible to apply quasi-identical tags to the disparate revisions of the same file. For example, an early bug in CVS involved poor transformation of line-ending characters.
A common occurrence was the generation of ^M characters in Windows-based editors. In normal use, Windows developers would edit files and commit them to the CVS repository.
The earliest users of CVS implemented their repositories on Unix systems. Similarly, the earliest engineers using CVS were developing source code on those same Unix systems. When the client-server CVS was introduced, this enabled developers to engage remote CVS repositories, and the line-ending issue emerged soon after.
As CVS usage grew, Windows developers began using remote CVS repositories. Often, connections to these remote CVS repositories were facilitated using either Samba, or later using pserver.
This wide-spread adoption led to a wave of line-ending issues, which were converted to ^M as they went into the repository, and were actually being saved in the ,v files. While this was annoying to developers, it was ultimately largely resolved in the CVS client’s improved handling of line-endings on the way into and out of the repository.
However, this line-ending issue had at least one serious side affect to the repository. Due to CVS’ buggy early handling of Windows line-endings, it was possible to apply tags of this form
TEST: 1.9
…
TEST^M: 1.0.2.14
to one or more files in the repository. While later clients fixed this bug, the malformed tag had been written to the repository and would be there forever. CVS would simply ignore the malformed tag because it was never requested by any subsequent client operation.
Cvs2svn.py, however, parses each ,v file and it appears to ignore, or drop, the ^M at the end of the tag name, effectively creating two same-named tags on disparate revisions.
This causes a fatal error in cvs2svn.
3. Same named files in both the parent directory and the Attic
Update: There is a cvs2svn option to deal with this;
–retain-conflicting-attic-files
EDIT: Apparently that’s a documentation bug, because the ‘retain-conflicting-attic-files’ is not a recognized option in the current verfsion of cvs2svn.
Once again, this issue is likely due to some old bug in CVS. Under normal usage, CVS would store a ,v file in either the parent of a directory node or in its Attic.
When a file is added to CVS, its position relative to the trunk (the HEAD) is determined. If the file is being added on the trunk, the corresponding ,v file is stored in the repository inside its directory node. For example, when
/foo/bar.java
is added to the trunk, the ,v file is stored at
$CVSROOT/foo/bar.java,v
Similarly, when bar.java is added to a branch, its ,v file is stored at
$CVSROOT/foo/Attic/bar.java,v
And in that ,v file the HEAD revision is listed as Dead. This indicates that the file exists only on the branch. When cvs client operations attempt to retrieve the HEAD, bar.java will not be included.
This is normal CVS behavior, but because of old bugs, CVS has in the past allowed same-named files to exist at both the parent node and in the Attic. Though the bug that allowed this to occur has been fixed, the operation that caused it in fact wrote the extra file to the repository and that file is till there (sometimes years later).
When cvs2svn encounters this, it throws a fatal error.
I’ll come to edit this post with final results of a very large CVS repository conversion I’m doing; for now, suffice it to say that cvs2svn.py is useful, but not 100% reliable for very large, very old CVS repositories.

5 Comments
I am currently the main developer/maintainer of cvs2svn. I read your article with interest, and would like to discuss some of the points that you made.
I do not know of any problems in cvs2svn that are caused by deep repositories or repositories with lots of branches. If you have found some, please let us know.
You claim that cvs2svn cannot be restarted. This is partly correct. The individual passes of cvs2svn are self-contained and can be re-run. Thus if there is a problem in pass4, you don’t have to restart the conversion at pass1. This can be a big benefit, particularly when dealing with tag/branch names that were used inconsistently.
However, you are correct that pass1, in which the data are parsed out of CVS, cannot be restarted if there is an error.
We try to give reasonable error messages (and sometimes workarounds) for the kinds of repository corruption that have been reported to us
frequently. If you have found other common failure modes, please report them to our mailing list and we will look at them.
Regarding symbol names with appended carriage returns:
I’ve never seen this problem; thanks for pointing it out. If you would send an email to the users mailing list including a snippet of a CVS repository that shows this problem, I’d be happy to look at it. If you have any suggestions for how you think cvs2svn should work around this problem, please let us know.
Regarding the –retain-conflicting-attic-files option:
This option has been added to the trunk version of cvs2svn (and works, as far as I know), and that is why it is included in the online documentation. But the latest official release, 1.5.1, does not yet include this feature. I understand that this discrepancy can lead to confusion.
Thanks for the comments, Michael.
I am very interested in being able to restart Pass1 - is this feasible? For very large repositories (i.e. 15+Gb), a fatal error deep into the conversion is a huge time-waster if the process must be completely restarted.
WRT issues converting large & complex repositories, my statement is based on anecdotal evidence & experience converting 15 different CVS repositories with cvs2svn.py. Of those, 13 have converted properly and error-free, but were smallish and non-complex. The other two have exhibited both the failures I have noted in my posting above, as well as what I am referring to as …trouble with deep branch structures.
That particular issue presented itself on a very large repository in a module that had numerous branch and release tags, both across the entire module and on single files. The error that was presented was
This was duplicated twice (two subsequent executions of the conversion) against a few files whose only noticeable difference from other ,v files was the depth of the symbol list. Once I moved those files out of the CVS repository, the conversion moved past them. This makes me conclude that this issue is related to the depth of the symbol list and is reproducible.
This error case and the one involving the ^M at the end of a symbol name are what have prompted me to characterize the error flagging & reporting as poor with respect to use on large, complex repositories. It took multiple executions of very large conversions to finally determine the root cause of these issues; had the error reporting been more meaningful, the issues with the ,v files could have been resolved after the first report.
Thanks for the update about the
--retain-conflicting-attic-filesoption; I will grab the version from SVN next time and try it.Your comments got me thinking about a resumable pass1. It wouldn’t be terribly hard to implement. (No promises, though
)
Regarding “depth” problems: I think a far more likely explanation is that a “deep” repository is likely to be an old one that has accumulated repository corruption over the years. But consider submitting your unprocessable *,v files to the user mailing list if you think they are not corrupt.
Granted, it would be nice if we could provide a more specific error message in the case of a corrupt *,v file. I don’t think that the parser that we use provides more information, but I’ll double-check.
For further discussion, the users@cvs2svn mailing list would be more appropriate so that other members of the cvs2svn community can participate.
I was doing cvs2svn conversion and came across the error “Error summary:
ERROR: ‘/Development/Inc/Attic/.keepme,v’ is not a
valid ,v file
Exited due to fatal error(s).”.
I am attaching the file here
head ;
access ;
symbols ;
locks ; strict;
comment @# @;
desc
@@
Please let me know what could be wrong in this file. Is the file corrupted? I would like to specify here that the corresponding file is not very deep in the reporitory. Please help.
Hey Anjana,
If that’s the entire contents of .keepme,v file that is generating the error, then it is indeed corrupt.
Immediately following the lines
desc
@@
should be the contents of at least one revision to the file; it is entirely legal to have only one revision. For example, the entire contents of an unmodified file named verifymsg,v , which is present in the CVSROOT directory of most repositories, looks like this:
head 1.1;
access ;
symbols ;
locks ; strict;
comment @# @;
1.1
date 2001.10.11.14.14.32; author somebody; state Exp;
branches;
next ;
desc
@@
1.1
log
@initial checkin@
text
@# The "verifymsg" file is used to allow verification of logging
# information. It works best when a template (as specified in the
# rcsinfo file) is provided for the logging procedure. Given a
# template with locations for, a bug-id number, a list of people who
# reviewed the code before it can be checked in, and an external
# process to catalog the differences that were code reviewed, the
# following test can be applied to the code:
#
# Making sure that the entered bug-id number is correct.
# Validating that the code that was reviewed is indeed the code being
# checked in (using the bug-id number or a seperate review
# number to identify this particular code set.).
#
# If any of the above test failed, then the commit would be aborted.
#
# Actions such as mailing a copy of the report to each reviewer are
# better handled by an entry in the loginfo file.
#
# One thing that should be noted is the the ALL keyword is not
# supported. There can be only one entry that matches a given
# repository.
@
Note how this differs from your file.
Post a Comment