After my successful extraction of the kate-wml-syntax portion of Wesnoth-UMC-Dev, the next goal obviously had to be just a little more complicated than that. Between Invasion from the Unknown and After the Storm, AtS has a relatively simple VCS and meta-history spanning only one main branch and one short-lived branch, as well as a number of tags based only on the main branch.
In reality, a tool called reposurgeon ought to be the most optimal for performing this task correctly and elegantly, but I have my reasons to avoid it. Nevertheless, if you need to convert your own SVN repositories to Git, it is probably what you should be checking out instead of these blog posts. I’m here to do things my own way and learn while at it; the
git svn portion of this procedure is very boring and nothing out of the ordinary, but
git filter-branch is a handy power tool that comes bundled with Git and which may prove useful to me for non-conversion tasks as well, and in fact, it already has done so. It’s worth noting that being a power tool, you do not want to put a clueless simpleton in charge of applying it on your repository; then again, if its documentation is anything to go by, reposurgeon is not much better in that regard. All this is probably for the best.
(I’m also taking this opportunity to express my absolute befuddlement at Git’s liberal
git push -f implementation, presumably a consequence of Git being originally designed for the Linux kernel’s mail-based workflow. While
git push -f can be useful under very specific circumstances, I have heard and seen people use it as a Git panacea without realizing the consequences.)
I already described the
git svn step in broad terms in Part II of this series, so I am not going to go into details here. The fun part continues to be the
git filter-branch (repository rewrite) operation, although this time we also need to deal with SVN tag branches.
As I had the perform the After the Storm repository rewrite multiple times on a remote server before I realized my desktop box’s superior disk I/O and tmpfs configuration and lack of console lag resulting from a slow SSH link would be a considerable advantage, I wrote a few scripts to help me throughout the whole procedure. I made several highly risky and specific concessions aimed solely at dealing with my particular use case, though, so I can’t possibly publish those scripts without either feeling ashamed of my subpar shell scripting skills or inadvertently allowing one of the aforementioned specimens to wreak havoc on somebody’s data, perhaps even their own! Either outcome would be regrettable.
Thus, I am providing a pseudocode outline of the rewrite procedure instead:
- For every tag branch BT:
- Let its tip commit be HT.
- Create a corresponding annotated tag T pointing to HT, with T’s authorship information matching HT’s (author and committer identification and timestamp).
Due to how SVN tagging and branching works from
git svn’s point of view, HT is an empty commit object that incurs in no tree changes. This is an important thing to keep in mind for the next step.
- Because the original HT commit message is highly SVN-specific and not particularly useful (
Tagging add-on 'After the Storm', release <VERSION> from trunk, using r<REV>), I opted for an automatic minimalistic tag message for T of the form
After the Storm version <VERSION>, reminiscent of the
AtS: version <VERSION>commits preceding the tag branching point.
- Because the original HT commit message is highly SVN-specific and not particularly useful (
- Delete BT. HT is now only referenced in the repository by T.
- For every commit C:
- Rewrite the
git-svn-idmetadata trail in the commit message so it reads e.g.
[Wesnoth-UMC-Dev SVN r12345]and strip extra empty lines between paragraphs.
- Delete the
/musicdirectory recursively if it appears in the commit’s tree.
- Rewrite any tag objects pointing to C while keeping any other attributes intact.
git filter-branchdoes not do this by default for some reason, so you may end up with tags pointing to dangling objects otherwise.
- If C is empty (no tree changes) after the previous steps, delete it.
From point 1, this results in every HT commit being erased, and T being rewritten to point to HT’s immediate ancestor, which is in an ideal SVN repository part of the origin branch and not the short tag branch.
- Rewrite the
git fsckto check that the repository isn’t broken at the end.
Part 1 was done through a “simple” sequence that extracts the needed information from every tag branch. At the risk of making myself eligible for the questionable classification from the start of this post, the code I used for this was:
tagproject="After the Storm" git for-each-ref --format='%(refname)' refs/heads/tags | cut -d / -f 4 | ( while read ref; do set -- `git log -n 1 --format="format:%ct %at" refs/heads/tags/$ref` export GIT_COMMITTER_DATE="$1 +0000" export GIT_AUTHOR_DATE="$2 +0000" set -- `git log -n 1 --format="format:%ae %an" refs/heads/tags/$ref` export GIT_AUTHOR_EMAIL="$1"; shift export GIT_AUTHOR_NAME="$*" set -- `git log -n 1 --format="format:%ce %cn" refs/heads/tags/$ref` export GIT_COMMITTER_EMAIL="$1"; shift export GIT_COMMITTER_NAME="$*" echo " * $ref (REF: `git rev-parse refs/heads/tags/$ref` CTS: $GIT_COMMITTER_DATE ATS: $GIT_AUTHOR_DATE CA: $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> CC: $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>)" git tag -a "$ref" -m "$tagproject version $ref" "refs/heads/tags/$ref"; unset GIT_COMMITTER_DATE; unset GIT_COMMITTER_EMAIL; unset GIT_COMMITTER_NAME unset GIT_AUTHOR_DATE; unset GIT_AUTHOR_EMAIL; unset GIT_AUTHOR_NAME git branch -D "tags/$ref" > /dev/null done )
The most glaring issues with this code are that it invokes
git log multiple times for each commit (which could be solved if I were writing this pipeline for general usage, which I am not), and it does not take into account author/commit timezones; but it does not do so because
git svn does not either! Subversion normalizes dates to UTC, so
git svn cannot possibly know the original timezones used. Ideally, one would use
git filter-branch to rewrite these commits on part 2 (or even before, in fact) to convert their timestamps to local ones with a timezone offset attached, but determining the offsets to use involves knowing the authors/committers’ locations around the globe and any local daylight saving time regulations that may come into play depending on the local date and time. For a repository that has more than one committer or author, this is just pointless pedantry bordering on pathological perfectionism. I have a whole history with the latter and After the Storm which I would rather not repeat with something as mundane and short-lived as a one-time Subversion-to-Git conversion. Seriously.
Avoiding the kind of perfectionism that ruins lives is also the reason why I opted to leave commit messages intact save for the metadata trail instead of taking the opportunity to fix commits with long/inexistent summary lines, sloppy grammar/spelling/punctuation, commit messages consisting solely of a bullet list, and Kri’tan.
Rewriting commits and updating tags
Compared to the tag preparation part, the actual
git filter-branch invocation is rather simple:
# Temporary dir on /tmp (a tmpfs mount) to trade disk access for RAM gittmp=`mktemp -uqd --tmpdir gittmp.XXXXXXXXXX` # Strip double blank lines and convert git-svn-id metadata trail # to a more readable format (e.g. [Wesnoth-UMC-Dev SVN r99999]). msgfilter="cat -s | sed -re 's/^git-svn-id.*@([0-9]+) .*\$/[Wesnoth-UMC-Dev SVN r\\1]/m'" # Remove external music/ dir indexfilter="git rm --cached --ignore-unmatch -q -r ./music/" # Needed to rewrite tags to point to the rewritten commits tagfilter="cat" git filter-branch -d $gittmp --prune-empty --msg-filter "$msgfilter" --index-filter "$indexfilter" --tag-name-filter "$tagfilter" -- --all"
To be honest, disk I/O on my desktop machine is good enough (with a 7200 rpm HDD on a SATA III link) that asking
git filter-branch to use a directory on a tmpfs mount doesn’t make a noticeable difference, but your mileage may definitely vary. AtS’ repository is not too large nor does it contain particularly a particularly large number of files. Moving the operation from a remote VPS to my desktop, on the other hand, did make the overall process about two times faster.
--tag-name-filter "cat" part was actually recommended by both
git filter-branch and its documentation, since otherwise tags are left untouched and stop making sense since they point to commits that have been rewritten and thus no longer have the original SHA1 hash or ancestry. The rest of the invocation is just pedestrian business involving
git rm and standard Unix tools; it merits no explanation whatsoever.
Well, the recursive
/music directory removal actually does require some justification. This Wesnoth add-on originally included a couple of large music track files (in Ogg Vorbis format) within it, thus they ended up as part of its SVN tree. Over time, however, this approach proved suboptimal both for me (uploading new versions of the add-on from a 3.5G mobile broadband link) and players, who had to download over 14 MiB or so every time instead of 7 MiB, the approximate base size of the add-on without music. Thus, at some point I deleted
/music from the SVN tree in favor of a separate rarely-updated add-on, but its history remained there. Since converting to Git involves starting from a clean slate in a way, I decided I might as well delete a portion of history that’s no longer relevant or used and may take up a significant portion of the packed repository data... or it may not; it’s not the point, either way.
The end result is now available on GitHub, and it’ll become official once AtS version 0.9.9 is released.
It is worth mentioning that past a certain point, power tools like
git filter-branch become an unfathomable corrupting force; if you spend too much time tweaking an invocation sequence and examining the results, this force seeps into your very soul and makes you do absolutely ridiculous things. It is no exaggeration when I say that such an outcome was but narrowly averted for my conversion of the AtS repository.
Anyway, the last SVN commit and the first native commit to the repository now look like this:
commit bbf8df680322ce585debb2b3068b446b87616371 Author: Ignacio R. Morelle <shadowm@xxxxxxxx> Date: Mon Nov 25 06:27:34 2013 -0300 Add a Markdown README.md containing general info, replaces BUGS commit 040ca5e81ef2a8851e331cfc5a4f59ae03d6bca2 Author: Ignacio R. Morelle <shadowm@xxxxxxxx> Date: Sat Nov 23 02:18:29 2013 +0000 AtS E1S4: mismatched tense fix [Wesnoth-UMC-Dev SVN r19373]
I opted for not stripping the
AtS: prefixes in past commit summaries (mostly those since 2011) used to identify Wesnoth-UMC-Dev projects. The reason is that I don’t really want to spend more time figuring out what words after the colon would then need to be excluded from capitalization aside from wesnoth-optipng. Laziness.
The whole of
/music was excised from the repository history, resulting in an overall size decrease from 25 MiB to 17 MiB. Given the number of PNG file additions and recompressions (“wesnoth-optipng pass” commits) taking place in the past, as well as the existence of some shorter Ogg Vorbis files under
/sounds, I do not think it’s feasible to shrink it any further without dropping additional relevant history. Again, the
/music removal was purely for practical purposes.
I guess you could say that balancing out my perfectionism with my unrelenting pragmatism and laziness was key to accomplishing this task within a finite amount of time.
Next target: Invasion from the Unknown. Since it contains oddities like changing tag branches and an unusual layout for its first commit (revision 1), it’s probably not going to be as easy as AtS.