Thursday, March 21, 2013

Measuring Project Activities (2)

Continuing from an earlier article, let's see how you can compute some interesting stats on your own projects.

How much change did a release have?

As I said earlier, you can measure the extent of change to your codebase in two ways. A quicker and less precise way, and a more involved but more accurate way.

A quicker way is to ask git diff --numstat to count the deleted and added lines between the release tags, and add them up yourself. If you care about whole-file renames, you can add the -M option to the git diff command:

addremove2 () {
  git diff --numstat "$@" | {
    total=0 &&
    while read add remove path
    do
      total=$(( $total + $add + $remove ))
    done &&
    echo "$total"
  }
}

And with that helper, the main function we introduced in the earlier article can do this to compute the modified2 number for the entire release cycle and per each day:

handle () {
  old="$1^0" new="$2^0"
  ...
  modified2=$(addremove2 "$old" "$new")
  mod2perday=$( echo 2k "$modified2" "$days" / p | dc )
}

How much real change did a release have?

Counting number of added and removed lines using git diff --numstat is straightforward, but this tends to over-count changes. For example, when adding a new caller to an existing code, you may have to move that existing code up in the same file (especially if it is a file-local static function) to make the callee come before the caller, or move it to a different, more "library-ish" file, while making its visibility from static to extern. Both of these kind of changes unfortunately appear as a bulk deletion of existing block of lines and bulk addition of the same contents elsewhere in the codebase.

In order to count the true amount of work went into the new release, you would want to exclude such changes from your statistics.

This is where git blame can help. In the most basic form, it can trace each and every line of a file in the given commit back to its origin, i.e. which commit it came from. By default, it notices when the whole file gets renamed (e.g. the file hello.c you are running the command on in the current release may have been called goodbye.c in an earlier release), and employs no other fancy tricks, but you can tell it to notice code movement within a file (e.g. moving the callee up in the file) with the -M option, or code moves across files (e.g. moving a static function from a file that an existing caller lives in to a different "library-ish" file, to make it also visible to a new caller in another file) with the -C option. You can also tell it to ignore whitespace changes with the -w option like you can with git diff. For example:

  git blame -M -C -w -s v1.8.0..v1.8.1 -- fetch-pack.c

will show you which commit each and every line in the fetch-pack.c file came from; its output may begin like this:

745f7a8c fetch-pack.c           1) #include "cache.h"
^8c7a786 builtin/fetch-pack.c   2) #include "refs.h"
^8c7a786 builtin/fetch-pack.c   3) #include "pkt-line.h"


The first line is blamed to commit 745f7a8c, while the other lines are attributed to commit 8c7a786 (the leading caret ^ means it is attributed to a commit at the lower boundary of the range), which is the v1.8.0 release. Note that these old lines used to live in a different file builtin/fetch-pack.c in the older release, and would have been counted as additions if you used the approach based on git diff --numstat -M to count them, because there was no file renaming involved between these two releases.

Also notice that these lines may have been untouched since a commit that may be a lot older than v1.8.0, but we told the command to stop at v1.8.0 from the command line, so these are all attributed to that range boundary.

If you count the number of lines in the whole output from the above command, that will show the number of lines in the fetch-pack.c file at the v1.8.1 release. If you count the lines that do not begin with a caret, that counts the lines added in the new release.

added_to_file () {
  old="$1" new="$2" path="$3"
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  grep -v '^^' |
  wc -l
}

This may be sufficient as a starting point, but we are not all interested in checking each and every commit between the two releases (e.g. the commit 745f7a8c in the above example is not the v1.8.1 release and the only thing we care about is that the line is new in the new release; we do not care where in the development cycle leading to the release it was added), so it is a waste of computational cycles.

Fortunately, you can tell git blame to pretend as if the commit tagged as v1.8.1 release were a direct and sole child of the commit tagged as v1.8.0 release with the -S option. First, you prepare a graft file to describe the parent-child relationship.

added_to_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $new $old
  $old
  EOF
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  ...
}


The graft file lists each commit object and its parent. The above snippet says that the $new commit has a single parent, which is $old, and $old commit does not have any parent. This lets us lie to git blame that our history consists of only two commits, and one is a direct child of the other.

With this, we can tell how much new material was introduced to the given path in the new release, but what about the material removed from the old release? We can compute it in a similar way with a twist. You take a path in the old release, and pretend as if the old release were the direct child of the new release. We compute what we have added if we started from release v1.8.1 and development led to the contents of v1.8.0, like this:

removed_from_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $old $new
  $new
  EOF
  git blame -M -C -w -s "$new".."$old" -- "$path" |
  grep -v '^^' |
  wc -l
}

By tying these two helper functions with a list of paths that existed in the two releases, you can compute the amount of real changes made to reach the new release, but this article is getting a bit too long, so I'll leave it to another installment. We will use the added_to_file helper to construct added_to_commit function like this:

added_to_commit () {
  old=$(git rev-parse "$1^0")
  new=$(git rev-parse "$2^0")
  list_paths_in_commit "$new" |
  while read path
  do
    added_to_file "$old" "$new" "$path"
  done | {
    total=0
    while read count
    do
      total=$(( $total + $count ))
    done
    echo $total
  }
}

1 comment:

David said...

That's a really interesting post, thanks for sharing!

I was trying the differences between using graft in "git blame" vs not using graft and it seems like using graft is slower (although maybe more precise?) - or maybe I'm doing something wrong:

function added_to_commit_nograft {
old=`git rev-parse "$1^0"`
new=`git rev-parse "$2^0"`
git diff --name-only --diff-filter=AM "$old" "$new" | while read path; do
git blame -M -C -w -s "$old".."$new" -- "$path" | grep -v '^^' | wc -l
done | awk '{sum+=$1} END {print sum}'
}
function added_to_commit_graft {
old=`git rev-parse "$1^0"`
new=`git rev-parse "$2^0"`
git diff --name-only --diff-filter=AM "$old" "$new" | while read path; do
echo "$new $old" > graft
echo $old >> graft
git blame -M -C -w -S graft -s "$old".."$new" -- "$path" | grep -v '^^' | wc -l
done | awk '{sum+=$1} END {print sum}'
}

$ time added_to_commit_nograft v1.8.0 v1.8.1
15030

real 0m14.307s
user 0m12.449s
sys 0m1.412s

$ time added_to_commit_graft v1.8.0 v1.8.1
14624

real 0m34.741s
user 0m31.462s
sys 0m2.888s