Find the most used first word (and every word) in your git commit messages

Via Twitter, I found this interesting gist (https://gist.github.com/2140115) which contains a one-line bash command for “Find[ing] the most used verbs in your git commit messages” in a git repo.

$ git log --pretty=format:'%s' | cut -d " " -f 1 | sort | uniq -c | sort -nr
      2 Added
      1 Improved

In the following, I have extracted the neccessary information from the help/man pages to understand, how this is achieved. I also preserved the intermediate states which are printed after a command is explained.

git log # "Show commit logs"(git log --help)
git log --pretty=format:'%s' # print each commit log as a line containing solely its subject

$ git log --pretty=format:'%s'
Improved README
Added CHANGELOG
Added README

cut # "print selected parts of lines from each FILE to standard output."(cut --help)
cut -d, --delimiter=DELIM # "use DELIM instead of TAB for field delimiter"(cut --help)
cut -f, --fields=LIST # "output only these fields"(cut --help)
cut -d " " -f 1 # split line by " " and output only the first field

$ git log --pretty=format:'%s' | cut -d " " -f 1
Improved
Added
Added

sort # "Write sorted concatenation of all FILE(s) to standard output."(sort --help)

$ git log --pretty=format:'%s' | cut -d " " -f 1 | sort
Added
Added
Improved

uniq # "Discard all but one of successive identical lines from INPUT (or standard input), wirting to OUTPUT (or standard output)"(uniq --help)
uniq -c, --count # "prefix lines by the number of occurrences"(uniq --help)

$ git log --pretty=format:'%s' | cut -d " " -f 1 | sort | uniq -c
      2 Added
      1 Improved

sort # "Write sorted concatenation of all FILE(s) to standard output."(sort --help)
sort -r # "reverse the result of comparisons"(sort --help)
sort -n # "compare according to string numerical value, imply -b"(sort --help)
sort -b # "ignore leading blanks in sort fields or keys"(sort --help)

$ git log --pretty=format:'%s' | cut -d " " -f 1 | sort | uniq -c | sort -nr
      2 Added
      1 Improved

If you want all the words (and not only the first words of each commit), the cut command does not suffice. In that case, sed, a “stream editor for filtering and transforming text”(sed man page) can be leveraged. It is shown in the following.

$ git log --pretty=format:'%s'
Improved README
Added CHANGELOG
Added README

sed # "stream editor for filtering and transforming text"(sed man page)
sed s/regexp/replacement/ # "Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement."(sed --help)
\s # whitespace character
\n # new line
sed 's/\s/\n/g' # replace all whitespace characters by new lines

$ git log --pretty=format:'%s' | sed 's/\s/\n/g'
Improved
README
Added
CHANGELOG
Added
README

$ git log --pretty=format:'%s' | sed 's/\s/\n/g' | sort
Added
Added
CHANGELOG
Improved
README
README

$ git log --pretty=format:'%s' | sed 's/\s/\n/g' | sort | uniq -c
      2 Added
      1 CHANGELOG
      1 Improved
      2 README

$ git log --pretty=format:'%s' | sed 's/\s/\n/g' | sort | uniq -c | sort -nr
      2 README
      2 Added
      1 Improved
      1 CHANGELOG

But beware, as this can be too much for your command line buffer to handle. This can be solved by piping the result to less.

$ git log --pretty=format:'%s' | sed 's/\s/\n/g' | sort | uniq -c | sort -nr | less
      2 README
      2 Added
      1 Improved
      1 CHANGELOG

It can still be improved, e.g., to convert all upper case letters to lower case ones. This is possible using tr '[:upper:]' '[:lower:]' which uses the tr tool that is able to “translate, squeeze, and/or delete characters from standard input, writing to standard output”(tr --help). In that case it translates every upper case letter to a lower case one while copying the other characters.

Advertisements

Analyzing Your Git Repository

Sometimes it is interesting to see who has commited how often in a git repository, how many lines of code the person contributed, etc.

There are several statistical tools available like gitstats, gitstat or online services like ohloh. However, I didn’t wanted to install additional software. So I looked at the available git commands.

You can use git shortlog to get a list of authors/commits by applying the –summary flag while the –numbered flag sorts them according the number of commits in descending order.

git shortlog --summary --numbered
git shortlog -sn # short form

However, it counts the merge commits, too. These commits do not create value and are unneccessary as one could use git rebase instead aiming for a cleaner git history. It is possible to exclude the merge commits by adding the –no-merges option. This is not included in the man page of git shortlog. This is possible as the git shortlog command is based on the git log command which can interpret the –no-merges option as stated on its man page.

git shortlog --summary --numbered --no-merges
git shortlog -sn --no-merges # short form (there is no one letter flat for --no-merges)

A problem can occur if developers with the same name have different email addresses within your git commit history. Using the command above, they are grouped according the name. Thus, you cannot differentiate between these two persons and their individual commits. For this to work, add the option –emails to ensure that commits of developers with the same name are not aggregated.

git shortlog --summary --numbered --emails
git shortlog -sne # short form

Another problem can occure if a developer uses different names and or email addresses within your git commit history. This can only be solved by adding a mapping file stating which developer has which names and email addresses. The file has to be named .mailmap and located at the top level of the repository. In each line, a mapping is defined. Each mapping maps a commit name and or a commit email address to a proper name and or proper email address. If a developer uses several different commit names and or email addresses, you may need several mappings for this developer.

For example, the developer Max Mustermann uses the following name/email pairs for his commits:

Max Mustermann <max.mustermann@mail.com>
Max <max.mustermann@mail.com>
Max Mustermann <max@mail.com>
Max <max@mail.com>

The aim is to identify this user by Max Mustermann <max.mustermann@mail.com> only. Therefore, the .mailmap file has to look as follows:

# same name but different mail address
Max Mustermann <max.mustermann@mail.com> <max@mail.com>

# same mail address but different names
Max Mustermann <max.mustermann@mail.com> Max

# different name and different email address
Max Mustermann <max.mustermann@mail.com> Max <max@mail.com>

The example shows all possible combinations (same email, same name, different email and name) and can be used as a guiding example for building your very own .mailmap file. For more details on the structure of such a file, please refer to the man page.

This approach only displays the number of commits per developer, however, it does not take the changes (lines added/lines deleted) into account. I will investigate and implement this in another blog post.