Extracting Diffs from Git with Python

One of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the GitPython library you have direct access to Git repositories that is scriptable. Unfortunately, things aren’t as simple as that, so I present a snippet for extracting change information from a Repository.

First thing first, dependencies. To use this code you must install GitPython:

$ pip install gitpython

What I’m looking for in this example is the change for every single file throughout time for every commit. This doesn’t necessarily mean the change in the blobs themselves, but metadata about the change that occurred. For example:

This pretty straight forward analysis will allow us to build a graph model of how users and files interact inside of a particular project. So here’s the snippet:

The result from this snippet is a generator that yields dictionaries that look something like:

{
  "deletions": 0,
  "insertions": 18,
  "author": "benjamin@bengfort.com",
  "timestamp": "2016-02-23T12:36:59-0500",
  "object": "cloudscope/tests/test_utils/__init__.py",
  "lines": 18,
  "commit": "00c5dd71d86f94dce5fd31b254a1c690c5ec1a53",
  "type": "A",
  "size": 509
}

This can be used to create a history of file changes, or to create a graph of files that are commonly changed together.