Your Code as a Crime Scene

Analyze a Large Codebase

When you start with a new project, how do you know which parts need extra attention? That kind of expertise takes time to build. You need to read a lot of code, talk to more experienced developers, and start small with your own changes. There’s no way around this.

At the same time, it’s important that you get a quick overview of where potential problems may be hiding. Those problems will influence how you approach design. If you have to add a feature in the middle of the worst spot, you want to know about it so that you can plan countermeasures, such as writing extra tests and setting aside time to refactor the code. You may decide to come up with a different design altogether.

A hotspot analysis gives you an overview of the good as well as the fragile areas of the codebase. The best part is that you get all that information faster than a CSI agent can hack together a Visual Basic GUI to track an IP address in real time.

As an example of in a large-scale system, let’s investigate Hibernate^[13]—a popular open-source Java library for object-relational mapping. We’re using Hibernate because it’s well known, has a rich history, and is under active development. If you’ve worked with a database in the Java ecosystem, chances are you’ve come across Hibernate.

Clone the Hibernate Repository

To get started, let’s clone Hibernate’s Git repository to your computer:

	prompt> git clone https://github.com/hibernate/hibernate-orm.git
	Cloning into 'hibernate-orm'...
	...
	Receiving objects: 100% (210129/210129), 127.83 MiB \| 1.99 MiB/s, done.
	Resolving deltas: 100% (118283/118283), done.
	Checking connectivity... done.

Because Hibernate is under active development, we know things may have changed since I wrote this book. So let’s roll back the code, as we learned in chapter Turn Back Time, so that we all start with the same code:

	prompt> git checkout `git rev-list -n 1 --before="2013-09-05" master`
	Note: checking out '46c962e9b04a883e03137962b0bdb71fdcfa0c4e'.
	...
	HEAD is now at 46c962e... HHH-8468 cleanup and simplification

Now the Hibernate code on your computer looks as it did back in September of 2013. Let’s generate a log, as we did in Automated Mining with Code Maat.

Generate a Version-Control Log

We are going to limit our analysis to code changes made in the last year and a half. Here’s how you specify that:

	prompt> git log --pretty=format:'[%h] %an %ad %s' --date=short \
	--numstat --before=2013-09-05 --after=2012-01-01 > hib_evo.log

This generates a detailed hib_evo.log we can use with Code Maat. Let’s explore the generated data:

	prompt> maat -l hib_evo.log -c git -a summary
	statistic,value
	number-of-commits,1346
	number-of-entities,10193
	number-of-entities-changed,18258
	number-of-authors,89

As you can see, there’s been plenty of development activity over the last year and a half. Remember how we said earlier, in Analyze a Large Codebase, that finding hotspots makes it easier to get started with a new project? This is a good example: you’re starting out with Hibernate and are faced with 400,000 lines of unfamiliar code. Talking to the 89 different developers who’ve contributed to the project over the past year and a half is impractical (particularly since some of them may have left the project).

Follow along, and you’ll see how a hotspot analysis can guide you through unfamiliar code territory.

Choose a Timespan for Your Analyses

First of all, it’s important to limit the data you are analyzing to a shorter time period than the project’s total lifetime. If you include too much historic data in the analysis, you skew the results and obscure important recent trends. You also risk flagging hotspots that no longer exist.

One approach is to include time in your analysis, by weighing individual measures by their relative age. The challenge if you choose that route is how to set up the algorithm. We go with an alternative approach in this book, which is to limit the period of time we look at. It’s a more general approach, but it requires you to be familiar with the development history.

To select an appropriate analysis period, you have to know how you work on the project. You have to know the methodology you’re using and the length of your release cycles. The period also depends on the questions you want answered. On my projects I choose the following timeframes:

Between releases: Compare hotspots between releases to evaluate your long-term improvements.
Over iterations: If you work iteratively, measure between each iteration. This lets you spot code that starts to grow into hotspots early.
Around significant events: Define the temporal period around significant events, such as reorganizations of code or personnel. When you make large redesigns or change the way you work, it will reflect in the code. With this analysis method, you have a way to investigate both impact and outcome.

Start with a Long Period
	As you start with your first analysis, go with a longer period, such as one or two years of historic data. That lets you explore the system and spot long-term trends. On projects with high development activity, select a shorter initial period, perhaps as little as one month.

Previous Chapter

Chapter 4: Analyze Hotspots in Large-Scale Systems

Next Chapter

Visualize Hotspots

Table of Contents for Your Code as a Crime Scene

Analyze a Large Codebase

Clone the Hibernate Repository

Generate a Version-Control Log

Choose a Timespan for Your Analyses

Table of Contents for
Your Code as a Crime Scene