Is it a bug or a feature?

Hello, I’m Jakub, a software engineer at Datawrapper.

Recently, I read a few books about software development that I found in our library here in the Berlin office. One statement in Facts and Fallacies of Software Engineering[1] caught my attention:

Errors tend to cluster.

This fact can be useful, because it suggests that when you find a problem in one module, you should pay close attention to it because there are probably more problems there.

By problems, or bugs, I mean improvements that need to be made to existing code to make it match the current requirements. Maybe the app didn’t support a specific web browser version, or it was slower than desirable, or it relied on third-party code that has become outdated. Many bugs aren’t visible to users and don’t affect them. What they have in common is that fixing them means changing existing code without changing the product requirements. This distinguishes them from features, which extend the requirements.

So to see if errors indeed cluster, and to find modules that are likely to contain more errors, I decided to analyze the code of Datawrapper by looking at the history of changes to our project and counting which modules received the most improvements. A “module” in a software project is usually one single source code file, but to get a higher level view, I will be looking at whole folders of files in this article.

There were, however, a few caveats:

  • Large modules (defined by the number of code lines) generally need more improvements, solely because they’re large.
  • Modules in which we add a lot of new features would probably also see a larger number of fixes.

Therefore, a meaningful visualization of bug fixes in a software project should take into account the size of a module and the number of features added to it. Then the focus should be on the outliers — modules that we improved more often than their size and number of features would suggest. Bear in mind that a larger number of fixes doesn’t necessarily mean a bad piece of code: it can also mean that the module just received fewer new features in comparison, because it’s considered complete.

Now without more ado, here is the result of my analysis of Datawrapper code:

What are we looking at?

  • The circles represent different modules of Datawrapper.
  • The size of each circle represents the size of each module, measured by the current number of lines in all the files that the module contains.
  • The horizontal axis shows the number of lines of code related to improvements that we've changed in a given module recently.
  • The vertical axis shows the number of lines of code related to new features that we've changed in a given module recently.

From my perspective as a Datawrapper developer, the chart matches what I remember we’ve been working on. Since my goal was to discover code to pay extra attention to, to look for any problems that we haven’t yet noticed, I might need to continue the research and zoom in on particular modules or even individual lines of code. Finally, looking at the improvements made tells a lot about which parts of the project we prioritized in the past few years.

The data were collected by analyzing the log of changes to our project (commits in our version control system) since September 2021. Each entry in the log is a set of changes to one or more files. It also comes with a message that the developer added to describe what the changes are about. Since we follow a fairly precise format for these messages, I could tell if the changes created a new feature or improved an existing one.

That's it from me for today. I hope you enjoyed this software-themed chart.

  1. Robert L. Glass, 2007 ↩︎