Skip to contents
rm(list = ls())
seed <- 1
set.seed(seed)

Introduction

This notebook leverages the JIRA issues data collected from the R Notebook download_jira_data.Rmd to calculate bug counts.

Project Configuration File

As usual, the first step is to load the project configuration file.

tool <- yaml::read_yaml("../tools.yml")
conf <- yaml::read_yaml("../conf/geronimo.yml")
perceval_path <- tool[["perceval"]]
git_repo_path <- conf[["version_control"]][["log"]]

# Issue ID Regex on Commit Messages
issue_id_regex <- conf[["commit_message_id_regex"]][["issue_id"]]
# Path to Jira Issues (obtained using `download_jira_data Notebook`)
jira_issues_path <- conf[["issue_tracker"]][["jira"]][["issues"]]

# Filters
file_extensions <- conf[["filter"]][["keep_filepaths_ending_with"]]
substring_filepath <- conf[["filter"]][["remove_filepaths_containing"]]

To establish bug count, we must map each file to its associated issue. In general (but not always), an open source project will adopt a commit message convention to label issue ids. For example, Kaiaulu uses i #<issue number>. JIRA dictates issue ids in the format PROJECT-. We can use this assumption to find the issues that the git log commits are trying to fix. You can create your own regular expression by manually inspecting some of the commit messages (for example, compare the regular expression from the geronimo project used here against its commit message).

Since commits are in turn associated to files, we can establish a map from file to issue. Additionally, projects which use JIRA commonly annotate on their issue type whether the issue is a feature or bug.

By connecting both git log and JIRA information, therefore, we can count the number of bugs per file.

Parse Git Log

After specifying the necessary configuration file, we first parse the project git log:

project_git <- parse_gitlog(perceval_path,git_repo_path)

We then removed test files:

project_git <- project_git  %>%
  filter_by_file_extension(file_extensions,"file_pathname")  %>% 
  filter_by_filepath_substring(substring_filepath,"file_pathname")

Timezones are standardized so we can explore a slice of the data for visualization:

project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz,
                                              format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")
project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz,
                                              format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")

project_git_slice <- project_git[author_datetimetz >= as.POSIXct("2012-01-01", format = "%Y-%m-%d",tz = "UTC") & author_datetimetz < as.POSIXct("2012-12-31", format = "%Y-%m-%d",tz = "UTC")]

Identifying Issue IDs in Commit Messages

We can use a built-in Kaiaulu function to search for a regular expression (regex) of the issue id. First we use the regex to calculate how many commits contain issue ids. Ideally, you should consider projects with a high enough coverage, or the results may not be representative.

The total number of commits with issue ids in the chosen git slice is:

commit_message_id_coverage(project_git_slice,issue_id_regex)
## [1] 54

Proportion of commit messages containing issue ids relative to all commits in the slice:

normalized_coverage <- commit_message_id_coverage(project_git_slice,issue_id_regex)/length(unique(project_git_slice$commit_hash))
normalized_coverage
## [1] 0.84375

Issue - File Network

To get a better idea of the mapping, we can also visualize the issue-file network we just established using the regex as follows:

project_commit_message_id_edgelist <- transform_commit_message_id_to_network(project_git_slice,commit_message_id_regex = issue_id_regex)

project_commit_message_id_network <- igraph::graph_from_data_frame(d=project_commit_message_id_edgelist[["edgelist"]], 
                      directed = TRUE, 
                      vertices = project_commit_message_id_edgelist[["nodes"]])
visIgraph(project_commit_message_id_network,randomSeed = 1)

As the network shows, visual inspection can identify interesting outliers in the data mapping. Remember you can zoom in to see both issue id and file paths details to look for details.

Calculating Bug Count per File

Now that we explored the file to issue mapping, let’s focus on calculating the total bug count per file for the entire git log.

First, we can extract only the detected issue id from the commit message, and add it in a separate column:

project_git <- parse_commit_message_id(project_git, issue_id_regex)

Now we know each commit’s ID, we can add the issue_type information obtained directly from JIRA, so we can know which files were involved in a bug, hence obtaining file bug count. We will also consider only issues that are already closed, in case the bug is later found to be invalid.

The jira table contains many fields (see download_jira_data.Rmd for all columns), here we show a sample of the relevant fields for bug count:

jira_issues <- parse_jira(jira_issues_path)[["issues"]]
jira_issues <- jira_issues[issue_status == "Closed" & issue_type == "Bug"]

kable(head(jira_issues[,.(issue_key,issue_type,issue_status,issue_summary)]))
issue_key issue_type issue_status issue_summary
GERONIMO-2665 Bug Closed CLONE -Invalid module path or references in plan should result in failed deployment or warning
GERONIMO-782 Bug Closed ejb ws deployment system does not use gbean builder references
GERONIMO-781 Bug Closed TomcatModuleBuilderTest busted
GERONIMO-780 Bug Closed mdb builder needs jms message listener as default
GERONIMO-779 Bug Closed SchemaInfoBuilder class cannot be loaded due to xmlbeans problems
GERONIMO-777 Bug Closed Deployment files not removed on Windows

We can then perform a left join using the extracted key of the git log against the issue key from the issue data:

project_git <- merge(project_git,jira_issues,all.x=TRUE,by.x="commit_message_id",
                     by.y="issue_key")

For each file, we now know the commits that were associated with fixing bugs. We then group the number of bug-related issues by file path. Our result should show the number of bugs per file path. The following outputs the top 20 files with the most bugs.

file_bug_count <- project_git[,.(bug_count=.N),by = "file_pathname"]
kable(head(file_bug_count[order(-bug_count)],20))
file_pathname bug_count
modules/jetty-builder/src/java/org/apache/geronimo/jetty/deployment/JettyModuleBuilder.java 132
modules/kernel/src/java/org/apache/geronimo/kernel/config/Configuration.java 110
modules/client-builder/src/java/org/apache/geronimo/client/builder/AppClientModuleBuilder.java 104
plugins/openejb/geronimo-openejb-builder/src/main/java/org/apache/geronimo/openejb/deployment/EjbModuleBuilder.java 91
modules/deployment/src/java/org/apache/geronimo/deployment/DeploymentContext.java 90
modules/deployment/src/java/org/apache/geronimo/deployment/Deployer.java 89
modules/kernel/src/java/org/apache/geronimo/kernel/Kernel.java 87
modules/tomcat-builder/src/java/org/apache/geronimo/tomcat/deployment/TomcatModuleBuilder.java 84
modules/j2ee-builder/src/java/org/apache/geronimo/j2ee/deployment/EARConfigBuilder.java 81
modules/jetty/src/java/org/apache/geronimo/jetty/JettyWebAppContext.java 76
modules/connector-builder/src/java/org/apache/geronimo/connector/deployment/ConnectorModuleBuilder.java 76
modules/service-builder/src/java/org/apache/geronimo/deployment/service/ServiceConfigBuilder.java 67
modules/system/src/java/org/apache/geronimo/system/main/Daemon.java 62
modules/naming-builder/src/java/org/apache/geronimo/naming/deployment/ENCConfigBuilder.java 62
modules/axis-builder/src/java/org/apache/geronimo/axis/builder/AxisBuilder.java 62
plugins/tomcat/geronimo-tomcat7-builder/src/main/java/org/apache/geronimo/tomcat/deployment/TomcatModuleBuilder.java 62
modules/geronimo-openejb-builder/src/main/java/org/apache/geronimo/openejb/deployment/EjbModuleBuilder.java 61
modules/tomcat/src/java/org/apache/geronimo/tomcat/TomcatWebAppContext.java 59
plugins/j2ee/geronimo-j2ee-builder/src/main/java/org/apache/geronimo/j2ee/deployment/EARConfigBuilder.java 59
modules/geronimo-axis2/src/main/java/org/apache/geronimo/axis2/Axis2WebServiceContainer.java 58