This notebook explains how to compute social smells for a given open source project. Before we begin, it is important to understand what data is required to compute social smells, what Kaiaulu will do and also what will not do for you.
Social smell metrics requires both collaboration and communication data. Collaboration data is extracted from version control systems, while communication data can be obtained from whatever the project of interest uses for developers to communicate.
Obtaining collaboration data is relatively painless: You need only clone the version control system locally. Currently Kaiaulu only supports analysis in Git, but in the future it may support other version control systems. Obtaining communication data, however, requires more effort on your part depending on the project you choose. This is because there is a large variety of communication mediums and archive types open source projects use.
Broadly, open source projects use mailing lists, issue tracker (comments), or both for communication.
Example of mailing list archive types are GNU Mailing List Manager, Google Groups, Mail Archive, Apache’s MOD Mbox, Free Lists, Discourse, etc. On March 2021, even GitHub launched its own built-in communication medium, Discussions.
Examples of issue tracker types include JIRA, GitHub’s built-in Issue Tracker, GitLab’s Issue Tracker, Monorail (used in Google’s Chromium), Trac, etc. You may also be interested in including discussion that occurs in GitHub’s pull requests / GitLab merge requests. Issue tracker communication is currently not supported in Kaiaulu, but we plan to support GitHub Issue Tracker, Github Pull Requests, and JIRA in the future.
It is of course not viable for Kaiaulu to implement interfaces to every single archive type out there. Therefore, to calculate social smells, we expect you can obtain a .mbox representation of the mailing list of interest. This may be available from the open source project directly (e.g. Apache projects mod_mbox – For mod_mbox and pipermail, see the R/download.R functions), or via a crawler someone already made. For example, gg_scraper outputs a mbox file from Google Group mailing lists (although it can only obtain partial information of the e-mails, as Google Groups truncate them, which may pose limitations to some steps of the analysis of the identity matching discussed below). In the future, we plan to also support pipermail, the archive type for GNU Mailing List Manager.
The bottom line is, the required effort to obtain the mailing list data will vary depending on the open source project of interest, as open source projects may even transition over time through different archive types. Once you have available in your computer both git log, and an .mbox file, you are ready to proceed with the social smells analysis of this notebook.
Please ensure the following R packages are installed on your computer. The dependency to igraph is not strictly required, but it will be used in this Notebook.
The parameters necessary for analysis are kept in a project configuration file to ensure reproducibility. In this project, we will use the Apache’s APR open source project. Refer to the conf/
folder on Kaiaulu’s git repository for APR and other project configuration files. It is in this project configuration file we specify where Kaiaulu can find the git log and .mbox file from APR. We also specify here filters of interest: For example, if Kaiaulu should ignore test files, or anything that is not source code.
We also provide the path for tools.yml
. Kaiaulu does not implement all available functionality from scratch. Conversely, it will also not expect all dependencies to be installed. Every function defined in the API expects as parameter a filepath to the external dependency binary. Tools.yml is a convenience file that stores all the binary paths, so it can be set once during setup and reused multiple times for analysis. You can find an example of tools.yml on the github repo from Kaiaulu root directory.
tools_path <- "../tools.yml" conf_path <- "../conf/apr.yml" tool <- yaml::read_yaml(tools_path) conf <- yaml::read_yaml(conf_path) perceval_path <- tool[["perceval"]] git_repo_path <- conf[["data_path"]][["git"]] start_commit <- conf[["interval"]][["window"]][["start_commit"]] end_commit <- conf[["interval"]][["window"]][["end_commit"]] window_size <- conf[["interval"]][["window"]][["size_days"]] mbox_path <- conf[["data_path"]][["mbox"]] scc_path <- tool[["scc"]] # Filters file_extensions <- conf[["filter"]][["keep_filepaths_ending_with"]] substring_filepath <- conf[["filter"]][["remove_filepaths_containing"]]
The remainder of this notebook does not require modifications. If you encounter an error in any code block below, chances are one or more parameters above have been specified incorrectly, or the project of choice may have led to an outlier case. Please open an issue if you encounter an error, or if not sure post on discussions in Kaiaulu’s GitHub. E-mailing bugs is discouraged as it is hard to track.
As stated in the introduction, we need both git log and mailing list to compute social smells. Therefore, the first step is to parse the raw data.
To get started, we use the parse_gitlog
function to extract a table from the git log. You can inspect the project_git
variable to inspect what information is available from the git log.
project_git <- parse_gitlog(perceval_path,git_repo_path) project_git <- project_git %>% filter_by_file_extension(file_extensions,"file_pathname") %>% filter_by_filepath_substring(substring_filepath,"file_pathname") project_git <- project_git[order(author_datetimetz)]
Next, we parse the .mbox data using the parse_mbox
function. Similarly to parse_gitlog
, the returned object is a table, which we can inspect directly in R to see what information is available.
project_mbox <- parse_mbox(perceval_path,mbox_path) project_mbox <- project_mbox[order(reply_datetimetz)]
We also have to parse and normalize the timezone across the different projects. Since one of the social metrics in the quality framework is the count of different timezones, we separate the timezone information before normalizing them.
# Parse Timezones project_git$author_tz <- sapply(stringi::stri_split(project_git$author_datetimetz, regex=" "),"[[",6) project_git$committer_tz <- sapply(stringi::stri_split(project_git$committer_datetimetz, regex=" "),"[[",6) project_mbox$reply_tz <- sapply(stringi::stri_split(project_git$reply_datetimetz, regex=" "),"[[",6)
project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_mbox$reply_datetimetz <- as.POSIXct(project_mbox$reply_datetimetz, format = "%a, %d %b %Y %H:%M:%S %z", tz = "UTC") #project_mbox <- project_mbox[reply_datetimetz >= start_date & reply_datetimetz <= end_date]
Having parsed both git log and mbox, we are ready to start computing the social smells. Social smells are computed on a “time window” granularity. For example, we may ask “between January 2020 and April 2020, how many organizational silos are identified in APR?”. This means we will inspect both the git log and mailing list for the associated time period, perform the necessary transformations in the data, and compute the number of organizational silos.
So we begin by specifying how large our time window should be, in days:
window_size <- 90 # 90 days
Kaiaulu will then perform a non-overlapping time window of every 3 months of git log history and mailing list to identify the number of organizational silos, missing link and radio silence social smells.
We “slice” the git log and mailing list tables parsed earlier in window_size
chunks, and iterate in a for loop on each “slice”.
Within a slice we do the following:
There is a large variety of customization in the above 5 steps. We will discuss them briefly here, as they directly impact the quality of your results.
It is very common authors have multiple names and e-mails within and across git log, mailing lists, issue trackers etc. There is no perfect way to identify all identities of an individual, only heuristics. Kaiaulu has a large number of unit tests, each capturing a different example of how people use their e-mails. However, this is not magic. It is possible, and it has been encountered before, cases where a software is used in an open source project that may entirely compromise the analysis if done blind. For example, a common heuristic used for identity matching is to consider two accounts to have the same identity if either the First+Last name OR E-mail have an exact match. In one open source project we analyzed, we found a software was being used that masked all users with commit access as “admin@project.org”. Using this heuristic would have, therefore, compromised the entire data.
Using the code below, you can manually inspect project_git
and project_mbox
in R the assigned identity to the various users. I strongly encourage you to do so. It is also possible to specify the identity_match()
function to consider only names, instead of name and e-mails, to avert the example above. If you find a edge case where the identity is incorrectly assigned, please open an issue so we can add the edge case. You may also manually correct the identity numbers, before executing the remaining code blocks, to improve the accuracy of the results.
#Identity matching project_log <- list(project_git=project_git,project_mbox=project_mbox) project_log <- identity_match(project_log, name_column = c("author_name_email","reply_from"), assign_exact_identity, use_name_only=TRUE, label = "raw_name") project_git <- project_log[["project_git"]] project_mbox <- project_log[["project_mbox"]]
Remember: Social smells rely heavily on patterns of collaboration and communication. If the identities are poorly assigned, the social smells will not reflect correctly the project status (since in essence several people considered to be communicating with one another, are the same individual!).
As mentioned in the introduction, there are multiple types of mailing list archives out there, and it may be more sensible to use an issue tracker instead of a mailing list, or a combination of both depending on the project. Besides data types, there are also different types of transformations that can be done, when we transform the data in networks. This notebook implements the bipartite transformation (see parse_gitlog_network()
). It is also possible to use a temporal transformation (see parse_gitlog_temporal_network()
). The choice of transformation impacts the direction and overall type of network that will be generated, so it is important you understand how this impact your research conclusions. A similar transformation could be applied to mailing lists, but it is not yet implemented. Because we use bipartite in the code block below, we also perform a bipartite projection. These are well known operations in graph theory which also impact the interpretability of the results.
Another transformation that you can choose is whether the analysis should be done on files, or entities (e.g. functions). See parse_gitlog_entity_network()
and parse_gitlog_entity_temporal_network()
for entities. You may choose functions, classes, and other more specific types of entities depending on the language of interest (e.g. typedef structs for C).
A third choice we make here is whether the collaboration being analyzed is done by authors or committers. Normally a open source project has both. In the code block below, we analyze authors. If you are interested in committers, or potentially their interaction, see the available parameters of parse_gitlog()
.
For some social smells, such as radio silence and primma donna, community detection is required to be applied to the constructed networks. Do consider the implications of the one chosen below in your results.
# Define all timestamp in number of days since the very first commit of the repo # Note here the start_date and end_date are in respect to the git log. # Transform commit hashes into datetime so window_size can be used start_date <- get_date_from_commit_hash(project_git,start_commit) end_date <- get_date_from_commit_hash(project_git,end_commit) datetimes <- project_git$author_datetimetz mbox_datetimes <- project_mbox$reply_datetimetz # Format time window for posixT window_size_f <- stringi::stri_c(window_size," day") # Note if end_date is not (and will likely not be) a multiple of window_size, # then the ending incomplete window is discarded so the metrics are not calculated # in a smaller interval time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f) # Create a list where each element is the social smells calculated for a given commit hash smells <- list() size_time_window <- length(time_window) for(j in 2:size_time_window){ # Initialize commit_interval <- NA start_day <- NA end_day <- NA org_silo <- NA missing_links <- NA radio_silence <- NA primma_donna <- NA st_congruence <- NA communicability <- NA num_tz <- NA code_only_devs <- NA code_files <- NA ml_only_devs <- NA ml_threads <- NA code_ml_both_devs <- NA i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] # Obtain all email posts from the mbox which are within a particular window_size project_mbox_slice <- project_mbox[(reply_datetimetz >= start_day) & (reply_datetimetz < end_day)] # Check if slices contain data gitlog_exist <- (nrow(project_git_slice) != 0) ml_exist <- (nrow(project_mbox_slice) != 0) # Create Networks if(gitlog_exist){ i_commit_hash <- first(project_git_slice)$commit_hash j_commit_hash <- last(project_git_slice)$commit_hash # Parse networks edgelist from extracted data network_git_slice <- transform_gitlog_to_bipartite_network(project_git_slice, mode="author-file") # Load networks in igraph igraph_network_git_slice <- igraph::graph_from_data_frame(d=network_git_slice[["edgelist"]], directed = TRUE, vertices = network_git_slice[["nodes"]]) # Community Smells functions are defined base of the projection networks of # dev-thread => dev-dev, and dev-file => dev-dev. This creates both dev-dev via graph projections git_network_authors <- igraph::bipartite_projection(igraph_network_git_slice, multiplicity = TRUE, which = TRUE) # FALSE is the thread projection code.clusters <- walktrap.community(git_network_authors, weights=E(git_network_authors)$weight) } if(ml_exist){ network_mbox_slice <- transform_reply_to_bipartite_network(project_mbox_slice) igraph_network_mbox_slice <- igraph::graph_from_data_frame(d=network_mbox_slice[["edgelist"]], directed = TRUE, vertices = network_mbox_slice[["nodes"]]) mbox_network_authors <- igraph::bipartite_projection(igraph_network_mbox_slice, multiplicity = TRUE, which = TRUE) # FALSE is the thread projection # Community Detection mail.clusters <- walktrap.community(mbox_network_authors, weights=E(mbox_network_authors)$weight) } # Metrics # if(gitlog_exist){ commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) # Social Network Metrics code_only_devs <- length(unique(project_git_slice$identity_id)) code_files <- length(unique(project_git_slice$file_pathname)) } if(ml_exist){ # Smell radio_silence <- length(community_smell_radio_silence(mbox_network_authors, mail.clusters)) # Social Technical Metrics ml_only_devs <- length(unique(project_mbox_slice$identity_id)) ml_threads <- length(unique(project_mbox_slice$reply_subject)) } if (ml_exist & gitlog_exist){ # Smells org_silo <- length(community_smell_organizational_silo(mbox_network_authors,git_network_authors)) missing_links <- length(community_smell_missing_links(mbox_network_authors,git_network_authors)) primma_donna <- length(community_smell_primadonnas(mbox_network_authors, mail.clusters, git_network_authors)) # Social Technical Metrics st_congruence <- community_metric_sociotechnical_congruence(mbox_network_authors,git_network_authors) communicability <- community_metric_mean_communicability(mbox_network_authors,git_network_authors) num_tz <- length(unique(c(project_git_slice$author_tz, project_git_slice$committer_tz, project_mbox_slice$reply_tz))) code_ml_both_devs <- length(intersect(unique(project_git_slice$identity_id), unique(project_mbox_slice$identity_id))) } # Aggregate Metrics smells[[commit_interval]] <- data.table(commit_interval, start_datetime = start_day, end_datetime = end_day, org_silo, missing_links, radio_silence, primma_donna, st_congruence, communicability, num_tz, code_only_devs, code_files, ml_only_devs, ml_threads, code_ml_both_devs) } smells_interval <- rbindlist(smells)
The remainder of this notebook does not compute any social smells. It provide some popular metrics commonly reported in software engineering literature, which may be useful to you when interpreting the social smells. While their granularity is not at “time window” level, they are computed like so in order to be placed in the same table of social smells after being aggregated to the same granularity.
churn <- list() for(j in 2:length(time_window)){ i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] gitlog_exist <- (nrow(project_git_slice) != 0) if(gitlog_exist){ # The start and end commit i_commit_hash <- first(project_git_slice)$commit_hash j_commit_hash <- last(project_git_slice)$commit_hash # The start and end datetime #start_datetime <- first(project_git_slice)$author_datetimetz #end_datetime <- last(project_git_slice)$author_datetimetz commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) churn[[commit_interval]] <- data.table(commit_interval, # start_datetime, # end_datetime, churn=metric_churn_per_commit_interval(project_git_slice), n_commits = length(unique(project_git_slice$commit_hash))) } } churn_interval <- rbindlist(churn)
time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f) #time_window <- seq(from=start_daydiff,to=end_daydiff,by=window_size) line_metrics <- list() for(j in 2:length(time_window)){ i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] gitlog_exist <- (nrow(project_git_slice) != 0) if(gitlog_exist){ i_commit_hash <- first(project_git_slice)$commit_hash # Use the ending hash of that window_size to calculate the flaws j_commit_hash <- last(project_git_slice)$commit_hash # Checkout to commit of interest git_checkout(j_commit_hash, git_repo_path) # Run line metrics against the checkedout commit commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) line_metrics[[commit_interval]] <- parse_line_metrics(scc_path,git_repo_path) line_metrics[[commit_interval]]$commit_interval <- commit_interval line_metrics[[commit_interval]]$git_checkout <- j_commit_hash line_metrics[[commit_interval]] <- line_metrics[[commit_interval]][,.(commit_interval, git_checkout, Location, Lines, Code, Comments, Blanks, Complexity)] # Filter Files line_metrics[[commit_interval]] <- line_metrics[[commit_interval]] %>% filter_by_file_extension(file_extensions,"Location") %>% filter_by_filepath_substring(substring_filepath,"Location") } } # Reset Repo to HEAD git_checkout("trunk",git_repo_path)
## [1] "Your branch is up to date with 'origin/trunk'."
dt <- merge(smells_interval,churn_interval,by="commit_interval") dt <- merge(dt,line_metrics_interval,by="commit_interval") kable(dt)
commit_interval | start_datetime | end_datetime | org_silo | missing_links | radio_silence | primma_donna | st_congruence | communicability | num_tz | code_only_devs | code_files | ml_only_devs | ml_threads | code_ml_both_devs | churn | n_commits | git_checkout | Lines | Code | Comments | Blanks | Complexity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
071f1167b3c7a5fc6e04ae77e9217618a89a16b5-3fb69d4d8e9fba586a360084d3b63dfa508e8f67 | 2013-11-26 09:28:22 | 2014-02-24 09:28:22 | 1 | 1 | 10 | 0 | 0.5000000 | 0.7857143 | 2 | 7 | 31 | 22 | 66 | 3 | 696 | 23 | 3fb69d4d8e9fba586a360084d3b63dfa508e8f67 | 110776 | 71910 | 25238 | 13628 | 13765 |
140dfcaccaf9b4e9c3b608d1811dd38b5f349c50-01d209fb32b938780157e92444b9f3f7a6110f6e | 2014-08-23 09:28:22 | 2014-11-21 09:28:22 | 0 | 0 | 8 | 0 | 1.0000000 | 1.0000000 | 2 | 3 | 3 | 26 | 45 | 3 | 23 | 3 | 01d209fb32b938780157e92444b9f3f7a6110f6e | 111298 | 72236 | 25371 | 13691 | 13858 |
2166cc3bcc084ea1be783f5088f44c9d89152861-754fa5bf9673699703b84c58dddabb30cb987ac1 | 2015-08-18 09:28:22 | 2015-11-16 09:28:22 | 0 | 0 | 18 | 0 | 1.0000000 | 1.0000000 | 2 | 4 | 7 | 33 | 47 | 3 | 178 | 4 | 754fa5bf9673699703b84c58dddabb30cb987ac1 | 112442 | 73154 | 25499 | 13789 | 14072 |
259cf1d392eb5e8ec0dddd94e85e7b8744bf3acb-72d7d0922949f47d58574c3d638d9d8c30a08e3f | 2016-02-14 09:28:22 | 2016-05-14 09:28:22 | 0 | 0 | 10 | 0 | 1.0000000 | 1.0000000 | 2 | 4 | 22 | 15 | 29 | 3 | 1077 | 17 | 72d7d0922949f47d58574c3d638d9d8c30a08e3f | 113689 | 73941 | 25842 | 13906 | 14264 |
2f24e818be9b96363d4c9ba1ccf516ce1834d196-5b8da7ceb8f269d945c24c4dc24f128e81d8cdb5 | 2015-02-19 09:28:22 | 2015-05-20 09:28:22 | 1 | 1 | 12 | 0 | 0.6666667 | 0.9047619 | 2 | 7 | 44 | 21 | 55 | 3 | 2521 | 41 | 5b8da7ceb8f269d945c24c4dc24f128e81d8cdb5 | 112340 | 73094 | 25464 | 13782 | 14063 |
353757332ad63f43ebc8907a5586b9053ad08566-3c818c6d7351f0130282d212a69035642f5fecad | 2014-11-21 09:28:22 | 2015-02-19 09:28:22 | 0 | 0 | 15 | 0 | 1.0000000 | 1.0000000 | 2 | 3 | 3 | 15 | 29 | 0 | 6 | 3 | 3c818c6d7351f0130282d212a69035642f5fecad | 111302 | 72240 | 25371 | 13691 | 13858 |
3755fef1269aa89dc7a449b5cb1c6f9d514ff38f-a5b7d0c1c059bf2031891b1baafde76f49018724 | 2013-05-30 09:28:22 | 2013-08-28 09:28:22 | 0 | 0 | 9 | 0 | 1.0000000 | 1.0000000 | 2 | 4 | 23 | 26 | 55 | 3 | 1726 | 9 | a5b7d0c1c059bf2031891b1baafde76f49018724 | 109518 | 71074 | 24910 | 13534 | 13612 |
50b94fbab9e432cff6b1fa5a5a8c7e4c4aaf472f-fd9c033af5c5c987bf3369caa55cb1118fb5c6a4 | 2014-05-25 09:28:22 | 2014-08-23 09:28:22 | 0 | 1 | 12 | 0 | 0.6666667 | 0.8888889 | 2 | 6 | 8 | 22 | 58 | 4 | 428 | 23 | fd9c033af5c5c987bf3369caa55cb1118fb5c6a4 | 111167 | 72150 | 25341 | 13676 | 13851 |
71b15391252d93b5277aa11270e34969e8967f79-8c59195da6c09739e5c6af6757e75dd5734b59f6 | 2014-02-24 09:28:22 | 2014-05-25 09:28:22 | 0 | 0 | 11 | 0 | 1.0000000 | 1.0000000 | 2 | 5 | 19 | 25 | 58 | 4 | 511 | 15 | 8c59195da6c09739e5c6af6757e75dd5734b59f6 | 111112 | 72118 | 25328 | 13666 | 13837 |
738e0f14c569d300bc65fc1fae91fa8ab67e90f2-2088b6d4bbee14ad2f579b87a45500024f6c4bfc | 2015-05-20 09:28:22 | 2015-08-18 09:28:22 | 0 | 0 | 8 | 0 | 1.0000000 | 1.0000000 | 2 | 4 | 5 | 18 | 34 | 1 | 42 | 5 | 2088b6d4bbee14ad2f579b87a45500024f6c4bfc | 112439 | 73154 | 25496 | 13789 | 14071 |
c990a3175e240d407638c7917497741a238727ee-2ad4de130dcce64e3bf81031c60c71c9e600cc50 | 2015-11-16 09:28:22 | 2016-02-14 09:28:22 | 1 | 1 | 11 | 0 | 0.0000000 | 0.3333333 | 2 | 3 | 5 | 30 | 54 | 2 | 797 | 6 | 2ad4de130dcce64e3bf81031c60c71c9e600cc50 | 113180 | 73560 | 25766 | 13854 | 14175 |
ef4b61b5ff72efb6cd8a4f4215bc67d18a3dd923-7c141cd06f48f447b101c39afc16cef81916c03a | 2013-08-28 09:28:22 | 2013-11-26 09:28:22 | 0 | 0 | 12 | 0 | 1.0000000 | 1.0000000 | 2 | 6 | 32 | 35 | 96 | 5 | 3079 | 28 | 7c141cd06f48f447b101c39afc16cef81916c03a | 110252 | 71671 | 24985 | 13596 | 13724 |
fwrite(dt,"~/Desktop/results_kaiaulu_tomcat.csv")