1 Introduction
2 Background
2.1 Code Repository Analysis
-
• Version Control Systems (VCS). Version Control Systems (VCS) is a tool that organizes the source code of software systems. VCS are used to store and build all the different versions of the source code (Ball et al., 1997). In general, a VCS manages the development of an evolving object (Zolkifli et al., 2018), recording every change made by software developers. In the process of building software, developers make changes in portions of the source code, artifacts, and the structure of the software. Thus, it is difficult to organize and document this process because it becomes a large and complex software. Therefore, VCS is a tool that allows developers to manage and control the process of development, maintainability and evolution of a software (Costa and Murta, 2013).
-
• Software repositories. Systems that store project data, e.g. issue control systems and version control systems, are known as software repositories (Falessi and Reichel, 2015). Software repositories are virtual spaces where development teams generate collaborative artifacts from the activities of a development process (Arora and Garg, 2018; Güemes-Peña et al., 2018; Ozbas-Caglayan and Dogru, 2013). Software repositories contain large amount of software historical data that can include valuable information on the source code, defects, and other issues like new features (De Farias et al., 2016). Moreover, we can extract many types of data from repositories, study them, and can make changes according to the need (Siddiqui and Ahmad, 2018). Due to open source, the number of these repositories and its uses is increasing at a rapid rate in the last decade (Amann et al., 2015; Costa and Murta, 2013; Wijesiriwardana and Wimalaratne, 2018). Such repositories are used to discover useful knowledge about the development, maintenance and evolution of software (Chaturvedi et al., 2013; Farias et al., 2015). It is important to identify software repositories. Hassan (2008) describes the various examples of software repositories such as the following: historical repositories, run-time repositories, code repositories. Our research mainly focuses on code repositories.
-
• Code repositories. Code repositories are maintained by collecting source code from a large number of heterogeneous projects (Siddiqui and Ahmad, 2018). Code repositories like SourceForge, GitHub, GitLab, Bitbucket and Google Code contain a lot of information (Güemes-Peña et al., 2018). These companies offer services that go beyond simple hosting and version control of the software (Joy et al., 2018). Therefore, source code repositories have been attracting a huge interest from many researchers (Lee et al., 2013).
2.2 Related Work
Table 1
Paper | Type study | Objective | Extracted Info | Purpose |
De Farias et al. (2016), Siddiqui and Ahmad (2018), Costa and Murta (2013) | SMS | Understand the defects, analyse the contribution and behaviour of developers, and understand the evolution of software | Software repositories, MSR | Software evolution |
Zolkifli et al. (2018), Kagdi et al. (2007), Demeyer et al. (2013) | Systematic literature review, survey of the literature, text mining | Understand approaches to VCS, taxonomy of software repositories, analysis of obsolete or emerging technologies and current research methods | MSR, VCS, Conference MSR | Software development process, understanding the evolution of software, the change and evolution of software |
Amann et al. (2015), Güemes-Peña et al. (2018) | Systematic literature review, | identification of impact change, maintainability, software quality, developer effort and bug prediction | Conference MSR | Data mining, machine learning, software process |
Borges and Tulio Valente (2018), Cosentino et al. (2017), Kalliamvakou et al. (2016) | SMS | Coding and project communities, characteristics of software repositories | GitHub | Analysis Software Repository |
3 Research Methodology
3.1 Definition Phase
3.1.1 Research Questions
Table 2
Research questions | Motivation |
RQ1: What kind of information is taken as input for the analysis of code repositories? | To know what kind of information is analysed and their respective characteristics or approaches in code repositories. |
RQ2: What techniques or methods are used for analysing code repository? | To determine which are the main techniques and methods to obtain information from code repositories. |
RQ3: What information is extracted (directly) or derived (indirectly) as a result of the analysis of code repositories? | To analyse what information is extracted or derived through the analysis of code repositories. |
RQ4. What kind of research has proliferated in this field? | Establish the type of research that is most frequent in this area, e.g. solution proposal, applied research, research evaluation, etc., in order to know the maturity of the area and identify gaps. |
RQ5. Are both academia and industry interested in this field? | To analyse the degree of interest of industry in this field through its participation in research work. |
3.1.2 Search Process
Table 3
Main terms | AND expression division | Alternative terms |
Code repository | Conceptual synonyms | software repository version control systems |
Technological synonyms | Git | |
Svn | ||
Analysis | Synonyms | Mining |
Inspection | ||
Exploring |
3.1.3 Selection of Primary Studies Procedure
Table 4
Id | Criteria |
IC1 | Peer reviewed paper, for example, proceeding chapters, book chapters, keynote abstracts, call for papers and irrelevant publications. |
IC2 | The study employs some kind of techniques or methods to extract information through the analysis of code repositories. |
IC3 | The study provides some idea or type of application that might be applied for the analysis of code repositories. |
IC4 | The papers that were published from 1 January 2012 to 31 August 2019 |
EC1 | The paper is duplicate |
EC2 | Non-English articles |
EC3 | The paper is a preliminary investigation which is extended or is dealt with in depth in a more recent paper by the same authors which have already been included. |
EC4 | The focus of the article is not within the computer science area. |
-
Step 1. Paper is not a duplicate.
-
Step 2. Apply the exclusion/inclusion criteria to the studies obtained by using the search string, along with the analysis of the title, keywords and abstract of the article containing information related to our research topic. Therefore, we included studies that met at least one of the criteria (see Table 4 for inclusion criteria). In case of doubt, we proceeded to include the document for further analysis in Step 3.
-
Step 3. In order to perform a more exhaustive filtering and to know which studies should be excluded or selected, we proceed to read the entire study using the exclusion/inclusion criteria. The first author was responsible for selecting the studies. In this step, selection issues were resolved by agreement among all authors after analysing the full text. We obtained the primary studies that we used for our analysis and that allow us to answer the questions posed.
3.1.4 Quality Assessment
Table 5
Nr. | Assessment questions | Criteria |
AQ1 | Does the study have a systematic method for obtaining baseline information for code repository analysis? | Defined methods |
AQ2 | Does the study present a result of code repository information analysis? | Data analysis |
AQ3 | Does the study present an artifact (technique, tool, or method) for processing information from code repositories? | Study presentation results |
AQ4 | Does the research show a solution to the problems of software quality, development and evolution? | Study focus |
AQ5 | Does the research provide an artifact (technique, tool or method) that can be applied in industrial environments? | Application |
AQ6 | Do other authors cite the selected study? | Utility |
AQ7 | Is the journal or conference that publishes the study important or relevant? | Relevant |
3.1.5 Procedure for Data Extraction and Taxonomy
Table 6
RQs | Categories |
RQ1 | Project Features Info |
Defects | |
Comments | |
Branches | |
Source Code | |
Informal Information | |
Committers | |
Commit Data | |
Logs | |
Graphs/News Feed | |
Issue | |
Pulls/Pull Request | |
Level of Interest | |
Repository Info | |
RQ2 | Automatic Processing |
Branching Analysis | |
Changes Analysis | |
Commits/Committers Classification | |
Cloning Detection | |
Code Review | |
Commit Analysis | |
Defect/Issues Analysis | |
Developer Behaviour | |
Design Modelling | |
Maintainability Information | |
Metrics/Quality | |
Source Code Improvements | |
Testing Data | |
RQ3 | Ad Hoc Algorithms |
Data Mining | |
Automatic | |
Artificial Intelligence/Machine Learning | |
Qualitative Analyses | |
Heuristic Techniques | |
Empirical/Experimental | |
Statistical Analyses | |
Prediction | |
Reverse Engineering | |
Testing-Based Techniques | |
RQ4 | Evaluation Research |
Proposal of Solution | |
Validation Research | |
Philosophical Papers | |
Opinion Papers | |
Personal Experience Papers | |
RQ5 | Industry |
Academia | |
Freelance |
3.1.6 Summary Methods
-
• We generate a taxonomy of the selected studies according to each research question (see Table 6).
-
• We made a summary with the total number of articles per country and per year (see Fig. 3).
-
• We prepared a matrix of each primary study distributed in rows containing information on the research questions, proposed taxonomy and quality assessment.
-
• To summarize the results of the SMS, we generated a bubble chart where the different research questions intersect with the number of the selected primary studies.
3.2 Execution Phase
4 Results of the Systematic Mapping Study
4.1 RQ1. What Kind of Information is Taken as Input for the Analysis of Code Repositories?
Table 7
Input | Papers | # studies | % |
Commit data | 62, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 185, 189, 203, 223, 225, 226, 227, 229, 230, 233, 235 | 115 | 34 |
Source code | 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 167, 176, 203, 224, 226, 228, 229, 230, 235, 236 | 90 | 26 |
Repository info | 61, 179, 180, 181, 182, 190, 191, 192, 193, 194, 195, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 232, 234 | 23 | 7 |
Issue | 84, 85, 166, 167, 173, 174, 175, 178, 195, 196, 197, 198, 199, 200, 201, 202, 225, 229, 234 | 19 | 6 |
Comments | 63, 64, 65, 66, 87, 88, 89, 90, 91, 157, 158, 159, 185, 198, 199, 200 | 16 | 5 |
Branches | 1, 2, 3, 147, 148, 149, 165, 177, 185, 186, 187, 188, 189 | 13 | 4 |
Defects | 4, 5, 6, 80, 81, 82, 150, 151, 152, 153, 154, 155, 233 | 13 | 4 |
Pulls/pulls request | 146, 172, 173, 174, 175, 205, 206, 207, 208, 209, 210, 223, 225 | 13 | 4 |
Commiters | 83, 102, 153, 154, 155, 160, 161, 162, 163, 164, 165, 233 | 12 | 4 |
Informal information | 86, 156, 168, 169, 170, 171, 203, 204 | 8 | 2 |
Level of interest | 177, 188, 189, 211, 212, 223 | 6 | 2 |
Proyect features info | 61, 62, 86, 100, 101, 189 | 6 | 2 |
Logs | 6, 144, 145, 183, 184 | 5 | 1 |
Graphs/news feed | 201, 202 | 2 | 1 |
4.2 RQ2. What Techniques or Methods are Used for Analysing Source Code Repository?
Table 8
Methods-techniques | Papers | # studies | % |
Empirical/experimental | 2, 5, 8, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 34, 35, 36, 42, 43, 44, 45, 46, 48, 52, 56, 57, 59, 60, 62, 63, 64, 65, 66, 68, 74, 77, 83, 88, 91, 97, 100, 101, 102, 107, 111, 113, 114, 118, 120, 121, 128, 138, 139, 143, 147, 148, 151, 154, 161, 163, 165, 169, 177, 179, 184, 185, 186, 189, 192, 193, 194, 196, 197, 202, 206, 209, 211, 212, 213, 215, 216, 220, 222, 223, 225, 226, 227, 229, 233, 235 | 93 | 39 |
Automatic | 14, 51, 54, 55, 67, 71, 75, 76, 78, 81, 82, 85, 90, 96, 99, 105, 109, 116, 131, 153, 159, 168, 171, 174, 176, 181, 208, 219, 230 | 29 | 12 |
Artificial intelligence/machine learning | 1, 3, 4, 33, 49, 50, 53, 69, 70, 80, 87, 106, 108, 122, 126, 127, 130, 142, 150, 157, 158, 166, 175, 178, 182, 203, 205, 207, 214 | 29 | 12 |
Statistical analyses | 6, 9, 12, 21, 37, 38, 39, 58, 86, 89, 94, 98, 112, 136, 167, 199, 201, 204, 210, 217, 228, 232, 234, 236 | 24 | 10 |
Ad hoc algorithms | 20, 29, 41, 47, 61, 63, 72, 92, 103, 104, 115, 129, 137, 140, 146, 149, 162, 166, 173, 183, 187, 188, 195 | 23 | 10 |
Data mining | 7, 15, 53, 95, 110, 132, 141, 145, 152, 157, 164, 200, 217, 221, 231 | 15 | 6 |
Qualitative analyses | 13, 125, 127, 144, 158, 160, 191, 218, 224 | 9 | 4 |
Prediction | 10, 53, 134, 157, 190, 198, 203, 214 | 8 | 3 |
Reverse engineering | 11, 73, 84, 133, 155, 180 | 6 | 3 |
Heuristical techniques | 93, 119, 123, 124, 156, 172 | 6 | 3 |
Testing-based techniques | 40, 79, 117, 135 | 4 | 2 |
Table 9
Artificial intelligence/machine learning | Papers | # studies | % |
Random Forest Classifier | 49, 50, 157, 175 | 4 | 14 |
Natural Language Processing (NLP) | 126, 130, 178 | 3 | 10 |
Bayesian classifier | 3, 205 | 2 | 7 |
Search-based genetic algorithm | 33, 207 | 2 | 7 |
Latent Dirichlet Allocation (LDA) | 87, 106 | 2 | 7 |
Naive Bayes-based approach | 122, 166 | 2 | 7 |
Artificial Intelligence | 4, 70 | 2 | 7 |
Statistical learner | 203, 214 | 2 | 7 |
Sentiments analysis tools | 69 | 1 | 3 |
Deep model structur (convolutional Neural Network) | 158 | 1 | 3 |
Rule-based technique | 1 | 1 | 3 |
semantics-based methodology | 150 | 1 | 3 |
SGDClassifer | 142 | 1 | 3 |
Machine learning techniques | 127 | 1 | 3 |
Hoeffding tree classification method | 80 | 1 | 3 |
Dynamic topic models | 108 | 1 | 3 |
Naive bayes classifier | 182 | 1 | 3 |
Gradient boosting machine | 157 | 1 | 3 |
4.3 RQ3. What Information is Extracted (Directly) or Derived (Indirectly) as a Result of the Analysis of Source Code Repositories?
Table 10
Output | Papers | # studies | % |
Developer Behaviour | 1, 5, 8, 11, 13, 17, 21, 26, 30, 36, 37, 38, 39, 40, 43, 48, 52, 55, 56, 60, 61, 64, 65, 66, 69, 71, 77, 89, 92, 99, 101, 116, 120, 121, 127, 130, 147, 151, 160, 161, 163, 169, 175, 177, 181, 186, 189, 190, 191, 192, 196, 200, 204, 208, 211, 215, 216, 220, 221, 222, 230, 231, 232, 233, 234 | 65 | 26 |
Changes Analysis | 3, 4, 31, 32, 33, 42, 45, 51, 59, 63, 67, 70, 102, 104, 107, 108, 114, 118, 124, 136, 138, 155, 162, 171, 174, 179, 180, 184, 187, 194, 195, 217, 223, 229, 235 | 35 | 14 |
Metrics/Quality | 23, 29, 58, 74, 78, 79, 80, 84, 90, 91, 105, 115, 128, 131, 145, 153, 164, 166, 168, 170, 185, 198, 199, 228 | 24 | 10 |
Deffect/Issue Analysis | 10, 12, 15, 16, 27, 34, 41, 44, 84, 97, 113, 119, 134, 150, 167, 197, 198, 203, 205, 207, 224 | 21 | 9 |
Source Code Improvements | 2, 9, 18, 19, 25, 49, 54, 63, 67, 72, 73, 95, 102, 110, 125, 133, 141, 164, 165, 202, 212 | 21 | 9 |
Commits/Committers Classification | 6, 20, 46, 50, 87, 96, 98, 111, 126, 142, 146, 157, 176, 178, 182, 209, 210 | 17 | 7 |
Cloning Detection | 22, 35, 81, 82, 85, 94, 112, 144, 149, 159, 188, 227, 236 | 13 | 5 |
Maintenability Information | 7, 28, 57, 62, 68, 76, 83, 106, 143, 168, 226 | 11 | 4 |
Design Modelling | 86, 88, 100, 103, 129, 148, 193, 206, 213, 218 | 10 | 4 |
Commit Analysis | 14, 24, 47, 53, 122, 123, 137, 214, 219 | 9 | 4 |
Automatic Processing | 109, 117, 126, 158, 172, 173, 183 | 7 | 3 |
Code Review | 132, 139, 166, 201, 225 | 5 | 2 |
Branching Analysis | 24, 75, 140, 152, 156 | 5 | 2 |
Testing Data | 93, 135, 154 | 3 | 1 |
4.4 RQ4. What Kind of Research Has Proliferated in this Field?
4.5 RQ5. Are Both Academia and Industry Interested in this Field?
4.6 Quality Assessment
Table 11
Study | # Citations | Year | AQ5 |
2 | 143 | 2017 | 5 |
186 | 72 | 2014 | 5 |
25 | 69 | 2014 | 5 |
9 | 66 | 2013 | 5 |
61 | 62 | 2012 | 5 |
187 | 46 | 2014 | 5 |
130 | 43 | 2012 | 5 |
165 | 42 | 2012 | 5 |
89 | 42 | 2014 | 5 |
159 | 40 | 2012 | 5 |
5 Discussion
5.1 Principal Findings
-
• F1. The research field regarding code repository analysis is in the process of improving its matureness. Researchers have worked in this discipline very hard in the last 10 years with several different proposals with some evidence. However, most of them have not been extensively applied in the industry. In addition, most of the papers have been published in high impact conferences and journals. Therefore, several objectives have been found to be covered by these research proposals. As a result, many of the proposals turn out to be innovative and built on previous research.
-
• F2. The authors consider several methods/techniques for the analysis of information obtained from code repositories. The collected studies point out that there are different techniques and methods from other research areas that can be applied in the analysis of information extracted from code repositories. We can detect some recurrent patterns. For example, the most recurrent techniques are those related to empirical or experimental analyses which are present for various inputs and outputs. Another insight is the extensive use of artificial intelligence and, more specifically, machine learning to analyse the information extracted from code repositories with good results. Finally, some tools and techniques combine some automatic processes with other sources to achieve the research goal.
-
• F3. The selected proposals contribute to the understanding of software quality and evolution. Several methods, techniques and tools have been found for the process of analysis of information extracted from code repositories, their application in the industry is given in a minimum measure and poor ascent, as it is demonstrated by some studies that give viability through empirical results. Although there are studies that go hand in hand between industry and academia, few initiatives are found in digital libraries.
-
• F4. The output obtained from the analysis of information from code repositories focuses on most studies to investigate the developer, such as classifying it or finding patterns of feelings (like through sentiment analysis) that infer from the coding. In summary, this analysis allows the developer to know the quality and evolution of the software beyond counting and measuring the lines of code and focuses on the human factor as a fundamental part within the software development.
-
• F5. Finally, another important output obtained from the analysis of information from code repositories is the analysis of changes. These studies focus on obtaining the impact of changes in the software code that is developed to try to find error patterns and to be able to make predictions of possible failures in the code before they occur. These changes in the code can greatly influence the quality and maintenance of the software, as well as have serious repercussions on costs and developers of the software project.