Journal of Software Engineering Research and Development, 2021, 9:1, doi: 10.5753/jserd.2021.548  This work is licensed under a Creative Commons Attribution 4.0 International License.. Mining Experts from Source Code Analysis: An Empirical Evaluation Johnatan Oliveira [ Federal University of Minas Gerais (UFMG) | johnatan.si@dcc.ufmg.br ] Markos Viggiato [ University of Alberta | viggiato@ualberta.ca ] Denis Pinheiro [ Federal University of Minas Gerais (UFMG) | denisppinheiro@gmail.com ] Eduardo Figueiredo [ Federal University of Minas Gerais (UFMG) | figueiredo@dcc.ufmg.br ] Abstract Modern software development increasingly depends on third­party libraries to boost productivity and quality. This development is complex and requires specialists with knowledge in several technologies, such as the nowadays libraries. Such complexity turns it extremely challenging to deliver quality software, given the pressure. For this purpose, it is necessary to identify and hire qualified developers, to obtain a good team, both in open source and proprietary systems. For these reasons, enterprise and open source projects try to build teams composed of highly skilled developers in specific libraries. However, their identification may not be trivial. Despite this fact, we still lack procedures to assess developers skills in widely popular libraries. In this paper, we first argue that source code activities can identify software developers’ hard skills, such as library expertise. We then evaluate a mining­ based strategy to reduce the search space to identify library experts. To achieve our goal, we selected the 9 most popular Java libraries and 6 libraries for microservices (i.e., 15 libraries in total). We assessed the skills of more than 1.5 million developers in these libraries by analyzing their commits in more than 17 K Java projects on GitHub. We evaluated the results by applying two surveys with 158 developers. First, with 137 library expert candidates, they observed 63% precision for popular Java libraries’ used strategy. Second, we observe a precision of at least 71% for 21 library experts in microservices. These low precision values suggest space for further improvements in the evaluated strategy. Keywords: Library Experts, Software Skills, Expert Identification, Mining Software Repositories. 1 Introduction Software development has become increasingly complex, both in open­source and proprietary systems (Damasiotis et al., 2017). Such complexity makes it extremely challeng­ ing to deliver software with quality in time and may hinder developers’ participation in worldwide repositories of source code, such as GitHub (Viggiato et al., 2019). To contribute to open­source projects or hire developers (in the case of a company), identifying the developer with the right skills for a good team is a hard task (Garcia et al., 2007; McCuller, 2012). Besides, in many cases, project managers must build teams of skilled developers in relevant libraries. However, decisions made during the hiring process are a well­known decisive factor to the success of a software project (Tsui et al., 2016). Providing a more reliable way of identifying developers’ skills can help project managers make the right decision when hiring or attracting the right developers for an open­source project. The task of finding experts in spe­ cific technologies is especially complex, despite the exis­ tence of business­oriented social networks, such as LinkedIn, where developers write about their attributes and qualifica­ tions. This type of platform is commonly used for the online recruitment of professionals. However, the reliability and ac­ curacy of the information provided in such media are not guaranteed (Brown and Vaughn, 2011). For instance, some individuals can overvalue their skills and omit some skills in a self­authored curriculum. The most commonly used strategies to find experts have their limitations (Tsui et al., 2016; Constantinou and Kapit­ saki, 2016). For instance, the analysis of the curriculum from LinkedIn or in paper format can omit desirable skills. Be­ sides, developers may have difficulty to express their qual­ ifications (Tsui et al., 2016). Sometimes, the developer has a specific ability, but s/he considers it irrelevant. In another situation, the developer cites many skills but does not have expertise in the technologies mentioned (Constantinou and Kapitsaki, 2016). Even large companies may rely on cur­ riculum analysis, and this type of research may have inaccu­ rate or outdated information. Besides, even talent recruiters may incorrectly identify the developer skills or identify other skills that are not the organization’s focus. Hiring lowly skilled software developers can lead to additional costs, ef­ forts, and resources for training them, or expending more time and resources hiring others (Constantinou and Kapit­ saki, 2016; Sommerville, 2015). However, these costs can be reduced if companies identify with more precision best developers according to a job opening. Several software developers have used social coding plat­ forms, such as GitHub and BitBucket, to showcase their work, hoping that this may help them be hired for a better job. Developers use these social coding platforms to demon­ strate their skills and create an online profile about their projects (Constantinou and Kapitsaki, 2016). Some contrib­ utors are even using these platforms’ social aspects to in­ fer project popularity trends and promote themselves more efficiently through specific projects and collaborations in other open­source projects (?). In some cases, profiles de­ rived from accounts of social platforms, such as GitHub, are considered even more reliable than a curriculum from LinkedIn, concerning the technical qualifications of a job candidate (Constantinou and Kapitsaki, 2016). Therefore, mailto:johnatan.si@dcc.ufmg.br mailto:viggiato@ualberta.ca denisppinheiro@gmail.com mailto: figueiredo@dcc.ufmg.br Presenting the new SBC journal template Oliveira et al. 2020 data exploitation from coding platforms is a promising way for potential employers to identify and assess several candi­ dates in real situations (Capiluppi et al., 2013). GitHub has been widely used in several works mainly be­ cause it provides several user­based summary statistics, such as the number of contributions in the last year, the number of forked projects, and the number of followers. For instance, some works have used this platform to identify appropriate developers for cross­project bugs (Ma et al., 2017), iden­ tification of reuse opportunities (Oliveira et al., 2016) and collaborations between projects (Dabbish et al., 2012). Dif­ ferent approaches have been used to investigate the skills of developers from GitHub (Saxena and Pedanekar, 2017; Mockus and Herbsleb, 2002; Greene and Fischer, 2016). For instance, prior work conducted interviews with members of GitHub to understand the hiring process (Marlow and Dab­ bish, 2013). We did not compare the results with other ap­ proaches because our strategy is very different from the oth­ ers. Therefore, our strategy complements related work by au­ tomatically reducing the search space to support library ex­ perts’ identification. This paper is an extension of our previ­ ous work (Oliveira et al., 2019) that proposed and evaluated a strategy to identify library experts from source code, named JExpert. Our main goal is to reduce the search space to iden­ tify library experts. We list the following new contributions to this submission compared to the original paper. 1. We present and analyze data of all identified expert can­ didates by means of new boxplot charts. 2. We include a novel classification and discussion of ex­ perts in four categories. 3. We include additional analysis of the library experts by proposing a novel heuristic to rank the top experts of each library. 4. We perform a new identification of experts in microser­ vices libraries. 5. We conducted an additional survey to calculate the strat­ egy precision on identifying experts in microservices. 6. We include additional discussion about the negative re­ sults of the evaluated metrics. In this paper, we evaluate the feasibility of identifying soft­ ware developers’ hard skills; that is, library expertise from source code analysis. We rely on GitHub data to support the identification of the skills of developers based on their contributions. From each type of developer contribution, we aim to identify essential developers skills and evaluate the applicability and precision of the strategy. In the applicabil­ ity evaluation, we performed a mining study with the top­9 most popular Java libraries from GitHub, aiming to identify library experts in these libraries. In total, we analyzed more than 16 thousand projects and 1.5 million developers. In the precision evaluation, we designed and sent a survey to more than 1 thousand developers identified for these libraries. We received answers from 158 developers. As a result, we ob­ serve that it is possible to reduce the search space to identify experts from source code. We also note that the strategy pro­ vides meaningful information to recruiters, such as the his­ tory of written lines of code (LOC) for each library. These details about the developers can improve the selection of can­ didates. Our key contributions are threefold: • we empirically evaluate the applicability and precision of identifying library experts from source code analysis. In addition, we propose a tool to support the strategy; • we identify 1,045 experts in top­9 Java libraries with a precision of about 63%; • we identify 136 experts from microservices libraries with a precision of about 71%. Low precision values indicate space for future research in this subject. The remainder of this paper is organized as fol­ lows. In Section 2, we describe our analysis by detailing the strategy to identify library experts, dataset, and our research questions. Section 3 presents the results of the applicability evaluation to identify library experts. Section 4 shows the re­ sults to survey with top­9 library experts. Section 5 shows the results concerning a survey with library experts in microser­ vices. Section 6 shows details about a tool developed to sup­ port the strategy. Section 7 presents and discusses threats to validity. Related work is discussed in Section 8. Finally, Sec­ tion 9 discusses the concluding remarks and future work. 2 Study Settings This section describes the protocol to evaluate the identifica­ tion of library experts through an empirical study. Section 2.1 presents the aims of our study and the research questions we address. Section 2.2 shows the steps performed to evaluate the expert candidates. Section 2.3 describes the used dataset. 2.1 Goal and Research Questions This study’s primary goal is to evaluate the applicability and precision of a strategy to reduce the search space to identify li­ brary experts from source code analysis using software repos­ itories. We are interested in whether the strategy can signifi­ cantly reduce the search space to identify experts in a specific library. We are also concerned with assessing the relevance of the results provided by the strategy. For this purpose, we select the 10 most popular and standard Java libraries among GitHub developers. We also selected 6 popular libraries for microservices. One library was later excluded (Section 2.3). Therefore, we evaluate the strategy with the 9 most popu­ lar Java libraries and 6 libraries of microservices. To achieve this goal, we use the Goal­Question­Metric method to select measurements of source code. The GQM method proposes a top­down approach to defining measurement; goals lead to questions that are then answered with metrics (Basili et al., 1994). Table 1 shows the GQM with the research questions and metrics investigated in this study. As mentioned, the goal of this paper is to reduce the search space to identify library ex­ perts from source code. Therefore, from this goal, we check if it is feasible to analyze the source code to identify library experts. Through RQ1, we are interested in investigating the efficiency of the number of commits (metric) to indicate the level of activity of a developer in a specific library. In other words, we aim to analyze the number of commits involving Presenting the new SBC journal template Oliveira et al. 2020 a specific library performed by a developer to compute their activity level in the library. With RQ2, we aim at assessing the knowledge extension based on the number of imports to a specific library. From all imports made by a developer at the source code, we in­ vestigate the number related to the particular library. Finally, the last research question (RQ3) analyzes the knowledge in­ tensity of the developers from the number of LOC related to the library (metric). In this last question, we aim to evalu­ ate the amount of LOC implemented by a developer using a specific library. For this purpose, we evaluate the relation of total LOC and LOC related to a particular library. Table 1. The Metrics Analysis as GQM method Questions Metrics RQ1– How to evaluate the level of activity of a developer in a library? Number of commits RQ2– How to evaluate the knowl­ edge extension of a developer in a library? Number of imports RQ3– How to evaluate the knowl­ edge intensity of a developer in a library? Lines of Code 2.2 Evaluation Steps This section describes the steps to evaluate the identification of library experts from source code. To answer the research questions presented in Section 2.1, we designed a mixed­ method study composed of four steps: 1) Library Selection, 2) Dataset Collection, 3) Expert Identification, and 4) Sur­ vey Application. Figure 1 presents the steps of our research, which are discussed next. For Library Selection (Section 2.3), we selected the top­10 most popular libraries in the Java pro­ gramming language to identify library experts. We also se­ lected 6 libraries for microservices to favor external validity. In the Dataset Collection step (Section 2.3), we clone the projects that contain these libraries from GitHub. For Iden­ tification of Library Experts (Section 3.1), we compute the skills of developers based on three metrics: Number of Com­ mits, Number of Imports, and Lines of Code. These metrics are presented in Section 3.1. Finally, we performed two sur­ vey studies. These surveys were conducted to evaluate the precision of the strategy according to the responses of devel­ opers. Section 4.1 and 5.1 present details about the surveys. Figure 1. Study Steps 2.3 Dataset To create our dataset, we select the 10 most popular and com­ mon Java libraries among GitHub developers: Hibernate, Se­ lenium, Hadoop, Spark, Struts, GWT, Vaadin, Primefaces, Apache Wicket, and JavaServer Faces. This selection was made based on a survey provided by Stack Overflow1 in 2018 with answers of over 100,000 developers around the world. Table 2 summarizes the definitions of each library (top­10). All definitions of the libraries were retrieved from Stack Overflow and their Web pages. We selected Java be­ cause it is one of the most popular programming languages2 and there are many Java projects available on GitHub. Microservices have become most popular in the last years, together with the spread of DevOps practices (Pahl, 2015). We can see a significant increase in the use of microservices architectural style since 2014 (Klock et al., 2017), which can be verified in the service­oriented software industry where the usage of microservices has been far superior when com­ pared to other software architecture models (Alshuqayran et al., 2016). Furthermore, a microservice usually runs on its own process and communicates using standardized inter­ faces. In practice, microservices are widely used by large Web companies, such as Netflix and Amazon (Alshuqayran et al., 2016). For these reasons, we aim to identify experts of microservices in 6 libraries: Apache Karaf, Apache Spark, JavaEE, Netflix, Spring Boot, and Swagger. Table 3 also summarizes the definitions of each library, but now con­ cerning microservices. These definitions were retrieved from Stack Overflow and their Web pages. Figure 2 illustrates the criteria for defining our dataset. To achieve more realistic results for software development, we apply the following exclusion criteria. (1) We excluded sys­ tems with less than 1 KLOC because we considered them toy examples or early­stage software projects. (2) We removed projects with no commit in the last 3 years because the devel­ opers may forget their code (Krüger et al., 2018). Finally, in the last exclusion criteria, (3) we removed projects which did not contain imports related to the selected libraries. Besides, we excluded all official projects of these libraries because we assume all library project developers are experts in the corresponding library. In popular Java libraries, we also re­ moved libraries with less than 100 projects (e.g., JavaServer Faces). We need a representative number of projects to eval­ uate our strategy. We analyze only files with extension .java. The same process was made to projects of libraries of mi­ croservices. Therefore, we end up analyzing 15 libraries in this study. Figure 2. Steps for Collecting Software Projects from GitHub Table 4 shows the number of remained projects after each step in our filtering process. The first part of this table shows the results for top­10 Java libraries, and the second part 1https://insights.stackoverflow.com/survey/2018#most­popular­ technologies 2https://spectrum.ieee.org/static/interactive­the­top­programming­ languages­2018 Presenting the new SBC journal template Oliveira et al. 2020 Table 2. Library Descriptions Library Description Hibernate Hibernate is a library of object­relational mapping to object­oriented. Selenium A test suite specifically for automating Web. Hadoop A library that facilitates the use of the network from many computers to solve problems involv­ ing massive amounts of data (Tong et al., 2016; Ye, 2017). Spark A general­purpose distributed computing engine used for processing and analyzing a large amount of data Struts It helps in developing Web­based applications. GWT It allows Web developers to develop and maintain complex JavaScript front­end applications in Java. Vaadin It includes a set of Web components, a Java Web library, and a set of tools and application starters. It also allows the implementation of HTML5 web user interfaces using the Java. PrimeFaces A library for JavaServer Faces featuring over 100 components. Apache Wicket A library for creating reusable components and of­ fers an object­oriented methodology to Web devel­ opment while requiring only Java and HTML. JavaServer Faces A Java view library running on the server machine which allows you to write template text in client­ side languages (like HTML, CSS, JavaScript, etc.). shows the results for microservices libraries. The column #Projects presents the number of projects initially selected. Next, the column Filtered shows the number of projects re­ moved through the filtering step. Finally, the column Re­ mained presents the number of projects analyzed for each library. 3 Applicability Evaluation In this section, we describe how we evaluated the strategy in terms of its applicability focusing on the top­9 Java libraries. Section 3.1 presents the steps to identify library experts, for example, metrics and data about classes. Section 3.2 shows an overview of our data. Section 3.3 presents the top­10 ex­ perts in each library selected in this study. 3.1 Identification of Library Experts To evaluate the strategy in terms of its applicability, we perform three steps in this study. These three steps are described as follows. Step 1: Extract data from source code – In this step, we obtain data from the classes created by developers from a Git repository. All data, such as added or removed LOC, written imports, commits, date, email, and developers’ names, are stored locally. Table 3. Library Descriptions Library Description JavaEE The JavaEE platform is built on top of the Java SE plat­ form. The Java EE platform provides an API and run­ time environment for developing microservices and run­ ning large­scale, multi­tiered, scalable, reliable, and se­ cure network applications. Spring Boot Pivotal solution for implementing cloudbased microser­ vices using the well known Spring Framework. Netflix Netflix OSS is a set of frameworks and libraries that Netflix wrote to implement microservices in distributed­ systems. Swagger Swagger is used to creating documentation for each mi­ croservice. Karaf Apache project referenced to support microservice im­ plementations. Spark A lightweight web framework that has been used to im­ plement simple and expressive microservices. Table 4. Projects Selected for Analysis Library #Projects Filtered Reimaned Hibernate 31,134 26,020 5,114 Selenium 19,062 17,648 1,414 Hadoop 11,715 10,778 937 Spark 9,144 7,650 1,494 Struts 4,741 4,127 614 GWT 4,086 2,635 1,451 Vaadin 3,240 2,625 615 PrimeFaces 1,881 1,401 480 Apache Wicket 1,095 896 199 JavaServer Faces 120 120 ­ Total 86,218 73,900 12,318 Microservices Apache Karaf 264 155 109 Apache Spark 243 120 123 JavaEE 321 190 131 Netflix 653 240 413 SpringBoot 393 246 147 Swagger 357 239 118 Total 2,231 1,190 1,041 Step 2: Search for imports – From the previous step, we search for specific “imports” related to the chosen library. The idea is to explore all files that import the name of the target library. This step is performed as follows. First, the strategy gets files with all commits, for example, commits to LOC in general, comments, and mainly the header. Second, it analyzes the header of Java files containing the name of the package, all imports necessary to class, and the classes’ names. Consequently, we get the “import” through regular expression pattern import+“target library”, for example, “import org.apache.spark”. In this example, the target library is Spark. Figure 3 shows an example of a file with data of committers. As we can observe in Figure 3, there are three attributes in this file: (1) hash code of commit, (2) name of the developer, and (3) committed source code. At the beginning of the file, there is the name of a package and many imports. In this part, our strategy is to use a regular expression to detect if the line contains the library we investigate. If the line contains the target library, we compute the hash of commit, the number of imports to the specific library, and the total imports without relation to the target library. Step 3: Calculate skills – In this last step, we compute the skills for each developer. We rely on three metrics to identify library experts. Each metric is calculated concerning the Presenting the new SBC journal template Oliveira et al. 2020 Figure 3. File Example with Commits of Three Developers number of commits to a specific library. That is, when a commit using a library is identified, the metrics were calculated. In the following, we explain the 3 proposed metrics. Number of Commits. This metric calculates the activity of each developer through the number of commits using a particular library. Through this metric, we believe it is possible to measure the library’s amount in a project that a specific developer works. Number of Imports. This metric presents the extension of knowledge in the library. For this metric, we count all im­ ports to the library written by a developer. Repeated imports are included. If a developer wrote two equals imports, we would consider 2 imports to the target library. Figure 3 shows an example of repeated imports. There are four imports to Apache Hadoop in this figure, so we compute 4 imports for this library. Besides, if a developer made 3 imports to the same library, we compute 3 imports, for example, we are sup­ posed to developer made 3 imports. 1 import o r g . apache . hadoop . i o . LongWritable ; 2 import o r g . apache . hadoop . i o . LongWritable ; 3 import o r g . apache . hadoop . i o . LongWritable ; Lines of Code. To compute this metric, we developed a heuristic to count the amount of LOC related to a specific library. First, we obtain the ratio of changed LOC by the number of all imports in the file. Then, we multiply the ratio by the number of imports related to the library. Our heuris­ tic considers 3 attributes, the number of library imports, the number of imports in general, and the number of LOC al­ tered by a commit related to the library. The heuristic is then computed as follows: LOC = # of LOC Altered by Commit # of All Imports X # of Library Imports From Figure 3, it is possible to compute an ex­ ample for this metric. A developer made a commit with hash code 75b70c and an import to “import org.apache.hadoop.io.IntWritable;” (line 2). Therefore, we compute this metric as presented above and consider 10.67 LOC related to the Hadoop library. 3.2 Overview of Dataset From the dataset projects, we computed all commits with the libraries evaluated in this study and identified 1.5 million dif­ ferent developers who made commits. Figure 4 shows the number of developers for the top­9 popular Java libraries. The library, with more developers that made commits, was Selenium. This library has 811,884 developers. In contrast, Apache Wicket was the library with fewer developers: 5,440. It is important to say that these developers made at least one commit for the respective library. However, we cannot con­ sider them all experts since a single library use may not indi­ cate high expertise. Figure 4. Number of Developers by Library Figures 5, 6, and 7 show an overview of the metrics computed to our data set of popular Java libraries. Figure 5 presents the results to the Number of Commits per library. Figure 6 presents an overview of the metric Number of Im­ ports per library. Finally, Figure 7 shows the results of the metric Lines of Code per library. In general, LOC (Figure 7) was the metric that presented more variation in our data set. For instance, GWT has developers that wrote more than 130 KLOC. Similarly, for Hibernate, it is possible to see an outlier developer who wrote more than 500 KLOC. In con­ Presenting the new SBC journal template Oliveira et al. 2020 Figure 5. Number of Commits per Library Figure 6. Number of Imports per Library Figure 7. Number of LOC per Library trast, some developers wrote less than 10 lines of code, for example, to the library PrimeFaces. 3.3 Top Library Experts Selection In this section, we present the Applicability Evaluation re­ sults to verify the feasibility of library expert identification focusing on the top­9 popular Java libraries. We analyzed 16,703 software systems mined from GitHub and 9 libraries: Hibernate, Selenium, Hadoop, Spark, Struts, GWT, Vaadin, Primefaces, and Apache Wicket. Besides, we analyzed data from more than 1.5 million developers who have contributed to these projects in our dataset. Table 5 presents the results of top library experts. To ob­ tain these results, we aim to select the top­10 developers, but, in some cases, it was not possible to select top­10 developers. For instance, we obtained 3­top developers in library Spark. Besides, we consider a developer with a library expert only if this developer obtains high values in at least two metrics, for example, LOC & # of commit or # of imports & LOC. These developers are identified concerning their contribution. For this, we calculate the 90% percentile in each metric, then fil­ tering the developers with any metric below this threshold (90%). This type of classification is common in other stud­ ies (Joblin et al., 2017; Ferreira et al., 2019). Finally, we sort developers by LOC (# of Library LOC). The filtering thresh­ old was applied to remove potential false positives (i.e., de­ velopers with high # of Library LOC, but low # of Commits). In some cases, it resulted in less than 10 experts for some libraries, such as PrimeFaces (8), Spark (3), Struts (6), and Wicket (5). In Table 5, each developer is identified by the start name of the library, followed by a sequence number (e.g., HAD (1) means the first developer expert of Hadoop). The column # of Library Imports refers to the metric of Number of Imports written by the developer. It counts the number of imports re­ lated to the specific library evaluated in this study. The col­ umn # of All Imports shows the number of imports wrote by the developer in general. When a developer wrote an import to a specific library evaluated in this study, they also wrote imports to other libraries that have not been evaluated. Hence, this metric counts all imports in relevant commits made by the developer. The column # of Commits shows the results for the Num­ ber of Commits metric. This metric indicates the number of commits made by a specific developer. The column # of LOC Presenting the new SBC journal template Oliveira et al. 2020 Table 5. Top Library Experts ID # of Library Imports # of All Imports # of Commits # of LOC Altered by Commit # of Library LOC GWT(1) 1,693 6,836 49 637,724 157,938 GWT(2) 5,108 5,951 386 87,303 74,935 GWT(3) 4,019 5,451 452 75,700 55,813 GWT(4) 1,677 1,880 31 56,535 50,430 GWT(5) 2,497 3,714 74 54,865 36,886 GWT(6) 1,564 6,226 66 135,574 34,056 GWT(7) 2,657 6,167 71 71,767 30,920 GWT(8) 1,732 1,956 141 33,272 29,461 GWT(9) 2,249 2,558 105 31,124 27,364 GWT(10) 1,432 3,791 56 71,264 26,919 HAD(1) 15,739 32,391 172 488,882 237,550 HAD(2) 2,083 3,378 14 46,215 28,497 HAD(3) 1,024 27,277 31 476,220 17,877 HAD(4) 1,303 2,628 146 31,440 15,588 HAD(5) 932 1,518 93 16,086 9,876 HAD(6) 625 1,329 52 16,788 7,895 HAD(7) 569 1,843 55 19,899 6,143 HAD(8) 242 599 18 13,051 5,272 HAD(9) 493 617 18 6,110 4,882 HAD(10) 322 973 12 11,842 3,918 HIB(1) 3,401 5,211 155 78,781 51,417 HIB(2) 1,719 2,923 169 25,963 15,268 HIB(3) 180 432 25 24,552 10,230 HIB(4) 552 1,182 15 13,612 6,356 HIB(5) 552 791 44 7,939 5,540 HIB(6) 535 756 51 5,684 4,022 HIB(7) 509 1,281 10 9,250 3,675 HIB(8) 458 898 50 7,060 3,600 HIB(9) 202 395 17 6,880 3,518 HIB(10) 233 387 15 4,617 2,779 PRI(1) 239 16,194 6 245,319 3,620 PRI(2) 177 1,286 6 11,232 1,545 PRI(3) 72 282 15 3,500 893 PRI(4) 37 144 12 2,014 517 PRI(5) 38 545 10 6,374 444 PRI(6) 28 168 6 2,538 423 PRI(7) 28 142 6 1,904 375 PRI(8) 27 102 10 1,374 363 SEL(1) 614 820 61 8,757 6,557 SEL(2) 1,178 1,763 116 9,606 6,418 SEL(3) 707 3,166 27 27,808 6,209 SEL(4) 287 1,436 49 28,355 5,667 SEL(5) 780 1,141 93 7,245 4,952 SEL(6) 491 2,229 73 22,302 4,912 SEL(7) 242 486 18 9,513 4,736 SEL(8) 324 1,027 27 14,084 4,443 SEL(9) 394 1,095 16 12,096 4,352 SEL(10) 178 417 16 9,685 4,134 SPA(1) 757 2,208 36 22,903 7,852 SPA(2) 280 1,253 29 17,940 4,008 SPA(3) 446 834 38 7,344 3,927 STR(1) 670 3,286 3 64,468 13,144 STR(2) 531 2,432 2 24,448 5,337 STR(3) 616 2,771 2 23,419 5,206 STR(4) 175 793 9 14,753 3,255 STR(5) 133 1,076 9 21,477 2,654 STR(6) 278 818 6 7,357 2,500 VAA(1) 3,541 5,960 100 95,786 56,909 VAA(2) 561 761 46 21,537 15,876 VAA(3) 1,265 2,102 203 21,973 13,223 VAA(4) 684 4,208 74 59,710 9,705 VAA(5) 816 1,178 102 12,557 8,698 VAA(6) 510 656 31 8,913 6,929 VAA(7) 451 628 28 9,169 6,584 VAA(8) 740 1,432 30 11,746 6,069 VAA(9) 358 375 28 6,223 5,940 VAA(10) 334 495 59 8,695 5,866 WIC(1) 1,428 1,727 191 16,991 14,049 WIC(2) 1,017 1,212 55 14,255 11,961 WIC(3) 494 543 56 10,104 9,192 WIC(4) 403 451 49 9,549 8,532 WIC (5) 476 651 34 8,439 6,170 Presenting the new SBC journal template Oliveira et al. 2020 Altered by Commit presents the LOC changed by a developer when s/he made a commit related to the library (i.e., identi­ fied by a specific library import). Finally, the last column # of Library LOC shows the results for the metric LOC written by the developer related to the library based on our heuristic. In this paper, the developers could be classified into Hard/Soft Committers and Hard/Soft Coders, depending on the metrics’ numbers. We consider a Hard Committer when a developer obtains data equal or above 75%. That is, we use the 3rd quartile as a parameter. Hard Committers are developers who made several commits (# of Commits) related to the libraries which are subject of this study. For example, let us supposed that developer Maike made 10k commits that include hash to library Y and developer Anna made 1k commits using a hash to library Y. In this context, the developer Maike is a Hard Committer in relation to developer Anna. Similarly, Hard Coders are developers who wrote several lines of code related to the library (# of Library LOC). For instance, let us suppose that developer Mary wrote 8K LOC when made a commit to the library Y and developer John wrote 1K LOC to library Y when made a commit to the same library. Therefore, developer Mary is considered a Hard Coder in relation to developer John. Nevertheless, a developer could be Hard Committer and Hard Coder if s/he has a higher number of commits and LOC related to the library. On the other hand, we classify a developer as Soft using the same strategy to classify the Hard developers. However, we use the data below 25%, i.e., the 1 st quartile, as a parameter. Then, we discuss the below reasoning regarding this classification. Hard Committers and Hard Coders. According to our metrics, the developer GWT (3) is a Hard Committer and Hard Coder (see Table 5). This developer made more than 450 commits and wrote more than 55 KLOC for this library. It could be noted that other developers are Harder Committer and Harder Coder. For instance, the developer HAD (1) made 172 commits and wrote more than 237 KLOC. These are some examples of Harder Committer and Harder Coder from the calculated metrics. Hard Committers and Soft Coders. We present now the results to Hard Committers and Soft Coders. Developers HAD (1) and HAD (4) in Table 5 can be considered Hard Committers because they made 172 and 146 commits, respectively. The difference between HAD (1) and HAD (4) is only 26 commits. However, developer HAD (4) is considered as Soft Coder concerning developer HAD (1) because HAD (1) wrote more than 235 KLOC while the developer HAD(4) wrote about 15 KLOC. Developer HAD (4) wrote only 6% LOC of the developer HAD (1). Therefore, HAD (4) is a Hard Committer and Soft Coder. Soft Committers and Hard Coders. Concerning the Soft Committers and Hard Coders, we can observe that developers PRI (1), PRI (2), SEL (1), and STR (1) in Table 5 are Soft Committers because they made only a few commits. Developer STR (1), for instance, made only 3 commits, but s/he wrote more than 13 KLOC. Therefore, this developer is considered a Soft Committer and Hard Coder. Soft Committers and Soft Coders. As the name suggests, this category includes the developers that fewer commits and made fewer lines of code compared to their peers. For in­ stance, Developers HIB (9), HIB (10), SEL (9), and SEL (10) are considered Soft Committers because they wrote less than 20 commits to libraries cited. Besides, these developers wrote less than 5 KLOC. Therefore, according to our met­ rics, these developers are considered Soft Committers and Soft Coders. 4 Survey with Top Libraries Experts This section describes the survey applied to GitHub develop­ ers to evaluate the strategy with respect to the top­9 popular Java libraries. Section 4.1 presents the details regarding the survey developed. Section 4.2 presents a summary of some relevant findings. Section 4.3 presents the results to RQ1 re­ garding the Number of Commits metric. Section 4.4 presents the results to RQ2 about the Number of Imports metric. Sec­ tion 4.5 presents the results to RQ3 regarding the LOC met­ ric. 4.1 Survey Design According to Easterbrook et al. (2008), survey studies are used to identify the characteristics of a population and are usually associated with the application of questionnaires. Be­ sides, surveys are meant to collect data to describe and com­ pare or explain knowledge (Pfleeger and Kitchenham, 2001). We selected the library experts with the best values in the evaluated metrics to validate them through a survey. We de­ signed and applied a survey with the top developers identi­ fied by our strategy. We selected developers with the top­ 20% highest values in at least two (out of three) metrics. We created a questionnaire on Google Forms3 with two parts: the first one was composed of 5 questions about the background of the expert candidates; the second part also had 5 questions about the knowledge of the expert candi­ dates regarding the evaluated libraries. Table 6 contains the tag meaning a specific library, for instance, Hadoop. Also, this table shows the possible answers to the survey questions. Table 6. Survey Questions on the Use of the Libraries ID Questions SQ1 How do you assess your knowledge in ? ( ) 1 ( ) 2 ( ) 3 ( ) 4 ( ) 5 SQ2 How many projects have you worked with ? ( ) 1 to 5 ( ) 6 to 10 ( ) 11 to 20 ( ) More than 20 projects SQ3 How many packages of have you used? ( ) A few ( ) A lot SQ4 How often do your commits include ? ( ) A few ( ) A lot SQ5 How much of your code is related to ? ( ) Few of my code is related to ( ) My code is partially related to ( ) Most of my code contains To obtain the email used by the developer to perform the 3https://www.google.com/forms/ Presenting the new SBC journal template Oliveira et al. 2020 commits in the source code, we used the Git­Blame4 tool. The emails were collected to send the survey. We sent an email to developers asking them to assess their knowledge of each library. For instance, the developers were invited to rank their knowledge (Table 6, SQ1) using a scale from 1 (one) to 5 (five), where (1) means no knowledge about the library; and (5) means extensive knowledge about the library. Ques­ tions are not mandatory because they may require knowledge of the exceptional features of the library. Therefore, partici­ pants are not forced to provide an answer when they do not re­ member a specific library element, such as the time of devel­ opment using the library and the approximate frequency of commits that contain the library. The survey remained open for 15 days in January 2019. In summary, we present the precision evaluation results based on a survey with expert candidates in each of the top­9 popular Java libraries. The goal of this evaluation is to verify the precision of the library expert identification. We empir­ ically selected 1,045 developers among the top­20% values in at least 2 metrics. The questionnaire was sent in January 2019. After 15 days, we obtained 137 responses resulting in a response rate of about 15%. We asked the 137 develop­ ers about their software development experience in general (background) and the use of the specific libraries investigated in this paper. 4.2 Overview In this section, we present an overview of some relevant find­ ings of the popular Java libraries. Table 7 presents an overview of the experts’ candidates contacted to answer our first survey. This table has the fol­ lowing structure. The first column (Library) indicates the name of the analyzed library. The second column (Emails sent) shows the number of emails collected and sent to ex­ pert candidates. The third column (Invalid email) presents the number of invalid emails returned by the server. The fourth column (Remaining emails) indicates the number of valid emails. The fifth column shows the number of answers we obtained for each library. Finally, in the last column, we show the response rate of each library. Table 7. Top 20% from Library Experts Selected to Answer the Survey Library Emails sent Invalid email Remaining email # Answers % GWT 160 18 142 31 22% Hadoop 181 33 148 11 7% Hibernate 155 10 145 16 11% Spark 138 19 119 11 9% Struts 42 2 40 9 23% Vaadin 107 18 89 15 17% PrimeFaces 30 1 29 9 31% Wicket 23 2 21 8 38% Selenium 209 31 178 27 15% TOTAL 1,045 134 911 137 15% Concerning the participants’ background and replication package, we create a Web page with more details (Oliveira et al., 2020). It is worth mentioning that half of the respon­ dents graduated in Computer Science, and 7% holds a Ph. D. 4https://git­scm.com/docs/git­blame degree. Concerning time dedicated to software development, 47% has more than 10 years of experience, and only 2% have less than 1 year of experience. Therefore, we can conclude that, in general, the participants are not novices. Our study also shows that a significant amount of expert candidates makes commits. When writing code related to a specific library, they perform many imports of particular li­ braries and writes lines of code about the library. We support this affirmation through metrics that evaluate the amount of LOC written by a developer when they performed a commit. Table 8 shows the results of the knowledge that surveyed de­ velopers claim to have in each library. If we analyze the data about the precision of the strategy from the sum of levels 3, 4, and 5 of the Likert­type scale, we obtain on average 88.49% of precision about the knowledge of the developers, i.e., identification is correct in more than 88% of the cases. On the other hand, although a score of three may represent acceptable knowledge, if we followed more conservative cri­ teria, only classifying as library experts the developers that informed a higher (≥ 4) knowledge on the libraries obtain average, 63.31% of precision. This way, we conclude that less than 2/3 of the identified expert candidates identified by the strategy contain high knowledge about the evaluated li­ braries. About 63% of the library experts who answered the survey have high knowledge about the evaluated libraries. Table 8. Level of Knowledge in Each Library Library Likert scale Total 3­4­5 4­51 2 3 4 5 GWT 1 1 4 9 16 31 94% 81% Hadoop 0 1 3 4 3 11 91% 64% Hibernate 1 3 6 3 3 16 75% 38% Spark 0 1 4 2 4 11 91% 55% Struts 2 2 1 4 0 9 56% 44% Vaadin 0 2 5 3 5 15 87% 53% PrimeFaces 0 0 4 4 1 9 100% 56% Wicket 1 0 2 4 1 8 88% 63% Selenium 0 1 4 13 9 27 96% 81% 4.3 Level of Activity In this section, we answer the first research question. RQ1– How to evaluate the level of activity of a developer in a library? To answer this research question, we asked the library ex­ perts the following question. ‘‘How often are your commits related to the library’’? Figure 8 shows the results of this question in the first line in each chart to each library. For most libraries, the majority of the participants answered they made ‘‘few’’ commits using the evaluated li­ braries. This way, if we evaluated the results obtained for this label, it is possible to see that from 137 experts, 54% made ‘‘few’’ commits. For instance, in the library Hibernate, 87% of developers said they made few commits related to this library. Another library that deserves special attention is Struts. In this library, 88% of the developers responded that they made few commits. Regarding the label ‘‘a lot’’, only 39% of experts polled said they performed many commits. GWT was the library with a higher rate of answers to this Presenting the new SBC journal template Oliveira et al. 2020 label (62%). Therefore, results indicate that the metric Num­ ber of Commits needs to be combined with other metrics to achieved conclusive results about the skill of developers and even develop other metrics to identify the level of activity ability. Answer to RQ1. A large proportion of library experts make ‘‘few’’ commits using the library. Therefore, we con­ cluded that the solo use of the number of commits could not identify library experts. 4.4 Knowledge Intensity In this section, we answer the second research question. RQ2– How to evaluate the knowledge intensity of a devel­ oper in a library? Regarding the number of imports to indicate a library ex­ pert, we ask the developers the following question: ‘‘How often do you include an import of library in your commits?’’. Figure 8 shows the results of this question from the second line in each chart to each library. We ana­ lyze the number of imports performed by developers. The main reason for this analysis is to evaluate the feasibility of inferring the skills of the developers from the types of written imports. In general, the label ‘‘few’’ and ‘‘a lot’’ are tied or with little difference between them. For example, Hibernate, Spark, and PrimeFaces have practically tied. These libraries did not show significant differences; the difference was only 1 absolute point in some cases. In only three cases, the label ‘‘a lot’’ remained significantly higher: GWT (83%), Vaadin (67%), and Selenium (78%). From 137 experts, 68% said that they made ‘‘a lot of im­ ports’’. However, the number informed by the experts indi­ cates that this metric requires a combination with other met­ rics to achieve better results because 32% of experts said they made few imports to libraries evaluated. Therefore, from the survey results, the metric Number of Imports, as well as the metric Number of Commits, are not able to identify library experts when we apply one at a time. Answer to RQ2. The metric Number of Imports is not able to identify library experts, when we use it alone. 4.5 Knowledge Extension In order to evaluate the metric Lines of Code, we present the third research question as follows. RQ3– How to evaluate the knowledge extension of a devel­ oper in a library? In this research question, we analyze the developers skill from the number of LOC related to the library. We evaluate the number of LOC implemented by a developer to a specific library. For this purpose, we asked the library experts from the survey the following question. ‘‘How much of your code is related to the library when you perform a commit?’’. Figure 8 shows the results to this question in the third line in each chart to each library. The libraries GWT, Wicket, Selenium, and Hadoop, for instance, obtained 74%, 71%, 70%, and 64% respectively to label ‘‘a lot’’. We noted, however, the label ‘‘a few’’ also remained at a high level in some cases, for instance, the libraries Struts (88%) and Spark (55%). In fact, the library Hibernate re­ mained tied to labels ‘‘a few’’ and ‘‘a lot’’. In general, from 137 experts, 39% said they write ‘‘a few’’ LOC and 61% write ‘‘a lot’’ LOC with respect to libraries. Therefore, it is possible to infer that the metric Lines of Code alone also does not provide indications about developer skills, although this metric achieved better precision than the metric Number of Commits. Answer to RQ3. According to our analysis, the metric Lines of Code alone cannot reliably provide indications about developers’ skills. In general, our metrics are not fea­ sible to identify library experts. However, our strategy is able to reduce the search space of library experts. There­ fore, a company or project open source can be select a de­ veloper from a group selected by our strategy. 5 Survey with Microservices Experts In order to favor the generalization of our findings, we did a second survey with developers of microservices libraries. For this, we conducted a selection of libraries in this domain. 5.1 Survey Design We select the library experts to this survey in a similar way to the survey presented in Section 4.1. We created a ques­ tionnaire on Google Forms5 in order to evaluate the knowl­ edge of the developers about microservices libraries. The first question request the login of the developer at GitHub. This login is necessary to map the answer of the developer with our data. We ask developers about their knowledge in all six libraries of microservices. For this, we show all li­ braries investigates in this survey (6 libraries of microser­ vices). We request the developer to rank their knowledge in these libraries in four levels: No knowledge, Low knowledge, Medium knowledge, and Extensive knowledge. Each level of knowledge has meaning. No knowledge: this library was never used in any project I am involved in. Low knowledge: I never used this library, but it has been used in projects I am involved in. Medium knowledge: I used this library in some projects before, but I do not master all its API. Exten­ sive knowledge: I used this library many times, and I know a lot of its API. Table 9 shows the template of the survey. Table 9. Level of Knowledge in Microservices Libraries Library Noknowledge Low knowledge Medium knowledge Extensive knowledge Apache Karaf ⃝ ⃝ ⃝ ⃝ Apache Spark ⃝ ⃝ ⃝ ⃝ JavaEE ⃝ ⃝ ⃝ ⃝ Netflix ⃝ ⃝ ⃝ ⃝ Spring Boot ⃝ ⃝ ⃝ ⃝ Swagger ⃝ ⃝ ⃝ ⃝ We selected the library experts with the best values in the evaluated metrics to validate them through a survey. We de­ signed and applied the survey with the top developers iden­ 5https://www.google.com/forms/ Presenting the new SBC journal template Oliveira et al. 2020 (a) GWT (b) Hadoop (c) Hibernate (d) PrimeFaces (e) Selenium (f) Spark (g) Struts (h) Vaadin (i) Wicket Figure 8. Results of the survey questions for each library Presenting the new SBC journal template Oliveira et al. 2020 tified by our strategy. That is, we selected developers with the top­20% highest values in at least two (out of three) met­ rics. Therefore, we choose 136 candidates library experts in microservices. Figure 9 presents an overview of the number of developers by the library. The library with more candidate experts identified was Netflix with 64, and the library with fewer candidate library experts identified was Karaf with only 1. Figure 9. Number of Developers by Library in Microservices Table 10 presents an overview of the candidate experts contacted to answer our survey. We sent 136 emails, but 38 was returned with invalid email. Therefore, we sent the sur­ vey to 98 valid emails. The library, with more amount of respondents, was Netflix with 7 candidate library experts. On the other hand, the library with fewer participants was Apache Karaf with 0. Table 10. Top 20% from Library Experts of Microservice Library Emails Send Invalid Email Remaing Email #Answer % Apache Karaf 1 0 1 0 0% Apache Spark 6 1 5 4 80% JavaEE 9 1 8 1 13% Netflix 64 18 46 7 15% SpringBoot 37 11 26 6 23% Swagger 19 7 12 3 25% Total 136 38 98 21 21% 5.2 Results In this section, we present the results about a survey per­ formed with library experts from microservices. Initially, we perform a pilot survey with 5 developers from Netflix ran­ domly selected among the candidate experts identified. We received 3 answers for this library. From the pilot, we did not identify any problem with our survey. Then we apply the final survey for all top­20% developers with high values in at least two metrics. Note that the results of the pilot survey are part of our final results. Table 11 shows the summary results from our second sur­ vey. The first column shows the name of the library. The sec­ ond column shows the number of developers without knowl­ edge in the library target. The third column indicates the number of developers with low knowledge in the library tar­ get. The fourth column shows the number of developers with medium knowledge in the library target. The fifth column shows the number of developers with extensive knowledge in the library target. Finally, the two last columns show the precision for medium and extensive knowledge. The library Apache Karaf is not presented in Table 11 because we did not obtain any response for this library. Table 11. Summary Results Library No Low Medium Extensive Medium (precision) Extensive (precision) Apache Spark 2 1 1 25% JavaEE 1 100% Netflix 1 1 2 3 29% 43% SpringBoot 1 5 17% 83% Swagger 1 2 67% Total 4 2 4 11 19% 52% Table 12 presents the overview of the survey applied with developers from microservices. This table has 8 columns. The column Developer represents the name of the developer. We omitted the name of developers to avoid his/her expo­ sure. Next, six columns in the sequence represent the libraries investigates from the survey. Finally, we have the target li­ brary target name. This column represents the library that our strategy classified the developer as library experts. In this table, we have 4 scales for developers to rank her/his knowledge. The NK represents ‘‘no knowledge’’, LK rep­ resents ‘‘low knowledge’’, MK indicates ‘‘medium knowl­ edge’’, and EK represents ‘‘extensive knowledge’’. Table 12 shows, for instance, developer­D1 was identified by our strat­ egy as medium knowledge or extensive knowledge from li­ brary Spark. However, developer D1 answered that he/she has low knowledge in this library. D1 is an interesting case because this developer reports low knowledge in the library they had been recommended, but medium Knowledge and extensive knowledge for all others. Netflix is also an inter­ esting case since only 3 (out of 8) reported extensive knowl­ edge in the library, while 5 reported extensive knowledge in JavaEE and 5 in Spring Boot. On the other hand, developer­ D17 was identified by our strategy as medium knowledge or extensive knowledge from library Spark, and this developer marked extensive knowledge. From 21 developers that answered the survey, we observe that the strategy obtained a precision of 52% on average for extensive knowledge. Table 11 in the last column shows, for example, to library SpringBoot a precision of 83%. On the other hand, for library Netflix, our strategy obtained a pre­ cision of 43%. If we consider the survey results to corre­ late with the results of the strategy concerning only the de­ velopers that answered with extensive knowledge, we ob­ tained 52% precision. However, if we consider developers who answered the survey with medium knowledge or exten­ sive knowledge to correlate with strategy results, we obtain 71% precision. 6 Tool Support We developed a prototype tool, named JExpert, to support the identification of library experts (strategy). We developed JExpert in Java programming language. JExpert currently works with Java projects, but the tool can be easily adapted Presenting the new SBC journal template Oliveira et al. 2020 Table 12. Survey Results: Microservices (Overview) Developer Apache Karaf Apache Spark JavaEE Netflix Spring Boot Swagger Library D1 MK LK MK EK EK EK Spark D2 NK NK NK NK LK NK D3 NK MK EK MK MK MK D4 NK NK MK NK MK MK D5 NK NK EK LK EK MK JavaEE D6 LK NK EK EK EK EK Netflix D7 NK NK EK EK EK MK D8 MK LK EK MK EK MK D9 LK LK LK LK EK NK D10 LK NK LK NK LK MK D11 NK NK LK EK LK LK D12 NK NK EK MK MK MK D13 LK LK EK MK EK EK SpringBoot D14 LK LK MK LK EK LK D15 LK LK MK LK MK EK D16 NK MK MK MK EK EK D17 NK MK MK EK EK EK D18 LK LK EK MK EK EK D19 LK NK EK LK EK NK SwaggerD20 LK LK EK EK EK EK D21 LK EK NK LK LK EK NK No knowledge LK Low knowledge MK Medium knowledge EK Extensive knowledge to identify library experts in other programming languages. JExpert is a standalone tool and runs in Windows, Linux, and MAC. JExpert is available in our website (Oliveira et al., 2020). JExpert uses static analysis to avoid Abstract Syntax Tree (AST). Therefore, it reduces the response time when analyzing large systems with hundreds of source elements, such as LOC, imports, packages, and classes. Our goal is to support recruiters with a flexible, light­weighted means to identify library experts from source code. Figure 10 presents the simplified architecture design of JExpert. In the first moment, there are two modules: Projects and Library Name. These two modules are the input of JEx­ pert. In other words, JExpert receives as inputs two items, (i) projects in Java that contain the target libraries, i.e., sys­ tems from a local directory informed by the user, and (ii) the names (keywords) of the libraries that a developer wants to investigate. Module Activity Extractor is responsible for ex­ tracting the code elements necessary for the computation of activities made by a developer. Besides, this module removes the old projects, i.e., projects with commits made more than three years ago, projects with less than 1 KLOC, and projects without target library. Figure 10. JExpert Architecture Overview From the next step, the module Developer Data Analyzer computes all data about each developer. This module is re­ sponsible for separating the number of commits to libraries and changes made from source code in general, for instance, the number of lines of code written. This module also com­ putes the number of imports made by developers and verifies if an import is related to the target library. The Metric Collector module computes the three metrics, as mentioned in Section 3.1. Finally, the List of Experts is generated as output with the sorted list of expert candidates from our metrics. Such a list prioritizes the library experts based on a heuristic score, i.e., higher scores come first; cur­ rently, the tool returns a ”.csv” file for each library. 7 Threats to Validity We based our study on related work to support the evalu­ ation of a strategy to identify library experts. Regarding the assessment, we conducted a careful empirical study to assess the efficiency of the strategy from software systems hosted by GitHub. The strategy evaluated can analyze source code from platforms that follow the Git architecture. However, some threats to validity may affect our research findings. The main threats and respective treatments are discussed below based on the proposed categories of Wohlin et al. (Wohlin et al., 2012). Construct Validity. This validity is related to whether mea­ surements in the study reflect real­world situations (Wohlin et al., 2012). Before running the strategy, we conducted careful filtering of software systems from GitHub reposito­ ries. However, some threats may affect the correct filtering of systems, such as human factors that wrongly lead to a valid system’s discard to be evaluated. Considering that the exclusion criteria to system selection were applied in a manual process, we may have discarded interesting systems that we identified as non­Java, for instance. Internal Validity. The validity is related to uncontrolled aspects that may affect the strategy results (Wohlin et al., 2012). The strategy may be affected by some threats. To treat this possible problem, we selected a sample of 5 software systems that contain the library Hadoop from our dataset, with a diversified number of LOC. Then, we manually identified the number of commits from the GitHub Presenting the new SBC journal template Oliveira et al. 2020 repository, the number of imports, and the number of LOC codified to the specific library. We compared our manual results with the results provided by the tool and observed a loss of 5% in metrics terms computed through the automated process. We believe that this error rate does not invalidate our main conclusions. In addition, our strategy has the goal to reduce the search space to identify library experts, that is, we do not recommend a specific developer. External Validity. This validity is related to the possibility of generalizing our results (Wohlin et al., 2012). We evalu­ ated the strategy with a set of 16,703 software projects from GitHub. Considering that these systems may not include all existing libraries, our findings may not be generalized. Fur­ thermore, we evaluated the strategy with an online survey with only 158 developers that implemented projects with the investigated libraries. We analyzed the data with only 15 Java libraries. However, we chose the top libraries from the survey reported by StackOverflow in 2018, with over 100,000 re­ sponses from developers around the world. We also analyzed microservices libraries. This way, we believe these libraries can represent a reasonable option to evaluate the strategy. 8 Related Work The use of data from GitHub to understand how software de­ velopers work and collaborate has become recurrent in soft­ ware engineering studies (Greene and Fischer, 2016; Singer et al., 2013; Ortu et al., 2015; Destefanis et al., 2016; Ma et al., 2009; Begel et al., 2010; Moraes et al., 2010). Some studies seek to understand the behavior of developers con­ cerning an interaction with their peers (Ortu et al., 2015). For example, a few studies (Ortu et al., 2015, 2016) tried to understand who are the developers with peaceful behavior and those with aggressive behavior and if these developers coexist productively in software development projects (Ortu et al., 2016). Similar studies also tried to understand if there is a relationship between bug resolution time and behavior of developers (Ortu et al., 2015). Also, some studies investi­ gated developers manners (Destefanis et al., 2016) and seek to understand the emotional behavior of software develop­ ers (Ortu et al., 2016). Schuler and Zimmermann (2008) investigated developer expertise based on their commit activities, which manifests itself whenever developers are using functionality. They present preliminary results for the Eclipse project. They were able to create expertise profiles that included data about what APIs a developer may be an expert in through their use of those APIs. Wu et al. (2011) proposed DREX, an approach to bug assignment using k­nearest neighbor search and social network analysis. This approach performs with the follow­ ing way: 1) finding textually similar bug reports, 2) extract­ ing developers involved in their resolution, and 3) ranking the developers expertise by analyzing their participation in resolving similar bugs. An evaluation of bug reports from the Firefox OSS project shows the social network analysis of DREX outperforms a purely textual approach, with a pre­ diction accuracy of about 15%. In closely related work, Greene and Fischer (2016) have developed an approach to extract technical information from GitHub developers. The work of these researchers does not differentiate developers from their level of knowledge of technical skills since a recruiter has several candidates for the same job position. Besides, such work only shows the profile of the users in GitHub, and it does not extract other charac­ teristics of their knowledge and skills. The other limitation is that they neither provide actual data about the developer’s knowledge production nor present a survey to evaluate the results. Singer et al. (2013) investigated the use of profile ag­ gregators in the evaluation of developer skills by developers and recruiters. However, these aggregators only gather skills for individual developers, and it is not clear how they support the identification of relevant developers from a large dataset. We believe that the strategy evaluated in our study is com­ plementary to the described related work, providing a differ­ ent approach focusing on reducing the search space to iden­ tify possible experts. Our strategy is complementary with other approaches, such as CVExplorer (Greene and Fischer, 2016). For instance, by combining our results with CVEx­ plorer (Greene and Fischer, 2016) it is possible to select skills in the language of programming and analyze the met­ rics shown in our paper. To the best of our effort, we did not find a similar large scale study that evaluates some strategy able to identify library experts. Hence, we cannot compare the strategy evaluated with other studies. 9 Conclusion In this paper, we evaluated a strategy to reduce the search space to identify library experts in software systems from source code analysis. We also presented a prototype tool that implements the strategy. The strategy evaluated is composed of three metrics: Number of Commits, Number of Imports, and Lines of Code. We assessed the strategy in two dimen­ sions: applicability and precision. First, Applicability Evalu­ ation analyzed the feasibility of identifying library experts candidates in large datasets. Second, Precision Evaluation compared the results provided by a strategy with developers perceptions from a survey. In total, we analyzed 16k software systems mined from GitHub, 15 libraries, and a survey with 158 developers. Our findings pointed out that the strategy was able to identify library experts in different libraries from the set of input software systems with a precision of 71% on average. There are many possible extensions for this work. For in­ stance, we did not consider all available data in our analysis, such as the number of forks, number of projects belonging to the developer that have received stars, the number of follow­ ers, number of methods, source code quality, and contribu­ tions to the project discussions. Besides, we did not consider the number of lines of code added and removed between ver­ sions. Future work can also extend our research to evaluate the strategy of other programming languages and libraries. References Alshuqayran, N., Ali, N., and Evans, R. (2016). A systematic Presenting the new SBC journal template Oliveira et al. 2020 mapping study in microservice architecture. In 9th Inter­ national Conference on Service­Oriented Computing and Applications (SOCA), pages 44–51. Basili, V., Caldiera, G., and Rombach, H. D. (1994). The Goal Question Metric Approach. Online Technical Re­ port. Begel, A., Khoo, Y. P., and Zimmermann, T. (2010). Code­ book: discovering and exploiting relationships in software repositories. In 32nd International Conference on Soft­ ware Engineering (ICSE), pages 125–134. Brown, V. R. and Vaughn, E. D. (2011). The writing on the (facebook) wall: The use of social networking sites in hiring decisions. Journal of Business and psychology, 26(2):219. Capiluppi, A., Serebrenik, A., and Singer, L. (2013). Assess­ ing technical candidates on the social web. IEEE software, 30(1):45–51. Constantinou, E. and Kapitsaki, G. M. (2016). Identifying developers’ expertise in social coding platforms. In 42th Euromicro Conf. on Software Engineering and Advanced Applications (SEAA), pages 63–67. Dabbish, L., Stuart, C., Tsay, J., and Herbsleb, J. (2012). So­ cial coding in github: Transparency and collaboration in an open software repository. In 12th Proc. of the Conf. on Computer Supported Cooperative Work (CSCW), pages 1277–1286. Damasiotis, V., Fitsilis, P., Considine, P., and O’Kane, J. (2017). Analysis of software project complexity factors. In Proc. of the 2017 International Conf. on Management Engineering, Software Engineering and Service Sciences, pages 54–58. Destefanis, G., Ortu, M., Counsell, S., Swift, S., Marchesi, M., and Tonelli, R. (2016). Software development: do good manners matter? PeerJ Computer Science, 2(2):1– 10. Easterbrook, S., Singer, J., Storey, M.­A., and Damian, D. (2008). Selecting empirical methods for software engi­ neering research. In Guide to advanced empirical software engineering, pages 285–311. Ferreira, M., Mombach, T., Valente, M. T., and Ferreira, K. (2019). Algorithms for estimating truck factors: A com­ parative study. Software Quality Journal, 1(27):1–37. Garcia, V. C., Lucrédio, D., Alvaro, A., Almeida, E. S. D., de Mattos Fortes, R. P., and de Lemos Meira, S. R. (2007). Towards a maturity model for a reuse incremental adop­ tion. In 7th Brazilian Symposium on Software Compo­ nents, Architectures, and Reuse (SBCARS), pages 61–74. Greene, G. J. and Fischer, B. (2016). Cvexplorer: Identifying candidate developers by mining and exploring their open source contributions. In 31st Int. Conf. on Automated Soft­ ware Engineering (ASE), pages 804–809. Joblin, M., Apel, S., Hunsen, C., and Mauerer, W. (2017). Classifying developers into core and peripheral: An em­ pirical study on count and network metrics. In 39th In­ ternational Conference on Software Engineering (ICSE), pages 164–174. Klock, S., van der Werf, J. M. E. M., Guelen, J. P., and Jansen, S. (2017). Workload­based clustering of coherent feature sets in microservice architectures. In 2017 IEEE Interna­ tional Conference on Software Architecture (ICSA), pages 11–20. Krüger, J., Wiemann, J., Fenske, W., Saake, G., and Leich, T. (2018). Do you remember this source code? In 40th Proc. of the International Conf. on Software Engineering (ICSE), pages 764–775. Ma, D., Schuler, D., Zimmermann, T., and Sillito, J. (2009). Expert recommendation with usage expertise. In Interna­ tional Conference on Software Maintenance (ICSM, pages 535–538. Ma, W., Chen, L., Zhang, X., and Xu, Y. Z. . B. (2017). How do developers fix cross­project correlated bugs? a case study on the GitHub scientific python ecosystem. In 39th International Conference on Software Engineering (ICSE), pages 1–12. Marlow, J. and Dabbish, L. (2013). Activity traces and sig­ nals in software developer recruitment and hiring. In 16th Proc. of the 2013 Conf. on Computer supported coopera­ tive work (CSCW), pages 145–156. McCuller, P. (2012). How to recruit and hire great software engineers: building a crack development team. Apress. Mockus, A. and Herbsleb, J. D. (2002). Expertise browser: a quantitative approach to identifying expertise. In 24rd Proc. of the International Conf. on Software Engineering (ICSE), pages 503–512. Moraes, A., Silva, E., da Trindade, C., Barbosa, Y., and Meira, S. (2010). Recommending experts using communi­ cation history. In 2nd International Workshop on Recom­ mendation Systems for Software Engineering, page 41–45. Oliveira, J., Fernandes, E., Souza, M., and Figueiredo, E. (2016). A method based on naming similarity to identify reuse opportunities. In 7th Brazilian Symposium on Infor­ mation Systems on Brazilian Symposium on Information Systems: Information Systems in the Cloud Computing Era ­ Volume 1, pages 41:305–41:312. Oliveira, J., Pinheiro, D., and Figueiredo, E. (2020). Web site of the paper. https://johnatan-si.github.io/ JSERD2020/. Oliveira, J., Viggiato, M., and Figueiredo, E. (2019). How well do you know this library? mining experts from source code analysis. In 18th Brazilian Symposium on Software Quality (SBES), pages 49–58. Ortu, M., Adams, B., Destefanis, G., Tourani, P., Marchesi, M., and Tonelli, R. (2015). Are bullies more productive?: empirical study of affectiveness vs. issue fixing time. In 12th Proc. of the Working Conf. on Mining Software Repos­ itories (MSR), pages 303–313. Ortu, M., Destefanis, G., Counsell, S., Swift, S., Tonelli, R., and Marchesi, M. (2016). Arsonists or firefighters? affec­ tiveness in agile software development. In 18th Interna­ tional Conf. on Agile Software Development (XP), pages 144–155. Pahl, C. (2015). Containerization and the paas cloud. IEEE Cloud Computing, 2(3):24–31. Pfleeger, S. L. and Kitchenham, B. A. (2001). Principles of survey research: Part 1: Turning lemons into lemonade. SIGSOFT Softw. Eng. Notes, 26(6):16–18. Saxena, R. and Pedanekar, N. (2017). I know what you coded last summer: Mining candidate expertise from GitHub https://johnatan-si.github.io/JSERD2020/ https://johnatan-si.github.io/JSERD2020/ Presenting the new SBC journal template Oliveira et al. 2020 repositories. In 17th Companion of the Conf. on Com­ puter Supported Cooperative Work and Social Computing (CSCW), pages 299–302. Schuler, D. and Zimmermann, T. (2008). Mining usage expertise from version archives. In Proceedings of the 2008 International Working Conference on Mining Soft­ ware Repositories, pages 121––124. Singer, L., Filho, F. F., Cleary, B., Treude, C., Storey, M.­ A., and Schneider, K. (2013). Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators. In 13th Proc. of the Conf. on Computer supported cooperative work (CSCW), pages 103–116. Sommerville, I. (2015). Software Engineering. Pearson. Tong, J., Ying, L., Hongyan, T., and Zhonghai, W. (2016). Can we use programmer’s knowledge? fixing parameter configuration errors in hadoop through analyzing q amp;a sites. In 5th IEEE Int. Congress on Big Data (BigData Congress), pages 478–484. Tsui, F., Karam, O., and Bernal, B. (2016). Essentials of soft­ ware engineering. Jones & Bartlett Learning. Viggiato, M., Oliveira, J., Figueiredo, E., Jamshidi, P., and Kästner, C. (2019). Understanding similarities and differ­ ences in software development practices across domains. In 14th International Conference on Global Software En­ gineering (ICGSE), pages 74–84. Wohlin, C., Runeson, P., Hst, M., Ohlsson, M. C., Reg­ nell, B., and Wessln, A. (2012). Experimentation in Soft­ ware Engineering. Springer Publishing Company, Incor­ porated. Wu, W., Zhang, W., Yang, Y., and Wang, Q. (2011). Drex: Developer recommendation with k­nearest­neighbor search and expertise ranking. In 18th Asia­Pacific Software Engineering Conference, pages 389–396. Ye, C. (2017). Research on the key technology of big data service in university library. In 13th Int. Conf. on Natu­ ral Computation, Fuzzy Systems and Knowledge Discov­ ery (ICNC­FSKD), pages 2573–2578. Introduction Study Settings Goal and Research Questions Evaluation Steps Dataset Applicability Evaluation Identification of Library Experts Overview of Dataset Top Library Experts Selection Survey with Top Libraries Experts Survey Design Overview Level of Activity Knowledge Intensity Knowledge Extension Survey with Microservices Experts Survey Design Results Tool Support Threats to Validity Related Work Conclusion