journal of software engineering research and development, 2023, 11:10, doi: 10.5753/jserd.2023.3082  this work is licensed under a creative commons attribution 4.0 international license.. identifying and addressing problems in the estimation process: a case study applying action research ana m. debiasi duarte  [ universidade do oeste de santa catarina | ana.duarte@unoesc.edu.br ] ieda margarete oro  [ universidade do oeste de santa catarina | ieda.oro@unoesc.edu.br ] karine vidor [ universidade do oeste de santa catarina | karine.vidor@unoesc.edu.br ] denio duarte  [ universidade federal da fronteira sul | duarte@uffs.edu.br ] abstract literature shows that a large part of software projects exceeds the amount of effort and estimation duration, even though we currently witness an evolution of software project management discipline. through its best practices, software engineering tries to reduce the flaws in software development. several techniques and resources have been presented to help to reduce this problem. this paper aims to propose an approach based on action research to improve the estimation process in software development tasks by identifying problems. a case study is carried out to show the effectiveness of our approach. the results show an improvement of 50% accuracy over the baseline estimation process. keywords: software estimation, process improvement, agile methodologies, action research 1 introduction agile software development (asd) is usually used as an alternative to more traditional approaches, e.g., waterfall or evolutionary. the key elements for the latter are extensive planning, rigorous reuse, and codified processes. on the other hand, asd is based on iterative and incremental development models (larman and basili, 2003; hohl et al., 2018). although asd intends to make software development easier compared to traditional ones, it still experiences the size estimation effort problem. effort estimation can be defined as the process by which effort is evaluated, and estimation is carried out in terms of the number of resources required to end project activity to deliver a product or service that meets the given functional and non-functional requirements to a customer (trendowicz and jeffery, 2014). several methods (metrics) have been proposed to estimate effort, e.g., planning poker, expert judgment, and wideband delphi. however, the accuracy of software effort estimation models for asd still remains inconsistent (pillai et al., 2017). the report proposed by the standish group chaos report (2018) showed that many software companies struggle to develop their products within strict schedules and budget constraints. either the companies finished their projects behind schedule and over budget (48% 65%) or failed to complete them (48% 56%) in 2018. the findings show that most projects’ planned efforts and schedules were overrun compared to the estimations. it is well known that cost underestimation brings inefficiencies to the project (nhung et al., 2019). gupta et al. (2019) present the lack of the most usual factors that cause flaws in software projects: (i) top management’s commitment and involvement/support; (ii) allocation of scarce resources; (iii) communication among various stakeholders; (iv) team configuration and structure; and (v) social cohesion in the team and the complexity of the project and organizational culture. in this paper, we focus on software development effort estimation. we intend to offer an approach to minimize the error of one of the software project problems: estimation effort. the action research method (mckay and marshall, 2001) allows us to involve researchers and developers in finding an approach to solve the target problem. based on the steps performed in action research (see figure 1), we proposed an approach to improve the estimation effort using a case study. an asd team from a software development company and the researchers participate in all phases of our approach to get an ideal process to estimate effort. using historical data, we find problems that decay the effort estimation process. according to those problems, we develop our approach. the results show that our proposal improves the process effort estimation accuracy in 1.5 times. we believe that the promising results can help companies using asd in minimizing the flaws in software projects. the rest of this paper is organized as follows: section 2 briefly presents software development effort estimation and management, and section 3 presents works related to ours. next, we introduce our methods. section 5 presents our approach and its application as a case study. finally, section 6 concludes this paper. 2 background software development effort estimation plays a crucial role in software development projects. building reliable software processes for executing software projects to meet the delivery on time, respecting the budget, and in a cost-effective manner is challenging (sommerville, 2015). developers have struggled with software development effort estimation since the 1960s (gautam and singh, 2018). effort estimation plays a crucial role when it comes to finish a project on time and respecting the budget. effort estimation can be stated as the process by which https://orcid.org/0000-0001-8054-0063 mailto:ana.duarte@unoesc.edu.br https://orcid.org/0000-0002-2239-531x mailto:ieda.oro@unoesc.edu.br mailto:karine.vidor@unoesc.edu.br https://orcid.org/0000-0003-4936-4748 mailto:duarte@uffs.edu.br identifying and addressing problems in the estimation process duarte et al. 2023 effort is assessed, and estimation is performed as to the number of resources required to end project activity to deliver a product or service that meets the given functional and nonfunctional requirements to a customer (trendowicz and jeffery, 2014). if the effort estimations are accurate, they can contribute to the success of software development projects, while incorrect estimations can negatively affect product development, leading to monetary losses (altaleb and gravell, 2018). software project estimation involves estimating the effort, size, staffing, schedule (time), and cost involved in creating a unit of the software product (jorgensen and shepperd, 2006; pillai et al., 2017). the ratio between the amount of work spent on software development and its size is called productivity (fenton and bieman, 2014). it can be measured in several ways, but function point analysis (fpa) is the most common. fpa can be applied before the program writing, based on system requirements, so it is possible to estimate the effort and the schedule to develop activities. many variables can impact a software development team’s productivity: one is time management. inadequate time management usually occurs because of a lack of planning for the day, non-management of compromises, and accepting more tasks than possible, among others (sá et al., 2017). however, some techniques help to manage time in a better way. one of those is the pomodoro technique, created by cirillo (2022), which aims to address the time spent on activities and eliminate internal and external distractions. planning and supervising the project is needed to check the development team’s productivity and software quality. those tasks are essential in the software development process. according to the project management body of knowledge (pmbok) (pmi, 2021), a project is a temporary effort to progressively create a product, service, or single result. project managing means applying knowledge, abilities, and tools to support scheduled requirements. according to pressman (2014), successful project management begins with an accurate estimation of development effort; however, estimation is still imprecise contributing to failed software projects. usually, estimation is made by using techniques along historical project bases. however, maxwell (2001) claims that, more than simply registering productivity, data is needed to improve the estimate process, so analysis is important to understand its influences on projects and their productivity contexts. according to kirmani and wahid (2015), efficiency, product delivery on time, and the desired quality level are features that influence the software development process. therefore, collecting data through the measurements taken during the project execution, usually based on qualitative and quantitative information, is crucial. software projects are complicated in any context and are especially prone to failure (bannerman, 2008). there is no fail-proof project, but it is possible to be ready for unforeseen problems. the agile methods, including scrum, were created precisely to deal with project uncertainties instead of traditional methods that try planning everything before the development starts. scrum is a lightweight framework that helps people, teams and organizations generate value through adaptive solutions for complex problems. in each iteration, the team analyzes the requirements, technology, and abilities and then splits themselves between creating and delivering the best software they can, adapting daily as complexities and surprises arise (schwaber and sutherland, 2020). scrum employs an iterative, incremental approach to optimize predictability and engages groups of people who collectively have all the skills and expertise to do the work and share or acquire such skills as needed. 3 related work action research (ar) has been applied in several study cases, from software development to healthcare (elg et al., 2020; cordeiro and soares, 2018). its basic principle is that the researchers change their role from external observers to participants in solving concrete problems (bradbury-huang, 2010). regarding software engineering (se), there are several proposals to apply ar to deal with se problems. in 2006, dingsøyr et al. (2006) used an action research study to apply the scrum software development process in a small crossorganizational development project. more recently, hoda et al. (2014) combined action research as the overall research framework, elements of user-centered (for evaluation by end-users) and participatory design as the design frameworks, and scrum as the software development framework. marinho et al. (2015) presented the development of an uncertainty management guide designed by action research. the proposed guide was applied in a software development company aiming to reduce the uncertainties in software projects. conversely, choraś et al. (2020) proposed a set of metrics that measure the agile software development process in small and medium company types. they were built as part of an action-research collaboration involving a team of researchers. action research is also applied to developing students’ competencies during the learning and teaching process in software engineering using thinking-based learning (flores and de alencar, 2020). the works cited show that the action research method is widely used as a support tool to help the industry to improve its processes. in this work, we intend to contribute to the industry by proposing using action research to address problems in the estimation software development process. 4 methods this paper applies qualitative research using a case study approach to evaluate our proposal (gil et al., 2002; godoy, 1995). according to creswell (2010), in qualitative studies, the researcher uses a particular language to describe what is expected to be understood, mainly through findings or theories. besides, they must encounter a minimum amount of literature, enough to discuss the issue. the researcher uses a particular language to describe what they expect to understand, discover or develop as a theory. the research was developed using the action research method. thiollent (2011) defines action research as a theoretical and methodologiidentifying and addressing problems in the estimation process duarte et al. 2023 cal approach responsible for an essential contribution to the methodology in social phenomenons investigations, getting known as a research line directed to collective actions. the method is based on joining research and action in a process in which the implicated actors and researchers look to interactively enlighten the reality in which they are and identify common issues by searching and experimenting with solutions in real situations. the search for knowledge conducted by the research is treated as a composite construction (peruzzo, 2016). in our case study, researchers and software engineering work together in all project phases. the collaboration intends to solve a given problem in a software project. we use an action research (ar) process adapted from (mckay and marshall, 2001) and pictorially shown in figure 1. note that the ar process is composed of 8 steps. in the following, we present and discuss every step regarding our proposal. 5 case study planning, execution, and results our primary goal in this work is to apply action research in the context of software effort estimation in an asd approach. to accomplish that, we carried out a case study involving our proposal. in this section, we present the target company and development team and the implementation of our proposal. figure 1 guides us to show how ar is applied in the estimation effort problem. 5.1 characterization of the company and team the case study was conducted in a software development company that provided previous software effort estimations for analysis. as described in (gil, 2008), a case study is an analysis of situations that occur in real life; it is applied to obtain detailed knowledge to present conclusions. the available estimations are composed of 31 sprints. the historical data comprises 100 different functionalities, 302 stories, and 568 programming tasks. the target company uses scrum-like process for its software development, so they are familiar with scrum and its good practices. besides, points are to estimate the sprint size. for every sprint task, the size is calculated, and the sum of all task sizes gives the sprint size in points. the points and the corresponding task complexities are calculated regarding the development effort applied previously, i.e., the company’s historical data. the participants in this case study (i.e., the development team) are 7 software engineers. to build our estimation methodology, we first study the company’s current process. this allowed us to propose a new estimates support method regarding ar. the project office defines all the product phases following the estimation phase. in the estimation phase, demands are presented to the development team, and the team estimates the size of the demands for every sprint. then the project development phase starts, the projects team prioritizes demands inside the sprint, and the development team starts working. after 15 days, all deliverables for a given sprint are produced. step 1 problem identification the first step of the ar process was applied using a bibliography survey. we searched papers in seven different academic databases with the following search strings: “agile project management” and “risks analysis” and “software engineering” and “software estimation” and “agile management” and “agile methods” and “scrum” and “software metrics”. the search retrieved 1,006 papers. to reduce the number of working papers, we apply the following exclusion criteria: (i) papers written in english, (ii) abstracts showing that estimation effort and scrum are used in the approach, and (iii) the reputation of publication vehicle (in this case, we use h-index and number of citations as guide). using those criteria, we selected 23 papers. the team and the researchers read and discussed the papers to make all the involved people aware of the literature about software estimation in asd. during the discussions, we identified several classical software development problems like imprecise schedules, unplanned costs, and delays that might influence the negotiation with the customer. based on the discussions, the team built a sheet containing variables about the 31 sprints used in our case study. to make the development process adequate, we identified the variables that help understand the company’s historical productivity database. table 1 presents the built sheet from the collected data, where (i) sprint represents the sprint number (identifier), (ii) story stores the number of stories, (iii) task represents the number of tasks, (iv) avail. time (h) is the available time (in hours) to accomplish the sprint, (v) est. points shows the number of estimate points (effort), and (vi) del. points represents the number of delivered points. table 1 presents the historical data since they have started using scrum. note that the estimation effort is not very accurate, and, in the beginning, the team did not even registered the delivery points (first eight sprints). we decided to use the last eight to measure the variance between the planned and execution time. our choice was based in the team’s maturity to plan the sprint. table 2 details the mean of variation in planned and execution time is around 35%. note that the standard deviation is also high, meaning that the variation ranged from 15% to 55%. for example, in sprint #30 the variation was 41%, whereas in sprint #29 was 6%. those numbers showed that the team’s estimation could have been more accurate. to calculate the variation, we used equation 1. this equation is also used in bilgaiyan et al. (2017) and de souza (2013): v ar = et − p t p t (1) where var is the estimation variation, et is the estimated time, and pt is the estimated planning time. the main problems from the data analysis of the effort and size estimates were: • lack of precision of the effort and size estimation beidentifying and addressing problems in the estimation process duarte et al. 2023 figure 1. steps performed in action research table 1. historical data from 31 finished sprints. sprint story task avail. time (h) est. points del. points sprint story task avail. time (h) est. points del. points #1 15 31 274 6 #17 9 18 274 77 130 #2 9 16 183 72 #18 3 16 274 70 67 #3 9 17 204 109 #19 3 10 218 44 58 #4 12 30 134 93 #20 6 20 274 49 138 #5 4 13 134 59 #21 10 27 204 133 190 #6 4 18 204 40 #22 6 10 204 95 81 #7 9 21 204 136 #23 13 7 204 130 62 #8 2 4 204 33 #24 12 19 204 130 65 #9 9 20 204 131 118 #25 28 37 183 127 122 #10 6 17 183 78 99 #26 36 39 274 122 166 #11 4 6 183 28 129 #27 13 28 183 110 74 #12 3 4 134 24 #28 13 31 274 131 79 #13 2 2 183 48 48 #29 11 11 309 12 145 #14 9 12 204 143 81 #30 10 20 344 92 98 #15 2 11 183 15 62 #31 27 44 274 128 131 #16 3 9 183 31 65 total 302 568 cause of register failures or lack of information in historical bases; • new demands (e.g., corrections or crucial new requirements) are not formally specified and, sometimes, without further details. step 2 recognizing facts about the problem to collect data from problems in the estimation process, we applied a survey containing 48 objective questions divided into four categories: general, specifications and estimation, sprint, and effort estimation (see table 3). we used the likert scale (likert, 1932) to evaluate the estimated problems. the respondents could choose between the alternatives: “always”, “usually”, “sometimes”, “rarely”, and “never”. answering the survey took an average time of 30 minutes. even though the respondents remained anonymous, we note that some were uncomfortable criticizing the company’s estimation process. we tried to mitigate this by asking them to answer that survey in separate rooms and using the same type of pen. even though we noticed that some criticisms might have been omitted, we believe that the results coherently showed the reality of the development team. table 3 shows the proposed categories along their subcategories (if it is the case). we encoded each category to make it easier to present our solution for the problems. in the following, we briefly discuss each proposed category. in the category general, the result shows there are situations in which the workers stop planned to do unplanned tasks that were not expected when the schedule was created (code rsc01 in table 3). the code rsc02 (category specifications and estimation) means that there are situations where estimates need to be carried out. this compromises delivery in several ways such as imprecise schedules and unplanned costs. the category sprint represents when participants did not use the burndown chart as a stimulus to reach the sprint goals (see rsc03 in table 3). not using the chart contributes to the fact that the development team does not follow the sprint performance, meaning there is no way to know if it follows the schedule. another identified problem is that rarely, or only sometimes, the tasks are appropriately delivered to use, withidentifying and addressing problems in the estimation process duarte et al. 2023 table 2. time variation: planned versus executed sprint planned time (hours) executed time (hours) difference variation #24 141 192 +51 36% #25 260 304 +44 17% #26 328 459 +131 40% #27 147 263 +116 79% #28 161 205 +44 27% #29 84 89 +5 6% #30 153 215 +62 41% #31 274 366 +92 34% variation mean 35% (± 20.01%) table 3. problems influencing estimations category code description general rsc01 performing non-sprint tasks specification and estimation rsc02 size estimation process not performed sprint rsc03 burndown chart not used rsc04 lack of commitment in the delivery of tasks effort estimation rsc05 effort estimation process not performed rsc06 estimates of urgent tasks not performed rsc07 lack of technical knowledge out any faults, pointing to the fact that there are tasks that will need to go through unscheduled corrections (see rsc04 in table 3). as there is no periodic maintenance in the burndown chart in daily meetings, the team does not compromise with the day-to-day delivery of goals. the category effort estimation is composed of three other problems. rsc05 states that the effort estimation process does not occur frequently, which means there are situations where estimates are not made. another problem is that some urgent tasks are added during the sprint without effort estimation. this may cause delays in task performance and a problem in the estimation measurements (see rsc06 in table 3). finally, rsc07 states that developers need to be aware of all the pre-existing codes in the application, and those codes are rarely consulted during the estimate process. this lack of orientation causes uncertainty in estimation, meaning that the codes will likely have to be updated during development. from this analysis, we present in table 3 the problems that must be examined and discussed to propose a method that minimizes the effects as much as possible. the study and discussion of relevant literature, in addition to the survey’s result, let us conclude the following: (i) the team does not have much experience in measuring the effort, (ii) there are not much historical data about productivity, and (iii) the team are not very confident regarding the effort measurement. the team usually estimates backlog stories; however, it is rarely estimated when a new story is inserted in a running sprint. based on the two previous steps, we plan how to solve or minimize the problems faced by the team. this is the third step of action research. step 3 activity planning in step 2, we identified the estimation process problems. we proved a problem in estimation through the analysis, and this served as a process improvement opportunity. we analyzed the current process used by the company and proposed an approach for improving it according to the problems identified in table 3. rsc01 – the proposed solution was to implement a kanban (stellman and greene, 2014; dos santos et al., 2018), so the unplanned tasks may be executed without interfering with the progress of the current sprint. this process may also be used to attend to urgent corrections; the kanban should run along with the sprint. rsc02 and rsc05 – we proposed changes in the way of estimating. the estimation order was inverted in the proposed model: before the sprint meeting starts, size estimation is done, and the backlog must be prioritized. then, the effort estimation process, which happens during sprint planning, can begin. rsc03 – the daily meetings should update the burndown plot. developers must answer three questions: “what have you done today?”, “what will you do tomorrow?” and “which problems have you faced?”. these three questions were inspired from scrum guide 2017 (schwaber and sutherland, 2017) currently used by the company. rsc04 – there should be regular updates on the plot in each daily meeting, and the team should justify internally (between the developers) the daily results that concern the goal. rsc06 – this problem will be minimized with kanban, proposed in rsc01. in this process, at least one developer will be ready to rapidly solve the unplanned income tasks. rsc07 – to address this problem, the company must provide specialized training to the developers concerning the subjects in which they face more problems. identifying and addressing problems in the estimation process duarte et al. 2023 after the discussion about the proposed approach, its implementation must be defined. first, the project office plans the definition phase, from the requirements to the implementation. the planning feeds a project management tool to better control the outputs. in this study case, redmine1 was the chosen tool. later, the planning and product project phase starts when the project team estimates the size of the demands and then prioritizes the backlog (rsc07). in case of a correction or urgency, the demand is sent to kanban (rsc01 and rsc06). if not, it goes to the sprint, and a meeting with the developers is called. in this phase, the demands, requirements, and related interfaces are presented to the development team. the team debates what was presented and estimates the effort in those demands (rsc05). the phase of project development starts as the sprint opening is done. the project team selects the prioritized stories to develop in the sprint. the team starts working, and the stories are finished by the end of 15 days and presented in the sprint meeting. step 4 implementation this step was the implementation of what was planned in the sprint. the team compares the planned estimation to the actual size. two sprints were used as pilots: sprints #36 and #37. rsc01 and rsc06 – kanban was implemented to reduce these identified factors. as of now, a developer is ready to solve any unforeseeable issues that may occur during the sprint and make corrections. the task is developed, tested, and integrated directly into the main branch in kanban. rsc02 and rsc05 – before an opening sprint meeting, the planning team decides which tasks will be in the sprint and estimates their sizes. then, the team may analyze if it is necessary to add or remove any tasks to fit in the upcoming sprint. lastly, the development team estimates the effort, and the opening meeting is done. rsc03 and rsc04 – in the current model, the developers have the daily meeting at the end of the afternoon and answer the three proposed questions: “what have you done today?”, “what will you do tomorrow?”, and “which drawbacks have you faced?”. besides the meeting, the development team fills the burndown plot, making it possible to analyze the plot and explain the daily results relating to the goal. rsc07 – the impact of this factor was reduced by offering the development team opportunities to improve their technical knowledge. according to singh et al. (2019), the people involved in the working process should be trained to guarantee their tasks are executed in the best way possible to fit the company’s goal. to allow this and reduce the impact of the identified problems, the company provided online training to the employees, besides intensifying knowledge-sharing practices. step 5 monitoring this step was the active participation of the researchers in implementing process change measurements by using sug1www.redmine.org gestions and helping validate the action results. the critical point of this step was to check the project’s evolution and ensure that the schedule was adequate to reach the initial goals. step 6 assessment of the results during the study case, there were meetings to evaluate the results and discuss problems. these meetings raised issues about interruptions affecting the teams’ efficiency. the team could not control these interruptions. the interruptions were treated as a new problem so that improvement could be implemented, and then the action plan was improved as described in the next step. step 7 improving the action plans after implementing technical improvements and evaluating the effects, there were still many interruptions in the development environment. an interruption can be internal – by a team member – or external – by someone outside the sprint. those interruptions reduce productivity. see problems in table 4. table 4. new problems that may influence the estimates category code description general rsc08 external interruptions rsc09 internal interruptions we suggested the pomodoro technique to solve the issue. during the pomodoro time, nobody may interrupt a colleague – except for very urgent issues. an online timer will be used to control each pomodoro time2, and a sign was developed to inform the workers that the developer is in pomodoro; it is visible to everyone. one side says “pomodoro”, and the other says “clear”. to use the technique, the worker picks a task and counts 25 minutes in pomodoro. then, for each pomodoro, the working time must be put in redmine. each worker must turn the pomodoro sign according to their status and pause for 5 minutes maximum. for every four pomodoro tasks, a longer pause (around 15 minutes) can be done. step 8 action-research cycle conclusion our approach was tested in eight new sprints to identify its performance in addressing problems of task estimation effort. in total, nine problems should be treated to improve the estimate process in the company. the improvement brought by our approach is shown in table 5. the average variation between the planned and executed time is 14.5% (standard deviation equals 6.8%). compared to table 2 (35% (± 20.01%)), the accuracy improvement is of approximately 1.5 times. the results indicated the action-research method, which involves cooperation between the researchers and the study participants, is helpful in improving software estimation effort errors. 2www.tomatotimers.com www.redmine.org www.tomatotimers.com identifying and addressing problems in the estimation process duarte et al. 2023 table 5. time variation: time planned versus time executed after the improvements sprint planned time (hours) executed time (hours) difference variation #36 134 119 -15 11% ↓ #37 80 102 22 28% ↑ #38 106 115 9 8% ↑ #39 69 83 13 19% ↑ #40 94 110 16 17% ↑ #41 81 87 6 7% ↑ #42 205 187 -18 9% ↓ #43 167 195 28 17% ↑ variation mean 14.5% (± 6.8%) 6 conclusion this paper presented a case study to investigate how action research can help developers to address problems in the estimation process. we first studied the target company estimation process and analyzed the historical data; then, we surveyed the development team to find the reasons for the effort estimation errors. using the action research method involving the researchers and developers, we propose an approach to help the development team estimate better the task effort. we accomplished our goal by identifying the problems and implementing changes in the current software estimation process. after implementing the suggested procedures, the results indicated that we reached the main goal: addressing the problems in the estimation process. by comparing the estimation time with and without our method, we improved estimation accuracy 1.5 times compared to the historical data. the research action method guided the whole process of our proposal and proved very effective in our case study. there are some threats to the validation of our approach; however, using 31 sprints as historical data and eight sprints to compare the results can satisfactorily validate our results. recommendations for future works are: (i) increase the number of case studies to compare the results; (ii) apply the analysis of methods that use statistics to treat historical productivity data in short and long-term estimates; and (iii) to evaluate known estimation problems in the software development process by analyzing the techniques for solving them. acknowledgments the authors thank fapesc for the financial support to the paper proofreading. project approved with grant term n. 2021tr001877. references altaleb, a. and gravell, a. (2018). effort estimation across mobile app platforms using agile processes: a systematic literature review. journal of software, 13(4):242. bannerman, p. l. (2008). risk and risk management in software projects: a reassessment. j. syst. softw., 81(12):2118–2133. bilgaiyan, s., sagnika, s., mishra, s., and das, m. (2017). a systematic review on software cost estimation in agile software development. journal of engineering science & technology review, 10(4). bradbury-huang, h. (2010). what is good action research? why the resurgent interest? action research, 8(1):93–109. choraś, m., springer, t., kozik, r., lópez, l., martínezfernández, s., ram, p., rodriguez, p., and franch, x. (2020). measuring and improving agile processes in a small-size software development company. ieee access, 8:78452–78466. cirillo, f. (2022). pomodoro technique. [online; accessed 10-dec-2022]. cordeiro, l. and soares, c. b. (2018). action research in the healthcare field: a scoping review. jbi evidence synthesis, 16(4):1003–1047. creswell, j. w. (2010). projeto de pesquisa métodos qualitativo, quantitativo e misto. in projeto de pesquisa métodos qualitativo, quantitativo e misto. penso editora. de souza, l. l. c. (2013). suporte ao processo de monitoramento e controle de projetos de software: uma abordagem inteligente com base na teoria do valor agregado. dissertação mestrado, universidade estadual do ceará. dingsøyr, t., hanssen, g. k., dybå, t., anker, g., and nygaard, j. o. (2006). developing software with scrum in a small cross-organizational project. in european conference on software process improvement, pages 5–15. springer. dos santos, p. s. m., beltrão, a. c., de souza, b. p., and travassos, g. h. (2018). on the benefits and challenges of using kanban in software engineering: a structured synthesis study. journal of software engineering research and development, 6(1):1–29. elg, m., gremyr, i., halldorsson, á., and wallo, a. (2020). service action research: review and guidelines. journal of services marketing. fenton, n. and bieman, j. (2014). software metrics: a rigorous and practical approach. crc press, inc., usa, 3rd edition. flores, a. p. m. and de alencar, f. m. r. (2020). competencies development based on thinking-based learning in software engineering: an action-research. in proceedings of the 34th brazilian symposium on software engineering, pages 680–689. gautam, s. s. and singh, v. (2018). the state-of-the-art in software development effort estimation. journal of software: evolution and process, 30(12):e1983. identifying and addressing problems in the estimation process duarte et al. 2023 gil, a. c. (2008). métodos e técnicas de pesquisa social. 6. ed. editora atlas sa. gil, a. c. et al. (2002). como elaborar projetos de pesquisa, volume 4. atlas são paulo. godoy, a. s. (1995). pesquisa qualitativa: tipos fundamentais. revista de administração de empresas, pages 20–29. gupta, s. k., gunasekaran, a., antony, j., gupta, s., bag, s., and roubaud, d. (2019). systematic literature review of project failures: current trends and scope for future research. computers & industrial engineering, 127:274– 285. hoda, r., henderson, a., lee, s., beh, b., and greenwood, j. (2014). aligning technological and pedagogical considerations: harnessing touch-technology to enhance opportunities for collaborative gameplay and reciprocal teaching in nz early education. international journal of childcomputer interaction, 2(1):48–59. hohl, p., klünder, j., van bennekum, a., lockard, r., gifford, j., münch, j., stupperich, m., and schneider, k. (2018). back to the future: origins and directions of the “agile manifesto”–views of the originators. journal of software engineering research and development, 6(1):1–27. jorgensen, m. and shepperd, m. (2006). a systematic review of software development cost estimation studies. ieee transactions on software engineering, 33(1):33–53. kirmani, m. m. and wahid, a. (2015). article: use case point method of software effort estimation: a review. international journal of computer applications, 116(15):43–47. full text available. larman, c. and basili, v. r. (2003). iterative and incremental developments. a brief history. computer, 36(6):47–56. likert, r. (1932). a technique for the measurement of attitudes. number nº 136-165 in a technique for the measurement of attitudes. publisher not identified. marinho, m., lima, t., sampaio, s., and moura, h. (2015). uncertainty management in software projects an action research. in experimental software engineering track – xviii cibse iberoamerican conference on software engineering. cibse. maxwell, k. d. (2001). collecting data for comparability: benchmarking software development productivity. ieee software, 18(5):22–25. mckay, j. and marshall, p. (2001). the dual imperatives of action research. information technology & people. nhung, h. l. t. k., hoc, h. t., and hai, v. v. (2019). a review of use case-based development effort estimation methods in the system development context. in proceedings of the computational methods in systems and software. springer. peruzzo, c. (2016). epistemologia e método da pesquisaação. uma aproximação aos movimentos sociais e à comunicação. anais xxv encontro anual da compós, pages 1–22. pillai, s. p., madhukumar, s., and radharamanan, t. (2017). consolidating evidence based studies in software cost/effort estimation — a tertiary study. in tencon 2017 2017 ieee region 10 conference, pages 833–838. pmi, p. m. i. (2021). a guide to the project management body of knowledge (pmbok©guide). project management institute (pmi), usa, 7th edition. pressman, r. (2014). software engineering: a practitioner’s approach. mcgraw-hill, inc., usa, 8 edition. schwaber, k. and sutherland, j. (2017). the scrum guide. the definitive guide to scrum: the rules of the game. scrumguides. schwaber, k. and sutherland, j. (2020). the definitive guide to scrum: the rules of the game. singh, s. k., gupta, s., busso, d., and kamboj, s. (2019). top management knowledge value, knowledge sharing practices, open innovation and organizational performance. journal of business research. sommerville, i. (2015). software engineering. pearson education limited, 10th edition edition. stellman, a. and greene, j. (2014). learning agile: understanding scrum, xp, lean, and kanban. ” o’reilly media, inc.”. sá, m., silva, a., oliveira, g., and silveira, j. (2017). o método getting things done (gtd) e as ferramentas de gerenciamento de tempo e produtividade. navus revista de gestão e tecnologia, 8(1):72–87. the standish group chaos report (2018). decision latency theory: it’s all about the interval. technical report, the standish group international. thiollent, m. (2011). metodologia da pesquisa-ação. 18ª. são paulo: cortez. trendowicz, a. and jeffery, r. (2014). software project effort estimation. foundations and best practice guidelines for success, constructive cost model–cocomo pags, 12:277–293. introduction background related work methods case study planning, execution, and results characterization of the company and team conclusion journal of software engineering research and development, 2019, 6:1,doi: 10.5753/jserd.2019.17  this work is licensed under a creative commons attribution 4.0 international license.. improving energy efficiency through automatic refactoring luis cruz  [ inesc-id, university of porto | luiscruz@fe.up.pt ] rui abreu  [ inesc-id, ist, university of lisbon | rui@computer.org ] abstract the ever-growing popularity of mobile phones has brought additional challenges to the software development lifecycle. mobile applications ought to provide the same set of features as conventional software, with limited resources: such as limited processing capabilities, storage, screen and, not less important, power source. although energy efficiency is a valuable requirement, developers often lack knowledge of best practices. in this paper, we propose a tool to improve the energy efficiency of android applications using automatic refactoring — leafactor. the tool features five energy code smells that tend to go unnoticed. in addition, to evaluate the effectiveness of our approach, we run an experiment over a dataset of 140 free and open source apps. as a result, we detected and fixed code smells in 45 android apps, from which 40% have successfully merged our changes into the official repository. keywords: automatic refactoring, mobile computing, energy efficiency, software engineering 1 introduction in the past decade, the advent of mobile devices has brought new challenges and paradigms to the existing computing models. one of the major challenges is the fact that mobile phones have limited battery life. as a consequence, users need to frequently charge their devices to prevent their inoperability. hence, energy efficiency is an important nonfunctional requirement in mobile software, with a valuable impact on usability. a study in 2013 reported that 18% of apps have feedback from users that is related to energy consumption (wilke et al., 2013). other studies have nonetheless found that most developers lack the knowledge about best practices for energy efficiency in mobile applications (apps) (pang et al., 2015; sahin et al., 2014). hence, it is important to provide developers with actionable documentation and toolsets that aim to help deliver energy efficient apps. previously, we have identified five code smells with significant impact on the energy consumption of android apps (cruz and abreu, 2017) — we refer to them as energyrelated smells. we used a hardware-based approach to assess the energy efficiency improvement of fixing eight performance-based code smells described in the official android documentation. the impact on energy efficiency was evaluated by manually refactoring the codebases of five open-source android applications. the energy consumption was measured for every pair of versions: before and after the refactoring. the measurements were performed by mimicking real use-case scenarios while collecting power data with the single-board computer odroid1, which features power sensors for energy measurements. from those eight refactorings, five were found to yield a significant improvement in the energy consumption of mobile apps. however, certify that code is complying with these optimizations is time-consuming and prone to errors. thus, in this paper we study how automatic refactor can help develop code that follows energy best practices. 1odroid is a single-board computer that runs android and is used for mobile application development and iot applications. there are state-of-the-art tools that provide automatic refactoring for android and java apps (for instance, autorefactor2, walkmod3, facebook pfff 4, kadabra5). although these tools help developers creating better code, they do not feature energy-related refactorings for android. thus, we leverage five energy optimizations in an automatic refactoring tool, leafactor, which is publicly available with an open source license. in addition, the toolset has the potential to serve as an educative tool to aid developers in understanding which practices can be used to improve energy efficiency. on top of that, we analyze how android developers are addressing energy-related smells and how an automatic refactoring tool would help ship more energy efficient mobile software. we have used the results of our tool to contribute to real android app projects, validating the value of adopting an automatic refactoring tool in the development stack of mobile apps. in a dataset of 140 free and open source software (foss) android apps, we have found that a considerable part (32%) is released with energy inefficiencies. we have fixed 222 energy-related smells in 45 apps, from which 18 have successfully merged our changes into the official branch. results show that automatic refactoring tools can be very helpful to improve the energy footprint of apps. this paper is an extension of our previous work, in which we introduced the automatic refactoring tool leafactor (cruz et al., 2017; cruz and abreu, 2018) for the first time. we provide a self-contained report of our work on improving energy efficiency of mobile apps via automatic refactorings, by adding details of the architecture of the toolset and the available set of refactorings. moreover, we make a more comprehensive description of the dataset used in the empirical study, including complexity metrics. combined, our work makes the following contributions: 2autorefactor: http://autorefactor.org (august 17, 2019). 3walkmod: http://walkmod.com (august 17, 2019). 4facebook pfff : https://github.com/facebookarchive/pfff (august 17, 2019). 5kadabra: http://specs.fe.up.pt/tools/kadabra/ (august 17, 2019). https://orcid.org/0000-0002-1615-355x mailto:luiscruz@fe.up.pt https://orcid.org/0000-0003-3734-3157 mailto:rui@computer.org http://autorefactor.org http://walkmod.com https://github.com/facebookarchive/pfff http://specs.fe.up.pt/tools/kadabra/ cruz et al. 2019 • an automated refactoring tool, leafactor, to improve energy efficiency of android applications. • an empirical study of the prevalence of five energyrelated code smells in foss android applications. • the submission of 59 pull requests to the official code bases of 45 foss android applications, comprehending 222 energy efficiency refactorings. the remainder of this paper is organized as follows: section 2 details energy refactorings and corresponding impact on energy consumption; in section 3, we present the automatic refactor toolset that was implemented; section 4 describes the experimental methodology used to validate our tool, followed by sections 5 and 6 with results and discussion; in section 7 we present the related work in this field; and finally section 8 summarizes our findings and discusses future work. 2 energy refactorings we use static code analysis and automatic refactoring to apply android-specific optimizations of energy efficiency. in this section, we describe refactorings which are known to improve the energy consumption of android apps. each of them has an indication of the energy efficiency improvement (), as assessed in previous work (cruz and abreu, 2017), and the fix priority provided by the official lint documentation6. the priority reflects the impact of the refactoring in terms of performance and is given on a scale of 1 to 10, with 10 being the most effective. the severity is not necessarily correlated with energy performance. in addition, we also provide examples where the refactorings are applied. all refactorings are in java with the exception obsoletelayoutparam which is in xml — the markup language used in android to define the user interface (ui). 2.1 viewholder:add viewholder to scrolling lists energy efficiency improvement (): 4.5%. lint priority: ■■■■■□□□□□ 5/10. this refactoring is used to make a smoother scroll in list views, with no lags. when in a list view, the system has to draw each item separately. to make this process more efficient, data from the previous drawn item should be reused. this technique decreases the number of calls to the method findviewbyid(), which is known for being a very inefficient method (linares-vásquez et al., 2014). the following code snippet provides an example of how to apply viewholder. // ... @override public view getview(final int position, view convertview, viewgroup parent) { convertview = layoutinflater.from(getcontext()).inflate ( ¶ r.layout.subforsublist, parent, false ); 6lint is a tool provided with the android sdk which detects problems related with the structural quality of the code. website: https:// developer.android.com/studio/write/lint (august 17, 2019). final textview t = ((textview) convertview.findviewbyid (r.id.name)); · // ... optimized version: // ... private static class viewholderitem { ¸ private textview t; } @override public view getview(final int position, view convertview, viewgroup parent) { viewholderitem viewholderitem; if (convertview == null) { ¹ convertview = layoutinflater.from(getcontext()). inflate( r.layout.subforsublist, parent, false ); viewholderitem = new viewholderitem(); viewholderitem.t = ((textview) convertview. findviewbyid(r.id.name)); convertview.settag(viewholderitem); } else { viewholderitem = (viewholderitem) convertview.gettag (); } final textview t = viewholderitem.t; º // ... ¶ in every iteration of the method getview, a new layoutinflater object is instantiated, overwriting the method’s parameter convertview. · each item in the list has a view to display text — a textview object. this view is being fetched in every iteration, using the method findviewbyid(). ¸ a new class is created to cache common data between list items. it will be used to store the textview object and prevent it from being fetched in every iteration. ¹ this block will run only in the first item of the list. subsequent iterations will receive the convertview from parameters. º it is no longer needed to call findviewbyid() to retrieve the textview object. one might argue that the version of the code after refactoring is considerably less intuitive. this is, in fact true, which might be a reason for developers to ignore optimizations. however, regardless of whether this optimization should be addressed by the system, it is the recommended approach, as stated in the android official documentation7. see more on this discussion in section 6. 2.2 drawallocation: remove allocations within drawing code  1.5%. lint priority: ■■■■■■■■■□ 9/10. draw operations are very sensitive to performance. it is a bad practice allocating objects during such operations since it can create noticeable lags. the recommended fix is allocating objects upfront and reusing them for each drawing operation, as shown in the following example: public class drawallocationsampletwo extends button { public drawallocationsampletwo(context context) { super(context); } @override protected void ondraw(android.graphics.canvas canvas) { 7viewholder explanation in the official documentation: https://developer.android.com/guide/topics/ui/layout/ recyclerview visited in august 17, 2019. https://developer.android.com/studio/write/lint https://developer.android.com/studio/write/lint https://developer.android.com/guide/topics/ui/layout/recyclerview https://developer.android.com/guide/topics/ui/layout/recyclerview cruz et al. 2019 super.ondraw(canvas); integer i = new integer(5);¶ // ... return; } } optimized version: public class drawallocationsampletwo extends button { public drawallocationsampletwo(context context) { super(context); } integer i = new integer(5);· @override protected void ondraw(android.graphics.canvas canvas) { super.ondraw(canvas); // ... return; } } ¶ a new instance of integer is created in every execution of ondraw. · the allocation of the instance of integer is removed from the drawing operation and is now executed only once during the app execution. 2.3 wakelock: fix incorrect wakelock usage  1.5%. lint priority: ■■■■■■■■■□ 9/10. wakelocks are mechanisms to control the power state of a mobile device. this can be used to prevent the screen or the cpu from entering a sleep state. if an application fails to release a wakelock or uses it without being strictly necessary, it can drain the battery of the device. the following example shows an activity that uses a wake lock: extends activity { private wakelock wl; @override protected void oncreate(bundle savedinstancestate) { super.oncreate(savedinstancestate); powermanager pm = (powermanager) this. getsystemservice( context.power_service ); wl = pm.newwakelock( powermanager.screen_dim_wake_lock | powermanager. on_after_release, "wakelocksample" ); wl.acquire();¶ } } ¶ using the method acquire() the app asks the device to stay on. until further instruction, the device will be deprived of sleep. since no instruction is stopping this behavior, the device will not be able to enter a sleep mode. although in exceptional cases this might be intentional, it should be fixed to prevent battery drain. the recommended fix is to override the method onpause() in the activity: //... @override protected void onpause(){ super.onpause(); if (wl != null && !wl.isheld()) { wl.release(); } } //... with this solution, the lock is released before the app switches to background. 2.4 recycle: fix missing recycle() calls  0.7%. lint priority: ■■■■■■■□□□ 7/10. there are collections such as typedarray that are implemented using singleton resources. hence, they should be released so that calls to different typedarray objects can efficiently use these same resources. the same applies to other classes (e.g., database cursors, motion events, etc.). the following snippet shows an object of typedarray that is not being recycled after use: public void wrong1(attributeset attrs, int defstyle) { final typedarray a = getcontext(). obtainstyledattributes( attrs, new int[] { 0 }, defstyle, 0 ); string example = a.getstring(0); } solution: public void wrong1(attributeset attrs, int defstyle) { final typedarray a = getcontext(). obtainstyledattributes( attrs, new int[] { 0 }, defstyle, 0 ); string example = a.getstring(0); if (a != null) { a.recycle();¶ } } ¶ calling the method recycle() when the object is no longer needed, fixes the issue. the call is encapsulated in a conditional block for safety reasons. besides typedarray instances, this refactoring is also applied to instances of following classes: cursor, velocitytracker, motionevent, parcel, and contentproviderclient. 2.5 obsoletelayoutparam (olp): remove obsolete layout parameters  0.7%. lint priority: ■■■■■■□□□□ 6/10. during development, ui views might be refactored several times. in this process, some parameters might be left unchanged even when they have no effect in the view. this is a code smell that needs to be fixed since it causes useless attribute processing at runtime. the refactoring is applied by removing the obsolete parameters from the ui specification. as an example, consider the following code snippet (xml): /* deleteme */ ¶ ¶ the property android:layout_alignparentbottom is used for views inside a relativelayout to align the bottom edge of a view (i.e., the textview, in this example) with the bottom edge of the relativelayout. on contrary, linearlayout is not compatible with this property, having no effect in this example. it is safe to remove the property cruz et al. 2019 table 1. layout-related parameters that only have a visual effect when defined inside specific layouts. layout parameter allowed parent layout layout_x absolutelayout layout_y absolutelayout layout_weight linearlayout, actionmenuview, listrowhovercardview, listrowview, numberpicker, radiogroup, searchview, tabwidget, tablelayout, tablerow, textinputlayout, zoomcontrols layout_column gridlayout, tablelayout, tablerow layout_columnspan gridlayout layout_row gridlayout layout_rowspan gridlayout layout_alignleft relativelayout layout_alignstart relativelayout layout_alignright relativelayout layout_alignend relativelayout layout_aligntop relativelayout layout_alignbottom relativelayout layout_alignparenttop relativelayout layout_alignparentbottom relativelayout layout_alignparentleft relativelayout layout_alignparentstart relativelayout layout_alignparentright relativelayout layout_alignparentend relativelayout layout_alignwithparentmissing relativelayout layout_alignbaseline relativelayout layout_centerinparent relativelayout layout_centervertical relativelayout layout_centerhorizontal relativelayout layout_torightof relativelayout layout_toendof relativelayout layout_toleftof relativelayout layout_tostartof relativelayout layout_below relativelayout layout_above relativelayout autorefactor java files xml files java refactor engine xml refactor engine android project >_ cliplugin ui figure 1. architecture diagram of the automatic refactoring toolset. from the specification. in table 1, we detail all the cases featured in leafactor. 3 automatic refactoring tool in the scope of our study, we developed a tool to statically analyze and transform code, implementing android-specific energy efficiency refactorings — leafactor. the toolset receives a single file, a package, or a whole android project as input and looks for eligible files, i.e., java or xml source files. it automatically analyzes those files and generates a new compilable and optimized version. the architecture of leafactor is depicted in figure 1. there are two separate engines: one to handle java files and another to handle xml files. the refactoring engine for java is implemented as part of the open-source project autorefactor — an eclipse plugin to automatically refactor java code figure 2. developers can apply refactorings by selecting the “automatic refactoring” option or by using the key combination y . bases. 3.1 autorefactor autorefactor is an eclipse plugin that delivers automatic refactoring in java codebases. it is created as a complement to existing static analyzers such as sonarqube, findbugs, checkstyle and pmd. although they provide insightful warnings to developers, they do little in helping developers fixing all the issues lying in legacy codebases. it provides a comprehensive set of 103 common code cleanups to help deliver “smaller, more maintainable and more expressive code bases”8. the list goes from simple rules, such as enforcing the use of the method isempty() to check whether a collection is empty, instead of checking its size (rule isemptyratherthansize), to more complex ones, such as setratherthenlist choosing a more adequate collection type for specific use cases. in addition, autorefactor also supports cleanups for code comments, such as removing auto-generated or empty javadocs from the codebase (named by autorefactor as rule comments). eclipse marketplace9 reported 4459 successful installs of autorefactor. a common use case is presented in the screenshot of figure 2. developers can apply refactorings in single files, packages, or entire projects. under the hood, autorefactor integrates a handy and concise api to manipulate java abstract syntax trees (asts). we contributed to the project by implementing the java refactorings mentioned in section 2. 3.2 xml refactorings since xml refactorings are not supported by autorefactor, a separate refactoring engine was developed and integrated 8as described in the official website, visited in august 17, 2019: http: //autorefactor.org 9eclipse marketplace is an interface for browsing and installing plugins for the java ide eclipse: https://marketplace.eclipse.org visited in august 17, 2019. http://autorefactor.org http://autorefactor.org https://marketplace.eclipse.org cruz et al. 2019 7. commit & push changes 1. collect metadata from f-droid 2. fork repository 3. select optimization 4. create branch 5. apply leafactor 6. validate changes 8. submit pr figure 3. experiment’s procedure for a single app. into leafactor. the engine features a command line interface, that can be integrated with continuous integration environments. optionally, the tool can be set to simply flag warnings, without performing any refactoring transformation. as detailed in the previous section, only a single xml refactoring is offered — obsoletelayoutparam. 4 empirical evaluation we designed an experiment with the following goals: • study the benefits of using an automatic refactoring tool within the android development community. • study how foss android apps are adopting energy efficiency optimizations. • improve energy efficiency of foss android apps. we adopted the procedure explained in figure 3. starting with step 1, we collected data from the f-droid app store10 — a catalog for free and open-source software (foss) applications for the android platform. for each mobile application, we collected the git repository location which was used in step 2 to fork the repository and prepare it for a potential contribution to the project’s official code repository. following, in step 3 we selected one refactoring to be applied and consequently initiate a process that was repeated for all refactorings (steps 4–8): the project was analyzed and, if any transformation was applied, a new pull request (pr) was submitted to be considered by the project’s integrator. since we wanted to engage the community and get feedback about the refactorings, we manually created each pr with a personalized message, including a brief explanation of committed code changes. we analyzed 140 free and open-source android apps collected from f-droid11. apps were selected by publish date (i.e., it was given priority to newly released apps), considering exclusively java projects (e.g., kotlin projects are filtered out) with a github repository. we selected only one git service for the sake of simplicity. apps in the dataset are spread in 17 different categories, as depicted in figure 4. table 2 presents descriptive statistics for the source code and repository of the mobile applications in the dataset: number of lines of code (loc), mccabe’s cyclomatic complexity (cc), mean weighted methods per class12 (wmc), lack of cohesion of methods13 (lcom) (etzkorn et al., 10f-droid repository is available at https://f-droid.org visited in august 17, 2019. 11data was collected on nov 27, 2016, and it is available here: https: //doi.org/10.6084/m9.figshare.7637402 12weighted methods per class (wmc) is the sum of the complexity of methods in a class. 13lack of cohesion of methods (lcom) is a software code metric that measures the correlation between class members and methods. values fall between 0, indicating perfect cohesion, and 1, indicating a complete lack of cohesion. m ul tim ed ia se cu ri ty ph on e& sm s t he m in g m on ey d ev el op m en t in te rn et sy st em g am es r ea di ng c on ne ct iv ity sp or ts & h ea lth w ri tin g sc ie nc e& e du . ti m e n av ig at io n 0 2 4 6 8 10 8 21 2 33 9 5 3 4 1 3 22 11 categories n um be ro fa pp s figure 4. number of apps per category in the dataset. 1998), number of java files, number of xml files, number of github forks, github stars, and contributors. these metrics were collected using the static analysis tool designite14 and the github api v315. the dataset comprehends very diverse mobile applications. it goes from very simples apps, such as storage-usb16, with 13 loc and complexity cc of 2, to large apps, such as slide17 with almost 400k loc and complexity cc of 14631, or osmand18, with over 300k loc and complexity cc of 77889. the largest project in terms of java files is tinytraveltracker (1878), while newsblue is the largest in terms of xml files (2109). most apps in the dataset have reasonable cohesion, with lcom below 0.34 for 75% of the apps; apps with low/moderate cohesion were also analyzed, having lcom values up to 0.67. in total, we analyzed 2.8m lines of java code (loc) in 6.79gb of android projects in 4.5 hours — 15103 xml files, and 15308 java files. 5 results our experiment yielded a total of 222 refactorings, which were submitted to the original repositories as prs. multiple refactorings of the same type were grouped in a single pr to avoid creating too many prs for a single app. it resulted in 59 prs spread across 45 apps. this is a demanding process since each project has different contributing guidelines. nevertheless, by the time of writing, 18 apps had successfully merged our contributions for deployment. an example of the prs submitted to the projects is illustrated in figure 5. leafactorperformed the refactoring viewholder in the app slide19, and developers successfully merged our pr. the full thread can be found in the github project ccrama/slide with reference #234620. 14designite’s website: http://www.designite-tools.com visited in august 17, 2019. 15github api v3’s website:https://developer.github.com/v3/ visited in august 17, 2019. 16storage-usb basically launches storage settings directly from the apps drawer. github repository: https://github.com/enricocid/ storage-usb visited in august 17, 2019. 17slide is a browser for the social news forum reddit. github repository: https://github.com/ccrama/slide visited in august 17, 2019. 18osmand is a navigation app. github repository: https://github. com/osmandapp/osmand visited in august 17, 2019. 19slide’s website: http://trikita.co/slide/ visited in august 17, 2019. 20pr of the viewholder of app slide: https://github.com/ccrama/ https://f-droid.org https://doi.org/10.6084/m9.figshare.7637402 https://doi.org/10.6084/m9.figshare.7637402 http://www.designite-tools.com https://developer.github.com/v3/ https://github.com/enricocid/storage-usb https://github.com/enricocid/storage-usb https://github.com/ccrama/slide https://github.com/osmandapp/osmand https://github.com/osmandapp/osmand http://trikita.co/slide/ https://github.com/ccrama/slide/pull/2346 cruz et al. 2019 table 2. descriptive statistics of projects in the dataset. loc cc wmc lcom java files xml files github forks github stars contributors mean 20350 3532 17.41 0.29 103 102 65 179 15 min 13 2 1.00 0.00 0 4 0 0 1 25% 1444 271 11.14 0.23 13 23 3.75 7.75 2 median 4641 946 15.20 0.27 38 48 9 24 3 75% 14795 3007 21.50 0.34 106 97 39 111 10 max 388853 77889 82.82 0.67 1678 2109 1483 4488 323 total 2869394 – – – 15308 15103 9547 26484 2162 table 3. summary of refactoring results refactoring viewholder drawallocation wakelock recycle olp∗ total total refactorings 7 0 1 58 156 222 total projects 5 0 1 23 30 45 percentage of projects 4% 0% 1% 16% 21% 32% incidence per project 1.4× 1.0× 2.5× 5.2× 4.8× ∗olp — obsoletelayoutparam figure 5. an example of pull request submitted to the android project slide. table 3 presents the results for each refactoring. it shows the total number of applied refactorings, the total number of projects that were affected, the percentage of affected projects, and the average number of refactorings per affected project. in addition, the table presents the combined results for the occurrence of any type of refactoring (total). obsoletelayoutparam was the most frequent refactoring. it was applied 156 times in a total of 30 projects out of the 140 in our dataset (21%). in average, each affected project had 5 occurrences of this refactoring. recycle comes next, occurring in 23 projects (16%) with 58 refactorings. drawallocation and wakelock only showed marginal impact. in addition, figure 6 presents a plot bar summarizing the number of projects affected amongst all the studied refactorings. the mobile application with a bigger incidence of refactorings was the android application for the cloud platform nextcloud21. leafactor has refactored two occurrences of recycle, two of viewholder, and 6 of obsoletelayoutparam. in terms of the total number of refactorings, qr scanner22 was the app with a higher number of occurrences, with 35 occurrences of obsoletelayoutparam. slide/pull/2346 visited in august 17, 2019. 21nextcloud’s website: https://nextcloud.com visited in august 17, 2019. 22qr scanner’s entry on google play: https://play.google.com/store/apps/details?id=com.secuso. privacyfriendlycodescanner visited in august 17, 2019. w ak elo ck re cy cle dr aw al loc ati on vi ew ho lde r ol p to tal 0 20 40 1 23 05 30 45 n um be ro fa pp s af fe ct ed figure 6. number of apps affected per refactoring. for reproducibility and clarity of results, all the data collected in this study is publicly available23. in addition, all the prs are public and can be accessed through the official repositories of the apps. 6 discussion results show that an automatic refactoring tool can help developers ship more energy efficient apps. a considerable part of the apps in this study (32%) had at least one energy inefficiency. since these inefficiencies are only visible after long periods of app activity, they can easily go unnoticed. from the feedback developers provided in the prs, we have noticed that developers are open to recommendations from an automated tool. only in a few exceptions, developers expressed being unhappy with our contributions. reasons varied between seeing our pr as a critique of the programming skills of developers or simply because developers did not want to make changes in components of the app that were affected by the refactoring. nevertheless, most developers were curious about the refactorings, and they recognized being unaware of their impact on energy efficiency. this is consistent with previous work (pang et al., 2015; sahin et al., 2014). 23spreadsheet with all experimental results: https://doi.org/10. 6084/m9.figshare.7637402. https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://github.com/ccrama/slide/pull/2346 https://nextcloud.com https://play.google.com/store/apps/details?id=com.secuso.privacyfriendlycodescanner https://play.google.com/store/apps/details?id=com.secuso.privacyfriendlycodescanner https://doi.org/10.6084/m9.figshare.7637402 https://doi.org/10.6084/m9.figshare.7637402 cruz et al. 2019 a positive outcome of our experimentation was that we were able to improve energy efficiency in the official release of 18 android apps. in a few cases, code smells were found in code that does not affect the energy consumption of the app itself (e.g., test code). in those cases, our prs were not merged24. nevertheless, we recommend consistently complying with these optimizations in all types of code since new developers often use tests to help understand how to contribute to a project. leafactor, akin to autorefactor, applies the refactorings without prompting developers for confirmation. this is a common approach for simple refactorings. nevertheless, in the case of energy code smells, a single refactoring entails changing several lines of code which the developer may not be able to interpret. during our experiments, this issue is mitigated since we submit a pr with a brief explanation of the code smell and the applied refactoring. it would be interesting to consider alternative approaches in which developers are informed or prompted while having their code refactored. the code smell related to obsoletelayoutparam was found in a considerable fraction of projects (21%). this relates to the fact that app views are often created in an iterative process with several rounds of trial and error. since some parameters have no effect under specific contexts, useless ui specification statements can go unnoticed by developers. recycle is frequent, too, being observed in 16% of projects. this smell is found in android api objects that can be found in most projects (e.g., database cursors). although a clean fix is to use the java try-with-resources statement25, it requires version 19 or earlier of android sdk (introduced with android 4.4 kitkat). however, developers resort to a more verbose approach for backward compatibility, which requires explicitly closing resources, hence prone to mistakes. our drawallocation checker did not yield any result. it was expected that developers were already aware of drawallocation. still, we were able to manually spot allocations that were happening inside a drawing routine. nevertheless, those allocations are using dynamic values to initialize the object. in our implementation, we scope only allocations that will not change between iterations. covering those missed cases would require updating the allocated object in every iteration. while spotting these cases is relatively easy, refactoring would require better knowledge of the class that is being instantiated. similarly, wakelocks are very complex mechanisms, and fixing all misuses still needs further work. in the case of viewholder, although it only impacted 4% of the projects, we believe it has to do with the fact that 1) some developers already know this refactoring due to its performance impact, and 2) many projects do not implement dynamic list views. viewholder is the most complex refactoring we have in terms of lines of code (loc) — a simple case can require changes in roughly 35 loc. although changes 24example of a pr of refactorings on test code: https://github.com/ hidroh/materialistic/pull/828 visited in august 17, 2019. 25documentation about the java try-with-resources statement: https://docs.oracle.com/javase/tutorial/essential/ exceptions/tryresourceclose.html visited in august 17, 2019. are easily understandable by developers, writing code that complies with viewholder is not intuitive. gainings on energy efficiency may vary depending on the application and the use cases in which they occur. measuring the effective impact on energy consumption is not trivial as it requires a complicated setup. previous work has found these refactorings to improve energy efficiency by up to 5% in real use case scenarios (cruz and abreu, 2017). nonetheless, these refactorings are recommended by the official android documentation26 as best practices for performance. a visible side effect of the refactorings featured by leafactor is the questionable maintainability of the code introduced. although the refactorings are implemented based on the official android documentation, the resulting code is considerably longer and less intuitive for refactorings such as viewholder and recycle. this is a threat to the adoption of energy-efficient practices in android applications. mobile frameworks should feature coding mechanisms aiming to improve energy efficiency without hindering code maintainability. 7 related work the energy efficiency of mobile apps is being addressed with many different approaches. some works opt by simplifying the process of measuring the energy consumption of mobile apps (zhang et al., 2010; pathak et al., 2012, 2011; hao et al., 2013; di nucci et al., 2017; couto et al., 2014). alternatively, other works study the energy footprint of software design choices and code patterns that will prevent developers from creating code with poor energy efficiency (li et al., 2014; li and halfond, 2014, 2015; linares-vásquez et al., 2017; malavolta et al., 2017; pereira et al., 2017). automatic detection of code smells for android has been studied before. fixing code smells in android has shown gains up to 5% in energy efficiency (cruz and abreu, 2017). the code was manually refactored in six real apps and energy consumption was measured using a hardware-based power monitor. our work extends this research by providing automatic refactoring to the resulting energy code smells. the frequency of code smells in android apps was studied in previous work (hecht et al., 2015). code smells were automatically detected in 15 apps using the tool paprika which was developed to perform static analysis in the bytecode of apps. although paprika provides valuable feedback on how to fix their code, developers need to manually apply the refactorings. our study differs by focusing on energy-related code smells and by applying automatic refactoring to resolve potential issues. previous work has also studied the importance of providing a catalog of bad smells that negatively influence the quality of android applications (reimann et al., 2014; reimann and aβmann, 2013). although the authors motivate the importance of using automatic refactoring, their approach lacks an extensive implementation of their catalog. related work has implemented 15 code-smells from this catalog proposed 26viewholder is documented here: https://developer.android. com/training/improving-layouts/smooth-scrolling visited in august 17, 2019. https://github.com/hidroh/materialistic/pull/828 https://github.com/hidroh/materialistic/pull/828 https://docs.oracle.com/javase/tutorial/essential/exceptions/tryresourceclose.html https://docs.oracle.com/javase/tutorial/essential/exceptions/tryresourceclose.html https://developer.android.com/training/improving-layouts/smooth-scrolling https://developer.android.com/training/improving-layouts/smooth-scrolling cruz et al. 2019 by reimann and aβmann (2013) in an automatic refactoring tool, adoctor (palomba et al., 2017). in our work, we use this approach to improve the energy efficiency of android applications. another work has focused exclusively on design patterns to improve the energy efficiency of ios and android mobile applications (cruz and abreu, 2019). however, no efforts were made regarding the automatic refactoring of the cataloged energy patterns. in our work, we implement automatic refactoring for five energy patterns. in addition, we validate our refactorings by applying leafactor in a large dataset of real android apps. moreover, we assess how automatic refactoring tools for energy can positively impact the android foss community. other works have detected energy-related code smells by analyzing source code as tgraphs (gottschalk et al., 2012; ebert et al., 2008). eight different code smell detectors were implemented and validated with a navigation app. fixing the code with automatic refactoring was discussed but not implemented. besides, although studied code smells are likely to have an impact on energy consumption, no evidence was presented. previous work has used the event flow graph of the app to optimize resource usage (e.g., gps, bluetooth) (banerjee and roychoudhury, 2016). results show significant gains in energy efficiency. nevertheless, although this process provides details on how to fix the code, it is not fully automated yet. other works have studied and applied automatic refactorings in android applications (sahin et al., 2014, 2016). however, these refactorings were not mobile-specific. besides refactoring source code, other works have focused on studying the impact of ui design decisions on energy consumption (linares-vásquez et al., 2017). agolli, t., et al. have proposed a methodology that suggests changes in the ui colors of apps. the new ui colors, despite being different, are almost imperceptible by users and lead to savings in the energy consumption of mobile phones’ displays (agolli et al., 2017). in our work, we strictly focus on changes that do not change the appearance of the app. 8 conclusion our work presents the automatic refactoring tool leafactor to improve the energy efficiency of android application codebases. in an empirical study with 140 foss android apps, we show the potential of using automatic refactoring tools to improve the energy efficiency of mobile applications. we have fixed 222 energy-related energy-related smells, improving the energy footprint of 45 android applications. results show that automatic refactoring can benefit developers to improve the energy efficiency for a considerable number of foss android applications. as future work, we plan to study and support more energy efficiency refactorings. in particular, some of the energy patterns studied in previous work (cruz and abreu, 2019; reimann et al., 2014; reimann and aβmann, 2013) could help increase the usefulness of leafactor. besides, it would be interesting to explore the detection of energyrelated smells using dynamic analysis. moreover, it would be interesting to integrate automatic refactoring in a continuous integration context. the integration would require two distinct steps: one for the detection and another for the code refactoring which would only be applied upon a granting action by a developer. one could also use this idea with an educational purpose. a detailed explanation of the code transformation along with its impact on energy efficiency could be provided whenever a developer pushes new changes to the repository. acknowledgements this work is financed by the erdf – european regional development fund through the operational program for competitiveness and internationalization compete 2020 program and by national funds through the portuguese funding agency, fct fundação para a ciência e a tecnologia within project poci-01-0145feder-016718. luis cruz is sponsored by an fct scholarship grant number pd/bd/52237/2013. references agolli, t., pollock, l., and clause, j. (2017). investigating decreasing energy usage in mobile apps via indistinguishable color changes. in proceedings of the 4th international conference on mobile software engineering and systems, pages 30–34. ieee press. banerjee, a. and roychoudhury, a. (2016). automated refactoring of android apps to enhance energy-efficiency. in proceedings of the international workshop on mobile software engineering and systems, pages 139–150. acm. couto, m., carção, t., cunha, j., fernandes, j. p., and saraiva, j. (2014). detecting anomalous energy consumption in android applications. in brazilian symposium on programming languages, pages 77–91. springer. cruz, l. and abreu, r. (2017). performance-based guidelines for energy efficient mobile applications. in proceedings of the 4th international conference on mobile software engineering and systems, pages 46–57. ieee press. cruz, l. and abreu, r. (2018). using automatic refactoring to improve energy efficiency of android apps. in cibse xxi ibero-american conference on software engineering. cruz, l. and abreu, r. (2019). catalog of energy patterns for mobile applications. empirical software engineering. cruz, l., abreu, r., and rouvignac, j.-n. (2017). leafactor: improving energy efficiency of android apps via automatic refactoring. in proceedings of the 4th international conference on mobile software engineering and systems, mobilesoft ’17, pages 205–206. ieee press. di nucci, d., palomba, f., prota, a., panichella, a., zaidman, a., and de lucia, a. (2017). petra: a software-based tool for estimating the energy profile of android applications. in proceedings of the 39th international conference on software engineering companion, pages 3–6. ieee press. ebert, j., riediger, v., and winter, a. (2008). graph technology in reverse engineering–the tgraph approach. in proc. cruz et al. 2019 10th workshop software reengineering. gi lecture notes in informatics. citeseer. etzkorn, l., davis, c., and li, w. (1998). a practical look at the lack of cohesion in methods metric. in journal of object-oriented programming. citeseer. gottschalk, m., josefiok, m., jelschen, j., and winter, a. (2012). removing energy code smells with reengineering services. gi-jahrestagung, 208:441–455. hao, s., li, d., halfond, w. g., and govindan, r. (2013). estimating mobile application energy consumption using program analysis. in software engineering (icse), 2013 35th international conference on, pages 92–101. ieee. hecht, g., rouvoy, r., moha, n., and duchien, l. (2015). detecting antipatterns in android apps. in proceedings of the second acm international conference on mobile software engineering and systems, pages 148–149. ieee press. li, d. and halfond, w. g. (2014). an investigation into energy-saving programming practices for android smartphone app development. in proceedings of the 3rd international workshop on green and sustainable software, pages 46–53. acm. li, d. and halfond, w. g. (2015). optimizing energy of http requests in android applications. in proceedings of the 3rd international workshop on software development lifecycle for mobile, pages 25–28. acm. li, d., hao, s., gui, j., and halfond, w. g. (2014). an empirical study of the energy consumption of android applications. in software maintenance and evolution (icsme), 2014 ieee international conference on, pages 121–130. ieee. linares-vásquez, m., bavota, g., bernal-cárdenas, c., oliveto, r., di penta, m., and poshyvanyk, d. (2014). mining energy-greedy api usage patterns in android apps: an empirical study. in proceedings of the 11th working conference on mining software repositories, pages 2–11. acm. linares-vásquez, m., bernal-cárdenas, c., bavota, g., oliveto, r., di penta, m., and poshyvanyk, d. (2017). gemma: multi-objective optimization of energy consumption of guis in android apps. in proceedings of the 39th international conference on software engineering companion, pages 11–14. ieee press. malavolta, i., procaccianti, g., noorland, p., and vukmirović, p. (2017). assessing the impact of service workers on the energy efficiency of progressive web apps. in proceedings of the 4th international conference on mobile software engineering and systems, pages 35–45. ieee press. palomba, f., di nucci, d., panichella, a., zaidman, a., and de lucia, a. (2017). lightweight detection of android-specific code smells: the adoctor project. in 2017 ieee 24th international conference on software analysis, evolution and reengineering (saner), pages 487–491. ieee. pang, c., hindle, a., adams, b., and hassan, a. e. (2015). what do programmers know about the energy consumption of software? peerj preprints, 3:e886v1. pathak, a., hu, y. c., and zhang, m. (2012). where is the energy spent inside my app?: fine grained energy accounting on smartphones with eprof. in proceedings of the 7th acm european conference on computer systems, pages 29–42. acm. pathak, a., hu, y. c., zhang, m., bahl, p., and wang, y.m. (2011). fine-grained power modeling for smartphones using system call tracing. in proceedings of the sixth conference on computer systems, pages 153–168. acm. pereira, r., carção, t., couto, m., cunha, j., fernandes, j. p., and saraiva, j. (2017). helping programmers improve the energy efficiency of source code. in proceedings of the 39th international conference on software engineering companion, pages 238–240. ieee press. reimann, j. and aβmann, u. (2013). quality-aware refactoring for early detection and resolution of energy deficiencies. in proceedings of the 2013 ieee/acm 6th international conference on utility and cloud computing, pages 321–326. ieee computer society. reimann, j., brylski, m., and aßmann, u. (2014). a tool-supported quality smell catalogue for android developers. in proc. of the conference modellierung 2014 in the workshop modellbasierte und modellgetriebene softwaremodernisierung–mmsm, volume 2014. sahin, c., pollock, l., and clause, j. (2014). how do code refactorings affect energy usage? in proceedings of the 8th acm/ieee international symposium on empirical software engineering and measurement, page 36. acm. sahin, c., pollock, l., and clause, j. (2016). from benchmarks to real apps: exploring the energy impacts of performance-directed changes. journal of systems and software, 117:307–316. wilke, c., richly, s., gotz, s., piechnick, c., and aßmann, u. (2013). energy consumption and efficiency in mobile applications: a user feedback study. in green computing and communications (greencom), 2013 ieee and internet of things (ithings/cpscom), ieee international conference on and ieee cyber, physical and social computing, pages 134–141. ieee. zhang, l., tiwana, b., qian, z., wang, z., dick, r. p., mao, z. m., and yang, l. (2010). accurate online power estimation and automatic battery behavior based power model generation for smartphones. in proceedings of the eighth ieee/acm/ifip international conference on hardware/software codesign and system synthesis, pages 105–114. acm. introduction energy refactorings viewholder: add view holder to scrolling lists drawallocation: remove allocations within drawing code wakelock: fix incorrect wakelock usage recycle: fix missing recycle() calls obsoletelayoutparam (olp): remove obsolete layout parameters automatic refactoring tool autorefactor xml refactorings empirical evaluation results discussion related work conclusion journal of software engineering research and development, 2022, 10:7, doi: 10.5753/jserd.2021.1978  this work is licensed under a creative commons attribution 4.0 international license.. using evidence from systematic studies to guide a phd research in requirements engineering: an experience report taciana novo kudo  [ universidade federal de goiás | taciana@ufg.br ] renato f. bulcão-neto  [ universidade federal de goiás | rbulcao@ufg.br ] auri marcelo rizzo vincenzi  [ universidade federal de são carlos | auri@ufscar.br ] érica ferreira de souza  [ universidade tecnológica federal do paraná | ericasouza@utfpr.edu.br ] katia romero felizardo  [ universidade tecnológica federal do paraná | katiascannavino@utfpr.edu.br ] abstract conducting systematic studies during a postgraduate program, such as systematic review, systematic mapping, and tertiary review, can benefit the project’s success. they provide an overview of the literature considering currently available research findings, establish baselines for other research activities, and support decisions made throughout the research project. however, there is a shortage of research that presents systematic studies experiences in supporting academic projects. this paper’s main contribution is reporting our experience on how the evidence found in tertiary and secondary studies positively influenced a phd project’s decisions. initially, a tertiary study was conducted, followed by a systematic mapping. the evidence returned by the tertiary study led to the definition of the phd research proposal in the requirement engineering field. moreover, a systematic mapping contributed to the definition of the phd research problem. from this experience in undertaking systematic studies to support a phd project, the paper also presents lessons learned and recommendations to guide phd students’ decisions. keywords: evidence-based software engineering, graduate education, tertiary study, secondary study 1 introduction a systematic study1 aims to identify, select, evaluate, interpret, and summarize available studies considered relevant to a topic or phenomenon of interest. individual studies that contribute to a systematic study (systematic literature reviews – slr or systematic mapping – sm) are primary, while the systematic study itself is considered secondary. historically, systematic studies, especially slrs, have been employed in the medical area and are recognised as critical components to support evidence-based medicine (clarke and chalmers, 2018). inspired by the success in the medical field, evidence-based software engineering (ebse) was first proposed to advance and improve the discipline of software engineering (se) (kitchenham et al., 2015). currently, a larger community is formed around ebse and composed of researchers who have conducted systematic studies in se. informal literature reviews are relevant for research initiatives, especially in cases that use good practices. however, they lack scientific rigour, such as investigation bias. reviews based on a rigorous process ensure auditable, reproducible, and unbiased results for all stakeholders. one of the reasons systematic studies have been conducted in se compared to informal reviews is its advantages, including the reduction of biases in results and the possibility of identifying and combining the main differences between data from the various studies selected in the review (egger et al., 1997). another advantage is identifying gaps in cur1throughout this work, the term “systematic study” encompasses systematic literature review (slr), systematic mapping (sm), as a more open form of slr, and tertiary studies, as slr of slrs. details on functional similarities and differences between slr and sm are found elsewhere (napoleão et al., 2017). rent research, which may suggest new research themes and provide a suitable way to position these themes in the context of existing research. other benefits include (kitchenham and brereton, 2013): • a well-planned systematic study avoids bias in the analysis of primary studies; • a systematic study allows researchers to answer research questions that can not be answered based on a single primary study; • a systematic study can help researchers to test theoretical hypotheses that otherwise could not be tested based on primary studies; and • results of a systematic study can be used to understand the efficacy and the efficiency of a method or a technology; alternatively, they can point out the strengths and weaknesses of methods and technologies under certain circumstances. in that context, felizardo et al. (2020) affirm that systematic studies are valuable to graduate students. regarding the main benefits of conducting systematic studies during a phd research project, the most significant one is providing an overview of the literature, finding out research opportunities, learning from studies, and providing baselines to assist new research efforts. in particular, sms can significantly benefit researchers in establishing baselines for further research activities, such as choosing a dissertation topic for a phd degree considering research trends that can not be tracked over time (research gaps) (souza et al., 2015). another advantage includes using the reviews’ findings to support decisions made in the research project. https://orcid.org/0000-0002-7238-0562 mailto:taciana@ufg.br https://orcid.org/0000-0001-8604-0019 mailto:rbulcao@ufg.br https://orcid.org/0000-0001-5902-1672 mailto:auri@ufscar.br https://orcid.org/0000-0001-7262-7863 mailto:ericasouza@utfpr.edu.br https://orcid.org/0000-0001-9080-4165 mailto:katiascannavino@utfpr.edu.br mailto:katiascannavino@utfpr.edu.br kudo et al. 2022 one expects that phd students produce a compelling literature review. this review is a critical doctoral component since it allows students to thoroughly understand the topic they will work on and be familiar with the results obtained by other researchers. therefore, secondary studies are the proper methodology to write a compelling literature review. moreover, during the review conduction, students are trained in searching and selecting relevant literature, assessing the quality of the selected literature, and summarising/presenting the achievements. these are skills that every phd candidate must procure during his/her doctoral. there are numerous motivations for conducting a secondary study, such as those reported in felizardo et al. (2020): • systematic studies’ results may identify suitable areas for future research – i.e., the original topic of investigation and the research questions to be answered during a phd project – aiming at the advance of state of the art in the research topic; • those studies can replace traditional narrative literature providing the currently available research findings; • results of primary studies selected in a systematic study can be used as a baseline for comparison with ongoing, recent research results; • the findings of systematic studies guide phd research efforts, e.g., researchers could consider the systematic studies’ findings for choosing appropriate research methods; and • the systematic study may be published, externalising the acquired knowledge, contributing to the ebse field. because of these advantages, several se researchers advocate for phd students using systematic studies (clear, 2015; pejcinovic, 2015; kuhrmann, 2017; kaijanaho, 2017). for example, souza et al. (2015) describe a case of such a successful application of secondary studies to guide the decisions of a doctorate. this article reflects upon our experience using systematic studies in developing a phd project. therefore, this study aims to present how systematic studies’ findings impact an academic project. specifically, the main goals of this research are to: • present a successful case in which systematic studies had great importance in the conduction of a phd project; • exemplify how the best available evidence provided by systematic studies can base project’s decisions; • reinforce the importance of systematic studies in conducting a research project; • report our experiences conducting secondary and tertiary studies as part of a phd research project (kudo, 2021); and • inspire graduate students with our lessons learned and recommendations for undertaking systematic studies in their research projects. in summary, one tertiary review and one secondary study were conducted to support a phd project’s decisions in the requirements engineering (re) domain. our main conclusion is that systematic studies have many advantages, and therefore, graduate students should consider doing at least one review during the doctorate. the remainder of the paper is organized as follows. section 2 introduces the software requirements patterns theme. section 3 presents a phd research project showing how systematic studies’ results guided its conduction. section4 and 5 discuss the lessons learned and the threats to this work’s validity, respectively. section 6 addresses the related work, focusing on using systematic studies to guide a phd research. finally, section 7 presents our concluding remarks. 2 software requirement pattern incorrect, omitted, misinterpreted, or conflicting requirements usually result from poorly executed re activities (franch, 2015). as a result, software projects in such a scenario often struggle with software that does not meet quality requirements, cost and time overruns, and unsatisfied users. requirements reuse is a practical approach to mitigate those issues (irshad et al., 2018): the core idea is reusing the knowledge acquired in previous projects to make re activities more prescriptive and systematic. a widely discussed reuse approach is the software requirement pattern (srp) abstraction, which aggregates behaviours and services observed in multiple similar applications (withall, 2007). usually, srp guides requirements elicitation and specification through well-defined templates that can be reused in later specifications (costal et al., 2019). for instance, one can create an srp for representing a user authentication feature, commonly found in several applications, and make appropriate adaptations, if necessary. an srp’s anatomy defines its structure and content, not the requirements that might result from it. however, to be helpful as a guide to writing software requirements, the srp needs to consider situations likely to be encountered in the type of requirement built upon this srp. thus, srp is more substantial than a requirement, and its specification is quite a demanding task (withall, 2007). there are srp proposals for multiple sorts of systems such as embedded (konrad and cheng, 2002), cloud computing (beckers et al., 2014), and call-for-tender (costal et al., 2019). these studies demonstrate that srp can promote greater efficiency in requirements elicitation, quality and consistency improvement in the requirements specification, gain in the development team’s productivity, and better requirements management support. 3 from systematic studies to a phd research project this section’s goal is three-fold: first, it introduces research method types that helped ground the doctoral project; second, it describes two systematic studies performed from planning to results analysis; and it demonstrates how these studies’ results contributed to the definition of the phd research proposal (kudo, 2021). kudo et al. 2022 figure 1. an overview of our experience with phd research decisions based on evidence provided by systematic studies. figure 1 illustrates how the best available evidence provided by the systematic studies — tertiary review (kudo et al., 2020a) and systematic mappings (kudo et al., 2019a,b) — guided decisions during the phd project reported in this paper. each step in figure 1 is described next. 3.1 research method despite the differences between the methods, systematic studies (slr and sm) are conducted using a process composed of three main phases (kitchenham et al., 2015): planning, conduction, and reporting. during the first phase, the review objectives and a protocol are defined. the protocol formalises the criteria and procedures for selecting, extracting, and summarising the data, including the research questions’ definition through the search strategy to the final report. the protocol aims to reduce likely bias and ensure researchers can reproduce the review, adopting the same criteria and procedures. according to the protocol, primary studies are retrieved, selected, and evaluated during the conduction phase. then, in the reporting phase, studies that meet the review purpose are summarised, together with data extraction and synthesis that can be descriptive, complemented with a quantitative summary obtained through a statistical calculation. sms and tertiary studies are other types of reviews that complement slrs. sm is a more open form of slr, providing an overview of a research area to assess the quantity of evidence existing on a topic of interest (petersen et al., 2015). a tertiary study is considered a review that focuses only on secondary studies (slr/sm). the conduction of a tertiary study is proper in domains where some high-quality slrs or sms exist. the process used to conduct a tertiary study is the same as slrs’ (kitchenham et al., 2015). as depicted in figure 1, we conducted a pilot search for slrs/sms on the srp topic performed by third parties. we then conducted a tertiary review on the state of the art and practice in srp (kudo et al., 2020a) as we found some highquality secondary studies on the same topic. in the tertiary review, we mapped the main topics covered and research gaps on srp (the tertiary study’s main contribution in figure 1). we elaborated on a seven-item research agenda with lines of investigation (details in the next section) to approximate academics’ and professionals’ interests regarding improving requirements quality through srp to lessen these gaps. remarkably, we noticed that secondary studies reported srp only in the re phase (item 1 in figure 1). as software requirements influence the remaining phases of the development process, we have identified a potential research gap about the benefits of using srp in other development phases besides re. this finding motivated us to conduct an sm to identify primary studies reporting the use of srp in software design, construction, testing, and maintenance. the sm results pointed out eight primary studies in srp applied to design, one to construction, one to testing, and none to maintenance. these results revealed a research problem to investigate: the lack of evidence on the srp benefits for other development phases (the sm’s main contribution in figure 1). as re activities significantly impact other development phases, such as testing, we contributed to a novel approach to aligning re and testing in which reuse through srp and software test patterns (stp) are core elements (phd research proposal in figure 1). an stp is an abstraction for generic testing solutions to recurrent behaviours from different scenarios. unfortunately, recent literature reports that most companies still face adverse effects (cost, rework, and delay) from a weak alignment between requirements and testing (bjarnason and borg, 2017; ebert and ray, 2021). further details about how the findings of the tertiary review and the sm drove our efforts throughout the doctoral research are presented next. 3.2 tertiary study on requirement patterns recognised the importance of systematic studies for powerfully grounding a phd research proposal, a question arose: are there already systematic literature studies on srp? to respond to this question, a tertiary study was performed, as described follows. the tertiary review employed the methodology used in classic tertiary studies in se (kitchenham et al., 2010). besides, it took advantage of the start tool (fabbri et al., 2016) kudo et al. 2022 support throughout the study protocol, from planning to reporting. the tertiary review protocol included three general research questions (rq) defined in the planning phase: rq1 – what is the state of the art in requirement patterns? rq2 – what are the most searched topics on requirement patterns? rq3 – what are the current gaps in requirement patterns research? activities performed in this tertiary review include automatic search, elimination of duplicate, selection of secondary studies on srp, snowballing (wohlin, 2014), quality assessment (zhou et al., 2015), and data extraction and synthesis. the following is the final search string used in the automatic search activity: (“requirement pattern” or “requirement template”) and (survey or “systematic review” or “systematic literature review” or “systematic mapping” or “systematic literature mapping”) this process identified 40 secondary studies organised as follows: acm dl (4), engineering village (13), ieee xplore (2), science direct (11), and scopus (10). after excluding duplicate papers and applying selection criteria, four secondary studies remained. concerning the snowballing technique, we examined the bibliographic references of each of these four papers to identify further relevant studies. however, we found no relevant paper. next, we assessed the quality of each secondary study using four criteria: description level of inclusion and exclusion criteria, search coverage, primary studies quality evaluation, and description level of primary studies. as no paper was removed after data extraction, four secondary studies on srp (irshad et al., 2018; palomares et al., 2017; da silva and benitti, 2011; justo et al., 2018) contributed to formulating answers to the review’s research questions. the conclusions made are as follows: • the number and publication dates of secondary studies on srp (representing 44 non-duplicate primary studies) confirm that srp is not a stagnant research topic, with contributions throughout the decade – (rq1). • the most searched topics regarding srp are representation format, availability, scope, and purpose – (rq2). • research gaps found include the professionals’ unfamiliarity with srp, few validations in the industry, the need for metrics and tools to enable the effective use of srp in the industry, and the lack of secondary studies on how srp benefits the software life cycle – (rq3). the analysis of those four secondary studies resulted in a research agenda to cover the gaps found between the states of art and practice in srp. a research agenda is a formal plan of actions that summarises specific activities to guide the phd conduction and the time to execute them. as depicted in figure 1, the tertiary review’s main contribution is a research agenda composed of the following items: 1. the demonstration of the benefits of srp in other phases of the software development process in industry software projects – none of the secondary studies analysed explicitly identified this gap; 2. traceability mechanisms between requirements represented as patterns and artefacts produced in other development phases — this is another research topic not reported in any of the secondary studies analysed, and it is complementary to the item 1; 3. the joint use of srp and existing and well-established methodologies in the software industry, such as agile approaches; 4. the development of tools that effectively support professionals’ practices in the use of srp; 5. the dissemination of current and future catalogues of srp in a systematised manner; 6. the definition of objective metrics to help professionals measure the impact of the use of srp as described in items 1 to 3; 7. collecting evidence of the effective use of srp, particularly in the re process of industry software projects. 3.3 sm on requirement patterns and software life cycle according to brereton et al. (2009), summarising the results of primary studies through secondary studies is a valuable research mechanism for providing knowledge of a given topic and supporting the identification of topics for future research. therefore, influenced by items 1 and 2 of the tertiary review’s research agenda (see figure 1), an sm was planned and conducted (kudo et al., 2019a,b) to investigate the srp usage in other phases of the software development life cycle (sdlc) and the traceability between srp and specifications produced in these phases. based on this goal, the sm included three research questions: rq1 – at what sdlc phases are srp used: design, construction, testing, and/or maintenance? rq2 – is there evidence of srp usage in practice at those sdlc phases? rq3 – are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits? a trade-off analysis between coverage and relevance of the results of a pilot search preceded the definition of the final search string presented next: (“requirement pattern” or “requirement patterns” or “requirements pattern” or “requirements patterns”) and (“software development” or “development process” or “life cycle” or design or construction or coding or implementation or test or integration or maintenance) activities performed in this sm study include automatic search, elimination of duplicates, the application of selection criteria, snowballing, quality assessment, and data extraction and synthesis. target studies in this slm are, thus, primary studies on srp not employed in re. kudo et al. 2022 the automatic search identified 303 primary studies organised as follows: acm dl (26), engineering village (106), ieee xplore (25), science direct (9), scopus (76) and web of science (61). ten primary studies remained after excluding duplicate papers, applying selection criteria, and full-text reading (155, 107 and 31 papers excluded, respectively). after data extraction, we examined the bibliographic references and citations of these ten relevant papers found (i.e., backward and forward snowballing). alike, we found no relevant additional paper. we also assessed the quality of each primary study using nine criteria, including description level of the research problem and design, contributions, insights, limitations, and srporiented criteria. as a result, sm included a ten-primary-study group whose individual contributions regarding the use of srp were analysed and synthesised in the form of a bubble chart, as depicted in figure 2. such contributions allowed us the formulation of the following answers to the research questions: • eight primary studies used srp in software design, one study in software construction and testing, and none in software maintenance – (rq1). • from these ten primary studies, eight are still at the proof of concept level, and none reports evidence of srp usage in the software industry – (rq2). • there is only one primary study that demonstrates, through metrics and experimentation, that srp integrated with software design artefacts implies significant development time savings; the corresponding metrics are drr (degree of requirement realisation) and dpr (degree of pattern realisation) – (rq3). figure 2. mapping of the types of requirements and validation on srps for softwaredesign,construction, testing,andmaintenance(kudoetal.,2019b). then, we drew two conclusions from the analysis of these data: 1. there is an open field for research on srp adoption in other sdlc stages (only 10); in contrast, most research efforts still focused on re (76 primary studies). 2. there was little empirical evidence of the benefits of srp beyond re as we found only one case study and one experiment report. those sm results contributed to the phd research problem definition, as shown in figure 1: the lack of research on srp in other sdlc stages. 3.4 project’s decisions based on the best available evidence this section recaps the evidence found in our secondary studies that guided a phd research in re. moreover, it associates each primary study developed by that graduate student with the pieces of evidence resulting from the tertiary review and the systematic mapping. additional information on each thesis product is available in kudo (2021) and kudo et al. (2019c, 2020b,c, 2022). items 1 and 2 of the research agenda (kudo et al., 2020a) inspired the conduction of the sm as none of the target studies described the benefits of srps in other sdlc phases, nor how to trace such support upon the development process. the evidence found in the sm (kudo et al., 2019a,b), in turn, led to the definition of the phd research problem: the lack of research on srp beyond re, motivated by the potential benefits that srp can bring to the sdlc (e.g., better quality specifications, reduced development time, and improved team productivity). moreover, the tertiary review’s research agenda items combined with the strong influence of re activities on software testing contributed to the definition of the phd research proposal (see figure 1): the alignment of re and testing phases through patterns, i.e., srp and stp. except for item 7, every research agenda item guided this phd work experience, as illustrated in figure 1. following item 2, the phd proposal endeavoured a novel srp approach, called software pattern metamodel (sopamm) (kudo et al., 2019c, 2020b). sopamm is a metamodel that represents, relates, and classifies software patterns in general and srp and stp in particular. influenced by item 3, sopamm borrows concepts and practices from the behaviour-driven development (bdd) agile methodology (chelimsky et al., 2010). in sopamm, functional requirement patterns (frp) are described as user stories associated with behaviours and test data using the gerkhin language. frp’s behaviours, in turn, are linked to acceptance test patterns (atp) through test cases. inspired by item 4, the terminal model editor (tmed) tool was developed to help with the elaboration of sopammbased pattern catalogues. a catalogue is a means of systematically gathering patterns, usually addressing the most common problems for a particular application domain. what differentiates tmed from related tools (palomares et al., 2011; barcelos and penteado, 2017) is that it handles other software patterns instead of srp only. with the support of the tmed tool, four pattern catalogues with srp and stp aligned (the research agenda’s item 5) were developed. one supports the certification of electronic health record systems (kudo et al., 2019c, 2020b; martins et al., 2021), another represents behaviour-driven requirements of internet of things (iot) systems, and two catalogues describe common functionalities and behaviours for user authentication and registration. finally, as the quality of the sopamm metamodel may impact the quality of pattern catalogues, which may influence software specifications quality, the metamodel quality requirements and evaluation (mquare) framework was devised (kudo et al., 2020c). mquare is a metamodel qualkudo et al. 2022 ity requirements and metrics, a metamodel quality model, and an integrated evaluation process. using mquare, the sopamm’s levels of compliance, conceptual suitability, usability, maintainability, and portability were recently evaluated in a controlled experiment (kudo et al., 2022). thus, mquare is the first effort toward addressing the research agenda’s item 6, providing objective metrics to evaluate metamodel’s characteristics that may affect the quality of the software artefacts relied upon it. finally, the research agenda’s item 7 is a future work of the phd thesis reported. it demands empirical work in collaboration with the software industry, a later effort of our research group. 4 discussion this section presents our lessons learned from undertaking systematic studies in an academic context. we believe these lessons can help phd candidates perform systematic studies in their research. 1. choose the correct systematic study type – phd students can conduct three types of reviews: slr, sm, or tertiary review. in particular, in the example of this paper, the phd student conducted two different systematic literature studies, one tertiary and one secondary (an sm). the choice for conducting a systematic study must consider, for example, the amount of evidence available. an sm may be more appropriate than an slr in domains with very little evidence related to a research topic, or the topic is vast. on the other hand, in domains where several slrs exist already, it may be possible to conduct a tertiary review (an slr of slrs) to answer broader research questions. sms may also be helpful to phd students who are required to draw an overview of the existing evidence concerning a research topic. despite that, it is essential to consider that the mapping study results may be more limited than the slr. an slr would be inappropriate if the research question is too vague or broad but also if the question is too narrow. the first case would yield hundreds of studies, and the second one would yield too few studies to be helpful. the conduction of a tertiary review is potentially less resource-intensive than conducting a new slr. however, its conduction is dependent on sufficient quality slrs being available. in our experience, the quality aspect of existing systematic studies on srp geared us towards a tertiary review on that topic. moreover, the tertiary review’s results were determinant for the conduction of an sm. 2. use systematic studies conducted by third parties, when appropriate – phd students should consider three critical points to using systematic studies conducted by third parties: • it is indicated to use the results from already published slr in se since it meets the phd’s goal, i.e., the slr research questions are related to the subject the student wants to investigate; • if the published slr uses valid methods and was well-conducted to ensure its credibility; and • if the slr is updated, avoid the understanding of the outdated state of the art. in this context, mendes et al. (2020) recommend updating slrs in se using a third-party decision framework to decide whether they need updates. we have recently noticed an increasing number of slr published in the se area. however, occasionally we see that independent research teams have been conducting slrs on the same topic, leading to duplication of the reviews and potentially wasted efforts. therefore, before undertaking a systematic study, phd students should ensure that a review is necessary, i.e., they should identify and review any existing study related to their research focus. in addition, when phd students decide to conduct a systematic investigation, they must be aware that findings may be helpful for future students. moreover, conducting an slr that does not benefit only specific research can yield benefits: avoid duplicate work from other students, increase confidence in findings, and catalyse new collaborations among students and other researchers. in our experience, we conducted a novel sm on the srp topic as the existing secondary studies focused on srp solely applied to the re phase. research collaborations have arisen from the findings reported in this phd experience (martins et al., 2021; kudo et al., 2022). 3. the need for a previous pilot search step – a pilot search is a reasonable first step before conducting systematic studies on the same or closely related target topic. a pilot search may reveal high-quality systematic studies on the topic of interest, motivating a tertiary review’s conduction (as we did) rather than a new secondary study. 4. experience reduces effort – establishing the first review protocol was a complex task and consumed considerable time and effort. however, it was essential for the assurance of the tertiary review quality. the knowledge acquired from the first review facilitated elaborating the sm protocol since procedures and forms were reused and adapted. moreover, we could find the quality level of candidate studies more quickly, comparing them to studies previously read. the access to information was also faster as we already knew its organisation (i.e., the paper structure in the srp context). finally, an experienced researcher familiar with the review subject must compose the review team. in our experience, she supported in defining keywords and synonyms for the search string’s main terms, synthesis of results, among other important decisions. 5. attention to open research issues in secondary studies – when conducting a tertiary review, identify the open research issues described in each secondary study. under the assessment for an experienced researcher, e.g., the phd advisor, these open issues may result in candidate research gaps. in this phd experience, we found three open issues found from the secondary studies analysis: the lack of kudo et al. 2022 professionals’ knowledge about srp, the low number of evaluation research on srp, and the need for tools for the effective use of srp in the industry. these were essential to derive the seven lines of action of the tertiary review’s research agenda. 6. sm results may identify suitable areas for future research – sm results are usually synthesised in a bubble chart, as depicted in figure 2. when synthesising the findings of an sm, a phd student should choose and group three relevant pieces of information, assign the most important one according to the study’s objective to the ordinate axis, and distribute the remainder to the positive and negative abscissa axes. in our sm, sdlc phases, software requirements types, and research validation types are the ordinate axis and the negative and positive abscissa axes, respectively. then, the phd student should count the number of primary studies addressing two information axes simultaneously — for instance, crossing information from the ordinate and negative abscissa axes. the smaller the number of primary studies crossing two axes, the smaller the bubble size. in our experience, we identified suitable areas for future research in figure 2: a few research on srp in construction (one), testing (one), and maintenance (none), the predominance of studies in non-functional requirement patterns (8 of 10), and the need for more mature research on srp in the sdlc (1 of 10). 5 threats to validity finding all relevant papers on a particular theme is challenging. for this reason, both systematic studies included a previous pilot search under the supervision of an srp expert, and a standard vocabulary for se helped the search strings definition process. furthermore, both search strategies comprise automatic search – in at least five relevant sources for se – and the snowballing technique. we also assess the quality of target papers to reduce a likely bias in the analysis and synthesis steps. the tertiary review protocol includes quality criteria widely accepted (centre for reviews and dissemination, 2002; cruzes and dybå, 2011), and the sm protocol describes nine criteria regarding general (jamshidi et al., 2013) and specific aspects of primary studies on srp. thus, both the quality criteria and the scores for each study analysed better weighed the value of individual studies after synthesising results, guaranteeing the evidence’s reliability. besides, three researchers conceived the protocols of both systematic studies: • researcher a has expertise in re and conducted the identification, selection, quality assessment, extraction, and synthesis of relevant secondary and primary studies; • researcher b is an expert in se, and to mitigate the possibility of biases throughout the process, he verified all results phases; and • researcher c is the team leader with vast experience in se, being consulted in the case of divergences not solved between researchers a and b. finally, we summarise our recommendations to common threats that phd students can face during the planning and conduction of a systematic study. these general recommendations include: • a previous pilot search for systematic studies on the graduate’s topic of interest; • the aid of both an expert and a standard glossary in the search string definition process; • a hybrid search strategy to expand the search coverage; • and the quality assessment of target studies and a wellcoordinated team both to mitigate research bias. 6 related work felizardo et al. (2020) highlight the relevance of using secondary studies as a research methodology for conducting se research projects. this study aimed to explore se researchers’ perceptions, mainly msc and phd students and their supervisors, about the value of secondary studies and how these perceptions impact decisions on conducting their research. the authors conducted two empirical research methods. first, they performed an sm to identify primary studies that used secondary ones as a research methodology for conducting se research projects. second, the authors surveyed se researchers to determine their perception of the value of performing secondary studies to support their research projects. in summary, felizardo et al. (2020) showed the main benefits of using secondary studies as a research methodology, identifying relevant research, finding reasons to explain why a research project should be approved, and supporting decisions made. the study reflected upon the value of secondary studies in developing academic projects. in agreement with other authors (dybå et al., 2005; kitchenham et al., 2011; zhang and babar, 2011), felizardo et al. (2020) highlight that a systematic secondary study is a valuable research mechanism for providing knowledge of a given topic and identifying gaps for future research. however, what is not clear yet is how this knowledge helps to conduct msc/phd research projects. one of the categories investigated in the sm shown by felizardo et al. (2020) was the application of secondary studies. this classification summarises how the findings of such analyses can guide efforts in research projects. to the best of the authors’ knowledge, only souza et al. (2015)’s work fits this category. souza et al. (2015) show how the findings of the sm drove their research efforts in conducting a project on knowledge management (km) in software testing. among the sm results, the following stand out: (i) the central problem in software organisations related to software testing is low knowledge reuse rate and barriers in knowledge transfer; and (ii) reuse of test cases is the perspective that has received more attention. from sm results, the authors decided to conduct two slrs, developed an ontology testing, and performed a survey to define a scenario to apply km in software testing. the survey aimed to identify the testing activities in which kudo et al. 2022 km is more valuable or appropriate for reuse. from the survey results, the most suitable scenario in the software testing domain was established for applying km. finally, considering the survey results and ontology, a km system was developed to manage testing knowledge repositories, such as test case reuse. comparatively, our work followed a similar approach. the results of the secondary studies served as a basis for followon research activities. before accomplishing a secondary study, we and souza et al. (2015) performed a tertiary review looking for secondary studies investigating the same topic. likewise, based on the results of the tertiary review, an sm was planned and conducted in both studies, directing the project’s decisions or defining other empirical approaches later used. 7 conclusion especially for phd research projects, originality is mandatory. moreover, once students research the advanced state of the art, it is essential to do it correctly. this work reports our experience conducting a phd research guided by systematic studies. we also highlight our lessons learned and recommendations that other researchers can use to guide their doctoral process. we explained the criteria phd candidates should choose to undertake the correct systematic study type and use a systematic study conducted by a third party. we also showed that a previous pilot search is desirable before conducting a secondary study on any topic. in addition, the experience acquired performing systematic studies reduces effort in similar works. moreover, a deep analysis of the open research issues found in secondary studies may be valuable to delimit gaps that can gear other investigations on the same topic, e.g., including a new secondary study with a more profound view of that theme. we also explained how the results of a systematic mapping help identify future research. finally, we also helped phd students with recommendations to mitigate common threats they can face during a systematic study. we believe any phd candidate can adapt or reuse the lessons and recommendations outlined in our experience in research to any area of study. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) – finance code 001. references barcelos, l. and penteado, r. (2017). elaboration of software requirements documents by means of patterns instantiation. j softw eng res dev, 5(3):1–23. beckers, k., côté, i., and goeke, l. (2014). a catalog of security requirements patterns for the domain of cloud computing systems. in proceedings of the acm symposium on applied computing, pages 337–342. bjarnason, e. and borg, m. (2017). aligning requirements and testing: working together toward the same goal. ieee software, 34(1):20–23. brereton, p. o., turner, m., and kaur, r. (2009). pair programming as a teaching tool: a student review of empirical studies. in 22nd conference on software engineering education and training (csee&t’ 09). centre for reviews and dissemination (2002). title=the database of abstracts of reviews of effects (dare). effectiveness matters, 6(2):1–4. chelimsky, d., astels, d., helmkamp, b., north, d., dennis, z., and hellesoy, a. (2010). the rspec book: behaviour driven development with rspec, cucumber, and friends. pragmatic bookshelf, raleigh, nc, 1st edition. clarke, m. and chalmers, i. (2018). reflections on the history of systematic reviews. bmj evidence-based medicine, 23:121–122. clear, t. (2015). follow the moon’ development: writing a systematic literature review on global software engineering education. in 15th koli calling conference on computing education research, koli calling ’15, pages 1–4. acm. costal, d., franch, x., lópez, l., palomares, c., and quer, c. (2019). on the use of requirement patterns to analyse request for proposal documents. in laender, a. h. f., pernici, b., lim, e., and de oliveira, j. p. m., editors, conceptual modeling 38th international conference, er 2019, salvador, brazil, november 4-7, 2019, proceedings, volume 11788 of lecture notes in computer science, pages 549–557. springer. cruzes, d. and dybå, t. (2011). research synthesis in software engineering: a tertiary study. information & software technology, 53(5):440–455. da silva, r. c. and benitti, f. b. v. (2011). writing standards requirements: a systematic literature mapping. in proceedings of the 14th workshop on requirements engineering, pages 259–272, rio de janeiro, rj, brazil. dybå, t., kitchenham, b. a., and jørgensen, m. (2005). evidence-based software engineering for practitioners. ieee software, 22(1):58–65. ebert, c. and ray, r. (2021). test-driven requirements engineering. ieee software, 38(01):16–24. egger, m., smith, g., and philips, a. (1997). meta-analysis: principles and procedures. bmj, 315(1533–1537). fabbri, s. c. p. f., silva, c., hernandes, e. m., octaviano, f., di thommazo, a., and belgamo, a. (2016). improvements in the start tool to better support the systematic review process. in 20th international conference on evaluation and assessment in software engineering (ease’ 16), pages 21:1–21:5. felizardo, k. r., de souza, e. f., napoleão, b. m., vijaykumar, n. l., and baldassarre, m. t. (2020). secondary studies in the academic context: a systematic mapping and survey. journal of systems and software, 170:110734. franch, x. (2015). software requirements patterns: a state of the art and the practice. in proceedings of the 37th international conference on software engineering volume 2, icse ’15, pages 943–944, piscataway, nj, usa. ieee press. kudo et al. 2022 irshad, m., petersen, k., and poulding, s. (2018). a systematic literature review of software requirements reuse approaches. inf. softw. technol., 93(c):223–245. jamshidi, p., ghafari, m., ahmad, a., and pahl, c. (2013). a framework for classifying and comparing architecturecentric software evolution research. in 2013 17th european conference on software maintenance and reengineering, pages 305–314. justo, j. l. b., benitti, f. b. v., and leal, a. c. (2018). software patterns and requirements engineering activities in real-world settings: a systematic mapping study. computer standards & interfaces, 58:23–42. kaijanaho, a.-j. (2017). teaching master’s degree students to read research literature: experience in a programming languages course 2002–2017. in 17th koli calling int. conference on computing education research (koli calling ’17), pages 143–147, new york, ny, usa. acm. kitchenham, b. and brereton, o. (2013). a systematic review of systematic review process research in software engineering. information and software technology, 55(12):2049–2075. kitchenham, b., budgen, d., and brereton, o. (2011). using mapping studies as the basis for further research a participant-observer case study. information and software technology, 53(6):638–651. kitchenham, b., budgen, d., and brereton, p. (2015). evidence-based software engineering and systematic reviews. chapman & hall/crc innovations in software engineering and software development series. chapman & hall/crc. kitchenham, b. a., pretorius, r., budgen, d. an brereton, p. o., turner, m., niazi, m., and linkman, s. g. (2010). systematic literature reviews in software engineering a tertiary study. information & software technology, 52(8):792–805. konrad, s. and cheng, b. h. (2002). requirements patterns for embedded systems. in proceedings ieee joint international conference on requirements engineering, pages 127–136, essen, germany. ieee. kudo, t. n. (2021). a metamodel for the alignment of requirement patterns and test patterns and a metamodel evaluation framework. phd thesis, federal university of são carlos, são carlos-sp, brazil. (in portuguese). kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019a). padrão de requisitos no ciclo de vida de software: um mapeamento sistemático. in proceedings of the xxii iberoamerican conference on software engineering (cibse’ 19), pages 420–433. kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019b). a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle. journal of software engineering research and development, 7:9:1–9:11. kudo, t. n., bulcão neto, r. f., and vincenzi, a. m. r. (2019c). a conceptual metamodel to bridging requirement patterns to test patterns. in proceedings of the xxxiii brazilian symposium on software engineering, pages 155–160, new york, ny, usa. acm. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020a). requirement patterns: a tertiary study and a research agenda. iet software, 14(1):18–26. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020b). uma ferramenta para construção de catálogos de padrões de requisitos com comportamento. in anais do wer20 workshop em engenharia de requisitos, são josé dos campos, sp, brasil, august 24-28, 2020. editora puc-rio. kudo, t. n., bulcão-neto, r. f., graciano neto, v. v., and vincenzi, a. m. r. (2022). aligning requirements and testing through metamodeling and patterns: design and evaluation. requirements engineering journal, pages 1–25. (to be published). kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2020c). toward a metamodel quality evaluation framework: requirements, model, measures, and process. in proceedings of the xxxiv brazilian symposium on software engineering, sbes 2020, pages 102–107. kuhrmann, m. (2017). teaching empirical software engineering using expert teams. in seuh, pages 20–31. martins, m. c., kudo, t. n., and bulcão-neto, r. f. (2021). padrões de requisitos para sistemas de registro eletrônico de saúde. in anais do wer21 workshop em engenharia de requisitos, brasília, df, brasil, august 23-27, 2021. editora puc-rio. mendes, e., wohlin, c., felizardo, k. r., and kalinowski, m. (2020). when to update systematic literature reviews in software engineering. journal of systems and software, 167:110607. napoleão, b., felizardo, k. r., souza, e. f., and vijaykumar, n. l. (2017). practical similarities and differences between systematic literature reviews and systematic mappings: a tertiary study. in 29th international conference on software engineering and knowledge engineering (seke’ 17), pages 1–10. palomares, c., quer, c., and franch, x. (2011). pabre-man: management of a requirement patterns catalogue. in 2011 ieee 19th international requirements engineering conference, pages 341–342. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. pejcinovic, b. (2015). development and uses of iterative systematic literature reviews in electrical engineering education. electrical and computer engineering faculty publications and presentations, 327(1):1–10. petersen, k., vakkalanka, s., and kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64:1–18. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015). using the findings of a mapping study to conduct a research project: a case in knowledge management in software testing. in 41st euromicro conference on software engineering and advanced applications (seaa’15), pages 208– 215. withall, s. (2007). software requirement patterns. best practices. microsoft press, redmond, washington. kudo et al. 2022 wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering. in 18th international conference on evaluation and assessment in software engineering, ease ’14, london, england, united kingdom, may 13-14, 2014, pages 38:1– 38:10. zhang, h. and babar, m. (2011). an empirical investigation of systematic reviews in software engineering. in 5th international symposium on empirical software engineering and measurement (esem’ 11), pages 1–10. zhou, y., zhang, h., huang, x., yang, s., babar, m. a., and tang, h. (2015). quality assessment of systematic reviews in software engineering: a tertiary study. in proceedings of the 19th international conference on evaluation and assessment in software engineering, ease 2015, nanjing, china, april 27-29, 2015, pages 14:1–14:14. introduction software requirement pattern from systematic studies to a phd research project research method tertiary study on requirement patterns sm on requirement patterns and software life cycle project's decisions based on the best available evidence discussion threats to validity related work conclusion journal of software engineering research and development, 2019, 7:9, doi: 10.5753/jserd.2019.458  this work is licensed under a creative commons attribution 4.0 international license.. a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle taciana n. kudo  [ dc-ufscar, são carlos-sp, brazil | taciana@dc.ufscar.br ] renato f. bulcão-neto  [ inf-ufg, goiânia-go, brazil | rbulcao@ufg.br ] alessandra a. macedo  [ ffclrp-usp, ribeirão preto-sp, brazil | ale.alaniz@usp.br ] auri m. r. vincenzi  [ dc-ufscar, são carlos-sp, brazil | auri@dc.ufscar.br ] abstract in the past few years, the literature has shown that the practice of reuse through requirement patterns is an effective alternative to address specification quality issues, with the additional benefit of time savings. due to the interactions between requirements engineering and other phases of the software development life cycle (sdlc), these benefits may extend to the entire development process. this paper describes a revisited systematic literature mapping (slm) that identifies and analyzes research that demonstrates those benefits from the use of requirement patterns for software design, construction, testing, and maintenance. in this extended version, the slm protocol includes automatic search over two additional search sources, the application of the snowballing technique, and the quality assessment of the relevant ten-study-group for data analysis and synthesis. in comparison to previous work, results still show a small number of studies on requirement patterns at the sdlc (excluding requirements engineering). results indicate that there is yet an open field for research that demonstrates, through empirical evaluation and usage in practice, the pertinence of requirement patterns at software design, construction, testing, and maintenance. keywords: requirement pattern, software development life cycle, systematic literature mapping 1 introduction requirements engineering is a critical development phase in which software functionalities and constraints must be well identified and understood. however, a high percentage of software projects do not meet deadlines and budget due to incomplete, misinterpreted, conflicting, or omitted requirements (tockey, 2015; palomares et al., 2017). to deal with this issue of quality of requirements specifications, software requirement patterns (srp) have been given special attention in the recent years (palomares et al., 2017; irshad et al., 2018). an srp is an abstraction that groups both behaviors and services of applications with similar characteristics. it works as a template for new requirements specification, and it can also be replicated in future requirements documentation (withall, 2007). for instance, to write a user authentication functional requirement, one can use an srp for this purpose and make appropriate adaptations to the requirement, if necessary. several proposals for srps are found in the literature such as for embedded (konrad and cheng, 2002), content management (palomares et al., 2013), and cloud computing systems (beckers et al., 2014). among the benefits obtained with the adoption of srps are: (i) greater efficiency in requirements elicitation since these are not identified from scratch; (ii) quality and consistency improvement in the requirements specification document; and (iii) improved requirements management (withall, 2007). because of the inherent interaction between requirements engineering and other phases of the software development life cycle (sdlc), it is assumed that the benefits of using srps can reach other development activities. although there are secondary studies on software engineering (kitchenham and brereton, 2013), requirements engineering (curcio et al., 2018), and requirement patterns (barros-justo et al., 2018), there is no evidence of secondary studies that analyze the use of srps at other sdlc phases. in short, existing secondary studies are restricted to analyzing the adoption of srps exclusively in the requirements engineering phase. in recent work, we performed a systematic literature mapping (slm) that identifies and analyses primary studies that put in evidence the usage of srps at the software design, construction, testing, and maintenance1 phases (kudo et al., 2019a). the underlying protocol included an automatic search over four sources of information and the definition and application of inclusion and exclusion criteria over 117 non-duplicate studies found. only nine primary studies were considered relevant, given the research aim (kudo et al., 2019a). results indicated that most of the relevant studies apply srps in software design, but none in software maintenance. moreover, only one study was featured as validation research, while the remaining studies were solution proposals. thus, we concluded that the benefits from the srps usage in practice at other sdlc phases are still in its early stages. in this paper, we revisit the slm described in kudo et al. (2019a) and improve the identification and selection methods of primary studies. besides the inclusion of two additional sources of information in the automatic search process, we also perform the snowballing technique (wohlin, 2014) that identifies relevant studies through the scanning of the list of bibliographic references or citations of a paper. the inclusion of two sources of studies resulted in 32 extra, non-duplicate papers, from which one novel relevant study arose. considering the 9 relevant primary studies found in our previous work, we obtained a ten-primary-study group in this research. to check whether other essential studies on 1we adopt the terminology of the software engineering body of knowledge (swebok) for the sdlc phases (bourque and fairley, 2014). https://orcid.org/0000-0002-7238-0562 mailto:taciana@dc.ufscar.br https://orcid.org/0000-0001-8604-0019 mailto:rbulcao@ufg.br https://orcid.org/0000-0001-5271-3086 mailto:ale.alaniz@usp.br https://orcid.org/0000-0001-5902-1672 mailto:auri@dc.ufscar.br a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 1. phases and activities of this slm, adapted from (fabbri et al., 2013; wohlin, 2014) this research exist, we also analyzed the list of bibliographic references as well as the citing papers of each one of these 10 studies. the snowballing technique resulted in 202 nonduplicate papers from which none was assessed as relevant after the re-application of inclusion and exclusion criteria. we read the full text of 10 studies to extract the answers to the slm research questions and, in parallel, to assess the quality of each relevant paper. finally, we synthesized in a bubble graph a map with the remarkable characteristics of this ten-study-group. in comparison with our previous work, results continue to point out a lack of research on srps for software design, construction, testing, and maintenance. the organization of this paper is as follows. section 2 details the protocol of this slm. section 3 reports the data extraction and the quality assessment activities regarding the relevant studies. the answers to the research questions in this study and the research gaps are summarized in section 4. finally, section 5 describes the validation threats of this slm, whereas section 6 presents our final remarks. 2 the systematic mapping protocol in general, a systematic study process can be divided into three distinct phases (fabbri et al., 2013): planning, conduction, and publishing of results. first, a protocol is planned in such a way one can reproduce it later. this systematic mapping protocol includes the definition of the main goal, research questions, search strategy, search string, sources of studies, and inclusion and exclusion criteria. in the conduction phase, studies gathered from search engines and bibliographic databases are identified and selected using the inclusion and exclusion criteria previously defined. a set of useful information is extracted from these selected studies that, in turn, can be still excluded from the slm. snowballing is performed over these included papers by firstly checking their references list. the selection of the studies from this backward analysis is also based on the previous reading of the paper’s title and abstract. this same process is also carried out with the citation list of the same papers examined in the data extraction step. forward and backward analyses finish when no new study is included. following the slm goal, the studies remaining constitute the set of relevant papers from which answers for the research questions of the protocol are analyzed and synthesized. a quality assessment activity is also conducted to assist data synthesis from these relevant papers, as suggested by kitchenham et al. (2010). in the publishing phase, the entire protocol and the results of each previous stage are documented as scientific papers or technical reports. the slm presented in this paper is an extension of the kudo et al. (2019a)’s work and follows those three phases, as depicted in figure 1. 2.1 research questions and keywords the main goal of this slm is to identify studies that explore the benefits of requirement patterns for every sdlc phase, except for the requirements engineering process. based on this goal, the set of research questions (rq) this slm should answer, and the respect justifications are presented next: rq1. at what sdlc phases are requirement patterns used: design, construction, testing, and/or maintenance? this question is essential to find out if there is research on requirement patterns covering other sdlc phases, beyond requirements engineering. rq2. is there evidence of requirement patterns usage in practice at those sdlc phases? this question is relevant to discover empirical evidence on requirement patterns usage at other sdlc phases, i.e., not only solution proposals. rq3. are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits? this question is useful to find out if the benefits of requirement patterns (e.g., development time savings, better quality specifications, etc.) have been exploited at other sdlc phases. if so, we want to know how these benefits have been measured. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 to support the definition of standardized terms in software engineering, the search terms are borrowed from the sevocab (software and systems engineering vocabulary), which is an iso/ieee initiative to standardize the terms used in software engineering (iso/iec/ieee, 2017). the following is the set of keywords used for the definition of the search string: requirement pattern, development process, software development, life cycle, design, construction, coding, implementation, test, integration, and maintenance. a search strategy should find relevant studies to answer the research questions. next, we present the search strategy performed in this slm that includes automatic search and the snowballing technique. 2.2 automatic search after evaluating the trade-off between coverage and relevance of the search results in a pilot search, we opted for the following combination of keywords2 as search string: (“requirement pattern” or “requirement patterns” or “requirements pattern” or “requirements patterns”) and ((“software development” or “development process”) or (“life cycle” or design or construction or coding or implementation or test or integration or maintenance)) besides acm dl3, engineering village, ieee xplorer, and scopus, we also performed searches at the sciencedirect and the web of science websites. similarly, we did searches based on studies metadata, at least over the abstracts because of their richer content. table 1 describes in detail the number of studies returned per source of studies, both in the original search4 (kudo et al., 2019a) and in this revisited version5. therefore, 85 studies were identified (including duplicate papers) after the inclusion of two new bibliographic databases (sciencedirect and web of science) and the update of search results over the four original sources of studies. table 1. number of studies returned per source. source original extension difference acm dl 24 26 2 engineering village 100 106 6 ieee xplorer 23 25 2 scopus 71 76 5 sciencedirect 9 9 web of science 61 61 total 218 303 85 2.3 selection of primary studies this section describes the selection method of relevant studies to answer the research questions of this slm. the same 2plural variations of the term “requirement pattern” are necessary due to the capabilities of the search engines of each source of studies. 3we chose the the acm guide to computing literature because it is a most comprehensive bibliographic database on computing, including the full-text collection of all acm publications. 4search carried out from april 24 to may 5, 2018. 5additional search performed on june 3 and 4, 2019. original selection criteria were applied to the 303 papers returned by the automatic search process. the exclusion criteria (ec) are: ec1 it is not a primary study. ec2 it is not a paper (e.g., preface or summary of journals or conference proceedings). ec3 the research is not about srp. ec4 the research addresses srp in requirements engineering only. ec5 the full study text is not in english. ec6 the full study text is not accessible. ec7 it is a preliminary or short version of another study. a paper is removed from this slm whenever it meets at least one of the exclusion criteria (ec) presented; otherwise, the study is categorized based on the following inclusion criteria (ic): ic1 it addresses srp in software design. ic2 it addresses srp in software construction. ic3 it addresses srp in software testing. ic4 it addresses srp in software maintenance. figure 2 depicts the entire selection process with the respective number of primary studies chosen and removed in each activity of the conduction phase. after the automatic search process, 155 duplicate papers are identified and removed (from the 303 studies group) with the support of the start tool (fabbri et al., 2016). next, we proceeded with reading of the title, summary, and keywords of each of the 148, upon which we applied the exclusion and inclusion criteria. as a result, we selected 41 possibly relevant studies because this selection relies on the reading and interpretation of papers’ metadata only. in the data extraction activity, we read the full text of these 41 studies from which we excluded 31 papers by the ec4 criterion, i.e., their research focus is on srp in the requirements engineering phase. we describe the process of data extraction of the 10 studies remaining in section 3. these studies are identified throughout this paper as s1 to s10 as follows: s1 adaptive requirement-driven architecture for integrated healthcare systems (yang et al., 2010) s2 analysing security requirements patterns based on problems decomposition and composition (wen et al., 2011) s3 an architectural framework of the integrated transportation information service system (chang and gan, 2009) s4 application of ontologies in identifying requirements patterns in use cases (couto et al., 2014) s5 effective security impact analysis with patterns for software enhancement (okubo et al., 2011) s6 from requirement to design patterns for ubiquitous computing applications (knote et al., 2016) s7 modeling design patterns with description logics: a case study (asnar et al., 2011) s8 mutation patterns for temporal requirements of reactive systems (trakhtenbrot, 2017) s9 sacs: a pattern language for safe adaptive control software (hauge and stølen, 2011) s10 re-engineering legacy web applications into rias by aligning modernization requirements, patterns and ria features (conejero et al., 2013) a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 2. the total number of studies removed per exclusion criteria throughout the conducting phase. activity ec1 ec2 ec3 ec4 ec5 ec6 ec7 total automatic search 7 13 40 47 0 0 0 107 data extraction 0 1 10 15 0 2 3 31 snowballing 2 0 186 14 0 0 0 202 total 9 14 236 76 0 2 3 340 figure 2. a detailed view of the conduction phase: automatic search, duplicate study exclusion, study selection, data extraction, snowballing, and data synthesis. 2.4 snowballing besides automatic search, our search strategy includes snowballing as an attempt to obtain other relevant studies using the papers s1 to s10 as input. regarding backward snowballing, we collected the reference list of each paper from the scopus database, resulting in 216 documents whose metadata (title, abstract, and keywords) we stored into the start tool. after the removal of 49 duplicate studies, we read the metadata of the 167 documents remaining to decide for the exclusion or the tentative inclusion of a paper for further analysis. as no new paper was found in the first round of backward snowballing, we finished this analysis earlier. in sequence, we searched the citation list of s1 to s10 from the scopus website, resulting in 44 papers also registered into the start tool. similarly, no new paper was retrieved in the first round of this forward snowballing step, resulting from the removal of 9 duplicate studies and the reading of the metadata of the 35 documents remaining. both snowballing procedures end up the process of selection of relevant studies of this slm. figure 2 depicts the total number of studies identified (260), excluded (58), and selected (0) from the overall snowballing process. as a result, the data extraction and synthesis activities include only the studies s1 to s10 previously presented. finally, table 2 summarizes the removal process of studies in the conduction phase. most of the papers removed in the automatic search (87 of 107) are due to the ec3 and ec4 criteria, i.e., they do not address srp, or they do it in the requirements engineering phase only, respectively. studies were excluded at a similar rate (25 of 31) in data extraction activity. these exclusion rates around 80% are expected because of the trade-off analysis between coverage and relevance of the search string. differently, most of the studies removed during both snowballing procedures (186 of 202) are because of the ec3 criterion. two related reasons explain this 92% exclusion rate: first, in general, the size of the reference list of a paper is far more extensive than the number of studies citing that paper; second, the papers in a reference list often address other research topics. besides, only 7% of the studies referenced by or citing them represent research on srps (14 of 202). even so, none of these explores srps at other stages of sdlc other than requirements engineering. 3 data extraction this section describes the data extraction process from the full-text-reading of the 10 relevant studies (s1 to s10) of this slm. besides presenting a comparative analysis of the contribution types of each paper, we also extract: 1. the quality score of each primary study; 2. the type of research carried out; 3. the type of requirement addressed by srp; 4. the sdlc phase supported by srp; 5. and the contribution type. 3.1 quality assessment the quality assessment may be useful for an slm to assure that sufficient information is available to be extracted. however, we concur with petersen et al. (2015) that quality assessment should not pose high requirements on the primary a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 studies because the main objective of an slm is to give a broad overview of a research topic. rather than using quality criteria for the exclusion of papers, our quality assessment approach assists data analysis and synthesis, such as to investigate whether different quality scores are associated with varying outcomes of the primary studies (kitchenham et al., 2010; petersen et al., 2015). multiple checklists are available in the literature to help the process of assessing the quality of primary studies. here, we evaluated the quality of primary studies through nine quality criteria from which six are general factors (g1 to g6), as described in jamshidi et al. (2013), and three are particular factors (p1 to p3) that we defined based on the subject of this slm. following is the full description of every general and specific quality criteria, including their respective predefined responses and scores (in parenthesis). observe that g2 is the only criterion whose score ranges from 0 to 1, indicating a lower weight in the quality score of each study. g1problem definition of the study. (2) : there is an explicit problem description. (1) : there is a general problem description. (0) : there is no problem description. g2environment in which the study is carried out. (1) : there is an explicit description of the environment in which the research is performed (e.g., lab setting, as part of a project, in collaboration with industry, etc.). (0.5) : there are some general words about the environment in which the research is performed. (0) : there is no description of the environment. g3research design of the study. (2) : there is an explicit description of the plan (different steps, timing, etc.) used to perform the research, or of the way the research is organized. (1) : there are some general words about the research plan or the way the research is organized. (0) : there is no description of the research design. g4contributions of the study. (2) : there is an explicit list of the contributions/results. (1) : there are some general words about the study results. (0) : there is no description of the study results. g5insights derived from the study. (2) : there is an explicit list of insights/lessons learned from the study. (1) : there are general words about insights/lessons learned from the study. (0) : there is no description of the insights derived from the study. g6limitations of the study. (2) : there is an explicit list of the limitations of the study. (1) : there are general words about the limitations of the study. (0) : there is no description of the limitations of the study. p1the srp structure. (2) : there is an explicit description of the srp structure. (1) : there is some general information about the srp structure. (0) : there is no description of the srp structure. p2the integrated use of srps with the sdlc phases. (2) : there is an explicit description of which sdlc phase benefits from srps usage. (1) : there are some general words about which sdlc phase benefits from srps usage. (0) : there is no description of which sdlc phase benefits from srps usage. p3empirical investigation of srps usage in the sdlc phases. (2) : there is an explicit description of empirical investigation. (1) : there is some general information about the empirical investigation. (0) : there is no description of empirical investigation. the relevance of the particular quality criteria (p1 to p3) is presented next. as stated by franch et al. (2010), the reuse of an srp heavily depends on a detailed description of its structure (p1). the p2 criterion is important to identify the adherence of each study to the research question rq1, i.e., the sdlc phase supported by srps. finally, the p3 criterion allows distinguishing studies with empirical evidence. once presented general and particular quality criteria, the following is the final quality score (qs) formula that provides us with a numerical quantification as a means of ranking the relevant primary studies: qs = [( ∑6 g=1 11 ) + ( ∑3 p =1 6 × 3)] (1) , where the sums of g1 to g6 and of p1 to p3 may reach a maximum score of 1 and 3, respectively. that is, specific quality criteria represent 75% weightage in the final quality score because of their higher importance in comparison with the general items. table 3 presents the full quality assessment of the ten primary studies, in descending order of the final quality score (qs at the rightmost column). the respective values assigned to the general and particular quality criteria of every primary study are also available in table 3 as well as the particular total score for general and particular quality criteria (sgc and spc, respectively). observe that the p3 criterion contributes to a subclassification of the ten-study-group: research with no empirical investigation (s1 to s6, s8, and s9) got a quality score less than 3.0, while the studies whose quality score is higher than 3.0 (s7 and s10) have more empirical evidence and explicit their lessons learned (g5 criterion). however, s7 and s10 obtained a minor grade for the p1 criterion because they do not describe the structure of their srps proposals. finally, the lower quality scores are mainly due to the grades of the p2 and p3 criteria. consider the case of the studies s3 and s4 that both have no empirical evidence and partially describe how to employ srps in an sdlc phase. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 3. a detailed view of the quality assessment results. study g1 g2 g3 g4 g5 g6 sgc p1 p2 p3 spc qs s10 2 1 2 2 2 1 0.9 1 2 2 0.8 3.3 s7 2 1 2 2 2 0 0.8 1 2 2 0.8 3.2 s6 2 1 2 2 1 0 0.7 2 2 0 0.7 2.8 s2 1 1 2 2 0 2 0.7 2 2 0 0.7 2.8 s9 2 1 2 2 1 0 0.7 2 2 0 0.7 2.8 s5 2 0.5 2 2 0 0 0.6 2 2 0 0.7 2.7 s8 2 0 2 2 0 0 0.5 2 2 0 0.7 2.6 s1 2 0 2 1 0 0 0.5 2 2 0 0.7 2.6 s4 1 0.5 2 2 0 1 0.6 2 1 0 0.5 2.1 s3 2 0 2 1 0 0 0.5 2 1 0 0.5 2.0 3.2 research type we classified the ten-study-group using petersen et al. (2015)’s criteria in which a set of conditions determine the type of research developed. for instance, opinion research solely reports the author’s point of view about a subject. in this case, there is no usage in practice, empirical evaluation, author’s experience report, or proposal of a conceptual framework or a novel solution. table 4 shows that, according to petersen et al. (2015)’s taxonomy, most of the studies (8 of 10) is a solution proposal because there is no empirical evaluation: three studies are validated by a free proof of concept, whereas the five remaining do not even confirm their proposals. furthermore, only two of ten studies are validation research: s7 presents a case study, and s10 describes an experiment with controlled conditions. table 4. types of research and validation of relevant studies. type of research type of validation solution proposal proof of concept: s2 s5 s9 no validation: s1 s3 s4 s6 s8 validation research case study: s7 experiment: s10 3.3 type of software requirement next, we analyzed the particular type of software requirement covered by srp, as presented in table 5. four of the relevant studies define srp for the adaptability requirement and the other four papers for the security one. the proposals of srp described in the remaining two studies do not address a specific type of software requirement. table 5. type of requirement covered by an srp. type of requirement studies adaptability s1 s3 s6 s8 security s2 s5 s7 s9 general purpose s4 s10 3.4 a comparative analysis next, we describe a detailed comparative analysis of the contributions proposed in s1 to s10, from which we perceived some similarities and identified the sdlc phase supported by their srps solutions. studies s1 and s3 propose a similar conceptual architecture for systems developed from srps, as illustrated in figure 3. the dashed lines a, b, c, and d show the similarities between the architectures proposed in s1 (left-hand side) and s3 (right-hand side). the requirements layer (a) identifies, analyzes, and models requirements as user requirement patterns (urp). the service layer (b) interacts with the requirements layer and provides services to satisfy the urp. the security and information sharing mechanism (c) establishes a process of reliable information exchange between systems of the same domain. the knowledge base (d) combines standards, norms, and ontologies of the system domain. the motivation of both research efforts is the need to share information between systems of the same area: medical systems (in s1) and transport systems (in s3). regarding s1 and s3 again, these studies make use of srp to support the software design phase. in both studies, a urp in the requirements layer leads to the efficient selection of services in the service layers. a urp is a crucial element not only because it represents user requirements but also due to the fact it guides the operation of the entire system. we also observed commonalities on how s2 and s5 represent security requirements as an srp, as depicted in figure 4. both studies specify security requirement patterns with similar structure and security concepts (context, assets, and threats) as well as protection measures as design patterns. illustrated as dashed lines in figure 4, the steps outlined in s2 (left-hand side) — the identification of stakeholders and objectives, essential information assets, and threat sources using standards — match with the following items of the security requirement pattern in s5 (right-hand side), respectively: the pattern definition format (context, problem, solution, and structure), asset, and threat. finally, the step “adding protection measures in the system design” in s2 matches with the countermeasure concept described as security design patterns in s5. from this analysis, we concluded that s2 and s5 also make use of srp to benefit the software design phase because they define security requirement patterns and relate them to design-pattern-based protection measures. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 3. a comparative analysis of the srp-based conceptual architectures discussed in s1 (left-hand side) and s3 (right-hand side). figure 4. a comparative analysis of the srp-based security approaches discussed in s2 (left-hand side) and s5 (right-hand side). as a result of the analysis of s8 and s9, we identified that both studies present proposals of requirement patterns representation format. in s8, each natural language-written requirement binds to a linear-temporal-language-written formula, in which mutations soften the likely issues in this association. each type of requirement pattern attaches its potential failures and the respective appropriate variations. the formulas associated with mutants have multiple purposes, such as test generation, the adequacy of test sets, or the automatic construction of monitors for the system’s behavior verification at run-time. thus, the mutations included in the transformation of the requirement patterns contribute to the software testing phase. in the case of s9, a composite pattern integrates three types of software patterns (i.e., requirement, design, and security). based on the problem frames theory, this composite pattern uses parameters extracted from an inner requirement pattern, from which a set of functions correspond both to solutions in a design pattern and contextual elements in a security pattern. thus, this applicability of srp is at software design. the studies s4 and s7 model requirement patterns using ontologies based on formal description logic. as ontologybased srps allow the automatic generation of source code in s4, this srp contribution is to the software construction phase. in study s7, authors implement a mechanism that automatically binds an ontology-based security requirement pattern to a corresponding design pattern solution. thus, the srp’s main contribution in s7 is for the design phase of the sdlc. in the context of ubiquitous computing (ubicomp) applications, s6 aims to map dependencies between design patterns and requirement patterns. this software pattern-integration approach bridges the gaps of the early software development phase, where recurring requirements demand similar design solutions, such as the case of the adaptability requirement for ubicomp applications. consequently, the main contribution of s6 is for the software design phase. regarding the study s10, it presents a systematic process to modernize legacy web applications into rich internet applications (ria). the core of that process is a set of tracea revisited systematic literature mapping on software requirement patterns kudo et al. 2019 table 6. data extraction from the 10 relevant studies. type of contribution sdlc phase type of requirement studies qs conceptual architectures for srp-based systems design adaptability s1 s3 2.6 2.0 representation formats for srp design security s2 s5 s9 2.8 2.7 2.8 testing adaptability s8 2.6 processes for discovery and use of srp design security s7 3.2 design general purpose s10 3.3 construction general purpose s4 2.1 catalog of srp design adaptability s6 2.8 ability matrices that relate modernization requirements, ria features, and patterns. a final traceability matrix suggests the most suitable ria patterns for each new requirement based on the values of two different metrics: the degree of requirement realization (drr) and the degree of pattern realization (dpr). once selected, the ria patterns are weaved into the legacy models so that those pattern-based ria functionalities are incorporated into the system. the reusability of ria patterns is very clear because the patterns traceability matrix is built once and used in any modernization process that, in turn, takes a lesser design time. thus, in this approach, srps cover the gap between requirements elicitation and architectural design along the ria development process. 3.5 summary table 6 summarizes the analysis of the ten-study-group by the types of contributions identified: conceptual architectures for srp-based systems, processes for discovery and use, representation formats, and catalogs of srp. the final quality scores (qs) of each study are at the rightmost column. 4 data synthesis this section presents a synthesis of the data extracted from the relevant studies to answer the research questions. 4.1 about the research question 1 to answer the research question “at what sdlc phases are requirement patterns used: design, construction, testing and/or maintenance?”, eight studies use srps at the design phase, one at construction, one at software testing, and none at software maintenance. among the eight studies that address srps at software design (s1 to s3, s5 to s7, s9, and s10), there are no repeating authors, neither the convergence of studies to one or more research groups. two hypotheses can explain the high concentration of studies related to the design phase: the fact that it is after requirements engineering as well as the increasing usage of design patterns in software development. even though the studies s3 and s4 do not clearly state the sdlc phase supported by srps, we consider that their srps proposals bring benefits to the software design and construction phases, respectively (see p2 criterion). a significant difference between the number of relevant studies (10) and the number of papers excluded (77) is because these investigate srps exclusively for requirements engineering. this unbalance makes it clear that there is still an open field for research on the benefits of srps for the other sdlc phases, such as testing and (1) maintenance (0). as a consequence, another evidence is the lack of research on the use of srps along the entire sdlc, from requirements engineering to software maintenance. an example of a challenging study could be the evaluation of the improvements for the sdlc resulting from the adoption of srps, beyond the well-known benefits of time savings and better quality specifications. 4.2 about the research question 2 regarding the research question “is there evidence of requirement patterns usage in practice at those sdlc phases?”, there is no study that reports evidence of srps usage in the software industry. eight of the ten-relevant-studies are solution proposals with no validation, and only two papers (s7 and s10) are validation research with the highest quality scores, according to our quality assessment. this analysis suggests that future work should be more focused on the use of srps along the sdlc in the software industry. 4.3 about the research question 3 to answer the research question “are there reported benefits of using requirement patterns at those phases? if so, what metrics are used to measure these benefits?”, s10 is the only study that defines srp-related metrics. we believe that this lack of concern with metrics is because most articles are solution proposals, thus without use in practice. in s10, the metrics drr (degree of requirement realization) and dpr (degree of pattern realization) select candidate ria patterns in the process of re-engineering of legacy web applications. a value of 1 in drr indicates that a pattern fully supports all the ria features demanded by the requirement, whereas a value of 0 means that the requirement and the pattern do not share any feature. similarly, a value of 1 in dpr denotes that the requirement demands all the ria features supported by the pattern. in contrast, a value close to 0 implies that the requirement needs an insignificant amount of the ria features supported by the pattern. the experiment results in s10 show that, in the worst case, more than half of the patterns would have been automatically suggested by the authors’ method. furthermore, the synchronization patterns indicated by the approach and those used by developers are the same in all systems tested in the experiment. both results allow concluding that srps usage in s10 implies significant development time savings. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 figure 5. mapping of the types of requirements and validation on srps for software design, construction, testing, and maintenance. 4.4 discussion figure 5 illustrates a bubble graph that synthesizes the information we extracted and analyzed from each relevant paper. observe that four studies (s2, s5, s7, and s9) propose security requirement patterns with contributions to the software design phase. we conclude that this is because security is a recurrent requirement of many software systems, besides the support of well-established international standards (iso/iec, 2018). however, these studies mentioned above still require more significant validation with empirical assessments and use in the software industry. four studies (s1, s3, s6, and s8) explore srp for the adaptability nonfunctional requirement: one in software testing (s8), and the others in software design. besides, none of these studies presents any validation of the proposal. s4, in turn, investigates srps for general purpose requirements used in software construction, but also with no validation. still regarding figure 5, as important as mapping the research endeavors is the analysis of the existing gaps: 1. there is a general lack of investigation on the adoption of srps at other sdlc stages (10), while many research endeavors still focus on requirements engineering (77); 2. adaptability and security are the most addressed nonfunctional requirements as srps at the software design and testing phases, from the analysis of the left-hand side of the bubble graph. however, other types of nonfunctional requirements can be specified as srp at different sdlc phases, e.g., usability aspects with automated support for code and test case generation. 3. the application of research results on srps in the software industry (right-hand side of the figure); except for the studies s7 and s10, the remaining are in the proof of concept level. 5 threats to validity finding all relevant research on a topic and selecting evidence of quality are significant problems in systematic studies. three procedures were carried out throughout the planning and the conduction phases to reduce the potential threats to the validity of this slm. first, we performed an automatic search strategy that combines six relevant sources of studies with search string terms based on the sevocab standard vocabulary. besides, search in the gray literature is not part of the protocol (e.g., dissertations, theses, and technical reports) because we assume that good quality research is mostly published in journals or conferences. secondly, we were aware that searches could be extended to two additional relevant sources of research, i.e., sciencedirect and web of science. surprisingly, the number of relevant studies resulting from the automatic search increased only from 9 to 10 (the study s10 retrieved from web of science), even introducing those two new sources. as a means of retrieving a higher number of papers, we extended the search strategy again by performing the snowballing technique over those ten relevant studies. in spite of this, this hybrid search strategy included no new research. thirdly, we assessed the quality of primary studies as a means of reducing a likely bias in the analysis and synthesis steps of this slm. the quality criteria we defined and the scores we calculated for each relevant study allowed us to weight better the importance of individual studies when results were synthesised. for instance, the value of empirical evidence and the reporting of lessons learned convey a higher maturity level to the study s10 in comparison to s3. this explains somewhat the difference between their respective quality scores. finally, to mitigate the possibility of biases of this research, three researchers participated in the planning and conduction phases of this slm as follows: a: with 14 years of experience in requirements engineering, she performed the protocol planning, the study selection, and the data extraction and synthesis. b: with 13 years of experience in software engineering, he also performed the protocol planning. still, his contribution was mostly on the verification of the results of the selection, extraction, and synthesis activities. c: the team leader accumulates more than 20 years of experience in software engineering. he helped the synthesis and writing of the results. should divergences arose, a, b, and c solved conflicts together. a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 6 final remarks in the past few years, the literature has demonstrated the positive impacts of software requirement patterns on requirements specification quality, team productivity, elicitation and specification costs, among others (barros-justo et al., 2018; irshad et al., 2018) this paper presents a revisited version of a recent work (kudo et al., 2019a) that investigates if those benefits from srps usage have also been studied for the software design, construction, testing, and maintenance. here, we expand the scope of the search strategy with two additional and pertinent sources of studies and the application of the snowballing technique. besides, we carry out a quality assessment activity supporting the data extraction of relevant studies. by adding other databases in the search strategy, we obtained only one new relevant paper (s10) in comparison with our previous slm. however, the study s10 got the highest quality score, it is the only one that defines srp-related metrics, and it is also classified as validation research. concerning the overall snowballing procedure, in spite of scanning both the reference and the citation lists of the relevant studies as a means of finding further research, none of the 260 papers found suited for our purposes. this strengthens our claim that the effective use of srps in software design, construction, testing, and maintenance constitutes a gap for future research. we also conclude that the studies’ quality scores corroborate the maturity of each research described. the highest quality score studies (s7 and s10) achieve more empirical evidence and lessons learned than the remaining investigations about srps in the software design phase (studies s1 to s3, s5, s6, and s9). in general, we are confident that our results are valuable not only for new secondary studies on this same subject but also for future primary research. to promote further research on srps in the whole software development process, we continue suggesting that the academic community approaches the software industry to match the latter’s expectations effectively. researchers should also establish more metrics that corroborate the advantages of srps usage, such as reduced design time, automatic source code generation, standardized testing, and improvement in the quality of specifications in general (kudo et al., 2019c). at last, we also conclude that the concrete results of the srps usage in practice can be better experienced through two more lines of action: srp-based innovative development tools, and the enhancement of the current development methodologies that could integrate srps along the sdlc. our current efforts include the reuse of agile concepts and practices of behaviour-driven development (bdd) for the description of srps whose behavior is described as test patterns (kudo et al., 2019b). as future work, we plan the inclusion of the term “analysis pattern” (and its variants) in the search string of this systematic mapping to augment the group of relevant studies. the main reason is that analysis patterns and requirements patterns are complementary approaches (pantoquilho et al., 2003) in such a way that the former can be transformed into the latter to migrate to the implementation details level. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. renato bulcão-neto is grateful for the scholarship granted by capes/fapeg (88887.305511/2018-00), in the context of the postdoctoral internship held at the dept. of computation and mathematics of ffclrp-usp. alessandra macedo is grateful for the financial support of fapesp (16/13206-4) and cnpq (302031/2016-2 and 442533/2016-0). the authors would also like to thank all the anonymous referees for their valuable comments and suggestions on this paper. references asnar, y., paja, e., and mylopoulos, j. (2011). modeling design patterns with description logics: a case study. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), volume 6741 lncs, pages 169 – 183, london, united kingdom. barros-justo, j. l., benitti, f. b. v., and leal, a. c. (2018). software patterns and requirements engineering activities in real-world settings:a systematic mapping study. comp. standards & interfaces, 58:23–42. beckers, k., côté, i., and goeke, l. (2014). a catalog of security requirements patterns for the domain of cloud computing systems. in proceedings of the acm symposium on applied computing, pages 337–342. bourque, p. and fairley, r. e., editors (2014). swebok: guide to the software engineering body of knowledge. ieee computer society, los alamitos, ca, version 3.0 edition. chang, f. and gan, r. (2009). an architectural framework of the integrated transportation information service system. in 2009 ieee international conference on grey systems and intelligent services, gsis 2009, pages 1342 – 1346, nanjing, china. conejero, j. m., rodríguez-echeverría, r., sánchezfigueroa, f., linaje, m., preciado, j. c., and clemente, p. j. (2013). re-engineering legacy web applications into rias by aligning modernization requirements, patterns and ria features. journal of systems and software, 86(12):2981 – 2994. couto, r., ribeiro, a. n., and campos, j. c. (2014). application of ontologies in identifying requirements patterns in use cases. in electronic proceedings in theoretical computer science, eptcs, volume 147, pages 62 – 76, grenoble, france. curcio, k., navarro, t., malucelli, a., and reinehr, s. (2018). requirements engineering: a systematic mapping study in agile software development. journal of systems and software, 139:32 – 50. fabbri, s., silva, c., hernandes, e. m., octaviano, f., thommazo, a. d., and belgamo, a. (2016). improvements in the start tool to better support the systematic review process. in proceedings of the 20th international conference on evaluation and assessment in software engineering, a revisited systematic literature mapping on software requirement patterns kudo et al. 2019 ease 2016, limerick, ireland, june 01 03, 2016, pages 21:1–21:5. fabbri, s. c. p. f., felizardo, k. r., ferrari, f. c., hernandes, e. c. m., octaviano, f. r., nakagawa, e. y., and maldonado, j. c. (2013). externalising tacit knowledge of the systematic review process. iet software, 7(6):298–307. franch, x., palomares, c., quer, c., renault, s., and de lazzer, f. (2010). a metamodel for software requirement patterns. in wieringa, r. and persson, a., editors, requirements engineering: foundation for software quality, pages 85–90, berlin, heidelberg. springer berlin heidelberg. hauge, a. a. and stølen, k. (2011). sacs: a pattern language for safe adaptive control software. in proceedings of the 18th conference on pattern languages of programs, plop ’11, pages 7:1–7:22, new york, ny, usa. acm. irshad, m., petersen, k., and poulding, s. (2018). a systematic literature review of software requirements reuse approaches. inf. softw. technol., 93(c):223–245. iso/iec (2018). iso/iec 27000:2018 information technology – security techniques – information security management systems – overview and vocabulary. iso/iec/ieee (2017). iso/iec/ieee 24765:2017 systems and software engineering – vocabulary. jamshidi, p., ghafari, m., ahmad, a., and pahl, c. (2013). a framework for classifying and comparing architecturecentric software evolution research. in 2013 17th european conference on software maintenance and reengineering, pages 305–314. kitchenham, b. a. and brereton, p. (2013). a systematic review of systematic review process research in software engineering. information & software technology, 55(12):2049–2075. kitchenham, b. a., budgen, d., and brereton, o. p. (2010). the value of mapping studies a participant-observer case study. in 14th international conference on evaluation and assessment in software engineering, ease 2010, keele university, uk, 12-13 april 2010. knote, r., baraki, h., söllner, m., geihs, k., and leimeister, j. m. (2016). from requirement to design patterns for ubiquitous computing applications. in proceedings of the 21st european conference on pattern languages of programs. konrad, s. and cheng, b. h. c. (2002). requirements patterns for embedded systems. in proceedings ieee joint international conference on requirements engineering, pages 127–136. kudo, t. n., bulcão-neto, r. f., macedo, a. a., and vincenzi, a. m. r. (2019a). padrão de requisitos no ciclo de vida de software: um mapeamento sistemático. in proceedings of the xxii iberoamerican conference on software engineering, cibse 2019, la habana, cuba, april 22-26, 2019., pages 420–433. kudo, t. n., bulcão-neto, r. f., and vincenzi, a. m. r. (2019b). a conceptual metamodel to bridging requirement patterns to test patterns. in proceedings of the xxxiii brazilian symposium on software engineering, sbes 2019, salvador, brazil, september 23-27, 2019., pages 155–160. kudo, t. n., bulcão neto, r. f., and vincenzi, a. m. r. (2019c). requirement patterns: a tertiary study and a research agenda. iet software, pages 1–9. https://doi. org/10.1049/iet-sen.2019.0016. okubo, t., kaiya, h., and yoshioka, n. (2011). effective security impact analysis with patterns for software enhancement. in 2011 sixth international conference on availability, reliability and security, pages 527–534. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. palomares, c., quer, c., franch, x., renault, s., and guerlain, c. (2013). a catalogue of functional software requirement patterns for the domain of content management systems. in proceedings of the 28th annual acm symposium on applied computing, sac ’13, coimbra, portugal, march 18-22, 2013, pages 1260–1265. pantoquilho, m., raminhos, r., and araújo, j. (2003). analysis patterns specifications: filling the gaps. in viking plop, bergen, norway. petersen, k., vakkalanka, s., and kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64:1–18. tockey, s. (2015). insanity, hiring, and the software industry. computer, 48(11):96–101. trakhtenbrot, m. (2017). mutation patterns for temporal requirements of reactive systems. in proceedings 10th ieee international conference on software testing, verification and validation workshops, icstw 2017, pages 116–121. wen, y., zhao, h., and liu, l. (2011). analysing security requirements patterns based on problems decomposition and composition. in 2011 1st international workshop on requirements patterns, repa’11, pages 11 – 20, trento, italy. withall, s. (2007). software requirement patterns. best practices. microsoft press, redmond, washington. wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering. in 18th international conference on evaluation and assessment in software engineering, ease ’14, london, england, united kingdom, may 13-14, 2014, pages 38:1– 38:10. yang, h., liu, k., and li, w. (2010). adaptive requirementdriven architecture for integrated healthcare systems. journal of computers, 5(2). https://doi.org/10.1049/iet-sen.2019.0016 https://doi.org/10.1049/iet-sen.2019.0016 introduction the systematic mapping protocol research questions and keywords automatic search selection of primary studies snowballing data extraction quality assessment research type type of software requirement a comparative analysis summary data synthesis about the research question 1 about the research question 2 about the research question 3 discussion threats to validity final remarks journal of software engineering research and development, 2023, 11:7, doi: 10.5753/jserd.2023.2657 this work is licensed under a creative commons attribution 4.0 international license. education, innovation and software production: the contributions of the reflective practice in a software studio aline andrade [ pontifícia universidade católica do paraná | alinesf.andrade@gmail.com ] alessandro maciel schmidt [ pontifícia universidade católica do paraná | alessandromacielschmidt@hotmail.com ] tania mara dors [ pontifícia universidade católica do paraná | taniadors@ppgia.pucpr.br ] regina albuquerque [ pontifícia universidade católica do paraná | regina.fabia@pucpr.br ] fabio binder [ pontifícia universidade católica do paraná | fabio.binder@pucpr.br ] dilmeire vosgerau [ pontifícia universidade católica do paraná | dilmeire.vosgerau@pucpr.br ] andreia malucelli [ pontifícia universidade católica do paraná | malu@ppgia.pucpr.br ] sheila reinehr [ pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br abstract the growth of the mobile phone market has been generating a great demand for professionals qualified for applications (apps) development. the required profile includes technical skills, also known as hard skills, and behavioral or soft skills. the training of these professionals in speed, quantity, and quality demanded by the market poses a significant challenge for educational institutions. apple and pucpr have established a partnership to build a software studio to develop such talents using the challenge based learning (cbl) method and associated practices whose effects need to be studied. this research aims to analyze the contributions of reflective practice in a software studio to teach the main professional competencies regarding app development, including hard and soft skills. the research method was the case study, based on semi-structured interviews with 28 participants in three cycles. the collected data were analyzed with open and axial coding from grounded theory and atlas.ti tool. the results demonstrate that reflective practice, applied in a software studio environment that uses cbl was able to help students to map new ideas and acquire valuable hard and soft skills. the study pointed out that reflective practice is an effective instrument for developing the skills required by the app market, which demands innovation and quality at high speed. keywords: reflective practice, software studio, challenge based learning, software quality education, app development 1 introduction the demand for technological products has been growing in recent years, which requires better training of computing professionals, especially for developing applications for mobile devices. among the most requested competencies are technical knowledge (hard skills) and behavioral skills (soft skills), such as teamwork, collaboration, and communication. these abilities are crucial once information technology (it) professionals tend to be more introspective. although students perceive the development of soft skills as relevant, studies show that there is not the same degree of concern about acquiring these skills compared to more technical skills (lima and porto 2019). the apple developer academy, or simply academy, is a technological innovation project run in partnership between university environments and apple through a course that offers a complete education to students, allowing them to learn how to code, test, and publish applications based on their ideas. academy is a software studio (bull et al. 2013) that uses active and collaborative learning methods and tools to contribute to the student's practical learning and skills development. its staff consists of instructors composed of programmers and designers, available at the studio daily. for the two-year extension course, 50 graduate students or students to graduate within six months were selected. there are designers, developers, and devigners who are students able to work with both skills as designers and developers (dors et al. 2020). the academy uses the challenge based learning (cbl) method to support the mobile application development process, which is based on challenges proposed to students. this is a definition established by apple as a contractual item. one of the practices associated with cbl is the reflective practice. reflective practice is a feature of the software studio supported by formal and informal feedback from teachers to students, which can be mentoring or critiquing to improve the outcome (bull et al. 2013). in this learning environment, students are exposed to social interactions, group work, oral presentations, and discussions of their work with peers (kuhn et al. 2002). the course instructors use an approach that follows the guidelines established by the partnership and those related to the studio concepts. thus, students, through workshops, receive theoretical content. throughout the development of the challenges, instructors use coaching and mentoring reflectively with the students, according to the software studio concepts. the instructors encourage the students to reflect and find solutions independently, i.e., the instructors do not give the answers but the tools and conditions for the students to develop them. the presence and importance of reflective practice are recognized in the software engineering educational literature, according to dors et al. (2020) and bull and https://orcid.org/0009-0000-4239-4464 https://orcid.org/0009-0001-8853-3295 https://orcid.org/0000-0003-1167-7685 https://orcid.org/0000-0003-1564-1743 https://orcid.org/0000-0001-6682-3868 https://orcid.org/0000-0002-9508-0888 https://orcid.org/0000-0002-0929-1874 https://orcid.org/0000-0001-9430-7713 education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 whittle (2014). it is a form of reflective-based learning, as the name implies, that ranges from constant questioning, teamwork, peer review, and collaborative learning to group problem-solving. the concept was initially proposed by donald schön when observing architecture studios. he suggested thinking about professional practice, relating theory to practice, and generating terms such as reflectionin-action, reflection-on-action, and conversation with the material (schön 1983). reflective practice in architecture has proven effective in assisting soft skills development, improving performance, and helping students acquire an artistic talent essential for professional competencies (schön 1983 and hazzan 2002). the contributions of such an approach to computer science education are described by bull and whittle (2014) regarding the technical and attitudinal skills of the software engineer. the latter are improved decision-making skills, teamwork, communication, planning, and time management. with so many positive results aimed at teaching mobile application development from the software studio education approach, an interest arose in deepening the understanding of the results provided using the reflective practice, extending the study of dors et al. (2020) to the analysis of a more extensive set of students and projects. dors et al. (2020) analyzed data from the academy's 2017-2018 student class, conducting a face-to-face ethnographic observation study. the present study analyzed data from the 2019-2020 and 2021-2022 classes obtained through semistructured interviews and constituting a longitudinal study. new findings were identified, complementing dors et al. (2020), as can be seen later in this article. 2 background the different methodological approaches used by teachers throughout history have intrinsically aimed at the same scenario: to enable the learner to act autonomously in diverse professional situations. however, traditional methods that focus only on lectures are not enough to develop the competencies required by today's society. active methodologies, which place the student at the center of the learning process, have been widely used in universities worldwide and, in recent years, also in brazil. they seem to offer better results for the development of the required competencies. regardless of the methodology used, ferraz and belhot (2010) said that it is not enough to focus on the content to be covered to conclude the teaching-learning process efficiently. it is necessary to plan and structure activities to be developed, resources available, methodologies adopted, and evaluation tools used. 2.1 collaborative learning collaborative learning is one way to overcome the challenges faced by traditional teaching methodologies. according to barkley et al. (2014), a collaborative approach meets the following criteria: (i) the activity design must be intentional and carefully undertaken by the faculty member and not just limited to assigning some group activity; (ii) all group members must effectively engage in the activity and contribute equally to the outcome; and (iii) meaningful learning, related to the learning objectives of the discipline, must occur. briefly, collaborative learning is about "two or more students working together and sharing the workload equally as they progress toward the intended learning outcomes" (barkley et al., (2014)). the process of collaborative learning refers partly to the metacognition process, which means getting the students to reflect on their learning process. besides choosing the technique, implementing collaborative approaches implies properly defining how to organize the groups, encourage collaboration, and conduct the assessment. 2.2 hard and soft skills it is prevalent for undergraduate courses to focus on developing technical skills so that the future professional can work in his/her field. these are also known as hard skills. however, these are insufficient to build an excellent professional to cover current market demands. behavioral skills, also called soft skills, are equally or even more relevant in this journey. according to agante (2015), soft skills are non-technical competencies such as communication, empathy creation, trust with groups, and resilience in a work environment. competency is "a set of capabilities (knowledge, skills, attitudes, and values) mobilized for a delivery, which adds value to both the individual and the organization" (fernandes 2013). the survey developed in july 2021 by the american company careerbuilder with 2,138 managers and human resources professionals pointed out that 77% of interviewees believe that soft skills are essential for the job. carter, ferzli, and wiebe (2007) state that although communication skills are vital for an effective professional, these skills usually fall short of employers' expectations of recent technology graduates. several universities already recognize the need for computer science students to acquire these skills and incorporate teaching methods that favor their development. studies show the importance of communication in technology because students learn what it means to think like computer scientists and be professionals in the field (burge et al. 2012). 2.3 software studio one of the approaches to developing these competencies is the software studio, which comes from the historical tradition of the école des beaux-arts and the bauhaus and its atelier model (dors et al. 2020). according to tomayco (1991), the software studio emphasizes developing reflective skills and sensibilities and is the reflective practice the essence of the atelier concept. collaborative learning in the studio helps students to develop their skills through practice. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 furthermore, the dynamic interconnection of elements in a studio, such as people, software tools, development methodologies, processes, techniques, and products, provides a network in which software development knowledge and skills are created (prior et al. 2014)). reflection generally occurs in cycles of experience followed by consistent reflection to learn from that experience, during which the developer can explore comparisons, weigh alternatives and diverse perspectives, and generate inferences, especially in new and/or complex situations (dybå et al. 2014). according to shön (1983), reflection-in-action occurs during problem-solving, with doing and thinking as complementary ways. reflection-on-action is about thinking about a different approach to an already executed process. finally, conversation with the material refers to a conversation with the product that has been developed. reflection-in-action is the reflective form of knowing-inaction, reflecting during the problem-solving process. in the reflection-in-action process, doing and thinking are complementary. knowing-in-action is the knowledge built into and revealed by our performance of everyday action routines (schön 1983). sometimes it is labeled as intuition, instinct, or motor skills. in such cases, one continually controls and modifies his/her behavior in response to changing conditions (schön 1987). in such cases, we continually control and modify our behavior in response to changing conditions. “this capacity to do the right thing ... exhibiting the more that we know in what we do by the way in which we do it, is what we mean by knowing-in-action. and this capacity to respond to surprise through improvisation on the spot is what we mean by reflection-inaction". (schön 1987). carbone and sheard (2002) reported the reactions of firstyear students to being exposed to a new learning environment that consisted of differentiated physical space, a new teaching approach, it facilities, and a new assessment method. this space was a workshop whose approach was established in 2000 in the school of management and information systems, bachelor of management, and information systems (bims), at monash university (australia). the studio-based teaching and learning approach adopted was based on the bauhaus school of design model. the bauhaus school introduced a radical change from the traditional art education model, completely reshaping the teaching and learning spaces at that time. the atelier aims to allow students to develop strategies to cooperate and collaborate. the authors concluded that, in general, most first-year students enjoyed learning in the studio environment. an unexpected finding was the evidence of students developing metacognitive skills. danielewicz-betz and tatsuki (2014) analyzed reflective practice concerning the outcomes of a software workshop in undergraduate and graduate software courses. the analysis focused on the interaction between students and clients to determine how and to what degree students transformed through collaborative project-based learning. during the final self-reflection, students reported improving their project management, communication, presentation, writing, business, and software development skills. the reflective practice was analyzed and focused on collaborative learning and students' relationships with clients. prior et al. (2019) described a study based on open-ended interviews and ethnographic observations in studio sessions to understand how this experience impacted students’ employability. students observed that the studio experience helped enhance their technical and non-technical employability skills. in addition, from interviews with mentors and academics, the study corroborated the students’ views. they concluded that the relevant skills for employability include collaboration and communication, project management, mutual support to solve technical problems with help from industry mentors and academics, social aspects of the work, reflection skills, and technical skills were found to be essential employability skills. according to marques et al. (2018), adopting reflective practice (reflexive weekly monitoring rwm) is a way to improve learning for computer science students. the authors followed nine semesters of a project discipline and concluded that the approach effectively improved student coordination, effectiveness, sense of belonging, and satisfaction. 3 research method the research was conducted in a case study format based on data collection through semi-structured interviews (yin 2017). this method constitutes a research strategy that aims to understand the dynamics in a contemporary context over which the researcher has no control. it is appropriate to answer "how" and "why" questions. the study's main objective was to understand the contributions of reflective practice to technical and nontechnical skills development in a software study. we followed the steps defined by yin (2017): (i) definitions and planning; (ii) preparation, data collection, and analysis; (iii) cross-analysis and conclusions. the research planning involved case selection and preparation of the research protocol. the informed consent form (icf) and the non-disclosure agreement (nda) were prepared and signed by all researchers involved in the project. the project went through the analysis by the research ethical committee, receiving its approval (number 4.209.411) on august 12th, 2020. the underlying question for this study was: how is the reflective practice performed in software development studios? the following complementary questions arise from this general question: how is the reflective practice carried out in software studio environments? how can reflective practice contribute to the learning of computer science students? the present study was characterized by a prospective design in a qualitative research format. the first and second collection data rounds occurred between january and may 2021 and referred to the 2019-2020 class. the third collection cycle occurred from december 2021 until january 2022 and referred to the 2021-2022 class. the sample was determined through convenience time series. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 according to the inclusion and exclusion criteria, individuals considered eligible to participate in the study answered a semi-structured interview. the unit of analysis of this project is the apple developer academy (called academy), a software studio constituted in the scope of the partnership between apple and pucpr. candidates go through a selection process that identifies the most appropriate profiles. these students then undergo a two-year training period, exposing them to several challenges. included in the research were academy students over 18 who agreed to collaborate with the study. as selection criteria, we adopted that: the students should have participated in active learning using reflective practice and attended the class workshop immediately before. the project was divided into three cycles. in the first cycle, ten students participated in the interview. this stage was considered a pilot project and aimed to understand the benefits of reflective practice from the student's point of view. the first author initially developed the semi-structured script, which the other authors later revised. the script contained ten open questions to understand the academy student's perspective on reflective practice, ranging from usefulness, learning, and future applications outside the academic environment. at the end of this stage, a preliminary data analysis was performed to adjust for the second collection cycle. for the second cycle, the questions were adjusted, allowing a deeper exploration of items that emerged from the first collection. the second cycle comprised eight interviews that took place remotely and synchronously, each lasting 30 to 60 minutes, as had already happened in the first cycle. no adverse effects were perceived because the interviews took place online. the third and last cycle was conducted with twenty students from the 2021-2022 class. the interviews were undertaken in a synchronous remote way in two phases. the interviews for all cycles were recorded in audio format and later transcribed with the interviewee's permission. the information was mapped and analyzed using the support of atlas.ti tool. the analysis of the results used the open and axial coding of grounded theory (strauss and corbin 2007). open coding is a microanalysis of the transcribed interviews, performed line by line, identifying concepts and memos records (researcher's notes) about the meaning of the codes and categories. on the other hand, axial coding refers to grouping codes of the same properties in the form of networks. the results present the behavioral and technical skills acquired from applying reflective practice. the study participants had no direct benefit from the project. however, the research contributed to the planning and development of future actions aimed at improving the skills of technology students in the researched environment. it can be understood that an indirect benefit for the academy students was their reflection on reflective practice, performed through the researcher's inquiries. 4 results this section details the process of analyzing and triangulating the collected data, and it is organized into subsections according to the collection cycles. 4.1 first cycle ten students between 18 and 24 years old were interviewed in the first cycle. of these, five were male, and five were female. the interview follows the script shown in table 1. table 1. script for the semi-structured interview cycle 1. script for the semi-structured interview cycle 1 have you done activity reflections before joining the apple developer academy? how was your first reaction when you discovered that you would need to reflect on your challenges? why? considering programming, design, and business, what have you learned technically using reflective practice? how do you think reflective practice contributed to developing your behavioral skills? how do you think reflective practice contributed to developing your behavioral skills? how was the critique carried out? can you explain to me what it was like to receive and give a review of a challenge? after doing some reflection, did you avoid any mistakes in the development of the challenge, and consequently, did you see new attitudes for the following activities? if yes, explain how. how was your last reflection compared to the first one? in the future, do you intend to continue using reflective practice in new projects? can you explain how you intend to use it? finally, could you tell me the most significant benefits of reflective practice? interviewees answered about the process of reflective practice based on a semi-structured script with more comprehensive questions that would allow for more indepth results. most students stated that they had conducted reflections before participating in the academy. when asked about their reactions to finding out that reflection would be mandatory, most students used positive words such as cool and productive. only three participants demonstrated negative expressions, such as confusing, complicated, or boring. from the open coding analysis, it was obtained the axial network that reflects soft skills learning as presented in figure 1. this category comprises the following subcategories: leadership, communication, and teamwork. teamwork was very evident during this study. students could develop technical and behavioral skills during team interactions, such as collaboration among colleagues, managing time, and improving communication. examples of these statements can be seen later in this text. regarding teamwork, we found evidence of learning about conflict management. this emerged from statements concerning arguments and stress between colleagues in the same team. through reflection, these students were able to manage these disagreements better. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 figure 1. soft skills (behavioral competencies) cycle 1. figure 2 presents the results of the technical knowledge obtained through reflective practice. the most cited terms were technical learning, writing learning, and project management. it was also possible to analyze that many students could identify the depth of their knowledge about programming after doing the reflections. figure 2. hard skills (technical competences) cycle 1. the technical competence considered most relevant by the interviewees was project management. usually, students have no previous practical experience with project management except for graduation work. the studio provides them the practical experience throughout the course. they perceived this theme as planning skills, activities standardization, error analysis, problem identification, and time management. therefore, the results concerning personal development obtained from the reflective practice were: improvement in the student's technical performance, self-knowledge, and learning evolution in hard and soft skills, which led to a continuous improvement process for the students. students realized that through reflection, self-knowledge could be developed, which brings increased self-confidence. the evolution of learning, whether in the development of soft or hard skills, is also perceived as a cause of personal growth that makes one learn to deal better with feelings. during the interviews, many students stated that they reflected in search of improvements in their technical and behavioral results. with this, they started to analyze themselves more critically, finding mistakes and successes to deal with the following activities differently and obtain future learning and improved performance. self-knowledge was highly mentioned among the interviewees. through the practice of reflection, students could know themselves, understand their preferences, and what made them more confident in performing the academy activities. table 2 presents some interview quotations regarding the soft skills found in this first cycle. an “s” followed by a number replaced the student's identity (e.g., s1, s2). table 2. soft skills quotations – cycle 1. soft skills quotations self-knowledge "[...] self-knowledge for sure, more patience to understand my process, understand my time, decrease anxiety and stress […]" (s2) "[...] i believe the greatest benefit that reflective practice has brought to me is the selfknowledge [...]" (s6) planning "[...] it is assuredly very favorable to prepare yourself for the next challenge better [...]" (s8) self-confidence "[...] fell me more confident regards the skills that i possessed." (s1) time management "[...] learn how to manage time [...]" (s8) communication "[...] and another thing was a personal relationship. i believe you think about what interests you, what you talk about, the way you talk. i learned better this relationship with other people […]" (s8) "[...] learn how to communicate verbally better [...]" (s2) learning evolution "[...] i think a very cool thing was seeing the evolution over time [...]" (s1) "[...] not making the same mistake twice [...]" (s2) personal development improvement “[…] creating a concept, doing the more artistic part, doing the development part, doing the presentation later, so you interact with various aspects. and i think through reflective practice. i was able to understand better how my development process was […]” (s2) technical development improvement “[…] think, learn to communicate better verbally, […], not making the same mistake twice, improve what you do, like something that worked […]” (s5) critical thinking "[...] to make that change in behavior which is in line with the critical analysis [...]" (s3) the encodings were performed by one of the authors and reviewed by the others. these can be grouped in the category that donald schön (1983) defines as a reflectionon-action, the reflection after the action is performed. 4.2 second cycle in the second cycle, eight more students were interviewed. the basis for these interviews was the revised script shown in table 3. table 3. semi-structured interview script – cycle 2. script for the semi-structured interviews cycle 2 have you ever done any reflection before joining the apple developer academy? what was your first reaction when discovering that you had to reflect on your challenges? why was that? education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 in the day-to-day life of the academy, did you notice yourself reflecting during the execution of your activities? how was it? what were the moments when you needed to record the reflections? what was the difference between reflecting on the short-term and the long-term challenge? what are the differences between doing reflections with guidelines for their execution and those with a free format? what are the benefits of both types of reflections? what are the differences between learning when working as a team and individually? how were the reflections during the planning of the challenges? what about development? and in the delivery of the final product? there were only sessions with the instructor to reflect on the progress during the development of challenges. how were these moments? what was it like to receive feedback from the instructor? what skills have you developed to perform presentations in public sessions? what kind of learning or skill was developed or required to perform the division of tasks in teams? what kind of challenge reflections, learnings, or experiences was helpful in another challenge? what benefits have you brought to your professional life using reflective practice? in addition to encouraging individual reflection, the academy has specific moments for students to reflect critically on other students'/teams' work. what was it like to make and receive criticism about the cbl and design in the review sessions? how have the reviews contributed to your personal development or development of the challenge/final product? were you able to avoid any mistakes in any challenge after reflecting and consequently seeing new attitudes to the following activities? if so, explain how. how was your last reflection compared to the first one? do you continue to use reflective practice in your projects? how did that become part of your day-to-day life? finally, could you tell me the most significant benefits the reflective practice has brought you? with the changes implemented, nine additional results were found to be added to the networks obtained in cycle 1. the analysis allowed us to divide them into three technical skills, three behavioral skills, and three classified as the personal benefits of reflective practice. the ability to speak english was the only property of the technical skills that differed from cycle 1. the other changes were related to project management. the following were cited as relevant points: clarity in the execution of project activities and division of tasks, as shown in figure 3. as can be seen, the network was refined due to the best understanding of the scenario. figure 3. hard skills (technical competences) cycle 2. figure 4 presents the soft or behavioral skills mapped in cycle 2. it is possible to observe an outstanding factor that refers to knowing how to listen to colleagues once listening to different opinions is fundamental for good communication. all interviewees mentioned this skill. in addition, this ability is in great demand in the professional market and is essential in personal development through feedback. the students identified that improving the organization of presentations results in behavioral skills development, especially those that facilitate communication. this was a skill developed because they had to make several presentations of their projects throughout the course. figure 4. soft skills (behavioral competencies) cycle 2. the reflective practice significantly impacted the students' personal evolution and behavior improvement. the active methodology also helped the students to evolve their learning through reading their classmates' reflections and personal development. as can be seen, it was possible to evolve the networks to contemplate the findings made through the analysis of the cycle 2 interviews to find the hard and soft skills developed from reflective practice. table 4 presents the quotations extracted from the interviewees' speeches and represents the most frequently mentioned skills in the interviews in cycle 2. knowing how to listen to one's colleagues was highly cited among the interviewees in this second cycle. through reflection, students could better understand their colleagues and the importance of listening to them. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 table 4. hard skills quotations cycle 2. hard skills quotations active listening “[...] it is a lot about listening to other people you know, and understanding their thinking in a good way..." (s13) “[...] you learn to listen to what people want..." (s15) “[...] what i learned most from it was to listen to others before doing" (s17) presentation organization "i learned to organize for the presentations right, and go-getting the hang of getting better [...]" (s17) behavior improvement “i learned personally because we are dealing with people. so, i learned how to express myself or the form of my posture […] with people [...]" (s11) technical performance improvement “i was able to improve my results, so, so for me, it's one of the main things, to improve results effectively, quickly […]" (s14) learning by reading colleagues' reflections “[...] for me, learning from the mistake of others is also valid. i read the reflections of people who seemed interested in learning from them, so sometimes someone wrote and seemed dissatisfied, or someone who seemed very satisfied. i liked to read these reflections [...]" (s12) ideas sharing “[...] i knew then to present my thoughts [...]" (s12) “[...] i was able to expose better what i was thinking." (s15) “[...] i think the skill that i learned the most for the presentation, like this, was losing the shame, you know. lose the fear of presenting or speaking my ideas.” (s13) english speaking “[...] i developed a little bit in how to present in english, keep learning, and lose the fear right." (s18) another interesting issue that emerged when analyzing the interviews was the observation that it is also relevant to learn through the mistakes and successes of colleagues, which was possible by reading the reflections made by the colleagues. the ability to organize also appears in the reflections about improving oral presentations. the students learned self-organization to show their work to colleagues and realized this throughout the reflection process. it is interesting to note that in the students' perception, by reflecting and analyzing the situations in which they were involved with their colleagues, they report having learned to deal with people and adopt a more appropriate posture towards the team. 4.3 third cycle the third cycle refers to 10 students participating in the 20212022 academy class. this cycle aimed to identify the contributions from each reflective practice concept, such as reflection-in-action, reflection-on-action, conversation with the material, and knowing-in-action. an expansion was made to a more significant number of quotations and subsequent refinement of the second cycle networks to meet the interrelationship of codes with the third cycle, so new codes emerged from the interpretation. in this collection cycle, ten students answered questions from a semi-structured script about their studio showcase experience from the perspective of reflective practice contributions (see table 5). the showcase is a studio session where students present their projects to other studio students. the best project receives an award. they were also interviewed in this cycle to investigate the conversation with the material. table 5. semi-structured interview script cycle 3. script for the semi-structured interviews cycle 3 please comment on your experience with reflections on challenges. have you ever been introduced to reflection as a teaching methodology throughout your education? what aspects would your formal education have been different if you had used reflective practice? how do you believe that reflections with colleagues help create more interesting projects? has this ever happened to you? what techniques would you use in case of imminent team conflict? how did the reflections affect the relationship between the team members? what technical skills did you develop? what interpersonal skills do you think you developed throughout studio activities? how will the materials produced, such as projects, codes, slides, presentations, and so on, influence the development of future materials? what is your process of revisiting the materials produced at the academy like? have you changed the materials produced due to reflections between challenges? how have the team reflections helped to develop your creativity? what are the main lessons or learnings from the studio? table 6 shows some of the quotations extracted from the interviews in cycle 3. table 6. quotations cycle 3. code quotation decision making "[…] i believe would have... take more assertive decisions […] i had wasted less time because i would have had to stop to really focus on my ability to think about what's going on [...]" (s19) "[…] i think it brought me these insights into what i should do from now on, […] (s20) conflict management "[…] but basically, it was through conversations that we solved these problems [...]the third person who was the one who brought the conflict, admitted that they could have brought the situation in another way" (s24) “then through the reflections, i realized that in most cases, this was not a good alternative, and i started opting for these conversations 100% of the times [...]" (s20) reflection on mistakes and successes "[…] i think they kind of force you to look at everything you've done, look at all you did, and analyze what you did right or wrong. so, analyzing these practices, you can think on what keep doing, or what behaviors should i stop doing […]" (s22) active listening "[…] learning to give and receive feedback, to stop and hear feedback, was something i already was working on before, but the academy gave me interesting ways to develop this [...]" (s19) collaboration "[...] i think collaboration is a major one, work as a team, [...]" (s21) education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 communication "[...] i learned a lot about how to communicate myself, [...]" (s22) self-confidence "[...] also works for me to have confidence and believe in my own potential, [...]" (s21) motivation "[…] it influenced me a lot to keep myself motivated […]" (s19) upcoming artifacts production "[...] i think it influences a lot; i think that everything is a reference [...]" (s25) professional impacts “this helps a lot in the development of the personal portfolio.” (s22) “[...] when you need to reference these projects in some professional opportunity.” (s26) academic impacts “[...] like a portfolio, this is where you expose your projects, but more than that, you delve into a retrospective of how the project was carried out (why and how was the project developing). and this helps you not only document all the process[...]". (s29) “[...] something for the future, something i would apply to a future project or team.” (s24) “the biggest influence is on future projects, where i can use what was written and learned during the projects i participated in.” (s20) self-knowledge “[...] to see better this way, how we are doing in a specific subject, to dedicate ourselves more or better understand what we like, and what we do not want to work on.” (s22) synthesis skill "[...] the skills i developed in the construction of this activity is self-knowledge and synthesis. [...]" (s29) priority management “i started to apply this model of reflection […] not only in my work but also in my study things. then i realized that i had a much clearer vision of what i had to do or the alternatives i would prioritize in the steps forward.” (s20) from the analysis of the interviews, it was possible to observe that reflection-in-action promotes the behavioral improvement and the student’s continuous development. throughout the challenge development, students had to manage situations of divergence of ideas, conflict of leadership, and the teams’ expectations, improving their conflict management skills. students said this usually occurs at the beginning of the challenge while doing the project design. conflict resolution comes through conversations and sometimes using the voting strategy. the students stated that, based on reflection-in-action, they realized they could make more assertive decisions regarding the project and their behaviors. it also develops leadership, which contributes to engagement and teamwork. students pointed out reflection-on-action as a powerful tool to better understand their motivations, interests, and capabilities, contributing to their self-knowledge development. students wrote a self-reflection at the end of each challenge to stimulate their reflection, promoting reflectionon-action. reflection-on-action promotes self-knowledge, as well as self-reflection and creativity. these selfreflections promote work process comprehension, professional development monitoring (which stands for the ability to keep track of what is learning), reflections on mistakes and successes (which helped the students not to repeat the same mistakes and emulate behaviors that had positive results in the past) and, creativity. the reflective practice supported by studio sessions incentives having multiple views on a given theme, which promotes creativity. interaction with creative colleagues contributes to developing creativity. sharing ideas throughout the studio sessions stimulates teams to be open to different ideas. in addition, creativity is responsible for the creation of innovative projects. figure 5 shows these reflection-on-action findings. figure 5. reflection-on-action cycle 3. the third collection cycle’s purpose was to explore the conversations with the material and knowing-in-action, other reflective practice concepts not covered in cycle 2. from knowing-in-action, we could realize hard and soft skills development, as shown in figure 6 and figure 7. concerning hard skills, as shown in figure 6, knowingin-action promotes the development of project management, project development, and presentation organization skills. regarding the development of the project was noticed the design skills development. some students had never made a mobile app design before the academy course. in addition, they started to learn about code reuse-oriented development, software architecture, database development, app prototyping for different platforms such as miro, figma, or adobe package, algorithmic logic and programming, object-oriented programming, and programming in swift/ swift ui. figure 6. knowing-in-action – hard skills cycle 3. figure 7 shows the knowing-in-action findings regarding soft skills. they actively listen to colleagues’ ideas, communicate, collaborate, and self-criticism. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 figure 7. knowing-in-action – soft skills cycle 3. in this cycle, students reported soft skills that had not appeared in the previous cycle: responsibility and autonomy, storytelling, and communicating feedback. there was an evolution in their commitment to the project regarding deadlines and the execution of assigned tasks, improving responsibility and autonomy skills. another soft skill developed by the students was storytelling. students developed the art of telling stories while preparing app design and project presentations. they had to create an appealing backstory for the app and even engage colleagues in their presentations. the academy’s activities are collaborative by nature, and the other new skills the students present relate to this specific characteristic. through the development of collaborative tasks, students showed considerable improvement in communicating feedback to their colleagues, solving issues, and maintaining a respectful work environment. as shown in figure 8, the showcase stimulates the conversation exercise with the materials to studio students. as a result, the showcase experience positively influenced upcoming artifacts production and promoted priority management, synthesis skill, motivation, self-confidence, learning by shared experience, and self-knowledge, which positively influenced professional and academic contexts. figure 8. conversation with the material cycle 3. since the students were supposed to create brief presentations about complex projects, they had to develop priority management and understanding of the presentations' purpose and structure. not only prioritizing the most relevant items but using synthesis skills to tell a story as effectively as possible. enrolling in such activities helped students to boost their motivation and self-confidence. students sharing experiences in the showcase help them to learn from colleagues’ experiences, which positively influences professional and academic contexts. written self-reflection supports reflective practice in the studio and can be carried out in two ways: free format or with guiding questions. the student perceives these skills development throughout the exercise of his metacognition. metacognition is ”being aware of and able to monitor the development of one’s own learning and the application of that learning to their practice.” (parson; stephenson, 2005). the difficulties in applying the reflective practice found in the analysis are related to the student's lack of experience and physical fatigue. the latter is because students tend to work hard to complete the project on time. consequently, they get tired after the delivery, and writing the reflections becomes difficult. 5 discussion this study aimed to identify the contributions of reflective practice in a software studio, analyzing the benefits of mobile applications development and the acquisition of professional skills demanded by the market. the cbl active methodology is reflection-based learning that uses the students' relationship with their experiences. it was identified that students reflect to find new results and knowledge, so the practice aims to improve the student's abilities for the following activities. the results showed that reflective practice positively affects software development and the acquisition of professional skills. the study highlighted that collaborative learning helps students develop their own skills through practice and that groups interested in other teams' work acquire new knowledge and skills. in addition, among the main contributions of reflective practice are skills development, like teamwork, collaboration, communication, time management, planning, problem identification, decision-making, and selfknowledge. therefore, these contributions are crucial for a computer science practitioner to succeed. 5.1 related work compared to other literature studies, it was possible to notice that conflict management and time management are essential skills in executing a project. these points were also cited in the work of dors et al. (2020) and were confirmed in the present study results. the findings of this research confirm and extend the results obtained by dors et al. (2020). the authors' main results were that reflective practice promotes the emergence of new ideas and contributes to the practice and development of skills, such as collaboration, oral or written communication, commitment, interpersonal relationships, adaptability, flexibility, and teamwork. it also develops problem-solving, decision-making, planning, project management, time management, scope management, outsourcing development management, and new technical skills. in addition, the reflective practice emphasizes hands education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 on learning, supports the development of technical skills, and appears to be an authentic environment of the relationship between academic disciplines and real-world experiences, where students can practice and learn by doing, preparing students for the real world. the interviewees in this study highlighted the importance of behavioral competencies for their development. in this study, students emphasized the relevance of selfknowledge, knowing how to listen, awakening behavioral improvement, confidence, and communication as benefits of reflective practice. the second cycle highlighted communication as a fundamental part of personal development and teamwork (carter ferzli and wiebe (2007)). on the other hand, in the present study regarding the question, it was observed that students recognize that public speaking and listening skills are equally relevant to personal and professional life. the analysis of the results of this study showed that students perceived that decision-making is improved with each new reflection made since pondering one's virtues and weaknesses are critical to improving the timing and quality of personal deliberations. this is consistent with what was obtained by dors et al. (2020). this research made it possible to observe the students' acceptance and motivation to use reflective practice. these findings differ from the study by prior et al. (2016), who presented the results of action research conducted at the sydney university of technology with three software development studios. the main challenges in terms of motivation that the authors encountered were i) time pressures that made it difficult to record the journals; ii) difficulties in making the journal entries. some students only did the reflections when reminded, and others refused. they also identified the following patterns of students: i) refusers (do not write reflections), ii) recounters (difficulties in doing reflection), and iii) instinctive reflectors (able to reflect naturally). one intervention that proved to be effective was the 10-minute reflection sessions, which consisted of having students express how their written reflections were, any problems, and a stimulus question, which was answered during the session. in the case of the study reported in this article, this difficulty was not observed because the students felt motivated by the reflective practice. in this research, it was possible to observe improvement in technical performance. that is reflective practice positively impacted student performance. this finding is consistent with the research conducted by nylén et al. (2017), in which the authors studied how students approach the reflective practice task. two categories were identified in students' recording of critical incidents: progress and expansion. progress refers to progress reporting, divided into "what i am doing," status reports, and daily type categories. these subcategories grow in sophistication, respectively. expansion indicates students' reports on learning items and reflections on those items. this category is divided into keywords (how-to, knowledge about generic, personal, and theoretical language). the authors concluded that journal recording induces reflection on learning and positively affects students' awareness of their professional knowledge. on the other hand, students found it challenging to identify learning. on this last topic, this did not occur with the present study, as students could clearly identify their growth through the application of reflective practice. 5.2 threats to validity according to yin (2017), research developed using the case study method can be evaluated under four criteria: (i) construct validity; (ii) internal validity; (iii) external validity; and (iv) reliability. a threat to the construct validity of this study was the use of narratives to identify the acquisition of professional competencies through reflective practice. it was recognized that the narrative approach is compatible with the need to assess the complexity of organizations. however, it was necessary to rely on the interviewees' memories to understand this practice's benefits in both professional competencies and application and software development in general. another limitation of the narrative approach is that people often rationalize the facts while telling the story. the construction of meaning from the facts causes individuals to interpret past events, try to find explanations for what happened, and perhaps confuse what occurred. to lessen this threat, students were asked to recount specific moments to confirm their interpretation of the facts concerning their reflection during the interviews. similarly, the validation performed with academy students also helped validate the perception of their interpretations of the sequence of occurrences, understanding the benefits and uses of reflective practice. regarding external validity, since this study is about a specific environment in specific circumstances, the ability to generalize is limited. it is possible that similar results could be obtained when studying teaching and innovation environments that operate under close conduction, for example, using the software studio concept and reflective practice. 5.3 implications to software engineering education, industry, and academia from the point of view of software engineering education, we were able to observe that the cbl method, especially reflexive practice, is an effective means of developing technical and non-technical skills. providing students with challenges that provoke them to go further, search for answers, create, and reflect can lead to valuable knowledge. this can inspire educators worldwide to rethink their educational practices in the classroom. for industry, our study reveals that cbl and reflective practice lead to the development of highly demanded soft skills such as communication, conflict resolution, autonomy, and responsibility, among others. these methods can be used to develop such skills in the academic environment or professional education. education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 from the academic perspective, we could present a model that describes our findings expressed in the form of a relationships network shown in the previous sections. 6 conclusion technology changes continuously, so computer professionals must increasingly deal with new methods, tools, platforms, user expectations, and software markets. thus, more advanced education is needed to prepare these professionals for the coming decades and new demands. in this respect, reflective practice has proven to be an effective method to help develop hard and soft skills, improving performance and assisting students to acquire talents essential for professional competencies. the analyses performed in this study refer to the students who participated in the software studio from 2019-2020 and 2021-2022. part of the activities of these students occurred in a face-to-face manner, and part happened remotely. this may have caused some effects unknown to the researchers. a further study with the cohort 2021-2022 is already being initiated. it may reveal whether, by entering in a fully remote manner, the results obtained by the students will be different from those obtained in the present study. acknowledgments we thank the apple developer academy for allowing us to conduct the study, the students who participated in the interviews for making their time available, and cnpq for providing the grant for this study. references agante, l. (2015). a importância das soft skills na vida profissional. dinheiro vivo. disponível em: https://www.dinheirovivo.pt/gestao-rh/a-importancia-dassoft-skillsna-vida-profissional-12665712.html. access in 04th july 2021. barkley, e. f.; major, a. h.; cross, k. p. (2014). collaborative learning techniques – a handbook for college faculty. 2nd ed. san francisco: jossey-bass – a willey brand, page 417. burge, j. e.; gannod, g. c.; anderson, p. v.; rosine, k.; vouk, m. a.; carter, m. (2012). characterizing communication instruction in computer science and engineering programs: methods and applications. in: frontiers in education conference proceedings, pp. 1-6, doi: 10.1109/fie.2012.6462496. bull, c.; whittle, j. (2014). supporting reflective practice in software engineering education through a studio-based approach. ieee software, v.31, n.4, pages 44-50. bull, c. n.; whittle, j.; cruickshank, l. (2013). studios in software engineering education: towards an evaluable model, in international conference on software engineering (icse 13), pages. 1063–1072. carbone, a.; sheard, j. (2002). a studio-based teaching and learning model in it. in: proceedings of the 7th annual conference on innovation and technology in computer science education (iticse'02), v. 34, n. 4, pp. 213-217. carter, m., ferzli, m., and wiebe, e. n. (2007). writing to learn by learning to write in the disciplines. journal of business and technical communication, v.21, n3, pages 278-302. danielewicz-betz and tatsuki, (2014). danielewicz-betz, anna; kawaguchi, tatsuki. gaining hands-on experience via collaborative learning: interactive computer science courses. in: 2014 international conference on interactive collaborative learning (icl), ieee, december, pp. 403409. dors, t.m.; van amstel, fmc; binder, f.; reinehr, s.; malucelli, a. reflective practice in software development studios: findings from an ethnographic study. in 2020 ieee 32nd conference on software engineering education and training (csee&t). dybå, t.; maiden, n.; glass, r. (2014). the reflective software engineer: reflective practice, ieee software, v.31, n4, pages 32-36. fernandes, b. h. r. (2013). gestão estratégica de pessoas com foco em competência. rio de janeiro: elsevier. ferraz, a. p.; belhot, r. v. (2010). taxonomia de bloom: revisão teórica e apresentação das adequações do instrumento para definição de objetivos instrucionais. gestão da produção, v.17, n.2, pages 421-431. hazzan, o. (2002). the reflective practitioner perspective in software engineering education. journal of systems and software, v. 63, n. 3, pages 161-171. lima, t.; porto, j. b. (2019). análise de soft skills na visão de profissionais da engenharia de software. in: workshop sobre aspectos sociais, humanos e econômicos de software (washes), 4., (2019). belém. anais [...]. porto alegre: sociedade brasileira de computação, pages 31-40. doi: https://doi.org/10.5753/washes.2019.6407. marques, m.; ochoa, s. f.; bastarrica, m. c.; gutierrez, f. (2018). enhancing the student learning experience in software engineering project courses. ieee transactions on education, v. 61, n. 1, pages 63-73. nylén, a.; isomöttönen, v. exploring the critical incident technique to encourage reflection during project-based learning. in proceedings of koli calling 2017, koli, finland, november 16–19, 2017, 10 pages. parsons, m.; stephenson, m. (2005) developing reflective practice in student teachers: collaboration and critical partnerships, teachers and teaching: theory and practice, 11:1, 95-116. prior, j.; connor, a.; leaney, j. (2014). things coming together: learning experiences in a software studio. in: proceedings of the 2014 conference on innovation & technology in computer science education, pp. 129-134. prior, j.; suman, l.; leaney, j. (2019). what is the effect of a software studio experience on a student´s employability? in: proceedings of the 21st australasian computing education conference (ace'19), acm. sydney, nsw, australia, pp. 28-36. prior, j.; ferguson, s.; leaney, j. reflection is hard: teaching and learning reflective practice in a software https://doi.org/10.5753/washes.2019.6407 education, innovation and software production: the contributions of the reflective practice andrade et al. 2023 studio. acsw '16: proceedings of the australasian computer science week multiconference february 2016. doiorg.ez433.periodicos.capes.gov.br/10.1145/2843043.2843 346. schön, d. a. (1983). the reflective practitioner: how professionals think in action. new york: basic books inc., 374 p. schön, d. a. (1987). teaching artistry through reflection-inaction, in educating the reflective practitioner: toward a new design for teaching and learning in the professions.1st ed., san francisco, ca, us: jossey-bass. strauss, a.; corbin, j. (2007). basics of qualitative research: techniques and procedures for developing grounded theory. 3rd ed. london: sage publications. tomayco, j.e. (1991). teaching software development in a studio environment. in: proceedings of the twenty-second sigcse technical symposium on computer science education (sigcse '91), v. 23, n. 1, pp. 300-302. kuhn, s.; hazzan, o.; tomayko, j. e.; corson, b. (2002). the software studio in software engineering education. in: 15th conference on software engineering education and training (csee&t 2002), proceedings, kentucky, usa, pages 236-238. yin, r. (2017). case study research: design and methods (applied social research methods), 6th ed. los angeles: sage publications. education, innovation and software production: the contributions of the reflective practice in a software studio 1 introduction 2 background 2.1 collaborative learning 2.2 hard and soft skills 2.3 software studio 3 research method 4 results 4.1 first cycle 4.2 second cycle 4.3 third cycle 5 discussion 5.1 related work 5.2 threats to validity 5.3 implications to software engineering education, industry, and academia 6 conclusion acknowledgments references microsoft word 457-##_source texts-635-1-18-20200120.docx journal of software engineering research and development, 2020, 8:1, doi: 10.5753/jserd.2019.457 this work is licensed under a creative commons attribution 4.0 international license. supporting a hybrid composition of microservices. the eucaliptool platform pedro valderas [ pros research center – universitat politècnica de valència, spain | pvalderas@pros.upv.es ] victoria torres [ pros research center – universitat politècnica de valència, spain | vtorres@pros.upv.es ] vicente pelechano [ pros research center – universitat politècnica de valència, spain | pele@pros.upv.es ] abstract to provide complex and elaborated functionalities, microservices may cooperate with each other either by following a centralized (orchestration) or decentralized (choreography) approach. it seems that the decentralized nature of microservices makes the choreography approach more appropriate to achieve such cooperation, where lighter solutions based on events and message queues are used. however, orchestration through the usage of a process model facilitates the analysis of the composition when this is modified. to benefit from the goodness of these two approaches, this paper presents a hybrid solution based on the choreography of business process pieces that are obtained from a previously defined description of the complete microservice composition. to support this solution, the eucaliptool platform is presented. keywords: microservice, composition, choreography, orchestration 1 introduction companies such as amazon, airbnb, twitter, netflix, apple, uber, and many others have shifted towards a microservices architecture intending to be more agile in doing their business. the technology and functionality independence acquired when applying this architecture allows companies to replace, scale, and upgrade their applications easily and very fast (newman, 2015; bucchiarone et al., 2018; shadija et al., 2017). however, to provide their customers with valuable services, developer teams are forced to build microservice compositions due to the small granularity level in which these operate (dragoni et al, 2017). the definition of such compositions is being made by many organizations programmatically ad-hoc. the major problem when creating compositions in this way is that their complexity grows, making more difficult their visualization, understanding, and maintenance. this complexity has forced many companies to build their solution to compose microservices. among these solutions, we find zeebe (the evolution of the camunda project to orchestrate microservices), netflix conductor, ing baker or uber cadence. apart from zeebe, the other solutions have been developed by non-software companies to deal with the growing number of microservices handled by each company to develop their business. in general, to achieve microservices compositions we can find two major different approaches, these are choreography and orchestration. as a motivating example, let us consider a process designed to place orders in a webshop, which is supported by four microservices: customers, payment, inventory, and shipment. the sequence of steps to process an order is the following: 1. a customer places an order in the webshop. 2. the customers microservice checks customer data and logs the request. 3. if the customer is accepted, the payment microservice starts to collect the money. if it is required, payment details can be asked to the customer. in any case, the customer must be informed. 4. as soon as the payment is performed, the inventory microservice starts to fetch the ordered items. if some problem occurs, the customer is informed and the order is canceled. 5. finally, once the items are fetched correctly, the shipping microservice creates an order of shipment and assigns a driver. when following the choreography approach (dragoni et al., 2017; butzin et al., 2016), the logic of the composition is distributed through microservices, which communicate to each other through an event bus (usually supported by a message queue). thus, once the client places an order in the webshop (see fig. 1), an "order created" event is issued in the queue. the customers microservice, which is listening to this event, reacts to performing its assigned tasks, and a "customer accepted" event is triggered when the customer data is ok. then, the payment microservice, which is listening to this event, performs its tasks and generates the event that makes the next microservice in the composition perform the next tasks. and so on. let us now suppose that our company wants to provide special treatment to its vip customers, so they can proceed with the payment by the end of the process. to maintain these low-coupled microservices, this small change would imply the introduction of several changes in different microservices: the customers microservice should generate a different event depending on the type of customer to allow the participation of either the payment microservice (regular customers) or the inventory microservice (vip customers); in the same way, the shipment microservice should generate a different event to proceed with the payment or on the contrary with the delivery of the order; and the payment microservice should be also modified to allow delivering the order in case of vip customers. note how a single change requires the modification of several microservices. the major supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 problem with this approach is that there is not a clear picture of how microservices participate in the process since the composition is hard-coded and distributed along with multiple microservices. therefore, when engineering decisions need to be taken, it is difficult to analyze the composition's flow. figure 1. microservice collaboration through choreography. on the other hand, when building compositions with the orchestration approach (singhal et al., 2019; hamidehkhan, 2019), the logic of the microservice composition is centralized in an orchestrator microservice. one of the possible solutions for this approach is to define compositions as bpmn models and endow the orchestrator microservice with a bpmn engine that is in charge of executing it. the bpmn representation of the motivating example presented above is shown in fig. 2. figure 2. bpmn representation of the motivating example. in this case (see fig. 3), a client asks the orchestrator microservice to process an order, and this microservice executes the bpmn model that describes the microservice composition that manages customer orders. according to the logic of this composition, the first step the orchestrator does is asking the customers microservice to check the customer data, and then waits for a response from this microservice. once the response from the customer microservice is received, the orchestration microservice asks the payment microservice to collect the money and waits for a response. and so on. with this approach, the logic of the microservice composition is centralized in the orchestrator microservice. if we want to change the composition to support vip customers, we just need to update the bpmn model accordingly. however, all microservices depends on the orchestrator, reducing the degree of decoupling among them. also, there are some misconceptions within the microservice community that can make the adoption of this solution difficult: (1) many times, the task of process modeling is considered as an overhead for a software project; and (2) bpm tools are considered to be heavyweight and to take weeks to set up. figure 3. orchestration to support microservice collaboration. in this paper, we face the challenge of defining a hybrid solution to compose microservices that combine the benefits of both approaches. this solution is based on the following: 1. developers describe the complete microservice composition by means of a centralized model. this allows having the big picture of the composition, which facilitates the following maintenance and analysis tasks. 2. the centralized model of the composition is split into different pieces whose execution responsibility is delegated to the different participating microservices. each microservice is in charge of executing its piece and informing the other microservices about its execution. to do so, an event-based orchestration is proposed, which provides a degree of decoupling among microservices higher than the one provided by orchestration solutions. to support this solution, we present the eucaliptool platform, which includes the following: 1. an authoring tool to define microservices compositions through a domain specific modeling language (dsml) that facilitates the modeling activity. this tool has been developed to alleviate the misconceptions of using a process model for composing microservices. developers can design the whole composition using constructors that are easier to use than business modeling elements. this tool also supports the transformation of descriptions based on our dsml into executable bpmn specifications, and the split of it into pieces. 2. a microservice architecture that facilitates both, the deployment of each bpmn piece into the corresponding microservice, and the distributed execution of the microservice compositions through an event-based choreography. it also supports the maintenance and evolution of the microservice composition. the remainder of the paper is structured as follows. section 2 outlines the hybrid solution proposed in this work to supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 achieve microservice compositions. section 3 presents the architecture designed to support this solution. section 4 presents the authoring tool proposed to model microservices compositions. section 5 explains how a microservice composition is transformed into bpmn and split into pieces to be deployed in the proposed microservice architecture. section 6 analyzes how the evolution of microservice compositions are supported. section 7 introduces the related work. finally, section 8 concludes the paper and provides insights into directions for future work. 2 a hybrid approach to compose microservices in this section, we present a hybrid approach to achieve microservice compositions. the stages proposed in this approach are the following: 1. developers define a centralized description of the complete microservice composition. 2. the centralized description is split into bpmn pieces and these pieces are distributed among microservices. 3. the microservice composition is executed through an event-based choreography of bpmn pieces. to illustrate the proposed approach, we make use of the motivating example. first, developers start defining a microservice composition in a centralized model. in the case of the motivation example, developers should create a composition as the one shown in fig. 2. note that this microservice composition is defined with bpmn. however, we propose a dsml to facilitate this modeling activity, which is presented in section 4. once developers have described the complete microservice composition, the bpmn model is split into pieces whose execution responsibility is delegated to the different participating microservices. as fig. 4 shows, the bpmn model of the motivating example is split into four pieces that must be executed by the different microservices. figure 4. microservice orchestration split into different fragments. an event-based choreography of bpmn pieces is proposed to support the execution of a microservice composition. in this sense, each microservice is in charge of executing its piece and informing the others about it. following with the motivating example, once the client places an order in the webshop (see fig. 5), an "order process" event is issued in the message broker. the customers microservice, which is listening to this event, reacts executing their associated bpmn piece, and the "piece1_completed" event is triggered whether the customer data is ok. then, the payment microservice, which is listening to this event, performs its bpmn piece and generates the event that makes the next microservice in the composition to execute the next piece. and so on. note that current business process management (bpm) tools provide little support to create a business process model and split it into pieces that can be deployed into different microservices. there is also little help to implement the communication mechanisms that are required to coordinate the execution of the different pieces to complete a process. in addition, note that we propose to have two versions of the composition. on the one hand, we have the model of the whole microservice composition. on the other hand, we have a split version that is distributed along with the microservices. thus, when the microservice composition needs to be evolved due to changes in requirements, both versions must be updated, which implies additional efforts for developers. therefore, if we want that developers adopt our proposal we need to provide them with tools that facilitate the tasks of modeling and provide a high degree of automation to deploy composition pieces and configure the execution environment. to achieve this, we present the eucaliptool platform. the next section introduces the supporting microservice architecture. figure 5. event-based orchestration of bpmn pieces. 3 supporting microservice architecture in a microservice architecture, applications are structured as a collection of loosely coupled services, which implement the business capabilities of a system. apart from those business microservices, it is usual to find in this type of architecture other microservices that are focused on supporting infrastructure issues. examples of this type of microservices are the service registry that gives support to service discovery, containing the network locations of microservice instances; an supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 api gateway that provides addressability capabilities; an authentication server that is in charge of controlling the access to the microservices; and a configuration server that manages microservice configuration on the cloud. in addition, it is also common the use of tools to monitor microservices’ status and log their executions, as well as to deploy a message queue to manage asynchronous communication among microservices. finally, microservices are usually complemented with a client-side load balancer and some library that implements the circuit breaker pattern to support fault tolerance. microservices architectures have already been used to build business process modeling and analysis tools (alpers et al., 2015). in this work, we extend the typical microservice architecture with three main elements (see red-colored blocks in fig. 6): 1. eucaliptool composer. it is a microservice endowed with an authoring tool to facilitate the creation of microservices compositions. this microservice also is in charge of transforming the compositions created through the authoring tool into a bpmn executable specification, splitting it into bpmn pieces, and sending them to the eucaliptool server. in addition, this microservice stores the whole description of the microservice composition created with the authoring tool. 2. eucaliptool server. it is a microservice that plays the role of gateway among business microservices and the eucaliptool composer. it is responsible for the following tasks: a. receiving the split bpmn processes sent by the eucaliptool composer, registering them into a process repository, and distributing the pieces among the different microservices. b. launching the execution of each process by triggering the first bpmn piece and delegating the responsibility of continuing the process to the corresponding microservice. to achieve this, a message queue is used. c. providing the eucaliptool composer with the list of available microservices and their operations. to achieve this, microservices must be registered into this server using the eucaliptool client. 3. eucaliptool client. it is a client library that endows each microservice with: (1) a lightweight activiti 1 bpmn engine and (2) a microservice composition authoring tool. the bpmn engine is included to support the execution of bpmn pieces. the authoring tool is included to support the evolution of these pieces by the developers of each microservice. this library is also in charge of automatically registering microservice's operations into the eucaliptool server. 1 https://www.activiti.org/ 2 https://projects.spring.io/spring-boot/ figure 6. microservice orchestration split into different fragments. to satisfy the responsibilities associated with each architectural element, they must interact with each other. this interaction is done through the http protocol. thus, each architectural element is in charge of publishing the required http end-points. for instance, the eucaliptool client library is in charge of publishing an http end-point to allow the eucaliptool server to send the bpmn pieces to each microservice. in the same way, the eucaliptool server must publish an http end-point to allows the eucaliptool client library to register the operations of each microservice. 3.1 supporting technology one of the most important supporters of the microservice architecture is netflix. this video streaming company has developed its software infrastructure by using microservices and has published all its supporting tools as open source. one of the main characteristics of these tools is their ease of use. these tools are based on the spring boot2 framework and are distributed as java libraries3. they propose the use of simple annotations and configuration files to develop and deploy the different components of the architecture. for instance, to build a service registry to support microservice discovery it is enough to create a spring boot java class and annotate it with the annotation @enableeurekaserver. then, you just need to define some parameters in a configuration file and the “magic” is done. you have a functional service registry. we want to follow the same strategy to facilitate the use of the eucaliptool infrastructure in a real microservice architecture. thus, we have created three java packages that encapsulate the functionality of the three proposed architectural elements and they are complemented with the following three annotations:  @eucaliptoolcomposer  @eucaliptoolserver  @eucaliptoolclient thus, to create these microservices, developers just need to create a spring boot java class, use these annotations and, in some cases, define some configuration parameters. 3 https://netflix.github.io/ supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 for instance, to create an eucaliptool sever microservice developers just need to import the corresponding java libraries, and create a java class as follows: @eucaliptoolserver public class server { public static void main(string[] args) { springapplication.run(server.class, args); } } the springapplication class is a spring utility that creates a java application with an embed tomcat. when the above code is executed, we intercept the run method and search for our annotations by using reflection capabilities. when the @eucaliptoolserver is found, we deploy the functionality of this component into the embed tomcat. we also create an http controller that publishes the required end-points to interact with the rest of the architectural elements. the configuration that is required for this component is the end-points of the components that need to interact with. in particular, the api gateway, the service registry, and the message cue. this configuration is done through a yml file. by using and configuring the other two annotations we achieve the following:  @eucaliptoolcomposer. it creates a spring application with the eucaliptool composer deployed into the embed tomcat. this annotation needs a config file that indicates the end-points of the eucaliptool server that (1) provides the list of microservices and their operations, and (2) allows sending a split composition. it also creates an http controller that publishes the end-points required to interact with the eucaliptool server.  @eucaliptoolclient. it transforms a microservice into a eucaliptool client. to do so, it includes a lightweight version of the activiti engine to execute bpmn pieces. it also includes a web graphical editor deployed into the embed tomcat. also, it creates an http controller that publishes end-points to both receiving bpmn pieces and subscribing the microservices to choreography events. this annotation needs a config file that indicates the end-points of the eucaliptool server in order to register microservice’s operations and send bpmn pieces when are modified. 4 specifying microservice compositions the eucaliptool composer includes a web-based authoring tool that proposes a domain specific modeling language (dsml) to facilitate the modeling of microservice composition. it is based on a previous work that focuses on helping end-users to compose services by using a visual interface from a mobile device (valderas et al., 2017). next, we present the abstract syntax of the dsml (i.e. the conceptual elements) and the concrete syntax (i.e. the graphical components that define the web interface). 4.1 dsml abstract syntax the abstract syntax of the dsml supported by the web graphical editor is based on the change patterns (weber et al., 2008) developed within the context of the process of process modeling. change patterns are high-level abstractions aimed at achieving flexible and easy adaptations of a business process. these abstractions are defined in terms of highlevel change operations (e.g., the creation of a parallel branch) which are based on the execution of a set of change primitives (e.g., add/delete activity). as opposed to change primitives, change pattern implementations typically guarantee model correctness after each transformation (casati, 1998) by associating pre/post conditions with high-level change operations. usually, process modeling environments supporting the correctness-by-construction principle (e.g., dadam et al., 2009) just provide process modelers with those change patterns that transform a sound process model into another sound one. for this purpose, structural restrictions on process models (e.g., block structuredness) are imposed. in addition, correct usage of change patterns allows speeding up the creation of the composition. some change patterns are (weber et al., 2008): insert process fragment, embed process fragment in loop, embed process fragment in conditional branch, etc. inspired by the concept of fragment introduced by change patterns, the abstract syntax of the dsml proposed to compose microservices is shown in fig. 7. figure 7. domain specific language designed for eucaliptool. a microservice composition is made up of compositionelements of two types which are operations (of a microservice) and fragments. each operation has some inputs and one output. inputs are classified into three types depending on the source from which their value is obtained. this source can be the output of another operation; it can be obtained at runtime; or can be defined at design time. in the next subsection, this issue is explained with some examples. regarding fragments, there are four types: parallel, which has two or more branches of elements that must be executed in parallel; conditional, which has one or more branches of elements that must be executed when a condition is satisfied; loop, which has a branch of elements that must be executed while supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 a condition is satisfied; and witherror, which has two branches of elements, a major one that is executed by default, and a compensation one the is executed if some errors occur with some of the major branch's operations. the previouselement relationship between compositionelements allows establishing the sequence order between operations and fragments. to better understand the concepts of this metamodel, fig. 8 illustrates them in a process that is composed of a sequence of four operations followed by a parallel fragment. in turn, this latter parallel fragment is made up of a conditional fragment and two operations that are executed in parallel to it. figure 8. dsl concepts applied in an example. 4.2 dsml concrete syntax to create a composition of microservices we have defined a web interface based on the "adding element" metaphor where microservice developers just need to add a set of operations or fragments to a composition. to exemplify this interface, fig. 9 shows some of the screens needed to define the payment piece marked in green in fig. 4. fig. 9a shows the composition after adding the checkcustomer and logrequest operations of the microservice customers. to add more elements, designers just need to click on the "+" symbol. the type of elements that can be added to a composition are single operations and fragments (note that there are two tabs in fig. 9b). fig. 9b shows a list of fragments that are ready to be used in the current composition. in this case, the designer is selecting a with error fragment. as a result, a fragment of this type is included after the existing operations (see fig. 9c). here, the designer should specify two things, the major branch of operations to perform and the compensation branch of operations in case the major branch fails. in this case, the designer selects the paymentprocess operation offered by the payment microservice to be included in the major branch (see fig. 9d). this is offered as a single operation from the available catalog. this list shows the microservice operations that the eucaliptool server sends to the eucaliptool composer. these operations are automatically registered into the eucaliptool server by the eucaliptool client library that is installed in each microservice. the selection of this single operation results in the screen shown in fig. 9e. at this point, the designer still has to specify what to do when the major branch fails. this can be specified by selecting the tab labeled with the warning icon, and proceeding similarly to the definition of the major branch. in this case, the designer selects the operation changepaymentdetails. with this action, the second element of the composition is already completed (see fig. 9f). at this point, the designer should continue by selecting the most appropriate operations or fragments until the composition is completely defined. once microservice composition's flow is described, developers must define the inputs that some microservice operations require to be properly executed. to facilitate this, we provide a graphical component (see fig. 10) that allows: (1) linking an input with any compatible previous output, (2) indicating that the input value should be obtained at runtime; or (3) defining an input value at design time. for instance, let us consider that the operation cancelorder, which must be executed by the inventory microservice in case of error, needs two inputs: the customer id, which is a string, and the order number, which is an integer. let us consider also that all previous microservice operations generate a string value as output. fig. 10b shows the options that are available for the customer input. in this case, it can be associated to any previous operation since their data types are compatible, and also can be defined as an input to be obtained at runtime or an input that is associated to a predefined value (defined at design time in this screen). fig. 10 shows the options available for the order input. in this case, none of the previous operations is compatible so they are not available to be associated with this input. if a developer selects the option predefined value for a microservice's input, an input component is shown in order to allow the developer to introduce the value associated with the microservice's input at design time. regarding the option of defining an input to be obtained at runtime it implies that the values must be obtained when executing the microservice composition, and from a data source different from the own operations included in the composition. currently, we are considering that the data source is the client that launches the microservice composition (see fig. 5). thus, any time a microservice needs to execute an operation that has some input to be obtained at runtime, the corresponding bpmn piece generates an event in order to ask the client for this data. in further work, we want to consider other data sources such as the results of other microservice compositions or some physical devices in the context of the internet of things. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 figure 9. example of the dsml concrete syntax to create microservice compositions. figure 10. configuration of microservice operation’s inputs. 5 supporting the execution of split bpmn processes once a microservice composition is defined with the eucaliptool composer three main stages are followed to distribute the responsibility of the process execution: (1) generation. the composition is transformed into a set of bpmn pieces. (2) distribution. bpmn pieces are sent to the eucaliptool server which registers the process and deploys the pieces into the corresponding microservices. (3) choreography. each microservice participates in the composition through an event-based orchestration. 5.1 generation of bpmn pieces the eucaliptool composer analyzes each process defined with the dsml and creates groups of actions according to the microservices that support them. each of these groups will be transformed into a bpmn piece. for instance, let us consider the composition presented in the motivation example (cf. fig. 11). in this case, the first two operations must be executed by the customer microservice and, therefore, they constitute the first piece. the second piece is defined by the third and fourth elements of the composition (a with error boundary block and a single operation), which both must be executed by the payment microservice. the third piece is defined from the operations that the inventory microservice must execute, i.e. fetch the items and the composition actions in case of error. finally, the fourth piece is made up of the two last operations that must perform the shipment microservice. figure 11. identification of bpmn pieces. for each bpmn piece, the eucaliptool composer generates a specification with the bpmn tasks to be performed as well as additional tasks to trigger the events that must manage the orchestration. for instance, let us consider the operations that must perform the microservice inventory (the third piece of bpmn). this microservice must fetch the items of the order and, in case of error, inform the user and cancel the order. fig. 12 shows the definition built with the eucaliptool composer and the generated bpmn process model. as we can see, two additional bpmn tasks are in supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 charge of 1) triggering an ok event in case there is no error, and 2) triggering a fail event if some problem occurs. these tasks are preconfigured to publish the event in a message queue. figure 12. generated piece of bpmn. the eucaliptool composer internally manages each composition in json format. to transform json descriptions into bpmn (which is based on xml) it uses java parsers of json and xml. the json description is parsed into a structure of java objects that are maintained in memory. next, this structure is analyzed in order to generate a bpmn specification by using the xml parser. in particular, we generate bpmn specifications that will be executed in the activiti engine, i.e. the engine included in the microservice by the eucaliptool client library. 5.2 distribution of bpmn pieces once the set of bpmn pieces has been generated, the eucaliptool composer sends them to the eucaliptool server. to do so, the latter publishes an http end-point that accepts this data through post connections. when the eucaliptool server receives a split composition, it performs the following actions (see fig. 13): (1) it registers the composition into its repository and creates an http end-point to launch it. (2) it deploys each piece of bpmn into the corresponding microservice. (3) it defines an event to launch the first piece of bpmn and configures the first microservice to listening to it. (4) for each event generated by a piece of bpmn, it configures the microservice that must execute the next piece to listen to this event. note that the eucaliptool server must interact with the microservices to deploy each piece of bpmn as well as to configure the microservice to listen to specific events. this can be done using a set of http endpoints that each microservice has available when including the eucaliptool client. figure 13. actions done by eucaliptool server. 5.3 orchestration of bpmn pieces the orchestration of the bpmn pieces deployed in microservices is done as follows (see fig. 14): (1) a client accesses the end-point published by the eucaliptool server. (2) the eucaliptool server launches the start event for this process. (3) the microservice that is listening to this event executes the first piece of bpmn. this execution finishes by triggering an event that indicates that the execution of the first bpmn piece is completed. (4) the microservice that is listening to the event that indicates the execution of the first bpmn piece launches its bpmn piece (the second one) and when executed, it generates another event that indicates that the execution of the second bpmn piece is completed. (5) the microservice that is waiting for the event that indicates the execution of the second bpmn piece does the same actions as the previous one: launches its corresponding bpmn piece and generates an event that indicates its execution. (6) and so on until the process is completed. figure 14. event-based orchestration of a split bpmn process. 6 supporting the evolution of microservice compositions following the proposed hybrid approach, we have two descriptions of a microservice composition. on the one hand, we have the whole picture of the composition that is stored by the eucaliptool composer. this centralized description supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 helps developers to analyze the whole composition to take engineering decisions. on the other hand, we have the split version of the composition that is distributed through the different microservices. this split description provides a high degree of decoupling among microservices when the composition is executed through an event-based choreography. one of the most important challenges to be faced within this context is the evolution of the microservice composition and the synchronization of both descriptions. our main goal is to propose a solution that provides developers with a high degree of flexibility to perform changes. so these can be done either at the centralized composition, i.e., at the whole composition, or at the microservice level, i.e., at the pieces deployed in each microservice. to achieve this, as introduced in section 3, the following mechanisms are provided by the proposed three architectural elements:  the eucaliptool client library includes a web editor like the one shown in section 6 where developers can independently evolve their composition pieces.  the eucaliptool server publishes an http end-point to receive modified composition pieces from microservices to send them to the eucaliptool composer.  the eucaliptool composer publishes an http endpoint to receive modified composition pieces from the eucaliptool server to update the whole version of the composition. thus, the evolution of a microservice composition can be done in two ways: 1 developers update the whole description of the composition from the eucaliptool composer microservice (see fig. 15a). in this case: 1.1 the eucaliptool composer microservice generates the corresponding bpmn pieces and sends those pieces that have been changed to the eucaliptool server. 1.2 the eucaliptool server microservice distributes the pieces among the corresponding business microservices. 1.3 microservices that receive a new version of a piece, replace the old version by the new one. 2 developers change a composition piece from a business microservice (see fig. 15b). in this case: 2.1 the microservice sends the new version of the piece to the eucaliptool server. 2.2 the eucaliptool server sends the received piece to the eucaliptool composer. 2.3 the eucaliptool composer updates the whole description with the changes that introduce the modified piece. figure 15. evolution of a microservice composition. to update the whole composition when an updated bpmn piece is received, the eucaliptool composer applies the transformation inverse to the one used to generate the bpmn pieces and obtains a json representation of the piece. this json representation is based on the dsml presented above and the eucaliptool composer just needs to replace the elements of the whole description that correspond with the updated piece. note that updating the whole description of the microservice composition is easy since pieces are composed of operations and fragments that are added to a container. there are no connections with previous or further elements that need to be managed like can happen with a bpmn model. in order to better understand this aspect fig. 16 illustrates how the composition of the motivating example is updated with a new piece 2. figure 16. example of composition update by replacing a piece. 7 evaluation this section presents the experiment that we have conducted to show the efficiency of our proposal in the development and evolution of microservice compositions. this experiment aimed to compare the efficiency measurement obtained by a development based on eucaliptool with the measurement obtained by an ad-hoc implementation of an event-based choreography. this ad-hoc implementation was done by using the technology provided by spring and netflix. to support the exchange of messages among microservices, a rabbitmq message broker was used in both cases. to do the experiment, we followed the guidelines presented by kitchenham et al. (1995) and wohlin et al. (2012). according to these guidelines, we have divided the experiment into three main phases: scoping, planning, operation and analysis, and interpretation 7.1 scope the scope of an experiment is set by defining its goal. to do so, we have used the template proposed by basili et al. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 (1988). the goal of our experiment is characterized as follows: analyze: our approach based on eucaliptool for the purpose of: evaluating the impact of our approach compared to ad-hoc development with respect to: efficiency from the point of view of: microservice developers in the context of: researchers in software engineering composing microservices 7.2 experimental design in the planification activity, we must formalize the hypotheses, determine the dependent and independent variables, describe the context of the experiment and the instrumentation used, and consider the threats of validity we can expect. hypothesis. the hypotheses defined for the experiment were the following:  null hypothesis 1, h10. the efficiency of the eucaliptool approach for developing and evolving microservice compositions is the same as an ad-hoc development.  alternative hypothesis 1, h11. the efficiency of the eucaliptool approach for developing and evolving microservice compositions is greater than an ad-hoc development. identification of variables. we identified two types of variables:  dependent variables: variables that correspond to the outcomes of the experiment. in this work, the efficiency in composing microservices was the target of the study, which was measured in terms of the following software quality factors: development time and evolving time.  independent variables: variables that affect the dependent variables. the development method was identified as a factor that affects the dependent variable. this variable had two alternatives: (1) eucaliptool approach and (2) an ad-hoc implementation. context. the context of the experiment was the following:  experimental subjects. ten subjects participated in the experiment, all of the researchers in software engineering. their ages ranged between 28 and 45 years old. the subjects had an extensive background in java programming and modeling tools; however, they did not have experience in the use of eucaliptool. only 3 of them have experience in using the spring framework and message queues, and 4 of them have previously worked with bpmn.  objects of study. the experiment was conducted using a case study similar to the motivating example used throughout the paper, i.e. the microservice composition to manage a purchase order in a webshop (see section 1). instrumentation. the instruments that were used to carry out the experiment were: o a demographic questionnaire: a set of questions to know the level of the users’ experience in java/spring programming, modeling tools, and bpmn. o work description: the description of the work that the subjects should carry out in the experiment by using eucaliptool and the ad-hoc solution. this work description explained two activities: (1) the development of the microservice composition to support purchase orders, and (2) the modification of this composition to support new requirements. o a form: a form was defined to capture the start and completion times of the proposed work. for each task that was proposed in the experiment, participants had to annotate the starting and completion times by using the clock of the computer. if some interruptions occur while performing the work, subjects wrote down the times every time they started and stopped carrying out the activity; thus, the total time was derived using these start and completion times. finally, additional space was left after the completion time of the work for additional comments about the subjects about the performed activity. threats of validity. our experiment was threatened by the random heterogeneity of subjects. this threat appears when some users within a user group have more experience than others. this threat was minimized with a demographic questionnaire that allowed us to evaluate the knowledge and experience of each participant beforehand. this questionnaire revealed that all the users had experience in java programming and modeling techniques. some of them had experience in the use of technologies related to the implementation of choreographies, while others did not. this problem could affect the evaluation of the development with an adhoc solution since this type of development requires these technologies. some participants had experience in bpmn which could affect the evaluation of the development based on eucaliptool since it is based on some abstractions of bpmn. to minimize this threat, all subjects participated in training sessions about both choreography implementation technologies and eucaliptool. in addition, to minimize the effect of the order in which the subjects applied the approaches, the order was assigned randomly to each subject. however, in order to have a balanced design, the same number of subjects was assigned to start with each approach. to do so, the ten participants were aleatorily divided into two groups, and each group was initially assigned to a development type. then, each group changed of development type to do again the same tasks. in this way, we minimized the threat of learning from previous experience. finally, our experiment was threatened by the reliability of measures threat: objective measures, that can be repeated with the same outcome, are more reliable than subjective measures. in this experiment, the precision of the measures may have been affected since the activity completion time was measured manually by users using the computer clock. to reduce this threat, we observed subjects while they were performing different tasks to guarantee their exclusive supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 dedication in the activities and supervise the times that they wrote down. 7.3 execution we followed a within-subjects design where all subjects were exposed to every treatment/approach (eucaliptool solution and ad-hoc solution). the main advantage of this design was that it allowed statistical inference to be made with fewer subjects, making the evaluation much more streamlined and less resource-heavy (wohlin et al., 2012). to perform the experiment, we arranged a workshop of three days with two sessions per day (see table 1). table 1. sessions of the experiment session 1 session 2 day 1 duration: 4h all participants: training in choreography implementation duration: 4h all participants: training in eucaliptool day 2 duration: 5h group a: development of a microservice composition with an ad-hoc solution group b: development of a microservice composition with eucaliptool duration: 3h group a: evolution of a microservice composition with an ad-hoc solution group b: evolution of a microservice composition with eucaliptool day 3 duration: 5h group a: development of a microservice composition with eucaliptool group b: development of a microservice composition with an ad-hoc solution duration: 3h group a: evolution of a microservice composition with eucaliptool group b: evolution of a microservice composition with an ad-hoc solution during the first day, we had two sessions of 4 hours in which participants were proposed to fill in a demographic questionnaire to capture participants’ background and were trained in choreography technologies and eucaliptool. in particular:  regarding choreography technologies, we provided the subjects with the necessary tutorials and tools to learn the basics of the spring and netflix technologies needed to develop the case study. we also made an introduction to message queues and rabbitmq. the subjects also participated in the implementation of some guided examples to gain experience with the technologies.  regarding eucaliptool, we provided the subjects with a tutorial where the web authoring tool included in the eucaliptool composer was explained. the subjects also worked with some examples to gain experience with the dsml of this tool. we also explained the proposed architecture and how the proposed eucaliptool architectural elements interact among them and need to be configured. during the second and third days, participants were divided aleatorily into two groups, a and b, and two sessions of five and three hours respectively were proposed for each day. we did the same experiment in both days. in one day, group a used an ad-hoc solution to develop and evolve a microservice composition while group b used eucaliptool. the second day groups changed the development methods. the tasks designed for the experiment were initiated with a short presentation in which general information and instructions were given. afterward, the work description and the form were given to the subjects and they started to develop and evolve the microservice composition following the development method (eucaliptool and ad-hoc) that was indicated for each group. the microservice composition that participants had to develop was described in a textual way. after performing this work, participants filled in a form to capture the development times. once the subjects developed the composition, they started to modify it to evaluate the evolution. for these activities, they also filled in the form to capture the time taken to evolve the composition. to properly perform this work, we previously developed the microservice architecture required to support the case study. to do so, we used netflix’s technology. the eucaliptool composer and the eucaliptool server microservices were also created, and every business microservice was defined as a eucaliptool client. in a more detailed way, the activities carried out with each development approach were the following:  ad-hoc development: from the case study description, they started the implementation of the microservice composition for the management of purchase orders. generally, they identified the operations that each microservice should perform, and defined for them both, a starting event and an end event. once this data was clear, they updated each microservice with the classes required to connect to rabbitmq and listen at the starting event to launch the operations corresponding to each microservice. to execute these operations, they implemented some classes that call the corresponding methods. these classes also were in charge of launching the ending event. once they modified each microservice and achieved the compilation of the code, they spent some time testing the composition and detecting code errors. finally, we provided a set of requirement changes for the composition to evaluate the evolution. in particular, we proposed them to support vip customers in such a way it was introduced in section 1. in this activity, the participants changed the code of the involved microservices to support the new requirements. then, the participants tested the new composition and corrected the errors.  eucaliptool-based development. following this approach, the participants first designed the microservice composition with the eucaliptool composer according to the case study description. then, they asked the eucaliptool composer to deploy the composition. afterward, they spent some time testing the composition and detecting errors in the composition design. finally, we asked participants to support the same new requirements as explained in the previous activity. in this case, the participants changed the composition done with the eucaliptool composer and deployed it again. then, supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 the participants tested the new composition and corrected the errors. 7.4 analysis of results in this subsection, we analyze and compare the usefulness of both approaches based on the time used for the development and evolution of a microservice composition. the results have been studied based on time mean comparison and the standard deviation. table 2 presents the descriptive statistics for each of the studied quality factors. table 2. descriptive statistics for each quality factor. quality factor dev. method mean (hours) num. of subjects std.dev. (hours) develop. time ad-hoc 4.38 10 0.52 eucaliptool 1.15 10 0.44 evolution time ad-hoc 1.55 10 0.69 eucaliptool 0.29 10 0.05 next, we provide further analysis of the results for each measured software quality factor:  development time. the development time following the ad-hoc approach differed according to the subject implementation experience, ranging from 3.25 hours (the most experienced subject) to 5. following the eucaliptool approach, the development activity ranged from 75 min to 2.10 hours. the difference between the two approaches was high since developing the microservice composition in an ad-hoc way was more complex and difficult for the participants since they had to implement all the composition logic manually as well as all the code required to connect with rabbitmq to participate in the event-based choreography. the eucaliptool approach allowed participants to focus on the required requirements instead of solving technological problems. note that by following this approach, none of the participants had to implement anything to manage the invocation of operations neither the events required to participate in the choreography. regarding the standard deviation, it was low for both development approaches (see table 1) indicating that development times tended to be close for each development approach.  evolution time. concerning the ad-hoc development, this activity took subjects from 1.10 to 2.3 hours since they had to identify the microservices that must be updated, and modify the corresponding code. changing the eucaliptool description of the microservices composition took less than 30 min. for all the subjects (very low standard deviation obtained). this is because evolving the microservice composition to fit the new requirements was as easy as modifying the whole description with the web authoring tool. in this case, participants focused again only on requirements. they did not need to identify microservices and hardcoded changes. 1 statistical analyses using spss, http://www.ats.ucla.edu/stat/spss/whatstat/whatstat.htm#1sampt with the eucaliptool approach, the subjects took, on average, 1.44 hours to develop the case study, whereas with an ad-hoc implementation the subjects took 5.93 hours. therefore, the process for automating and evolving microservice compositions is more efficient using the eucaliptool approach than using an ad-hoc solution. in order to verify whether we can accept the null hypothesis, we performed a statistical study called paired t-test using the ibm spss statistics v201 at a confidence level of 95% (α = 0.05). this test is a statistical procedure that is used to make a paired comparison of two sample means, i.e., to see if the means of these two samples differ from one another. for our study, this test examines the difference in mean times for every subject with the different approaches to test whether the means of an ad-hoc development and the eucaliptool approach are equal. when the critical level (the significance) is higher than 0.05, we can accept the null hypothesis because the means are not statistically significantly different. for our experiment, the significance of the paired ttest for the total time means is 0.000 (calculated using the ibm spss statistics), which means that we can reject the null hypothesis h10 (the efficiency of the eucaliptool approach for developing and evolving microservice compositions is the same as an ad-hoc development). based on this test, we have given strong evidence that the kind of development influences the usefulness. specifically, the efficiency using the eucaliptool approach is significantly better than using an ad-hoc solution, i.e., the mean values for all the measures are lower when using the eucaliptool approach; thus, the alternative hypothesis h11 is fulfilled: the efficiency of the eucaliptool approach for developing and evolving microservice compositions is greater than an ad-hoc development. 7.5 conclusions the above-presented experiment evaluated our approach to develop and evolve microservice compositions concerning ad-hoc solutions based on choreographies. we have validated that our approach is more efficient than ad-hoc solutions and have confirmed the expected benefits suggested in the introduction. on the one hand, having the big picture of the composition has facilitated its analysis to support its evolution when requirements changed. on the other hand, the visual editor of eucaliptool, as well as the supporting infrastructure to manage event-based communication, have significantly facilitated the definition and execution of choreographed microservice compositions. note that we have evaluated ad-hoc solutions based on choreographies since the decentralized nature of microservices seems to make choreographies more appropriate to define microservices compositions (dragoni et al., 2017; butzin et al., 2016). a similar experiment focusing on orchestration will be considered as further work. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 8 related work rajasekar et al. (2012) presented the integrated rule oriented data system (irods) to orchestrate microservices within data-intensive distributed systems. a microservice choreography is defined as a set of textual event-conditionaction (eca) rules. each rule defines the data management actions that a microservice must execute. these actions generate events within the system that trigger the rules associated with other microservices. the authors also proposed the use of recovery microservices to maintain transactional properties. the main drawback of this work is that the logic of the process is distributed along with the different rules that each microservice implements, making the maintenance and evolution difficult to perform. yahia et al. (2016) introduce medley, an event-driven lightweight platform for microservice orchestration. they propose a textual domain-specific language (dsl) for describing orchestrations using high-level constructs and domain-specific semantics. these descriptions are compiled into low-level code run on top of an event-driven processbased and lightweight platform. the main drawback of this approach is that developers need to explicitly manage service orchestration issues at the modeling level. our solution allows developers to focus only on modeling business requirements. also, a choreography solution is proposed to obtain a major level of independence among microservices. kouchaksaraei et al. (2018) present pishahang, a framework for jointly managing and orchestrating cloud-based microservices. this framework introduces tools to easily integrate sonata (dräxler et al., 2017), an orchestration framework, with terraform (2019), a multi-cloud tool. however, tools for modeling business processes and support them within a decoupled microservice infrastructure are not provided. indrasiri & siriwardena (2018) introduce ballerina, an emerging technology that is built as a programming language and aims to make it easy to write programs that integrate and orchestrate microservices. however, although they propose an environment to design microservice integrations with sequence diagrams, most of the communication issues among microservices need to be managed at programming level. our solution automatically generates the implementation artifacts required to support microservice communication from business process models. petrasch (2017) presents an approach based on uml to design microservices and communication among them. however, complex business processes involving multiple microservices cannot be modeled. guidi et al. (2017) present the need for specific programming languages aimed towards microservices composition. authors claim that these languages should include concepts such as communication, interfaces, and dependencies. they instantiate their proposal in terms of the jolie (2019) programming language. similar work to this is the one presented by safina et al. (2016), which extends the jolie programming language to support data-driven workflows. this means that the flow of microservice compositions is controlled at the time of message passing according to the nature of the message structure and type. our work differs from these two approaches in the fact that we provide a solution based on business process modeling instead of programming languages to create ad-hoc solutions. finally, it is worth noting that in this paper we present an extended version of the work proposed in (valderas et al., 2019). in this current work, we introduce the evolution of microservice compositions from both, a top-down perspective (i.e. from the eucaliptool composer to the microservices), and a bottom-up strategy (i.e. from the microservices to the eucaliptool composer). we have improved the dsml defining how inputs and outputs of microservices can be linked. we also present the development infrastructure implemented to support developers in the composition of microservices by using our approach. in addition, our approach has been evaluated through a complete experiment that compares it with ad-hoc solutions to compose microservices. 9 conclusion and further work in this work, we have presented a hybrid solution that combines the choreography and orchestration approaches to deal with microservice compositions with the use of eucaliptool. the main reason to follow such a hybrid solution is that we want to take advantage of the goodness of each approach. this is, we want to maintain the flexibility and decoupling nature offered in choreographies but also want to keep the composition global vision and management offered by an orchestration approach. for this purpose, the eucaliptool platform has been presented and integrated in a typical microservice architecture to provide: 1) tool support to the specification of microservices compositions, 2) mechanisms to automate the distributed deployment of microservice compositions and its execution through an eventbased choreography, and 3) support the evolution of compositions following a top-down strategy (i.e. from the global vision of the composition) or a bottom-up strategy (i.e. from a piece of a specific business microservice). in addition to the evaluation based on the motivating example, it would be very interesting to evaluate also the performance of the designed architecture in a real scenario. furthermore, since our objective is to improve how compositions are made, as future work we plan to enrich eucaliptool with goal-oriented capabilities. this way, instead of specifying compositions, users would just need to state their goals. then, based on them, eucaliptool would propose an initial composition intended to satisfy the user stated goals. acknowledgments this work has been developed with the financial support of the spanish state research agency under the project tin2017-84094-r and co-financed with erdf. supporting a hybrid composition of microservices. the eucaliptool platform valderas et al. 2019 references alpers, s., becker, c., oberweis, a., schuster, t. (2015). microservice based tool support for business process modeling. edoc workshops: 71-78 basili, v.r., rombach, h.d. (1988). the tame project: towards improvement-oriented software environments. ieee trans. softw. eng. 14(6), 758–773 bucchiarone, a., dragoni, n., dustdar, s., larsen, s. t., and mazzara, m. (2018). from monolithic to microservices: an experience report from the banking domain. ieee software, vol. 35, no. 3, pp. 50-55 butzin, b., golatowski, f., & timmermann, d. (2016). microservices approach for the internet of things. in 2016 ieee 21st international conference on emerging technologies and factory automation (etfa) (pp. 1-6). ieee. casati, f.: models, semantics, and formal methods for the design of workflows and their exceptions. (1998). phd thesis, milano dadam, p., reichert, m. (2009). the adept project: a decade of research and development for robust and flexible process support. comp scie r&d 23: 81-97 dragoni, n, giallorenzo, s., lluch-lafuente, a., mazzara, m., montesi, f., mustafin, r., safina, l. (2017). microservices: yesterday, today, and tomorrow. present and ulterior software engineering: 195-216 dräxler, s., karl, h., peuster, m., kouchaksaraei, h. r., bredel, m., lessmann, j., ... & xilouris, g. (2017). sonata: service programming and orchestration for virtualized software networks. in 2017 ieee international conference on communications workshops (icc workshops) (pp. 973-978). ieee. guidi, c., lanese, i., mazzara, m., & montesi, f. (2017). microservices: a language-based approach. in present and ulterior software engineering (pp. 217-225). springer, cham. hamidehkhan, p. (2019). analysis and evaluation of composition languages and orchestration engines for microservices (master's thesis). indrasiri, k., & siriwardena, p. (2018). integrating microservices. in microservices for the enterprise (pp. 167-217). apress, berkeley, ca. jolie. (2019). a service oriented language. url: https://www.jolielang.org/ last time accesed: november 2019. kitchenham, b., pickard, l. and pfleeger, s. l. (1995). case studies for method and tool evaluation, software, ieee, vol. 12, no. 4, pp. 52–62, 1995. newman, s. (2015). building microservices, usa:o'reilly media inc., february 2015. petrasch, r. (2017). model-based engineering for microservice architectures using enterprise integration patterns for inter-service communication. in 2017 14th international joint conference on computer science and software engineering (jcsse) (pp. 14). ieee. rajasekar, a., wan, m., moore, r., & schroeder, w. (2012). microservices: a service-oriented paradigm for. data intensive distributed computing. in: challenges and solutions for largescale information management (pp. 74-93). igi global. safina, l., mazzara, m., montesi, f., & rivera, v. (2016). datadriven workflows for microservices: genericity in jolie. in 2016 ieee 30th international conference on advanced information networking and applications (aina) (pp. 430-437). ieee. shadija, d., rezai, m., hill, r. (2017). towards an understanding of microservices. icac 2017: 1-6 singhal, n., sakthivel, u., & raj, p. (2019). selection mechanism of micro-services orchestration vs. choreography. international journal of web & semantic technology (ijwest), 10(1), 25. terraform. (2019). url: https://www.terraform.io/ last time accesed: november 2019. valderas, p., torres, t., mansanet, m., pelechano, v. (2017). a mobile-based solution for supporting end-users in the composition of services. multimedia tools appl. 76 (15): 16315-16345 valderas, p, torres, v, and pelechano, v. (2019). hybrid composition of microservices with eucaliptool. proceedings of the xxii iberoamerican conference on software engineering, cibse 2019, la habana, cuba, april 22-26, 2019: 2-15. weber, b., reichert, m., rinderle, s. (2008). change patterns and change support features enhancing flexibility in processaware information systems. data and knowledge engineering 66: 438-466 wohlin, c., runeson, p. , höst, m., ohlsson, m. c., regnell, b. and wesslén, a. (2012). experimentation in software engineering, springer. yahia, e. b. h., réveillère, l., bromberg, y. d., chevalier, r., & cadot, a. (2016). medley: an event-driven lightweight platform for service composition. in international conference on web engineering (pp. 3-20). springer, cham. journal of software engineering research and development, 2021, 9:11, doi: 10.5753/jserd.2021.1892 this work is licensed under a creative commons attribution 4.0 international license. a requirements engineering technology for the iot software systems danyllo valente da silva [federal university of rio de janeiro | dvsilva@cos.ufrj.br] bruno pedraça de souza [federal university of rio de janeiro | bpsouza@cos.ufrj.br] taisa guidini gonçalves [federal university of rio de janeiro | taisa@cos.ufrj.br] guilherme horta travassos [federal university of rio de janeiro | ght@cos.ufrj.br] abstract contemporary software systems (css) – such as the internet of things (iot) based software systems – incorporate new concerns and characteristics inherent to the network, software, hardware, context awareness, interoperability, and others, compared to conventional software systems. in this sense, requirements engineering (re) plays a fundamental role in ensuring these software systems' correct development looking for the business and enduser needs. several software technologies supporting re are available in the literature. however, many do not cover all css specificities, notably those based on iot. this paper presents retiot (requirements engineering technology for the internet of things-based software systems). it aims to provide methodological, technical, and tooling support to produce iot software system requirements document. in addition, it comprises an iot scenario description technique, a checklist to verify iot scenarios, construction processes, and templates for iot software systems. a feasibility study was carried out in iot system projects to observe its templates and identify improvement opportunities. the results indicate the feasibility of retiot templates' when used to capture iot characteristics. however, further experimental studies represent research opportunities, strengthen confidence in its elements (construction process, techniques, and templates), and capture end-user perception. keywords: software engineering, requirements engineering, internet of things, iot software systems, software systems specification, software system requirements document, software technology 1 introduction contemporary software systems, such as those inherent to the internet of things (iot) paradigm, are complex compared to conventional software systems. this complexity comes from the inclusion of new concerns and characteristics related to network, software, hardware, context awareness, interface, interoperability, and others (motta et al., 2019a) (nguyen-duc et al., 2019). iot-based software systems seek to promote the interlacement of technologies and devices that, through a network, can capture and exchange data, make decisions, and act. with these actions, they unite the real and virtual worlds through objects and tags. however, building iot software systems is not a trivial activity due to its specific technological characteristics. it requires adapted and/or innovative software technologies to create and guarantee the quality of the built product (motta et al., 2019a). the quality of contemporary software systems' development depends on software technologies that respond to these systems' new concerns and characteristics. as with any other product built on engineering principles, a key activity in developing iot software systems is constructing the requirements document. defects present in the requirements document can cause an increased time, cost, and effort for the project; dissatisfied customers and end-users; low reliability of the software system; a high number of failures; among others (vegendla et al. 2018) (arif et al. 2009). requirements engineering (re) is responsible for the life cycle of the requirements document. it ensures its proper construction (vegendla et al. 2018) (pandey et al., 2010). the re phases and activities may differ according to the application domain, people involved, processes, and organizational culture. however, we can observe some recurring phases and re activities, such as conception/design, elicitation, negotiation, analysis, specification, verification, validation, and management. the technical literature presents several software technologies to support re for software systems. however, not all of them cover the different re phases and, mainly, iot software systems' specificities. in this work, the term "software technology" refers to the methodological, technical, and tooling offered by the works to support the construction of requirements document for iot software systems. considering the need for appropriate software technologies to develop iot software systems and understand the importance of requirements document for the stability, adequacy, and quality of a project, this work proposes the retiot (requirements engineering technology for the internet of things software systems). the retiot consists of a requirements specification technique based on iot scenarios description scenariot (silva 2019), an iot scenario inspection technique scenariotcheck (souza 2020), a construction process, and templates to support the processes activities and build the requirements document. the scenariot and scenariotcheck techniques were previously evaluated through experimental studies, which indicated their feasibility (souza et al. 2019a) and usefulness (souza et al. 2019b). moreover, they have been used in iot software system projects developed by the experimental software engineering (ese) group in the context of delfos – the observatory of engineering contemporary software systems – coppe/ufrj. based on the experiences with these projects, the construction process and templates of retiot evolved. this article extends a previous publication of retiot (silva et al. 2020b). the first version (silva et al. 2019) encompasses many re activities. it focuses on the definition of project scope, iot system, and iot system requirements. the second version (silva et al. 2020b) focuses on eight re phases: conception, iot elicitation, iot analysis, iot specification, iot verification, negotiation, validation, and management. the templates of this version are evaluated through a feasibility study https://orcid.org/0000-0002-1026-1189 https://orcid.org/0000-0002-1502-7703 https://orcid.org/0000-0002-7108-1763 https://orcid.org/0000-0002-4258-0424 a requirements engineering technology for the iot software systems silva et al. 2021 (section 4). the third version (silva et al. 2020a) involves the different re phases through an engineering cycle divided into eight phases: iot ideation and conception, iot elicitation, iot analysis, iot specification, iot verification, negotiation, iot evaluation, and management. their templates are evaluated through a proof of concept. the fourth and current version of the technology includes improvements in the construction process and templates. optional activities and tasks were analyzed and incorporated into the construction process. the focus and perspective of the process were changed to iot product and project, including product ideation and evaluation concepts. also, the engineering cycle was compacted to simplify the construction process that includes four phases instead of eight of the third version. the templates were evolved in the current version to include the gaps and improvements identified during the proof of concept (silva et al. 2020a). for the sake of completeness and applicability, this paper presents the current (fourth) version of retiot, including a feasibility study comparing three retiot templates with regular ones used to build requirements document for conventional software systems. the results indicate that retiot templates allow capturing the information needed for iot software systems and that they are mature to be evaluated in constructing such requirements document. furthermore, it is also possible to observe that the technology covers the main re phases and activities concerning iot-based projects. beyond this introduction, this article presents six other sections. section 2 describes the technological basis of the retiot. next, section 3 introduces and details the retiot. section 4 demonstrates the feasibility study. section 5 presents some related works found in the literature. section 6 discusses some research opportunities. finally, section 7 presents future work and concludes the article. 2 the technological basis of the retiot this section presents the technological basis used to build the retiot to support re in iot software systems. such a requirements technology is inserted in the context of a systems engineering approach, which concerns the major development stages of iot software systems (motta et al. 2020). its technological basis is composed of two empirically evaluated techniques, the scenariot and scenariotcheck techniques. 2.1 scenariot conventional software scenarios can be used in any software system and development stage. they can cover different purposes, such as eliciting requirements, specifying requirements, validating requirements, and testing (glinz 2000) (behrens 2002) (alexander and maiden 2004). a scenario is a sequence of events describing the system behavior and its environment (burg and van de riet 1996) or an ordered set of interactions between partners usually systems and external actors (glinz 2000). it represents requirements through stories describing the system from the users' perspective when applied to requirements engineering (glinz 2000) (alexander and maiden 2004). scenarios offer many advantages: they are based on the users' point of view; ii) the possibility to carry out partial specifications; iii) easy to understand; iv) enable short feedback loops; and v) provide a basis for testing the system (glinz 2000). thus, scenarios constitute a good basis for communication with clients and laypeople (non-technical) because they can be easily understood and do not require prior understanding. therefore, everyone involved at different levels and functions can express opinions and identify problems (glinz 2000) (behrens 2002) (alexander and maiden 2004). the scenariot (silva 2019) is a specification technique that adapts conventional scenarios to support iot software systems' specifications. it considers the characteristics (adaptability, connectivity, privacy, intelligence, interoperability, mobility, among others) and behaviors (identification, sensing, and actuation) specific to these software systems (motta et al., 2019b). the combination of characteristics and behaviors led to the creation of nine iot interaction arrangements (iias). iot interaction arrangements represent frequent interaction flows between things and other non-iot elements, such as conventional software systems and end-users. each iia has a catalog containing all relevant information captured and used in the scenario's description. the cardinality for arrangements and scenarios is a many-to-many relationship (m:n). therefore, many arrangements (isolated or combined) relate to one or more iot scenarios. an iot scenario can be linked to one or more arrangements. the iias, together with their catalogs, guide software engineers to capture essential information about the system: i) identification of the "things" and system components; ii) the types of data that will be collected and displayed; iii) the actions that will be performed in the environment; iv) aspects related to decision making on a particular system context; v) the actors (end-users, software systems, things, among others) who will access the data; among others. figure 1 shows the "iia-1: display of iot data" arrangement and its catalog. a requirements engineering technology for the iot software systems silva et al. 2021 figure 1. "iia-1: display of iot data" arrangement (silva 2019). 2.2 scenariotcheck the scenariotcheck is a checklist-based software inspection technique specialized in verifying iot software system scenarios (souza 2020). this technique aims to assist inspectors in detecting defects in iot scenario descriptions, guaranteeing their quality. it was created to work together with the scenariot technique since it produces the input to scenariotcheck. the scenariotcheck checklist consists of two parts. the first part (general questions) aims to identify defects related to project information and systemic solution such as i) problem domain; ii) interaction and identification among actors, system, hardware, and devices; iii) alternative and exception flow; among others. table 1 shows the first part of the questionnaire. table 1. questions on the scenariot specification technique. g e n e r a l q u e s ti o n s nº question 01 has the overall application domain been established? (health, leisure, traffic) 02 is the specific purpose of the system correctly described? (data visualization, decision making, and/or actuation only) 03 is the type of data collected specified? (temperature, humidity, pollution, and so on) 04 is it possible to identify who or what collects the data? (sensors, qr code readers, and so on) 05 is it possible to identify who or what manages the data collected? (administrator, decision-maker, users, and so on) 06 is it possible to identify who or what accesses the data collected? (things, software systems or users) 07 is the user interface device that displays the data described? (dashboard, smartphone, tablet, and so on) 08 is it possible to identify who is viewing the data? (things, software systems, users, and so on) 09 is it possible to identify the source from which the data is provided? (chairs, table, automobiles, houses, buildings, and so on) 10 are the roles involved in the system described? (things, software systems, users, and so on) 11 is there any description of each actor in the specified scenarios? 12 is it possible to identify the source of data provision? 13 has each action within the scenario been described clearly and contains no extraneous information? 14 is there a sequence of ambiguous actions in the scenarios? 15 are the actors described in the scenarios consistent with the actors described in the arrangements? (things, software systems, users) 16 are the scenarios related to the arrangements consistently? 17 do the scenarios seek to be accurate by presenting title and flows? (presenting the purpose and actions of the system directly and explicitly) 18 are adverbs avoided in order not to generate more than one possibility of interpretation in the scenarios? (probably, possibly, supposedly) 19 are the condition terms (such as "if", "go to", "while") used correctly? 20 when you use words like "things" "data" in the scenario, do they have the same meaning in other parts of the same scenario? 21 is it possible to identify "things" described with a function in the arrangements representing another function in the described scenarios? 22 are the alternative and/or exception flows described? 23 does the scenario specification identify the matching id arrangement? (aii1, aii2, ..., aii9) the second part (specific questions) considers the nonfunctional properties (iot facets) of iot software systems discussed in (motta et al. 2019a). table 2 presents the questions of the second part of the scenariotcheck checklist. table 2. questions on iot facets. s p e c if ic q u e s ti o n s nº question 24 is it possible to identify the specific context in which the system is embedded? (smart room, smart greenhouse, autonomous vehicle, healthcare, and so on) 25 are the limitations of the environment described? (e.g., lack of connectivity structure, lack of hardware structure, inadequate infrastructure) 26 are the technologies associated with system objects described? (smartphones, smartwatches, wearables) 27 are the events that the system has identified? (e.g., on/off an object, sending data) 28 what kind of communication technology does the system use in the scenarios? (bluetooth, intranet, internet ...) 29 does the proposed communication technology meet the geographic/physical specifications of the system? (large, medium or small scale) 30 is it possible to identify how the system will react according to changes in the environment? 31 are the interactions between the system and the environment represented in the scenarios? 32 is it possible to identify the interaction between actors? after specifying iot scenarios, the inspectors can apply the scenariotcheck technique to verify the scenario descriptions. the identified non-conformities are described in the inspection report. finally, after the discrimination meeting (defects identification), the iot scenario specification document is corrected. the application process of the two techniques is shown in figure 2. the scenariotcheck complements scenariot by providing a template for iot scenarios specification. this template resembles a use-case description document with some additional fields: i) identification of the iot software system elements; ii) problem domain description, iii) role description of each actor involved in the scenario; a requirements engineering technology for the iot software systems silva et al. 2021 figure 2. application process of scenariot and scenariotcheck techniques (souza et al. 2019a). and iv) descriptions of the interaction between the actors (end-user, things, software system, others) and the iot software system. 3 the retiot the retiot (requirements engineering technology for the internet of things based software systems) comprises the techniques described in section 2, a construction process, and templates to build requirements document following re principles. the requirements document's construction process is based on the main re phases (pressman and maxim 2014) (sommerville 2015): conception/design, elicitation, analysis, specification, negotiation, verification, validation, and management. however, the retiot adapts and includes new activities to meet the specificities of iot software systems. 3.1 construction process the current version of the technology encompasses product ideation, evaluation concepts, such as lowand high-level prototypes, and mvps' creation (minimum viable product) for the desired product. in addition, the construction process incorporates aspects and characteristics found in the literature review inherent to iot software systems. it also involves the different re phases (pressman 2014) (sommerville 2015) through an engineering cycle divided into four phases: iot ideation, conception, elicitation; iot analysis and specification; iot negotiation and evaluation; and management. figure 3 presents an overview of the construction process engineering cycle with two dimensions: main and transversal (performed in parallel). the main dimension corresponds to the activities and tasks required to build the iot requirements document. figure 3. construction process overview phases. the transversal dimension (see figure 4) offers three management activities, and tasks focused on artifacts and process management. the activities and tasks do not have a specific and determined time to be performed. instead, everything depends on the need identified by the user through the main process flow. the technology proposes version control of the artifacts and traceability between requirements, iot scenarios, iot interaction arrangements, and iot use cases in the management phase. besides, retiot offers change management so that modifications in requirements can be reflected in the generated artifacts. the iot requirements document construction is performed iteratively and incrementally. thus, the engineering cycle is executed three times, where each execution is called a stage. a requirements engineering technology for the iot software systems silva et al. 2021 figure 4. management phase overview. 3.2 construction process stages each stage performs the common phases (see figure 3 and figure 4) of the engineering cycle generating intermediate artifacts. the result of stage 3 is the iot requirements document. however, each stage has specific objectives, activities, and tasks. the construction process can be executed for one idea or a set of requirements (see figure 5). in the first case, the process execution is an intermediated version of the iot requirements document. also, the construction process can be adapted to be used in different contexts. for example, we can apply this proposal with any development methodologies. regarding projects that use an agile methodology, iot use cases could be not applicable and demand more cost and effort. in these contexts, iot use cases cannot be mandatory, and the activities and tasks that support build them can be skipped. however, on the other hand, it can cause positive impacts (decrease time and effort) and negative (absence of important information). therefore, the user of the process needs to evaluate these impacts. besides, the current retiot version integrates ten templates – eight of them are defined/adapted from the project templates currently used in projects of the ese group/pesc/coppe and other templates used in software engineering group/pesc/coppe. in addition, two of them were defined by the scenariotcheck technique (souza 2020). figure 6 shows an overview (idef0 diagram) of the three stages with their inputs, outputs, templates presented in the next paragraphs, and controls management procedures and feasibility strategy. the management phase performs the management procedures. the feasibility strategy represents the milestone of each stage. 3.2.1 stage 1 the first stage is to understand the problem. then, it aims to understand the problem or opportunity, analyze the stakeholders and their needs, elicit the business needs, and carry out the project feasibility analysis. it is composed of 12 activities and 27 tasks that are distributed throughout the engineering cycle. figure 7 presents an overview of the activities performed in the first stage. this stage offers three templates: iot canvas, iot project feasibility analysis, and requirements checklist. its milestone is the feasibility analysis performed by four activities (analyze market demand, analyze economic feasibility, analyze impact and risks, and analyze technical feasibility). figure 5. construction process overview stages a requirements engineering technology for the iot software systems silva et al. 2021 figure 6. idef0 diagram of the three stages figure 7. first stage overview of the construction process. 3.2.2 stage 2 the second stage is to describe the solution. it aims to transform business needs, stakeholders' needs, and general requirements into detailed, classified, and organized requirements. iot scenarios, arrangements, and components are used for the specification and verified during this stage. subsequently, the requirements are negotiated and evaluated, attesting that a common understanding of the system has been reached. this stage is composed of 12 activities and 39 tasks that are distributed throughout the engineering cycle. a requirements engineering technology for the iot software systems silva et al. 2021 the scenariot technique (silva 2019) supports the requirements identification and the system behavior description. this technique is executed during the following activities: define iot scenarios and specify iot scenarios. figure 8 presents an overview of the activities performed in the second stage. this stage defines three templates: iot project detail, iot solution proposal, and change analysis report. the scenariotcheck technique (souza 2020) contributes with two templates (verification checklist template and inspection record template) used in this stage during verify iot scenarios activity. its milestone is the low-level prototype performed by the activity "define low-fidelity prototype." this stage presents optional activities since the construction process can be used with any development methodology. 3.2.3 stage 3 the third stage is to detail the solution. it transforms iot requirements and scenarios into iot use cases. the iot use cases diagram, the list of iot use cases, and their descriptions are generated during this stage. subsequently, the generated artifacts are checked and evaluated, attesting that a common understanding of the system has been achieved. this stage is composed of ten activities and 24 tasks distributed throughout the engineering cycle. figure 9 presents an overview of the activities performed in the third stage. two templates are defined for it: iot use case description and iot diagram and use cases checklist. also, the change analysis report can be used in this stage. this stage's milestone is the high-level prototype performed by the activity "define an evolved prototype." this stage presents optional activities since the construction process can be used with any development methodology. figure 8. second stage overview of the construction process. a requirements engineering technology for the iot software systems silva et al. 2021 figure 9. third stage overview of the construction process. 4 evaluating the retiot's templates feasibility the retiot aims to support software engineers during the re activities. the main techniques presented in section 2 and used to compose the software technology have already been empirically evaluated and used in iot software system projects (souza et al. 2019a) (souza et al. 2019b). however, new facilities' inclusion to support the re with the retiot requires an initial observation before using them in the projects and conducting further experimental studies. thus, this section presents a feasibility study of the retiot templates. 4.1 templates in this feasibility study, we considered the structure of two artifacts' templates – requirements list (rl) and iot usecases description (iotucd1) – for conventional software systems but used in iot software system projects. we compared their structure with the structure of the retiot templates – project scope (ps), solution proposal (sp), and iot use-cases description (iotucd2). the full versions of all templates are available at http://bit.ly/393sghx. http://bit.ly/393sghx a requirements engineering technology for the iot software systems silva et al. 2021 4.1.1 the retiot templates this section presents three retiot templates (silva et al. 2020b) regarding the activities of elicitation (eli) – "project scope (ps)" template (see figure 10); analysis (ana) and specification (spe) – "solution proposal (sp)" template (see figure 11) and "iot use-cases description (iotucd)" template (see figure 12). the conception/design (con), negotiation (neg), and validation (val) activities are minimally covered by the "project scope" template. the "solution proposal" and "iot use-cases description" templates support the management (man) activities, maintaining traceability between requirements and analysis models. besides, the techniques described in section 2 support the elicitation (eli), specification (spe), and verification (ver) activities. the following items will present the templates' global description (project scope, solution proposal, and iot use-cases description) as defined in the retiot. • project scope template this template supports the documentation of the project's initial activities, the problem to be solved, those involved in the project, the user profiles, user needs, and business needs. in addition, it includes identifying and describing system requirements (functional, non-functional, restrictions, others) and business rules. figure 10. extract from the "project scope template." also, the requirements document's validation is made through an explicit agreement (signature or email copy). finally, it provides two (status and priority) fields to support the negotiation of functional and non-functional requirements. figure 10 presents an extract of this template. the proposed template is used in the activities of conception/design (con), elicitation (eli), negotiation (neg), and validation (val). • solution proposal template this template supports the solution description. it identifies and describes, using the scenariot technique, the iot scenarios, the iot components, and the iot interaction arrangements (iias) of the system. also, it provides the details of the iias chosen for each iot scenario via the corresponding catalogs. thus, the traceability between requirements, iot scenarios, iot interaction arrangements, and their respective catalogs is maintained. figure 11 presents an extract of this template. it should be used in the elicitation (eli), analysis (ana), specification (spe), and management (man) activities to identify, describe and refine the system's behavior while maintaining requirements traceability. figure 11. extract from the "solution proposal template." • iot use-cases description template this template includes the description of the iot use-cases. use cases are identified and described providing a view of the system's behavior. in addition, the use-case diagram is inserted in this template. traceability between requirements, iot scenarios, iot interaction arrangements, and iot use-cases is maintained. figure 12 presents an extract of this template. it should be a requirements engineering technology for the iot software systems silva et al. 2021 used in the analysis (ana), specification (spe), verification (ver), and management (man) activities. the scenariotcheck technique is applied during the verification activities to identify inconsistencies in the iot scenarios description and their components and the choice of iias. figure 12. extract from the "iot use-cases description template." 4.1.2 projects and teams the rl and iotucd1 artifacts were built by using conventional templates in three (3) iot based software projects: • project a supports environmental markers' collection (e.g., temperature, humidity, particulates, co2 level, and toxic gases). • project b monitors a high-performance computing environment (data center) to collect different information such as temperature, humidity, energy consumption, and energy supply quality. • project c collects temperature, humidity, wind speed, and wind direction in different city regions. all three projects represent real demand. a stakeholder (totally external to the course and the research group) worked with the developers, including the requirements acceptance. undergraduate students produced the rl and iotucd1 artifacts during a software engineering course at ufrj. the course had the participation of 21 students of the fourth year of information and computer engineering. the subjects were organized into three development teams, with seven participants each. the teams contained balanced participants with equivalent levels of knowledge and skills regarding software and hardware. training on different topics in software engineering and mentoring throughout the project were available. there was no intervention by the mentors in the artifact's content. all ethical issues and consent forms were made available. some of the course's topics included requirements engineering, iot scenarios, verification technique for iot systems, uml (unified modeling language) diagrams, among others. in addition, the scenariot and scenariotcheck techniques were presented to the participants, although they were not conditioned to use them. the teams were free to organize their projects. the requirements document represented one of the design milestones. a minimum viable product (mvp) represents one of the concrete results delivered at the end of the course. 4.2 execution the researchers (paper's authors) analyzed the requirements document (rl and iotucd1 artifacts) after the three teams constructed them. the information found in the generated artifacts was compared with the requested information in the project scope (ps), solution proposal (sp), and iot use-cases description (iotucd2) templates structure. a working checklist was used to compare the templates, which will be presented in the next subsection. three researchers carried out the comparison – two master's students and one ph.d. that work in software engineering and iot domains. after that, a fourth researcher (ph.d. and expert in software engineering and iot domains) reviewed the analysis of the results. 4.3 results and discussion table 3 presents the checklist used to compare the template structure (conventional and retiot) and the analysis result. it indicates that: • the rl template does not address the project/system objective and problem domain. however, knowing the problem domain is essential for building an iot software system (motta et al. 2019) (nguyen-duc et al., 2019). • the rl template presents a partial description of the stakeholders. however, it does not include profiles descriptions of the different users important for the system development and the user interface design. • the rl template does not address the description of business/stakeholder needs. the identification of business/stakeholder needs represents the initial stage of the project. in this step, we seek to understand the client's real need, which will be transformed into system requirements in the future. • retiot allows identifying the requirements that will guide the iot solution from the beginning (project scope template), unlike the rl template that does not identify the iot requirements. table 3. mapping checklist of the template structure. a requirements engineering technology for the iot software systems silva et al. 2021 project/system information conventional templates retiot templates rl iotucd2 ps sp iotucd2 project name/project responsible t t t t t version control t t t t t explicit agreement t t project/system objective n t problem domain n t project scope t t glossaire t t stakeholders description p t business and stakeholders needs a description n t functional requirements p t non-functional requirements t t requirements negotiation (prioritization) t t business rules n t t t project analyses n p iot scenarios p t iot components description n t t iot interaction arrangements p t t iot use-cases diagram n t iot use-cases description n t requirements traceability p p t references (others project documents) t n p partially collected; t totally collected; n does not collect information; gray not applicable for this template. • the iotucd1 template treats iot scenarios and iot arrangements partially but does not address iot components' description. in contrast, the retiot treats this information entirely in the solution proposal (sp) and iot use-cases description (iotucd 2) templates. • conventional templates are not treat the iot use-cases diagram and iot use-cases description, but retiot fully treats them. • the requirements traceability is partially treated by the iotucd1template, partially treated by retiot in sp template, and fully treated iotucd2 template. • the rl template presents a field for references (other documents), which retiot does not address. the different convergence and divergence points between conventional systems templates (rl and iotucd1) and retiot templates (ps, sp, iotucd2) offer indications that the retiot can be more robust because it deals with iot information since the beginning of the project according to the results, retiot can present (silva et al. 2020a) (silva et al. 2020b) a good potential for supporting iot software systems' specifications because of its templates' specific iot information, differently from conventional ones. 4.4 templates' evolution this study allowed us to evolve the existing templates regarding the reorganization of sections and insert new sections and new fields. first, we identified the lack of information and redundancies on templates. then some fields were added, moved, or removed. besides that, we started to think about mvp and prototypes applied on iot projects that caused changes in some templates, and we added the "iot canvas" template to the technology. in the second version, we started to think about requirements negotiation, reuse, and traceability. consequently, we identified the need to insert new fields to attend to these points better. also, the templates of the second version did not cover the project feasibility and requirements verification. it is because the technology does not cover these points. to fill these gaps, we propose "project feasibility analysis," "requirements checklist," and "iot diagram and use cases checklist" templates. at least, we identified the need to register and track requirements changes. the second version presented activities and tasks to support it, but no template was defined. in that sense, one more template, named "change analysis report," was defined to support these activities. in addition, the project scope template was renamed to iot project detail and the solution proposal template to iot solution proposal. in the iot project detail template, we included a new field, "project description." the "glossary" and "stakeholders" sections have been changed to include fields to support the capturing of specific information. the "potential stakeholders" section has been changed to "stakeholders" to include two new fields to capture each stakeholder's interest and its influence in the system. the "project scope" section has been removed from the template. the "canvas iot" section was added, allowing the insertion of an image or photo of the iot canvas built in stage 1. new fields ("reused requirement?" and "related requirement id") have been added to the "system requirements" section to enable requirements traceability (functional and non-functional). in addition, for functional requirements, two fields ("cost" and "effort") were added to make negotiation feasible. in its previous version, requirements were classified into iot requirements and non-iot requirements. in this new version, this classification has been removed, and the "iot characteristic" field has been included. therefore, when describing a non-iot requirement, this field should not be filled. instead, the iot characteristic must be described as identification, sensing, performance, connectivity, and processing in an iot requirement. in addition, the "dependency a requirements engineering technology for the iot software systems silva et al. 2021 between requirements" field has also been added to the "functional requirements" section. the non-functional requirement "scalability" was added to the new template. the requirements "portability and compatibility" and "security and privacy" have been adapted. the section "annex non-functional requirements" has been added to support the identification of non-functional requirements. the section "scope not covered by the project" has been added, and the section "project analysis" has been removed. in the "business rules" section, the "related needs id" field has been added to allow business rules' traceability. in the iot solution proposal template, the fields "actors," "actions," "interaction arrangements" were added in the "iot scenarios" section. the section "iot system components" was removed because it had a redundancy of the arrangement catalogs' information. the "related functional requirements," "precedencies," and "dependencies" fields have been added in the "iot scenarios description" section to enable the traceability of iot scenarios. the field "collected data and actions performed" was divided into two fields: "collected data" and "actions performed." the "interaction sequence" field was changed because it is like a use-case structure (main, alternative, and exception flows). finally, the "environment" and "connectivity" fields have been removed. in the iot use-cases description template, the "business rules" field was moved from the "interaction sequence" section to a separate section. in addition, the section "customer or customer representative agreement" has been added to this template. five new templates were also defined to support the construction process activities that had not yet been contemplated. the new models (see section 3) correspond to iot canvas, iot project feasibility analysis, requirements checklist, change analysis report, and iot diagram and use cases checklist. table 4 shows each change and the rationale for them. however, to ensure the technology validity, further experimental evaluation is necessary to verify whether the retiot construction process with the templates is useful, complete, correct, and intuitive. 4.5 threats to validity internal validity is the study itself, even though experimental studies have evaluated part of the retiot technology. however, the results indicate that the retiot templates can capture relevant information than conventional templates regarding project artifacts. an external validity issue concerns the participants (undergraduate students) who have been invited to participate in the study. we cannot claim that the information provided is complete from the project's point of view, nor did the participants understand all the topics taught during the course. to mitigate this threat, the projects treated in the study represented real problems. besides, each team had contact with a stakeholder of each addressed problem. there was no control over the artifact's creation during the course and used in the study regarding their construct validity. however, the projects were equivalent in size, complexity and used iot technologies to mitigate this threat. also, it can be highlighted that the teams received equivalent training and mentoring in re. finally, the conclusion validity concerns the study interpretation and sample size. we had a small and inhomogeneous sample size. therefore, it was impossible to apply statistical tests to carry out a deeper analysis of the results obtained. also, the study conclusion is limited to the researchers' interpretation. these items limit the study results generalization. to mitigate this threat, we aim to perform future experimental studies to collect feedback from the retiot. 5 literature analysis 5.1 re phases this section presents related works found in the technical literature, which address technologies for the different re phases mentioned above. table 5 presents a comparison of seventeen (17) technologies found in the technical literature. we can observe that conception, negotiation, verification, validation, and management phases need more attention regarding iot concepts and characteristics. figure 13 synthesizes the information presented in table 5, showing the number of technologies per re phase. again, we can highlight that a high number permeate elicitation (nine), analysis (ten), and specification (eight) phases. in contrast, a small number is concentrated in the conception/design (four), negotiation (one), verification (five), validation (three) and management (three) phases. figure 13. technologies x re phases. regarding the conception phase (con), gsem-iot (zambonelli 2017) (laplante et al. 2018) and ignite (giray et al. 2018) technologies carry out the stakeholders' analysis involved in the system. in addition, the feasibility analysis is partially addressed by iot methodology (giray et al., 2018). also, the ignite and core (hamdi et al. 2019) technologies provide business analysis mechanisms. a requirements engineering technology for the iot software systems silva et al. 2021 table 4. templates' evolution. template name previous element new element change description rationale iot project detail project description. included new field allow getting a simple and brief description of the project. glossary and stakeholders included fields to support the capturing of specific information enable to capture specific information about terms used in the project and stakeholders. potential stakeholders stakeholders change section name and included two new fields to capture interest and influence simplify this section and capture more data about the stakeholders. project scope removed from the template allows avoiding redundancy. the described requirements can identify the project scope. canvas iot included new field allow the insertion of an image or photo of the iot canvas built in stage 1 reused requirement? and related requirement id included new field added to "system requirements" section enable requirements' reuse and traceability. cost and effort included new field make negotiation feasible. iot requirements and noniot requirements iot characteristic (only to iot requirements) this classification has been removed, and the "iot characteristic" field has been included simplify this section and enable to capture identification, sensing, performance, connectivity, and processing characteristics on requirements. dependency between requirements this field has also been added to the "functional requirements" section enable requirements' traceability. non-functional requirements "scalability", "portability and compatibility" and "security and privacy" the section has been improved this field's description of this section has been adapted and improved to attend iot systems better. annex non-functional requirements included new section support the identification of non-functional requirements. scope not covered by the project included new section enable to capture and describe the project analysis the section has been moved to another template a new template (iot project feasibility analysis) has been created. as a result, the information presented in this section has been adapted and moved to a specialist template. related needs id a new field has been added to the "business rules" section allow business rules' traceability. iot solution proposal actors, actions, and interaction arrangements included new field added in the "iot scenarios" section enable traceability between iot scenarios information. iot system components this section has been removed avoid redundancy of the arrangement catalogs' information. related functional requirements, precedencies and dependencies included new fields have been added in the "iot scenarios description." enable traceability between iot scenarios. collected data and actions performed collected data and actions performed the field "collected data and actions performed" was divided into two fields simplify this section and separate specific information. interaction sequence remove alternative and exception flows. simplify this section because it is like a use-case structure (main, alternative, and exception flows). environment and connectivity these sections have been removed simplify this section. we believe this information is not relevant at this point. in this way, they must be collected during the design projects' phase. iot use-cases description business rules the business rules field was moved from the "interaction sequence" section to a separate section simplify this section. customer or customer representative agreement this section has been added enable to get an explicit agreement about iot diagram and use cases. iot canvas this template has been added enable to support projects' description and idea validation through the easy and fast way. a requirements engineering technology for the iot software systems silva et al. 2021 template name previous element new element change description rationale iot project feasibility analysis this template has been added enable to support projects' decision-making about feasibility in market demand, cost, impact, risks, and technology. requirements checklist this template has been added enable to verify if requirements are correct, understandable, and consistent. change analysis report this template has been added enable to manage requirements' change through the project life cycle. iot diagram and use cases checklist this template has been added enable to verify if iot diagram and use cases are correct, understandable, and consistent. a requirements engineering technology for the iot software systems silva et al. 2021 table 5. technologies x re phases. technology/re phase con eli neg ana spe ver val man (aziz et al. 2016) x x x (mahalank et al. 2016) x (takeda and hatakeyama 2016) x x (touzani and ponsard 2016) x iot-rml (costa et al. 2017) x x x (yamakami 2017) x gsem-iot (zambonelli 2017) x x (carvalho et al. 2018) x (curumsing et al. 2019) x x x x iot system development methods ignite (giray et al. 2018) x x x x x x iot methodology (giray et al. 2018) x x x (laplante et al. 2018) x x x x iotreq (reggio 2018) x x x core (hamdi et al. 2019) x x scenariot (silva 2019) x x x scenariotcheck (souza 2020) x trustapis (ferraris and fernandez-gago 2020) x x x several technologies address the elicitation phase (eli): ignite (giray et al. 2018), iot methodology (giray et al. 2018), (laplante et al. 2018), iotreq (reggio 2018), (curumsing et al. 2019), core (hamdi et al. 2019), scenariot (silva 2019) and trustapis (ferraris and fernandez-gago 2020) that offer resources for collecting requirements. in addition, gsem-iot (zambonelli 2017), iotreq, and iot methodology propose mechanisms to transform users' needs into requirements. for the negotiation phase (neg), ignite (giray et al., 2018) addresses the impact and risk analysis but does not provide further details on conducting this activity. in the analysis phase (ana), (takeda and hatakeyama 2016) and (touzani and ponsard 2016) technologies, ignite (giray et al. 2018), (laplante et al. 2018), iotreq (reggio 2018), (curumsing et al. 2019) and core (hamdi et al. 2019) use uml diagrams to develop the analysis models. the scenariot technology (silva 2019) comprises the scenario analysis based on iot interaction arrangements. the works of (aziz et al. 2016) and iot-rml (costa et al., 2017) address artifacts and models' reuse. the specification phase (spe) is addressed by several technologies: (takeda and hatakeyama 2016), iot-rml (costa et al. 2017), iotreq (reggio 2018), and trustapis (ferraris and fernandez-gago 2020) that use formal models for specifying requirements. technologies proposed by (aziz et al. 2016), (mahalank et al. 2016), and (giray et al. 2018) – ignite provide templates for specifying requirements. the scenariot (silva 2019) proposes the scenario specification using iot interaction arrangements. in the verification phase (ver), we found that (carvalho et al. 2018) and scenariotcheck (souza 2020) propose mechanisms to verify requirements. the technologies proposed by (yamakami 2017), (costa et al. 2017) iot-rml (carvalho et al., 2018), and (curumsing et al. 2019) offer mechanisms for checking conflicts between requirements. the validation phase (val) is addressed by ignite (giray et al., 2018), iot methodology (giray et al., 2018), and (laplante et al., 2018), which propose a prototyping technique to ensure that the product meets users' needs. for the management phase (ger), (aziz et al. 2016), (curumsing et al. 2019), and trustapis (ferraris and fernandez-gago 2020) offer mechanisms to enable traceability. in addition, trustapis also provides a mechanism for requirements change management. 5.2 techniques and methods a quasi-systematic literature review (lim et al. 2018) identified 12 relevant publications and 37 elicitation techniques normally applied in iot systems development. the most frequently used techniques are interviews and prototypes, where the latter can also be used to validate requirements. we can also highlight other techniques and methods applied during the elicitation phase: scenarios, use cases, and frameworks. this work also presents a brief contribution regarding the conflict resolution of the stakeholders. the authors emphasize using interview and prototyping techniques to encourage discussions and find alternative ways to identify conflicts. in this way, we analyzed the 17 technologies to identify which techniques/methods are used and where (re phases) in iot systems development. figure 14 shows our findings where we can observe 14 items and the most used: process (thirteen), use cases (eight), and models (seven). figure 14. technologies x techniques/methods. table 6 shows where (re phases) the techniques/methods found are applied. the elicitation (28), analysis (30), and specification (22) phases offer a greater number of techa requirements engineering technology for the iot software systems silva et al. 2021 niques/methods. it is important to highlight that some technologies offer more than one technique for one or more re phases. retiot permeates the eight phases previously described offering methodological and technical support through construction, techniques, and templates. analyzing the retiot current version (see section 3), we can say that it proposes and integrates some techniques/methods: prototyping, iot canvas, iot scenarios based on iot scenarios specification technique scenariot (silva 2019), use cases diagram and description, templates, and iot scenario inspection technique scenariotcheck (souza 2020), and a construction process. 6 research opportunities analyzing the technologies found in the technical literature, we can observe that only one technology discusses the negotiation phase. it represents a research opportunity. few technologies offer project management, validation, test case elaboration, and decision-making related to the system's design and architecture. these topics can be explored through future research. we can also observe that not all technologies cover all re activities and present gaps regarding the different activities necessary to build iot system requirements document. among these gaps, we can observe the lack of i) methodological support for the design and ideation of iot products (nguyen-duc et al. 2019); ii) stakeholder identification and description and business needs (silva et al. 2020b); iii) iot system characteristics and behaviors (motta et al. 2019a), as well as the requirements refinement; iv) high-level (new iot interaction arrangements) and low-level (iot use-case diagram) analysis models; v) project feasibility analysis (silva et al. 2020); vi) prototypes as suggested by (nguyen-duc et al. 2019) (lim et al. 2018); and vii) explicit agreements with the client (silva et al. 2020). these technologies also do not fully meet the iot software system specificities and characteristics: i) the components and actors' description (curumsing et al. 2019) (aziz et al. 2016); ii) the behaviors description of different levels of each object (curumsing et al. 2019) (reggio 2018); iii) the identification of the systemic characteristics (sensing, identification, performance, processing, and connectivity); and iv) the detailed specification of each feature. 7 conclusion and future works this paper presented the retiot. it provides a construction process, techniques (iot scenario specification and verification techniques), and tools (templates) to support iot software's requirements engineering systems. besides, this work seeks to accomplish an initial observation about this technology that focuses on analyzing and evaluating only the templates. a feasibility study was performed to compare three templates defined in the second version of retiot with conventional software systems templates (not specific to iot software systems). their comparison provided indications that the artifacts generated by retiot may be complete regarding the capture of iot information. table 6. techniques/methods x re phases. techniques-methods /re phase con eli neg ana spe ver val man interview 2 prototyping 3 canvas 1 1 scenarios 1 3 use cases 2 2 7 1 class diagram 1 activity diagram 1 state diagram 1 sysml language 1 1 1 formal models 2 4 2 1 1 templates 2 3 5 1 2 goal model 2 4 1 1 framework 2 3 1 2 1 2 catalogs 1 1 1 process 3 11 1 9 7 3 3 3 total 9 28 2 30 22 7 9 7 the experimental study was planned to analyze the process and templates of retiot. however, it was not possible to conduct this study due to covid 19 pandemic. some of the future work reserved for the retiot are: i) (re)design and execution of experimental studies to evaluate the technology in more robust iot software system projects (both academic and industrial contexts); a comparative study of the retiot with traditional technologies will be carried out to verify the efficiency and effectiveness of the retiot in terms of capturing system and project relevant information. such a study should also evaluate the retiot usefulness and suitability according to the user's perception; ii) integrating retiot with a testing technique to support software engineers with the specification of context-aware test cases cats# contextaware test suite design (doreste and travassos 2020); and iii) developing tooling support integrating the construction process, iot scenario specification, and verification techniques templates. the tool will facilitate the traceability among iot requirements, iot interaction arrangements, iot scenarios, and iot use-cases. a requirements engineering technology for the iot software systems silva et al. 2021 acknowledgments the authors would like to thank the national council for scientific and technological development cnpq. taisa gonçalves received a postdoctoral scholarship (154004 / 20189). prof. travassos is a cnpq researcher (304234/2018-4). references alexander i, maiden n (2004) scenarios, stories, and use cases: the modern basis for system development. computing and control engineering 15:24–29. https://doi.org/10.1049/cce:20040505 arif s, khan q, gahyyur sak (2009) requirements engineering processes, tools/technologies, & methodologies. international journal of reviews in computing 2:41–56 aziz mw, sheikh aa, felemban ea (2016) requirement engineering technique for smart spaces. in: international conference on internet of things and cloud computing. acm press, cambridge united kingdom, p 54:1-54:7 behrens h (2002) requirements analysis using statecharts and generated scenarios. in: doctoral symposium at ieee joint conference on requirements engineering burg jfm, van de riet rp (1996) a natural language and scenario based approach to requirements engineering. in: proceedings of workshop in natuerlichsprachlicher entwurf von informationssystemen carvalho rm, andrade rmc, oliveira km (2018) towards a catalog of conflicts for hci quality characteristics in ubicomp and iot applications: process and first results. in: 12th international conference on research challenges in information science (rcis). ieee, nantes, pp 1–6 costa b, pires pf, delicato fc (2017) specifying functional requirements and qos parameters for iot systems. in: 15th intl conf on dependable, autonomic and secure computing, 15th intl conf on pervasive intelligence and computing, 3rd intl conf on big data intelligence and computing and cyber science and technology congress. ieee, orlando, fl, pp 407–414 curumsing mk, fernando n, abdelrazek m, et al (2019) emotion-oriented requirements engineering: a case study in developing a smart home system for the elderly. journal of systems and software 147:215–229. https://doi.org/10.1016/j.jss.2018.06.077 doreste ac de s, travassos gh (2020) towards supporting the specification of context-aware software system test cases. in: xxiii ibero-american conference on software engineering. springer, curitiba, brazil (online), p s10 p1:8 pages ferraris d, fernandez-gago c (2020) trustapis: a trust requirements elicitation method for iot. international journal of information security 19:111–127. https://doi.org/10.1007/s10207-019-00438-x giray g, tekinerdogan b, tüzün e (2018) iot system development methods. in: hassan q, khan ar, madani sa (eds) internet of things. crc press/taylor & francis, new york, pp 141–159 glinz m (2000) improving the quality of requirements with scenarios. in: proceedings of the second world congress on software quality. pp 55–60 hamdi ms, ghannem a, loucopoulos p, et al (2019) intelligent parking management by means of capability oriented requirements engineering. in: wotawa f, friedrich g, pill i, et al. (eds) advances and trends in artificial intelligence from theory to practice iea/aie 2019. springer international publishing, cham, pp 158–172 laplante nl, laplante pa, voas jm (2018) stakeholder identification and use case representation for internet-of-things applications in healthcare. ieee systems journal 12:1589–1597. https://doi.org/10.1109/jsyst.2016.2558449 lim t-y, chua f-f, tajuddin bb (2018) elicitation techniques for internet of things applications requirements: a systematic review. in: vii international conference on network, communication and computing. acm press, taipei city, taiwan, pp 182–188 mahalank sn, malagund kb, banakar rm (2016) non functional requirement analysis in iot based smart traffic management system. in: international conference on computing communication control and automation. ieee, pune, india, pp 1–6 motta rc, oliveira km, travassos gh (2019a) on challenges in engineering iot software systems. journal of software engineering research and development 7:5:1-5:20. https://doi.org/10.5753/jserd.2019.15 motta rc, oliveira km, travassos gh (2020) towards a roadmap for the internet of things software systems engineering. in: proceedings of the 12th international conference on management of digital ecosystems. acm, virtual event united arab emirates, pp 111–114 motta rc, silva vm, travassos gh (2019b) towards a more in-depth understanding of the iot paradigm and its challenges. journal of software engineering research and development 7:3:1-3:16. https://doi.org/10.5753/jserd.2019.14 a requirements engineering technology for the iot software systems silva et al. 2021 nguyen-duc a, khalid k, shahid bajwa s, lønnestad t (2019) minimum viable products for internet of things applications: common pitfalls and practices. future internet 11:paper 50. https://doi.org/10.3390/fi11020050 pandey d, suman u, ramani ak (2010) an effective requirement engineering process model for software development and requirements management. in: international conference on advances in recent technologies in communication and computing. ieee, kottayam, india, pp 287–291 pressman rs, maxim b (2014) software engineering: a practitioner’s approach, 8 edition. mcgraw-hill education, new york, ny reggio g (2018) a uml-based proposal for iot system requirements specification. in: 10th international workshop on modelling in software engineering. acm press, gothenburg, sweden, pp 9–16 silva dv da, goncalves tg, rocha arc da (2019) a requirements engineering process for iot systems. in: xviii brazilian symposium on software quality. acm press, fortaleza, brazil, pp 204–209 silva dv da, gonçalves tg, travassos gh (2020a) a technology to support the building of requirements documents for iot software systems. in: xix brazilian symposium on software quality. acm press, são luís, brazil (online), p article no 4 pages 1-10 silva dv da, souza bp de, gonçalves tg, travassos gh (2020b) uma tecnologia para apoiar a engenharia de requisitos de sistemas de software iot. in: xxiii ibero-american conference on software engineering. curitiba, brazil (online), p s09 p3:14 pages silva vm (2019) scenariot support for scenario specification of internet of things-based software systems. master’s dissertation, federal university of rio de janeiro sommerville i (2015) software engineering, 10 edition. pearson, harlow souza bp (2020) scenariotcheck: uma técnica de leitura baseada em checklist para verificação de cenários iot. master’s dissertation, federal university of rio de janeiro souza bp de, motta rc, costa d, travassos gh (2019a) an iot-based scenario description inspection technique. in: xviii brazilian symposium on software quality. acm press, fortaleza, brazil, pp 20–29 souza bp de, motta rc, travassos gh (2019b) the first version of scenariotcheck: a checklist for iot based scenarios. in: xxxiii brazilian symposium on software engineering. acm press, salvador, brazil, pp 219–223 takeda a, hatakeyama y (2016) conversion method for user experience design information and software requirement specification. in: marcus a (ed) design, user experience, and usability: design thinking and methods duxu 2016. springer, cham, pp 356–364 touzani m, ponsard c (2016) towards modelling and analysis of spatial and temporal requirements. in: 24th international requirements engineering conference. ieee, beijing, china, pp 389–394 vegendla a, duc an, gao s, sindre g (2018) a systematic mapping study on requirements engineering in software ecosystems: journal of information technology research 11:4:1-4:21. https://doi.org/10.4018/jitr.2018010104 yamakami t (2017) horizontal requirement engineering in integration of multiple iot use cases of city platform as a service. in: 2017 ieee international conference on computer and information technology (cit). ieee, helsinki, finland, pp 292–296 zambonelli f (2017) key abstractions for iot-oriented software engineering. ieee software 34:38–45. https://doi.org/10.1109/ms.2017.3 a requirements engineering technology for the iot software systems 1 introduction 2 the technological basis of the retiot 2.1 scenariot 2.2 scenariotcheck 3 the retiot 3.1 construction process 3.2 construction process stages 3.2.1 stage 1 3.2.2 stage 2 3.2.3 stage 3 4 evaluating the retiot's templates feasibility 4.1 templates 4.1.1 the retiot templates 4.1.2 projects and teams 4.2 execution 4.3 results and discussion 4.4 templates' evolution 4.5 threats to validity 5 literature analysis 5.1 re phases 5.2 techniques and methods 6 research opportunities 7 conclusion and future works acknowledgments references journal of software engineering research and development, 2022, 10:8, doi: 10.5753/jserd.2022.2133  this work is licensed under a creative commons attribution 4.0 international license.. accessibility mutation testing of android applications henrique neves da silva  [ federal university of paraná | henriqueneves@ufpr.br] silvia regina vergilio  [ federal university of paraná | silvia@inf.ufpr.br] andré takeshi endo  [ federal university of são carlos | andreendo@ufscar.br] abstract smart devices and their apps are present in many everyday activities and play an important role for people with some disabilities. however, making apps more accessible is still a challenge for developers. automatically accessibility testing tools can help in this task but present some limitations. they produce reports on accessibility faults, which usually cover only a subset of the app because they are dependent on the test set available. in order to help in the improvement and/or assessment of test suites generated, as well as contribute to increasing the performance of accessibility testing tools, this work introduces a mutation testing approach. the approach includes a set of mutant operators derived from faults corresponding to the negation of the wcag standard’s principles and success criteria. it also includes a process to analyse the mutants regarding the original app. evaluation results with 7 open-source apps show the approach is applicable in practice and contributes to significantly improving the number of faults revealed by the test suites accompanying the apps. keywords: mobile apps, mutation testing, accessibility 1 introduction in the last decade, we have observed a growing number of smartphones and studies show this number is expected to increase even more in the next years (cisco, 2017). smart devices and their apps have become a key component in people’s daily lives. this is not different for people with some disabilities. for instance, people with some visual impairment have relied on smartphones as a vital means to foster independence in carrying out various tasks, such as understanding text document structure, communicating through social media apps, identifying products on supermarket shelves, and moving between obstacles (acosta-vargas et al., 2020). world health organization (who) estimated that more than one billion people, which is around 15% of the world’s population, are affected by some form of disability (hartley, 2011). then, it is fundamental to engineer software so that all the advantages of technology are accessible to every individual. mobile accessibility refers to making websites and apps more accessible to people with disabilities when using smartphones and other mobile devices (w3c, 2019). progress has been made with accessibility because of mandates from government regulations (e.g., u.s. section 508 of rehabilitation act), standards (such as the british broadcast corporation standards, brazilian accessibility model, and web content accessibility guidelines), widespread industrial awareness, technological advances, and accessibilityrelated lawsuits (yan and ramachandran, 2019). however, developers still have the challenge of providing more accessible software on mobile devices. according to ballantyne et al. (2018), much of the research on software accessibility is dedicated to the web and its sites (grechanik et al., 2009; wille et al., 2016; abuaddous et al., 2016); even though there is a recurring effort on the accessibility of mobile apps (vendome et al., 2019). moreover, studies point to the lack of adequate tools, guides and policies to design, evaluate, and test the accessibility in mobile apps (acosta-vargas et al., 2020). automated accessibility testing tools are usually based on existing guidelines. one of the most popular standards is the wcag (w3c’s web content accessibility guideline) (kirkpatrick et al., 2018) guide. the wcag guide covers recommendations for people with blindness and low vision, deafness and hearing loss, limited movement, cognitive limitations, speech and learning disabilities. wcag encompasses several guidelines, each one related to different success criteria,groupedintofouraccessibilityprinciples.sometoolsproduce, given a set of executed test cases, a report of accessibility violations for the app. examples of these tools are accessibility google scanner (google, 2020), espresso (google, 2018), a11y ally (toff, 2018), and mate (eler et al., 2018). they can perform static or dynamic analysis (silva et al., 2018). a limited number of violations can be checked by static tools, but dynamic analysis tends to be more costly. another limitation is that the accessibility faults checked by tools are limited by the test cases used. they cover only a subset of the app due to weak test scripts or limited input test data generation algorithms (silva et al., 2018). tools generally used for test data generation such as monkey (moher et al., 2009), sapienz (mao et al., 2016), stoat (su et al., 2017) and ape (gu et al., 2019), are focused on functional behavior, code coverage or crashes. in this sense, this work hypothesizes that a mutation approach specific to accessibility testing can help in the improvement and/or assessment of test suites generated and contribute to increasing the performance of accessibility testing tools. the idea behind mutation testing is to derive versions of the program under test p , called mutants. each mutant describes a possible fault, and is produced by a mutation operator (jia and harman, 2011). the objective is to generate test cases capable of distinguishing p from its mutants, that is, that when executed with each mutant m produces a different output from the output of p . if the p ’s result is correct, it is free from the fault described by m. if the output is differhttps://orcid.org/0000-0002-2417-3374 mailto:henriqueneves@ufpr.br https://orcid.org/0000-0003-3139-6266 mailto:silvia@inf.ufpr.br https://orcid.org/0000-0002-8737-1749 mailto:andreendo@ufscar.br silva et al. 2022 ent, m is said killed. at the end, a measure called mutation score is calculated, related to the number of mutants killed. this measure can be used to design test cases, or to evaluate the quality of an existing test suite, and consider whether a program has been tested enough. mutation testing has been proved to be effective in different domains and contexts (jia and harman, 2011). more recently, it has been used in the test of non-functional properties such as performance regarding execution time (lisper et al., 2017) and energy consumption (jabbarvand and malek, 2017). there are some initiatives exploring mutation testing of android apps (wei, 2015; deng et al., 2015; jabbarvand and malek, 2017; luna and el ariss, 2018; escobarvelásquez et al., 2019). but these works are not focused on accessibility testing. given the context and motivation described above, this paper introduces a mutation approach for the accessibility test of android apps. the underlying fault model is related to the non-compliance with wcag principles and success criteria. we propose a set of 6 operators that remove some selected code elements, the most commonly used in the apps, and whose absence may imply accessibility violations. we also define a mutant analysis process that uses tools’ accessibility reports to distinguish killed mutants. the process is implemented using the reports produced by espresso google (2018), and evaluated with 7 open-source apps. the results showourapproachisapplicableinpracticeandcontributesto improving the quality of the test suites accompanying the selected apps. we observe a significant improvement regarding the number of faults revealed by using the mutant-adequate test suites. in this way, the present work introduces a mutation approach that encompasses a set of mutant operators and a mutation process implemented by a tool. the approach (i) can be used as a criterion for test data generation and/or assessment, helping developers measure the quality of their test suites or to generate tests from an accessibility perspective; (ii) can be explored to evaluate the accessibility tools available in the market and in academia; and (iii) contributes to the emergent area of mutation testing for non-functional properties, and represents the first step to allow accessibility mutation testing, serving as basis to direct future research and encourage the academic community to create tools that further explore this field of research. the remainder of this paper is organized as follows. section 2 gives an overview of related work. section 3 introduces our mutation testing approach. section 4 details the evaluation and its main results. section 5 discusses the threats to validity, and section 6 concludes the paper. 2 related work related work can be classified into two main categories: mutation testing of apps (section 2.1) and accessibility testing (section 2.2). 2.1 mutation testing of android apps in the literature, there are some mutation approaches for android apps. deng et al. (2015) define 4 classes of mutation operators specific to the android context. the proposed workflow differs from the traditional mutation test process. once the mutants are generated, it is necessary to install each mutant m on the android emulator. the test cases are implemented through frameworks robotium (reda, 2019) or junit (gamma and beck, 2019). while deng’s approach requires the app source code, wei (2015) proposes mudroid, a tool that requires only the apk file of the app. linares-vásquez et al. (2017) define a list of 38 mutation operators, implemented by the tool mdroid+ (moran et al., 2018). first, a static analysis of java code using abstract syntactic trees (ast) is performed to find a potential fault profile (pfp) that describes a source code location that can be changed by an operator. pfps are used to apply the transformation corresponding to each operator in the java code or xml file. mdroid+ creates a clone of the android project and applies a single mutation to a pfp specified in the cloned project, resulting in a mutant. finally, a report is generated associating the name of the created clone with the applied operator. the tool does not offer a way to compile and execute the mutants, nor does it calculate the mutation score. in a follow-up study, escobar-velásquez et al. (2019) introduce mutapk that requires as input the apk of the android app and implements the same operators of mdroid+ (linares-vásquez et al., 2017; moran et al., 2018). the corresponding implementation considers smali representation. like mdroid+, mutapk does not include a mutant analysis strategy. both allow the creation of customized mutation operators. some works have explored aspects of a specific nature within the android platform. the edroid tool (luna and el ariss, 2018) implements 10 mutation operators oriented to vary configuration files and gui elements. the analysis of the mutants is done manually. if the mutant’s ui components are distinguished from the original, the mutant is classified as dead. µdroid is a mutation tool to identify energy-related problems (jabbarvand and malek, 2017). the tool implements a total of 50 mutation operators corresponding to 28 classes defined as energy consumption anti-patterns. µdroid has a fully automated mutation testing process. while the test is performed in the original app, energy consumption is monitored. when the test is executed on the mutant, the energy consumption of the original app is compared to that of the mutant. if the screening is different enough, the mutant is considered dead. most tools may be extended to have integrated support for the mutation testing process, mainly automatic mutant execution and analysis. most of them generate mutants and do not offer automatic support for the analysis of the mutant output, which is mainly conducted manually. in addition, there are some initiatives exploring mutation testing of apps for nonfunctional properties, such as energy consumption, but they do not address accessibility faults. based on elicited results about mutation testing of mobile apps (silva et al., 2021), and as far we are concerned, there is not a mutation approach for silva et al. 2022 mobile accessibility testing and evaluation. 2.2 accessibility evaluation of android apps there are few studies on the accessibility assessment of mobile apps. this small amount of studies is due to the lack of adequate tools, guides, and policies to evaluate apps (acostavargas et al., 2020; eler et al., 2018). such guides are generally used as oracles to check whether the app meets accessibility requirements during accessibility evaluation that can be conducted manually or by automated tools. below, we present some works that analyse those guides and report the main accessibility problems, as well as automated tools that take them into consideration. ballantyne et al. (2018) compile a super-set of guides and normalize them to eliminate redundancy. the result lists 11 categories of testable accessibility elements: text, audio, video, gui elements, user control, flexibility and efficiency, recognition instead of recalling, gestures, system visibility, error prevention, and tangible interaction. damaceno et al. (2018) perform a similar mapping that identifies 68 problems associated with different aspects of the interaction of people with visual impairments on mobile devices. these problems are mapped into 7 groups: buttons, data entry, gesture-based interaction, screen size, user feedback, and voice command. the group with more problems is related to the interaction made of formal gestures. vendome et al. (2019) elaborate taxonomy of accessibility problems by mining 13,817 android apps from github. the authors observe that 36.96% of the projects did not have elements with descriptive label attributes, and only 2.08% imported at least one accessibility api. the main categories listed in the fault model are: support for visual limitation, support for motor limitation, hearing limitation, and other aspects of accessibility. alshayban et al. (2020) present the results of a largescale study to understand the accessibility from three complementary perspectives: app, developers, and users. first, they analyze the prevalence of accessibility violations in over 1,000 android apps. then they investigate the developer sentiments through a survey. in the end, they investigate user ratings and app popularity. their analysis revealed that inaccessibility rates for apps developed by big companies are relatively similar to inaccessibility rates for other apps. the works of acosta-vargas et al. (2019, 2020) evaluate the use of wcag 2.1 and the accessibility google scanner, a tool that suggests accessibility improvements for android apps. the authors conclude that the wcag guide achieves digital inclusion on mobile platforms. however, the accessibility problems must be fixed before the application goes into production and they recommend the use of wcag throughout the development cycle. the most recent version of wcag 2.1 includes suggestions for web access via a mobile device (kirkpatrick et al., 2018). wcag principles are grouped into 4 categories: (i) perceivable, that is, “the information must be presentable to users in ways they can perceive”; (ii) operable, “user interface components and navigation must be operable.”; (iii) understandable, “information and the operation of user interface must be understandable.”; and (iv) robust, “content must be robust enough that it can be interpreted by a wide variety of user agents, including assistive technologies”. these principles are the core tenets of accessibility. to follow the accessibility principles, we must achieve the success criteria defined within their respective guideline and principle. automated tools commonly use the wcag success criteria as testable statements to check for guideline violations. they can perform static or dynamic analysis (silva et al., 2018). static analysis can quickly analyze all assets of an app (google, 2018), but they cannot find violations that can only be detected during runtime (e.g., low color contrast). in contrast, dynamic analysis tends to be time consuming. in this sense, eler et al. (2018) define a set of accessibility criteria and implemented mate (mobile accessibility testing), a tool that automatically explores and verifies the accessibility of mobile apps. developers can also manually assess accessibility properties using the google scanner (google, 2020). it allows testing apps and gets suggestions on how to improve accessibility (to help those who have limited vision, speech, or movement). first, the app is activated, then it displays the main handling instructions. finally, with the mobile app running, google scanner highlights the gui element on the screen and what accessibility property it has not fulfilled. the a11y ally app (toff, 2018) checks the accessibility of the running app. from its integration via the command line, a11y generates a json file at the end of its execution. this file contains the list of gui elements and which accessibility criteria have been violated. the framework espresso (google, 2018) allows the recording of automated tests that assess the accessibility of the mobile app. the accessibility of the gui element, or only widget, will be checked if the test action triggers/interacts with the widget in question. the tools for accessibility testing and evaluation present some limitations. the most noticeable one is that the kind and numberofaccessibilityviolationsdeterminedbythetoolsare dependent on the test set used to execute the app and produce the reports. in this sense, the use of mutants describing potential accessibility faults can guide the test data generation and help in the improvement or assessment of an existing test set regarding this non-functional property. 3 a mutation approach for accessibility testing this section introduces our approach, and describes its main elements, which are usually required for any mutation approach: (i) the underlying fault model, related to accessibility faults; (ii) the mutation operators; (iii) the mutation testing process, adopted to analyze the mutants; and (iv) automation aspects, essential to allow the use of the approach in practice. 3.1 fault model in this stage, we searched the literature for different accessibility guides that establish good practices and experiments that used them (see section 2.1). in general, a guide summarizes the main recommendations for making the presented content of the mobile app more accessible. as a result of silva et al. 2022 our search, we observe that the wcag guide was adopted as a reference to build mobile accessibility guides such as emag (brazilian government, 2007), list of accessibility guidelines for mobile applications (ballantyne et al., 2018), bbc mobile accessibility guideline (bbc, 2017), and sidi accessibility guideline (sejasidier, 2015). in this way, the wcag guide was chosen due to the following reasons: i) as mentioned before, it encompasses success criteria written as testable statements; ii) it is constantly updated and a new version of the guide maintains compliance with its previous one; and iii) it has been considered by many authors as the most popular guide (acosta-vargas et al., 2019, 2020). once the success criteria are known, we can start building a fault model by negating these criteria. an unsatisfied criterion may imply one or more accessibility faults, as exemplified in table 1. table 1. negating wcag success criteria principle success criterion success criterion denial perceivable content description absence of content for non-text elements descriptions operable recommended touch not recommended area size touch area size understanlabels or absence of labels dable instructions or instructions robust status messages absence of status messages as observed in table 1, the denial of the criterion “labels or instructions” causes one or more faults related to the absence of a label. within android’s mobile development, different code elements characterize the use of a label for a gui element. these code elements can be either xml attributes or java methods. for instance, one way to satisfy the success criterion “labels or instructions” is setting the xml attributes :hint and :labelfor, or using the java methods sethint and setlabelfor. such elements are the key to the generation of mutants, in order to capture the faults of our model. in this way, more than one mutation operator can be derived from the negation of a criterion, such as “labels or instructions”. each mutation operator, in its turn, can be applied to more than one element in the code, generating distinct mutants. to select the code elements and propose the mutation operators of our approach, we refer to the work of silva et al. (2020). this work maps the wcag principles and success criteria to code elements of native android api, and analyzes the prevalence of the mapped elements in 111 open source mobile apps. the study identifies code elements that impact accessibility, and shows that apps which adopt different types of code elements tend to have a smaller density of accessibility faults. this means that code elements associated with wcag are related to accessibility faults and justify mutation operators based on these code elements. 3.2 mutation operators the main objective in defining the accessibility mutation operators is to make sure that the test suite created by the tester exploits all, or at least most, of the app gui elements, as well as check the correct use of the code elements related to the accessibility success criterion. in this way, the operators can be used to guide the generation of test cases or to assess the quality of existing ones. to this end, and following the work of silva et al. (2020), we selected a set e of code elements, the most adopted in the apps, to propose an initial set of operators. these operators are defined considering aspects of android apps’ accessibility and can be improved in the future, by adding other code elements and success criteria. the selected code elements are presented in table 2; they correspond to the most used ones in the apps for each principle (silva et al., 2020). the table also shows the corresponding mutation operator. the labelfor element is a label that accompanies the view object. it can be defined via the xml file or the java language. in general, it provides a description and exploration labels for some screen elements. the hint element is a temporary label assigned to editable fields only. it is necessary for talkback, or any other screen reader, to correctly report what information the app needs. we can set or change textview font size with the element textsize. recommended dimension type for text is “sp” for scaledpixels (e.g., 15sp). the element inputtype specifies the input type for each text field in order for the system to display the appropriate soft input method (e.g., an on-screen keyboard). the app, by default, looks for the closest element to receive the next focus. the next element is not always the most logical. in these cases, we need to give the app custom navigation. we can define the next view to focus on using the code element nextfocusdownid. the element importantforaccessibility describes whether or not this view is important for accessibility. if the value is set with “yes”, the view fires accessibility events and is reported to accessibility services (e.g., talkback) that query the screen. the idea of the operators is to remove the corresponding code element e ∈ e when present. we opted for statement deletion operators, as previous studies gave evidence that such operators produce fewer yet effective mutants (delamaro et al., 2014). for each code element removed, we have a unique generated mutant. table 3 presents examples of applying the operators. snippets of code are presented and the ones to be removed are preceded by “–”. it is important to emphasize that if a mutation operator can not be applied to the app source code, this may indicate that the project/developer team has low priority on accessibility. now, imagine that the developer has taken care to define the accessibility code elements in the app. even if they are defined, it is very important to ensure that the test set includes a test that performs an action and interacts with the corresponding gui element and check they are defined properly. 3.3 mutation process the testing process for the application of the proposed operators is depicted in figure 1. it encompasses three steps. the first one is the mutant generation using the accessibility mutation operators defined. this step produces a set of mutant apps m. in the second step, the original app and the mutants in m are executed with test set t , which can be designed silva et al. 2022 table 2. selected code elements and corresponding wcag principles and success criteria. principle success criteria code elements mutation operator xml attributes java methods perceivable resize text :textsize settextsize missing textsize identify input purpose :inputtype setinputtype missing inputtype operable keyboard; focus order :nextfocusdownid setnextfocusdowndid missing nextfocusdownid underslabel or instructions :labelfor setlabelfor missing labelfor tandable label or instructions :hint sethint missing hint robust status messages :importantforaccessibility setimportantforaccessibility missing importantforaccessibility table 3. mutation operator description mutation operator code context example xml attribute java method mts missing textsize mit missing inputtype mnfd missing nextfocusdownid mlf missing labelfor mh missing hint mia missing importantforaccessibility with the tester’s preferred strategy. however, for the mutant analysis our process requires that t is implemented and executed by using an accessibility checker tool, such as the ones reported in section 2.2. the third step, mutant analysis, allows calculating the mutation score by comparing the accessibility reports produced by an accessibility checker for the original and mutant apps. if the accessibility logs differ, that is, different accessibility faults are encountered the mutant can be considered dead. the accessibility report generated by espresso contains some temporal information that may cause a non-deterministic output. to correct this, we post-process the output so that only the essential information is taken into account, namely the code element id and its reported accessibility issue. therefore, if the original app’s accessibility log is the same as that of the mutant app, resulting in a live mutant, the test suite probably needs to be revised and improved. if the score is not satisfactory, the tester can add new test cases or modify existing ones in t so that more mutants are killed. (1) mutant generation (2) execution of t in m and app (3) mechanism of analysis of mutants android app mutation score tester decides to improve score by modifying t set of tests t accessibility log of visited screens set m of mutant apps figure 1. testing process of the proposed approach 3.4 implementation to evaluate and use our approach, we implemented a prototype tool named accessibilitymdroid. it receives as input the source code of the android app under test. accessibilitymdroid implements the proposed operators by extending mdroid+ (moran et al., 2018), which are used for mutant generation (step 1). to build and execute the test, as well as to produce the accessibility log (step 2), the espresso framework is used. we chose tests implemented with espresso because it is the default framework for gui testing in android studio and includes embedded accessibility checking. as t is executed, the accessibilitycheck class allows us to check for accessibility faults. in the end of the run, espresso generates a log of the accessibility problems used in step 3. the tool compares the log automatically, and a list of mutants killed is produced. to illustrate our approach we use a sample app built with android studio. a piece of code for this app is presented in figure 2. with the application of operator mh (missing hint), which removes from the gui element the hint code element, line 22 (in red) disappears in the mutant m. 14 figure 2. a mutant generated by operator mh silva et al. 2022 27 @test 28 public void logintest() { 29 var appcompatedittext = onview(allof( 30 withid(r.id.username), 31 childatposition(allof(withid(r.id.container), 32 childatposition(withid(android.r.id.content), 0)), 1), 33 isdisplayed())); 34 35 appcompatedittext.perform(replacetext("email"), closesoftkeyboard()); 36 37 var appcompatedittext2 = onview(allof( 38 withid(r.id.password), 39 childatposition(allof(withid(r.id.container), 40 childatposition(withid(android.r.id.content), 0)), 2), 41 isdisplayed())); 42 43 appcompatedittext2.perform(replacetext("123456"), closesoftkeyboard()); 44 45 var appcompatedittext3 = onview(allof( 46 withid(r.id.password), withtext("123456"), 47 childatposition(allof(withid(r.id.container), 48 childatposition(withid(android.r.id.content), 0)), 2), 49 isdisplayed())); 50 51 appcompatedittext3.perform(pressimeactionbutton()); 52 } figure 3. test case using espresso 1 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 2 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 3 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 4 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 5 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). figure 4. accessibility log for the original app 1 @test 2 public void logintest() { 3 + onview(withid(r.id.nickname)).perform(typetext("nick"), 4 + closesoftkeyboard()); 5 var appcompatedittext = ... 6 } figure 5. changed test suppose that for this app, a test, as depicted in figure 3, is available. when t is executed with espresso on mutant m (step 2), a log is generated. this log is compared to the log generated by executing t in the original app (step 3). from the difference between the two accessibility logs, it is possible to determine the mutant’s death. in this case, t was not enough to show the difference between the original app and the mutant. as both produce the same log in figure 4, the mutant is still alive. the tester now tries to improve t and realizes that existing tests do not interact with one of the 1 + appcompatedittextid=2131230902,res-name=nickname: view is missing speakable text needed for a screen reader 2 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 3 appcompatedittext{id=2131230902,res-name=nickname}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 4 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 5 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). 6 appcompatedittext{id=2131230917,res-name=password}: view falls below the minimum recommended size for touch targets. minimum touch target size is 48x48dp. actual size is 331.4x45.0dp (screen density is 2.6). figure 6. accessibility log for the mutant m app’s input fields. after changes in t (illustrated in figure 5), step 2 is executed again and the log for m is now the one in figure 6; it differs from the original one in the first line. by employing a similar procedure to kill accessibility mutants, t achieves a higher mutation score, covers more gui elements, and potentially reveals other accessibility faults. 4 evaluation the main goal of the proposed operators is to serve as a guide for the evaluation and improvement of test suites regarding accessibility faults. to evaluate these aspects properly, as well as our implementation using espresso, we formulated three research questions as follows. rq1: how is the applicability of the accessibility mutation operators? this question aims to investigate if the proposed operators and processes are applicable in practice. to answer this question, we evaluate the approach’s application cost by analysing the number of mutants generated by each operator, as well as the number of required test cases. rq2: how adequate are existing test suites with respect to the accessibility mutation testing? this question evaluates the use of the proposed operators as an evaluation criterion. they are used for quality assessment of the test suites accompanying the selected open source apps with respect to accessibility. to this end, we analyse the ability of existing tests to kill the mutants generated by our approach. rq3: how much do the mutation operators contribute to revealing new accessibility faults? this question looks at the effectiveness of mutant-adequate test suites when revealing accessibility violations. silva et al. 2022 4.1 study setup we sampled open source apps from f-droid1, last updated in 2019/2020, containing espresso test suites. we refer to the test suite accompanying the project as t . we removed apps that failed to build and whose tests were not compatible with the accessibility checking feature. the replication package is available at: https://osf.io/vfs2d/. the seven apps are: alarmclock, an alarm clock for android smartphones and tablets that brings a pure alarm experience; anymemo, a spaced repetition flashcard learning software; authorizer, a password manager for android; equate, a unit converting calculator; kolabnotes, a note taking app; piwigo, a photo gallery app for the web; and pleestracker, a sleep tracker. for each app, we used accessibilitymdroid to generate mutants, run t , produce the accessibility logs to each mutant, and compare those with the original log. in this way, we obtained the set of mutants killed by t . after this, we manually inspected the alive mutants and realized that many times, some of the test cases in t exercised the mutated code, but they produced no difference in the log due to some espresso limitations (e.g., a limited set of accessibility criteria that will be detected and printed in the accessibility log). in this case, we marked the corresponding mutant as covered. other mutants were marked as “unreachable” since their mutations are related to widgets that are not reachable in the app (e.g., dead code). so, we counted the number of generated, killed, covered, and unreachable mutants by t . then, we extended t so that all mutants were killed or at least covered. we refer to this extended test suite as xt . the inclusion of a test case was conducted in the following way: (i) pick an alive mutant (not covered, not killed by t ); (ii) manually record a test that exercises the mutation using espresso test recorder in android studio, and if needed, refactor the test code to make it repeatable2; (iii) analyze if the mutant is killed by the new test, if not mark it as covered. the mutants information was collected again for xt . as cost indicators, we collected the number of tests of a test suite t c(t ), and its size, given by the number of lines of test code loc(t ). as for effectiveness, we counted per test suite the number of accessibility faults reported by the espresso accessibility check. table 4 shows the information on the seven selected apps. authorizer is the app with the greatest value of loc (28,286), while anymemo has 30 activities (#act.). alarmclock is the app with the smallest number of loc: 1,349, and equate has only 2 activities. the table also shows the number of test cases (#tc) and loc for the original set t and the extended one xt . notice that alarmclock has 41 tests and 1,068 lines of test code (loc(t )). kolabnotes has only one test, yet anymemo has the smallest loc(t ) (76). concerning xt , alarmclock and authorizer require more tests (both 43) and more loc(xt ) (1,341 and 1,700, respectively). pleestracker has the smallest number of test cases (5) and loc(xt ) (345). however, authorizer required more additional test cases, 32, while piwigo only one. 1https://www.f-droid.org 2the code generated by espresso test recorder may be too specific and fails in re-runs. table 4. selected apps app∗ loc #act. #tc(t ) loc(t ) #tc(xt ) loc(xt ) alarmclock 1,349 5 41 1,068 43 1,341 anymemo 19,751 30 3 76 13 932 authorizer 28,286 7 11 652 43 1,700 equate 5,826 2 6 511 9 709 kolabnotes 11,025 9 1 494 6 884 piwigo 4,744 7 8 408 9 579 pleestracker 1,868 5 2 89 5 345 ∗ the app’s name is a clickable link to the github project. 4.2 analysis of results table 5 summarizes the main results of the evaluation and is used in this section to answer our rqs. this table shows the number of mutants that were generated (columns g), killed by some test (columns k), covered but alive (columns c), and unreachable (columns u). notice that the results are shown for 4 out of 6 operators described in table 3; operators mlf and mnfd did not generate any mutant for the selected apps. for each app, two rows are presented, one for the results obtained by t and the other for xt . the last four columns list the total for all operators, while the last rows bring the total for all apps. for instance, for the app anymemo the operator mts generated 64 mutants, 11 unreachable. the test set t was not capable of killing any mutant but covered 14. the set xt covered 52; that is, 38 additional mutants could be covered. considering all operators, only one mutant could be killed by xt , and 70 mutants were covered out of 84 generated mutants. for this app, four mutants change a screen that is reached only when integrated with a third-party app. as exercising these mutants would require other tools beyond espresso, we were not able to cover them. however, they can not be classified as unreachable. because of this, the sum of killed, covered but alive, and unreachable mutants are not equal to the number of generated mutants for this app, as it happens for all of the other ones. rq1 – approach applicability. to answer rq1, we evaluate the number of mutants generated by each operator. we observe in table 5 that operator mts generated more mutants (145 in total), followed by mit (68), mh (34), and mia (9). mts generated mutants for all apps, mit for 6, and mh for 5 apps. operator mia generated mutants only for authorizer. in total, 256 mutants were generated, with anymemo with more mutants (86) and piwigo with 5. this means that the apps selected contain more code elements associated with the principle perceivable (operators mts and mit), which may indicate: (i) developers are worried about content descriptions for non-text elements more than the principle robust (operator mia that generated mutants for only one app) or operable (operator mnfd that did not generate any mutant); (ii) user experience (ux) and user interface (ui) documents include a more significant amount of code elements of the perceivable principle in their guidelines. operators mit and mia generated mutants that were not killed; only one mutant of mts was killed, and 17 out of 34 mutants generated by mh were killed. the process using espresso was capable of distinguishing mutants in the great majority generated by removing the code element :hint. analysing alive mutants, we identified 222 as covered, and 12 as unreachable. unreachable mutants were genhttps://osf.io/vfs2d/?view_only=6c3af7cdbb7f4132a9367e196735c68f https://www.f-droid.org https://github.com/yuriykulikov/alarmclock https://github.com/helloworld1/anymemo https://github.com/tejado/authorizer https://github.com/evanrespaut/equate https://github.com/konradrenner/kolabnotes-android https://github.com/piwigo/piwigo-android https://github.com/vmiklos/plees-tracker silva et al. 2022 table 5. summary of the results per operator android app mutation operator total mts mit mh mia g k c u g k c u g k c u g k c u g k c u alarmclock t 12 9 1 1 14 9 xt 12 1 1 1 13 anymemo t 64 14 11 22 86 14 11 xt 1 52 18 1 70 authorizer t 18 1 27 3 18 3 9 2 72 9 xt 18 27 6 12 9 6 66 equate t 3 2 2 1 1 7 1 1 xt 3 2 1 1 5 kolabnotes t 23 8 13 3 12 48 11 xt 23 13 8 4 8 40 piwigo t 1 3 3 1 1 5 1 3 xt 1 3 1 1 4 pleestracker t 24 8 24 8 xt 24 24 total t 145 40 11 68 9 34 2 3 1 9 2 256 2 54 12 xt 1 133 64 17 16 9 18 222 number of mutants generated, killed, covered but alive, unreachable by the original test suite t and the extended one xt . the mutation operators are: missing textsize; missing inputtype; missing hint; and missing importantforaccessibility. table 6. efforts to build xt app #mutants / app kloc a-tc a-loc mts mit mh mia total alarmclock 8.9 0.7 0.7 0.0 10.37 2 273 anymemo 3.2 1.1 0.0 0.0 4.35 10 856 authorizer 0.6 0.9 0.6 0.3 2.58 32 1048 equate 0.5 0.3 0.3 0.0 1.20 3 198 kolabnotes 2.0 1.1 1.0 0.0 4.35 5 390 piwigo 0.2 0.6 0.2 0.0 1.05 1 171 pleestracker 12.8 0.0 0.0 0.0 12.8 3 256 average 4.0 0.67 0.4 0.043 5.42 8 456 a-tc stands for the number of test cases added to t to obtain xt . a-loc stands for the number of loc added to t to obtain xt . erated mainly for anymemo and are related to implementation smells like dead code. for a deeper analysis, table 6 contains the number of mutants generated by the operator divided by the kloc of each app. the last two columns present information regarding the effort required to add new test cases so that an accessibility mutant adequate test suite is obtained. the last rows contain min, max and average values. we can see that the operators generate a mean value of 5.42 mutants per kloc, and, in the worst case, 12.8 for pleestracker. notice that a greater number of mutants is generated for the largest apps in terms of loc and number of activities: anymemo, authorizer and kolabnotes. given the fact that the proposed operators only remove code elements, the number of mutants tends to be equal to the number of existing elements associated to the accessibility wcag success criteria. due to this characteristic, it is unlikely that the operators generate equivalent mutants. this is an advantage, because the identification of such mutants is usually costly. moreover, we have not found either stillborn or trivial mutants. the first ones are mutants that do not compile, and the second ones are mutants that crash in the initialization. we also measured the effort of adding new test cases, considering the values in table 4. as table 6 shows, authorizer demanded more effort required 32 additional tests (with 1,048 a-loc), followed by anymemo: which required 10 additional tests (with 856 a-loc); and kolabnotes: 5 tests (390 a-loc). these apps are the greatest in terms of size. response to rq1: the number of mutants is related to the size of the app, mainly to the number of gui elements, and code elements associated with the accessibility success criteria. operators mts and mit, related to the principle perceivable, produce more mutants, while no mutant is generated for operator mnfd, related to the operable principle. moreover, we did not observe any stillborn, trivial, or equivalent mutants. implications: the operators are deletion style and depend on the use of accessibility-related code elements. the number of generated mutants grows proportionally to the number of accessibility code elements used in the app. operators mts and mit generated more mutants, which may indicate that code elements related to the principle perceivable are the most used in the app selected. our set of operators represents a first proposal, and we intend to improve the set with other kinds of operators, that for instance adding or modifying code elements, as well, and other code elements and success criteria could be considered. the proposed operators do not generate equivalent mutants due to their conception characteristics. we did not observe any stillborn or trivial mutant. this is important, because they imply in cost. these kinds of mutants are very common in the android mutation testing (linares-vásquez et al., 2017). we observe espresso’s limited ability to detect accessibility faults, and as a consequence, a reduced number of mutants were killed. because of this other accessibility testing tools should be used in future versions of accessibilitymdroid. we also intend to implement mechanisms to automatically determine covered mutants. the analysis of dead mutants is a drawback of most mutation testing approaches for android apps. the great majority do not offer an automatic way to persilva et al. 2022 form this task, they do not even provide a way to consider a mutant killed. rq2 – adequacy of existing test suites. rq2 evaluates the adequacy of the test suites concerning the proposed operators. the answer can shed some light on the quality of the test cases regarding accessibility faults and if the developers are worried about the test of such a non-functional property. to answer this question, table 7 brings the percentage of mutants killed and covered by t , per app. unreachable mutants were not considered. on average, the original sets were capable of killing only 5.23% of the mutants. the killed percentage reaches 20% for piwigo, the app with the fewest number of mutants. but this percentage is equal to zero for five apps. the percentage of covered mutants are better, 30.24% on average. the best percentages were achieved by alarmclock (64.3%) and piwigo (60%). the other five apps achieved a percentage lower than 35%. table 7. adequacy results of original test suites app killed covered alarmclock 0.0% 64.3% anymemo 0.0% 18.67% authorizer 0.0% 12.5% equate 16.67% 0% kolabnotes 0.0% 22.91% piwigo 20% 60% pleestracker 0.0% 33.33% average 5.23% 30.24% response to rq2: the existing test suites of the studied apps killed or covered only a small fraction of the accessibility-related mutants. in other words, they had a low mutation score. implications: in general, there are opportunities to improve the quality of gui tests in mobile apps. while code coverage and mutation testing have better support at the unit test level, more tool support is required at gui level. as the accessibility mutants demand better test coverage at gui level, the results herein presented helped to expose those weaknesses. rq3 – accessibility faults. by answering rq2, we observe that the existing tests obtained a small coverage of accessibility mutants, and new tests are required to obtain adequate test suites. however, it is important to know if such additional tests and efforts improve the test quality in terms of accessibility faults revealed. rq3 aims to answer this question. table 8 shows the number of accessibility faults pointed by espresso when the original (t ) and extended (xt ) test sets are used; the last column also shows the percentage of improvement. for t , alarmclock has more accessibility faults (126), while pleestracker has only 2 faults. on average we have 45.28 accessibility faults per app. concerning the mutant-adequate test suite xt , piwigo has more faults (447); pleestracker presented the best percentage of improvement (3,650%). but, the smallest percentage of improvement was obtained for alarmclock. on average xt revealed 186.4 accessibility faults. the improvements varied from 3.2 to 3,650%. table 8. accessibility faults detected by t and xt app #faults(t) #faults(xt) improv. alarmclock 126 130 3.2% anymemo 24 355 1,479% authorizer 65 201 209.2% equate 19 27 42.1% kolabnotes 43 70 62.8% piwigo 38 447 1,076.3% pleestracker 2 75 3,650% average 45.28 186.4 931.8% response to rq3: mutant-adequate test suites contribute to meaningful improvements in the number of accessibility faults detected. on average, the extended test suites improved around 932% the number of accessibility faults revealed in the original test suites. implications: the results gave evidence that the use of the mutation operators contributed to an increase in the number of revealed accessibility faults. we anticipate that the quality of the test suite is improved too, besides the accessibility point of view. 5 threats to validity there are some threats to the validity of our study. sample selection. it is not easy to guarantee the representativeness of the apps. in addition, the adopted sample has only android native apps with espresso test suites. to mitigate this, we selected the apps from f-droid a diverse set of open-source apps with recent updates. f-droid has been used in other studies (mao et al., 2016; zeng et al., 2016; gu et al., 2019). limited oracle. the mutant analysis strategy is linked to the espresso tool. however, the proposed approach is also compatible with other tools that monitor the running app and produce accessibility logs like mate (eler et al., 2018) and a11y (toff, 2018); we plan to integrate them in the future. manual determination of covered elements. this task was performed manually and is subject to errors. to minimize this threat, this analysis was carefully conducted and double-checked. flaws in the implementation. there may be implementation errors in any of tools or routines used in our study, like the mdroid+ extension, android emulator management, and espresso. the number of mutation operators. the set of accessibility mutation operators proposed represents only a fraction of all accessibility violations that can occur in a mobile app. we created this initial deletion set to validate the proposed tool. this set of deletion mutation operators is tested and validated as effective in practice. silva et al. 2022 6 concluding remarks this paper presented an approach for accessibility mutation testing of android apps. first, we defined a set of six accessibility mutation operators for android apps. then, for an android app, we generated the mutants. based on the original test suite, we checked which mutants are killed or at least covered. following our approach, we extended the original test suite to cover more mutants. the empirical results show that the original test suites cover only a small part of the accessibility-related mutants. besides, mutant-adequate test suites contribute to meaningful improvements in the number of accessibility faults detected. as future work, we plan to extend the tool support to handle apk files and commercial apps (closed source). the mutation operators may also be described more generically so that the approach can be extended to include other mobile development languages and frameworks (e.g., swift, reactnative, kotlin). another direction is to experiment with different oracles (e.g., mate (eler et al., 2018)), besides the accessibility check of espresso we used in this study. finally, different accessibility mutation operators can be defined, now focused on including and changing code elements. acknowledgment this work is partially supported by cnpq (andre t. endo grant nr. 420363/2018-1 and silvia regina vergilio grant nr. 305968/2018-1). references abuaddous, h. y., jali, m. z., and basir, n. (2016). web accessibility challenges. international journal of advanced computer science and applications (ijacsa). acosta-vargas, p., salvador-ullauri, l., jadán-guerrero, j., guevara, c., sanchez-gordon, s., calle-jimenez, t., laraalvarez, p., medina, a., and nunes, i. l. (2020). accessibility assessment in mobile applications for android. in nunes, i. l., editor, advances in human factors and systems interaction, pages 279–288, cham. springer international publishing. acosta-vargas, p., salvador-ullauri, l., perez medina, j. l., zalakeviciute, r., and perdomo, w. (2019). heuristic method of evaluating accessibility of mobile in selected applications for air quality monitoring. in international conference on applied human factors and ergonomics, pages 485–495. springer. alshayban, a., ahmed, i., and malek, s. (2020). accessibility issues in android apps: state of affairs, sentiments, and ways forward. in proceedings of the acm/ieee 42nd international conference on software engineering, icse ’20, page 1323–1334, new york, ny, usa. association for computing machinery. ballantyne, m., jha, a., jacobsen, a., hawker, j. s., and elglaly, y. n. (2018). study of accessibility guidelines of mobile applications. in proceedings of the 17th international conference on mobile and ubiquitous multimedia, pages 305–315. acm. bbc (2017). the bbc standards and guidelines for mobile accessibility. https://www.bbc.co.uk/ accessibility/forproducts/guides/mobile. brazilian government (2007). accessibility model in electronic government. https://www.gov.br/governodigital/ pt-br/acessibilidade-digital/ modelo-de-acessibilidade. cisco (2017). cisco visual networking index: global mobile data traffic forecast update, 2017–2022 white paper cisco. https://www.cisco.com/c/en/ us/solutions/collateral/service-provider/ visual-networking-index-vni/ white-paper-c11-738429.html. damaceno, r. j. p., braga, j. c., and mena-chalco, j. p. (2018). mobile device accessibility for the visually impaired: problems mapping and recommendations. universal access in the information society, 17(2):421–435. delamaro, m. e., offutt, j., and ammann, p. (2014). designing deletion mutation operators. in 2014 ieee seventh international conference on software testing, verification and validation, pages 11–20. deng, l., mirzaei, n., ammann, p., and offutt, j. (2015). towards mutation analysis of android apps. in proceedings of the eighth international conference on software testing, verification and validation workshops, icstw, pages 1–10. ieee. eler, m. m., rojas, j. m., ge, y., and fraser, g. (2018). automated accessibility testing of mobile apps. in 2018 ieee 11th international conference on software testing, verification and validation (icst), pages 116–126. escobar-velásquez, camilo, o.-r., michael, and linaresvásquez, m. (2019). mutapk: source-codeless mutant generation for android apps. in 2019 ieee/acm international conference on automated software engineering (ase). gamma, e. and beck, k. (2019). the new major version of the programmer-friendly testing framework for java. https://junit.org. google (2018). espresso. https://developer.android. com/training/testing/espresso. google (2018). improve your code with lint checks. https: //developer.android.com/studio/write/lint. google (2020). accessibility scanner. https://play. google.com/store/apps/details?id=com.google. android.apps.accessibility.auditor&hl=en_u. grechanik, m., xie, q., and fu, c. (2009). creating gui testing tools using accessibility technologies. in 2009 international conference on software testing, verification, and validation workshops, pages 243–250. gu, t., sun, c., ma, x., cao, c., xu, c., yao, y., zhang, q., lu, j., and su, z. (2019). practical gui testing of android applications via model abstraction and refinement. in proceedings of the 41st international conference on software engineering, icse ’19, page 269–280. ieee press. https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.bbc.co.uk/accessibility/forproducts/guides/mobile https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.gov.br/governodigital/pt-br/acessibilidade-digital/modelo-de-acessibilidade https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html https://junit.org https://developer.android.com/training/testing/espresso https://developer.android.com/training/testing/espresso https://developer.android.com/studio/write/lint https://developer.android.com/studio/write/lint https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.auditor&hl=en_u silva et al. 2022 hartley, s. d. (2011). world report on disability (who). technical report, who and world bank. jabbarvand, r. and malek, s. (2017). µdroid: an energyaware mutation testing framework for android. in proceedings of the 11th joint meeting on foundations of software engineering, esec/fse, pages 208–219. acm. jia, y. and harman, m. (2011). an analysis and survey of the development of mutation testing. ieee trans. software eng., 37(5):649–678. kirkpatrick, a., connor, j. o., campbell, a., and cooper, m. (2018). web content accessibility guidelines (wcag) 2.1. https://www.w3.org/tr/wcag21/. linares-vásquez, m., bavota, g., tufano, m., moran, k., di penta, m., vendome, c., bernal-cárdenas, c., and poshyvanyk, d. (2017). enabling mutation testing for android apps. in proceedings of the 2017 11th joint meeting on foundations of software engineering, esec/fse, pages 233–244, new york, ny, usa. acm. lisper, b., lindstrom, b., potena, p., saadatmand, m., and bohlin, m. (2017). targeted mutation: efficient mutation analysis for testing non-functional properties. in proceedings 10th ieee international conference on software testing, verification and validation workshops, (icstw), pages 65–68. luna, e. and el ariss, o. (2018). edroid: a mutation tool for android apps. in proceedings of the 6th international conference in software engineering research and innovation, conisoft, pages 99–108. ieee. mao, k., harman, m., and jia, y. (2016). sapienz: multiobjective automated testing for android applications. in proceedings of the 25th international symposium on software testing and analysis, issta 2016, page 94–105, new york, ny, usa. association for computing machinery. moher,d.,liberati,a.,tetzlaff,j.,andaltman,d.g.(2009). preferred reporting items for systematic reviews and meta-analyses: the prisma statement. bmj, 339. moran, k., tufano, m., bernal-cárdenas, c., linaresvásquez, m., bavota, g., vendome, c., di penta, m., and poshyvanyk, d. (2018). mdroid+: a mutation testing framework for android. in proceedings of the 40th international conference on software engineering: companion proceeedings, pages 33–36. acm. reda, r. (2019). robotiumtech: android ui testing. https://github.com/robotiumtech/robotium. sejasidier (2015). guide to the development of accessible mobile applications. http://www.sidi.org.br/ guiadeacessibilidade/index.html. silva, c., eler, m. m., and fraser, g. (2018). a survey on the tool support for the automatic evaluation of mobile accessibility. in proceedings of the 8th international conference on software development and technologies for enhancing accessibility and fighting info-exclusion, dsai 2018, page 286–293. acm. silva, h. n., endo, a. t., eler, m. m., vergilio, s. r., and durelli, v. h. r. (2020). on the relation between code elements and accessibility issues in android apps. in proceedings of the v brazilian symposium on systematic and automated software testing, sast. silva, h. n., prado lima, j. a., endo, a. t., and vergilio, s. r. (2021). a mapping study on mutation testing for mobile applications. software testing, verification reliability. su, t., meng, g., chen, y., wu, k., yang, w., yao, y., pu, g., liu, y., and su, z. (2017). guided, stochastic modelbased gui testing of android apps. in proceedings of the 11th joint meeting on foundations of software engineering, esec/fse, paderborn, germany, september 48, pages 245–256. toff, d. (2018). a11y ally. https://github.com/ quittle/a11y-ally. vendome, c., solano, d., liñán, s., and linares-vásquez, m. (2019). can everyone use my app? an empirical study on accessibility in android apps. in 2019 ieee international conference on software maintenance and evolution (icsme), pages 41–52. w3c (2019). w3c accessibility standards overview. https://www.w3.org/wai/ standards-guidelines/. wei, y. (2015). mudroid: mutation testing for android apps. technical report, ucl-uk. undergraduate final year individual project. wille, k., dumke, r. r., and wille, c. (2016). measuring the accessability based on web content accessibility guidelines. in 2016 joint conference of the international workshop on software measurement and the international conference on software process and product measurement (iwsm-mensura), pages 164–169. yan, s. and ramachandran, p. g. (2019). the current status of accessibility in mobile apps. acm transactions on accessible computing, 12. zeng, x., li, d., zheng, w., xia, f., deng, y., lam, w., yang, w., and xie, t. (2016). automated test input generation for android: are we really there yet in an industrial case? in proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering, fse 2016, page 987–992. https://www.w3.org/tr/wcag21/ https://github.com/robotiumtech/robotium http://www.sidi.org.br/guiadeacessibilidade/index.html http://www.sidi.org.br/guiadeacessibilidade/index.html https://github.com/quittle/a11y-ally https://github.com/quittle/a11y-ally https://www.w3.org/wai/standards-guidelines/ https://www.w3.org/wai/standards-guidelines/ introduction related work mutation testing of android apps accessibility evaluation of android apps a mutation approach for accessibility testing fault model mutation operators mutation process implementation evaluation study setup analysis of results threats to validity concluding remarks journal of software engineering research and development, 2021, 9:8, doi: 10.5753/jserd.2021.1893  this work is licensed under a creative commons attribution 4.0 international license.. on the test smells detection: an empirical study on the jnose test accuracy tássio virgínio  [ federal institute of tocantins | tassio.virginio@ifto.edu.br ] luana martins  [ federal university of bahia | martins.luana@ufba.br ] railana santana  [ federal university of bahia | railana.santana@ufba.br ] adriana cruz  [ federal university of lavras | adriana.cruz@estudante.ufla.br ] larissa rocha [federal university of bahia / state univ. of feira de santana | larissa@ecomp.uefs.br] heitor costa  [ federal university of lavras | heitor@ufla.br ] ivan machado  [ federal university of bahia | ivan.machado@ufba.br ] abstract several strategies have supported test quality measurement and analysis. for example, code coverage, a widely used one, enables verification of the test case to cover as many source code branches as possible. another set of affordable strategies to evaluate the test code quality exists, such as test smells analysis. test smells are poor design choices in test code implementation, and their occurrence might reduce the test suite quality. a practical and largescale test smells identification depends on automated tool support. otherwise, test smells analysis could become a cost-ineffective strategy. in an earlier study, we proposed the jnose test, automated tool support to detect test smells and analyze test suite quality from the test smells perspective. this study extends the previous one in two directions: i) we implemented the jnose-core, an api encompassing the test smells detection rules. through an extensible architecture, the tool is now capable of accomodating new detection rules or programming languages; and ii) we performed an empirical study to evaluate the jnose test effectiveness and compare it against the state-ofthe-art tool, the tsdetect. results showed that the jnose-core precision score ranges from 91% to 100%, and the recall score from 89% to 100%. it also presented a slight improvement in the test smells detection rules compared to the tsdetect for the test smells detection at the class level. keywords: tests quality, test evolution, test smells, evidence-based software engineering 1 introduction ensuring end-user satisfaction, detecting software defects before go-live, and increasing software or product quality is among the most commonly reported software testing objectives, as written by the annual report of a global consulting firm (capgemini, 2018). recently published reports estimate over $ 2 trillion to quantify the impact of poor software quality on the united states economy, referencing publicly available source material for the year 2020 (cisq, 2021). such data illustrates the need for employing software testing techniques in software development processes, as they could anticipate bug identification and fixing, thus reducing its likely effects still during implementation (or even when existing functionalities are under evolution) (palomba et al., 2018; spadini et al., 2018; grano et al., 2019). in a well-defined software engineering process, test code should co-evolve together with production code, as highquality test code is essential to ease the maintenance and evolution of production and test code (yusifoğlu et al., 2015; guerra calle et al., 2019). however, it might be time-consuming and cost-ineffective (yusifoğlu et al., 2015; guerra calle et al., 2019). several approaches have been proposed in the literature to assess the quality of test suites. for example, code coverage measurement has been widely used to check the quality of automated tests. it measures the test suite quality based on how much a test covers structural elements, such as functions, instructions, branches, and lines of code (gopinath et al., 2014). nonetheless, even with high code coverage, the test code might encompass poor design choices in their implementation, the so-called test smells. the presence of smells in test code may reduce the quality of test suites and, consequently, the production code quality (deursen et al., 2001). additionally, poorly-written tests can be challenging to comprehend and onerous for testers to maintain the code and detect faults (bavota et al., 2015; grano et al., 2019). the software testing literature has introduced a set of tools focused on validating the quality of test suites, mainly through metrics analysis. for example, codecover1 is an open-source java tool for code coverage executed via a graphical user interface (with eclipse ide) and commandline; tsdetect2 is a command-line tool for test smells detection. other tools use code coverage results to predict test smells, such as teredetect (negar and garousi, 2010) and tecrevis (koochakzadeh and garousi, 2010). generally, these tools have many different data outputs, which might be hard for testers to establish a relationship between code coverage and internal test code quality. moreover, several types of test smells have not been investigated in conjunction with code coverage yet, but could also provide opportunities to improve test code quality. in previous studies (virginio et al., 2019, 2020), we introduced the jnose test, a tool to analyze the quality 1available at: https://codecover.org 2available at: https://testsmells.github.io https://orcid.org/0000-0001-6259-4957 mailto:tassio.virginio@ifto.edu.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-1153-8960 mailto:railana.santana@ufba.br https://orcid.org/0000-0001-5196-6356 adriana.cruz@estudante.ufla.br https://orcid.org/0000-0002-8069-5249 mailto:larissa@ecomp.uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://codecover.org https://testsmells.github.io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 of test suites from the test smells perspective. the jnose test provides an automated test strategy focused on (i) identifying possible test design flaws, (ii) analyzing the software project quality evolution, and (iii) reducing the effort for performing quality assurance of a test suite. the jnose test integrates a conceptual framework which encompasses strategies for test smells prevention, identification, refactoring, and visualization to improve the test code quality. raide3 (santana et al., 2020) and tsvizzevolution4 tools are part of this framework. in this study, we proposed the jnose-core, an api (application programming interface) to detect test smells in the test code. it provides a flexible architecture to support the insertion of new test smells detection rules. the jnose test implements the interface methods the jnose-core provides and organizes the data flow in a web-based user interface. in this new version, our tool: i) detects test smells in different code granularities (line, method, block, and class); ii) detects test smells more accurately according to the literature definition; and iii) presents the outputs in a more user-friendly interface. additionally, we also extended our previous work by validating the test smells detection rules implemented in the jnose test tool. we conducted an empirical evaluation to investigate two objectives: (i) verify the jnose test accuracy compared with the tsdetect in terms of precision and recall at a class level, and (ii) verify the jnose test accuracy compared with the manual analysis in terms of precision and recall at a fine-grained level. the results show that in a test class level, the jnose test obtained slightly better results than the tsdetect for specific types of test smells, such as assertion roulette, lazy test, and eager test. when analyzing the test smells at a fine-grained level, our tool shows higher accuracy when detecting the test smells location. the remainder of this paper is structured as follows. section 2 introduces the test smells concept and types. section 3 presents an overview of the jnose-core api. section 4 presents the jnose test, a web application for test smells detection. section 5 describes the empirical study to evaluate the jnose test accuracy. section 6 presents the results. section 7 discusses related work. section 8 presents the threats to the validity of our study. finally, section 9 draws concluding remarks. 2 background test code development is not a trivial task (palomba et al., 2018; virginio et al., 2019). in real-world practice, developers are likely to use anti-patterns during test development (bavota et al., 2012; junior et al., 2020). those anti-patterns may negatively impact the test code quality and maintenance and reduce its capability for detecting software faults (bell et al., 2018; spadini et al., 2020). several studies have investigated different types of test smells. initially, deursen et al. (2001) defined a catalog of 11 test smells and refactorings to remove them from the test 3available at https://raideplugin.github.io 4available at https://github.com/arieslab/tsvizzevolution code. next, several authors extended this catalog and analyzed the test smells effects on the production and test code (meszaros et al., 2003; bavota et al., 2012; greiler et al., 2013; bavota et al., 2015; bell et al., 2018; virginio et al., 2019; spadini et al., 2020). as a result of the researchers’ efforts to identify anti-patterns, garousi and küçük (2018) listed more than 190 test smells in a literature review. in this study, we selected twenty-one types of test smells currently discussed in the literature (peruma et al., 2019): • assertion roulette (ar). it occurs when a test method contains non-documented assertions. if an assertion fails, it can be difficult to identify which one failed; • conditional test logic (ctl). it occurs when a test method contains conditional expression or loop structures. conditions within the test method may alter its behavior which leads the test to fail; • constructor initialization (ci). it occurs when a test method contains a constructor; • default test (dt). it occurs when a test class is created by default; • dependent test (dept). it occurs when the test being executed depends on other tests’ success; • duplicate assert (da). it occurs when a test method tests for the same condition multiple times within the same test method; • eager test (et). it occurs when a test method checks more than one method of the production class; • empty test (ept). it occurs when a test method does not contain executable statements; • exception catching throwing (ect). it occurs when a test method is explicitly dependent on the production method throwing an exception; • general fixture (gf). it occurs when the test methods only access part of the test case fixture (setup method); • ignored test (igt). it occurs when a test method is suppressed from running; • lazy test (lt). it occurs when several test methods check the same production method; • magic number test (mnt). it occurs when assert statements contain numeric literals; • mystery guest (mg). it occurs when a test method utilizes external resources (e.g., a file containing test data), and thus it is not self-contained; • print statement (ps). it occurs when unit tests contains print statements; • redundant assertion (ra). it occurs when the test method contains an assertion statement that always is true or false; • resource optimism (ro). it occurs when a test method makes optimistic assumptions about the existence and state of external resources; • sensitive equality (se). it occurs in test methods that contains an equality check using a tostring() method. the test may fail when the tostring() method is changed; • sleepy test (st). it occurs when the execution of a test https://raideplugin.github.io https://github.com/arieslab/tsvizzevolution on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 method is paused for a certain period (e.g., simulate an external event) and then continues its execution; • unknown test (ut). it occurs when a test method does not encompass an assertion statement. • verbose test (vt). it occurs when the tests use too much code to do what they are supposed to do. in other words, the test code is not clean and simple. 3 jnose core in our previous work (virginio et al., 2020), we introduced the first version of the jnose test, a web application for the detection and coverage calculation of test smells. we reused and also expanded the test smells detection rules from the tsdetect (peruma et al., 2020). therefore, the jnose test provides: (i) a graphical interface to facilitate the interaction between user and tool, (ii) the amount and location of the detected test smells, and (iii) support for the test smells analysis through several project versions. when improving the detection rules from tsdetect, we faced some challenges regarding the coupling and dependency between the test framework and test code. the test frameworks, specifically the junit framework5, require different implementations depending on the version used. for example, junit 4 uses a tag @ignore to disable a test class or test method, while junit 5 uses the tag @disabled. regarding the assertions, junit 4 accepts an optional parameter for error message as the first argument, and junit 5 uses the last argument in the method signature. therefore, to facilitate the detection rules expansion and reuse of other tools, we implemented the jnose-core api.6 it is beneficial for the conceptual framework we are working on to evaluate the test code quality. the detection module is the framework base; the test smells detected are the same that should be removed by the refactoring module (raide tool) and presented to the user by the visualization module (tsvizzevolution). 3.1 architecture we designed the jnose-core as a maven7 project to simplify and standardized the build process. additionally, we provided a jnose-core compiled version that can be imported by other projects built with maven. the requirement to use the compiled version is to import the library in the pom.xml of the project, as listing 1 shows. as a result, the jnose-core provides methods to instantiate for the test smells detection. the jnose-core is licensed under the gnu general public license, and its architecture comprises four packages, as follows (figure 1): • core. it implements the jnosecore, a facade class that receives a instance of the config interface. the con5junit is a java library for testing source code, which has advanced to the de-facto standard in unit testing. available at https://junit.org/. 6available at https://github.com/arieslab/jnose-core 7maven is a software project management and comprehension tool. maven can manage a project’s build, reporting and documentation from a central piece of information. available at https://maven.apache.org/ figure 1. jnose-core api internal architecture fig interface contains the methods signature for the test smells detection; • detector. it implements a structure to detect the smelly elements and contains classes to support a test code static analysis through an ast (abstract syntax tree) generated by javaparser8. • smell. it implements the detection rules for junit 4 and improves the detection rules from tsdetect (section 2) to identify test smells at different granularity levels. several classes are implemented (for each type of test smell) and use javaparser to collect additional information on the location and number of test smells. • dto (data transfer object). it implements the classes responsible for transferring data among the packages. 1 2 br. ufba .jnose 3 jnose -core 4 0.7 snapshot 5 listing 1: pom.xml configuration to use jnose-core 3.2 detection rules we revisited the test smells definitions in the literature to identify how we should improve the detection rules from tsdetect. table 1 shows the granularity levels that we defined to detect the exact test smells location in the test code, as follows: (i) line, test smells that occur in a specific line; (ii) block, test smells that occur in a statement block level, e.g., try/catch and conditional statements; (iii) method, test smells that occur in the method level; and (iv) class, test smells that occur in a test class level. additionally, we made improvements in the test smells detection rules. we next detail the main modificationsw we performed: • nested structures. we improved the rules for detecting the ctl, ect, and mnt test smell to consider nested 8available at: https://javaparser.org/ https://junit.org/ https://github.com/arieslab/jnose-core https://maven.apache.org/ https://javaparser.org/ on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 table 1. test smells detection rules. name detection rule granularity assertion roulette a line with assertion statements without the explanation/message parameter line constructor initialization a method that is a constructor declaration method conditional test logic a code block with conditional statements block duplicate assert a line with assertion whose parameters equal the other assertion inside of same test method line default test a method called exampleunittest() or exampleinstrumentedtest() method dependent test a method that depends on the previous execution of another test method method empty test a method that does not contain a single executable statement method eager test a line that contains a call to another production method line exception catching throwing a block that contains either a throw statement or a catch clause block general fixture a line with a field instantiated within the setup() method that is not utilized by all test methods line ignored test a method that contains the @ignore annotation method lazy test a line of method that calls the same production method that other test method line mystery guest a method that accessing object instances of files and databases classes method magic number test a line with assertion method that contains a numeric literal as an argument line print statement a line that invokes either the print() or println() or printf() or writes method of the system class line redundant assertion a line containing an assertion statement in which the expected and actual parameters are the same line resource optimism a method that uses an external resource without checking the state of the object method sensitive equality a method that contains an assertion that invokes the tostring() method of an object method sleepy test a line that invokes the threadsleep() method line unknown test a method that uses the @test annotation but does not contain assertions statement method verbose test a method with more than 30 lines counting non executable statements and annotations method structures. when the tool reports a nested conditional structure as one test smell, it might be hard to identify which part of the test code needs refactoring at first glance. if the nested conditional is too long, the user may refactor parts of it. when rerunning the tool, the user will see that the problem is still there, making the refactoring process longer. therefore, the tool presents one test smell for each structure; • empty or non-assertive. the ut and ept test smells present similar definitions. the ut test smell identifies methods without assertions, and the ept test smell identifies methods with non-executable statements. test methods without a body neither contain executable statements nor assertions. therefore, we added another rule to separate both definitions; the ut test smell identifies methods that contain a body and does not identify asserts; • general fixture. the gf test smell occurs when test methods use only a setup method part, representing the cohesion among the test class’s methods. therefore, we improved the detection rules to show that all the test class methods are used with setup fixtures. it allows the user to identify the test method to which a fixture should be moved; • missing structures. each version of the test framework requires the static analysis of different code structures. the assert structures used in junit 3 is different from junit 4, which is also different from junit 5. therefore, to improve the detection rules to junit 4, we added the code structures that were missing to detect the ctl, ar, da, and ect test smells; • methods overload. similar to the preceding item, there are differences among the junit versions regarding the overloaded methods. when analyzing test cases written with junit 3, we were not concerned about overloaded methods. however, to focus on the current detection rules for junit 4, we needed to improve the ar, and da test smells to support the overloaded methods. 4 jnose test the jnose test9 enables test code quality analysis through test smells detection and code coverage over several software project versions. therefore, it is possible to compare whether a project test quality has either improved or declined throughout its life cycle. the jnose test operation involves three key processes (figure 2): (i) data input, receives the settings for the tool execution, i.e., the list of types of test smells, analysis mode (by testclass, by testsmell, by testfile, and evolution), and the project to be analyzed; (ii) project analysis, calls the jnose-core, an api to perform the project analysis according to the analysis mode selected; and (iii) data output, shows the execution status and the analysis results. 4.1 processes description java development kit (jdk) 11 and maven 3 (or superior) are necessary to install the jnose test. upon installation, the user would be able to use jetty (embedded on maven) and build and run the jnose test. 9available at https://jnosetest.github.io https://jnosetest.github.io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 2. schematic overview of the jnose test tool and its main features after starting the tool, the user must configure the data input (figure 2). first, the user should import the projects to be analyzed (figure 3a step 1). the jnose test clones the repository directly from github, and allows the user to manage it (figure 3a step 2). second, the user selects the analysis mode, i.e., by testclass, by testsmells, by testfile, and evolution (figure 3a step 3). each analysis mode provides a menu where the user chooses the repositories to be analyzed. by default, the tool detects twenty-one types of test smells, but the user could configure this feature as well (figure 3a step 4). after completing the project import and defining the settings detection, the tool starts the project analysis (figure 2). for each analysis mode, the jnose test tool presents an interface with (i) a list of cloned projects (figure 3b step 1), (ii) a menu with specific analysis mode settings (figure 3b step 2), and (iii) a menu with the data output options (figure 3b step 3). the project analysis considers the analysis mode selected by the user, described below. (1) by testclass. in the data input process, the user could enable the coverage metrics calculation and select the projects to be analyzed. then, to analyze the project by test class, the project analysis calls the jnose-core and optionally executes the code coverage module. finally, the data output process generates a view that contains a table with the number of test smells by test class. that table presents a row for each test class, and each column represents the type of parameter collected: project name, test class, and production class location, twenty-one columns for the types of test smells, the number of test class lines, the number of test methods, and five columns with coverage data. that table could be downloaded as a .csv file. additionally, the user could view a chart or download it as a .png file with the amount of each test smell in the project. (2) by testsmells. the project analysis process only calls the jnose-core to analyze the project by test smell. during the data input process, the user needs to select the projects to explore. unlike the previous analysis, by testsmells provides the exact location of a test smell. the last, the data output, offers a view with the data analysis results, which could also be downloaded as a .csv file. each row of the table represents a test smell, and it has five columns to show the type of parameter collected: the project name, the test class location, the production class location, the test smell name, the test smell location. (3) by testfile. the project analysis process only calls the jnose-core to analyze the project by test file. during the data input process, the user should select a test class and optionally its respective production class. besides the production class selection is an optional feature, the eager test and lazy test test smells are not detected without it. then, the data output provides a view containing a row for each detected test smell and its location. (4) evolution. the project analysis process executes the git mining module and the jnose-core to analyze the project by version. during the data input, the user should select projects to explore and search to be applied (by commits or by tags). this analysis provides the test smell detection for each project version, in addition to data about the author who committed the test smell. the data output process provides a view containing the data analysis results by test smells, downloadable as a .csv file. the table rows represent the test classes by commit. the columns encompass the following parameters: project name, test class and production class location, number of test smells, commit identification, authorship, date, and message. additionally, the user could view a chart and download it as a .png file with the amount of test smells in each project version or the number of test smells committed by an author. the tool also automatically calculates the authorship of a test smell by guilt, i.e., the tester who last modified the method and did not fix it. different analysis mode allows other data visualization. on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 (a) data input: cloning projects from github (b) project analysis: configuring by testclass analysis mode (c) data output: an excerpt of the table with the by testclass results figure 3. jnose test process execution therefore, the data output generates tables or charts depending on the analysis mode. tables are generated for all analysis modes (figure 3c). charts are generated for by testclass and evolution. by testclass charts present the total amount of test smells inserted in a project, and evolution charts present the amount of test smells by project version or by author. 4.2 tool architecture the jnose test is implemented as a java project and comprises five packages, as figure 4 shows: (i) base, responsible for instantiating the jnose-core interface implementation and calculating the coverage metrics; (ii) page, responsible for presenting the web pages and their content; (iii) dtolocal, responsible for encompassing the classes used in dto; (iv) entity, responsible for the domain objects persistence from the database; and (v) business, responsible for applying the business rules to present the results. the base package implements the project analysis (figure 3a), which was split into three other packages, as follows: • coverage. it applies the rules necessary to calculate figure 4. packages of the jnose test on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 coverage. it runs the jacoco library10 to calculate code coverage in the java language. it performs dynamic analysis of the production code branches (bc), instructions (ic), lines (lc), complexity (cc), and methods (mc) to determine which one is either missed or covered by the test (virginio et al., 2019); • git mining. it applies business rules for github mining. it uses the github api for java library11 to clone the projects from github and extract information about the project’s tags, commits, and authors; • jnose-core. it performs test code static analysis through an ast generated by javaparser.12 then, it extracts information about the code structure to apply the rules for the test smells detection, and it collects additional information about the location and number of test smells. the detection rules were improved from the tsdetect tool (section 2) to identify test smells at different granularity levels (table 1). the jnose test interface was implemented in the page package based on the apache wicket13, a framework for web application development in java. we also used html5 and css3 to develop the web pages. this package implements the data input (figure 2). the business implements utility classes responsible for generating the results. it is possible to generate a different type of report for each analysis mode. this package implements the data output (figure 2). in the dto package, we have the classes used to transfer data among the project layers. that package implements the communication among data input, project analysis, and data output (figure 2). additionally, a local database stores the data generated by those processes, comprising persistence rules implemented in the entity package. the jnose test execution uses parallel processes, i.e., the tool creates threads for each uploaded project, for each test class, and so on. with parallel processing, the jnose test could be used to analyze a massive set of projects in a short time (virginio et al., 2019). 4.3 running example we carried out an experimental study to verify the correlation between the coverage metrics and test smells in previous work. we selected eleven software projects to perform that study, in which we collected twenty-one test smells and five coverage metrics using the jnose test. this section presents an example considering the different types of analysis modes supported by the jnose test. we used the commons-io project14 (release 2.7-rc1), a library of utilities, to assist i/o development. we next discuss each supported method. 4.3.1 by testclass analysis we ran the jnose test by testclass to analyze which type of test smells would achieve the highest diffusion over the 10available at https://www.eclemma.org/jacoco/ 11available at https://github-api.kohsuke.org/ 12available at https://javaparser.org/ 13available at https://wicket.apache.org/ 14available at https://github.com/apache/commons-io commons-io project. therefore, we took the following steps: (i) select all types of test smells; (ii) select the project path; and (iii) enable code coverage. the tool returned 58 test classes. we checked the number of classes where each test smell was present to understand the test smell type diffusion. for example, the ect test smell was present in 23 classes, followed by ar test smell in 17 test classes, and et test smell in 16 test classes. each type of test smell could occur many times in a test class. those three types of test smell presented the highest occurrence in the project, counting 316, 175, and 157 times, respectively. table 2 shows five test classes with the highest number of ect, ar, and et test smells. for example, the test class proxycollectionwritertest contains the highest number of those test smells. additionally, most test classes achieved good code coverage when considering the ic, lc, and mc coverage metrics (>70%). therefore, even with high coverage, the test code might present low-quality. 4.3.2 by testsmell once we found that the ect, ar, and et test smells had the highest diffusion numbers in the commons-io project test classes, we may improve the test code quality by fixing the problems. then, we executed the jnose test by testsmell by taking the following steps: (i) select the ect, ar, and et test smells; and (ii) select the project. table 3 shows a results excerpt filtered by the proxycollectionwritertest test class. 4.3.3 by testfile in the previous example (by testsmells), we filtered the results to present only the ones related to the proxycollectionwritertest test class. in the by testfile analysis, that class could be analyzed individually. therefore, we executed the jnose test by taking the following steps: (i) select the ect, ar, and et test smells; and (ii) select the proxycollectionwritertest and proxycollectionwriter files. the results are the same as the filter presented in table 3. listing 2 shows the proxycollectionwritertest test class with the testarrayioexceptiononappendchar1() test method (lines 39-53). we observed that the assertequals() method is called twice within the test method (lines 50-51). each one checks a different condition, but there is no explanation message for them. thus, if the test method fails, there is no clue to identify which assertion caused the failure. that issue refers to the ar test smell. moreover, those assertions are also related to the ect test smell because they may fail when a specific exception occurs. furthermore, a test method is supposed to check just one production class method; otherwise, the code has one et test smell (proxycollectionwriter() on line 43 and append() on line 46). 4.3.4 evolution analysis the evolution analysis might help us identify whether the commons-io has improved over time. we should take the following steps to perform this analysis: (i) select all test smells, https://www.eclemma.org/jacoco/ https://github-api.kohsuke.org/ https://javaparser.org/ https://wicket.apache.org/ https://github.com/apache/commons-io on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 table 2. classes with high diffusion of test smell by testclass tesfilename ... loc met ut igt ro ... st lt da et ar ctl ci dt ept ect gf mg ps dpt ic bc lc cc mc proxycollectiontest ... 448 23 1 0 0 ... 0 61 1 23 21 1 0 0 0 23 0 0 0 0 72 0 76 100 100 trewritertest ... 448 23 1 0 0 ... 0 30 1 2 21 1 0 0 0 23 0 0 0 0 100 0 100 100 100 proxywritetest ... 275 21 3 0 0 ... 0 23 0 4 0 0 0 0 0 21 0 0 0 0 83 0 87 93 93 boundedreadertest ... 246 22 1 1 1 ... 0 48 1 8 3 2 0 0 0 16 0 1 0 0 100 100 100 100 100 endianutilstest ... 316 22 1 0 0 ... 0 46 8 20 15 1 0 0 0 14 0 0 0 0 100 100 100 100 100 table 3. test smells location in proxycollectionwritertest tesfilename ... testsmell methodlocationname lines proxycollectiontest ... ar testarrayioexceptiononappendchar1 50,51 proxycollectiontest ... ar testarrayioexceptiononappendchar2 66,67 proxycollectiontest ... ar testarrayioexceptiononappendcharse 82,83 proxycollectiontest ... et testarrayioexceptiononappendchar1 50,51 proxycollectiontest ... et testarrayioexceptiononappendchar2 66,67 proxycollectiontest ... et testarrayioexceptiononappendcharse 82,83 proxycollectiontest ... ect testarrayioexceptiononappendchar1 45-52 proxycollectiontest ... ect testarrayioexceptiononappendchar2 61-69 proxycollectiontest ... ect testarrayioexceptiononappendcharse 77-84 37 public class proxycollectionwritertest { 38 39 @test 40 public void testarrayioexceptiononappendchar1 () throws ioexception { 41 final writer badw = new brokenwriter (); 42 final stringwriter goodw = mock ( stringwriter . class ); 43 final proxycollectionwriter tw = new proxycollectionwriter (badw , goodw , null ); 44 final char data = 'a'; 45 try { 46 tw. append ( data ); 47 fail (" expected "+ ioexception . class . getname ()); 48 } catch ( final ioexceptionlist e) { 49 verify ( goodw ). append ( data ); 50 assertequals (1,e. getcauselist (). size ()); 51 assertequals (0,e. getcause (0, ioindexedexception . class ). getindex ()); 52 } 53 } listing 2: proxycollectionwritertest test class (ii) select the analysis by commit, and (ii) select the project path. the project has 2,337 commits, 52 releases, and 56 contributors from the beginning until the release 2.7rc1. we filtered the five test class results with more ect, et, and ar test smells (table 4). figure 5 shows the evolution of those classes and the project. the proxycollectionwritertest, trewritertest, and proxywritertest test classes are stable, as no test smell was either inserted or fixed. however, the boundedreadertest test class presented novel test smells during 2014-2016 and fixed them during 2016-2020. we could observe that the number of test smells increased over time, which might indicate that people involved in the project test suite development have not worked to get rid of test smells yet. in addition, authorship is calculated by fault, so the authors from that example might not have inserted all detected test smells. table 4. classes with high diffusion of test smell evolution tesfilename ... testsmell commitid commitname commitdate proxycollectionwrite ... 153 b739ce7c adam retter 03:39:47 2020 proxycollectionwrite ... 153 bcb36041 david georg 00:09:03 2018 trewritertest ... 101 b739ce7c adam retter 03:39:47 2020 trewritertest ... 101 bcb36041 david georg 00:09:03 2018 proxywritetest ... 59 b739cc7c adam retter 03:39:47 2020 proxywritetest ... 59 bcb36041 david georg 00:09:03 2018 boundedreadertest ... 92 b739ce7c adam retter 03:39:47 2020 boundedreadertest ... 96 51f13c84 kristian rose 15:36:15 2016 boundedreadertest ... 83 9a9b8385 gary d. greg 01:17:05 2014 endianutilstest ... 118 b739ce7c adam retter 03:39:47 2020 endianutilstest ... 117 8940848g gary d. greg 18:47:06 2018 5 empirical evaluation this empirical evaluation aims to investigate the jnose test accuracy in detecting test smells. we designed the empirical study in four steps, as figure 6 shows: (i) dataset selection, in which we defined the test classes to analyze; (ii) oracle definition, in which we manually detected the test smells instances; (iii) data collection, where we applied the jnose test and tsdetect to collect the test smells instances; and (iv) data analysis, in which we analyzed the data collected to investigate our objectives. 5.1 dataset selection for this analysis, we used the dataset made available by peruma et al. (2020), which contains 65 test classes extracted from github projects. as we initially reused the jnose test detection rules from the tsdetect, we decided to use the same dataset they used to perform a fair comparison between both tools and assess the jnose test effectiveness. to build the dataset, peruma et al. (2020) selected android apps neither duplicated nor forked. upon the smells identification in a test file, they randomly selected 65 test classes from the selected projects and followed the definitions to detect the test smells. although the tsdetect implements detection rules for twenty-one types of test smells, only nineteen were validated. it did not detect the dt and dpt test smells. the same limitation applies to our study. since the authors did not have access to the test results from manual detection performed by peruma et al. (2020), we created a new oracle using the same test and production classes for this study. even if we had access to the peruma et al. (2020) manual detection results, we would have to detect the test smells at a fine-grained level to validate the jnose test. the reason for such assumption is that the jnose test detects the test smells exact location, rather than just their presence (like the tsdetect). on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 5. evolution of the commons-io project and classes with high diffusion of test smells 5.2 oracle definition to manually detect the test smells instances, we followed a design not fully crossed to assign coders to the subjects, i.e., different subjects are analyzed by different subsets of coders (hallgren, 2012). the subjects are the 65 test classes, and four authors of this study served as coders. the coders are experts in test smells with at least three years of experience. additionally, their java programming development experience ranged from 4 to 15 years, including unit test development. we organized the codes into two groups of two coders each, where one group analyzed 32 test classes and the other group 33 test classes. two coders individually analyzed each test class. they collected data regarding the test smells type and location, following definitions from table 1. as a result, each coder generated a document with all the test smells detected. subsequently, the coders compiled the individual records into one document after discussing the divergences. the review process of the test smells manually detected was time and effort-consuming (~60 minutes). the final oracle version supports the detection of eighteen types of test smells. in addition to the non-existence of the dpt and dt test smells in the dataset, previously reported by peruma et al. (2020), we did not detect any igt test instances smell. the analysis process of the test classes and the discussion about the classification divergences took about 60 hours. 5.3 data collection data collection consisted of detecting 65 test classes in two different analyses: detection with tsdetect and detection with jnose test tool. detection with tsdetect. we downloaded the tsdetect version 2.0 to collect the data. it executes three modules: i) the test file detector to detect the test classes, ii) test file mapping to link the test classes to production classes, and iii) tsdetect to detect the test smells. all modules were executed by command line in the terminal sequentially. as a result, the tsdetect generates a file that contains a boolean value for each type of test smell detected in the test class. therefore, the result provided by the tsdetect has a classlevel granularity. the detection process took about 7 minutes, considering the tool execution time and the participants’ expertise with the operating system terminal to exercise the necessary commands for its execution. detection with jnose test. we use the jnose test version 2.1 to detect the test smells. after running the tool, the output file with the result encompassed each test smell for each test class detected. the test smells detection granularity followed table 1. the automated detection with jnose test took about 1 minute due to the unified process to detect the test classes, production classes, and test smells. a friendly graphical interface makes this process easier. 5.4 data analysis we used the oracle to calculate the jnose test and tsdetect accuracy against the manual analysis. both tools present distinct granularity levels to detect test smells. tsdetect indicates whether a test class contains a test smell instance, i.e., returns a boolean value for each test smell in a class. jnose test detects all instances of a test smell with its exact location (line, block, method, or class). therefore, we carried out what follows: 1. we compared the jnose test and tsdetect accuracy considering the class-level. we treated the jnose test output to show boolean values at the class-level to compare with the tsdetect. as the jnose test detection rules were reused from the tsdetect, our goal is to determine the extension we improved those detection rules. in this comparison, the accuracy is given at the class-level considering its precision and recall. 2. we compared the jnose test and manual analysis accuracy considering a fine-grained level. for example, by evaluating the line-level of granularity, we can detect the ar test smell; therefore, we collected data at the line level to see it manually and automatically. our goal is to show the jnose test accuracy to indicate the test smells location. therefore, we provide the accuon the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 figure 6. steps to conduct the experiment racy value at a fine-grained level in terms of precision and recall. 6 results this section reports the results of our empirical study. the data for replication purposes are available online (virgínio et al., 2021). 6.1 comparison between jnose and tsdetect table 5 reports precision and recall accuracy when detecting test smells with jnose test and tsdetect. this comparison was made at the test class-level. table 5. jnose test and tsdetect comparison class-level accuracy (%) precision (%) recall (%) f1-score (%) test smell jnose tsdetect jnose tsdetect jnose tsdetect jnose tsdetect ar 100 75.38 100 90 100 75 100 78 ci 100 100 100 100 100 100 100 100 ctl 100 100 100 100 100 100 100 100 da 98.46 96.92 99 98 98 97 99 97 ect 100 46.15 100 92 100 46 100 55 et 95.38 86.15 95 87 95 86 95 86 ept 100 100 100 100 100 100 100 100 gf 98.46 98.46 99 99 98 98 99 99 lt 100 93.85 100 94 100 94 100 94 mg 90.77 90.77 92 92 91 91 89 89 mnt 95.38 90.77 96 92 95 91 95 90 ps 100 100 100 100 100 100 100 100 ra 100 100 100 100 100 100 100 100 ro 89.23 89.23 91 91 89 89 88 88 se 100 100 100 100 100 100 100 100 st 100 100 100 100 100 100 100 100 ut 100 93.85 100 94 100 94 100 94 the results obtained with the tsdetect diverges from those reported by peruma et al. (2020). such study yielded precision values from 85.71% to 100% and recall values from 95% to 100%. they could detect nineteen types of test smells. the tsdetect achieved a precision from 87.71% to 100% and recall from 46% to 100% for eighteen types of test smells when using our oracle. as we mentioned earlier, we did not detect any igt test smell instances in none of the tools. those divergences highlight the challenges of building an oracle due to different interpretations that a coder may have about the test smells definitions. regarding the results obtained with the jnose test, the precision ranged from 91% to 100%, and the recall from 89% to 100% to detect eighteen types of test smells. as we reused the tsdetect detection rules, we showed the improvements we achieved. considering the f1-score metric, the jnose test presented accuracy improvement of 45% for the ect test smell, followed by 22% for the ar test smell, 11% for the vt test smell, 9% for the et test smell, 6% for the lt, and ut test smells, 5% for the mnt test smell, and 2% for the da test smell. other test smells detection rules did not present any relevant improvement at the test class level. next, we showed the reason for the divergence between the results obtained by the tools for the ect test smell detection. the jnose test considers three compliant solutions to handle exceptions (listing 3): i) the use of the tag test with the expected parameter (lines 1-4), ii) the use of assertthrows statement (lines 6-9), or iii) throw the exception in the method signature (lines 11-14). as a noncompliant solution, it considers the try/catch structure within the method body (lines 16-23). the tsdetect considers the try/catch structure and the throw-in method signature as a non-compliant solution (lines 11-23). we identified that the tsdetect does not consider the junit overloaded methods when using an assert statement regarding the ar test smell. for example, the assertequals on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 asserts that (listing 4) (i) two objects are equal (lines 1-9) or (ii) two objects are equal within a positive delta (lines 1119). the optional value is a string that describes the assertion. the tool simplifies the number of parameters expected by the assert statement. it detects as a test smell only methods with two parameters (lines 14). the problem occurs because the tool always classifies the assertequals as a non-test smell when the assert has three parameters. however, it is necessary to verify the fourth parameter to decide whether it is either a test smell or not. we improved the jnose test in this direction. additionally, there was a conflict in the ept, and ut test smells definition. the ept test smell is a test method without executable statements (empty method). the ut test smell is a test method with executable statements but no assertions. the tsdetect considers methods without a body as both ept and ut. therefore, we implemented the rules necessary to differentiate those test smells. we performed some minor fixes to detect other types of test smells. for example, for the vt test smells, the tsdetect considers a class with more than 123 lines as one verbose test. as the jnose test detects the test smells at a fine-grained level, we defined that a test method with more than 30 lines is verbose. therefore, we found more instances because of our definition. 1 @test ( expected = exception . class ) 2 public void tag_usage (){ 3 // some code 4 } 5 6 @test 7 void trows_statement_usage () { 8 assertthrows (" exception message ", exception . class , parameter ); 9 } 10 11 @test 12 public void trows_signature_usage () throws exception . class { 13 // some code 14 } 15 16 @test 17 public void try_catch_usage () { 18 try { 19 // some code 20 } catch ( myexception e) { 21 assert . fail (e. getmessage ()); 22 } 23 } listing 3: (non)compliant solutions for ect considered by jnose test 6.2 jnose and manual analysis comparison table 6 reports accuracy through precision and recall values when detecting test smells with jnose test and manual analysis. this comparison considered the granularity level for the test smells detection. in a fine-grained level, the jnose test precision score ranges from 84% to 100%, and the recall ranges from 47% to 100%. at the class level, the detection difficulties related 1 @test 2 public void two_parameters (){ 3 assertequals ( float expected , float actual ) 4 } 5 6 @test 7 public void three_parameters_with_message (){ 8 assertequals ( string message , float expected , float actual ) 9 } 10 11 @test 12 public void four_parameters (){ 13 assertequals ( string message , float expected , float actual , float delta ) 14 } 15 16 @test 17 public void three_parameters_no_message (){ 18 assertequals ( float expected , float actual , float delta ) 19 } listing 4: solutions for ar considered by jnose test table 6. jnose test and manual analysis comparison fine granularity level test smell accuracy (%) precision (%) recall (%) f1-score (%) ar 100 100 100 100 ci 100 100 100 100 ctl 100 100 100 100 da 94.12 100 94 97 ect 100 100 100 100 et 89.13 100 89 94 ept 100 100 100 100 gf 90 100 90 95 lt 96.55 100 97 98 mg 50 100 50 67 mnt 94.74 100 95 97 ps 100 100 100 100 ra 100 100 100 100 ro 47.06 84 47 60 se 100 100 100 100 st 100 100 100 100 ut 100 100 100 100 vt 100 100 100 100 to specific cases are not evident because it returns a boolean value for test smells in the whole test class. however, when we performed a more detailed test smell detection, we noticed some test code-specific characteristics that the tool does not detect. the most divergent results between the classand fine granularity-level are the mg and ro test smells. at the class level, those test smells have the accuracy of 90.77% and 89.23%, respectively. however, those test smells present accuracy of 50% and 47.06%, respectively. both the test smells to deal with external resources. a test method that makes optimistic assumptions about external resources’ existence has the ro test smell (listing 5, lines 10-21). the test method that uses external resources has the mg test smell (listing 5, lines 2-5). as the jnose test performs test code static analysis, we only considered the direct calls for external resources (listing 5, lines 1-15). however, whether a test method calls a production class from any part of the project and that class calls for external resources, the test class uses on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 external resources indirectly (listing 5, lines 17-21). in this scenario, the mg and ro test smells need additional work to determine the indirect calls. we identified a specific characteristic that can detect other false positives instances using the da test smell. that false positive occurs when one test method uses an assertion structure implemented by a json library similar to the assertion structure implemented by the junit. this is because the junit has the assertthat(string reason, t actual, m matcher) the other jsonassert library implements the assertthat(string).contains(string). when performing the static analysis, all the statements that start with assert were considered a junit assertion. therefore, we may improve it by detecting the libraries imported in the test class. however, the tool might miss test smells instances if using a test class with another assert library. other types of test smell required minor fixes. the lt and et test smell miss some instances due to default constructors. we considered that the same way a different test method should not call the same production class method, a class is instantiated several times in different test methods. if many test methods need to instantiate the same object, it should be moved to a setup method. therefore, we need to improve the jnose test to detect calls for the default constructors. 1 @test 2 public void external_file (){ 3 file file = openfile (" config . xml "); 4 if ( file . exists ()){ 5 xmlpullparser config = xmlparserfactory . fromfile ( file ); 6 // some code 7 } 8 } 9 10 @test 11 public void external_file_without_checking (){ 12 file file = openfile (" config . xml "); 13 xmlpullparser config = xmlparserfactory . fromfile ( file ); 14 // some code 15 } 16 17 @test 18 public void external_resource_indirectly (){ 19 xmlreader reader = new xmlreader (" xml / config . xml "); 20 // some code 21 } listing 5: mystery guest and resource optimus 7 related work in large-sized test suites, software engineers barely perform manual detection of test smells. this practice is rather timeconsuming and infeasible in many scenarios. therefore, the research community has proposed automated tool support for detecting test smells. the test smell detector (tsd) detects nine types of test smells (bavota et al., 2015). the tsd detection rules overestimate the presence of test smells in the code to ensure high recall (87%). it returns a list of candidate-affected classes. similarly, tsdetect, the state-of-the-art tool to detect test smells, identifies twenty-one types of test smell (section 2). it indicates whether a particular test smell appears in the test class with the precision score ranging from 85% to 100%, and recall score from 90% to 100% (peruma et al., 2020). other tools correlate test smells with structural and coverage metrics. the intellij plug-in coined vitrum (vizualization of test-related metrics) is an extension of tsdetect. it collects a set of seven types of test smells and structural metrics (pecorelli et al., 2020). teredetect (negar and garousi, 2010) and tecrevis (koochakzadeh and garousi, 2010) use code coverage analysis, held by codecover, to detect test smells related to code duplication. our tool uses a test smells rule-based detection instead of a metricor coverage-based detection. it extends the tsdetect tool in several respects. for example, our tool provides the number of test smells identified in a test class and the method line and name with each test smell’s location. moreover, it supports the test suite analysis through several project versions, by mining git for providing information about when and by who introduced the test smells. additionally, our tool supports other tools for test smells refactoring (raide) (santana et al., 2020) and visualization (tsvizzevolution). the raide is an eclipse ide plugin to detect and refactor the ar and da test smells. the tsvizzevolution is a test smells visualization tool that aims to help the user understand problems in the test code by using three visualization techniques (graph view, treemap view, and timeline view). it represents the twenty-one types of test smells detected by jnose test. 8 threats to validity internal validity. in the manual analysis to construct the oracle, there may have been divergences among the researchers’ analysis. we mitigated this threat by resolving disagreements collectively. after collecting data with the jnose test and tsdetect tools, we checked if any test smells detected by the tools were not considered in the manual analysis. external validity. our study results may not be generalized to other suites of test classes or other types of test smells. to mitigate this threat, we used the same dataset used in the study to validate the tsdetect tool (peruma et al., 2020). conclusion validity. although the jnose test detects twenty-one types of test smells, this study only validated eighteen ones because the dataset used did not have the dpt, dt, and igt test smells. on the other hand, we used the same dataset used to evaluate tsdetect (peruma et al., 2020). construct validity. although we used four coders to build the oracle, they were experts with more than three years of experience with test smells. they were aware of the test code of the test smells detection tools. 9 conclusion this paper presents the jnose test and its api, the jnose-core. the api supports the detection of twenty-one types of test smells. it provides a flexible architecture to supon the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 port the insertion of new test smells detection rules. the jnose test tool is a web application to detect test smells and calculate coverage for java projects. to validate the detection rules implemented by jnose-core, we conducted an empirical study to compare our tool’s accuracy with the state-of-the-art tool and manual analysis. we built an oracle to detect test smells to perform the comparison. the oracle contains sixty-five test classes analyzed by specialists in the subject. the comparison between jnose and tsdetect was made at the class-level. the results showed that jnose presented higher accuracy than tsdetect, in terms of precision and recall. as we reused the detection rules from the tsdetect to implement the jnose test, the results indicated that we successfully improved them. additionally, the jnose also detects test smells at a fine-grained level. as the tsdetect does not support this feature, we could only compare the fine-grained level detection against the manual analysis. results showed a high accuracy to determine the exact line location, but it still needs further improvements. there are many opportunities for other investigations. for example, it would be interesting to validate our tool efficiency in a real-world environment through a user study. such a study could also consider significant usability concerns. there is open room for introducing new features in the jnose test in terms of both detection and refactoring, and as necessary, in terms of how it behaves in practice considering quality attributes. acknowledgements this research was partially funded by ines 2.0; cnpq grants 465614/2014-0 and 408356/2018-9 and fapesb grants jcb0060/2016 and bol0188/2020. references bavota, g., qusef, a., oliveto, r., de lucia, a., and binkley, d. (2015). are test smells really harmful? an empirical study. empirical software engineering, 20(4):1052–1094. bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2012). an empirical analysis of the distribution of unit test smells and their impact on software maintenance. in 28th ieee international conference on software maintenance (icsm). bell, j., legunsen, o., hilton, m., eloussi, l., yung, t., and marinov, d. (2018). deflaker: automatically detecting flaky tests. in ieee/acm 40th international conference on software engineering (icse), pages 433–444. capgemini (2018). world quality report 201819. https://www.capgemini.com/service/ world-quality-report-2018-19/. accessed: march 1st, 2021. cisq (2021). the cost of poor software quality in the us: a 2020 report. https://www.it-cisq.org/pdf/ cpsq-2020-report.pdf. acessed: march 1st, 2021. deursen, a., moonen, l. m., bergh, a., and kok, g. (2001). refactoring test code. in refactoring test code, amsterdam, the netherlands, the netherlands. cwi (centre for mathematics and computer science). garousi, v. and küçük, b. (2018). smells in software test code: a survey of knowledge in industry and academia. journal of systems and software, 138:52 – 81. gopinath, r., jensen, c., and groce, a. (2014). code coverage for suite evaluation by developers. in proceedings of the 36th international conference on software engineering (icse), new york, ny, usa. acm. grano, g., palomba, f., di nucci, d., de lucia, a., and gall, h. c. (2019). scented since the beginning: on the diffuseness of test smells in automatically generated test code. journal of systems and software, 156:312–327. greiler, m., van deursen, a., and storey, m. (2013). automated detection of test fixture strategies and smells. in ieee sixth international conference on software testing, verification and validation, pages 322–331. guerra calle, d., delplanque, j., and ducasse, s. (2019). exposing test analysis results with drtests. in international workshop on smalltalk technologies, pages 1–5, cologne, germany. hal. hallgren, k. a. (2012). computing inter-rater reliability for observational data: an overview and tutorial. tutorials in quantitative methods for psychology, 8(1):23. junior, n. s., rocha, l., martins, l. a., and machado, i. (2020). a survey on test practitioners’ awareness of test smells. in proceedings of the xxiii iberoamerican conference on software engineering, cibse 2020, pages 462– 475. curran associates. koochakzadeh, n. and garousi, v. (2010). tecrevis: a tool for test coverage and test redundancy visualization. in bottaci, l. and fraser, g., editors, testing – practice and research techniques, pages 129–136, berlin, heidelberg. springer berlin heidelberg. meszaros, g., smith, s. m., and andrea, j. (2003). the test automation manifesto. in maurer, f. and wells, d., editors, extreme programming and agile methods xp/agile universe 2003, berlin, heidelberg. springer berlin heidelberg. negar, k. and garousi, v. (2010). a tester-assisted methodology for test redundancy detection. advances in software engineering, 2010. palomba, f., zaidman, a., and lucia, a. d. (2018). automatic test smell detection using information retrieval techniques. in ieee international conference on software maintenance and evolution (icsme), pages 311– 322, madrid, spain. ieee. pecorelli, f., di lillo, g., palomba, f., and de lucia, a. (2020). vitrum: a plug-in for the visualization of testrelated metrics. in proceedings of the international conference on advanced visual interfaces, new york, ny, usa. acm. peruma, a., almalki, k., newman, c. d., mkaouer, m. w., ouni, a., and palomba, f. (2019). on the distribution of test smells in open source android applications: an exploratory study. in proceedings of the 29th annual international conference on computer science and software https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.capgemini.com/service/world-quality-report-2018-19/ https://www.it-cisq.org/pdf/cpsq-2020-report.pdf https://www.it-cisq.org/pdf/cpsq-2020-report.pdf on the test smells detection: an empirical study on the jnose test accuracy virgínio et al. 2021 engineering (cascon), riverton, nj, usa. ibm. peruma, a., almalki, k., newman, c. d., mkaouer, m. w., ouni, a., and palomba, f. (2020). tsdetect: an open source test smells detection tool. acm, new york, ny, usa. santana, r., martins, l., rocha, l., virginio, t., cruz, a., costa, h., and machado, i. (2020). raide: a tool for assertion roulette and duplicate assert identification and refactoring. in proceedings of the 34th brazilian symposium on software engineering (sbes). acm. spadini, d., palomba, f., zaidman, a., bruntink, m., and bacchelli, a. (2018). on the relation of test smells to software code quality. in international conference on software maintenance and evolution (icsme), pages 1–12. ieee. spadini, d., schvarcbacher, m., oprescu, a.-m., bruntink, m., and bacchelli, a. (2020). investigating severity thresholds for test smells. in proceedings of the 17th international conference on mining software repositories (msr). acm. virginio, t., martins, l., soares, l. r., railana, s., costa, h., and machado, i. (2020). an empirical study of automatically-generated tests from the perspective of test smells. in proceedings of the xxxiv brazilian symposium on software engineering (sbes), new york, ny, usa. acm. virginio, t., santana, r., martins, l. a., soares, l. r., costa, h., and machado, i. (2019). on the influence of test smells on test coverage. in proceedings of the xxxiii brazilian symposium on software engineering (sbes), pages 467– 471, new york, ny, usa. acm. virgínio, t., martins, l., santana, r., cruz, a., rocha, l., costa, h., and machado, i. (2021). on the test smells detection: an empirical study on the jnose test accuracy [dataset]. available at: https://doi.org/10.5281/ zenodo.4570751. yusifoğlu, v. g., amannejad, y., and can, a. b. (2015). software test-code engineering: a systematic mapping. information and software technology, 58:123 – 147. https://doi.org/10.5281/zenodo.4570751 https://doi.org/10.5281/zenodo.4570751 introduction background jnose core architecture detection rules jnose test processes description tool architecture running example by testclass analysis by testsmell by testfile evolution analysis empirical evaluation dataset selection oracle definition data collection data analysis results comparison between jnose and tsdetect jnose and manual analysis comparison related work threats to validity conclusion microsoft word proceso de ir para el mcdai_v8.10.docx journal of software engineering research and development, 2020, 8:2, doi: 10.5753/jserd.2019.459 this work is licensed under a creative commons attribution 4.0 international license. requirements engineering base process for a quality model in cuba yoandy lazo alvarado [ centro nacional de calidad de software (calisoft) | yoandy.lazo@calisoft.cu ] leanet tamayo oro [ centro nacional de calidad de software (calisoft) | leanet.tamayo@calisoft.cu ] odannis enamorado pérez [ centro nacional de calidad de software (calisoft) odannis.enamorado@calisoft.cu ] karine ramos [ alloy digital product development & marketing technology | karine.rb19@gmail.com ] abstract a high percentage of projects worldwide fail or are canceled, due to incorrect requirements engineering. incorporating good practices into the requirements engineering process provides the appropriate mechanism to understand and analyze what stakeholders want and need. this process also allows you to evaluate and negotiate a reasonable solution; and specify, validate and manage the requirements as they become a functional system. the objective of this research is to elaborate a process of requirements engineering for the quality model for software development that contributes to raising the percentage of successful projects in cuban´s software development organizations, regarding the fulfillment of the agreed requirements. to reach the desired goal a bibliographic review was made about the requirements engineering discipline, as well as interviews and surveys to roles related to this activity in cuban´s software development organizations. the solution was evaluated by experts, in a focus group and put into practice, as a pilot, in three organizations. as a result, a basic requirements engineering process was obtained that contains specific requirements divided by the three levels of maturity of the model, and a graphic and textual description of the process. the satisfaction of the end user was measured through the implementation of iadov technique, obtaining a group satisfaction index equals 1, meaning maximum user’s satisfaction with the process. keywords: requirement, requirements engineering, software, process 1. introduction a significant percentage of software development projects, worldwide, are canceled or fail according to studies carried out. between the years 2011 and 2015 those canceled accounted for 39%, 46%, 40%, 47% and 45%, respectively, and the unsuccessful represented 22%, 17%, 19%, 17% and 19%, respectively (rosato, 2018; the standish group international, 2015). the behavior of projects in 2018 was similar to previous years since canceled projects reached 36% and 20% reported as unsuccessful (international, 2018). an investigation carried out by lehtinen et al. suggests that the causes of failures in projects occur in several processes that include management, sales and requirements, and implementation. it also states that the failures are related to the project environment, people, methods, and tasks (lehtinen, mäntylä, vanhanen, itkonen, & lassenius, 2014; mcleod & macdonell, 2011). an analysis of the standish group publication in 2014 on a study of more than 2,000 projects in 1,000 companies allowed to know that, although project is considered successful regarding the compliance with delivery deadlines, budget and agreed requirements, the percentages of utilization of functionalities that compose the systems are: 7% always, 13% often, 16% sometimes, 19% rarely, 45% never (the standish group international, 2014). according to del toro, this is because: 1) the client did not request it, but it could happen due to a misinterpretation of a requirement or the developers considered that it could be useful or interesting; or 2) the client requested it, i) but later he realized that he did not describe it correctly and he does not want it anymore; ii) the client described it correctly, but when he saw it implemented he realized that he asked for something wrong; or iii) the client described it correctly, but now wants something different (del toro, 2018). the authors of this research agree with del toro on the importance of maintaining adequate feedback with stakeholders during the software development life cycle to reduce the effects of the volatility of the requirements. to guarantee that feedback, an indispensable bridge that goes through the stakeholder needs, the design and development of the product is the requirements engineering (re) process. it provides an appropriate mechanism to understand and analyze the stakeholders needs, evaluate the feasibility, negotiate a reasonable solution, specify the solution without ambiguities, validate the specification and manage the requirements as they become a functional system (pressman, 2010). a diagnosis performed in 2014 by the software quality national center (calisoft) to a sample of 43.75% of cuban´s software development organizations, allowed to characterize these organizations through the application of interviews and surveys to the roles involved in the re process (calisoft, 2014). among the evaluated aspects, one of them is the fulfillment of the activities that compose the re process proposed by pressman (pressman, 2010). an analysis of the obtained results allowed the authors of this research to know that 7.14% do not identify the stakeholder requirements, 14.26% do not specify them, 21.43% do not validate them, 28.57% do not implement presenting the new sbc journal template viterbo et al. 2019 changes control, and 57.14% do not maintain traceability with the requirements. another result obtained describes the behavior of the projects completed in the period 20112014, identifying that 16.42% of the projects did not complete all the requirements agreed with the client; 11.94% delivered out of time and did not complete all the agreed requirements; 1.49% did not complete all the requirements and were over budget (calisoft, 2014; pérez & aveleira, 2016). another diagnosis performed in 2017 to a sample of 28.13% of cuban´s software development organizations. it allowed to know that 48% of the completed projects did it successfully, 20% canceled, and 32% failed. it also allowed identifying that the percentage of implementation of the re process reached 48% (calisoft, 2017). for all the above described it can be stated that the process in question is not mature and recognizes the need to establish activities that provide more feedback with the client. the software development organizations in cuba have used the nc-iso 9001, and cmmi to reach maturity levels in their development processes (y. a. lazo, 2016). at the same time, calisoft researchers work on the development of the quality model for the development of computer applications (mcdai) to provide the industry with a model based on international best practices. mcdai takes into account national characteristics and is based on the following principles: easy to understand, easy to apply, and serve as a basis for evaluations in other internationally recognized models (pérez, 2014). that being said the objective of this research is to develop a process of requirements engineering for the mcdai that contributes to raise the percentage of successful projects in cuban´s software development organizations, regarding compliance with the agreed requirements. 2. theoretical framework 2.1. requirements engineering in software development as part of the construction of the theoretical framework of the research, a bibliographic review was carried out (goguen, 1994; ieee, 2014; iso, iec, & ieee, 2017; oficina nacional de normalización, 2015b; sommerville, 2011; team, 2010), this allowed to conceptualize the term requirements as: need or expectation established, generally implicit or mandatory, expressing a condition or capacity demanded by the stakeholders or the organization, which must comply or have a process, product or product component to solve a problem or achieve an objective and to satisfy a contract, standard, specification or other formally imposed document. the broad spectrum of tasks and techniques that lead to understanding requirements is called re. from the software process perspective, re is one of the software engineering important actions, that begins during the communication activity and continues in the modeling activity. pressman argues that as part of the re seven different tasks are performed which are: conception, inquiry, elaboration, negotiation, specification, validation, and management (pressman, 2010). however, somerville identifies that the main activities of re are the acquisition, analysis, and validation of requirements. he also explains the importance of the requirements administration to plan the re process activities and control requirements changes (sommerville, 2011). the guide to the software engineering body of knowledge (swebok) contains the software requirements knowledge area (ka) that is concerned with the elicitation, analysis, specification, and validation of software requirements as well as the management of requirements during the whole life cycle of the software product (ieee, 2014). 2.1.1. requirements engineering according to nc-iso 9001 nc-iso 9001:2015 quality management systems requirements, uses the process approach, which incorporates the plan-do-verify-act cycle and risk-based thinking. the organizations that use it do so with a strategic vision to improve their overall performance. this standard can be used in any organization, including those that develop software. in this standard, the re process is not explicitly delimited, but it states that “the organization must plan, implement and control the processes necessary to meet the requirements for the provision of products and services, and implement the determined actions”; the aforementioned gives the organization a possibility to implement a re process for software development. also, it raises several requirements related to the re process: “8.2.2 determining the requirements for products and services”, “8.2.3 review of the requirements for products and services”, and “8.2.4 changes to requirements for products and services” (oficina nacional de normalización, 2015a). organizations that develop software can use iso/iec/ieee 90003:2018 as a guide for the application of iso 9001. iso/iec/ieee 90003 explains in detail how to comply with the requirements mentioned above (iso, iec, & ieee, 2018). 2.1.2. re according to iso/iec/iee 12207 and iso/iec/ieee 15288 iso/iec/ieee 15288:2015 and iso/iec/ieee 12207:2017 contain the life cycle processes for the system and software, respectively. both international standards contain 30 processes, including two related to re. during the revision of these standards, it was possible to identify that the activities they propose for er are similar. 1. the purpose of the stakeholder needs and requirements definition process is to define the stakeholder requirements for a system that can provide the capabilities needed by users and other stakeholders in a defined environment (iso, iec, & ieee, 2015; iso et al., 2017). presenting the new sbc journal template viterbo et al. 2019 the project to declare full compliance with this process shall implement the following activities: a) prepare for stakeholder needs and requirements definition; b) define stakeholder needs; c) develop the operational concept and other life cycle concepts; d) transform stakeholder needs into stakeholder requirements; e) analyze stakeholder requirements; and f) manage the stakeholder needs and requirements definition. 2. the purpose of the system/software requirements definition process is to transform the stakeholder, user-oriented view of desired capabilities into a technical view of a solution that meets the operational needs of the user. (iso et al., 2015, 2017). the project to declare full compliance with this process shall implement the following activities: a) prepare for system/software requirements definition; b) define system/software requirements; c) analyze system/software requirements; and d) manage system/software requirements. 2.1.3. re according to capability maturity model integration (cmmi) cmmi is a process improvement maturity model for the development of products and services. it includes the best practices that deal with development and maintenance activities that cover the product's life cycle, from conception to delivery and maintenance. it was created by the software engineering institute at carnegie mellon university and is the result of the integration of several models (cmmi institute, 2015). cmmi has 22 process areas. a process area is a set of related practices that when implemented collectively, satisfy a set of objectives considered important to improve that process area (cmmi institute, 2015). the process areas are composed of specific goals (sg) and specific practices (sp) that guide in a more detailed way how to achieve the goals. two of the model areas work on the topic of software requirements: requirements development (rd) and requirements management (reqm). the purpose of the rd process area is to elicit, analyze, and establish customer, product, and product component requirements (team, 2010). this process area includes three sg. sg 1 develop customer requirements. sp 1.1 elicit needs. sp 1.2 transform stakeholder needs into customer requirements. sg 2 develop product requirements. sp 2.1 establish product and product component requirements. sp 2.2 allocate product component requirements. sp 2.3 identify interface requirements. sg 3 analyze and validate requirements. sp 3.1 establish operational concepts and scenarios. sp 3.2 establish a definition of required functionality and quality attributes. sp 3.3 analyze requirements. sp 3.4 analyze requirements to achieve balance. sp 3.5 validate requirements. the purpose of the reqm process area is to manage requirements of the project's products and product components and to ensure alignment between those requirements and the project's plans and work products (team, 2010). this process area is composed of a single sg. sg 1 manage requirements. sp 1.1 understand requirements. sp 1.2 obtain commitment to requirements. sp 1.3 manage requirements changes. sp 1.4 maintain bidirectional traceability of requirements. sp 1.5 ensure alignment between project work and requirements. 2.1.4. re according to brazilian software process improvement (mps.br) mps.br was created by the association for the promotion of the excellence of brazilian software (softex). it has three components: reference model (mr-mps), evaluation method (ma-mps) and business model (mn-mps). it is composed of 19 process areas. it focuses on the profile of companies with different sizes and characteristics, although with special attention to micro, small, and medium enterprises. the model has two areas that address re: requirements development (dre) and requirements management (gre) (montoni, rocha, & weber, 2009). the purpose of the dre process area is to define customer, product, and product components requirements. the expected results of dre process are (softex, 2009a):  dre1 the client needs, expectations, and restrictions, both of the product and its interfaces, are identified.  dre2 a defined set of customer requirements is specified based on needs, expectations, and restrictions identified.  dre3 a set of product and product components functional and non-functional requirements that describe the problem solution to be solved is defined and maintained based on the client requirements.  dre4 each product component functional and non-functional requirements are refined, elaborated, and designated.  dre5 product and each product component internal and external interfaces are defined.  dre6 operating concepts and scenarios are developed. presenting the new sbc journal template viterbo et al. 2019  dre7 the requirements are analyzed using defined criteria to balance stakeholder needs with the existing restrictions.  dre8 the requirements are validated. the purpose of the gre process area is to manage product and product components requirements of the project, and identify inconsistencies between requirements and project plans and project work products. the expected results of ger process are (softex, 2009b):  gre1 the requirements are understood, evaluated, and accepted together with the requirements providers, using objective criteria.  gre2 the technical team's commitment to the approved requirements is obtained.  gre3 bidirectional traceability between requirements and work products is established and maintained.  gre4 revisions in plans and work products of the project are carried out to identify and correct inconsistencies with the requirements.  gre5 requirements changes during the project are managed. 2.1.5. re according to moprosoft and competisoft software industry processes model (moprosoft) emerges as part of the ministry of economy software industry development program from mexico, to reach international levels of process capacity by small and medium-sized mexican software development companies (hanna oktaba, 2015). this model was the basis for the preparation of the iso/iec 29110 – lifecycle profiles for very small entities, and the model process improvement to promote the competitiveness of the ibero-american small and medium software industry (competisoft) (competisoft, 2006). moprosoft and competisoft are divides into three categories senior management, management, and operation. the operations category contains the software development and maintenance process, which allows to systematically carrying out requirements engineering activities, with a set of activities whose purpose is to obtain the documentation of the requirement specification and the system, to have a common understanding between the customer and the project. some of these activities for moprosoft model are (competisoft, 2006; hanna oktaba, 2005): a2.2. document or modify requirement specifications.  identify and query information sources (customers, users, previous systems, documents, etc.) to obtain new requirements.  analyze requirements identified to limit their scope and feasibility, considering the customer’s or project’s business environment restrictions.  prepare or modify the user interface prototype.  generate or update the requirement specifications. a2.3. verify the requirements specification. a2.4. correct defects found in requirement specification based on verification report and obtain approval of corrections. a2.5. validate requirements specification. a2.6. correct defects found in requirements specification based on validation report and obtain approval of corrections. a3.2. document or modify analysis and design.  generate or modify the traceability record. competisoft incorporates other activities that complex moprosoft, e.g., a2.2 includes the task to identify and establish the security requirements of the information standard to obtain the required level of security. in general way, they both describe similar activities for requirements engineering. 2.1.6. good practices extracted from the models and standards re has a fundamental role in software development projects because it is the process that allows communication with stakeholders to obtain the requirements of the product in development. some models and standards group re activities in requirements development and requirements management. after analyzing the bibliography studied, it can be affirmed that cmmi and mps.br treats similarly the re. the same way happens with moprosoft and competisoft, as well as iso/iec/ieee 12207 and 15288 standards. table 1 identifies the good practices of the re. the requirements elicitation, specification, analysis, and validation stand out in the requirements development. in the case of requirements management, the most common is to achieve understanding, control changes, and maintain bidirectional traceability. table 1. good practices in the re process (own preparation). no. good practices p re ss m a n s o m m er vi ll e n c -i s o 9 0 0 1 is o /i e c /i e e e 1 2 2 07 y 1 5 28 8 c m m i – d e v y m p s . b r m o p ro s o ft y c o m p e t is o f t s w e b o k requirements development 1 requirements elicitation 5.1 – c4s5 4.2 6.4.2.3 rd 9.2(ope.2chapter presenting the new sbc journal template viterbo et al. 2019 no. good practices p re ss m an s om m er v il le n c -i s o 9 0 0 1 is o /i e c /i e e e 12 2 0 7 y 1 5 2 8 8 c m m i – d e v y m p s . b r m op ro s of t y c o m p e t is o f t s w e b o k inquiry 5.3 5.1.2 (b [1, 2, 3, 4]) 6.4.2.3 (d [1, 2, 3]) (sg 1 [sp 1.1, sp 1.2]) dre 1 a2.2) ds(a2.2) (3) 2 requirements specification 5.1 – specification c4s2 6.4.3.3 (b [3, 4, 5]) rd (sg 2 [sp 2.1, sp 2.2) dre 2 dre 4 9.2(ope.2 a2.2, a3.2) ds(a2.2, a3.2) chapter (5) 3 elaboration of product requirements c4s5 6.4.3.3 (b [1, 2, 3]) rd (sg 2 [sp 2.1, sp 2.2) dre 3 ds(a3.2) chapter (4.2) 4 identify interface requirements 5.1 – elaboration 5.4 7 6.4.3.3 (b [5]) rd (sg 2 [sp 2.3) dre 5 9.2(ope.2 a2.2, a3.2) ds(a2.2, a3.2) 5 establish operational concepts and associated scenarios 5.1 – elaboration 5.5 6 c4s5 6.4.2.3 (c [1, 2]) rd (sg 3 [sp 3.1) dre 6 ds(a3.2) chapter (4.2) 6 establish a definition of required functionality and quality attributes c4s1 6.4.3.3 (b [5]) rd (sg 3 [sp 3.2) ds(a2.2) chapter (7.3) 7 analysis and negotiation 5.1 negotiation 5.6 c4s5 c12s5 8.2.1 8.2.3 6.4.2.3 (e [1, 2, 3, 4]) 6.4.3.3 (c [1, 2, 3, 4]) rd (sg 3 [sp 3.3, sp 3.4]) dre 7 9.2(ope.2 a3.2) ds(a2.2, a3.2) chapter (4) 8 validation of requirements 5.1 – validation 5.7 c4s6 rd (sg 3 [sp 3.5]) dre 8 9.2(ope.2 a2.5, a2.6) ds(a2.5, a2.6) chapter (6) requirements management 9 identify requirements source 5.1 management c4s5 8.1 6.4.2.3 (a [1, 2,]) 6.4.3.3 (a [1, 2,]) reqm (sg 1 [sp 1.1]) 9.2(ope.2 a2.2) ds(a2.2) chapter (3.1) presenting the new sbc journal template viterbo et al. 2019 no. good practices p re ss m an s om m er v il le n c -i s o 9 0 0 1 is o /i e c /i e e e 12 2 0 7 y 1 5 2 8 8 c m m i – d e v y m p s . b r m op ro s of t y c o m p e t is o f t s w e b o k 10 understand requirements 8.2.3 reqm (sg 1 [sp 1.1]) gre 1 11 obtain commitment to requirements 8.2.1 6.4.2.3 (f [1]) 6.4.3.3 (d [1]) reqm (sg 1 [sp 1.2]) gre 2 12 manage requirements changes c4s7 8.2.4 6.4.2.3 (f [3]) 6.4.3.3 (d [3]) reqm (sg 1 [sp 1.3]) gre 5 chapter (7.2) 13 maintain traceability of requirements 6.4.2.3 (f [2]) 6.4.3.3 (d [2]) reqm (sg 1 [sp 1.4]) gre 3 ds(a2.2, a3.2, a4.3, a4.6) chapter (7.4) 14 ensure alignment between project work and requirements 8.2.4 reqm (sg 1 [sp 1.5]) gre 4 9.2(ope.2 a1.1) ds(a1.1, a2.2, a2.3) the tendency to include the reuse management approach as a process of creating software systems from existing software, applying domain engineering, was also identified in these reference models. this approach has provided many organizations with competitive advantages in the market, in terms of product quality, development time, production costs, among others (bastarrica, 2011; manso martínez & garcía peñalvo, 2013; northrop et al., 2007; salazar, 2017). during the application of domain engineering requirements are also developed and managed. as a fundamental element of domain engineering is the application of the domain analysis technique. it allows capturing the critical information of the entities, data, and processes that characterize a particular business area and then develop and specify the requirements (brun, 2007). the main result of the application of this technique is the domain model, which describes at a high level of abstraction the common elements and variants of the family for a correct management of the variability of the resulting products (montoni et al., 2009). most of the studied models and standards are designed for large software development organizations since they need long periods of implementation, and great effort for their assimilation. also, they have a high cost associated with certification and consulting, so it is difficult for cuban organizations, which have limited resources, to adopt some of these. it is for this reason that countries that are characterized by the majority presence of small and medium enterprises (sme) such as mexico and brazil have adapted the internationally recognized models to their needs, like moprosoft in the case of mexico and mps for the brazilian development companies. however, these two projects are adapted to the context of these countries and their characteristics. most of the available models do not detail a strategy that allows organizations an agile process that guides improvement and facilitates the work of process engineers. 2.2. model for the development of computer applications the processes' capacity to adapt to the market or clients makes that management models, oriented to quality, focus their attention on processes as the most powerful lever to act on the results, in an effective and sustained way in a long time (concepción, 2010; zaratiegui, 1999). perez researcher states that the mcdai has a process approach and considers it an accepted proposal for the software development industry in cuba (pérez, 2014). presenting the new sbc journal template viterbo et al. 2019 the mcdai is composed of 1) general guide that describes the model and its components; 2) implementation guide that contains the general requirements that must be met by the twelve base processes that compose the model, as well as defining each of the base processes; 3) evaluation guide that describes the process and evaluation method to determine the organization maturity level and capacity of its processes related to the model (see figure 1) (pérez, 2014). figure 1. mcdai components. implementation guide groups the processes into the following categories 1) organizational management gathers the base processes that have a direct influence on the organization, and it executes at a high level or on management´s responsibility; 2) project management gathers the base processes related to the project work organization; 3) engineering gathers the technical base processes necessary for software development; 4) support gathers the base processes that supports software development (see figure 2). figure 2. mcdai categories. each base process contains a purpose, specific requirements, and a process modeling suggested that meets the requirements. in the case of specific requirements, they are defined by basic, intermediate, and advanced levels (see figure 3). figure 3. base process structure. the specific requirements are divided into three parts: title, description, and recommended evidence. the recommended evidences are examples of what the work products could be. the specific requirements of each base process and mcdai's generic requirements are used as a reference standard by evaluators to determine the organization's processes capacity. the organization's maturity level (basic, intermediate, or advanced), is determined by taking into account the capacity of all its processes. organizations that decide to adopt the mcdai shall implement the requirements depending on the maturity level, and/or capacity desired. the process modeling suggestion with a graphic and textual representation is also shown as part of each base process. this process modeling is done to exemplify how to implement generic and specific requirements. 2.2.1. mcdai's generic requirements table 2 shown mcdai's generic requirements (gr) necessary to reach the desired capacity. each base process has to implement these requirements including re base process, that this investigation is presenting. table 2. mcdai's generic requirements. basic level gr 1 define the process to follow. gr 2 define roles and responsibilities. gr 3 plan process execution. gr 4 provide resources. gr 5 monitor process execution. gr 6 identify and preserve the configuration items. gr 7 evaluate the execution of the established process. gr 8 analyze the process status with the management. intermediate level gr 9 institutionalize the process. gr 10 manage indicators. gr 11 train staff. gr 12 manage the knowledge generated by the process. gr 13 identify and treat risks. advanced level gr 14 perform process improvement. presenting the new sbc journal template viterbo et al. 2019 2.3. process representation to model the re base process is necessary to analyze graphic or textual representation techniques: flow diagram, notation lanes, idef, etvx, business process modeling notation (bpmn), and textual description (losavio, guzmán, & matteo, 2011; manene, 2013; medina, 2012; murcia-oeste–arrixaca, 2013; silega, 2014; batista anisbert suárez, 2013). the result of this analysis allowed the authors of this investigation to determine that the bpmn and textual description combination are the most optimal variant because: they allow a graphic notation that describes the logic of the steps of a business process; coordinates the sequence of processes and messages that flow between the participants of the different activities; allows processes modeling in a unified and standardized way which facilitates an understanding to everybody in the organization; explains the activities and covers the information about the needs of the process, when it begins, the people involved, the duration, how the activities are carried out, when it ends, and the different scenarios that may arise (y. a. lazo, 2016). 3. requirements engineering base process this research proposes the re base process. it is part of the mcdai, therefore it's aligned to its structure. 3.1. purpose and specific requirements the purpose of the re base process is to identify the stakeholder requirements for a software product so that it can provide the capabilities needed by them, in a defined environment and transform the stakeholder's view into a technical vision that meets the operational needs of users. to fulfill this purpose and based on the good practices of re identified as part of the construction of the theoretical framework, specific requirements divided by three mcdai's maturity levels (basic, intermediate and advanced) were proposed (see table 3 1 ). requirements division in maturity levels was made to facilitate the model adoption through process improvement stepwise with small changes. table 3. specific requirements of the re base process. basic level re 1 define the relevant stakeholder requirements. (1 and 9) re 2 analyze and specify the requirements. (2, 6 and 7) re 2.2 prioritize requirements. (7) 1 in table 3, you can find the requirements statements distributed by levels, and in parentheses, it is related to the good practices identified in table 1. re 3 achieve understanding and commitment to technical requirements. (10 and 11) re 4 validate technical requirements. (8) cm 4 control changes. (12) intermediate level re 5 model the technical requirements. (3 and 5) advanced level re 2.1 approve technical requirements. re 5.1 modeling requirements based on reuse. re 6 establish bidirectional traceability. (13) qa 6 perform inconsistency reviews. (14) re 1 define the relevant stakeholder requirements. the appropriate sources and suppliers shall be identified to obtain relevant stakeholder requirements. the requirements shall be defined based on the needs and expectations of the suppliers and an analysis of the sources identified. recommended evidence: providers list and requirements list. re 2 analyze and specify the requirements. the stakeholder requirements shall be analyzed taking into account whether they are necessary or sufficient to meet the objectives of the product; from this analysis, new derived and/or implicit requirements can be defined. the functional and non-functional requirements shall be formally specified and with sufficient technical detail. shall be reviewed the viability of technical requirements. recommended evidence: requirements specification. re 2.1 approve technical requirements. a benchmarking shall be carried out in the corresponding application domain, to identify functionalities of similar products. the functionalities identified with the technical requirements shall be homologated, and define additional requirements that the product could contains to increase customer satisfaction. recommended evidence: requirements specification. re 2.2 prioritize requirements. priority to requirements that will be implemented according to the stakeholder needs, market conditions, and/or business objectives, shall be given. recommended evidence: prioritization of requirements. re 3 achieve understanding and commitment to technical requirements. shall be achieved requirements understanding between the suppliers and the project team. shall be resolved conflicts arising between the requirements. shall be obtained the project team commitment with the current and approved requirements implementation, as well as making the necessary changes, to plans, activities, and related work if the requirements evolve. recommended evidence: tasks in the management tool (assigned and accepted), meeting notes. re 4 validate technical requirements. the technical requirements shall be validated to ensure that the resulting product meets the stakeholder needs and presenting the new sbc journal template viterbo et al. 2019 expectations and works as intended, in the environment of the end user. recommended evidence: requirements specification. re 5 model the technical requirements. shall be modeled the technical requirements to obtain a better understanding of the product to be developed. shall be grouped the requirements taking into account criteria. note: the modeling of the requirements could be done taking into account different paradigms such as structured analysis; object-oriented analysis; among others. in the first case, models are created to represent the flow and content of the information (data and control), the product is divided into functional and behavioral partitions and the essence of what is to be built is described. for example: data flow diagrams (dfds); state transition diagram (dte); data dictionary. in the second case, the objective is to model the concepts (objects) of the domain of the product, its relationships and behaviors. that model is continuously refined until a model with sufficient detail is obtained for its implementation in the form of executable code. for example: use case models and operation scenarios; class model; sequence and activity diagrams; state diagrams. recommended evidence: document realization of requirements. re 5.1 model requirements based on reuse. a domain model(s) shall be defined and maintained that describes the borders of each domain with reuse potential and specifies its characteristics, capabilities, common elements and variants, optional or mandatory. the domain model(s) shall be incorporated into a repository of reusable assets, once they are formally evaluated and approved. recommended evidence: domain model. re 6 establish bidirectional traceability. bidirectional traceability between the project's objectives, stakeholder requirements, technical requirements, derived work products, and tasks that will fulfill it, shall be determined. shall be updated traceability throughout the project as appropriate. recommended evidence: traceability tool with the built-in elements. to obtain the desired capacity level of the re base process, in addition to fulfilling the specific requirements described above, the following shall be met:  for basic level with the specific requirement, cm 4 control changes, from configuration management base process, to manage requested changes on requirements.  for advanced level with the specific requirement, qa 6 perform inconsistency reviews, from quality assurance base process, to ensure alignment between project work and requirements. for the construction of the specific requirements described above were executed three stages. first, the authors prepared a proposal taking into account their experience and the good practices identified in table 1. second, was presented the proposal to 22 researchers who were working on the definition of the mcdai, for dividing the specific requirements into the three maturity levels (basic, intermediate, and advanced) of the model, and for identifying the relationship with the rest of the mcdai's base processes. the third stage executed after updating the proposal with the obtained feedback. seven experts were identified, with an average of 7 years of experience working on the re discipline, 100% engineers in computer science and with the scientific category of master. the specific requirements and the proposal of at what level they might be grouped were presented to experts, to obtain their assessment of them. the feedback with the experts allowed updating the specific requirements and the levels that group them. finally, the last version of the re base process, shown in the next section, was obtained. 3.2. process and activities as part of the solution, the graphic description (see figure 1) and the textual description of the re base process as an example of how to put the specific requirements into practice, is proposed. presenting the new sbc journal template viterbo et al. 2019 figure 4. graphic description of the re process. presenting the new sbc journal template viterbo et al. 2019 below is the textual re process description. 1. characterize and select the requirement sources. the analyst and the client taking into account the stakeholders identified in ppmc project planning, monitoring, and control (batista anisbert suárez et al., 2016), obtain requirements source and characterize them. for the advanced level when domain engineering is applied, the development is directed to an application family, therefore, requirement sources vary from specific clients to market and business studies; when domain engineering is applied, the sources are given by domain assets and the specific client. the analyst, project manager, and client select requirement sources taking into account their characterization and the provider(s) that represent the client’s interests, if applicable, and take responsibility for providing the requirements. the “requirement sources list” is obtained as a result of the execution of this activity. 2. obtain the stakeholder requirements. the analyst uses the “requirement sources list”, the “offer”, and/or the “technical project” prepared when the project was conceived to analyze the requirements that would be needed to comply with the project's goals. also, he identifies providers’ needs and expectations, characterizes the organization operating environments, and prepares a comprehensive list of them. this list is continuously updated by monitoring any changes that may occur going forward and based on suppliers’ suggestions. in case of not being satisfied, the analyst reidentifies/improves the requirements with the help of other techniques such as prototype, focus group, business use cases, business process model, among others. for the advanced level, when domain engineering is applied, the requirements are obtained through market and business studies and the analysis of past projects. in these cases, usually, the project does not have a specific client since it is working in the development of a generic product, for this reason, the analysis to resolve conflicts between requirements is made with functional experts. when application engineering is applied, the obtaining of the requirements is done by analyzing the existing domain assets with the specific client, for which common and variational elements, optional or mandatory, are taken into account, if necessary, to adopt them or design new requirements. the “requirement and restrictions list” is obtained as a result of the execution of this activity. 3. match requirements. from the advanced level, the analyst taking into account the “requirement and restrictions list” performs a benchmarking in the corresponding application domain to identify similar products. also, he matches the stakeholder´s requirements with similar product functionalities to identify additional requirements and verify that the identified functional needs correspond to this product type. the “benchmarking” and “requirement and restrictions list” with new requirements in cases to apply is obtained as a result of the execution of this activity. 4. achieve an understanding of the stakeholder requirements. the analyst, taking into account the “requirement and restrictions list”, identifies the conflicts between the requirements and makes proposals on how to eliminate them, using functional experts. the analyst and stakeholders meet to reach a consensus on the resolution of the conflicts identified, taking into account the proposal made. the project team and the requirements providers achieve a common understanding of the “requirement and restrictions list”. the “requirement and restrictions list” (updated) is obtained as a result of the execution of this activity. 5. prioritize the requirements. the analyst taking into account the “requirement and restrictions list” identifies the appropriate method for prioritization of the stakeholder requirements (e.g., hierarchical analysis, cumulative vote, numerical assignment, value-based prioritization, cost, and risks, among others). also, he prioritizes the requirements using the appropriate method. the “requirement and restrictions list” with the prioritized requirements is obtained as a result of the execution of this activity. 6. analyze and specify the stakeholder requirements. the analyst taking into account the “requirement and restrictions list” with the prioritization groups the functional and non-functional requirements that correspond to the iteration. analyzes if he can reuse the requirements of the previous projects. identifies whether the requirements are necessary or sufficient to develop the product that satisfies the stakeholder and, if required, identifies new derived requirements (functional requirements). he refines the functional requirements in terms of its description and functionality details. he analyzes the “requirement and restrictions list” taking into account the software product quality model defined in nciso/iec 25010, to identify implicit requirements (nonfunctional requirements). he refines the non-functional requirements by assigning allowable values to the quality attributes that the product should have. he reviews functional and non-functional requirements viability to determine if they are complete, feasible, and verifiable. from the intermediate level, he also specifies the internal and external interface requirements of the system. the “requirements specification” is obtained as a result of the execution of this activity; hereafter these requirements will be treated as technical requirements. 7. achieve an understanding of the technical requirements. the analyst and stakeholders taking into account the “requirements specification” meet to arrives at a common understanding of described technical requirements. presenting the new sbc journal template viterbo et al. 2019 the “requirements specification” (updated) is obtained as a result of the execution of this activity. 8. validate technical requirements. the project manager, analyst, and client taking into account the “requirements specification” validate the technical requirements using the prototype technique, where candidates of the system interfaces and the input and output elements are shown to the final user. the analyst, in case of any indication or observation by the clients, updates the “requirements specification”. the “requirements specification” (signed) is obtained as a result of the execution of this activity. 9. model the technical requirements. from the intermediate level, the analyst defines the conceptual model, establishing the relationship of the entities of the system or subsystem and their fundamental attributes, future persistent classes, and candidates for the data model. also, he models the requirements by making a technical description (use case model, operation scenarios, class model, and user stories, where applicable). the project manager and architect distribute the requirements in the modules or subsystems of the project. from the advanced level, the domain model corresponding to the applications family is taken into account to define the product analysis model that is going to be developed for a specific client. the “analysis model” is obtained as a result of the execution of this activity. 10. model technical requirements based on reuse. from the advanced level, the analyst defines the domain model where are described its boundaries with other domains, and specifies the characteristics, capacities, common elements, and variants, optional or mandatory. he defines the conceptual model, establishing the relationships of the entities that are part of the domain and its fundamental attributes, future persistent classes, and candidates for the standard data model for the family of applications. also, he models the requirements by making a technical description of the applications family (use case model, operation scenarios, class model, and user stories, in the cases that apply). the project manager and architect distribute the requirements in the project modules or subsystems. the “domain model” as a result of the execution of this activity, is obtained. 11. qa-perform evaluation. the evaluation team verifies that the “domain model” is technically correct guided by the sub-process qaperform evaluation (y. a. lazo, 2016). the “evaluation file” is obtained as a result of the execution of this activity. 12. create/update traceability system. in the advanced level, the analyst taking into account the traceability guide inserts in the selected tool, as they are being developed the objectives of the project, the stakeholder requirements, the technical requirements, the work products and the tasks that they will comply with the agreed requirements. he establishes the corresponding bidirectional relationships between the elements inserted in the tool. also, he updates the tool, if changes to the requirements or work products arise. the traceability tool, with its established relationships is obtained as a result of the execution of this activity. 13. cm-control the changes. the change control committee analyzes the change request on the requirements as established by the subprocess cm-control the changes (garcía, 2017). the “change request” (accepted) is obtained as a result of the execution of this activity. 14. make changes to requirements. the analyst taking into account the “change request” accepted, at the basic and intermediate levels, makes the corresponding changes on the requirements and the related work products. in the advanced level, the changes are made using the traceability tool. the “requirements specification” and related (updated) artifacts are obtained as a result of the execution of this activity. 15. qa-perform evaluation. at the advanced level, the evaluation team executes the inconsistency review between the requirements and the associated work products, taking into account the tool and the traceability guide, as established by the qa-perform evaluation sub-process (y. a. lazo, 2016). the “file of the evaluation” is obtained as a result of the execution of this activity. 3.3. re base process relationship with mcdai the re base process has a close relationship with other base processes that compose the mcdai. this relationship allows providing input elements for other base processes (see figure 5). for example, the requirements specification and the domain model are input elements of the tsd base process and are taken into account for product design and implementation. in this relationship, it is also appreciated that the result of other base processes is used in the re base process, for example, the changes requests to requirements are accepted or rejected by the cm base process, among others. presenting the new sbc journal template viterbo et al. 2019 figure 5. relationship of the re base process with other mcdai processes. also, this relationship assures to comply with the model's generic requirements. as shown in figure 6, through the opm base process the re process is defined, is defined it's associated roles and responsibilities, is provided the resources necessary to execute the process, and is institutionalized the re process throughout the organization. ppmc base process, plan and monitor the re process execution, as well as manage the training of the project personnel internally to the process execute. cm base process identifies and preserves the configuration elements that are generated in the re process. qa base process evaluates that the defined re process is executing in the organization, and keep the management informed of the status of that process. mi base process defines necessary indicators to measure the re base process and makes improvements to it. km base process manages the project team training about the re process, that could not be satisfied in the project, and knowledge generated by it. finally, the rm base process identifies and treats risks associated with the re base process. figure 6. re base process compliance with generic requirement. 3.4. measure the re base process to measure the re base process influence on the software development projects' success is proposed the indicator requirement compliance index (rci). it aims to evaluate requirements compliance agreed with clients. it as an unfulfilled requirement is understood when it has not been developed, or are not those agreed-upon results obtained after its implementation. for this, are identified the following base measures (arq: agreed requirements quantity, riq: requirements implemented quantity). the following measurement function 𝑅𝐶𝐼 = is used to calculate the rci. the projects are considered successful if 𝑅𝐶𝐼 > 0,95. this indicator was selected by 7 experts with presenting the new sbc journal template viterbo et al. 2019 an average of 7 years of experience working in the discipline of requirements engineering. 4. validation 4.1. analysis of the proposal taking into account focal group the authors of the present investigation consider that the focal groups constitute a valuable and widely used technique to obtain information. for this reason, they decided to use it to know if the solution proposal uses the correct terminology and is technically viable. for its conformation the criteria issued by aigneren and méndez were taken into account (aigneren, 2009; méndez, 2007), those who state that the size of the group should oscillate between 4 and 12 participants; that all the participants have the possibility of issuing their criteria; and that the group must be homogeneous in order to ensure the diversity of ideas. in order to comply with the above, 12 specialists were summoned, with more than 5 years of experience in the roles of analyst and architect. the selected ones represented the organizations calisoft, desoft, xetid, etecsa, transoft, eicma, aicros and segurmática (a. y. lazo, tamayo, enamorado, pérez, & sánchez osorio, 2018). the final result was a re process enriched with the experiences of each participant and the unanimous criterion that it is an accepted proposal that meets the needs of the software development organizations in cuba. 4.2. analysis of the implementation of the process in pilot projects a pre-experiment was applied in pilot projects to evaluate that when introducing the re base process in software development projects, the project's success is greater than 48%, concerning the dimension of the requirement compliance index variable. sampieri suggests pre-experiment can be done, through a case study with a single measurement, or the design of a pre-test post-test with a single group (hernández, fernández, & baptista, 1991). when analyzing the two options, it was found that in the first one, there is no manipulation of the independent variables, and there is no previous reference of what was the situation before performing the stimulus. in the second one, there is an initial reference point to see what level the group had in the dependent variables before the stimulus, allowing a followup. taking into account the foregoing, researchers selected the second variant, knowing that pre-experimental designs are not suitable for the establishment of relationships between independent and dependent variables. but they consider it important because can yield results that when compared with those of other methods help to reach conclusions. to implement the pre-experiment in the period from january 2018 to july 2018 were selected six projects from 3 different organizations (datys, aicros, and transoft). the projects were developing web applications, with a team of six people, with mastery of the technologies used and with more than 5 years of experience in the business, the requirements average was 110, functional and non-functional. at the end of the pre-experiment, five of six pilot projects were successful taking into account the rci indicator, which represents 83.33%. according to table 4, project 2 was the one that did not reach an 𝑅𝐶𝐼 > 0.95. table 4. rci of project. projects riq arq rci p1 120 120 1.00 p2 108 120 0.90 p3 99 100 0.99 p4 110 110 1.00 p5 110 110 1.00 p6 97 100 0.97 carrying out a comparative analysis between the diagnosis made in 2017 and the result obtained, in the first case only 48% of the projects completed successfully, and in the second case, an improvement in the indicator is seen in 83.33%. however, carrying out an exhaustive analysis of the project that did not comply with the indicator, in the review of adherence to the re process, it reached a 50% implementation of the activities, an aspect that could influence the obtained results. among the re process activities not executed in some project iterations, were analysis, negotiation, and validation of the requirements, due to the distance between the client and the project team. the absence of these activities caused that there was no understanding between the parties about the requirements in early stages and that the client was dissatisfied with seven of the agreed requirements because they did not work as expected and another five had problems related to usability. the results allowed the authors of the research to appreciate an improvement in the success of the projects taking into account the dimension of the rci variable, after introducing the proposed process. 4.3. satisfaction of end users the technique of v.a. iadov was created by n.v. kuzmina in 1970, for the study of satisfaction with pedagogical careers. subsequently, it has been used in several investigations to evaluate satisfaction in different contexts. iadov consists of five questions, three closed, and two open. in this research, this technique is used to assess user satisfaction in pilot projects respect to the re process. for this, a survey was applied to six analysts and six architects. the criteria measured in the survey are based on the relationships established between the three closed questions, related through the iadov logical table (see table 5). presenting the new sbc journal template viterbo et al. 2019 table 5. iadov logical table for the re base process (modified by the authors of this research). 1. do you consider the requirements engineering base process complex and difficult to understand? no i don't know yes 3. is the requirements engineering base process used to your liking? 2. if you were to carry out another project, would you use the proposed requirements engineering process? yes i don't know no yes i don't know no yes i don't know no clearly pleased 1 2 6 2 2 6 6 6 6 more pleased than unpleased 2 2 3 2 3 3 6 3 6 not defined 3 3 3 3 3 3 3 3 3 more unpleased than pleased 6 3 6 3 4 4 3 4 4 clearly unpleased 6 6 6 6 4 4 6 4 5 contradictory 2 3 6 3 3 3 6 3 4 the number resulting from the interrelation of the three questions indicates the position of each respondent on the satisfaction scale. respondents used the following satisfaction scale, to which a value is assigned to determine the group satisfaction index: 1. clearly pleased: +1 2. more pleased than unpleased: +0.5 3. not defined: 0 4. more unpleased than pleased: -0.5 5. clearly unpleased: -1 6. contradictory: 0 below is the calculation of the group satisfaction index (gsi) in the following formula: 𝐺𝑆𝐼 = 𝐴(+1) + 𝐵(+2) + 𝐶(0) + 𝐷(−0.5) + 𝐸(−1) 𝑁 = 12(+1) + 0(+2) + 0(0) + 0(−0.5) + 0(−1) 12 = 1 the group index yields values between + 1 and 1 and is classified as follows:  satisfaction: values between 0.5 and 1  contradiction: values between -0.49 and 0.49  dissatisfaction: values between -1 and 0.5 the result of 𝐺𝑆𝐼 = 1 , means maximum satisfaction for the proposed er base process. this result was corroborated with the answers to open questions 4 and 5, where respondents expressed that they would not change anything in the base process because it fits their needs. 5. conclusion the good practices for re were grouped, into requirements development and requirements management. requirements development's main practices are identifying stakeholder needs, and specifying, analyzing and negotiating requirements. requirements management's main practices are controlling changes and maintaining traceability. the graphic and textual description of the re base process is a guide to adopt the mcdai's requirements divided by the three levels of maturity and facilitate their adoption. incorporating feedback activities with clients in the re process is a factor that influences the success of the project, because it allows identifying the necessary changes to the requirements, at the appropriate time, for the product to respond to the client’s needs and expectations. the proposal validation contributed to verify the user satisfaction with the proposed process and that the execution of the process can contribute to the project success. it is recommended to measure the process impact on the volatility of the requirements to contribute to the project planning fulfillment. references aigneren, m. (2009). la técnica de recolección de información mediante grupos focales. la sociología en sus escenarios. bastarrica, c. (2011). productividad en la industria tic. bits. brun, r. e. (2007). técnicas de análisis de dominio: organización del conocimiento para la construcción de sistemas software. paper presented at the la interdisciplinariedad y la transdisciplinariedad en la organización del conocimiento científico: interdisciplinarity and transdisciplinarity in the organization of scientific knowledge: actas del viii congreso isko-españa, león, 18, 19 y 20 de abril de 2007. calisoft, c. n. d. c. d. s. (2014). cs-03-d (14-001) libro de diagnóstico. calisoft, c. n. d. c. d. s. (2017). cs-03-d (17-001) libro de diagnóstico. cmmi institute. (2015). retrieved 02/11/2015, 2015, from https://sas.cmmiinstitute.com/pars/pars_detail.aspx?a=25 323 competisoft, p. (2006). competisoft-mejora de procesos para fomentar la competitividad de la pequeña y mediana industria del software de iberoamérica. versión 0.2. diciembre. presenting the new sbc journal template viterbo et al. 2019 del toro, a. a. (2018). una mirada desde el desarrollo ágil a los requisitos de software. experiencias en datys villa clara. paper presented at the taller 2 ingeniería de requisitos. garcía, y. g. (2017). proceso base gestión de la configuración para un modelo de calidad en cuba. universidad de las ciencias informáticas. goguen, j. a. (1994). requirements engineering as the reconciliation of social and technical issues (san diego: academic press professional ed.). hernández, s. r., fernández, c. c., & baptista, l. p. (1991). metodología de la investigación. ieee. (2014). swebok. guide to the software engineering body of knowledge (versión 3 ed.). international, t. s. g. (2018). chaos report. iso, iec, & ieee. (2015). iso/iec/ieee 15288 systems and software engineering — system life cycle processes. iso, iec, & ieee. (2017). iso/iec/ieee 12207, systems and software engineering — software life cycle processes. iso, iec, & ieee. (2018). iso/iec/ieee 90003, software engineering — guidelines for the application of iso 9001:2015 to computer software. lazo, a. y., tamayo, o. l., enamorado, p. o., pérez, m. d., & sánchez osorio, y. (2018). apuntes sobre el modelo de la calidad para el desarrollo de aplicaciones informáticas (mcdai). paper presented at the xvii convención y feria internacional informática 2018, la habana. http://www.informaticahabana.cu/es/node/3703 lazo, y. a. (2016). proceso base de aseguramiento de la calidad para el desarrollo de software en cuba. universidad de las ciencias informáticas. lehtinen, t. o., mäntylä, m. v., vanhanen, j., itkonen, j., & lassenius, c. (2014). perceived causes of software project failures–an analysis of their relationships. information and software technology, 56(6), 623-643. losavio, f., guzmán, j. c., & matteo, a. (2011). correspondencia semántica entre los lenguajes bpmn y grl. enl@ ce, 8(1). manene, l. m. (2013). los diagramas de flujo: su definición, objetivo, ventajas, elaboración, fases, reglas y ejemplos de aplicaciones. los diagramas de flujo. manso martínez, m., & garcía peñalvo, f. j. (2013). medición en la reutilización orientada a objetos. mcleod, l., & macdonell, s. g. (2011). factors that affect software systems development project outcomes: a survey of research. acm computing surveys (csur), 43(4), 24. medina, y. t. (2012). modelado de procesos con idef en la metodología rup. serie científica-universidad de las ciencias informáticas, 5(2). méndez, a. l. d. (2007). la entrevista y los grupos focales. montoni, m. a., rocha, a. r., & weber, k. c. (2009). mps. br: a successful program for software process improvement in brazil. software process: improvement and practice, 14(5), 289-300. murcia-oeste–arrixaca, á. i. (2013). manual para el diseño de procesos. northrop, l., clements, p., bachmann, f., bergey, j., chastek, g., cohen, s., . . . little, r. (2007). a framework for software product line practice, version 5.0. sei.–2007– http://www. sei. cmu. edu/productlines/index. html. oficina nacional de normalización. (2015a). nc-iso 9001 sistema de gestión de la calidad — requisitos. oficina nacional de normalización. (2015b). nc iso 9000 sistema de gestión de la calidad fundamentos y vocabulario. oktaba, h. (2005). modelo de procesos para la industria de software-moprosoft-versión 1.3, agosto de 2005: nmx-059/01-nyce-2005. oktaba, h. (2015). historia de una norma. moprosoft y sus primeros pasos. retrieved 1, 2015, from http://sg.com.mx/content/view/390 pérez, d. m. (2014). guía general para un modelo cubano de desarrollo de aplicaciones informáticas. universidad de las ciencias informáticas. retrieved from https://repositorio.uci.cu/jspui/handle/ident/8725 pérez, d. m., & aveleira, d. q. (2016). evolución del modelo de la calidad para el desarrollo de aplicaciones informáticas. paper presented at the xvi convención y feria internacional informática 2016, la habana. http://www.informaticahabana.cu/es/node/664 pressman, r. s. (2010). ingeniería de software. un enfoque práctico (séptima edición ed.). méxico. rosato, m. (2018). go small for project success. pm world journal, vii(v). salazar, l. l. (2017). desarrollo del proceso solución técnica para los proyectos de desarrollo de la universidad de la ciencias informáticas. universidad de las ciencias informáticas (uci). silega, m. n. (2014). método para la transformación automatizada de modelos de procesos de negocio a modelos de componentes para sistemas de gestión empresarial. universidad de las ciencias informáticas (uci). softex. (2009a). mps.br mejora de proceso del software brasileño (vol. guía de implementación – parte 4: fundamentos para implementación del nivel d del mrmps). softex. (2009b). mps.br mejora de proceso del software brasileño (vol. guía de implementación – parte 1: fundamentos para implementación del nivel g del mrmps). sommerville, i. (2011). ingeniería de software (novena edición ed.). méxico. suárez, b. a. (2013). marco de procesos para las entidades de servicios de tecnología de la información de la universidad de las ciencias informáticas. universidad de las ciencias informáticas (uci). suárez, b. a., sánchez, o. y., muñoz, r. m., ruenes, c. s. b., gómez, b. c., gutierrez, f. l. m., & calunga, á. a. (2016). modelo de calidad para el desarrollo de aplicaciones informáticas: categoría de gestión de proyecto. paper presented at the xvi convención y feria internacional informática 2016, la habana. team, c. p. (2010). cmmi® for development, version 1.3, improving processes for developing better products and services. no. cmu/sei-2010-tr-033. software engineering institute. the standish group international. (2014). the standish group report. the standish group international. (2015). chaos report 2015. journal of software engineering research and development, 2021, 9:7, doi: 10.5753/jserd.2021.1049  this work is licensed under a creative commons attribution 4.0 international license.. representation of software design using templates: impact on software quality and effort silvana moreno  [ universidad de la república, uruguay | smoreno@fing.edu.uy ] vanessa casella  [ universidad de la república, uruguay | vcasella@fing.edu.uy ] martín solari  [ universidad ort uruguay | martin.solari@ort.edu.uy ] diego vallespir  [ universidad de la república, uruguay | dvallesp@fing.edu.uy ] abstract as a practice, software design seeks to contribute to developing quality software. during this software devel­ opment stage, the requirements are translated into a representation of the software (also known as design), whose quality can be evaluated and improved. for undergraduate students, the design is difficult to understand and make. in fact, building a good design seems to require a certain level of cognitive development that few students achieve. the aim of this study is to know the effort dedicated to software detailed design and the effect on software quality when graduating students use templates to represent their design. we conducted a controlled experiment where stu­ dents develop eight projects following a defined process and recording data from its execution in a software tool. we found that the use of design templates did not improve the quality of the code, measured as the defect density in the unit test phase. also, the use of templates did not reduce the number of code smells in the analyzed code. regarding the effort, students who use templates dedicated greater development effort to designing than to coding. meanwhile, students who did not use templates dedicated four times less effort to designing than to coding. keywords: detailed design, software quality, graduating students 1 introduction software design is one of the most important components to ensure the success of a software system (hu, 2013). be­ tweentherequirementsanalysisphaseandthesoftwarebuild­ ing phase, software design has two main activities: architec­ tural design and detailed design. during architectural design, high­level components are structured and identified. dur­ ing detailed design, every component is specified in detail (bourque and fairley, 2014). this work is focused specifi­ cally on detailed design. design is a difficult discipline for undergraduate students to understand, and success (i.e. building a good design) seems to require a certain level of cognitive development that few students achieve (carrington and k kim, 2003; hu, 2013; linder et al., 2006). students’ ability to build a good design is related to the abstraction, understanding, reasoning and data­processing ability (kramer, 2007; leung and bol­ loju, 2005; siau and tan, 2005). building quality software is increasingly relevant. we highly depend on software in our daily lives and its quality has a great impact. a quality software design allows us to build quality software, with fewer defects and is more main­ tainable. industry practitioners are aware of the importance of software design quality and they use clean code practices, reviews and tools, among others, to contribute in this regard (brown et al., 1998; fowler, 2018; stevenson and wood, 2018). knowing how undergraduate students design is of interest to several authors (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). most of their studies found that students do not manage to produce a good soft­ ware design. some of the problems detected are lack of con­ sistency between design artifacts and code, incomplete de­ signs, and the lack of understanding of what kind of infor­ mation to include when designing software (eckerdal et al., 2006a,b; loftus et al., 2011). in this work, we study the software design practice in grad­ uating students. we conducted an experiment within the con­ text of some courses over three consecutive years to know the effort dedicated to software design and the effect that the representation of design using specific templates has on soft­ ware quality. we use the term graduating for our students, because, in fact, they are in the fourth year of the degree of the school of engineering of universidad de la república, in uruguay. the curriculum of the school of engineering is a five­year degree, similar to the ieee/acm’s proposal for the computer science undergraduate curriculum (joint task force on computing curricula ­ acm and ieee computer society, 2013). students have already passed courses where detailed software design is taught: design principles, artifacts and design diagrams, uml, design patterns, etc. this work is an extension of the article published at the iberoamerican conference on software engineering (cibse) 2020: “the representation of detailed design using templates and their effects on software quality”. our article was se­ lected to participate for the publication of in a special issue in the journal of software engineering research and devel­ opment (jserd). below, we detail the extension of our work with respect to cibse article: the work presented at cibse 2020 aims to know the effect on software quality when graduating stu­ dents use templates to represent the detailed design. in this work we present an empirical study where students develop 8 projects following a defined process and recording data from the execution in a tool. we found that the use of design tem­ plates did not improve the quality of the code measured as the defects density in the unit test phase. neither did the use https://orcid.org/0000-0002-1677-6212 mailto:smoreno@fing.edu.uy https://orcid.org/0000-0002-0339-6624 mailto:vcasella@fing.edu.uy https://orcid.org/0000-0001-5532-3227 mailto:martin.solari@ort.edu.uy https://orcid.org/0000-0003-1701-353x mailto:dvallesp@fing.edu.uy moreno et al. 2021 of templates manage to reduce the number of code smells present in the analyzed code. the extension carried out in this work consists, on the one hand, of expanding and deepening aspects that for limited space reasons are not in the cibse ar­ ticle. on the other hand, we add a new research question and its analysis, which allows to knowing the effort that implies the use of design templates. specifically, a new section explaining the experimental de­ sign in depth was added. the analysis of external quality was expanded and deepened. descriptive statistics were added and analyzed and tables were added with the data of the av­ erage density of defects in ut for the students. in addition, a statistical analysis was added within the between­group anal­ ysis that checks the homogeneity of the groups studied (trd, notrd). threats to validity were expanded, grouping them by type (construct, internal, external, conclusion), and dis­ cussion and conclusions sections were expanded. a research question was added that seeks to know the effort that students dedicate to design, and how that effort varies after the use of templates. to answer this question, the relationship between the effort dedicated to the design phase and the effort dedicated to the coding phase was studied. de­ scriptive and statistical analyses were presented as part of the analysis of results. the results obtained are discussed and re­ lated to those previously obtained in the discussion section. the document is structured as follows: section 2 presents related works; section 3 presents the research methodology; section 4 presents the results, and section 5 is discussed; threats to validity are mentioned in section 6, and section 7 presents the conclusions and future work. 2 related work software design is an important activity to ensure the qual­ ity of a software system (hu, 2013; taylor, 2011). it involves identifying and abstractly describing the software system and its relationships. good designs help develop robust, main­ tainable software and with few defects (pierce et al., 1991; sommerville, 2016). detailed software design is a creative activity, which can be done in different ways: implicitly, in the developer’s mind before coding, on a sketch on paper, through diagrams, using both formal and informal languages or tools (chemuturi, 2018). software quality is the degree to which a software product meets stakeholders’ needs both explicit and implicit. qual­ ity models represent quality in terms of a set of elements of the model and their relationships (nistala et al., 2019). these models define internal and external software quality attributes. the internal ones are those that do not depend on the software execution (static), while the external ones are those that are applicable to the execution. in recent years, the use of clean code practices and tools has contributed to improved design quality (stevenson and wood, 2018). code smells, anti patterns and design flaws can be used to measure the quality of a software design (mar­ tin, 2002; gibbon, 1997; brown et al., 1998; fowler, 2018). sonarqube (campbell and papapetrou, 2013) and findbugs (ayewah et al., 2008) are some of the tools used to measure the quality of the code by detecting bad smells. current industry practices require practitioners with the necessary skills to understand and build good software de­ signs. however, students have difficulties designing. build­ inggooddesignsrequiresacertainlevelofcognitivedevelop­ ment that few students achieve (carrington and k kim, 2003; hu, 2013; linder et al., 2006). this cognitive development is related to the ability to recognize design patterns, architec­ tural design styles, and related data and actions that can be extracted into appropriate design abstractions (hu, 2013). in fact, for students, learning to design is more difficult than learning to code. this difficulty occurs because for most programming languages, students get compiler feedback and run time errors. however, this does not happen with design (karasneh et al., 2015). object­oriented design (ood) is one of the most widely­ used design approaches in the industry and one of the sub­ jects normally taught in universities (flores and medinilla, 2017). by using oo modeling diagrams and languages, static and dynamic models of software systems can be created. sev­ eral empirical studies analyze the understanding and bene­ fits of using uml diagrams (budgen et al., 2011; fernández­ sáez et al., 2013; arisholm et al., 2006; gravino et al., 2015; torchiano et al., 2017). in some studies, students failed to obtain design benefits using uml diagrams (gravino et al., 2015; torchiano et al., 2017). gravino et al. found that students who use uml di­ agrams to design do not make significant improvements in their source code comprehension tasks compared to students who do not use them. also, students who use diagrams spend twice as much time on the same source code comprehension task than as students who do not use them. when analyzing the experience factor, they find that the most experienced stu­ dents achieve an improvement in the understanding of the source code (gravino et al., 2015; soh et al., 2012). for industry professionals, the use of uml continues to be resisted to a certain degree (stevenson and wood, 2018). a survey conducted to on 50 software professionals indicates that although the quality of the software is an important as­ pect, the use of uml is selective (informal, only for a while, then it is discarded) and with low frequency (petre, 2013). the use of model­driven development (mdd) methodol­ ogy to design software has shown improvements in software quality. panach et al. conducted an experiment and found that students using mdd achieve better quality products (mea­ sured through test cases) than students using the traditional software development method (panach et al., 2021). undergraduate students’ design skills are reported by pre­ vious studies examining artifacts produced by them to learn how they design software (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). these studies use the same requirements specification for which students must produce a design. the studies use different approaches: designs produced individually, designs made in groups, and designs produced at different levels of training. in general, all the works mentioned agree on the fact that graduating students are not capable of designing a soft­ ware system. lack of consistency between design artifacts and code, incomplete designs, and lack of understanding of what kind of information to include when designing software are some of the major difficulties reported (eckerdal et al., moreno et al. 2021 2006a,b; loftus et al., 2011). we believe, just as loftus et al. (loftus et al., 2011), that students do not precisely know what to do when they have to design software. besides, several authors analyzed the ar­ tifacts produced and they agree on the fact that students do not know how to design (chen et al., 2005; eckerdal et al., 2006a,b; loftus et al., 2011; tenenberg, 2005). this moti­ vated the work presented in this paper, in which we pro­ vide students with design templates as a support tool for de­ sign representation. unlike gravino and torchiano, who an­ alyzed the benefits of using diagrams in code comprehen­ sion (gravino et al., 2015; torchiano et al., 2017), our ap­ proach tries to analyze the effort dedicated to designing and coding; and the impact of the use of templates on software quality. we studied quality from two perspectives: defects on the code and code smells. we also analyzed the effort as the time in minutes that students dedicate to the design and code phases. the focus of our research is the ood at the class level, including source code organization, the identification and re­ lationship between classes, and the interaction of users with the system. as kitchenham pointed out (kitchenham and pfleeger, 1996), this corresponds, to the “product view”, an examination of the inside of a software product. we used an approach focused on objects because a large part of the current software is developed using that technology (group, 2015). 3 research methodology we studied the effect of design in software quality when grad­ uating students represent their design using a specific set of templates and the effort they dedicate to the design activity. we conducted three experiments within the context of three consecutive undergraduate courses, from 2015 to 2017. 3.1 course context the course principles and foundations of personal software process (pf­psp) have the same format every year and lasts 9 weeks. in the first week (week 1), a base process is taught, and the dynamics of the practical work to be done throughout the remaining eight weeks are explained. students participate in the course on a voluntary basis. the base process is a defined and disciplined process that intends to help the software development tasks and to col­ lect product and process metrics. the process has different phases, scripts that guide the work in each phase, and logs that are used to collect data (see figure 1). the base process is divided into the following phases: plan, design, code, compile, unit test (ut), and postmortem. to follow the process, students are provided with a set of scripts. scripts are a one page guide that establishes the inputs, out­ puts and activities to be carried out in each phase. scripts help students guide the development activities but without demanding how they must be carried out. in each phase of the process, students must log the time dedicated to the phase, as well as data on the defects he or she removes (injection phase, removal phase, type of defect and time spent to correct it). in the postmortem phase students log the size in line of code (loc) of the program built. figure 1. base process the practical work consists of each student developing 8 small projects following the base process and recording the process data in a tool. students carry out the projects individ­ ually and consecutively. project 2 does not begin until project 1 has been completed, and so on with the remaining projects. from week 2 to week 9, one project is assigned per week. at the beginning of each week, a teacher sends the student the requirements of each project. each student’s submission must contain the code that solves the problem, the test cases executed, and the export of the data that was registered in the tool. once the student submits the solution, the teacher reviews the work and sends corrections back to the student if necessary. students carry out the projects at home and have a teacher assigned, who will be responsible for assigning the projects, correcting them and answering questions. before starting project 1, each student must choose the programming language to use throughout the course. our in­ terest is to collect data of the execution of the development process with the use of a programming language familiar to the student. projects are small in size and of low and similar difficulty, so the design phase refers to detailed design (i.e. identifying classes, attributes, operations, program scenarios, status diagram, and pseudocode). the nature of project 2 is different from the other projects. in project 2, students have to build a size­measuring software, while in the remaining projects, they must produce mathe­ matical solutions (standard deviation, simpson’s rule, corre­ lation parameters). previous studies show that process mea­ sures and product measures in project 2 have greater diffi­ culty than in the rest of the projects (i.e., project 2 is an out­ lier), and it is usually discarded in statistical analysis (grazi­ oli et al., 2014b; moreno and vallespir, 2018). therefore, we excluded the data of this project from the analyzes presented in this article. however, it is relevant to mention that project 2 is an integral part of our course, and it is used for students from projects 3 to 8 to count the lines of code they produce in each project. percentiles 5 and 95 of the data collected for all the stu­ dents throughout the 8 projects are 26 loc and 242 loc respectively. each replication of the experiment corresponds to an in­ stance of a different run of the course. students who par­ ticipated in one course do not participate again in a later course. the teachers participating were the same throughout the three courses (2015­2017). moreno et al. 2021 3.2 goals and research questions the aims of the experiment are to know the effect on soft­ ware quality when students represent their designs using tem­ plates, and to study the effort they dedicate to the design ac­ tivity. templates are documents with a predefined structure in which students have to represent their designs. the templates we used allow to describing the detailed de­ sign of a project. we used four templates, a brief description of each of them is presented below: • operational template: specifies the interaction between the program and the users. the content may look similar to a use­case description. • functional template: the behavior of the program’s invo­ cations and returns are specified in this template. vari­ ables, functions, classes and methods are described. fig­ ure 2 presents an example of the use of this template for project 6. • logical template: in this template, the pseudocode of each method that appears in the functional template is registered. • state template: it can be used to define the transactions and conditions of the program’s internal states. the con­ tent is similar to state machine diagrams. the selected templates emerge from the personal pro­ cess (psp) framework(humphrey, 1995). the psp consid­ ers a design to be complete when it defines all four di­ mensions (internal­static, internal­dynamic, external­static, external­dynamic). the way to correspond to each of the four dimensions is by using the four templates (operational, func­ tional, logical, state). completing the four templates allows describing the designs entirely and precisely (humphrey, 1995). several studies have shown an improvement in devel­ oper performance with templates insertion (hayes and over, 1997; prechelt and unger, 2001; gopichand et al., 2010). in the experiment context, we proposed the following re­ search questions and the corresponding research hypotheses: rq1: is there an improvement in the quality of the products when students represent the design using templates? rq2: what is the relation between the effort dedicated to designing and the effort dedicated to coding? are there any variations in effort when students use templates? to answer rq1, we analyzed the external and internal quality of the software developed in each project. to study the external quality, we considered the following research hypothesis: h1.0: representing software design using design templates, does not change the software defect density in unit testing h1.1: representing software design using design templates, changes the software defect density in unit testing to study the internal quality, we descriptively analyzed certain code smells introduced by students when producing software (fowler, 2018). we are interested in knowing if the use of templates to represent software design prevents stu­ dents from incurring into some type of code smells. to answer rq2, we studied the time spent on the design and code phases. we analyzed the following research hypoth­ esis: h2.0: the time spent on designing equals the time spent on coding. h2.1: the time spent on designing does not equal the time spent on coding. 3.3 experimental design our design is a repeated measures design with one factor (the base process) and two levels: with templates to represent the software design and without templates to represent the soft­ ware design. response variables considered in this experi­ ment are internal and external software quality, and the effort dedicated by the students to the design and code phases. our experimental design implies that students develop 8 projects. the base process introduces practices in the first 2 projects that allow for guiding the work and measure the pro­ cess. therefore, during the first or second project (depend­ ing on the subject), they are already following the process adequately. people have high variability among themselves when applying software development techniques or processes (humphrey, 2005). when high variability among people ex­ ists in an experiment with human subjects, a within­subjects design is preferable to a between­subjects experiment (senn, 2002). moreover, in repeated measures experiments, sub­ jects serve as their own control (jones and kenward, 2014). this reinforces the choice of our design, in which each stu­ dent carries out several projects. the effect of students’ learning throughout these 8 exer­ cises could be a problem in our experimental design. how­ ever, this was previously studied from different approaches, and the results indicate in both studies that repetition of pro­ gramming did not contribute to performance improvements (grazioli et al., 2014b; grazioli and nichols, 2012; grazioli et al., 2014a). as we already mentioned, to evaluate the external quality, we considered the defect density in the unit test phase of the base process. that is to say, the number of defects detected in that phase are counted and divided between the locs of the project. to evaluate the internal quality, we analyzed the code smells in which students incur. knowing the number of code smells present in the product’s source code gives us an idea of the maintenance costs in the future (fowler, 2018). the effort in design and code is measured as the time in minutes that the student dedicates to the phase in question. the experimental design is presented in figure 3. all stu­ dents apply the base process in projects 1 to 4, in which sub­ mitting the design representation to the teachers is not re­ quired. when students finished project 4, they were divided randomly into two groups: the control group and the experi­ mental group. the control group, called “without templates to represent the design” (notrd), continues to apply the base process throughout projects 5 to 8. the experimental group, called “with templates to represent the design” (trd), started to apply the templates from project 5 to 8. moreno et al. 2021 figure 2. functional template the trd group attends a theoretical class where the four design templates are presented and explained (and exam­ ples are shown). the submission of the design representation for this group was mandatory (except for the state template which is optional). when a student submitted the project, the assigned teacher checked the completeness of the templates and the consistency with the code. in this way, students de­ signing a solution and then coding another one is reduced. however, the fact that the design is complete and verifiable is not controlled. our experimental design allows us to study the behavior of the groups before and after the use of the templates. on the one hand, we propose to analyze the trd (representing design using template) andnotrd (representing design with­ out template) groups during project 1 to 4 to confirm they are homogeneous groups; that is, the quality of the software de­ veloped is similar in both groups from programs 1­4 (when students do not use templates in any of the groups). on the other hand, we are interested in knowing if students who use templates develop better­quality software. we pro­ pose studying the groups trd and notrd during projects 5 to 8 to know if representing the design using templates has some effect on the software quality. 3.4 operation the experiment was replicated in the course for three years: 2015, 2016, and 2017. the number of students that took part in the experiment was: 25, 17, and 19 respectively. out of the 61 students participating in the experiment, 29 are part of the trd group, and 32 of the notrd group. this unbalance between the groups is due to the unbalance gen­ erated when students were assigned to the trd and notrd groups in each of the three replications. 4 analysis and results to answer rq1: “is there any improvement in the quality of the products when students represent the design using tem­ plates?”, we analyzed the quality from the internal and exter­ nal points of view. 4.1 external quality we measured the external quality as the defect density in ut, that is, the number of defects in ut/kloc. to analyze the external quality, we defined the following research hypotheses: h1.0: representingsoftwaredesign using design templates does not change the software defect density in ut h1.1: representing software design using design templates changes the software defect density in ut we analyzed the external quality in two ways: intra groups and between groups. between groups refers to knowing if there is a significant difference in the quality between the trd group and the notrd group. intra group refers to study­ ing the quality of the software in the trd group before and after the use of templates. between groups the analysis between groups consists, on the one hand, of analyzing the trd and notrd groups during projects 1, 3 and 4; and on the other hand, analyzing the trd and notrd groups during projects 5 to 8. due to the difficulty of project 2 compared with the rest of the projects, we decided not to include this project’s data in the analysis. during projects 1, 3 and 4, both groups apply the base pro­ cess, so, comparing the software quality of both groups dur­ ing those projects allows to confirm that they are homoge­ neous groups, and thus establishing the experimental frame. for this analysis, we defined the following hypothesis of in­ vestigation: h1.0: median (def. density in ut of notrd) = median (def. density in ut of trd) h1.1: median (def. density in ut of notrd) = > median (def. density in moreno et al. 2021 figure 3. experimental design ut of trd) each sample corresponds to the average defect density in ut of a student considering projects 1, 3 and 4. 1000 ∗ ∑4 n=1 #def ectsu tn∑4 n=1 #locn (1) where n varies between 1, 3 and 4. during the analysis, we detected that the data from a stu­ dent of the trd group was not accurate, that is, that the process followed had not been accurately recorded. so, data from that student was eliminated from the analysis and then there were 28 students remaining as part of the trd group. the descriptive statistics of the trd and notrd groups considering projects 1, 3 and 4 are presented on table 1. the values of the mean and interquartile range indicate there seems not to be great variability between the groups. to confirm this, we applied the mann­whitney test for inde­ pendent samples, since they correspond to different students. table 1. mean and interquartile range in projects 1, 3 and 4 mean interquartile range trd 30.22 25.54 notrd 32.88 28.9 the result indicates a p­value = 0.3467, with which we cannot reject the null hypothesis (significance = 0.05). this result does not allow us to affirm that there is a difference in quality between trd and notrd groups. we can assert that both groups have a similar or homogeneous behavior. this gives us more confidence to study the software quality be­ tween the trd and notrd groups after the use of templates eliminating the possibility that the result is due to the behav­ ior of the groups rather than to using or not using templates. studying the trd and notrd groups during projects 5 to 8 aims to know if representing the design using templates has some effect in the software quality. for the analysis between groups during projects 5 to 8, we defined the following hypothesis of investigation: h1.0: median (def. density in ut of notrd) = median (def. density in ut of trd) h1.1: median (def. density in ut of notrd) = > median (def. density in ut of trd) table2presents the average defect density inut for the 28 students of the trd group and the 32 students of the notrd group in projects 5 to 8. the values of the mean and of the interquartile shown in table 3 indicate low variability of the groups. that is to say, the use of templates by thetrd group does not produce a sig­ nificant difference in the defect density compared to notrd group not using templates. to study the behavior of both groups we used hypothesis tests. the samples are different because they correspond to different students, thus, the mann­whitney test is applied. results indicate p­value = 0.165, therefore, the null hy­ pothesis cannot be rejected. thus, we cannot affirm that the students who use the templates manage to develop software with fewer ut defect density than students who do not use templates. intra groups as already mentioned, intra groups refers to knowing if students of trd group improve the software quality after the use of templates to prepare the design. to know this, the de­ fect density in ut from thetrdgroup is analyzed in projects 1 to 4 (without project 2) and projects 5 to 8. studying the be­ havior of the same group allows to know if there is a change in the software quality after the use of templates. we define the following research hypotheses: h1.0: median (def. density in ut of trd134) = median (def. density in ut of trd58) h1.1: median (def. density in ut of trd134) = > median (def. density in ut of trd58) being trd134 are the students of trd group during projects 1, 3 and 4; and trd58 are the same students of trd group during projects 5 to 8. table 4 presents the defect density in ut for the students of thetrdgroup in projects 1, 3 and 4, and the same students in projects 5 to 8. the descriptive statistics presented in table 5 indicate some variability in defect density. even though the mean is similar, it seems that using templates (after project 5) to rep­ resent the design achieves products with less defects. to statistically study the data, we applied the wilcoxon test (signed rank test) for paired samples (because for this analysis the data come from the same students). results indi­ cate a value of v = 138 and p­value = 0.1438. since p­value is higher than 0.05 (value of significance) it is not possible moreno et al. 2021 table 2. average defect density in ut for the students of the trd group and notrd group in projects 5 to 8 group student defect density group student defect density trd 1 8.83 notrd 1 27.98 trd 2 23.16 notrd 2 24.86 trd 3 33.78 notrd 3 23.59 trd 4 40.76 notrd 4 14.35 trd 5 83.33 notrd 5 21.37 trd 6 16.10 notrd 6 12.19 trd 7 5.74 notrd 7 22.79 trd 8 13.02 notrd 8 43.33 trd 9 28.07 notrd 9 27.02 trd 10 12.5 notrd 10 36.46 trd 11 9.49 notrd 11 38.98 trd 12 19.70 notrd 12 16.80 trd 13 11.70 notrd 13 37.65 trd 14 36.85 notrd 14 18.93 trd 15 20.53 notrd 15 18.25 trd 16 22.93 notrd 16 22.98 trd 17 11.80 notrd 17 47.12 trd 18 37.45 notrd 18 30.21 trd 19 26.05 notrd 19 35.03 trd 20 5.03 notrd 20 27.84 trd 21 23.35 notrd 21 12.22 trd 22 17.36 notrd 22 24.57 trd 23 10.08 notrd 23 15.65 trd 24 42.75 notrd 24 41.17 trd 25 33.43 notrd 25 44.89 trd 26 28.63 notrd 26 20.35 trd 27 44.02 notrd 27 38.80 trd 28 23.88 notrd 28 51.54 notrd 29 7.85 notrd 30 27.89 notrd 31 24.24 notrd 32 25.49 moreno et al. 2021 table 3. mean and the interquartile range in projects 5 to 8 mean interquartile range trd 24.65 21.2 notrd 27.57 16.9 to reject the null hypothesis. this indicates that we cannot affirm that students improve the quality of their software by using design templates. 4.2 internal quality to evaluate the internal quality, we carried out an analysis of those code smells introduced by students when develop­ ing the course projects. the aim of this analysis is to inves­ tigate if the use of design templates prevents students from incurring into certain code smells. the analysis presented is preliminary and exploratory, seeking to obtain initial results that allow us to generate new research hypotheses. the code smell types depend on the programming lan­ guage. as students can choose the language in which they develop their projects, this analysis has to be done taking into account the different languages used. with the aim of doing an initial analysis, and that it added value to our re­ search, the students who developed their projects with java, c#, c, c++ and ruby were selected, excluding those devel­ oped with php and python. we excluded php and phyton because they do not have many code smells in common with the other languages. if we had added php and python, the number of code smells to analyze would have been reduced too much. so, both languages were excluded for this initial analysis. this left a total of 45 students for the analysis, 19 from 2015, 14 from 2016, and 12 from 2017. of those 45 students, 21 belong to the trd group (9 in 2015, 6 in 2016 and 6 in 2017) and 24 to the notrd group (10 in 2015, 8 in 2016 and 6 in 2017). to detect the code smells, the tool sonarqube1 was used, since it is a free­software tool for a variety of programming languages, which presents constant updates for the commu­ nity and a wide documentation, among others. we selected 16 code smell types for the analysis. these are common for the programming languages we chose and are detectable by sonarqube. the code smell types are: 1) statements “if ... else if” must end with the clause “else”; 2) statements “switch”/“case” must not be nested; 3) statements “switch”/“case” must not have too many “case”/“when” clauses; 4) the cognitive complexity of the functions or meth­ ods must not be too high; 5) “if” collapsible statements must merge; 6) the “if”, “for”, “while”, “switch” and “try” state­ ments of control flow must not nest too much; 7) the ex­ pression must not be too complex; 8) files must not have too many lines of code; 9) functions or methods must not have too many lines of code; 10) functions or methods must not have too many parameters; 11) lines of code must not be too long; 12) functions or methods must not be empty; 13) statements must be in separate lines; 14) two branches in one conditional structure must not have the exact same im­ plementation; 15) the parameters of one function or method not used must be eliminated; 16) the local variables not used 1http://www.sonarqube.org must be eliminated. a more detailed description of each one is not provided for article­length reasons. table 6 shows the percentage of students that incurred in at least one code smell, segmented by project (from 1 to 8) and by group (notrd and trd). code smells 3, 8 and 12 are not present in any of the projects analyzed. when analyzing the table between the notrd and trd groups, as of program 5 (after using templates) a great vari­ ability arises, both if it is considered per project as it is con­ sidered per code smell. for code smells 4, 7, 10 and 13, it is observed that a group is better for certain projects, and the other group is better for certain other projects. for code smells 1, 2, 5, 6, 9 and 14, the difference between groups is very little. to sum up, changes after using templates are not observed for any of these code smells. for the case of code smell 11, a very minor percentage is observed in projects 5 and 7, and a minor percentage in project 8 on the part of the group using templates. in project 6, both groups have almost identical behavior. from the point of view of templates, maybe it is the pseudocode template that is helping the students decrease the introduction of this code smell. code smells 15 and 16 show a similar behavior. for both cases, trd group almost does not incur in them, while notrd does and sometimes in a high percentage. number 15 refers to parameters not used in the methods, and 16 to local variables not used. clearly, these types of code smells can be avoided with good software design. from the point of view of the use of templates, maybe the development of pseudocode (logic template) and the functional template are preventing the students of the trd group from incurring in these code smells. anyway, it is necessary to manually ana­ lyze the templates submitted by the students and have inter­ views with them to know better if this can be happening for the reasons already described. this has not been done yet. however, when analyzing the table, but only considering the data of the trd group throughout the 8 projects, we do not see that the use of templates improves the internal quality. it is worth noting that this group normally did not incur in code smells 15 and 16 (or did it in a very low percentage). ob­ serving projects 1 to 4 and 5 to 8 separately, we do not see any difference between them. that means, the behavior of this group before using templates and during its usage does not change for these code smells. so, the difference presented in the previous analysis between trd and notrd groups does not seem to respond to the use of templates. something similar happens with code smell 11. results do not show a decrease of this code smell when using templates. it can be observed that in project 8, the percentage of oc­ currence of code smells 4, 9 and 10 significantly increases for both groups. this increase makes us think that project 8 is more complex for the students. these three code smells in­ dicate that the code developed is too complex and long for its comprehension. that is, the use of templates did not help the students elaborate a less complex and understandable design. putting both analyses together, we conclude that the use of templates does not improve the internal quality. specifically (or being more precise), the use of templates does not seem to have an effect on the code smells in which the students moreno et al. 2021 table 4. defect density in ut for the students of the trd group in projects 1, 3 and 4, and in projects 5 to 8 group student defect density 1,3 and 4 defect density 5 to 8 trd 1 2.22 8.83 trd 2 7.22 23.16 trd 3 35.33 33.78 trd 4 14.24 40.76 trd 5 95.74 83.33 trd 6 17.85 16.10 trd 7 10.14 5.74 trd 8 21.18 13.02 trd 9 15.54 28.07 trd 10 39.80 12.5 trd 11 13.79 9.49 trd 12 18.31 19.70 trd 13 10.23 11.70 trd 14 60.60 36.85 trd 15 32.60 20.53 trd 16 25.83 22.93 trd 17 51.09 11.80 trd 18 48.78 37.45 trd 19 39.63 26.05 trd 20 15.56 5.03 trd 21 30.70 23.35 trd 22 25.77 17.36 trd 23 9.72 10.08 trd 24 32.71 42.75 trd 25 10.05 33.43 trd 26 42.70 28.63 trd 27 16.87 44.02 trd 28 102.04 23.88 moreno et al. 2021 table 5. mean and the interquartile range calculator project mean interquartile range 1, 3 y 4 30.22 25.5 5 to 8 24.65 21.2 incur when designing software. 4.3 effort dedicated to designing and coding to answer rq2: “what is the relation between the effort ded­ icated to designing and the effort dedicated to coding?, are there any variations in effort when students use templates?” , we analyzed the following hypothesis test: h2.0: median (tcod) <= median (tdld) h2.1: median (tcod) > median (tdld) as part of the base process, each student registered the time spent in the design phase (tdld) and the time spent in the code phase (tcod) for each project. to know the effort dedicated to designing and to coding by the group that uses the templates and the group that does not use them, we analyzed both groups independently during projects 5 to 8. that is, on the one hand, we carried out the analysis of the trd group during projects 5 to 8, and on the other hand, of the notrd group during projects 5 to 8. for each student, we calculated the time spent in design and the time spent in code for projects 5 to 8. the calculation for each pair of data is the following: ( 8∑ n=5 t dldn, 8∑ n=5 t codn) (2) where t dldn is the time spent in the design phase for project n, t codn is the time spent in the code phase for project n, and where n varies from 5 to 8. table 7 presents the 28 data pairs (tdld, tcod) for the trd group, and the 32 data pairs (tdld, tcod) for the notrd group. table 8 presents the mean and the interquartile range for the trd group and the notrd group. the mean value of the trd group shows that the use of templates takes more design time compared with the group that did not use templates. furthermore, the design time in the case of trd exceeds the time spent on coding. regarding the tcod’s mean, even though it is similar in the trd and notrd groups, a decrease in the trd group is observed. despite the fact that the decrease is not quite significant, the use of templates might have helped coding in less time. to determine the statistical test that best fits the problem to be solved, the distribution of the data was previously studied. when applying the kolmogorov­smirnov test for the trd group, a significance value of 0.00478 is obtained, indicating that the values do not fit a normal distribution. the result of applying kolmogorov­smirnov test for the notrd group returns 7.713e­12 as a significance value, for that, the values do not fit a normal distribution. as the data of both does not follow a normal distribu­ tion, wilcoxon’s test is used for paired samples. the sam­ ples of each group are paired since the sampled pairs (tdld, tcod) correspond to the same student. we executed the test for the trd group and for the notrd independently. for the notrd group, we proposed to know the value of x such that tcod = x*tdld. we analyzed the following hypothesis test: h2.0: median (tcod of notrd) <= median (x*tdld of notrd) h2.1: median (tcod of notrd) > median (x*tdld of notrd) when executing the test for the notrd group with x=1, the null hypothesis is rejected (p­value = 4.169e­07, the sig­ nificance level is taken with a value of 0.05), confirming that the coding time is greater than the designing time. to know how much more or what is the relationship between these times (tcod = x*tdld) we applied the test again but now multiplying the tdld by an integer x value until the null hy­ pothesis cannot be rejected. table 9 presents the results for the wilcoxon test. the results indicate that for x=1, x=2 and x=3 the null hypothesis is rejected, so the coding time is greater than 3 times the design time. for x=4, the null hypothesis cannot be rejected (p­value=0.541). in other words, students who did not use templates generally spent at least 3 times more time on coding than on designing. in the case of the trd group, the mean value shows that students tend to dedicate more time to design in relation to code.therefore,wecarriedout theanalysis inaninverseway, calculating x such that: x*tcod=tdld. we analyzed the following hypothesis test: h2.0: median (x*tcod of trd) >= median (tdld of trd) h2.1: median (x*tcod of trd) < median (tdld of trd) when executing wilcoxon test for the trd group with x=1, the null hypothesis is rejected (p­value = 0.0007155), confirming that the coding time is less than the designing time. to know how many times more students spent in de­ signing, we applied the test again but now multiplying the tcod by an integer x value until the null hypothesis cannot be rejected. table 10 presents the results of the wilcoxon test applied to trd group. the results indicate that for x=2 the null hypothesis can­ not be rejected (p­value = 0.998). so, students who use tem­ plates spend more time designing than coding, but not dou­ ble. this result indicates that the group that used templates ded­ icated a greater effort to design than the group that did not use templates. to confirm that the relationship between design­ ing time and coding time previously obtained by the trd group is due to the use of templates and not to another factor dependent on the group, we studied the relationship (tcod, tdld) but in this case during projects 1, 3 and 4 (without using templates). table 11 presents the mean and the interquartile range of the pairs (tdld, tcod) for the trd group in projects 1, 3 and 4. the values of the descriptive statistics of the trd group in projects 1, 3 and 4 are similar to those of thenotrdgroup. in other words, during projects in which students design with­ moreno et al. 2021 table 6. percentage of students who incur at least one code smell by code smell type and student group code smell group project 1 2 3 4 5 6 7 8 1 notrd 4% 29% 0% 4% 13% 13% 4% 13% trd 19% 19% 10% 0% 5% 5% 5% 5% 2 notrd 0% 0% 0% 0% 0% 0% 0% 0% trd 0% 0% 0% 0% 0% 0% 0% 5% 4 notrd 8% 58% 0% 13% 30% 46% 29% 50% trd 24% 43% 5% 10% 10% 43% 24% 95% 5 notrd 4% 21% 0% 0% 0% 0% 0% 0% trd 0% 24% 10% 0% 0% 5% 0% 5% 6 notrd 13% 63% 8% 29% 30% 38% 13% 42% trd 38% 67% 29% 29% 33% 52% 57% 62% 7 notrd 0% 25% 0% 0% 0% 4% 8% 0% trd 0% 19% 0% 0% 0% 5% 0% 5% 9 notrd 0% 4% 8% 17% 10% 21% 21% 67% trd 0% 10% 19% 14% 10% 29% 38% 71% 10 notrd 0% 0% 0% 0% 0% 0% 8% 54% trd 0% 0% 5% 0% 0% 0% 19% 38% 11 notrd 4% 46% 42% 8% 40% 4% 46% 75% trd 0% 29% 29% 0% 14% 5% 24% 62% 13 notrd 0% 0% 0% 0% 10% 0% 0% 4% trd 5% 0% 5% 0% 0% 0% 5% 19% 14 notrd 0% 8% 0% 0% 10% 0% 0% 0% trd 0% 0% 0% 0% 0% 0% 0% 0% 15 notrd 0% 0% 8% 4% 20% 0% 13% 17% trd 0% 0% 0% 0% 5% 0% 0% 0% 16 notrd 8% 13% 8% 8% 40% 8% 17% 29% trd 5% 5% 10% 10% 0% 0% 10% 10% moreno et al. 2021 table 7. data pairs for the trd group and the notrd group trd group notrd group tdld tcod tdld tcod 178 263 60 172 748 217 44 369 940 621 51 446 522 249 63 350 178 61 16 245 204 221 53 302 163 371 100 427 295 212 67 289 665 265 64 243 175 272 23 464 626 329 31 350 407 169 65 460 757 407 23 248 238 228 18 184 392 269 132 347 288 249 163 225 212 210 140 197 278 150 116 354 573 274 69 205 518 199 33 229 336 398 193 226 453 108 58 329 401 222 103 206 330 360 83 168 515 493 43 241 327 242 92 187 160 169 21 481 296 213 107 304 35 236 205 468 64 224 168 194 table 8. mean and the interquartile range for notrd and trd groups group mean interquartile range trd tdld 399.1 287.5 trd tcod 265.7 26.7 notrd tdld 19.5 15.7 notrd tcod 292.8 132 table 9. wilcoxon test for the notrd group in projects 5 to 8 x=1 x=2 x=3 x=4 4.169e­07 4.088e­05 0.03861 0.541 table 10. wilcoxon test for the trd group in projects 5 to 8 x=1 x=2 0.0007155 0.998 table 11. mean and the interquartile range of the pairs (tdld, tcod) for the trd group in projects 1, 3 and 4 mean interquartile range tdld 43 41.5 tcod 242 118 out using templates, the time spent on design is significantly less than the time spent on coding. table 12 presents the results of executing wilcoxon’s test to analyze the relation tcod = x*tdld of the trd group in projects 1, 3 and 4. table 12. wilcoxon test for the trd group in projects 1, 3 and 4 x=1 x=2 x=3 x=4 x=5 3.725e­09 3.725e­08 0.0002701 0.01245 0.09678 the results indicate that for x=5, the null hypothesis can­ not be rejected (p­value = 0.09678). students of trd group in projects 1, 3 and 4 generally spent at least 4 times more time on coding than on designing. this result shows that there is an increase in the time dedicated to design after the students of the trd group begin to use design templates. 5 discussion in the context of our experiment, we found that design repre­ sentation using templates produced an increase in time spent designing (we were expecting this). however, it did not help to develop better­quality software products, nor from an in­ ternal point of view, neither from an external point of view. results show that the use of templates did not improve nei­ ther the number of defects the developed code has (measured as defects density in ut), nor the internal quality (measured as the number of code smells in the code). these results are related to those reported by gravino (gravino et al., 2015), where the use of uml diagrams did not achieve any improve­ ment in the comprehension of the source code vis­à­vis not using them. in addition, the analysis of the relation between effort dedi­ cated to coding and effort dedicated to designing showed that the use of templates produced an increase in design time. stu­ dents who did not use the templates tended to spent 3 times more on code than on design. students who use templates spent more time designing than coding. moreover, students in both groups spent similar time in coding and before us­ ing templates the students in trd group behave similar to notrd group. we can conclude then, that using templates to represent design increases the effort dedicated to design but does not have a significant positive effect on quality or in reducing coding time. this can be due to several factors that we must analyze in the future. it could be, among other reasons, that students are not used to these templates and so they did not get the expected benefit; it could be that they just filled the templates but, in that moment, they did not care to think or de­ velop a quality system; it could be that students do not know how to design (as found in other studies); or as mentioned by chaiyo (chaiyo and ramingwong, 2013), it could be that the templates are difficult to use by students. we believe that students do not have the habit of designing and thinking of a solution before coding. although we think that the use of templates would be helpful, we believe that the students filled them in to achieve the goal without thinking of a design solution. rather, we believe that the usual stu­ dent practice is code­and­fix. even though more analysis is moreno et al. 2021 needed, we agree with several authors on the fact that grad­ uating students have difficulties to design and they do not seem to understand what type of information to include to de­ sign software (eckerdal et al., 2006a,b; loftus et al., 2011). 6 threats to validity most empirical studies are threatened by the way research is conducted (wohlin et al., 2012). this section describes the threats to validity we have detected. internal validity threats: investigating with students in­ volves several threats. on the one hand, the fact that the con­ text of the experiment is a course implies that the students does not develop naturally. we tried to minimize this threat with a non­graded course, that is, the student approved or failed. besides, we remarked the importance of monitoring and registering the process just as it was, and we emphasized that students’ assessments would not be done according to results, defects found, or efforts made. on the other hand, there is a threat that students share in­ formation or solutions to projects. in this sense, the assigned teachers reviewed the submissions and compared them be­ tween students to ensure there were no duplicate submis­ sions. in addition, students carry out their projects at home, which causes limited control by teachers. to reduce this threat, we introduced supervision, corrections, and feedback between the student and the assigned teacher. besides, for the analysis, we did a data aggregation of the three courses, knowing that the different courses can have influence on the data collected for being a hierarchical model. we tried to reduce this threat through the use of a defined and disciplined process the students followed, and keeping the same material and the same teachers throughout the three courses. external validity threats: experimenting with students of a course has the advantage that they are available and are will­ ing to participate in experiments, and the disadvantage that their characteristics cannot be generalized. in our experiment, students took part of the pf­psp course voluntarily and did not know that they were part of an experiment until they fin­ ished the course. this reduces to the minimum the bias they might have when feeling part of a research. conversely, the results obtained in this experiment cannot be generalized to the students practice of design in other contexts. construct validity threats: this kind of threat is related to the way in which the response variables were measured. in our experiment, we measured effort as the time in min­ utes that the student spends on the phase and the quality as the number of defects in ut and the number of code smells in which students incur. to ensure a correct data recording, we used a data recording tool and framework that allows a disciplined and measurable process to be followed. conclusion validity threats: the number of students in the research constitutes a threat to the statistical conclusion. 61 students participated during the three replications. this causes the statistical analysis to be carried out using non­ parametric tests whose statistical power is lower than the parametric tests. as a measure to this threat, we completed the non­parametric tests with descriptive statistics. 7 conclusions this work is one step further towards the understanding of the software design practice. the results of our experiment show that graduating students do not improve the software quality when using templates for design representation. how­ ever, using templates produces a significant increase in the time spent on the design phase without reducing coding time. we analyzed the software quality from the internal and ex­ ternal points of view, and from the effort dedicated to design. on the one hand, we statistically proved that using templates for design representation does not improve the external soft­ ware quality, measured as the defect density in unit testing. from the internal quality perspective, the use of templates does not have a significant positive effect on the code smells in which students incur when designing software. regarding the effort, students who used templates dedi­ cate a greater effort to designing than to coding (which is not double). meanwhile, students that did not use templates dedicated four times less effort to designing than to coding. our results are related to those mentioned by gravino and torchiano (gravino et al., 2015; torchiano et al., 2017), where the use of uml diagrams to design does not make significant improvements in their source code comprehen­ sion tasks. also, regarding effort, students who use diagrams spend twice as much time on the same source code compre­ hension task than students who do not use them. gravino ana­ lyzes the experience factor, and they find that the most experi­ enced students achieve an improvement in the understanding of the source code (gravino et al., 2015). although we did not analyze the experience factor of the graduating students, it could be an analysis to be performed in the future. our research focuses on graduating students, most of them working in the uruguayan software industry as junior engi­ neers. these engineers usually perform programming tasks, which include low­level design. the results obtained in our experimentcannotbegeneralizedtoall juniordevelopersand even less to senior developers. our results raises new questions about the practice of soft­ ware design: what do students usually design? what kind of information do they include when designing? is it possible for them to produce their designs mentally, without repre­ senting them? do they know the effect of a good design in software quality? continuing with this line of research, in 2018, we executed an experiment that sought to know how students usually de­ sign. students performed the same 8 projects during this ex­ periment and delivered the design representation made in a natural way (without templates). although we have not yet fi­ nalized the data analysis, we have found that students do not deliver complete designs in a preliminary analysis. in gen­ eral, they use informal/natural language and incomplete class diagrams in a few cases. studying the students’ habitual be­ havior when designing software should help identify poten­ tial problems in the design practices and find better ways of teaching skills for developing quality software. in 2019 and 2020, no experiments could be performed, but in 2021 we moreno et al. 2021 are replicating the 2019 experiment to have more data. as future work, we will finish the above­mentioned analysis to identify potential problems in the design practices and find better ways of teaching skills for developing quality software. also, we plan to analyze the designs produced with the tem­ plates to know what students design and conduct interviews with students to know their experience using templates. on the other hand, we find it interesting to experiment with some simple mdd tool to know the effect on software qual­ ity. references arisholm, e., briand, l. c., hove, s. e., and labiche, y. (2006). the impact of uml documentation on software maintenance: an experimental evaluation. ieee transac­ tions on software engineering, 32(6). ayewah, n., pugh, w., hovemeyer, d., morgenthaler, j. d., and penix, j. (2008). using static analysis to find bugs. ieee software, 25(5). bourque, p. and fairley, r. e. (2014). guide to the software engineering body of knowledge ­ swebok v3.0. ieee computer society, 2014 version edition. brown, w. h., malveau, r. c., mccormick, h. w., and mowbray, t. j. (1998). antipatterns:refactoringsoftware, architectures, and projects in crisis. john wiley & sons, inc. budgen, d., burn, a. j., brereton, o. p., kitchenham, b. a., and pretorius, r. (2011). empirical evidence about the uml: a systematic literature review. software:practiceand experience, 41(4):363–392. campbell, g. a. and papapetrou, p. p. (2013). sonarqube in action. manning publications co, 2013 version edition. carrington, d. and k kim, s. (2003). teaching software de­ sign with open source software. in 33rd annual frontiers in education. chaiyo, y. and ramingwong, s. (2013). the develop­ ment of a design tool for personal software process (psp). in 10th international conference on electrical engineer­ ing/electronics, computer, telecommunications and in­ formation technology, pages 1–4. chemuturi, m. (2018). software design: a comprehen­ sive guide to software development projects. crc press/taylor & francis group. chen, t.­y., cooper, s., mccartney, r., and schwartzman, l. (2005). the (relative) importance of software design criteria. sigcse bull., 37(3):34–38. eckerdal, a., mccartney, r., moström, j. e., ratcliffe, m., and zander, c. (2006a). can graduating students design software systems? in sigcse bull., page 403–407. acm, association for computing machinery. eckerdal, a., mccartney, r., moström, j. e., ratcliffe, m., and zander, c. (2006b). categorizing student software de­ signs: methods, results, and implications. computer sci­ ence education, 16(3):197–209. fernández­sáez, a., genero, m., and chaudron, m. (2013). empirical studies concerning the maintenance of uml dia­ grams and their use in the maintenance of code: a system­ atic mapping study. informationandsoftwaretechnology, 55:1119–1142. flores, p. and medinilla, n. (2017). conceptions of the stu­ dents around object­oriented design: a case study. in xii jornadas iberoamericanas de ingenieria de software e in­ geniería del conocimiento. fowler, m. (2018). refactoring: improving the design of ex­ isting code. addison­wesley professional. gibbon, c. a. (1997). heuristics for object­oriented design. phd thesis, university of nottingham. gopichand, m., swetha, v., and ananda rao, a. (2010). software defect detection and process improvement us­ ing personal software process data. in international con­ ference on communication control and computing tech­ nologies, pages 794–799. gravino, c., scanniello, g., and tortora, g. (2015). source­ code comprehension tasks supported by uml design mod­ els: results from a controlled experiment and a differenti­ ated replication. journal of visual languages & comput­ ing, 28:23 – 38. grazioli, f. and nichols, w. (2012). a cross course anal­ ysis of product quality improvement with psp. in team software process symposium 2012, pages 76–89. grazioli, f., nichols, w., and vallespir, d. (2014a). an anal­ ysis of student performance during the introduction of the psp: an empirical cross­course comparison. in team soft­ ware process symposium 2013, pages 11–21. grazioli, f., vallespir, d., pérez, l., and moreno, s. (2014b). the impact of the psp on software quality: eliminating the learning effect threat through a controlled experiment. adv. soft. eng., 2014. group, s. (2015). the chaos report. the astrophysical jour­ nal supplement series. hayes, w. and over, j. (1997). the personal software pro­ cess (psp): an empirical study of the impact of psp on indi­ vidual engineers. technical report cmu/sei­97­tr­001, software engineering institute, carnegie mellon univer­ sity, pittsburgh, pa. hu, c. (2013). the nature of software design and its teaching: an exposition. acm inroads, 4(2). humphrey, w. (2005). psp: a self­improvement process for software engineers. addison­wesley professional. humphrey, w. s. (1995). a discipline for software engineer­ ing. addison­wesley longman publishing co., inc. joint task force on computing curricula ­ acm and ieee computer society (2013). computer science curricula 2013: curriculum guidelines for undergraduate degree programs in computer science. association for comput­ ing machinery, new york, ny, usa. jones,b.and kenward,m.g. (2014). designandanalysisof cross­over trials. chapman and hall/crc, 3rd edition. karasneh, b., jolak, r., and chaudron, m. r. v. (2015). us­ ing examples for teaching software design: an experiment using a repository of uml class diagrams. in 2015 asia­ pacific software engineering conference. kitchenham, b. and pfleeger, s. l. (1996). software quality: the elusive target. ieee software, 13(1):12–21. kramer, j. (2007). is abstraction the key to computing? com­ moreno et al. 2021 mun. acm, 50(4):36–42. leung, f. and bolloju, n. (2005). analyzing the quality of domain models developed by novice systems analysts. in 38th hawaii international conference on system sci­ ences. linder, s. p., abbott, d., and fromberger, m. j. (2006). an instructional scaffolding approach to teaching software de­ sign. journal of computing sciences in colleges, 21. loftus, c., thomas, l., and zander, c. (2011). can grad­ uating students design: revisited. in proceedings of the 42nd acm technical symposium on computer science ed­ ucation. acm. martin, r. c. (2002). agilesoftwaredevelopment:principles, patterns, and practices. prentice hall. moreno, s. and vallespir, d. (2018). ¿los estudiantes de pregrado son capaces de diseñar software? estudio de la relación entre el tiempo de codificación y el tiempo de diseño en el desarrollo de software. in conferencia iberoamericana de ingeniería de software 2018. nistala, p., nori, k. v., and reddy, r. (2019). software quality models: a systematic mapping study. in 2019 ieee/acminternationalconferenceonsoftwareandsys­ tem processes, pages 125–134. panach, j. i., dieste, o., marín, b., españa, s., vegas, s., pas­ tor, o., and juristo, n. (2021). evaluating model­driven development claims with respect to quality: a family of experiments. ieee transactions on software engineer­ ing, 47(1):130–145. petre, m. (2013). uml in practice. international conference on software engineeringn, 35. pierce, k., deneen, l., and shute, g. (1991). teaching soft­ ware design in the freshman year. in softwareengineering education. springer berlin heidelberg. prechelt, l. and unger, b. (2001). an experiment measur­ ing the effects of personal software process (psp) training. ieee transactions on software engineering, 27(5):465– 472. senn, s. (2002). cross­over trials in clinical research. john wiley & sons, ltd, 2nd edition. siau, k. and tan, x. (2005). improving the quality of concep­ tual modeling using cognitive mapping techniques. data & knowledge engineering, 55(3). quality in conceptual modeling. soh, z., sharafi, z., van den plas, b., cepeda porras, g., guéhéneuc, y.­g., and antoniol, g. (2012). professional status and expertise for uml class diagram comprehension: an empirical study. in ieee international conference on program comprehension. sommerville, i. (2016). software engineering. pearson. stevenson, j. and wood, m. (2018). recognising object­ oriented software design quality: a practitioner­based questionnaire survey. software quality journal, 26. taylor, r. n. (2011). conference welcome message. in proc. 33rd international conference on software engineering. association for computing machinery. tenenberg, j. (2005). students designing software: a multi­ national, multi­institutional study. informatics in educa­ tion, 4. torchiano, m., scanniello, g., ricca, f., reggio, g., and leotta, m. (2017). do uml object diagrams affect design comprehensibility?resultsfromafamilyoffourcontrolled experiments. journal of visual languages & computing, 41. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media. introduction related work research methodology course context goals and research questions experimental design operation analysis and results external quality internal quality effort dedicated to designing and coding discussion threats to validity conclusions journal of software engineering research and development, 2021, 9:6, doi: 10.5753/jserd.2021.1094  this work is licensed under a creative commons attribution 4.0 international license.. everest: an automatic model­based testing tool for asynchronous reactive systems*  adilson luiz bonifacio  universidade estadual de londrina  bonifacio@uel.br  camila sonoda gomes  universidade estadual de londrina  camilasonoda@uel.br abstract reactive systems are characterized by their interaction with the environment, where the exchange of the input and output stimuli, usually, occurs asynchronously. systems of this nature, in general, require a rigorous testing activity in the development process. therefore model­based testing has been successfully applied over asynchronous reactive systems using input output labeled transition system (iolts) as the basis. in this work, we present a reactive testing tool to check conformance, generate test suites, and run test cases using iolts models. our tool can check whether the behavior of an implementation under test (iut) complies with the behavior of its respective specification. we have implemented two conformance relations in our tool: the classical ioco relation; and the conformance based on regular languages. the tool also provides a test suite generation in a black­box testing setting for finding faults over iuts according to a specific domain. in addition, we describe some case studies to probe the tool’s functionalities and also give a comparative analysis. finally, we offer practical experiments to evaluate the performance of our tool using several scenarios. keywords: model­based testing, conformance checking, test generation, reactive systems, automatic tool  published under the creative commons attribution 4.0 international public license (cc by 4.0) 1 introduction several real­world systems are ruled by reactive behaviors that interact constantly with the environment by receiving input stimuli and producing outputs in response. systems of this nature, in general, are also critical thus requiring precise and automatic support in the development process. model­based testing methods and their respective tools have been largely applied in the testing activity when develop­ ing systems. the input output labeled transition system (iolts) (tretmans, 2008) has been commonly employed as the formalism on modeling and testing asynchronous reac­ tive systems. an iolts can then specify desired behaviors of an implementation candidate and the testing task can be applied to find faults on it. one important issue of model­based testing is confor­ mance checking where we can verify whether a given imple­ mentation under test (iut) complies with its correspond­ ing specification according to a certain fault model. here we treat the classical notion of input output conformance testing (ioco) (tretmans, 2008) and the more recent testing conformance relation based on regular languages (bonifa­ cio and moura, 2019) to define fault models. the test gen­ eration is also an important task of model­based testing, es­ pecially when generating test cases for reactive systems in a black­box setting. in this work, we present an automatic tool, named everest1(gomes and bonifacio, 2019), that can check conformance between a given iut and its respective specification. our tool can also generate test suites based on specifications modeled by iolts and enable black­box test­ ing over iuts. *supported by capes. 1conformance verification on testing reactive systems we show that everest has a wider range of applications when compared to other testing tools since it implements not only the classical ioco relation but also the more recent language­based conformance checking. we also describe real­world scenarios to relate both approaches, where the language­based conformance method has been able to find faults which cannot be detected by using ioco relation. fur­ ther, experiments are performed to evaluate our tool when generating and running test suites in a black­box scenario, and also to compare and evaluate everest against a well­ known tool (belinfante, 2010a) from literature w.r.t. the con­ formance checking task. the remainder of this paper is organized as follows. we comment on related works in section 2. section 3 describes the conformance checking approaches and the test suite gen­ eration method. in section 4 we discuss important aspects comparing everest to another tool from the literature and also present a real­world case study. practical experiments of conformance checking and test suite generation are given in section 5. section 6 offers some concluding remarks and future directions. 2 related works reactive systems have been properly specified by iolts models to describe their syntax and semantics. hence model­ based testing techniques and practical tools have been ap­ plied in testing activities to support the system development using ioltss. therefore several works have studied aspects related to iolts­based testing such as test generation, con­ formance relations, and their checking methods. here we sur­ vey on some works that are more closely related to our testing tool and its features. http://orcid.org/0000-0002-7348-8508 mailto:bonifacio@uel.br mailto:camilasonoda@uel.br an iolts model­based testing tool bonifacio and gomes 2020 the ioco relation has been proposed by tretmans (2008) for iolts models, where iuts are treated as black­boxes, i.e., the tester, which is seen as an artificial environment that drives the test activity, has no access to the internal struc­ ture of iuts. however, some restrictions must be guaran­ teed over the specification, iut and tester models, such as input­completeness and output­determinism. further, the al­ gorithms therein are more theoretical and may lead to infi­ nite test suites, making it more difficult to devise solutions for practical applications. an ioco­based testing theory has been also proposed by de vries (2001) to obtain e­complete test suites. this ap­ proach focuses on specific test purposes that share particular properties related to certain testing goals. only observable behaviors based on specific criteria are considered when test­ ing black­box iuts, which turns out that the test purposes somewhat limit the fault coverage spectrum, e.g., producing inconclusive verdicts. the test generation method also pro­ duces large, even infinite, test suites thus requiring a test se­ lection criteria to avoid this problem. simão and petrenko (2014) have described an approach to generate finite ioco­complete test suites for a class of iolts models. however, their approach imposes a number of re­ strictions on the models. test purposes must be single­input and output­complete, and specifications and iuts must be input­complete, progressive, and initially­connected. so the class of iolts models that can be tested are very restricted according to their fault model. roehm et al. (2016) have introduced a conformance test based on safety properties. despite being a weaker rela­ tion than trace­inclusion conformance, it allows for tuning a trade­off between accuracy and computational complex­ ity when checking conformance. so their approach searches for counter­examples instead of verifying the whole system. however, this approach and previous ones have a more the­ oretical leaning and we are not aware of practical tools and their algorithms. a more recent work has been proposed by bonifacio and moura (2019) where few restrictions are considered and fi­ nite sets of test purposes can be generated in practical situa­ tions. in some rare cases their algorithm may lead to exponen­ tial sized testers, but the approach allows for a wider class of iolts models, and a low degree polynomial time algorithm is devised for efficiently testing ioco­conformance in practi­ cal applications. in this work, we have implemented the more general and recent approach (bonifacio and moura, 2019) with the language­based and ioco conformance relations, as well as the test suite generation for white­box and black­box scenar­ ios. in the literature, we have found jtorx (belinfante, 2014, 2010b), a more closely related testing tool that implements the ioco relation and the uioco variation for underspecified models. tgv (mark utting, 2007; calamé, 2005; jard and jéron, 2005) is also a testing tool designed for checking ioco­ conformance, similarly to testor (marsso et al., 2018), an on­ the­fly test case generation tool. however, the test generation methods of tgv and testor are only sound, i.e., they are not exhaustive, and so we cannot get complete test suites. fur­ ther, the soundness property over the generated test suites is only guaranteed from both tools over specific test purposes. although several other tools have been proposed for model­based testing many of them somewhat move away from the scope of our work. for instance, some tools and approaches implement a variation of ioco­theory, e.g. rioco and sioco relations, such as stg (symbolic test genera­ tor) (clarke et al., 2002), torxakis (mostowski et al., 2009) and uppaal­tron (larsen et al., 2005) that deal with sym­ bolic and timed models. 3 a model­based testing method asynchronous reactive systems are commonly specified by iolts models, a variation of labeled transition systems (ltss) (tretmans, 1993) with the partitioning of input and output labels. definition 1 an input/output labeled transition system (iolts) is a tuple s = (s, s0, li , lu , t ) where: • s is a finite set of states; • s0 ∈ s is the initial state; • li is a set of input labels; • lu is a set of output labels; • l = li ∪ lu and li ∩ lu = ∅; • t ⊆ s × (l ∪ {τ }) × s is a finite set of transitions, where the internal action τ /∈ l; and • (s, s0, l, t ) is an underlying lts associated with s. we indicate a transition by (s, l, r) ∈ t where s ∈ s is the source state and r ∈ s is the target state labeled by l ∈ (l ∪ {τ }). a transition (s, τ, r) ∈ t indicates an internal action, which means that an external observer cannot see the movement from s to r in the model. an iolts may also have quiescent states. a state s is qui­ escent if no output x ∈ lu or an internal action τ is defined on it (tretmans, 2008). when a state s is quiescent a transi­ tion (s, δ, s) is added to t , where δ /∈ lτ . note that l∪{τ } is denoted by lτ to ease the notation. we also note that in a real black­box testing scenario, where an iut sends messages to the tester and receives back responses, quiescence will indi­ cate that the iut can no longer respond to the tester, it has timed out, or that it is slow (bonifacio and moura, 2019). in what follows we define the semantics over iolts/lts models, but first we introduce the notion of paths. definition 2 let s = (s, s0, l, t ) be a lts and p, q ∈ s. let σ = l1, · · · , ln be a word in l⋆τ . we say that σ is a path from p to q in s if there are states ri ∈ s, and labels li ∈ lτ , 1 ≤ i ≤ n, such that (ri−1, li, ri) ∈ t , with r0 = p and rn = q. we say that α is an observable path from p to q in s if we remove all internal actions τ from σ. a path can also be denoted by s σ−→ s′, where the behavior σ ∈ l⋆τ starts in the state s ∈ s and reaches the state s′ ∈ s. an observable path σ, from s to s′, is denoted by s σ=⇒ s′. we can also write s σ−→ or s σ=⇒ when the target state is not important. all paths starting at a state s we call paths of s. now we give the semantics over iolts/lts models. definition 3 let s = (s, s0, l, t ) be a lts and s ∈ s: an iolts model­based testing tool bonifacio and gomes 2020 1. the set of all paths from s is denoted by tr(s) = {σ|s σ−→ } and the set of all observable paths from s is denoted by otr(s) = {σ|s σ=⇒}. 2. the semantics of s is given by tr(s0) or tr(s) and the observable semantics of s is denoted by otr(s0) or otr(s). the semantics of an iolts is defined by the semantics of its underlying lts. 3.1 checking conformance on reactive sys­ tems given an iolts specification, a conformance checking task can determine whether an iut complies with the correspond­ ing specification according to a specific fault model. the classical ioco (tretmans, 2008) relation establishes a notion of conformance where input stimuli are applied to both iut and specification models to observe whether outputs pro­ duced by the iut are also defined in the specification (boni­ facio and moura, 2019). definition 4 let s = (s, s0, li , lu , t ) be a specification and let i = (q, q0, li , lu , r) be an iut, we say that i ioco s if, and only if, out(q0 af ter σ) ⊆ out(s0 af ter σ) for all σ ∈ otr(s), where s af ter σ = {q|s σ=⇒ q} for every s ∈ s. otherwise, we get that ri ioco s does not hold. a more recent conformance relation (bonifacio and moura, 2019) has also been proposed using regular lan­ guages. given an iut i, a specification s, and regular lan­ guages d and f , we say that i complies with s according (d, f ), i.e, i confd,f s if, and only if, no undesirable be­ havior of f is observed in i and it is specified in s, and all desirable behaviors of d are observed in i and they are also specified in s. definition 5 given an alphabet l = li ∪ lu and lan­ guages d, f ⊆ l∗ over l. let s and i be iolts models over l we have that i confd,f s if, and only if, (i) σ ∈ otr(i) ∩ f , then σ /∈ otr(s); and (ii) σ ∈ otr(i) ∩ d, so σ ∈ otr(s). this new notion with a wider fault coverage is established by proposition 1, where desirable and undesirable behaviors can be specified by regular languages. proposition 1 (bonifacio and moura, 2019). let s and i be iolts models over an alphabet l = li ∪ lu , and the regu­ lar languages d, f ⊆ l∗ over l. we say that i confd,f s if, and only if, otr(i) ∩ [(d ∩ otr(s)) ∩ (f ∩ otr(s))] = ∅, where otr(s) = l∗ − otr(s). both notions of conformance can be related by the fol­ lowing lemma, where the language­based conformance rela­ tion given in definition 5 restrains the classical ioco relation given by definition 4. lemma 1 (bonifacio and moura, 2019). let i = (q, q0, li , lu , r) be an iut and let s = (s, s0, li , lu , t ) be a specification, we say that i ioco s if, and only if, i confd,f s when d = otr(s)lu and f = ∅. bonifacio and moura (2019) have proposed the language­ based conformance checking using the theory of au­ tomata (sipser, 2006). lts/iolts models are transformed into finite state automata (fsa), where the semantics of fsa is given by the language it accepts. so r ⊆ l⋆ is regu­ lar if there exists an fsa m such that l(m) = r, where l is an alphabet. therefore we can effectively construct the au­ tomatons ad and af where d and f are regular languages such that d = l(ad ) and f = l(af ). now we define test case and test suite according to regular languages. definition 6 let l be a set of symbols, a test suite t over l is a language t ⊆ l⋆, where each σ ∈ t is a test case. we can see that there will always be an fsa a that accepts a test suite since it is a regular language, where the final states are fault states. thus the set of undesirable behaviors, so­called fault model of s (bonifacio and moura, 2019), is defined by the fault states. therefore we can obtain a complete test suite for an iolts specification s and a pair of languages (d, f ) using propo­ sition 1. that is, we can detect the absence of desirable be­ haviors specified by d and the presence of undesirable be­ haviors specified by f in the specification s using the test suite t = [(d ∩ otr(s)) ∪ (f ∩ otr(s))]. an iut i is then declared in compliance to a specification s if there is no test case of the test suite t that is also a behavior of i (bonifacio and moura, 2019). the testing process first obtains an automaton a1 induced by the iolts specification s. since l(a1) = otr(s) we can effectively construct an fsa a2 such that l(a2) = l(af ) ∩ l(a1) = f ∩ otr(s). also, consider the fsa b1 obtained from a1 by reversing its set of final states, that is, a state s is a final state in b1 if, and only if, s is not a final state in a1. clearly, l(b1) = l(a1) = otr(s). we can now get an fsa b2 such that l(b2) = l(ad )∩l(b1) = d∩otr(s). since a2 and b2 are fsas, we can construct an fsa c such that l(c) = l(a2) ∪ l(b2), where l(c) = t . we can con­ clude that when d and f are regular languages and s is a deterministic specification, then a complete fsa t can be constructed such that l(t ) = t . next we state an algorithm with a polynomial time com­ plexity using the language­based conformance relation. proposition 2 (bonifacio and moura, 2019) let s and i be the deterministic specification and implementation ioltss over l with ns and ni states, respectively. let also |l| = nl. let ad and af be deterministic fsas over l with nd and nf states, respectively, and such that l(ad ) = d and l(af ) = f . then, we can effectively construct a complete fsa t with (ns + 1)2nd nf states, and such that l(t ) is a complete test suite for s and (d, f ). moreover, there is an al­ gorithm, with polynomial time complexity θ(n2s ni nd nf nl) that effectively checks whether iconfd,f s holds. now using lemma 1 we establish a relationship between the ioco and language­based relations in theorem 1. theorem 1 (bonifacio and moura, 2019) let s and i be deterministic ioltss over l with ns and ni states, respec­ tively. let l = li ∪ lu , and |l| = nl. then, we can effec­ an iolts model­based testing tool bonifacio and gomes 2020 tively construct an algorithm with polynomial time complex­ ity θ(ns ni nl) that checks whether i ioco s holds. 3.2 complete test suite generation in this work, we also provide the test suite generation in a black­box testing setting using the notion of test pur­ poses (tretmans, 2008). a test purpose (tp) is formally de­ fined by an iolts with two special states {pass, f ail} and, in practice, it represents an external tester that interacts with an iut. thus a fault model is composed of tps that are de­ rived from a given specification. to ease the notation from now on we will denote by io(li , lu ) the class of all ioltss over l = li ∪ lu . definition 7 let li and lu be the input and output alpha­ bets, respectively, with l = li ∪ lu . a test purpose (tp) over l is defined by an iolts t ∈ io(lu , li ) such that for all σ ∈ l∗ does not hold f ail σ=⇒ pass and pass σ=⇒ f ail. the fault model over l is the finite set of tps over l. the test case generation proposed by tretmans (2008), based on ioco relation, imposes some restrictions over the formal models. all tps must be acyclic, with a finite run, and input­enabled, since the tester cannot predict the output pro­ duced by a black­box iut. therefore, all output actions that are produced by the iut must be enabled in the respective tp. moreover, they must be output­deterministic, i.e. each state can send only one output symbol to the iut in order to avoid arbitrary and non­deterministic choices. in the pass and f ail states only self­loop transitions are allowed since verdicts are obtained in these states. definition 8 let s ∈ io(li , lu ). we say that s is output­ deterministic if |out(s)| = 1 and s is input­enabled if inp(s) = li for all s ∈ s, where out(s) and inp(s) give outputs and inputs, respectively, defined at state s. hence all restrictions imposed by tretmans (2008) are sat­ isfied when a tp is input­enabled, output­deterministic, and acyclic except for pass and f ail states. however, we see that a bound over the number of states to be considered in the iuts must be imposed to keep the tp acyclic. so the test suite completeness property is guaranteed if given an iut i and a specification s, i ioco s for all iut that conforms to s. otherwise we say that i ioco s does not hold. therefore we define a class of implementations to guarantee the ioco completeness property on generating test suites establishing an upper bound on the number of states over the iuts. now we are in a position to construct a complete test suite using the notion of tps. but first we generate a multigraph structure as proposed by (bonifacio and moura, 2019). so given an iut i and a specification s, we remark that m is the bound over the number of states to be considered on the iuts, and n is the number of states in s. then the multigraph must have mn + 1 levels, and at each level if a transition of s gives rise to a cycle then we must create a transition onto states on next level of the multigraph. a f ail state is also added and new transitions from every state of the multigraph are defined to the f ail labeled by all l ∈ lu when l is not defined. having an acyclic multigraph at hand we can extract tps using a simple breadth­first search algorithm from the ini­ tial state to f ail. we can guarantee the input­enabledness property by adding the pass state to the tp and, for every output of lu and all states, we add transitions to the pass state where the output is not defined. self­loops labeled by each l ∈ lu are also added to the pass and f ail states. the output­deterministic property is also guaranteed by adding a transition from every state to pass where there is no input of li defined. note that we always refer to an input symbol of lu or an output symbol of li from the perspective of the iut, as commonly denoted in the literature (tretmans, 2008; bonifacio and moura, 2019). the test run is then defined by the synchronous product between a tp t and an iut i, denoted by i × t . the tp interacts with the iut producing outputs that are sent to i as inputs. likewise, the iut receives actions from the tp and produces outputs that are sent to t as inputs. so the output alphabet of t corresponds to li, the input alphabet of the iut, and the input alphabet of t corresponds to lu , the out­ put alphabet of the iut. definition 9 let i = (si , q0, li , lu , ti ) ∈ io(li , lu ) be an implementation and t = (st , q0, lu , li , tt )) ∈ io(lu , li ) be a tp. we say that i passes t if for any σ ∈ (li , lu )∗ and any state q ∈ si, we do not have (t0, q0) σ=⇒ (f ail, q) in t × i. a path can be denoted by q0 σ=⇒ q where the behavior σ starts in the state q0 and reaches the state q. let m be the fault model, we say that i pass m, if i passes all tps in m. then given an iolts s and a set imp ⊆ io(lu , li )[m], we say that m is m­ioco­complete to s concerning imp if for all iut i ∈ imp we have i ioco s if, and only if, i passes m. the verdicts are obtained when tps reach the special states. the fail verdict gives rise to a fault behavior whereas the pass verdict denotes a desirable behavior. further details can be found in (bonifacio and moura, 2019; gomes and bonifacio, 2019). finally, the next proposition determines a fault model that is composed of tps obtained from a multigraph which, in turn, is constructed based on the corresponding specification. proposition 3 let the deterministic iolts s ∈ io(li , lu ) and m ≥ 1. then there is a fault model m that is m­ioco­complete for s relatively to io(li , lu )[m], ioltss at most m states, whose tps are deterministic, output­deterministic, input­enabled, and acyclic except for self­loops on pass and fail states. 4 a testing tool for reactive systems everest (gomes and bonifacio, 2019) has been developed to check conformance, generate test suites, and run tests over reactive systems specified by lts/iolts models. we have organized the tool’s architecture in four modules: configura­ tion; ioco conformance; language­based conformance; and test generation & run. the configuration module allows us to settle the testing scenario, and the checking conformance modules can yield verdicts of testing. when an iut does not an iolts model­based testing tool bonifacio and gomes 2020 conform to the specification our tool yields the verdict along with the paths induced by the test cases that could detect the corresponding faults. the test generation & run module en­ ables the multigraph and test purpose generation, and also allows for running test suites over the iuts. in this section, we look over the checking conformance and test suite generation processes. first we present some general examples to compare the conformance checking pro­ cesses of everest and jtorx. next we show how our test suite generation method using the language­based confor­ mance stands out from the classical approach. finally we describe a real­world case study of an automatic teller ma­ chine (atm) to explore some real scenarios, and then give a comparative analysis between the practical tools. 4.1 conformance checking process we apply some examples to explore characteristics from both everest and jtorx tools when checking conformance. let s be a specification depicted in figure 1a and let r and q be iuts depicted in figures 1b and 1c, respectively, with li = {a, b} and lu = {x}. s0 s3 s1 s2 a b a b, x x b b a (a) specification s q0 q3 q1 q2 a b a b, x a b b,x a (b) iut r q0 q3 q1 q2 a b a b, x a,x b b a (c) iut q figure 1. iolts models in the first checking run we have verified whether the iut r conforms to the specification s. our tool yielded a verdict of non­conformance and generated the test suite t1 = {b, aa, ba, aaa, ab, ax, abb, axb}. all test cases were induced by different paths that reach a fault and were ex­ tracted using a transition cover strategy over the specifica­ tion. jtorx also yielded the same verdict for this first run, as expected, but it has generated a test suite t2 = {b, ax, ab}. we can see that t2 ⊆ t1, i.e., jtorx has generated only one test case per fault in contrast to everest that has pro­ duced several test cases using a transition coverage. hence we notice that everest has provided a wide range of cover­ age which can be more useful in a fault mitigation process. in a second scenario, we checked the iut q against the specification s. at this time no fault was detected by both tools using the classical ioco relation. however, everest could find a fault using the language­based conformance relation, where the set of desirable behaviors were specified by the regular language d = (a|b)∗ax and no undesirable behavior was defined, so f = ∅. the set d denotes behav­ iors that are induced by paths finishing with an input action a followed by an output x produced in response. a verdict of non­conformance was obtained by our tool revealing a fault detected by the test suite t = {ababax, abaabax}. we re­ mark that jtorx, using the classical ioco relation, was not able to detect this fault. so we can note that everest is more s0,0 s1,0 s2,0 s3,0 s0,1 s1,1 s2,1 s3,1 s0,15 s1,15 s2,15 s3,15 s0,16 s1,16 s2,16 s3,16 f ail f ail f ail f ail f ail f ail a b δ b,x a x b a, δ b a b b,x a x a, δbb b a b δ b,x a x b a, δ b a b b,x a, δ x x x δ δ δ δ x x x x, δ δ δ δ δ x x,δ ... ... figure 2. a direct acyclic multi­graph d for specification s general in this sense and can be applied to a wider range of scenarios when compared to jtorx. 4.2 everest test suite generation we have seen that a conformance checking is run over an iut against a given specification to yield test verdicts. if the verdict is positive, i.e., faults are detected, then an associated test suite is generated with test cases that can reveal such faults. in addition, everest can also generate complete test suites relative to a given specification. to illustrate the test suite generation process of everest again we assume s as the specification depicted in fig­ ure 1a. in the first step a direct acyclic multigraph must be constructed according to the specification, as described in section 3. figure 2 partially depicts the multigraph with four states at each level once the specification s has four states (n = 4). every transition in the multigraph must go either to the next level or from left to right in the figure at the same level to secure the acyclic property. in this case we have con­ sidered iuts with at most four states, i.e. the same number of states as found in the specification (m = n = 4). therefore the multigraph has mn + 1 = 17 levels. figure 2 shows the first two levels and also the two last levels of the multigraph. note that we replicate the fail state in order not to clutter the figure. with the multigraph at hand, we can apply a breadth­first search algorithm to extract paths from the initial node s0,0 up to the fail state. we can take, for instance, the sequence α1 = aabbx. we see that α1 induces the path s0,0 → s1,0 → an iolts model­based testing tool bonifacio and gomes 2020 s3,0 → s0,1 → s3,1 → f ail in the multigraph. from propo­ sition 3 we can then obtain a deterministic, acyclic, input­ enabled, and output­deterministic test purpose t1 over α1 as depicted in figure 3a. s0, 0 s1, 0 s3, 0 s0, 1 s3, 1 f ail pass a b, δ a b, δ b a, δ b a, δ x a, b δ, x δ, x (a) tp t1 induced by aabbx s0, 0 s1, 0 s3, 0 s3, 1 f ail pass a b, δ a b, δ a b, δ x a, b δ, x δ, x (b) tp t2 induced by aaax figure 3. tps from multigraph of figure 2 note that the input­enabledness property is also guar­ anteed by adding a pass state and transitions from states where no output is defined to the pass state. the construc­ tion is complete by adding self­loops to the pass and fail states labeled by all output actions. regarding the output­ determinism property, for every state that no input action is defined, we also create a new transition from this state to the pass state labeled by any input action. for the sake of exemplification we take α2 = aaax as a distinct sequence. in the same way we obtain the induced path over the multigraph and construct the corresponding de­ terministic, acyclic, input­enabled and output­deterministic test purpose t2 as depicted in figure 3b. everest has indeed automatically constructed other fifteen test purposes based on paths induced by the set {α1, α2, x, aδ, bx, δx, aax, bbx, axδ, abδ, δbx, bδx, aabx, bbbx, aaδx} of se­ quences. from the tps of figure 3, everest could generate the test suite t = {α1, α2, b, δ, bδ, bx, ab, aδ, aaδ, aaa, aab, aabδ, aaba, aaaa, aaab, aaaxδ, aaaxx, aabba, aabbb, aabbxδ, aabbxx}. we then apply the test suite t to the iut r and a fault could be detected. by a simple inspection we see that all test cases that lead r from state q0 to the same state q0 can detect this fault. notice that the output x is produced at state q3 of r whereas x is not defined at state s3 of s. so everest exhibits a verdict of non­conformance which means that r does not pass the test suite, declaring that r ioco s does not hold. 4.3 a real­world case study now we present a real­world system to be put under test us­ ing the automatic tools. we specify functionalities of an au­ tomatic teller machine (atm) (mark utting, 2007; naik and tripathy, 2018) using an iolts model with the input stimuli li = {ic, pin, acc, tra, sta, wd, amo}, and the output re­ sponses lu = {cpi, bpi, mon, rec, ins, sho}. the intended meaning of the input actions are: ic, denotes the action when the user inserts his/her card into the atm; pin, indicates the pin code has been provided by the user; tra, requires the transfer amount; acc, indicates that a target account has been provided; sta, requires an account statement; wd, indicates that the user has requested a withdrawal; and amo, denotes the balance account. also we give the meaning of the out­ put alphabet: cpi, says the pin code is correct; bpi, says the provided pin is wrong; mon, indicates the money has been released; rec, indicates the receipt has been provided to the user; ins, denotes an insufficient balance on the account; and sho, indicates the statement has been shown to the user. we model the withdrawal operation by the iolts a of figure 4. note that if the requested amount (amo) is greater than the available amount (ins) then the withdrawal cannot be performed and the process reaches state s3 where a new with­ drawal operation can be requested again. some additional s0 s1 s2 s3 s4 s5 ?ic ?pin !cpi !bpi ?wd ?amo !mon !ins figure 4. atm specification a functionalities are also specified by the iolts b of figure 5. in this case we consider not only the withdrawal (wd) opera­ tion but also the transfer (tra) and statement (sta) operations. s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 ?ic ?pin !cpi !bpi ?tra ?sta ?wd ?acc ?amo !rec !ins ?amo !mon !ins !sho figure 5. atm specification b an iolts model­based testing tool bonifacio and gomes 2020 assume the iolts z depicted in figure 6 as an iut that implements the withdrawal (wd) and transfer (tra) operations. we observe that if the requested amount (amo) in a with­ drawal is greater than the available amount then the iut reaches the state s7 where the user can choose a new amount again. s0 s1 s2 s3 s4 s5 s6 s7 s8 ?ic ?pin !cpi !bpin ?tra? wd ?acc ?amo !rec !ins ?amo !mon !ins figure 6. iut z now as a first testing scenario we check whether the iut z conforms to the specification a. we, then, run jtorx and everest over these models to obtain conformance verdicts using the ioco relation. both tools have returned the same verdict where z complies with a. in a second round, we run everest using the language­based confor­ mance relation and, at this time, a fault could be detected in the iut z. the set of desirable behaviors was given by d = {ic pin cpi wd amo ins amo}, i.e., a sequence of ac­ tions where the account balance is not enough according to the requested withdrawal, and the user must provide a new value. everest has generated the test case {ic −→ pin −→ cpi −→ wd −→ amo −→ ins −→ amo} because the behavior specified in d is not an observable behavior in the specifica­ tion model but the iut z implements it. in a second scenario we want to verify the reliability of verdicts obtained by jtorx using ioco relation. we know that original underspecified models (see section 4.4) requires some labor in such a way that self­loop transitions must be added to get all states completely specified so handing over an input­enabled model. notice that the iut z is underspec­ ified, so jtorx must change the model to guarantee the input­ enabledness property on the iut. after changing the model we check whether z ioco­conforms to the specification b us­ ing jtorx. it is easy to see that the original behavior of the iut has been modified and, in this case, a fault is then de­ tected by the test case {ic −→ pin −→ cpin −→ sta}. we have also applied this second scenario to everest using the ioco relation. in the opposite direction to the jtorx, no fault was detected by everest once the fault be­ havior {ic −→ pin −→ cpin −→ sta} is not specified in the iut z. we see that the detection of this fault by jtorx is, in fact, a false positive, due to an extra behavior that has been added after changing the iut z to become it input­enabled. we also remark that everest can detect this same fault when checking ioco conformance over the same modified model. note that ic is the only single action defined at state s0 of the iut z. when jtorx turns z into an input­enabled model all input actions become enabled at all states, which is contradictory to the real functionality. for instance, we see that the action amo, i.e., the amount value to be withdrawn, becomes enabled at state s0. however, if a transfer opera­ tion (tra) is chosen instead of a withdrawal (wd), the amount value to be withdrawn should not be enabled at this moment. hence we see that any change performed over the former model modifies the original behavior of the iut, leading to an inaccurate conformance checking verdict relative to the real functionality of the atm. in the last scenario we take the iolts y depicted in fig­ ure 7 as a new iut. the iut y differs from the specification s0 s1 s2 s3 s4 s5 ?ic ?pin !cpi !bpi ?wd !mon !mon !ins figure 7. iut y a depicted in figure 4 only by the transitions (s4, ?amo, s5) and (s4, !mon, s5), respectively. so the iut allows a with­ drawal operation with no checking over the balance (amo) before releasing the money (mon). by contrast, in the spec­ ification model, the balance is checked before releasing the money when the account balance is positive. the fault model was bounded at six states for the class of iuts and everest has generated eighty tps based on the corresponding specification a. the generated test suite has then been sub­ mitted to the iut y, and a fault verdict could be obtained by the path ic −→ pin −→ cpin −→ wd using our tool. for the sake of completeness we have also applied this last scenario to jtorx, and a fault has also been detected by the test case ?ic, δ, ?pin, !bpi, ?pin, !cpi, ?wd, !mon. 4.4 a comparative analysis here we list some main aspects and compare everest and jtorx. we have seen that both tools provide a mechanism to generate test suites, run test cases and check ioco confor­ mance. everest also provides the more general conformance checking based on regular languages. further, our tool allows a complete test generation not only for the ioco relation but also for this more general conformance relation with a wider range of possibilities to specify desirable and undesirable be­ haviors. jtorx’s test generation employs an exhaustive strategy leading to the state space explosion problem making the pro­ cess infeasible in practice. in the opposite direction, everest is more flexible and allows for a complete test suite gen­ an iolts model­based testing tool bonifacio and gomes 2020 eration by setting the maximum number of states on the iuts to be taken into account in the fault model. jtorx also implements a random approach that chooses transitions to induce paths over the specification when gener­ ating test suites.everest, however, only applies a random ap­ proach over the language­based conformance relation when desirable and/or undesirable behaviors are not provided by the tester. in this case, the test run is reduced to the problem of checking isomorphism between the iut and the specifica­ tion model. we also note that both tools implement an online testing approach when iuts are provided together with the specifi­ cation. but only everest provides an offline test generation process using the notion of multigraph and test purposes. regarding the conformance checking process, jtorx de­ fines an online strategy where test cases are generated and right after they are already applied to the iut. everest follows an offline process where the whole test suite is generated and then all test cases are applied to the iut. how­ ever we remark that everest also has an online alternative process to check conformance where each test case obtained from the fault model is applied to iut right after it is gener­ ated. table 1 summarizes these aspects. table 1. methods and features jtorx everest conformance checking ioco theory √ √ language­based x √ generation test suite generation √ √ test strategy online/offline √ √ test purpose √ √ random approach √ √ we also probe some properties over the specification and iut models, test verdicts, and strategies of testing. see ta­ ble 2. some restrictions are naturally imposed over the mod­ table 2. properties and tools jtorx everest properties underspecified models √a √ non­input­enabledness x √ quiescence √ √ veredicts test run √ √ conformance √ √ test mode white/black boxes testing √ √ abut the internal structure of models must be changed. els when checking ioco conformance. underspecified mod­ els, for instance, are not allowed on the iut side and their internal structure must be changed to guarantee the input­ enabledness. the language­based conformance relation does not require any restriction, that is, the more general method can deal with underspecified iuts and specification models. there­ fore everest can handle underspecified models with no change over the models when checking conformance and also generating test suites using the language­based rela­ tion. jtorx, on the other hand, must completely explore the model’s structure to add new transitions to guarantee the input­enabledness property. we see that both tools can deal with quiescence models, where self­loops with δ actions are added at the quiescent states, and also give verdicts of conformance and run test cases in a similar way. 5 practical evaluation in this section, we present the results of practical experiments that we have run to evaluate the tools’ performance. first, we provide experiments to compare the checking conformance methods of everest and jtorx in subsection 5.1. given an iut and a specification, both tools can check whether the iut is in conformance to the specification under ioco rela­ tion. secondly, subsection 5.2 assays the additional feature of everest on generating and running test suites. in this case, given a specification model, we can generate test suites for a certain class of iuts and then de facto apply them to the iuts. the experiments are classified into different groups ac­ cording to the parameters under evaluation. therefore, each group of experiments represents a different scenario, where either the specifications and the iut models are changed to capture different situations of conformance checking, or test suite generation or test runs, e.g. the models must have a cer­ tain number of states and transitions. all experiments were performed using randomly generated models both for spec­ ifications and iuts, while satisfying all required properties, if any, to avoid bias in the results. in some groups, we have taking into account submachines of specification models as the basis to generate iuts with a certain percentage of mod­ ification. we have organized all experiments by research questions (rqs) to get the desired analyses using different groups of scenarios. our experiments have been performed on intel core i5 1.8 ghz cpu, with 8 gb of ram on windows 10. 5.1 conformance checking of everest and jtorx tools here we report on some experiments to compare everest and jtorx when checking conformance between an iut and a given specification using their respective implementa­ tions of the ioco relation. so a single conformance checking run is defined by a pair of models, an iut and a specification, where the result can be positive or negative. a verdict is said to be positive (ioco­conformance), when the iut complies with the specification, or negative (non­ioco­conformance), when the iut does not comply with the specification, accord­ ing to the ioco relation. we evaluate several parameters related to the specifica­ tions and iuts, such as the number of states and the number of input/output actions on the models. in addition, we also consider experiments that derive verdicts of conformance and non­conformance on separated scenarios. but we remark that only input­enabled and deterministic models have been an iolts model­based testing tool bonifacio and gomes 2020 generated in our experiments to comply with the restrictions imposed by jtorx. each group of experiments on checking conformance is de­ fined between one specification and ten iut models. there­ fore, each run is settled down by a pair of models, one iut against the corresponding specification, and a group of ex­ periments with ten specifications and ten iuts for each spec­ ification that outcomes in hundred runs. s experiments with verdicts of ioco­conformance were run over iut models obtained as submachines of their respec­ tive specifications, while iut models with verdicts of non­ ioco­conformance were randomly constructed by changing transitions from their corresponding specifications with a cer­ tain percentage of modification. regarding iut models with more states than the corresponding specifications, new states and new transitions have been randomly added to the models. to illustrate, let s be a specification model with 20 states and 120 transitions. in order to get positive verdicts we ran­ domly take iut models as submachines of s, choosing sub­ sets of states from s and their respective transitions. when we want to guarantee negative verdicts for iuts with 4% of modification from its respective specification we have to ran­ domly choose 5 transitions to be modified, i.e., over these 5 transitions we have to change the source state, or the target state, or even the action symbol. we have decided to gener­ ate iuts using the specification models as the basis instead of completely randomized iut models because in practical situations developers can make mistakes but, in general, they minimally implement the specified model, that is, real­world iuts usually are not very assorted from the corresponding specification. hence in the first group of experiments the size of the al­ phabets are interchanged while we increase the number of states on the models; and in the second group we varied the number of states (and transitions, consequently) of the spec­ ification and iut models in a stress testing. all processing times that are found in the graphics have been figured out from the mean value of the processing time of all experiments in the group. 5.1.1 reversing the size of input/output alphabets in this first scenario we investigate the impact over confor­ mance checking runs when the size of input and output alpha­ bets are inversely proportional. we state the rq as follows: “how does the size of the input alphabet (output alphabet) impact the processing time on checking conformance?”. to answer this question we have run experiments where the size of input and output alphabets have been reversed on the models. first, we take a group of iolts models with 2 symbols in the input alphabet and 10 symbols in the output alphabet. in a second group of models we reserve the size of alphabets, so taking input alphabets with 10 symbols and output alphabets with only 2 symbols. we vary the number of states by 15, 25 and 35 on iuts and get the specification with a fixed number of 10 states, both for verdicts of conformance and non­conformance. from the results, we note only a small variation on the pro­ cessing time when running experiments with verdicts of con­ formance either when the input alphabet is larger than the out­ put alphabet or when we reverse them in size. see figure 8. our tool is only 2.56% faster when running models with 2 inputs and 10 outputs (see figure 8(a)) than when checking models whose size of their alphabets are reversed with 10 in­ puts and 2 outputs (see figure 8(b)). similarly, jtorx is only 3.51% faster when reversing the size of the alphabets. however, we can notice that everest is faster than jtorx in both scenarios, where the size of the input alphabet is larger than the output alphabet, and vice­versa. 15 20 25 30 35 0.4 0.45 0.5 0.55 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (a) 2 inputs, 10 outputs 15 20 25 30 35 0.4 0.45 0.5 0.55 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (b) 10 inputs, 2 outputs figure 8. reversing i/o alphabets with ioco verdicts in contrast, regarding verdicts of non­conformance, we see an expressive impact on the verification time when running experiments with the same scenarios where the size of the input and output alphabets are reversed in size. in this case, both tools have taken less processing time for models with 2 inputs and 10 outputs. everest is 12.73% to 42.86% faster, according to the number of states on iuts, for models with 2 inputs and 10 outputs (see figure 9(a)) than when running models with 10 inputs and 2 outputs (see figure 9(b)). jtorx is around 200% to 352% faster, depending on the iut size, for the same scenarios. we can observe that the impact over the processing time is very expressive in jtorx for verdicts of non­conformance when we reverse the size of the alphabets. we remark that in practical applications we usually need more input actions than output actions to specify real­world systems. that is, input alphabets with a large number of ac­ an iolts model­based testing tool bonifacio and gomes 2020 15 20 25 30 35 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (a) 2 inputs, 10 outputs 15 20 25 30 35 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 everest jtorx specifications with 10 states number of states on iuts t im e ( se co n d s) (b) 10 inputs, 2 outputs figure 9. reversing i/o alphabets and non­ioco verdicts tions can weigh down the performance of jtorx tool. further, notice that everest always outperforms jtorx for all scenar­ ios as depicted in all figures. 5.1.2 varying the number of states we also performed some experiments varying the number of states (and transitions) to evaluate the tools’ scalability. in this case the rq is: “how does the number of states in spec­ ifications and iuts impact the processing time on checking conformance?”. we answer this question running three groups of exper­ iments: (i) specifications with 10 states and iuts ranging from 20 to 200 states; (ii) specifications with 50 states and iuts ranging from 60 to 200 states; and (iii) specifications with 100 states and iuts varying from 110 to 200 states. we remark that all groups of iut models were increased by 10 states in each group. in the experiments with verdicts of conformance, specifi­ cations with 10 states and iuts with up to 120 states, everest attains a better performance compared to jtorx. jtorx is just slightly better when the iut models have more than 120 states. see figure 10a. when running experiments with ver­ dicts of conformance, specifications with 50 and 100 states, and groups of iuts with up to 200 states, everest has always outperformed jtorx. see figures 10b and 10c. 20 40 60 80 100 120 140 160 180 200 0.42 0.45 0.48 0.51 0.54 0.57 0.6 everest jtorx number of states on iuts t im e ( se co n d s) (a) specification with 10 states 60 80 100 120 140 160 180 200 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 everest jtorx number of states on iuts t im e ( se co n d s) (b) specification with 50 states 120 140 160 180 200 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 everest jtorx number of states on iuts t im e ( se co n d s) (c) specification with 100 states figure 10. varying the number of states and ioco verdicts now we turn into experiments with verdicts of non­ conformance, specifications with 10 and 50 states, and iuts with up to 120 states. we see from figures 11a and 11b that everest has always outperformed jtorx for any group of iuts. jtorx gets a better performance only for iuts with more than 200 states and specifications with 100 states. see figure 11c. an iolts model­based testing tool bonifacio and gomes 2020 20 40 60 80 100 120 140 160 180 200 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 everest jtorx number of states on iuts t im e ( se co n d s) (a) specification with 10 states 60 80 100 120 140 160 180 200 1.5 3 4.5 6 7.5 9 10.5 12 13.5 15 everest jtorx number of states on iuts t im e ( se co n d s) (b) specification with 50 states 110 120 130 140 150 160 170 180 190 200 24 26 28 30 32 34 36 38 40 42 everest jtorx number of states on iuts t im e ( se co n d s) (c) specification with 100 states figure 11. varying the number of states and non­ioco verdicts 5.2 everest test suite generation now we evaluate our tool for test suite generation by running experiments using a more recent approach (bonifacio and moura, 2019), where multigraphs must be first constructed to then generate test purposes. we vary the number of states in the specification models and also the bound to be consid­ ered over the number of states on the iuts. we look out to construct distinguishing iuts from their respective specifi­ cations with a certain percentage of modification over the transitions in order to assess different scenarios. we remark that the experiments on generating test suites were performed using solely everest due to two main rea­ sons: (i) jtorx implements an online strategy where an iut is always required to run the test generation mechanism; and (ii) jtorx’s test generation process finishes at the very first detected fault, so it cannot generate complete test suites. in the first group of experiments we vary the number of states on specification models together with the number of states to be considered on iuts; in the second group we gen­ erate test purposes over the multigraphs obtained in the first group; and in the third group we run test suites which were extracted from the test purposes of the second group, over iuts that were generated by modifying the corresponding specification models by a certain percentage. 5.2.1 multigraph generation step we define the following rq for the multigraph generation step as follows: “what is the impact on the processing time when generating multigraphs?”. to answer this question we vary the number of states on specification models and also the bound m associated with the maximum number of states to be considered on iut mod­ els. moreover, we consider 5 to 35 states specifications and construct the corresponding multigraphs to get fault models for iuts with 5 to 55 states. alphabets were fixed at 5 inputs and 5 outputs and we increase the number of states by 10 for each group of iuts. transitions were randomly generated to ensure unbiased results. next we briefly describe all scenarios that are taken into account in the multigraph generation step: (i) specifications with 5 states and m from 5 to 55 states; (ii) specifications with 15 states and m from 15 to 55 states; (iii) specifications with 25 states and m from 25 to 55 states; and (iv) specifica­ tions with 35 states and m from 35 to 55 states. figure 12 shows that the processing time for generating multigraphs grows, in general, as the number of states also grows on specification and iut models. we first notice that the median values are lying on the medium of the boxes, which means that as the size of the models grows we also observe a well­behaved growth of the processing time. we particularly see in figure 12a that the multigraph con­ struction for specifications with 5 states and m = 35 takes 0.038 seconds, whereas the construction for m = 55 takes 0.047 seconds. so the processing time rose by 23.68%. sim­ ilarly, we observe that the construction process for specifica­ tions with 15 states takes 0.186 seconds, with m = 35, and takes 0.253 seconds, with m = 55. in this case, the process­ ing time rose by 36.02%. taking specifications with 25 states and m = 35 the time consumption of the multigraph construction is 0.428 seconds, and for m = 55 it takes 0.676 seconds, as we can see in figure 12b. the processing time rose by 57.94%. in the last group, we take specifications with 35 states and m = 35, resulting in a time consumption of 0.994 seconds, while it takes 1.867 seconds with m = 55. here the processing time rose by 87.82%. notice that the multigraph generation with m = 35 is 46 times faster for specification models with 5 states than specifications with 35 states. likewise the construction with an iolts model­based testing tool bonifacio and gomes 2020 0.0 0.1 0.2 0.3 5 15 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 5 15 multigraph generation − specification with 5, 15 states (a) specifications with 5 and 15 states 1 2 3 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 25 35 multigraph generation − specification with 5, 15 states (b) specifications with 25 and 35 states figure 12. multigraph generation m = 55 is about 26 times faster for specification models with 5 states than specifications with 55 states. therefore we can conclude that the performance of the multigraph generation decreases as the number of states on specifications and m increase. but the most important issue is that the processing time is not meaningfully affected, i.e., the processing time does not substantially increase as the number of states rises. 5.2.2 test purpose generation process now we turn into the tp generation step based on multi­ graphs that have been generated in the previous experiments. the associated rq, in this case, is: “how the tp generation is impacted w.r.t. the processing time when we take multi­ graphs that have been generated by varying the number of states from the corresponding specifications and also vary­ ing the number of states on iut models?”. here we get multigraphs associated to specifications with 5 to 35 states, and vary m from 5 to 55. we fixed the number of tps to be generated at 1000. figure 13 shows that the test generation process takes much more time compared to the multigraph generation step. 20 25 30 35 40 45 5 15 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 5 15 1000 tps generation from multigraph of specification with 5 and 15 states (a) specifications with 5 and 15 states 30 50 70 90 110 25 35 45 55 max iut states t im e ( s e c o n d s ) spec states 25 35 1000 tps generation from multigraph of specification with 25 and 35 states (b) specifications with 25 and 35 states figure 13. tp generation from figure 13a we see that the processing time is more uniform for specifications with 5 states no matter we vary m. when the number of states grows figure 13b shows that the processing time of the tp generation grows fast as m in­ creases. the processing time for specifications with 35 states and m = 35 takes 66.82 seconds whereas using m = 55 it takes 91.67 seconds. so we see that the rate rose by 37.19%. considering specifications with 15 states and, respectively, an iolts model­based testing tool bonifacio and gomes 2020 m = 35 and m = 55, the rate rose by 33.75%, whereas for specifications with 25 states and, m = 35 and m = 55, respectively, the rate rose by 30.48%. 5.2.3 running test suites in the last group of experiments we evaluate the processing time on running test suites. here the rq is given as follows: “what is the impact on the processing time when running test suites over iuts with 1%, 2% and 4% of modification w.r.t. the specifications which were used to generate the cor­ responding multigraphs?”. to answer this question we have taken test suites from tps that were generated for specifications with 15 and 25 states. we fixed m = n, that is, the number of states to be consid­ ered on iut models is the same to the number of states in the specification models. figure 14 shows the processing time according to the mod­ ification rate over the iuts. the time consumption of the test 1% 2% 4% 15 20 25 15 20 25 15 20 25 80 90 100 110 120 max iut states t im e ( se co n d s) max iut states 15 25 1000 tps generation from multigraph figure 14. test run run over iuts with 1% of modification takes 83.03 seconds with m = 15 and 89.78 seconds with m = 25. regarding iuts with 2% of modification, the process takes 86.19 sec­ onds with m = 15 and 80.83 seconds with m = 25. finally, for iuts with 4% of modification, the test run takes 87.54 seconds with m = 15 and 77.83 seconds with m = 25. we see that the processing time of test runs over iuts with m = 15 and 1% of modification is 5.15% faster than the test run over iuts with 4% of modification. if we consider m = 25 then the test run over iuts with 4% of modification is 13.31% faster than over iuts with 1% of modification. 5.3 threats to validity we list some aspects that may arise as a threat to the valid­ ity of the experiments. first we have to report a substantial intricacy to obtain the jtorx tool. several libraries were miss­ ing and we did not have full access to the source code. we have had access only to a binary code whereupon we could make some small amendments to adapt and run it from the command line. had we accordingly compiled and configured both tools they could be appropriately set up under the same conditions, and so the time consumption could be more eas­ ily and precisely obtained on running the experiments. the computational resource where the experiments were run may also be a threat. we have run all experiments in a general­purpose machine whose results might be biased in some way. but we remark that both tools have run all exper­ iments under the same conditions. another threat is related to the random generation of the models. although we have randomly generated all models in order to avoid biases in the process, we had to guarantee some properties on specific classes of experiments. for in­ stance, in some groups of experiments, we had to construct iuts that were in conformance to their corresponding speci­ fication while in other groups we had to guarantee a certain rate of modification over the iuts to get verdicts of non­ conformance. so the results might have somehow be biased by all these extra checking tasks. we also list as a threat, those properties that must be guar­ anteed over the models following restrictions imposed by jtorx. we see that the size of alphabets, the number of states, and transitions of the specification and iut models are mod­ ified from the original models to secure such properties. so we cannot make any claim about the similarity between these modified models and the original ones w.r.t. their behaviors. 6 conclusion conformance checking and test suite generation are impor­ tant activities to improve the reliability of developing reac­ tive systems. in this work we have presented an automatic testing tool for checking conformance and generating test suites for iolts models. we have implemented the classical ioco relation and the more general approach based on regular languages. the lat­ ter, and consequently everest tool, imposes few, if any, re­ strictions over the models and allows a wider range of fault models described by regular languages when checking con­ formance. several works have dealt with ioco theory and its variations. however, we are not aware of any other tool that implements a different notion of conformance, such as the language­based conformance. further our tool has imple­ mented a complete black­box test suite generation using the notion of test purposes for certain classes of fault models. we described some case studies to probe both tools and their functionalities in practice. we then could observe from a comparative analysis that everest provides a wider range of testing scenarios since it was able to detect faults, using the language­based approach, that were not detected by jtorx, using the ioco theory. the effectiveness of our test suite gen­ eration method is also evaluated in black­box scenarios. we also offered practical experiments of conformance checking to compare the performance of everest against the jtorx. we can see that everest outperforms jtorx in most sce­ an iolts model­based testing tool bonifacio and gomes 2020 narios unless for those where the structure of iut models are quite different from the corresponding specifications. hence we remark that although everest implements a more general conformance relation the time consumption has not been im­ pacted on checking runs. also we observed from the results that everest has a more stable behavior w.r.t. the processing time even for iut models with quite a different number of states. we also performed experiments of test suite genera­ tion and test run using everest tool. our tool was able to handle specifications and implementation candidates with a reasonable number of states as seen in the experiments. the main contribution of this work is our practical tool that can check conformance based on different relations and can generate test suites in a black­box setting. moreover, we have presented some case studies, a comparative analysis, and also practical experiments to evaluate and compare our tool. an extension on the current version of everest is under­ way with a new module to allow conformance checking, test suite generation and test run in a batch mode, i.e., it will be able to automatically test several iut models at once. as future directions, we intend to improve our strategies and al­ gorithms to generate test suites and run test cases more effi­ ciently. references belinfante, a. (2010a). jtorx: a tool for on­line model­ driven test derivation and execution. in esparza, j. and majumdar, r., editors, tools and algorithms for the con­ struction and analysis of systems, 16th international con­ ference, tacas 2010, lecture notes in computer science, pages 266–270. springer. belinfante, a. (2010b). jtorx: a tool for on­line model­ driven test derivation and execution. in esparza, j. and majumdar, r., editors, tools and algorithms for the con­ struction and analysis of systems, pages 266–270, berlin, heidelberg. springer berlin heidelberg. belinfante, a. (2014). jtorx: exploring model­based test­ ing. centre for telematics and information technology (ctit), netherlands. ipa dissertation series no. 2014­09. bonifacio, a. l. and moura, a. v. (2019). complete test suites for input/output systems. corr, abs/1902.10278. accessed on: 2019­06. calamé, j. (2005). specification­based test generation with tgv. software engineering notes. clarke, d., jéron, t., rusu, v., and zinovieva, e. (2002). stg: a symbolic test generation tool. in katoen, j.­p. and stevens, p., editors, tools and algorithms for the con­ struction and analysis of systems, pages 470–475, berlin, heidelberg. springer berlin heidelberg. de vries, r. (2001). towards formal test purposes. in tret­ mans, g. and brinksma, h., editors, formal approaches to testing of software 2001 (fates’01) ­ volume ns­01­4, brics notes series, pages 61–76, aarhus, denkmark. gomes, c. s. and bonifacio, a. l. (2019). automatically checking conformance on asynchronous reactive systems. in the fourteenth international conference on software engineering advances, pages 17–23. jard, c. and jéron, t. (2005). tgv: theory, principles and algorithms. international journal on software tools for technology transfer, 7(4):297–315. accessed on: 2019­ 08. larsen, k. g., mikucionis, m., nielsen, b., and skou, a. (2005). testing real­time embedded software using uppaal­tron: an industrial case study. in proceed­ ings of the 5th acm international conference on embed­ ded software, emsoft ’05, pages 299–306. acm. mark utting, b. l. (2007). practical model­based testing a tools approach. elsevier, 1nd edition. marsso, l., mateescu, r., and serwe, w. (2018). testor: a modular tool for on­the­fly conformance test case gener­ ation. in beyer, d. and huisman, m., editors, tools and algorithms for the construction and analysis of systems, pages 211–228, cham. springer international publishing. mostowski, w., poll, e., schmaltz, j., tretmans, j., and wichers schreur, r. (2009). model­based testing of elec­ tronic passports. in alpuente, m., cook, b., and joubert, c., editors, formal methods for industrial critical sys­ tems, pages 207–209, berlin, heidelberg. springer berlin heidelberg. naik, k. and tripathy, p. (2018). software testing and qual­ ity assurance: theory and practice. wiley publishing, 2nd edition. roehm, h., oehlerking, j., woehrle, m., and althoff, m. (2016). reachset conformance testing of hybrid automata. in proceedings of the 19th international conference on hybrid systems: computation and control, hscc ’16, pages 277–286, new york, ny, usa. acm. simão, a. d. s. and petrenko, a. (2014). generating com­ plete and finite test suite for ioco: is it possible? in pro­ ceedings ninth workshop on model­based testing, mbt 2014, grenoble, france, 6 april 2014., pages 56–70. ac­ cessed on: 2019­07. sipser, m. (2006). introduction to the theory of computa­ tion. course technology, second edition. tretmans, j. (1993). a formal approach to conformance test­ ing. in rafiq, o., editor, protocol test systems, vi, pro­ ceedings of the ifip tc6/wg6.1 sixth international work­ shop on protocol test systems, pau, france, 28­30 septem­ ber, 1993, volume c­19 of ifip transactions, pages 257– 276. north­holland. tretmans, j. (2008). model based testing with labelled transi­ tion systems. in hierons, r. m., bowen, j. p., and harman, m., editors, formal methods and testing, an outcome of the fortest network, revised selected papers, volume 4949 of lecture notes in computer science, pages 1–38. springer. introduction related works a model-based testing method checking conformance on reactive systems complete test suite generation a testing tool for reactive systems conformance checking process everesttest suite generation a real-world case study a comparative analysis practical evaluation conformance checking of everestand jtorx tools reversing the size of input/output alphabets varying the number of states everesttest suite generation multigraph generation step test purpose generation process running test suites threats to validity conclusion journal of software engineering research and development, 2021, 9:14, doi: 10.5753/jserd.2021.1911 � this work is licensed under a creative commons attribution 4.0 international license. attributes that may raise the occurrence of merge conflicts josé william menezes � [ universidade federal do acre | jose.william@sou.ufac.br] bruno trindade � [ universidade federal do acre | bruno.trindade@sou.ufac.br] joão felipe pimentel � [ universidade federal fluminense | jpimentel@ic.uff.br] alexandre plastino � [ universidade federal fluminense | plastino@ic.uff.br] leonardo murta � [ universidade federal fluminense | leomurta@ic.uff.br] catarina costa � [ universidade federal do acre | catarina.costa@ufac.br] abstract collaborative software development typically involves the use of branches. the changes made in different branches are usually merged, and direct and indirect conflicts may arise. some studies are concerned with investigating ways to deal with merge conflicts and measuring the effort that this activity may require. however, the investigation of factors that may reduce the occurrence of conflicts needs more and deeper attention. this paper aims at identifying and analyzing attributes of past merges with and without conflicts to understand what may induce direct conflicts. we analyzed 182,273 merge scenarios from 80 projects written in eight different programming languages to find characteristics that increase the chances of a merge to have a conflict. we found that attributes such as the number of changed files, the number of commits, the number of changed lines, and the number of committers demonstrated to have the strongest influence in the occurrence of merge conflicts. moreover, attributes in the branch that is being integrated seem to be more influential than the same attributes in the receiving branch. additionally, we discovered positive correlations between the occurrence of conflicts and both the duration of the branch and the intersection of developers in both branches. finally, we observed that php, javascript, and java are more prone to conflicts. keywords: version control, merge conflicts, conflict prediction 1 introduction software development normally involves collaboration among members of the project team. this collaborative development is supported by a version control system (vcs). often, when there is a need to develop new features or fix bugs, developers choose to create a branch, which is a separate development line. this separate development line helps teams to focus on their tasks, without prematurely worrying about how it affects other parts of the software (bird et al., 2011). however, the use of branches can cause problems, as changes made in different branches are usually merged, and direct and indirect conflicts may arise (brindescu et al., 2020b; costa et al., 2016; sarma et al., 2011; brun et al., 2011). according to bird et al. (2011), the effort involved in the merge process is dependent on how much work went on in the branches. some studies investigate ways to deal with merge conflicts by proactively detecting changes that can lead to conflicts (brun et al., 2011; sarma et al., 2011), identifying merge characteristics (accioly et al., 2018; ghiotto et al., 2018; vale et al., 2020), investigating the characteristics of difficult merge conflicts (brindescu et al., 2020b), and examining the decisions usually made to resolve conflicts (accioly et al., 2018; ghiotto et al., 2018). however, only recently, some studies started to investigate factors that may induce the occurrence of conflicts. dias et al. (2020) verify how seven factors related to modularity, size, and timing of developers’ contributions affect conflict occurrence. leßenich et al. (2018) analyze the predictive power of seven indicators, such as the number, size, and scattering degree of commits in each branch, to forecast the number of merge conflicts. in the same direction, owhadi-kareshk et al. (2019) investigate the predictive power of nine lightweight git feature sets, such as the number of changed files in both branches, the number of commits and developers, and the duration of the development of the branch. finally, vale et al. (2020) investigate the role of communication activity and the number of modified lines, chunks, files, developers, commits, and days that a merge scenario lasts in the increase or reduction of merge conflicts. similar to dias et al. (2020), leßenich et al. (2018), owhadi-kareshk et al. (2019), and vale et al. (2020), we assume that by analyzing attributes of past merges, it is possible to identify characteristics that may increase the chances of having a merge conflict. however, in addition to the attributes investigated by those authors (e.g., isolation, number of changed files, changed lines, commits, commit density, and developers), we analyzed some other attributes, such as the programming language, the frequency of one or more developers committing in both branches, and the existence of self-conflicts1 (zimmermann, 2007). as mentioned by brindescu et al. (2020a), the changes in conflict are generally authored by two different developers, but merge conflicts can also happen between the edits of the same developer, in two different branches. besides, in terms of the number of analyzed merges, our corpus is representative (it is only smaller than the corpus of owhadi-kareshk et al. (2019)), and our analysis of the number of commits, commit density, committers, and changed lines and files is performed by branch, not using averages. finally, it is important to mention that some metrics with similar names in the related work are calculated differently. thus, our work aims at providing a more in-depth analysis of how a set of merge attributes can influence the occurrence of conflicts. to do so, we mined association rules from 1a self-conflict is a conflict among changes committed by the same developer. https://orcid.org/0000-0003-4326-5351 mailto:jose.william@sou.ufac.br https://orcid.org/0000-0001-9000-3739 mailto:bruno.trindade@sou.ufac.br https://orcid.org/0000-0001-6680-7470 mailto:jpimentel@ic.uff.br https://orcid.org/0000-0003-4039-0915 mailto:plastino@ic.uff.br https://orcid.org/0000-0002-5173-1247 mailto:leomurta@ic.uff.br https://orcid.org/0000-0002-8851-1563 mailto:catarina.costa@ufac.br attributes that may raise the occurrence of merge conflicts menezes et al. 2021 182,273 merge scenarios extracted from 80 software projects hosted on github, written in eight different programming languages. the following eight research questions guided the analysis: • rq1. how is the isolation of a branch related to the occurrence of merge conflicts? our intuition is that the longer the isolation time of the branches, the greater the likelihood of having conflicts. • rq2. how is the number of commits related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of commits in the branches, the greater the likelihood of having conflicts. • rq3. how is the number of developers that performed commits related to the occurrence of merge conflicts? our intuition is that the greater the number of contributors in the branches, the greater the likelihood of having conflicts. • rq4. how is the number of changed files related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of changed files in the branches, the greater the likelihood of having conflicts. • rq5. how is the number of changed lines related to the occurrence of merge conflicts? our intuition is that the greater the number of contributions in terms of changed lines in the branches, the greater the likelihood of having conflicts. • rq6. how is the programming language related to the occurrence of merge conflicts? we had no intuition about the programming language, but we would like to know if any language was more prone to have conflicts. • rq7. how is the intersection of developers in both branches related to the occurrence of merge conflicts? our intuition is that the greater the number of contributors in both branches, the lesser the chances of having conflicts, because these developers are aware of the parallel changes. • rq8. how prevalent is the occurrence of merge selfconflicts? we had no intuition about the proportion of self-conflicts, but we would like to know if it is common in projects. the answers to these questions can provide insights on how software project teams’ work may affect the occurrence or avoidance of merge conflicts. we found that the investigated attributes have a positive correlation with merges with conflicts. notably, in the integrated branch, the number of changed files, the number of commits, the number of changed lines, and the number of committers have the strongest influence in the occurrence of conflicts among all attributes we analyzed. surprisingly, having some developers committing in both branches also increases the chance of conflicts, but having no common developer or having exactly the same developers committing in both branches decreases the chance of conflicts. we also verified that three programming languages (php, javascript, and java) are more prone to conflicts. this paper is an extended version of a conference paper (menezes et al., 2020) in which we answered six research questions, focused on the impact of the attributes time, commits, committers, changed files, intersection, and self-conflicts, in the occurrence of merge conflicts. this work complements our previous work by adding two new research questions and three new attributes, the number of changed lines, the commits density, and the programming language. additionally, we detail the investigation of developer intersection and replace the old “some intersection” category by percentages of intersections in association rules. we also deep the analysis of self-conflicts, obtaining the number of self-conflicts per chunk instead of per file. after all, a file can have several pieces of conflicting code that are from the same or different developers. hence, the analysis became more precise. in addition, we mine rules to verify the relation to the attributes in the occurrence of self-conflicts. besides this introduction, this paper is organized in 7 sections. in section 2, we present the research steps followed. in section 3, we present the results of our statistical analysis, the association rules, and the discussion about the self-conflicts. in section 4, we present the answers to our research questions. in section 5, we discuss threats to validity. in section 6, we discuss the related work. finally, in section 7 we present the conclusion. 2 materials and methods to answer the research questions presented in the introduction, we performed an exploratory study. the following steps, detailed in the following, compose our exploratory study: merge attributes definition, projects and merges selection, merges and attributes extraction, and data mining. 2.1 merge attributes definition the attributes were mainly derived from our research questions and defined in table 1. we divided the attributes into project attributes, merge attributes, and branch attributes. the project attributes are the predominant programming language, number of merges, number of analyzed merges (non-fast-forward), merges with conflicts, merges without conflicts, and self-conflicts. the merge attributes are the information collected by the merge scenario, using the information present in both branches. the merge conflict occurrence (yes or no), timing metrics, and information about changes and developers in both branches. the branch attributes are collected and presented by each branch2 (b1 and b2). we do not adopt any aggregation of 2when referring to the identification of branches in merges, i.e., the distinction between branch 1 (b1) and branch 2 (b2), we borrow the reasoning of chacon and hamano (2009): “the first parent is the branch you were on when you merged, and the second is the commit on the branch that you merged in”. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 1. attributes # attributes definition project attributes a) programming language predominant programming language. b) total of merges number of merges in total. c) analyzed merges number of three-way merges, not considering fast-forward merges. d) merges with conflicts number of analyzed merges with conflicts. e) merges without conflicts number of analyzed merges without conflicts. f) merges with self-conflict number of analyzed merges with the same developer authoring both sides of at least one conflicting chunk. merge attributes g) merge conflict occurrence binary attribute (yes or no) indicating if the merge has conflicts. h) branching-duration the effective duration of development in the branches, from the first branch commit (min(b1,b2)) to the last branch commit (max(b1,b2)), in days. i) total-duration the total duration of isolation, from the common ancestor (base commit) to the merge commit, in days. j) committers in both branches percentage of developers in both branches. k) conflicting chunk number of conflicting chunks. l) conflicting chunk by the same developers number of conflicting chunks authored by the same developer in both sides. branch attributes (b1 and b2) m) commit density number of commits in the branching-duration. n) loc-churn number of lines changed (added + deleted) in each branch. o) changed files number of changed files in each branch. p) commits number of commits in each branch. q) committers (commit authors) number of developers that authored commits in each branch. values. the branch attributes are the number of commits, number of committers (commit authors), number of changed files, and loc-churn (number of lines changed: added + deleted). the attributes are collected for merges with conflicts and merges without conflicts. they allow us to compare different characteristics between merges with and without conflicts. some of these attributes are also mentioned in related work, such as the timing metrics, merge conflict occurrence, number of merge conflicts, number of commits, commit density, committers, and changed lines and files (dias et al. (2020); leßenich et al. (2018); vale et al. (2020)). 2.2 projects and merges selection first, we decided to select projects developed in different and popular programming languages. thus, we identified the top8 programming languages present in the following surveys: github3 top active languages survey 2019, stack overflow4 developer survey results in 2019, and tiobe 5 index 2019. the top-8 selected programming languages present in the three surveys were: javascript, python, java, php, c#, c++, c, and ruby. we selected the projects using the github api. we used the following criteria: (1) popular projects (projects with more than 1,000 stars), (2) software projects, (3) number of merges greater than 100, (4) projects with a wiki or some documentation, and (5) a balanced amount of merges per project. after 3https://githut.info/ 4https://insights.stackoverflow.com/survey/2019# technology 5https://www.tiobe.com/tiobe-index/ applying the first criterion, we initially selected 461 projects. after applying criteria 2 to 4, 279 projects remained. to obtain a balanced corpus in terms of the number of analyzed merges per project, we selected ten projects per programming language, where the number of analyzed merges was less than 5% of the total number of analyzed merges in the dataset (table 2). for example, the graal project is the project with the highest number of analyzed merges (7,064). however, it represents just 3.9% of our final dataset (182,273). table 2. general information of our dataset. programming language total merges analyzed merges conflicting merges c 31,013 21,948 981 c# 31,468 21,148 2,003 c++ 32,463 24,155 2,290 java 32,989 24,109 2,519 javascript 31,542 21,803 2,613 php 31,208 22,371 3,376 python 32,591 22,585 1,923 ruby 37,001 24,154 2,114 total 260,275 182,273 17,819 it is important to mention that although the total number of merges was initially 260,275 (table 2), we removed 78,002 merges from the analysis: 74,293 fast-forward merges (i.e., merges with no changes in a branch, in which git would be able to just moves the pointer forward, but due to the option --no-ff, a merge commit was created (chacon and hamano, 2009)), 37 merges with negative total-duration (i.e., merges https://githut.info/ https://insights.stackoverflow.com/survey/2019#technology https://insights.stackoverflow.com/survey/2019#technology https://www.tiobe.com/tiobe-index/ attributes that may raise the occurrence of merge conflicts menezes et al. 2021 in which the date of the common ancestor is more recent than the date of the merge, probably due to some clock misconfiguration in the developer computer), and 3,672 merges with only merge commits (merges in which all commits from both branches are merge commits). 2.3 merges and attributes extraction we have implemented a tool to extract the attributes. the tool and the dataset are publicly available on github6. the tool was developed in java to parse the log provided by git, retrieving all merge commits. then, it identifies the parent commits that were merged and navigated until the common ancestor, just before forking the history. our tool also checks whether the merge resulted in conflicts. figure 1 shows a merge example composed of: a merge commit (c57), two parents commits that were merged (c55 and c56), and the common ancestor (c50). from these commits, it is possible to identify the commits within each branch. these commits are located between the common ancestor, and each of the parents commits merged (including the parent commits). the “feature” branch in the example has three commits (c51, c54, and c56), and the “master” branch also has three commits (c52, c53, and c55). by identifying all commits from the branches of each merge, our tool was able to collect all the attributes listed in section 2.1. in our example, to calculate the branching-duration, we check the date of the first branch commit (min(b1,b2)), “08 aug 2020”, and the date of the last branch commit (max(b1,b2)), “22 aug 2020”. so, the branching-duration was 14 days. in the attribute verification of committers in both branches, the tool would identify that ana made changes to both branches. in the verification of the conflicting chunk by the same developers, ana could also have been the author of a self-conflict. in the committers attribute verification, the “feature” branch has two committers (lisa and ana), and the “master” branch also has two committers (ana and tom). three files were changed in the “feature” branch (a, b, and c), and two files (a and b) were changed in the “master” branch. with the attributes of the 182,273 merge cases, we could conduct statistical analysis to understand the difference between the distributions of merges with and without conflicts. additionally, we plotted graphs representing the probability of having a conflict in a merge (axis y) given that an attribute is higher than a value (axis x). we calculated this probability according to the bayes theorem: p (conf lict|attribute > value) = p (conf lict ∩ attribute > value)/p (attribute > value). 2.4 data mining in this step, we adopted a data mining technique called association rules extraction. in summary, an association rule r is a pair (x, y ) of two disjoint entity sets, x and y . in the notation x → y , x is called antecedent, and y is called consequent (han et al. (2012)). the rules aim at finding associations or correlations, but, as said by zimmermann et al. 6https://github.com/catarinacosta/mactool/ figure 1. merge example. (2004), rules do not tell an absolute truth. they have a probabilistic interpretation based on the amount of evidence determined by two metrics (agrawal et al., 1994): (a) support, the joint probability of having both antecedent and consequent, and (b) conf idence, the conditional probability of having the consequent when the antecedent is present. another measure of interest used is the (c) lif t, which indicates how much the occurrence of y increases given the occurrence of x. han et al. (2012) explain that lif t(x → y ) = conf idence(x → y )/support(y ), where lif t = 1 indicates that the antecedent (x) does not interfere with the occurrence of the consequent (y ), lif t > 1 indicates that the occurrence of x increases the chances of the occurrence of y , and lif t < 1 indicates that the occurrence of x decreases the chances of the occurrence of y . we adopted the knowledge discovery in databases (kdd) process (fayyad et al. (1996)) to extract the association rules from our dataset: (a) data selection, (b) preprocessing, (c) transformation and data enrichment, (d) association rules extraction, and (e) results interpretation and evaluation. after we selected and collected the projects and the attributes using our tool (step a), we removed instances (merge cases) with inconsistent values (step b), for example, merge cases with negative total-duration. these two initial steps were described in section 2.1 to 2.3. the discretization (step c) was performed through the supervised algorithm proposed by fayyad and irani (1992), available in the weka7 tool. this algorithm transforms numerical attributes into categorical ones, aiming at reducing the entropy of the original class distribution by finding ranges that maximize their class-related purity. in this study, the class attribute indicates the merge conflict occurrence. for the association rules extraction (step d), we used r8 with the apriori (agrawal et al. (1994)) algorithm and the rattle9tool. in this study, our focus was on finding rules with the occurrence of conflict in the consequent (conflict=yes). however, due to the presence of conflicts being approximately 10% of our dataset, we lowered the support and confidence measures of interest considerably, to 0.01%. finally, we looked at all the association rules extracted that would help us answer the research questions (step e). in this step, we performed the analysis of the results. 7https://www.cs.waikato.ac.nz/ml/weka/ 8https://cran.r-project.org/bin/windows/rtools/ 9https://rattle.togaware.com/ https://github.com/catarinacosta/mactool/ https://www.cs.waikato.ac.nz/ml/weka/ https://cran.r-project.org/bin/windows/rtools/ https://rattle.togaware.com/ attributes that may raise the occurrence of merge conflicts menezes et al. 2021 3 results this section answers the research questions posed in section 1 according to the research process described in section 2. section 3.1 presents a statistical analysis of the merge attributes. section 3.2 analyzes the extracted association rules. section 3.3 presents the number of self-conflicts. 3.1 statistical analysis in this section, we analyze the distribution of each merge attribute for merges with and without conflict to understand which attributes act more as an indicator of conflict. hence, we divided the dataset of 182,273 merges into two subsets: one with 164,454 merges without conflicts and the other with 17,819 merges with conflicts. table 3 presents the comparison of the distributions. for comparing the distribution of the attributes, we first analyzed their normality using the anderson-darling test (anderson and darling, 1954). we chose this test due to the size of the distributions. we observed non-normality in all distributions at 95% confidence. then, we applied the mannwhitney test (mann and whitney, 1947) for each pair of subsets, and we found statistically significant differences for all the distributions, except the number of committers in b1 p-value = 0.323. after calculating the mean of the statistically different distributions, we observed that merges with conflicts have higher values than merges without conflicts for all the attributes. given these results and the non-normality of the distributions, we used cliff’s delta (macbeth et al., 2011) to calculate the effect size of these differences. we found four attributes with a large effect size (the ones related to b2) and five with a small effect size (the ones related to the time and most of the ones associated with b1). for clarification, let us analyze the distributions of changed files in b2 from table 3 as an example. we started the analysis by applying the anderson-darling test for the distribution of changed files in b2 for merges without conflicts and obtained a p-value < 10−15 rejecting the null hypothesis at 95% confidence (i.e., we found that this data are not from a population with a normal distribution). then, we applied the same test for the distribution related to merges with conflict, and we also observed a p-value < 10−15. since both distributions are not normal nor paired, we compared them with a non-parametric test for unpaired data: mann-whitney. once again, we observer a p-value < 10−15, indicating that the distributions are statistically different from each other. note in table 3 that both the average and the boxplot of the number of changed files in b2 for merges with conflicts (wc) are higher than the ones for merges without conflicts (wo). finally, we used cliff’s delta to calculate the effect size of the difference between these distributions, and we obtained a magnitude of −0.57, which is classified as large (romano et al., 2006). after analyzing the distributions and observing a significant statistical difference in most of them, we applied the bayes theorem to calculate the probability p (conf lict|attribute ≥ x) and we variated x for values within the range of the boxplots presented in table 3 (i.e., between max(q1 − 1.5 × iqr, minimum) and min(q3 + 1.5 × iqr, maximum)). figure 2 presents the distribution of probabilities for each numeric distribution. as expected, all probabilities start at around 10%, which represents the percentage of merge conflicts, but they grow at different rates. figure 2 highlights the probabilities in the medians of the distributions with and without merge conflicts and the probability in the last value of the interval. note in figure 2(e) that the p (conf lict|committers b1 ≥ x) is 9% for x = 2 (the median for both merges with and without conflicts). it indicates that the probability had a small decrease in comparison to the starting point (x = 1). continuing our example using the number of changed files in b2, note that the p (conf lict|changed f iles b2 ≥ x) in figure 2(h) starts at 9.8% when x = 0. then, when x reaches the median number of changed files in b2 for merges without conflicts (x = 2), the probability is 13.5%. when x reaches the median number of changed files in b2 for merges with conflicts (x = 19), the probability is 29.5%. finally, at the end of the interval (x = q3 + 1.5 × iqr = 205), the probability is 47.9%. as expected, in figure 2, attributes with a large effect size (changed files in b2, commits in b2, changed lines in b2, and committers in b2) grow faster than attributes with a smaller effect size (changed lines in b1, branching-duration, changed files in b1, total-duration, and commits in b1). 3.2 association rules we used data mining to enrich our analyzes with association rules. table 4 presents the extracted association rules in which the antecedent is the range value of each attribute, obtained by the discretization process, and the consequent is the presence of conflicts. it also presents the three measures of interest used, support (sup.), confidence (conf.), and lift. in general, smaller attributes’ values make the chance of merge conflicts decrease (lif t < 1), while higher values make the chance of merge conflicts increase (lif t > 1). for instance, for just one changed file in b2, the probability of merge conflicts decreases by 81% (lif t = 0.19). however, for more than 30 changed files, the probability of merge conflicts increases by 243% (lif t = 3.43). through all these analyzes, we observed that attributes related to b2 (i.e., the branch being integrated into b1) influence more the probability of merge conflicts than the other attributes, with the number of changed files, number of commits, number of changed lines, and number of committers in b2 being the attributes that influence the most, in this order. we observed that the attributes branching-duration and totalduration have a similar impact on the probability of merge conflicts and could be used interchangeably in most situations. we also verified the eight programming languages we selected regarding their influence on the occurrence of conflicts. three languages (php, javascript, and java) have shown a positive conflict dependency (lif t > 1), which increases the chances of conflicts occurring (table 5). we observe that, when using php, the probability of conflicts occurrence increases by 53% (lif t = 1.53). on the other hand, when programming in c, the probability of having conflicts decreases by 54% (lif t = 0.46). finally, we evaluated the intersection of developers, i.e., the number of developers working in both branches. some attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 3. comparison of merge distributions with (wc) and without conflicts (wo) attribute anderson-darling (p-value) mannwhitney average cliff’s delta distribution wo wc (p-value) wo wc value meaning branchingduration < 10−15 < 10−15 < 10−15 6.26 18.53 -0.3 small 0 10 20 30 wc wo totalduration < 10−15 < 10−15 < 10−15 7.27 19.71 -0.27 small 0 10 20 30 wc wo commits b1 < 10−15 < 10−15 < 10−15 73.51 252.91 -0.24 small 0 100 200 wc wo commits b2 < 10−15 < 10−15 < 10−15 9.76 81.07 -0.53 large 0 25 50 75 wc wo committers b1 < 10−15 < 10−15 0.32 8.52 21.31 5 10 15 wc wo committers b2 < 10−15 < 10−15 < 10−15 2.01 9.4 -0.48 large 5 10 15 wc wo changed files b1 < 10−15 < 10−15 < 10−15 100.8 425.44 -0.29 small 0 100 200 300 400 wc wo changed files b2 < 10−15 < 10−15 < 10−15 21.43 166.3 -0.57 large 0 50 100 150 200 wc wo loc-churn b1 < 10−15 < 10−15 < 10−15 6666.95 32934.05 -0.33 small 0 5000 10000 15000 20000 wc wo loc-churn b2 < 10−15 < 10−15 < 10−15 1623.7 13734.43 -0.51 large 0 2000 4000 6000 8000 wc wo density b1 < 10−15 < 10−15 < 10−15 545.98 1074.17 0.07 negligible 0 25 50 75 100 wc wo density b2 < 10−15 < 10−15 < 10−15 35.21 51.53 -0.11 negligible 0 10 20 30 40 wc wo studies have already mentioned that developers may work in both branches (costa et al., 2014, 2016; zimmermann, 2007). according to zimmermann (2007), many developers work at different places (e.g., home and office) or on different branches, and, at some point, they need to synchronize their changes. costa et al. (2014) analyzed the number of merges in repositories according to three scenarios: the presence of the same developers in both branches, disjoint sets of developers, or some intersection of the developers. they found a significant number of merges with developers working in both branches. we also performed this analysis in our dataset, but we compared the numbers of merges with and without conflicts. figure 3 shows the number of merges cases with no intersection, with some intersection, and with all the developers in common for merges with and without conflicts. since the number of merges with conflicts is much smaller than the number of merges without conflicts, we normalized both groups according to the total number of merges. then, we mined association rules to find the increase or decrease in the probability of merge conflicts. table 6 presents the results, which indicates that having some intersection (67% to 99%) increases the chance of conflicts by 265% (lif t = 3.65), and having no intersection reduces the chances of conflict by 41% (lif t = 0.59). after extracting rules with only one attribute in the antecedent, and considering the multidimensional characteristics of an association rule (lu et al. (2000)), we decided to analyze the combination of rules and understand if the combination of factors increases some measures of interest in the occurrence of conflict. the algorithm that brought the best results in the selection of attributes was infogainattributeeattributes that may raise the occurrence of merge conflicts menezes et al. 2021 0 5 10 15 20 25 30 (a) branching-duration 0% 50% p(conflict | branching duration x) 13.4% 16.3% 24.8% 0 5 10 15 20 25 30 (b) total-duration 0% 50% p(conflict | total duration x) 12.8% 15.4% 24.2% 0 50 100 150 200 (c) commits b1 0% 50% p(conflict | commits b1 x) 12.3% 14.1% 27.5% 0 20 40 60 80 (d) commits b2 0% 50% p(conflict | commits b2 x) 9.8% 29.4% 51.1% 5 10 15 (e) committer b1 0% 50% p(conflict | committer b1 x) 9.0% 9.0% 17.5% 2.5 5.0 7.5 10.0 12.5 15.0 (f) committer b2 0% 50% p(conflict | committer b2 x) 9.8% 25.7% 52.7% 0 100 200 300 400 (g) changed-files b1 0% 50% p(conflict | changed files b1 x) 12.9% 14.8% 28.0% 0 50 100 150 200 (h) changed-files b2 0% 50% p(conflict | changed files b2 x) 13.5% 29.5% 47.9% 0 5000 10000 15000 20000 (i) loc-churn b1 0% 50%p(conflict | loc-churn b1 x) 13.5% 16.0% 25.3% 0 2000 4000 6000 8000 (j) loc-churn b2 0% 50%p(conflict | loc-churn b2 x) 15.0% 26.8% 45.2% 0 20 40 60 80 100 (k) density b1 0% 50% p(conflict | density b1 x) 9.1% 8.9% 15.0% 0 10 20 30 40 (l) density b2 0% 50% p(conflict | density b2 x) 11.9% 11.9% 9.2% figure 2. probability of conflicts given that the attribute is greater than the value on the x axis. green stars represent the probability on the median of the distributions without conflicts. red triangles represent the probability on the median of the distributions with conflicts. blue squares represent the probability on the maximum value. val10. six attributes (branching-duration, committers in b2, intersection, commits in b2, and changed files and lines in b2) with the best classification were selected and ten combinations of attributes with the rules with the best measures of interest are presented in table 7. 10weka.attributeselection.infogainattributeeval table 4. measures of interest for the rules {attribute = range value} → {conflict=yes} attibute range value sup. (%) conf. (%) lift branchingduration (in days) < 1 2.85 5.84 0.60 1 – 7 3.84 10.87 1.11 8 – 15 1.13 16.31 1.67 16 – 30 0.77 17.93 1.84 > 30 1.17 25.45 2.61 totalduration (in days) < 1 2.22 6.01 0.62 1 – 7 4.19 9.43 0.97 8 – 15 1.27 15.06 1.54 16 – 30 0.83 16.85 1.73 > 30 1.24 24.20 2.48 commits in b1 1 1.27 5.99 0.61 2 – 5 1.98 7.60 0.78 6 – 20 2.38 17.82 1.82 > 20 4.37 15.01 1.54 commits in b2 1 1.60 3.39 0.35 2 – 5 2.33 7.76 0.79 6 – 20 2.38 17.82 1.82 > 20 3.46 36.92 3.78 committers in b1 1 – 3 6.14 9.71 1.00 4 – 10 1.51 6.88 0.70 11 – 30 1.08 10.91 1.12 > 30 1.04 21.00 2.15 committers in b2 1 – 3 5.84 6.57 0.67 4 – 10 2.19 29.51 3.02 11 – 30 1.15 43.00 4.40 > 30 0.59 59.12 6.05 changed files in b1 1 file 0.42 3.18 0.33 2 – 5 1.55 6.85 0.70 6 – 30 2.94 9.34 0.96 > 30 4.85 14.90 1.53 changed file in b2 1 file 0.60 1.89 0.19 2 – 5 1.99 5.99 0.61 6 – 30 3.07 13.55 1.39 > 30 4.10 33.47 3.43 loc-churn in b1 0 – 10 0.41 3.03 0.31 11 – 100 1.44 9.25 0.64 101 – 1000 2.94 9.32 0.95 1001 – 10000 2.83 12.81 1.31 > 10000 2.15 22.27 2.28 loc-churn in b2 0 – 10 0.82 3.04 0.31 11 – 100 1.95 5.52 0.57 101 – 1000 3.04 12.13 1.24 1001 – 10000 2.59 26.81 2.74 > 10000 1.36 46.25 4.73 density b1 0 – 5 4.33 10.82 1.11 > 5 – 20 2.07 7.44 0.76 > 20 – 40 0.84 8.72 0.89 > 40 2.83 11.30 1.16 density b2 0 – 5 4.95 8.39 0.86 > 5 – 20 2.69 13.73 1.40 > 20 – 40 0.82 11.44 1.16 > 40 1.32 9.26 0.95 considering the first rule in table 7, when the branchingduration is 16-30 days, the number of committers in b2, is 4attributes that may raise the occurrence of merge conflicts menezes et al. 2021 table 5. measures of interest for the rules {language} → {conflict=yes} language sup. (%) conf. (%) lift c 0.55 4.47 0.46 c# 1.11 9.62 0.97 c++ 1.27 9.49 0.96 java 1.42 10.77 1.09 javascript 1.45 12.18 1.23 php 1.87 15.18 1.53 python 1.04 8.43 0.85 ruby 1.18 8.88 0.90 table 6. measures of interest for the rules related to the intersection of developers {intersection} → {conflict=yes} % intersection sup. (%) conf. (%) lift 0% 3.39 5.78 0.59 1% – 33% 4.91 17.83 1.83 34% – 66% 0.74 11.94 1.22 67% – 99% 0.13 35.68 3.65 100% 0.60 8.19 0.84 10, the intersection of developers is 26%-50%, and the number of commits in b2 is greater than 20, then the probability of conflict occurrence increases by 850% (lif t = 9.50). please note the confidence and lift of this rule are greater when compared to the individual rules of each attribute. without conflicts with conflicts 0% 10% 20% 30% 40% 50% 60% 70% m er ge s (n or m al iz ed b y gr ou p) 61% 100046 34% 603632%52143 60% 10687 7% 12265 6% 1096 no intersection some intersection all developers in common figure 3. intersection of developers in branches. 3.3 self-conflicts we observed a significant number of developers intersection in figure 3. so, we investigated conflicting chunks and commits that have been made by the same developer. we noticed something interesting: in some cases, a developer made parallel changes that resulted in a merge conflict. zimmermann (2007) named this phenomenon as self-conflicts. we identified self-conflict cases in all 80 investigated projects. figure 4 summarizes the comparison between self-conflicts and conflicts inserted by different committers in each of the 80 projects, grouped by programming language, for merges with conflicts. in this analysis, we divided the number of conflicting chunks by the same developer by the total number of conflicting chunks. we also decided to mine association rules about the attributes investigated in this study and their effect on the occurrence of self-conflicts. when looking at attributes such as time, the number of commits, committers, changed lines and table 7. measures of interest for the rules combined antecedent sup. (%) conf. (%) lift branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% ∧ commits b2 = 20 0,01 92,86 9,50 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,02 90,91 9,30 branching-duration = 30 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 89,47 9,16 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 0,02 89,29 9,14 branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% ∧ changed file in b2 = 30 0,01 87,50 8,96 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 0,02 87,10 8,91 branching-duration = 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 1% – 25% ∧ commits b2 = 6 – 20 ∧ loc-churn = 101 – 1000 0,03 86,27 8,83 branching-duration = 16 – 30 ∧ committers in b2 = 4 – 10 ∧ % intersection = 26% – 50% 0,01 85,00 8,70 branching-duration = 30 ∧ committers in b2 = 11 – 30 ∧ % intersection = 1% – 25% ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 82,35 8,43 branching-duration = 30 ∧ commits b2 = 6 – 20 ∧ changed file in b2 = 30 ∧ loc-churn = 101 – 1000 0,01 81,82 8,37 files, intersection, commits density, and the programming language. only the existence of developer intersection showed a strong influence on the occurrence of self-conflicts. selfconflict logically only exists when a developer works in both branches. however, it is important to verify that there is a tendency for the chances of self-conflict to increase as the percentage of intersection increases (with a slight exception in the range of 67% 99%, which reduces the chances by 1% compared to the range of 34% 66%), as shown in table 8. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% % self-conflicts % different committers figure 4. conflicting chunks in projects grouped by programming language. table 8. measures of interest for the rules related to the intersection of developers {intersection} → {self-conflict=yes} % intersection sup. (%) conf. (%) lift 0% 2.40 6.98 0.19 1% – 33% 23.06 45.93 1.27 34% – 66% 4.67 59.33 1.65 67% – 99% 0,70 59,18 1.64 100% 5.22 81.48 2.26 4 discussions in this section, we answer the research questions presented in section 1 based on the results described in sections 3.1, 3.2, and 3.3. in general, the results for b2 (ie, the branch that is integrated into b1 during the merge) demonstrated a greater impact on the occurrence of conflicts, mainly for the number of changed files, commits, changed lines, and committers. as the identification of b1 and b2 is based on the merge direction, it depends on the strategy adopted by the software project. 4.1 how is the isolation of a branch related to the occurrence of merge conflicts? (rq1) the isolation of the branches is mentioned by some studies (bird et al. (2011); costa et al. (2014); dias et al. (2020); leßenich et al. (2018)) as a factor that may contribute to the occurrence of conflicts. in our study, we measured the isolation of branches using two attributes related to time: the branchingduration and the total-duration. we calculated these attributes for each merge case (with conflicts and without conflicts), in days. in section 3.1, we observed that both attributes have a very similar distribution, and they both present some impact on the occurrence of merge conflicts (effect sizes of -0.3 and -0.27 for branching-duration and total-duration, respectively). after mining association rules, we noted that the probability of conflicts occurring decreases when the duration is very short (less than a day): 40% less for branching-duration (lif t = 0.60) and 38% less for total-duration (lif t = 0.62). when the time is medium (8-15 days), the chances of having a conflict increases by 67% (lif t = 1.67) for branchingduration, and 54% (lif t = 1.54) for the total-duration. so, the results indicate a positive dependence between the duration increase and the chances of having a conflict. the lift of very long duration (more than 30 days) suggests that the chances of having a conflict increases by 161% (lif t = 2.61) for branching-duration and by 148% (lif t = 2.48) for the total-duration. answer to rq1: the branch-duration and totalduration have a small impact on the occurrence of merge conflicts (effect sizes of -0.3 and -0.27, respectively). despite the small impact, the association rules indicate that the occurrence of conflict increase when time increases (lift close to 1 for durations of 1-7 days and lift > 2.4 for durations bigger than 30 days). 4.2 how is the number of commits related to the occurrence of merge conflicts? (rq2) to answer this question, we checked the amount of work done in terms of commits in each branch. in section 3.1, we observed that both the number of commits in b1 and the number of commits in b2 have a positive impact on the occurrence of merge conflicts. however, the impact of commits in b2 is larger (effect size of -0.53) than the impact of commits in b1 (effect size of -0.24), indicating that the number of commits in b2 (i.e., the branch that is being integrated into b1) is a better predictor of conflicts than the number of commits in b1. we analyzed how much more frequent the conflict becomes with the increase in the number of commits in both branches in figure 2 and table 4. we can see that contributions with few commits in b1 and b2 have a negative dependency on the occurrence of conflicts. when the branch has only one commit, the occurrence of conflict decreases by 39% (lif t = 0.61) for b1 and 65% (lif t = 0.35) for b2. having few commits (2-5) shows a decrease of 22% (lif t = 0.78) for b1, and 21% (lif t = 0.79) for b2. the lift of 1.54 for b1 and 3.78 for b2 when there are more than 20 commits indicates that the chances of having a conflict increase by 54% for b1 and by 278% for b2. by looking at the probability of having a conflict given the number of commits in figure 2(c) and figure 2(d), it is possible to see that this probability grows faster according to the number of commits in b2, reaching around 40% for 30 commits while reaching just around 16% for 30 commits in b1. we also verified the commit density, i.e., the number of commits in b1 and b2 in relation to branching-duration. we noticed a significant difference between the impact of the number of commits and the number of commits divided by the branching-duration, the commit density. the impact of commit density in b1 (0.05) and b2 (−0.12) is negligible. when looking at the density association rules, we observed that unlike other attributes, there is no pattern of evolution when the value of the attribute increases, that is, when the number of commits in b1 or b2 divided by the branchingduration is greater. when the number of commits in b1 and b2 is between 0 to 5, the chance of having a conflict increases 11% for b1 (lif t = 1.11) and decreases 14% for b2 (lif t = 0.86). when the number of commits in b1 or b2 divided by the branching-duration is greater than 40, the chance of having a conflict increases 16% for b1 (lif t = 1.16) and attributes that may raise the occurrence of merge conflicts menezes et al. 2021 decreases 5% for b2 (lif t = 0.95). answer to rq2: the number of commit has a small impact for b1 (effect size of -0.24) and a large impact for b2 (effect size of -0.53) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of commits increases (according to the ranges of commits, lifts in b1 range from 0.61 to 1.54 and lifts in b2 range from 0.35 to 3.78). 4.3 how is the number of developers that performed commits related to the occurrence of merge conflicts? (rq3) for this question, we checked the number of committers in each branch. we observed that the number of committers in b1 (i.e., the branch that receives the integration) does not seem to have a statistically significant impact on the probability of merge conflicts. on the other hand, we observed that the number of committers in b2 has a large impact on the occurrence of merge conflicts (effect size of -0.48). these differences can also be observed in figure 2(e) and figure 2(f), which present the probabilities of conflicts. while the probability barely grows according to the numbers of committers in b1 (from 10% for one committer to 11% for six committers), it has a considerable growth for the number of committers in b2 (from 10% for one committer to 40% for six committers). hence, the number of committers in the branch that is being integrated (b2) seems to be a good indication of the possibility of merge conflicts. comparing the distributions of committers in b2 for merges with and without conflicts in table 3, we noted that while merges without conflicts usually have a single committer in b2, conflicting merges seem to have more committers. the association rules in table 4 also indicate that when the number of committers is large, the chances of conflicts are higher. first, having few committers (1-3) in b1 does not imply more or fewer conflicts (lif t = 1.00). however, there is a negative dependency when considering b2. in this case, the occurrence of conflict decreases by 33% (lif t = 0.67). for a very large number of committers (i.e., more than 30 committers), we observed an increase in the chances of having a conflict by 115% for b1 (lif t = 2.15) and 505% for b2 (lif t = 6.05). answer to rq3: the number of committers has no impact for b1 (p-value is 0.32) and a large impact for b2 (effect-size of -0.48) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of committers increases, especially for b2 (lift goes from 0.67 for 1–3 committers to 6.05 for >30 committers). 4.4 how is the number of changed files related to the occurrence of merge conflicts? (rq4) for this question, we checked the amount of work done in terms of changed files in each branch. the results are similar to the ones related to the number of commits in b1 and b2, with changes on b2 influencing more the probability of merge conflicts (effect size of -0.57) than changes on b1 (effect size of -0.29). figure 2(g) and figure 2(h) present the distributions and the probabilities of conflicts according to the number of changed files in b1 and b2, respectively. the probability of a merge conflict after changes in 40 or more files in b1 is around 16%. on the other hand, the probability after changes in the same number of files in b2 is approximately 36%. for the number of changed files, as expected, the association rules also confirmed that fewer changed files are less likely to cause conflicts. as shown in table 4, a single changed file indicates lower chances of conflicts: 67% less for b1 (lif t = 0.33) and 81% less for b2 (lif t = 0.19). however, for many changed files (i.e., more than 30), we observed an increase of 53% for b1 (lif t = 1.53) and 243% for b2 (lif t = 3.43). answer to rq4: the number of changed files has a small impact for b1 (effect-size of -0.29) and a large impact for b2 (effect-size of -0.57) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of commits increases (> 30 files in b1 has lift 1.53; > 6 files in b2 has lift 1.39; > 30 files in b2 has lift 3.43). 4.5 how is the number of changed lines related to the occurrence of merge conflicts? (rq5) for this question, we checked the loc-churn, the total number of lines of code added and removed in each branch (gousios and zaidman, 2014; nagappan and ball, 2005; da silva et al., 2020). we verified that changed lines on b2 influence more the probability of merge conflicts (effect size of -0.51) than changed lines on b1 (effect size of -0.33). this result is similar to the ones related to the number of changed files and commits. we also verified that association rules involving changed lines of code have a negative conflict dependency for values less than 100 changed lines . rules with values of 0-10 changed lines have their chances of conflict reduced by 69% (lif t = 0.31) for b1 and b2. for changes involving 11-100 lines, the chances are reduced by 36% (lif t = 0.64) for b1 and 43% (lif t = 0.57) for b2. for modifications involving many changed lines, the chances of a conflict occurring are increased. for more than ten thousand lines of code, the chances increased by 128% (lif t = 2.28) for b1 and 373% (lif t = 4.73) for b2. attributes that may raise the occurrence of merge conflicts menezes et al. 2021 answer to rq5: the number of changed lines has a small impact for b1 (effect-size of -0.33) and a large impact for b2 (effect-size of -0.51) on the occurrence of merge conflicts. the association rules indicate that the chances of conflict increase when the number of changed lines increases (lift goes from 0.31 for 0 – 10 loc to 2.28 for > 10000 loc in b1 and 0.31 for 0 – 10 loc to 4.73 for > 10000 loc in b2). 4.6 how is the programming language related to the occurrence of merge conflicts? (rq6) for this question, we observed the eight programming languages adopted in the selected projects. as shown in table 5, for five programming languages (c, c#, c++, python, and ruby), the chances of conflict occurrences decrease. the language c reduces the chances of conflicts in 54% (lif t = 0.46). python and ruby also decreases the chances of conflicts, in 15% (lif t = 0.85) and 10% (lif t = 0.90), respectively. on the other hand, three programming languages (php, javascript, and java) selected for this study present a positive dependency to conflict occurrence. we observed that php increases the chances of conflict by 53% (lif t = 1.53) and javascript increases in 23% (lif t = 1.23). for projects written in java, there is an increase of 9% (lif t = 1.09) in the chances of a merge conflict. answer to rq6: the association rules indicate that the chances of conflict increase when the project is written in php (53%), javascript (23%), and java (9%). 4.7 how is the intersection of developers in both branches related to the occurrence of merge conflicts? (rq7) for this question, we checked the frequency of the committers in both branches and divided the merges into three groups: merges with no intersection, merges with some intersection, and merges with all developers in common. contrary to our expectations, as presented in figure 3, the intersection of developers does not decrease the chance of merge conflicts. when we mined association rules related to the intersection of developers, we divided the merges into five groups: 0% (merges with no intersection), 1%-33%, 34%-66%, 67%-99%, and 100% (merges with all developers in common) (table 6). we observed that having some intersection (67%-99%) increases the chance of conflict by 265% (lif t = 3.65), while having no intersection decreases the probability of conflict by 41% (lif t = 0.59). however, when the merge has all developers in common, the chance of conflicts also decreases by 16% (lif t = 0.84). so, having all the developers or no developers in common seems to be better than having just one set of developers in common. answer to rq7: the association rules indicate that having some intersection increases the chances of conflict (67% 99% in 265%, 1% 33% in 83%, and 34% 66% in 22%). 4.8 how prevalent is the occurrence of merge self-conflicts? (rq8) conflicts caused between commits of the same developer seem more common than we anticipated. note that the percentage of self-conflicts in figure 4 ranges from 5.46% (of 3,152 conflicting chunks) in yii2 project to 66.23% (of 835 conflicting chunks) in vert.x project. note also that ten projects had more than 50% of self-conflicts. when considering projects with more than 40% of self-conflicts cases, 22 projects are listed. we then decided to analyze a merge case (commit 456424) from the elasticsearch project, and observed two examples of self-conflicts in a source-code file and in a debug file. regarding the source-code file, in b1, the developer created an instance of a searchresponse object with a parameter (commit 3a6429), and in b2, the developer performed validation and also created an instance of a searchresponse object, but without parameters (commit d82faf). regarding the debug file, the developer added several lines in both branches (commits 3a6429 and d82faf), possibly during execution in a test environment. when we mined association rules related to the occurrence of self-conflicts, we verified when the merge involves all the developers in common, the chances of a self-conflict occurring are increased by 126% (lif t = 2.26), as shown in table 8. we analyzed other attributes, but none showed a strong influence (> 27%), with the exception of the intersection of developers. answer to rq8: we identified self-conflicts in all 80 projects. the percentage of self-conflicts range from 5.46% (of 3,152 conflicting chunks) in yii2 project to 66.23% (of 835 conflicting chunks) in vert.x project. 5 threats to validity as in any study, ours also has limitations. our approach uses the committers’ git id (names and/or email addresses) to identify developers who committed in both branches. developers may use multiple aliases, eventually generating inconsistencies (i.e., false negatives) in the results. we adopted the strategy to turn all letters in uppercase and remove all existing spaces to reduce this threat. we may have missed some cases when the aliases are lexically different, but in this case, the number of committers in both branches and self-conflicts would be higher. we believe that a branch’s isolation time is relative. someone can create a branch and not commit to it for a while or someone can perform the branch’s last commit and not merge for a while. therefore, the measurement of the duration of a branch has limitations. we used two metrics of time to attributes that may raise the occurrence of merge conflicts menezes et al. 2021 mitigate this threat: considering just the commits performed within the branches (branching-duration) and considering the merge commit (total-duration). we are investigating only three-way merge scenarios integrating two branches, so we found and excluded 74,293 fast-forward merges. different merge strategies may not have been considered, for example, the git rebase, as it flattens the rich information of parallel development into a linear history. we also excluded 37 merge cases in which the time metrics were negative. since the timestamp for each commit is generated on the developer’s computer, if the computer’s clock is wrong, the timestamp is recorded incorrectly. in a merge case (merge commit 1da7521) from the elasticsearch project, for example, while the merge was committed on 2/8/2017, the common ancestor (commit 5ee82e4) of the parents’ commits (commits 1ba5f8f and e761b76) was committed on 4/20/2018. finally, we excluded 3,672 merges with only merge commits. for example, in a merge commit (197f57c) of the osu project, we found just one commit in each branch, and these commits are also merge commits (commits 660afb4 and 436e155). 6 related work vale et al. (2020) investigated the role of communication activity in the increase or reduction of merge conflicts. they analyzed the history of 30 popular open-source projects involving 19,000 merge scenarios. the authors mined and linked contributions from git and communication from github data. they used bivariate and multivariate analyses to evaluate the correlations. in bivariate analysis, they found a weak positive correlation between github communication activity and the number of merge conflicts. in the multivariate analysis, they discovered that github communication activity does not correlate with the occurrence of merge conflicts. thus, they investigated if it depends on the merge scenarios’ characteristics, such as the number of modified lines, chunks, files, developers, commits, and days a merge scenario lasts. these variables are calculated by merge scenario (both branches). for example, the authors considered the sum of the number of developers in both branches. they found that there is no relation between the communication measures and the number of merge conflicts when considering these factors. they concluded that: (1) longer merge scenarios with more developers involve more github communication, but not necessarily more merge conflicts, (2) the size of the changes of merge scenarios (in terms of numbers of files, chunks, and lines of code involved) is not sufficient to predict the occurrence of merge conflicts. leßenich et al. (2018) surveyed 41 developers and extracted a set of seven indicators (the number of commits, commit density, number of files changed by both branches, larger changes, fragmentation of changes, scattered changes across classes or methods, and the granularity of changes above or within class declarations) for predicting the number of conflicts in merge scenarios. they also checked additional indicators mentioned in the survey, i.e., whether the more developers contribute to a merge scenario, the more likely conflicts happen and whether branches that are developed over a long time without a merge are more likely to lead to merge conflicts. after determining the respective value for each branch, they compute the geometric mean of these values. to evaluate the indicators, the authors performed an empirical study on 163 open-source java projects, involving 21,488 merge scenarios. they found that none of the indicators can predict the number of merge conflicts, as suggested by the developer survey. hence, they assumed that these indicators are not useful for predicting the number of merge conflicts. owhadi-kareshk et al. (2019) also investigated if conflict prediction is feasible. they verified nine indicators (the number of changed files in both branches, number of changed lines, number of commits and developers, commit density, keywords in the commit messages, modifications, and the duration of the development of the branch) for predicting whether a merge scenario is safe or conflicting. they adopted norm-1 as the combination operator to combine the indicators extracted for each branch into a single value. to evaluate the predictor, they performed an empirical study on 744 github repositories in seven programming languages, involving 267,657 merge scenarios. similar to related work, they did not find a correlation between the chosen indicators and conflicts, but using the same indicators, they designed a classifier that was able to detect safe merge scenarios (without conflicts) with high precision (0.97 to 0.98) using the random forest classifier. dias et al. (2020) also conducted a study to understand better how conflict occurrence is affected by technical and organizational factors. they investigated seven factors related to modularity, size, and timing of developers’ contributions. they computed the geometric mean of the branch values for each factor. the authors analyzed 125 projects, involving 73,504 merge scenarios in github repositories of ruby (100) and python (25) mvc projects. they found that merge conflict occurrence significantly increases when contributions to be merged are not modular in the sense that they involve files from the same mvc slice (related model, view, and controller files). as previously discussed, vale et al. (2020) and owhadikareshk et al. (2019) tried to predict the occurrence of merge conflicts. complementary, leßenich et al. (2018) tried to predict the number of merge conflicts. vale et al. (2020) and leßenich et al. (2018) did not find a strong correlation between the analyzed attributes and the occurrence and number of conflicts. owhadi-kareshk et al. (2019) also found no correlation between the indicators and conflicts, but were able to design a classifier for merge conflicts. our study investigated some similar attributes to the ones evaluated by vale et al. (2020) and owhadi-kareshk et al. (2019) (time metric, number of commits, committers, changed lines and files), and by leßenich et al. (2018) (number of commits, commit density, and files in both branches), however, in our results the investigated attributes seem to have a positive correlation with merges with conflicts. similar to our results, dias et al. (2020) found that more developers, commits, changed files, and contributions developed over long periods are more likely associated with merge conflicts. however, no evaluated attributes showed predictive power concerning the number of merge conflicts. they also investigated some similar attributes, as timing metrics, number of commits, committers, changed lines, and files. although attributes that may raise the occurrence of merge conflicts menezes et al. 2021 we did not check whether the contributions were modular or not, we added some attributes, such as the frequency of one or more committers in both branches and the verification of conflicting chunks and commits that have been made by the same developer. the extraction of association rules also showed us a tendency to merge conflicts when there is a longer duration, more commits, committers, and files changed. it is worth mentioning that the attributes evaluated by the previous studies might not be computed in the same way, despite the attributes’ name similarity. for example, the number of commits is presented in all the related work. leßenich et al. (2018) reported the number of commits between the common ancestor and the merge as the geometric mean of both branches. vale et al. (2020) report this number as the sum of commits performed in the two branches. owhadi-kareshk et al. (2019) used norm-1 (also a sum of absolute values) as the combination operator for the number of commits between the ancestor and the last commit in a branch. dias et al. (2020) also used the geometric mean of the number of commits in each contribution. in our work, we decide to keep the information by branch, using no aggregate measure. 7 conclusion in this work, we analyzed 182,273 merge scenarios from 80 projects written in eight programming languages to understand which attributes impact on the occurrence of merge conflicts. while all attributes seem to have a positive influence on the probability of merge conflicts, some appear to have a more significant impact than others. the attributes that presented a higher relation to the occurrence of merge conflicts are changed files, commits, changed lines, and committers in the branch b2 (i.e., the branch that is integrated into b1 during the merge). these attributes in the branch b1 have a smaller impact (changed lines, changed files, and commits) or even no statistically significant difference (committers) on the occurrence of conflict. both the branching-duration and the total-duration seem to have an impact comparable to the impact of attributes in b1. despite some attributes presenting a smaller impact on merge conflicts when we compare the whole distributions, the association rules indicate that higher values of them increase the chances of conflicts by over 53%. in addition to these attributes, we analyzed the impact of the selected programming language and the intersection of developers between branches on the occurrence of conflicts. among the eight programming languages verified, php, javascript, and java, have a positive conflict dependency, and php increases the chances of conflicts by 53%. regarding the intersection of developers, we noticed that merges with one or more committers acting in both branches do not seem to reduce the chances of merge conflicts. instead, having some intersection in the developers increases the chance of conflicts (1%-33% by 83%, 34%-66% by 22%, and 67%-99% by 265%). however, having all the developers or no developers in common reduces the chances of conflicts (41% and 16%, respectively). finally, we analyzed how common it is for a single developer to make self-conflicts. we observed that all projects have self-conflicts with a huge variation on the proportion. while some projects have only 5.46% of selfconflicts, other projects have up to 66.23% of self-conflicts. while some attributes have a large impact on the occurrence of merge conflicts, they may not be used as predictive attributes since the probability of having a conflict given the value of these attributes is relatively small. nonetheless, these attributes can be used to elaborate policies and best practices to reduce the chances of merge conflicts. the adoption of recognized best practices such as frequent commits, small changes, continuous integration, among others, can be reinforced with attention to the number of developers involved and conflicting changes by the same developer. as future work, we intend to increase the number of attributes and further investigate some of them by conducting a qualitative study on the programming language (what actually influences a language to have greater chances of conflicts, such as verbosity, developer freedom, among other aspects) and self-conflicts (if self-conflicts are evenly distributed among the project’s committers or if some committers concentrate the majority of self-conflicts). we also would like to verify our results with some of the analyzed project communities. finally, we intend to develop a tool that analyzes the project’s history and measures these metrics from time to time to warn the project team. acknowledgements this work was partially supported by capes (88882.464250/201901), cnpq, and faperj. references accioly, p., borba, p., and cavalcanti, g. (2018). understanding semi-structured merge conflict characteristics in open-source java projects. empirical software engineering, 121:2051 – 2085. agrawal, r., srikant, r., et al. (1994). fast algorithms for mining association rules. in 20th international conference on very large data bases (vldb), pages 487 – 499, san francisco, ca, usa. anderson, t. w. and darling, d. a. (1954). a test of goodness of fit. journal of the american statistical association, 49:765–769. bird, c., zimmermann, t., and teterev, a. (2011). a theory of branches as goals and virtual teams. in 4th international workshop on cooperative and human aspects of software engineering (chase), pages 53 – 56, waikiki, honolulu, hi, usa. brindescu, c., ahmed, i., jensen, c., and sarma, a. (2020a). an empirical investigation into merge conflicts and their effect on software quality. empirical software engineering, 25:562 – 590. brindescu, c., ahmed, i., leano, r., and sarma, a. (2020b). planning for untangling: predicting the difficulty of merge conflicts. in 42nd ieee/acm international conference on software engineering (icse), pages 801 – 811, seoul, south korea. brun, y., holmes, r., ernst, m. d., and notkin, d. (2011). proactive detection of collaboration conflicts. in 19th attributes that may raise the occurrence of merge conflicts menezes et al. 2021 acm special interest group on software engineering (sigsoft) symposium and the 13th european conference on foundations of software engineering (esec), pages 168 – 178, szeged, hungary. chacon, s. and hamano, j. (2009). pro git. berkeley, ca, 1:509. costa, c., figueiredo, j., murta, l., and sarma, a. (2016). tipmerge: recommending experts for integrating changes across branches. in 24th international symposium on foundations of software engineering (fse), pages 523 – 534, seattle, wa, usa. costa, c., figueiredo, j. j., ghiotto, g., and murta, l. (2014). characterizing the problem of developers’ assignment for merging branches. international journal of software engineering and knowledge engineering, 24:1489 – 1508. da silva, d. a. n., soares, d. m., and gonçalves, s. a. (2020). measuring unique changes: how do distinct changes affect the size and lifetime of pull requests? in 14th brazilian symposium on software components, architectures, and reuse (sbcars), pages 121 – 130, natal, brazil. dias, k., borba, p., and barreto, m. (2020). understanding predictive factors for merge conflicts. information and software technology, 121:106256. fayyad, u., piatetsky-shapiro, g., and smyth, p. (1996). from data mining to knowledge discovery in databases. ai magazine, 17:37 – 37. fayyad, u. m. and irani, k. b. (1992). on the handling of continuous-valued attributes in decision tree generation. machine learning, 8:87 – 102. ghiotto, g., murta, l., barros, m., and hoek, a. v. d. (2018). on the nature of merge conflicts: a study of 2,731 open source java projects hosted by github. ieee transactions on software engineering, 48:892 – 915. gousios, g. and zaidman, a. (2014). a dataset for pull-based development research. in 11th working conference on mining software repositories (msr), pages 368 – 371, hyderabad, india. han, j., kamber, m., and pei, j. (2012). data mining concepts and techniques (3rd edition). leßenich, o., siegmund, j., apel, s., kästner, c., and hunsen, c. (2018). indicators for merge conflicts in the wild: survey and empirical study. automated software engineering, 25:279 – 313. lu, h., feng, l., and han, j. (2000). beyond intratransaction association analysis: mining multidimensional intertransaction association rules. acm transactions on information systems (tois), 18:423 – 454. macbeth, g., razumiejczyk, e., and ledesma, r. d. (2011). cliff’s delta calculator: a non-parametric effect size program for two groups of observations. universitas psychologica, 10:545–555. mann, h. b. and whitney, d. r. (1947). on a test of whether one of two random variables is stochastically larger than the other. the annals of mathematical statistics, 18:50–60. menezes, j. w., trindade, b., pimentel, j. f., moura, t., plastino, a., murta, l., and costa, c. (2020). what causes merge conflicts? in 34th brazilian symposium on software engineering (sbes), pages 203 – 212, natal, brazil. nagappan, n. and ball, t. (2005). use of relative code churn measures to predict system defect density. in 27th international conference on software engineering (icse), pages 284 – 292, st. louis, mo, usa. owhadi-kareshk, m., nadi, s., and rubin, j. (2019). predicting merge conflicts in collaborative software development. in 13th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1 – 11, porto de galinhas, brazil. romano, j., kromrey, j. d., coraggio, j., and skowronek, j. (2006). appropriate statistics for ordinal level data: should we really be using t-test and cohen’sd for evaluating group differences on the nsse and other surveys. in 10th annual meeting of the florida association of institutional research (fair), pages 1 – 3, florida, usa. sarma, a., redmiles, d. f., and hoek, a. v. d. (2011). palantir: early detection of development conflicts arising from parallel code changes. ieee transactions on software engineering, 38:889 – 908. vale, g., schmid, a., santos, a. r., almeida, e. s. d., and apel, s. (2020). on the relation between github communication activity and merge conflicts. empirical software engineering, 25:402 – 433. zimmermann, t. (2007). mining workspace updates in cvs. in 4th international workshop on mining software repositories (msr), page 11, washington, dc, usa. zimmermann, t., weisgerber, p., diehl, s., and zeller, a. (2004). mining version histories to guide software changes. in 26th international conference on software engineering (icse), pages 563 – 572, usa. introduction materials and methods merge attributes definition projects and merges selection merges and attributes extraction data mining results statistical analysis association rules self-conflicts discussions how is the isolation of a branch related to the occurrence of merge conflicts? (rq1) how is the number of commits related to the occurrence of merge conflicts? (rq2) how is the number of developers that performed commits related to the occurrence of merge conflicts? (rq3) how is the number of changed files related to the occurrence of merge conflicts? (rq4) how is the number of changed lines related to the occurrence of merge conflicts? (rq5) how is the programming language related to the occurrence of merge conflicts? (rq6) how is the intersection of developers in both branches related to the occurrence of merge conflicts? (rq7) how prevalent is the occurrence of merge self-conflicts? (rq8) threats to validity related work conclusion microsoft word 19-##_article-431-1-6-20190621.docx journal of software engineering research and development, 2019, 7:1, doi: 10.5753/jserd.2019.19 this work is licensed under a creative commons attribution 4.0 international license. a taste of the software industry perception of the technical debt and its management in brazil victor machado da silva [ universidade federal do rio de janeiro | victor0machado@gmail.com ] helvio jeronimo junior [ universidade federal do rio de janeiro | jeronimohjr@gmail.com ] guilherme horta travassos [ universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract background: technical debt (td) metaphor has been an exciting topic of investigation for the software industry and academia in the last year. despite the increasing attention of practitioners and researchers, td studies indicate that its management (tdm) is still incipient. particularly in brazilian software organizations (bsos), there is still a lack of information regarding how software practitioners perceive and manage td in their projects. objective: to characterize td and its management under the perspective of bsos using their practitioners as proxies and extend the discussions presented at the 2018 ibero-american conference in software engineering. methods: a survey was performed with 62 practitioners, representing about 12 organizations and 30 software projects. results: the analysis of 40 valid questionnaires indicates that td is still unknown to a considerable fraction of the participants, and only a small group of organizations adopts td management activities in their projects. besides, it was possible to obtain a set of technologies that can be used to support tdm activities and to make available a survey package to study td and its management. conclusions: although the results provide an initial and representative landscape of the tdm scenario in bsos, further research will support to observe how effective and efficient tdm activities can be in different software project contexts. keywords: technical debt, software quality, survey, experimental software engineering 1 introduction the software evolution is essential for the survival of a software product in the market since the environment in which it is immersed continually changes. as argued by boehm (2008), in the face of an increasingly dynamic and competitive market, software development organizations need to support continuous and fast delivery of value to the customer in both short and long terms. in this scenario, many software organizations introduce agility practices into their development processes to handle the frequents requirements changes and the continuous delivery demand (de frança et al. 2016). this context reflects the challenges faced by software practitioners regarding the many decisions they take in their projects over time. at the same time, software practitioners should build high quality, low cost, on time, and useful software products. this working environment brings challenges to practitioners regarding the decision-making, setting up a trade-off that can lead to the intentional or unintentional creation of “technical debt" in software projects over time. as argued by tom et al. (2013) and avgeriou et al. (2016), most, if not all, software projects face some td. td refers to technical decisions taken in the software development scenario involving intertemporal choices (becker et al. 2018), which influence positively (intentional and managed) or negatively (unintentional and not managed) to the software project ecosystem and the quality of their software products. when td is perceived and managed in software projects, it has the potential to support deliveries of value to customers in a short time. on the other hand, in the long-term, some risks to internal software increase when the debt is not perceived and managed in the projects, hindering the software products maintenance and evolution (avgeriou et al. 2016). currently, the td metaphor interest and use have grown over the years (li et al. 2015). many studies have been discussing different knowledge areas of td and supporting solutions to software engineers to achieve better results in their projects. using an ad-hoc literature review, we observed some studies discussing the concept of td and technologies to support the technical debt management (tdm). as mentioned in li et al. (2015) and alves et al. (2016), only a few studies deal directly with the question of how software organizations perceive and apply the td metaphor in their working environment. also, the software development process is influenced by the country’s culture, language, and beliefs (prikladnicki et al. 2007), and it can influence how td can emerge, be perceived, and managed. particularly in brazil, there is a more latent gap regarding how the brazilian software organizations (bsos) perceive td and how their practitioners handle it in their projects. assuncao et al. (2015) reported that tdm is a topic of interest at brazilian federal administration departments. however, there is scarce information on whether td is adequately managed in bsos. this information is useful since it can provide initial insights so that bsos can improve their software processes to minimize the risks that td can bring to the software project ecosystem and the quality of their software products. this context motivated us to investigate how bsos (represented by their practitioners) adopt and manage td. also, it is our interest to observe whether the perception from bsos’ practitioners on td and its management match the findings of other td investigations. our study intends to a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 raise the level of knowledge of td and its management in bsos. therefore, a survey was designed and conducted with software practitioners engaged in brazilian software organizations. this paper presents the results of this survey, intending to provide the following initial contributions: • to get an initial perception of the td metaphor and its management in bsos, using their engaged professionals as proxies; • to make available a survey package with empirically evaluated instruments to support the gathering and aggregation of information regarding the td perception and tdm activities, tailorable to other localities. this paper is an extension of a previous publication at cibse 2018 (silva et al. 2018b). it details the theoretical background on td regarding its concepts, classification, tdm activities, and approaches for td management. it offers a comparison between the obtained results and those concerned with the related works. the survey’s design, analysis, and discussion of results are comprehensively presented, including the answers from three new survey’s participants. the remainder of this paper is structured as follows: section 2 provides a background on td; section 3 summarizes the related works to our research; section 4 presents the survey design; section 5 explains the survey results; section 6 presents the discussion about the main findings, the works related to our research and the threats to validity; and section 7 presents the final considerations. 2 theoretical background ward cunningham (1993) first coined the term “technical debt” when discussing with stakeholders the consequences of releasing a poorly written piece of code to accelerate the development process. although the code attends the core system requirements in the current release, in case of future changes, the consequences might spread over other software areas, affecting its evolvability. since then, the td metaphor use spread to allow better communication with non-technical stakeholders (e.g., corporate managers, clients, among others). moreover, it has been used as a quality improvement instrument, bringing to the software development context terms such as “principal” (used to refer to the required effort to eliminate the td source) and “interest” (the additional effort needed on software maintenance due to the presence of td) (alves et al. 2016). although reasonably disseminated, up until 2016, there was no standard definition of the td concept, creating several inconsistencies in the technical literature (tom et al. 2013). some definitions of td over the years are “a way to characterize the gap between the current state of a software system and some hypothesized ‘ideal’ state in which the system is optimally successful in a particular environment” (brown et al. 2010), “any side of the current system that is considered suboptimal from a technical perspective” (ktata and lévesque 2010), and “a tradeoff between implementing some piece of software in a robust and mature way (the ‘right’ way) and taking a shortcut which may provide shortterm benefits, but which has long-term effects that may impede evolution and maintainability” (klinger et al. 2011). this imprecision on the td definition could cause several misinterpretations and even a td metaphor misuse and damage to the concept. tom et al. (2013) affirmed that “it is evident that the boundaries of technical debt, as reflected in academic literature, are fuzzy – they lack clarity and definition – and represent a barrier to efforts to model, quantify and manage technical debt.” the lack of consensus on the td definition was brought to attention during the dagstuhl seminar 16162, “managing technical debt in software engineering” (avgeriou et al. 2016). this seminar gathered members from academia and industry to discuss many relevant points regarding the td concept. at the end of the seminar, the participants came up with a td definition: “in software-intensive systems, technical debt is a collection of design or implementation constructs that are expedient in the short term, but set up a technical context that can make future changes more costly or impossible.” td is also acknowledged for being restricted to internal software quality issues, like maintainability or evolvability. as it can be observed, some differences between the definition of td and its first association with financial debt appeared along the years. the primary divergence is on the optionality to repay the td item (guo et al. 2016). however, some similarities remain. similar to financial debt, strategical, controlled decisions that opt to postpone some tasks to obtain short-term gains such as shortening time-to-delivery can be decisive for a product’s success (yli-huumo et al. 2016). nowadays, the definition proposed in the dagstuhl seminar is the most accepted among the researchers, and it is adopted throughout this paper. this definition contradicts, though, with some previous concepts of what should be considered and what should not be considered td. for instance, unfinished tasks in the development process are considered as a type of non-td, as reported by li et al.’s (2015) secondary study. however, it fits the dagstuhl’s td definition and should be considered as such in this paper. in other words, td can be associated with technical decisions about the shortcuts and workarounds taken in software development. such decisions can influence positively (strategic and managed) or negatively (unintentional and not managed). depending on the perspective, the presence of td can influence positively or negatively a software project and the quality of its software products. strategical, controlled decisions that opt to postpone some tasks to obtain short-term gains such as shortening time-to-delivery can be decisive for a product’s success. however, td can cause damage to the project, since it might be incurred unintentionally throughout the software development cycle. td items of this nature can be incurred due to many factors, such as a lack of knowledge of team members on writing the source code without following a specific programming style. therefore, it is crucial that software organizations perceive and manage td in their projects. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 2.1 classification even before the academic interest in the td metaphor, the industry already had presented alternatives to classify it. mcconnell (2007) divided td into two different types: unintentional debt (in which td does not incurr due to a strategical purpose, like a bad-written piece of code created by an inexperienced programmer); and intentional td (when it is usually incurred with a strategical approach, when the team or the organization decides to achieve a short-term gain at the cost of a long-term effort). an example of intentional td is the decision on developing a simplified architecture solution for the software, knowing that it might not attend the project’s future needs. martin fowler (2009) expanded the classification created by mcconnell (2007), considering that, beyond the td being intentional (deliberate) or unintentional (inadvertent), it can also be reckless or prudent. td quadrants structure these classifications, as shown in figure 1. a reckless debt (either deliberate or inadvertent) incurs in the project and is not adequately planned, creating unnecessary risks. on the other hand, prudent debt items receive attention from the developing team, which assess their risks and make a plan to repay them. another perspective is to observe the original artifacts that the td item incurred or the td item nature. tom et al. (2013) named this classification scheme as dimensions of td, naming five types of td. posteriorly, li et al. (2015) conducted a systematic mapping study, expanding the td dimensions into ten types. the most recent study attempting to classify the td according to its nature or origin artifact, to our knowledge, is by alves et al. (2016). in this study the authors provide classification of td in fifteen types, like design debt (associated with violations of the principles of good object-oriented design, documentation debt (issues observed in the software documentation) and code debt (problems found in the source code, that can make it harder to maintain, usually related to inadequate coding practices). figure 1. td quadrants (adapted from fowler (2009)) 2.2 td management li et al. (2015) state that tdm includes activities that prevent potential td (both intentional and unintentional) from being incurred, as well as those activities that deal with the accumulated td to make it visible and controllable, and to keep a balance between cost and value of the software project. to our knowledge, their mapping study is the most recent on tdm activities, listing eight activities and the main approaches collected from the studies: • td identification: detects td caused by technical decisions in software, either intentional or unintentional; • td measurement: evaluates the cost/benefit relationship of known td items in software or estimates the overall td; • td prioritization: adopts predefined rules to rank known td items, to support the decision-making process; • td prevention: establishes practices to avoid potential td from being incurred; • td monitoring: observes the evolution of known td items over time; • td repayment: eliminates or reduces the td impact (principal and interest) in a software system; • td representation/documentation: represents and codes td in a pre-defined standard, to address the stakeholders’ concerns; • td communication: disclose the identified td to the stakeholders. while searching for technologies to support tdm (silva et al. 2018a), it was possible to observe that some studies are discussing and proposing different technologies, either approaches, tools, or techniques. table 1 presents some of the leading technologies identified in the literature to support the management of td grouped by tdm activity. 3 related works klinger et al. (2011) interviewed four software architects at ibm to obtain insights on how the organization perceives and manages td. all four architects stated that the debt could incur unintentionally, showing up in the projects through, for example, acquisition, new alignment requirements, or changes in the market ecosystem. they claimed that unintentional debt is usually more problematic than intentional. they also affirmed that the decision-making process on tdm is often informal and ad-hoc. finally, the interviewees claimed that there was a gap between executive and technical stakeholders, indicating a lack of a channel or common vocabulary to explain the td to non-technical stakeholders. lim et al. (2012) conducted interviews with 35 practitioners with diverse industry experiences from the usa. the authors aimed to understand how td manifested in software projects and to determine which practitioners adopted td types in the industry. they also investigated the causes, symptoms, and effects of td, and finally, they questioned how practitioners deal with td. seventy-five percent of the interviewees were not familiar with the td metaphor. the participants described td as tradeoffs between a short-term gain and an additional long-term effort. they affirmed that the effects of td were not all negative, as the tradeoff depended on the product’s value. although they wanted a way to measure the td, they claimed that measuring td might not be that easy, as its impact is not uniform. besides, they a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 claimed the key to measure td is to evaluate the cumulative effect over time. finally, the authors suggested start managing the td in an organization through “conducting audits with the entire development team to make technical debt visible and explicit; track it using a wiki, backlog, or task board.” ernst et al. (2015) executed a survey with 1,837 participants in three organizations in the united states and europe. the authors found a “widespread agreement on high-level aspects of the technical debt metaphor, including some popular financial extensions of the metaphor.” they also observed that the project context dramatically affects how the practitioners perceive td. as they stated, only the software architecture was commonly seen as a source of td, regardless of context. sixty-five percent of the respondents in this survey report that they adopted only ad-hoc tdm practices in their projects. however, many respondents affirmed that they manage td through existing practices, such as risk processes or product backlog. forty-one percent of the participants affirmed not to use any tool for managing td, while only 16% use tools to identify td. ampatzoglou et al. (2016) conducted a study to understand how practitioners in organizations from the embedded systems domain perceive the td. they performed an exploratory case study in seven organizations from four different countries. among other research questions, the authors wanted to find what td types are more frequently occurring in embedded systems. their findings about the most frequent td types in the practitioners’ point of view coincide with the taxonomy proposed by alves et al. (2016), except regarding design debt, which is considered more relevant to researchers as it is for practitioners; and test debt and code debt, which seems to be more relevant to practitioners. the study did not identify the defect, people, process, service, and usability debts. rocha et al. (2017) surveyed with practitioners from bsos to understand how the td is dealt with in practice, at the code level only. among their research questions, they investigated which are the factors that lead developers to create td at the code level, and which practices can prevent developers from creating td at the code level. seventy-four practitioners answered the survey, from which almost 72% affirmed to have low, very low or medium knowledge about the td metaphor. the participants affirmed that developers should follow the best programming practices to help prevent the td, despite admitting they indeed contribute to creating td on their projects. among the main reasons to incur in td, the participants answered management pressure, tight schedule, developer’s inexperience, and work overload. the code review was pointed to as the most relevant practice to prevent the occurrence of td. holvitie et al. (2018) conducted a multi-national survey to observe td in practice, including practitioners from finland, brazil, and new zealand. the authors opted to focus on practitioners managing td in organizations adopting agile practices and methodologies. one hundred eighty-four practitioners answered the survey. approximately 20% of the participants had little to no knowledge on the td definition. thirty-five percent of the brazilian participants were able to provide an example of a td instance. according to the study, the six leading causes of td, selected by more than 50% of the participants, are inadequate architecture, structure, tests, and documentation, software complexity and violation of best practices or style guides. finally, most of the participants perceived refactoring, coding standards, continuous integration, and collective code ownership as having a positive effect on reducing the td in software projects. regarding agile software development processes’ and process artifacts, iteration reviews/retrospectives, iteration backlog, daily meetings, product backlog, iteration planning meetings, and iterations were all assigned as having a positive impact on reducing td. table 1. some technologies to support the management of td tdm activity technologies and strategies td identification manual code inspection, sonarqube, checkstyle, findbugs (ylihuumo et al. 2016); codevizard (zazworka et al. 2013); sonarqube, understand, cppcheck, findbugs, sloccount (ernst et al. 2015). td documentation /representation td template (seaman and guo 2011); td backlog/list, documentation practice, jira, wiki, td template (yli-huumo et al. 2016). td communication td meetings (yli-huumo et al. 2016); td board (santos et al. 2013); trello (oliveira et al. 2015). td measurement sonarqube, jira, wiki; td evaluation template (yli-huumo et al. 2016). td prioritization cost/benefit model, issue rating (yli-huumo et al. 2016). td repayment redesigning, refactoring, and rewriting (yli-huumo et al., 2016). td monitoring sonarqube, jira, wiki (yli-huumo et al. 2016); vtiger and jira (oliveira et al. 2015). td prevention coding standards, code reviews, definition of done (yli-huumo et al. 2016). a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4 survey design 4.1 research objectives using the goal-question-metric (gqm) paradigm (van solingen et al., 2002), the objective of this study is to analyze the td and its management, with the purpose of characterizing, with respect to the level of knowledge and the adopted strategies, activities and technologies, from the point of view of software practitioners, in the context of brazilian software organizations. 4.1.1 research questions the research questions are explained as follows: • rq1: is there a consensus on the perception of td among software practitioners in bsos? it intends to determine whether the perception of td is homogeneous among professionals in bsos. if so, it can support the observation of the existence of a common perspective on td between the industry and academia (a positive side effect of this survey). • rq2: do the practitioners in the bsos perceive td in their software projects? before characterizing the tdm activities, it is essential to confirm that the software organizations (through their practitioners) perceive, i.e., observe the presence of td in their projects. o rq2.1: do bsos manage their td? if td is perceived, it is essential to know whether bsos manage the td in their software projects.  rq2.1.1: what tdm activities are most relevant to software projects? the goal of this question is to identify, among professionals, which tdm activities, among those proposed by li et al. (2015), are more relevant, or at least more considered during the software projects.  rq2.1.2: which technologies and strategies are adopted for each tdm activity? for all eight tdm activities proposed by li et al. (2015), which strategies and technologies are used to support them. even though this survey had been designed to identify the most common technologies used by the practitioners in their bsos to support tdm activities, it is not possible to make any judgment regarding their efficiency and effectiveness. furthermore, this survey did not look for the benefits of applying such technologies in bsos. 4.2 questionnaire design the questionnaire was designed according to the guidelines presented in linåker et al. (2015). we performed an ad-hoc literature review to gather specific information about the perception of td and tdm. concerning tdm, we organized the activities as proposed by li et al. (2015), as such activities cover the ones mentioned in different studies. moreover, we accepted them as consistent since no disagreements were observed during pilot trials. for each activity, we identified a set of specific strategies and technologies used to conduct the activity, as well as a list of possible roles for each activity. from this information, a questionnaire was designed, and specific questions for each activity were included. for instance, on td identification, a set of questions involving the td classification was included – as by alves et al. (2016). regarding the questionnaire structure, it is divided into fourteen sections, described in table 2. it is composed mostly of closed-ended questions. a small number of openended questions was necessary to get further information from the participant. it also contains partially closed-ended questions to deal with issues related to tools and strategies for each tdm activity when the given options do not cover the entire possibility of the participant´s answers. table 3 presents an extract of our questionnaire translated into english, with some questions on td identification. each section starts with a brief explanation of its content and specific instructions. the limesurvey platform available in the experimental software engineering (ese) group at coppe/ufrj (http://lens-ese.cos.ufrj.br/ese/) supported the questionnaire implementation and survey execution. the questionnaire was configured to ensure the participant’s anonymity. a welcome message describes the survey structure and explains its importance for bsos. the participants are asked to answer the questions based on their current (or most recent) software project and organization. each set of questions related to a specific tdm activity was conditionally presented to the participants only if they have some experience with that activity to minimize the problem of lengthy survey questionnaires. other conditional breakpoints in the questionnaire were set to end the survey whether the participant is not familiar with the td concept and whether the organization or the project do not apply any tdm activity. 4.2.1 characterization sections the three sections related to the participant’s characterization include questions regarding its role in the projects, academic formation, working experience in software projects, its organization field, size, and any maturity model certificate in software processes. to assess the size of the organizations, we adopted the sebrae/ibge classification of organizations, consisting in micro (fewer than ten employees), small (between 10 and 49), medium (between 50 and 99) and large (more than 100) organizations. although this grouping does not constitute a world-level standard, it attends the first necessity for this study, which is a means to estimate the total number of organizations represented by the participants that answered the survey. finally, the projects in which the participants work are also characterized, through their domain problem and their lifecycle model. in this last question, the agile software development method was included for simplification purposes, even though it does not characterize as a lifecycle model. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.2.2 td perception section this section aims to gather the general participant understanding, regarding the td definition and its overall aspects. its first question regards the participant understanding of td. it was not our purpose to inquiry participants not knowing the meaning of td, as they can provide wrong answers on the tdm sections. thus, the participants without td knowledge should finish their questionnaire in this question. to the participants that claim they know td, a follow-up question was designed, to assess which common issues in software development should be considered td. we presented to the participants a list with items not considered td (according to the mapping study conducted by li et al. (2015)), and items considered td (obtained through an adhoc literature review). following the general understanding of td by the participants, they were questioned if td was perceived in their most recent project, i.e., if they could notice any issues that could be associated with td. an affirmative answer on this question allows the participants to answer two follow-up questions: if their organization adopts any tdm activity and if their manager (or themselves) adopt any tdm activity, regardless their organization adopting any. answering “yes” to any of these two questions, allow the answering of the remaining questionnaire. table 2. questionnaire sections sections topic description 1 participant characterization obtain personal information regarding the participant, such as professional experience and academic degrees. 2 organization characterization gather information about the organization the participant works for or has worked before. 3 project characterization obtain information about the project considered by the participant in the survey. 4 td perception collect information on the participant’s knowledge regarding td, including what can be considered td. also, determine if the organization or the project he works at has strategies for tdm. 5 tdm (general) ask the participant which tdm activities are adopted in the working project. obtain information about the responsibilities and importance associated with each activity from the participant’s point of view. 6-13 tdm (activities) gather information on several aspects regarding each of the tdm activities proposed in li et al. (2015). 14 tdm (other) provide space for the participant to describe other activities that are executed in the organization. table 3. survey – td identification section question answer options is there a formal strategy to identify td? ( ) yes, we have a formal procedure to identify the td. ( ) no, the td identification is executed only informally. are all the stakeholders required to apply the td identification strategy?” ( ) yes, the strategy is mandatory for all stakeholders. ( ) no, the strategy is considered only a suggestion. at what point in the project is the td identified? ( ) there is no defined period; we identify the td whenever we perceive some issue. ( ) we always identify the td at the end of each iteration/sprint. ( ) the td identification is continuous, i.e., occurs throughout the development process. mark below all tools or techniques that are used to identify td. [ ] manual coding inspection [ ] dependency analysis [ ] checklist [ ] sonarqube/sqale [ ] checkstyle [ ] findbugs [ ] codevizard [ ] clio [ ] other (cite which) a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.2.3 td management section the purpose of this section is to identify the adoption and relevance of tdm practices in the bsos’ projects. the participants were clarified that by “technical debt management,” they should consider all activities that organize, monitor, and control the td and its impacts on software projects. the participants were asked to select which tdm activities were conducted in their projects, based on the list of tdm activities provided by li et al. (2015). an additional option was included to provide space for the participants so that they can mention other tdm activities not discussed by li et al. (2015). for each tdm activity selected by the participants, an additional question was created to ask which roles were responsible for conducting that activity. the leading roles offered as answers were obtained from yli-huumo et al.’s study (2016), but the questionnaire provided an open-ended question so that the participants could elaborate in case another role should be considered responsible for that activity. 4.2.4 tdm activities sections eight sections follow the questionnaire, asking for information regarding each one of the tdm activities proposed by li et al. (2015). they are only available to the participants if they select those activities in the previous section. at the beginning of each section, the tdm activity is described, to improve the participants’ knowledge and reduce the probability of misunderstanding of the proposed questions. those sections follow the structure briefly presented below, for each activity. all activities subsections included a question to obtain which tools or techniques were adopted to conduct that particular activity, based mainly on a list obtained mainly from li et al. (2015) and yli-huumo et al. (2016). • td identification: the participants were asked about the use of any formal approach to identify td, as well as their optionality. next, they were asked when the td was identified. a subsection was created to assess if and how the td is classified after it has been identified. • td documentation/representation: the participants were asked if there was a standard to follow when documenting td, and whether it was mandatory to all stakeholders. then they were asked how the td items are documented or cataloged. • td communication: the participants were only asked how the unresolved td items were communicated between the project stakeholders. • td measurement: the participants were asked if there was any strategy previously defined to measure td, and how it was measured efficiently. they were asked which information or variables were used to measure the td items. • td prioritization: the participants were asked how the td is prioritized. finally, they were asked which criteria are used to support the td prioritization. • td repayment: the participants were asked if there is any planning to repay td. • td monitoring: like the td repayment, the participants were asked how the td is monitored. • td prevention: for this section, the participants were asked if there are any formal practices conducted to prevent the td and whether they are mandatory or optional to the stakeholders. • other tdm activities: one last section is provided to gather information regarding other tdm activities used in the participant’s software organization, presenting similar questions to the previous sections. 4.3 pilot execution a pilot trial was conducted using the same artifacts and procedures designed for the final survey, including the survey questionnaire and the execution method, but with a small number of participants from the target population (linåker et al. 2015). seven practitioners were invited to the pilot trials. five of them work on software projects and come from the ese group at coppe/ufrj, which conducts this research. the other two participants also work on software projects but are from outside the research group. all of them have some prior experience with td and/or tdm, mostly in the industry. an invitation by e-mail included the main instructions and questionnaire link. they were asked to answer the questionnaire and return their feedback regarding response time, proper understanding, completeness, and other aspects. all pilot participants answered the pilot survey within a week. the average answering time was 15.2 minutes. the relevant comments were associated with usability issues, clarity of questions, and some suggestions to improve some details and definitions throughout the questionnaire. these were later discussed internally, and modifications were applied to the final questionnaire. overall, we did not observe negative comments or doubts about either the answer options or the questions descriptions, suggesting that the questionnaire was good enough to use in the study. 4.4 target population and sampling to achieve the research objectives and to answer the research questions, the practitioners from bsos were selected as the target audience. the sampling design adopted is accidental, a non-probabilistic type of sampling, i.e., we cannot observe randomness on the selected units from the population. this decision can incur in a threat to validity, which will be further discussed in section 6.3. an invitation to answer the survey was sent to a series of renowned software development groups in the country. other invitations were sent through the linkedin professional social network. finally, the survey was disclosed to the participants on three software-related events: rioinfo (practitioners oriented) and sbqs (high participation from practitioners), in rio de janeiro/brazil, and cbsoft (some practitioners participation), in fortaleza/brazil. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 4.5 final revisions and survey release after the pilot trial, the final survey was released on june 2017. a lab package with the research plan and the survey questionnaire is available in english and portuguese at https://doi.org/10.6084/m9.figshare.5923969. 5 survey results the survey was conducted between june 2017 and april 2018. in total, 62 participants answered the survey, with 36 complete answers. four participants did not complete the survey but reached the questionnaire’s section 4 (td perception), so they were included in this initial analysis, totalizing 40 valid answers. the remaining 22 incomplete responses were not included in the analysis. figure 2 summarizes the survey responses. 5.1 participants’ characterization the respondents have an average work experience of 14.15 years in software projects. only four respondents reported having an incomplete undergraduate degree, while the remaining 36 respondents hold at least an undergraduate degree. twenty-six participants reported holding a specialization degree (master or doctorate). regarding the bsos in which the respondents work, most of them (23) are from the it sector. referring to the project development, most of them (35) adopt agile or incremental lifecycle models. two projects adopt the spiral model, while three adopt the waterfall model. due to the questionnaire anonymity, it is not possible to precise the number of organizations represented in the survey. however, it is possible to estimate it roughly, based on the information provided by the participants in this section. therefore, we could estimate around 12 organizations and 30 projects included in the survey. only one organization characterized by the participants adopt de mps.br maturity model to evaluate its software processes, at level g, whereas two participants affirmed the organizations they work have cmmi level 5 and two others have cmmi level 2. 5.2 td awareness and td perception regarding the perception of td, from 40 valid answers, we found that 16 respondents (40%) claimed not to be aware of the td metaphor. the 24 remaining participants (60%) were asked to select the options that best matched the td definition. two of them did not answer this question. as it can be observed in table 4, seven issues out of 12 were marked by 50% or more of the 22 participants: low internal quality, poorly written code, that violates code rules; "shortcuts" taken during design; the presence of known defects that were not eliminated; architectural problems; planned but unfinished or unplanned tasks; and issues associated with low external quality. regarding the td perception, from the 24 participants that were aware of the td meaning, 17 informed to perceive some issues associated with the td concept in their projects, whereas four did not perceive the td occurrence, and one participant did not answer this particular question. from the 17 participants that informed that perceived the td occurrence in their projects, ten answered that their organizations or the project managers adopt tdm activities. table 5 presents the distribution of these answers, grouped by the organization size. table 6 presents the same results among organizations that adopt any model to evaluate their maturity level on software processes. 5.3 td management from the eight tdm activities proposed by li et al. (2015), as shown in figure 2, only td monitoring was not marked by the participants when asked about which tdm activities were conducted in their projects. td identification and td documentation are conducted in projects for six participants each, while five participants each marked td prioritization, td communication, and td repayment. td measurement and td prevention are conducted in projects according to two participants. one participant did not mention any tdm activities. no participants mentioned any tdm activity besides those proposed by li et al. (2015). table 7 presents the grouping of the results according to the sizes of the organizations. 5.3.1 tdm responsibilities there was no consensus among participants on which roles should be responsible for each tdm activity. moreover, some distinction was observed between the participants’ responses and the tdm framework presented in yli-huumo et al.’s study (2016). for instance, the tdm framework presented in yli-huumo et al. (2016) states that software architects and the team leader are the responsible roles for the activity of td measurement. however, in our survey, no respondent selected software architects as responsible for this activity. on the other hand, it was also identified that some responsible roles to perform some tdm activities are similar to results pointed in yli-huumo et al. (2016), for example, software architect and development team to perform td identification. therefore, we consider the findings concerning tdm responsibilities are coherent and complementary to presented in yli-huumo et al. (2016). table 8 presents our results, compared with the ones from yli-huumo et al. (2016). 5.3.2 td identification two out of six participants answered that there is a mandatory strategy to conduct the td identification activities, while one participant answered that there is a formal strategy, albeit not mandatory. three participants claimed to adopt only simple strategies. three out of six answers suggested that td identification was conducted continuously throughout the project. regarding td identification, one participant affirmed that the td was classified as design debt or documentation debt in the project, while one participant claimed to use the artifact that initially incurred the td to classify it. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 figure 2. summary of the survey responses a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 5.3.3 td documentation from the total of six participants, two answered that they have a standard on documenting the td that should be followed by all stakeholders. one participant answered that his/her project has a td documentation standard, but it is not mandatory for the stakeholders. two participants answered that the td documentation is conducted only informally. when asked how the td is documented, four participants answered that they use a general task backlog, with no specific details, while one affirmed she uses a specific backlog of td items. one participant did not provide any details on td documentation, despite informing that it is conducted in her project. 5.3.4 td communication from the five participants that answered the td communication section, four affirmed that the td was discussed during project meetings, but with the participation of only a few of the necessary stakeholders. one participant said that the td was only discussed informally. 5.3.5 td measurement out of the two participants that answered the td measurement section, one affirmed that the td measurement was conducted informally, through the analysis of metrics and indicators based on specific information regarding the td item. the other participant indicated that there is a mandatory strategy to measure td, based on direct information, like person-hours to repay the td item or the item loc. 5.3.6 td prioritization regarding the td prioritization, three of five participants answered that the td items were prioritized according to “guesses” or simplified estimative based on previous experiences, while the other one used the td item criticality to prioritize it. four participants affirmed that they tend to prioritize the td items that most impact the client, and three answered that they prioritize the td items that could cause the most impact on the project. one did not provide any details on td prioritization, despite adopting it in his/her project. table 4. issues related to td, according to the participants issue % of participants low internal quality aspects, such as maintainability and reusability 77% poorly written code that violates code rules 68% "shortcuts" taken during design 68% presence of known defects that were not corrected 68% architectural problems (like modularity violation) 55% low external quality aspects, such as usability and efficiency 50% planned, but not performed, or unfinished, tasks (e.g., models, test plans, etc.) 50% trivial code that does not violate code rules 45% code smells 45% defects 36% lack of support processes to the project activities 23% required, but unimplemented, features 18% table 5. td perception, grouped by organization size yes knows td? perceived td? adopts any tdm activity? total no no answer fewer than 10 employees yes: 1 yes: 1 yes: 0 3 no: 2 no: 0 no: 1 no answer: 2 no answer: 2 between 10 and 49 employees yes: 5 yes: 3 yes: 3 11 no: 6 no: 0 no: 0 no answer: 8 no answer: 8 more than 100 employees yes: 18 yes: 13 yes: 7 26 no: 8 no: 4 no: 6 no answer: 9 no answer: 13 total 40 40 40 40 a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 5.3.7 td repayment from the five participants that answered the td repayment section, two of them answered that the td repayment is planned according to the current project necessities, while one answered that the td repayment is planned continuously, with specific periods during the development process destined to this activity. one participant answered that the td is only repaid when it is not possible to avoid it anymore. one participant did not provide any details on td repayment, despite informing that it is conducted in his/her project. 5.3.8 td prevention both respondents answering the td prevention section mentioned that it is an activity conducted only by each member of the team individually. 5.3.9 technologies and strategies for tdm table 9 presents a list of practices, techniques, and tools used in each tdm activity. the numbers in parentheses represent the number of participants answering that specific section (column “tdm activity”) and the number of participants that affirmed using that tool or technique (column “tools and techniques”). we can observe that different technologies support tdm, and there is no consensus about which one to use. most of such technologies are similar to those identified in the technical literature (see table 1). 6 discussion 6.1 revisiting the findings the analysis of the survey’s results, presented in section 5, allowed us to answer reasonably the rqs, which we discuss next. 6.1.1 rq1: consensus on the perception of td we did not observe consensus in the overall td perception. each participant was asked to select which of the 12 issues suggested on td should be associated with the td concept, as presented in table 4. out of those options, 75% or more table 6. td perception, grouped by the adoption of maturity models to evaluate software processes yes knows td? perceived td? adopts any tdm activity? total no no answer mps.br level g yes: 1 yes: 0 yes: 0 1 no: 0 no: 1 no: 0 no answer: 0 no answer: 1 cmmi level 2 yes: 2 yes: 2 yes: 0 2 no: 0 no: 0 no: 2 no answer: 0 no answer: 0 cmmi level 5 yes: 0 yes: 0 yes: 0 2 no: 2 no: 0 no: 0 no answer: 2 no answer: 2 others yes: 2 yes: 1 yes: 1 4 no: 2 no: 0 no: 0 no answer: 3 no answer: 3 total 9 9 9 9 table 7. tdm activities conducted in the participants’ projects, grouped by organization size tdm activity between 10 and 49 employees more than 100 employees total identification 2 4 6 documentation/ representation 3 3 6 communication 1 4 5 measurement 0 2 2 prioritization 2 3 5 repayment 2 3 5 monitoring 0 0 0 prevention 1 1 2 identification 2 4 6 documentation/ representation 3 3 6 a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 of the 22 respondents evaluated only one issue. from the issues associated with the td concept by 50% or more of participant’s, only one (“issues associated with low external quality”) is not considered td. only one issue associated with the td concept was marked by less than 50% of the 22 answers, which is “code smells” (42%). therefore, although we could not find consensus between the industry and academia, we consider that there is some agreement among participants of what should be considered td since 17 out of 22 participants identified that td should be related to internal quality issues. however, 1 https://www.sonarqube.org 2 http://www.sqale.org/ 3 http://findbugs.sourceforge.net/ 50% of the participants believe that td should also be associated with external quality issues, which is worrisome and contradicts the definition asserted at the dagstuhl seminar (avgeriou et al. 2016). it could indicate that there is a misconception of what should be considered td, associating its definition with any issue occurring during the software development. we could observe some alignments in the views on td between the participants and academia since most of the issues associated by more than half of the participants are also 4 https://br.atlassian.com/software/jira 5 https://trello.com/ table 8. tdm responsibilities tdm activity our study yli-huumo et al. (2016) identification team leader; software architect; development team software architect; development team documentation/ representation project manager; team leader; software architect; development team software architect; development team communication team leader; software architect; development team project manager; software architect; development team measurement team leader; development team software architect; development team prioritization project manager; team leader; software architect; development team project manager; software architect repayment team leader; software architect; development team software architect; development team prevention project manager; team leader; software architect; development team software architect; development team table 9. tdm activities – technologies and strategies tdm activity technologies and strategies td identification (6) manual code inspection (4), dependency analysis (1), checklist (2), sonarqube1/sqale2 (3), checkstyle (1), findbugs3 (1) td documentation /representation (6) td backlog (3), specific artifacts for td documentation (1), jira4 (1), others trello (1) td communication (5) discussion forums (3), specific meetings about td (1), others gitlab (1), others trello5 (1) td measurement (2) manual measurement (1), sonarqube (2), jira (1) td prioritization (5) cost/benefit analysis (1), classification of issues (3) td repayment (5) refactoring (3), redesign (1), code rewriting (4), meetings/workshops/training (1) td monitoring (0) n/a td prevention (2) guidelines (2), coding standards (2), code revisions (1), retrospective meetings (1), definition of done (2) a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 in agreement with the definition indicated in the technical literature. we believe that despite the reasonable td definition understanding by some software practitioners, it is vital to disseminate better the distinction between issues related to internal quality (td) and those related to external quality (defects). 6.1.2 rq2: bsos’ practitioner’s perception of td only 43% of 40 participants claimed to perceive td in their software projects, which could be considered low, given the importance of the topic. moreover, only 25% of the 40 participants adopt tdm activities, possibly indicating the existence of a severe gap in the overall product quality perspective. however, we did not assess the adoption of other internal quality assurance methods to replace the low perception of the td presence in the organizations. grouping the td perception with the size of the organizations (table 5), we could observe that a higher percentage of participants from larger organizations know about td when compared to smaller-sized organizations. most of the participants from these companies also answered that they perceive td in their projects. it could indicate that more prominent organizations (generally being active for a more extended period, and having more solid processes to manage software development projects) could have a broader perspective on td and tdm. unfortunately, due to the low number of responses, we could not analyze the correlation of the adoption of maturity models to evaluate software processes with any aspect of the study. out of the nine participants that answered their organizations adopt any maturity models, six of them did not know what td is, or they did not perceive it in their last projects. from the remaining three that did perceive in their projects, only one conducted any activities to manage it. this gap in the results can be used to develop the research on td in bsos further. 6.1.3 rq2.1.1: most relevant tdm activities the results of our survey show that there was no consensus on which tdm activities are more relevant to surveyed software projects. however, almost half of the participants that answered this question mentioned that td prevention is relevant to a project. it is a possible research gap for future works since most of the studies regarding tdm focus on td identification, measurement, and prioritization. regarding the main tdm activities conducted by the participants, our results are mostly in line with yli-huumo et al. (2016) in which indicates that td communication is most commonly adopted by the development teams, followed by td identification, documentation, prioritization, repayment, and prevention. the rarely managed td activities described in ylihuumo et al.’s study (2016) are td measurement and td monitoring, as also observed in our study. despite the number of participants indicating the importance of td prevention, only two reported performing td prevention activities. 6.1.4 rq2.1.2: technologies and strategies as presented in section 5.3.6, a list of tools and technologies used to manage td activities (see table 9) can be used in further studies looking for evidence on their effectiveness and efficiency in managing the td. 6.2 comparison with results from related works most of the studies previously described in section 3 have distinct populations, being researches from other countries. however, we could observe that their results are coherent and complementary to the findings of our survey, as we discussed in sections 5.3.1 and 5.3.9. when analyzing the results concerning the td understanding or td perception in the projects, it is possible to observe that most of the surveyed software practitioners reported having a low level of knowledge about td. regarding the management of td, we could identify that it still seems to be incipient in the surveyed software organizations in such studies. most of the studies reported that tdm activities are performed in an informal and ad-hoc way. although some strategies and technologies identified by holvitie et al. (2018) to support the tdm activities are coherent and complementary to those identified in our survey (see section 5.3.9), the evidence on their effectiveness and efficiency in managing the td must be further investigated. besides, as previously mentioned, some distinction and similarities were identified regarding which roles should be responsible for each tdm activity (see section 5.3.1). 6.3 threats to validity this research has some threats to validity as any other empirical study. next, we report them together with some of the adopted mitigation actions, relying on the classification as proposed by wohlin et al. (2012) and linåker et al. (2015). a potential internal threat comes from the participants that might have misunderstood some terms and concepts of the questionnaire. there is also a construct threat of a biased survey, from the researchers’ perspectives and the collected information from the technical literature such as the tdm activities organized in li et al. (2015). to reduce the level of this menace, we conducted three revision cycles during the survey development with two researchers. furthermore, two pilot trials were executed, followed by a final revision by all the pilot survey participants aiming to ensure the modifications were aligned with their perspectives. we also observed a potential threat in the way that the main topic of the survey was disclosed to the potential participants in the invitations. if the participants did not have previous knowledge regarding the topic, this could have driven them away from the survey, which could have biased the results. we recognize that this effect of the invitations on the participants could have affected the study in some way. we observed an external validity threat concerned with the representativeness and high mortality of surveys’ respondents. as part of our disclosure strategy involved presenting the research in software engineering research events, some of our results might present some bias. there is a high rate of a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 mortality of respondents since a substantial number of responses were discarded. only 65% of the 62 responses were valid to the point we could obtain some information. these discarded responses refer to 22 incomplete questionnaires, in which its respondents did not reach the questionnaire’s section 4 (td perception) so they could not be included in the analysis. perhaps the reason for incomplete questionnaires might be associated with the survey length and the response time. overall, the survey has 52 questions (no participant had to answer the complete survey, though), distributed over 23 pages. studies report that every additional question can reduce the response rate by 0.5%, and every additional page, by 5% (linåker et al. 2015). however, since we do not have data on this possibility, we cannot formulate any elaborate conclusions. another possible reason for the low number of responses is that the concept of td is still incipient in bsos and, since the topic was explicitly mentioned in the invitations, it could have kept away some practitioners that are not familiar with the term. if this case is indeed real, the results would be even more worrisome, as the percentage of practitioners that know what td is could drastically drop. considering the initial number of survey’s participants, only ten them reported adopting tdm activities in their projects. however, it is essential to highlight that the participants might not be so representative among those who manage td in the bsos. on the other hand, the practitioners surveyed may be a good sample of how the td concept is perceived in the bsos, in which it is also an interesting result. this result may indicate that td concept still needs to be further disseminated for the software industry. in this sense, maybe the dissemination about td concept and aspects concerned with its management can occur at the university courses level or even at the professional training level. even among those practitioners who responded to the complete survey, we could observe a level of misconception point out that the td perception is not in line with the dagstuhl’s definition. finally, the main threat to validity is the generalization of the results. since the target sampling is non-probabilistic, it is not possible to determine a priori the population size and the expected total number of participants. therefore, the results confidence level might be low, making it hard to generalize the results to the entire population (bsos). as argued by mello et al. (2015), the establishment of representative samples for se surveys is considered a challenge, and the specialized literature often presents some limitations regarding the interpreting surveys’ results, mainly due to the use of sampling frames established by convenience and non-probabilistic criteria for sampling from them. as previously, methodological procedures were used since the planning stage of our study until its execution, aiming to reduce the level of such menace. despite that, the inevitable conclusions can suggest the td research with initial indications of the level of knowledge of bsos regarding the td concept and tdm activities. 7 concluding remarks this paper presented background about the td definition and the results of a survey conducted with practitioners in bsos. the results provide initial observations regarding how bsos (represented by their software practitioners) perceive and manage td in their projects. before the analysis of the survey results, some observations can be made. first, we obtained a considerable low number of responses and an even lower number of complete responses. notwithstanding, the results were enough to provide an initial and representative picture of the perception of td and its management in the scenario of the bsos. regarding the td perception, our results indicate no unanimity concerning how brazilian software practitioners perceive td. regarding tdm, it was observed that only a few bsos report the td management in their software projects, indicating that tdm seems to be still incipient in bsos. four out of nine practitioners reporting tdm activities claimed that td prevention is the most critical activity in their projects, despite only two participants indicated to perform it. we believe that the results of this study provide the following contributions to both industry and academia: • to the bsos (industry) the initial results indicate that software practitioners and their organizations need to understand better the concept of td. it is necessary to achieve better results in their projects since the perception of td and its management in this scenario is still incipient. the findings also present a list of technologies that can be used to support tdm activities, as long as software engineers evaluate their usage based on the organizations’ needs at the time. moreover, the findings indicate that tdm activities usually involve distinct roles throughout the projects. in general, we consider that the bsos need to systematize some actions (e.g., training) to enable their teams to perceive and manage existing td. • to the researchers our results indicate that there is a need for more investigations aiming to disseminate the td knowledge to practitioners on bsos, as well as provide strategies and software technologies to support the tdm on these organizations. besides, we believe the sharing of this study package can contribute to support the development of investigations on td and its management more connected to the software organization needs in brazil and other regions. following this study, we conducted two other works that compose a research framework on td and its management. both were reported in silva et al. (2018a). the first work consisted of a quasi-systematic literature review to gather available technologies that support tdm in the technical literature. the second work organized evidence briefings (cartaxo et al., 2016) in both english and portuguese, combining the survey and the literature review results. the evidence briefings intend to address critical points observed in the survey research, primarily regarding the practitioners’ general lack of knowledge or misconceptions concerning a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 td. they are available online at https://doi.org/10.6084/m9.figshare.7011281. overall, we believe this study offered a new perspective on the td research in bsos. to the best of our knowledge, only one other survey analyzed the td specifically in bsos (rocha et al. 2017), but the authors focused mainly on the td located at the code level, not using a broader software engineering perspective like our study. moreover, our survey package provides the materials that can be used by other software engineering researchers to study this topic in other organizations and software communities, facilitating a better understanding, future comparisons, and providing indications to evolve tdm activities. acknowledgements the authors thank all the professionals that took part in this survey and the researchers that have collaborated with their feedback on the pilot trials. this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) – finance code 001. prof. travassos is a cnpq researcher and an isern member. references alves nsr., mendes ts. de, de mendonça mg., et al. (2016) identification and management of technical debt: a systematic mapping study. inf softw technol 70:100–121. doi: 10.1016/j.infsof.2015.10.008. ampatzoglou a, ampatzoglou a, chatzigeorgiou a, et al. (2016) the perception of technical debt in the embedded systems domain: an industrial case study. proc 2016 ieee 8th int work manag tech debt, mtd 2016 9–16. doi: 10.1109/mtd.2016.8. assuncao tr de, rodrigues i, venson e, et al. (2015) technical debt management in the brazilian federal administration. 2015 6th brazilian work agil methods 6–9. doi: 10.1109/wbma.2015.11. avgeriou p, kruchten p, ozkaya i, et al. (2016) managing technical debt in software engineering edited by. dagstuhl reports 6:110–138. doi: 10.4230/dagrep.6.4.110. becker c, chitchyan r, betz s, mccord c (2018) trade-off decisions across time in technical debt management. 85– 94. doi: 10.1145/3194164.3194171. boehm b (2008) making a software century. brown n., cai y., guo y., et al. (2010) managing technical debt in software-reliant systems. in: proceedings of the fse/sdp workshop on the future of software engineering research, foser 2010. pp 47–51. cartaxo b, pinto g, vieira e, soares s (2016) evidence briefings: towards a medium to transfer knowledge from systematic reviews to practitioners. pp 1–10. cunningham w (1993) the wycash portfolio management system. acm sigplan oops messenger 4:29–30. doi: 10.1145/157710.157715. de frança bbn, jeronimo h, travassos gh (2016) characterizing devops by hearing multiple voices. proc 30th brazilian symp softw eng sbes ’16 53–62. doi: 10.1145/2973839.2973845. ernst na., bellomo s., ozkaya i., et al. (2015) measure it? manage it? ignore it? software practitioners and technical debt. in: 2015 10th joint meeting of the european software engineering conference and the acm sigsoft symposium on the foundations of software engineering, esec/fse 2015 proceedings. pp 50–60. fowler m (2009) technical debt quadrant. in: martinfowler.com. https://martinfowler.com/bliki/technicaldebtquadrant.html. guo y., spínola ro. c, seaman c. (2016) exploring the costs of technical debt management – a case study. empir softw eng 21:159–182. doi: 10.1007/s10664-014-9351-7. holvitie j, licorish s, spinola r, et al. (2018) technical debt and agile software development practices and processes: an industry practitioner survey. inf softw technol 96:141–160. klinger t, tarr p, wagstrom p, williams c (2011) an enterprise perspective on technical debt. in: proceedings international conference on software engineering. pp 35–38. ktata o, lévesque g (2010) designing and implementing a measurement program for scrum teams: what do agile developers really need and want? in: acm international conference proceeding series. pp 101–107. li z, avgeriou p, liang p (2015) a systematic mapping study on technical debt and its management. j syst softw 101:193–220. doi: 10.1016/j.jss.2014.12.027. lim e., taksande n., seaman c. (2012) a balancing act: what software practitioners have to say about technical debt. ieee softw 29:22–27. doi: 10.1109/ms.2012.130. linåker j, sulaman s, maiani r, höst m (2015) guidelines for conducting surveys in software engineering engineering. mcconnell s (2007) technical debt 10x software development. oliveira f., goldman a., santos v. (2015) managing technical debt in software projects using scrum: an action research. in: proceedings 2015 agile conference, agile 2015. pp 50–59. prikladnicki r, audy jln, damian d, de oliveira tc (2007) distributed software development: practices and challenges in different business strategies of offshoring and onshoring. proc int conf glob softw eng icgse 2007 262– 274. doi: 10.1109/icgse.2007.19. ribeiro lf. b, de farias maf. d, mendonça m., spínola ro. e (2016) decision criteria for the payment of technical debt in software projects: a systematic mapping study. in: iceis 2016 proceedings of the 18th international conference on enterprise information systems. pp 572–579. rocha jc, zapalowski v, nunes i (2017) understanding technical debt at the code level from the perspective of software developers. proc 31st brazilian symp softw eng sbes’17 64–73. doi: 10.1145/3131151.3131164. santos psm, varella a, dantas c (2013) visualizing and managing technical debt in agile development: an experience report. a taste of the software industry perception of the technical debt and its management in brazil silva et al. 2019 seaman c., guo y. (2011) measuring and monitoring technical debt. adv comput 82:25–46. doi: 10.1016/b978-012-385512-1.00002-5. silva vm, junior hj, travassos gh (2018a) technical debt management in brazilian software organizations: a need, an expectation, or a fact? in: brazilian symposium on software quality (sbqs). curitiba. silva vm, junior hj, travassos gh (2018b) a taste of the software industry perception of technical debt and its management in brazil. av en ing softw a niv iberoam cibse 2018 1–14. spínola ro., vetrò a., zazworka n., et al. (2013) investigating technical debt folklore: shedding some light on technical debt opinion. in: 2013 4th international workshop on managing technical debt, mtd 2013 proceedings. pp 1–7. tom e, aurum a, vidgen r (2013) an exploration of technical debt. j syst softw 86 1498–1516. doi: 10.1016/j.jss.2012.12.052. van solingen r, basili v, caldiera g, rombach hd (2002) goal question metric (gqm) approach. encycl. softw. eng. wohlin c, runeson p, höst m, et al. (2012) experimentation in software engineering. springer science & business media, heidelberg, berlin. yli-huumo j., maglyas a., smolander k. (2016) how do software development teams manage technical debt? an empirical study. j syst softw. doi: 10.1016/j.jss.2016.05.018. zazworka n. e, spínola ro., vetro a., et al. (2013) a case study on effectively identifying technical debt. in: acm international conference proceeding series. pp 42–47. journal of software engineering research and development, 2023, 11:1, doi: 10.5753/jserd.2023.2417 this work is licensed under a creative commons attribution 4.0 international license. technical debt guild: managing technical debt from code up to build thober detofeno [pontifícia universidade católica do paraná | thober@gmail.com ] andreia malucelli [pontifícia universidade católica do paraná | malu@ppgia.pucpr.br ] sheila reinehr [pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br ] abstract efficient technical debt management (tdm) requires specialized guidance so that decisions taken are oriented to add value to the business. because it is a complex problem that involves several variables, tdm requires a systemic look that considers professionals' experiences from different specialties. guilds have been a means technology companies have united specialized professionals around a common interest, especially those using the spotify methodology. this paper presents the experience of implementing a guild to support tdm's activities in a software development organization using the action research method. the project lasted three years, and approximately 120 developers were involved in updating about 63,300 source-code files, 2,314 test cases, 2,097 automated test scripts, and the build pipeline. the actions resulting from the tdm guild's efforts impacted the company's culture by introducing new software development practices and standards. besides, they positively influenced the quality of the artifacts delivered by the developers. this study shows that, as the company acquires maturity in tdm, it increases the need for professionals dedicated to tdm's activities. keywords: technical debt, technical debt management, community of practice, technical debt guild 1 introduction technical debt (td) is a metaphor that expresses software artifacts' immaturity and their impacts on software maintenance and evolution activities. according to brown et al. (2010), this metaphor characterizes the difference between a software system's current state and its hypothetical ideal state. a theoretical ideal state is understood as the one established by the context in which the software is inserted (brown et al., 2010). td negatively affects productivity and feasibility in software development. in many cases, developers are forced to introduce more td because of prior debts (besker et al., 2019). it is estimated that between 25% and 37% of all development time is wasted due to td. most of the time is wasted understanding or managing td (ampatzoglou et al., 2017; besker et al., 2017; martini et al., 2018). if unmanaged, td can result in significant cost overruns, serious quality problems, reduced developer morale (ghanbari et al., 2017), and limited ability to add new features (seaman et al., 2012). it can even reach a crisis point when a vast and expensive refactoring or complete system replacement is needed (martini et al., 2014). the efficient management of td is a little explored area, although it seems to help in quality and productivity during software development (guo et al., 2016; rios et al., 2018). works investigating aspects of td management in the software development process are isolated initiatives (rios et al., 2018). decision-making in td management is hard to standardize because, in most cases, it depends on the organization's context (guo et al., 2016). one way to face this problem is to build a team focused on solving problems. this approach can be a practical way of solving a wide range of issues and offering suggestions on processes and working methods that need improvement (connolly, 1992). such groups can be implemented using the concepts of communities of practice (cop) (smite et al., 2019). a community of practice (cop) is a group of individuals who periodically meet due to a common interest in learning and applying what has been learned, sharing knowledge, exchanging experiences, taking their problems, and finding solutions. one of the best-known examples of cop is the concept used by the music streaming technology company spotify, named guild (kniberg, 2014). in a context where the tdm should be incorporated into the software development process, bringing together people who have knowledge and interest in the subject can contribute to finding solutions and generating value for the business. this article presents an experience report on establishing and using a td guild in a software development organization throughout an action research process. the paper describes the experiences, results, success factors, and challenges. the actions promoted by the td guild contributed to the identification, monitoring, prevention, prioritization, and payment activities in tdm. the guild helped align td's payout efforts with the organization's goals. due to the several strategies that can be adopted and difficulties in measuring the results, implementing a tdm process is not a trivial task. this work is expected to support other companies in the challenge of tdm using a guild approach. this study is structured as follows: section 2 presents a literature review; section 3 presents the research method; section 4 describes the context and overview of the company in which the study was conducted; section 5 describes the three cycles of action research; section 6 presents the results, lessons learned, challenges, related work and threats to validity. finally, section 7 concludes the paper. https://orcid.org/0000-0003-2479-5904 https://orcid.org/0000-0002-0929-1874 https://orcid.org/0000-0001-9430-7713 technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php 2 background 2.1 guild or communities of practice (cop) in the middle age, guilds played an essential role in economic sustainability. a guild was formed hierarchically by masters, officers, and apprentices and had experienced and renowned specialists in its field of craftsmanship. these specialists were called master artisans. there was an exchange of knowledge in these guilds to make the work more efficient and productive (wolek, 1999). using these older phenomena as a reference, leave and wenger (1991) coined the term community of practice (cop). in the most current concept, approached by wenger and wenger-trayner (2015), cops are formed by people who share a concern or passion for something and engage in collective learning in a shared domain of human effort to do it interacting better regularly. for wenger, mcdermott, and snyder (2002), domain, community, and practice are the three essential elements that characterize a cop. the domain builds the community and identity and corresponds to the interest area that attracts and keeps the members. on the other hand, the community is the central element, composed of individuals and their interactions based on joint learning. cops stand out for managing knowledge assets in organizations, creating value for members and the organization, as a competitiveness tool. it can develop new skills and generate strategic opportunities through innovations (wenger et al., 2002). it is believed that cops are used in organizations of different natures, with other terminologies, such as learning networks, thematic groups, technology clubs, and guilds. the professional literature on how to scale up agile software development suggests cops as a possible solution for learning and knowledge sharing among individuals with similar functions, such as testers or scrum masters (larman and vodde, 2010). experience from four cops at ericsson shows that success factors include a good topic, a passionate leader, a proper schedule, decision-making authority, openness, tool support, a suitable rhythm, and cross-site participation when needed. the cops in ericsson had three leading roles: to support the agile transformation, be part of the large-scale scrum implementation, and support continuous improvement. cops became a central mechanism behind the success of the large-scale agile implementation in the case organization that helped mitigate some of the most pressing problems of the agile transformation (paasivaara and lassenius, 2014). for smite et al. (2019), implementing well-functioning communities is not easy. experiences from oracle corporation, uk national health service, hewlett-packard, wipro technologies, alcatel, and daimlerchrysler suggest that the cultivation of knowledge culture requires organizational attention, support, and sponsorship for cops. inspired by cops, the guilds in spotify are designed beyond formal structures and unite members with common interests, whether related to leisure (cycling, photography, or coffee consumption) or engineering (web development, back-end development, c++ engineering, or agile coaching). figure 1 presents the five types of members identified by smite et al. (2020) in the guilds of spotify, based on the numbers of members registered in the communication channels and engaged in the activities. similar to wenger et al. (2002), smite et al. (2020) identified a group of core members (sponsors and coordinators), active members, and peripheral members (passive members and subscribers). the latter group forms most community members (smite et al., 2020). smite et al. (2020) noticed that individual members' activity levels change over time for several reasons: the coordinator role rotates, some active members become passive and vice versa, and those who change specialization turn into inactive users who merely subscribe the latest news. figure 1. different types of members in a guild (smite et al., 2020). some guilds arise from shared interests, while others are structured or sponsored and can even have a specific budget. the maintenance and generation of value for the organization of a guild is a challenge. 2.2 technical debt management (tdm) as previously stated, td represents the effects of immature artifacts in the software evolution that bring short-term benefits but have to be adjusted later. the concept, whose scope was initially limited to source code and related artifacts, was expanded to consider different software development stages and work products (alves et al., 2016). rios, mendonça, and spínola (2018) provide a taxonomy with 15 types of td, as described below: • architecture debt – "refers to the problems found in product architecture, which can affect architectural requirements. usually, architectural debt could result from sub-optimal upfront solutions or sub-optimal solutions as technologies and patterns become superseded, compromising some internal quality aspects, such as maintainability." • automation test debt – "refers to the work involved in automating tests of previously developed functionality to support continuous integration and faster development cycles." • build debt – "refers to issues that make the build task harder and unnecessarily time-consuming." • code debt – "refers to the problems found in the source code (poorly written code that violates best coding practices or coding rules) that can negatively affect the https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php legibility of the code, making it more challenging to maintain". • defect debt – "refers to known defects, usually identified by testing activities or by the user and reported on bug tracking systems, that the development team agrees that should be fixed but, due to competing priorities and limited resources, have to be deferred to a later time". • design debt – "refers to debt discovered by analyzing the source code and identifying sound object-oriented design principles violations." • documentation debt – "refers to the problems found in the software project documentation." • infrastructure debt – "refers to infrastructure issues that can delay or hinder some development activities if present in the software organization. such issues negatively affect the team's ability to produce a quality product." • people debt – "refers to issues that can delay or hinder some development activities if present in the software organization". this is represented for late hire, for example. • process debt – "refers to inefficient processes, e.g., what the process was designed to handle may be no longer appropriate." • requirements debt – "refers to tradeoffs made concerning what requirements the development team needs to implement or how to implement them. in other words, it refers to the distance between the optimal requirements specification and the actual system implementation." • service debt – "refers to the inappropriate selection and substitution of web services that lead to a mismatch of the service features and applications' requirements. this kind of debt is relevant for systems with service-oriented architectures." • test debt – "refers to issues found in testing activities that can affect the quality of those activities." • usability debt – "refers to inappropriate usability decisions that must be adjusted later." • versioning debt – "refers to problems in source code versioning, such as unnecessary code forks." design, code, and architecture debts are the most studied td types. this is probably because several source code analysis tools help identify problems such as complex code, code smells, duplicate code, and others, which often serve as indicators of technical debt. the authors also define debt types and a list of situations in which td items can be found in the software (rios et al., 2018). a td item represents an instance of td and has several causes factors that lead to the occurrence of the td item and consequences to the project. a td item can be caused by inappropriate processes, decisions, schedule pressure, etc. on the other hand, td items can cause several consequences that affect software features and are usually related to cost value, schedule, and quality. a td item can be associated with one or more artifacts of the software development process (rios et al., 2018). if td items are not managed, they can cause financial and technical problems, increase software maintenance and evolution costs, and lead to a crisis point where the entire future of the software can be compromised (martini and bosch, 2016; spínola et al., 2013; nord et al., 2012). it is not enough that teams are only aware of what constitutes td. they must be aligned to manage td to add value to the business. simply knowing about td does not necessarily result in value for the software (bavani, 2012). td metaphor allows thinking about software quality regarding the organization's business (tom et al., 2013). however, the decision criteria used for the payment of td can be different according to the different scenarios and objectives of an organization (rios et al., 2018). a challenge for development teams is to quantify the maintenance problems of their projects to justify the investment in refactoring the td (mo et al., 2018; sharma et al., 2015). convincing arguments are needed about when and why the td should be removed. a model for tdm should foresee the contexts in which td is identified and evaluated so that decisions can help companies and organizations to take advantage of opportunities and anticipate market needs (kruchten et al., 2012). although td affects everyone involved in the project, regardless of the cause, the level of communication regarding the td varies. team members generally discuss td among themselves but understand that there are difficulties presenting evidence of tdm to upper-level management (codabux, 2013). tdm includes identifying, monitoring, and paying td items incurred in a system (griffith et al., 2014). rios, mendonça, and spínola (2018) describe prevention, identification, monitoring, and payment as macro activities and documentation and communication as activities performed during tdm. some activities such as identification (e.g., td detection by static source code analysis), measurement (td quantification using estimates), and payment (td resolution by techniques such as re-engineering or refactoring) receive more attention with the support of appropriate tools and approaches (li et al., 2015). the payment activity refers to the activities carried out to support decision-making about the most appropriate time to eliminate td items. at this point, the prioritization of which td item should be eliminated is made (rios et al., 2018). the tdm turns it possible to make decisions about eliminating the td and the most appropriate moment to do so (guo et al., 2016). decision-making criteria are the basis for generating prioritization in the payment of td items. tdm should be based on a rational approach to decision-making, considering planned and potential future development (schmid, 2013). https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php guo et al. (2016) mentioned that aspects of managing td in a software development process were little explored. until the literature search for this work development, no studies reporting experiences of applying cops or guilds to tdm support were found. 3 research method considering the characteristics of this study, the research method selected was action research. action research aims to provide research subjects, participants, and researchers to respond to the problems they experience with greater efficiency and based on a transformative action. the characterization of action research varies from one author to another. however, there is a set of common characteristics (dick, 2000): • act in an existing situation to improve and expand knowledge on the subject. • to have a cyclical nature, to repeatedly execute a series of steps. the cycle varies according to the author, but it must include the stages of planning, action, and reflection. • possess a reflexive nature and a critical reflection on the research process and obtained results. • it is primarily qualitative, although quantifications are possible in some situations. in coughlan and coghlan (2002), the action research cycle comprises three steps, as illustrated in figure 2: 1. a pre-step: to understand context and purpose. 2. six main steps that relate first to the data and then to the action, as follows: • data gathering: this can occur through observations, interviews, surveys, and reports, collecting qualitative or quantitative data. • data feedback: the collected data is submitted to the organization for analysis through reports or feedback meetings. • data analysis: seeks for each party to contribute with a critical view of the data collected, internal company issues, the conduct of the research, or interaction with the researcher's knowledge. • action planning: what will be done and the deadline. • implementation: the actions are implemented to promote the planned changes in collaboration with the stakeholders. • evaluation: a reflection of the results expected or not, coming from executing the action, aiming to improve the next cycle. 3. a meta-step to monitor that occurs through all the cycles. each cycle leads to another, so continuous planning, implementation, and evaluation occur over time, as illustrated in figure 2. figure 2. action research cycle (coughlan and coghlan, 2002). our study was structured based on this approach, as illustrated in figure 3. it began with a stage of understanding the context and proceeded with three cycles of the driving phase, composed of the following steps: • planning: data analysis was performed with those involved to establish what would be done and when. • action: the planned activities were implemented to promote the planned changes in collaboration with those involved and responsible for the organization. • evaluation: a reflection was performed to analyze the outcomes, aiming to improve the following cycle. each cycle of this research was conducted as presented below: • 1st cycle: in the first cycle, the guild was created, and the guidelines for the scheduled and unscheduled social interactions were established. the first steps were taken to td identification, and the teams were guided in the tds payment and monitoring. • 2nd cycle: this cycle was a review of the previous one, where the tools and management activities of td were revised. the guild promoted the standardization of the source code's development and documentation and guided the teams in prioritizing the td. • 3rd cycle: in the third cycle, the review of the tools and the td identified in the source code was maintained, and the td guild sought to identify and propose actions to pay for the td in the continuous integration test artifacts. the duration of each cycle is linked to the company's annual management cycle, which foresees periods of planning and execution of actions that impact the software development process or the teams' goals. https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 ¹ https://www.sonarqube.org/ ² https://github.com/sonarsource/sonar-php figure 3. timeline of the research. 4 context this article describes the experience of a td guild's implementation and evolution in a brazilian software development company founded in 1995. it currently has more than 2,000 customers and 300,000 users worldwide, providing process improvement and compliance management solutions. corporations use its solutions in all kinds of industries: manufacturing, automotive, food and beverage, mining and metals, oil and gas, high-tech and it, energy and utilities, government and public sector, financial services, transportation and logistics, and healthcare. technically the product is entirely on the web, with documentation and localization for more than ten languages, and compatible with three database management systems. the software development area brings together some benefits of the agile philosophy with project management. the project management and planning use the scrum method defined by schwaber and sutherland (2020), dividing it into two-week development cycles and a quarterly release to the market. thus, the company does not have automated continuous delivery or continuous deployment. however, it has continuous integration with a standardized and automated development flow/process for all teams for software development. during the period that lasted for three years, the area had, on average, 96 professionals split into 12 teams composed of professionals with the following roles: product owners (po), scrum masters (sm), developers (dev), testers, and devops. the teams vary in terms of the number of members, the amount of source code they are responsible for, and the programming languages used. the source code repository is composed by 61% php, 30% javascript, 3% java, 2% html, 2% css, 1% json and 1% xml. the development area is responsible for approximately 63,300 source-code files. in the second and third years of the study, the monthly average was 1,850 change packages effective in the repository (commits). td concepts and tdm activities (as an approach to contribute to quality and productivity during software development) were presented to the product owners and scrum masters in an internal meeting. the area's director proposed sponsoring and supporting creating a td guild to discuss and offer tdm solutions for the company. the invitation for td guild was to all professionals involved in the product's maintenance and evolution activities. per year, three or four experienced professionals were invited to become active members because they had a deep knowledge of the product's architecture. in the three years that the td guild was implemented and evolved, the sponsor and the coordinator were the same professional, but there were changes in the active members. figure 4 shows the number and type of members per year. in the first year, the active members were composed of a tester, a product owner (po), a scrum master (sm), and three developers (devs). in the second year, the members were two testers, three sm, and two devs. in the third year, the members were three testers, an sm, and two devs. the guild was composed of representatives with technical and business knowledge of the product. figure 4. td guild members. most members of the td guild are peripheral members that do not represent the key practitioners. peripheral members are those with low involvement in the guild's interactions or the members impacted by the guild's actions. they provided kind suggestions, criticisms or encouraged the initiatives. in the 2nd cycle, these comments were analyzed through a survey. the td guild emerged within an organizational context, aligned with strategic objectives, and sponsored by the https://www.sonarqube.org/ https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 board. besides the exchange of experiences, learning, and best practices on tdm, the td guild was challenged to generate value for the product and add knowledge to the software development teams. the td guild formation was based on the guidelines for cops building and the characteristics of autonomy and alignment with the strategic objectives given by the spotify approach presented by kniberg (2014). the guild meetings were monthly and face-to-face, but the members met more frequently to deliberate actions that required more speed in specific cases. the primary responsibilities of the td guild coordinator were to organize the subjects and meetings, monitor the execution of tasks, support guild members, and align the needs with the sponsor. the sponsor was responsible for evaluating the proposed actions and approving and providing the necessary resources to execute the tasks. during guild meetings, each group member presented the ideas and problems to which the guild should pay attention. for each action approved by the guild members and the sponsor, a task list was created with a guild member as responsible. the person in charge had the objective of continuing the theme, carrying out in-depth studies and practical tests to evaluate the proposal's feasibility. the sponsor approved the tasks so the responsible in charge could prioritize this task and the other demands of the team. the subjects or actions in progress were discussed at the beginning of the guild meetings. the specific issues were often discussed in an internal communication channel or by e-mail. 5 research cycles the td guild beginning was marked by discussions and alignments about the purposes, objectives, guidelines to conduct the activities, and interest subjects to the organization and its members. at the beginning of each research cycle, guild members reviewed goals and procedures. the td guild's purpose was to study and help implement and monitor the tdm, with proposals and actions to improve internal quality and reduce maintenance costs and software evolution. to carry out its duty, the td guild developed some directives to conduct the meetings and activities aligned with the organization's expectations, as follows: • be aligned with the company's strategy. • have a well-defined purpose or objective. • have autonomy to implement solutions. • clearly communicate the problems and opportunities to the interested parties. • the member must be a promoter for td's payment actions within the teams. • allow members of different teams to participate, considering that the member should have knowledge about the work context. 1 https://www.sonarqube.org/ • the member influences the teams to help direct and prioritize the tasks of refactoring the td. • maintain the focus on quality and productivity in software development, helping to define the actions of prioritization and payment of the td. • guide the teams on best practices and standards of internal development. • have periodic meetings to monitor the actions and propose changes. 5.1 first cycle in the first year of this study, the actions of the td guild focused on two initiatives related to the php source code: (1) identify, measure, and monitor the primary technical debts identified in the php source code; (2) identify and propose actions to improve the php source code. based on the guidelines and the deployment of the initiatives, the td guild defined the following actions: • deploy tools to support tdm. • identify td in the context. • guide teams on td payment. • monitor td payment. the guild contributed to disseminating the tdm within the company, identifying the most appropriate td for the company's objectives, and selecting the most appropriate td identification and monitoring tools. according to the company's goals and resources, the guild members' quality rules priority classification guided the td payment. 5.1.1 deploy tools to support tdm to manage the td, it is necessary to have tools for implementing and continuing actions. tools provide support and enable the automation of tdm activities. sonarqube1 and the sonarphp 2 plugin were selected to identify and monitor the php source code's td. the choice of sonarqube and the sonarphp plugin was mainly to the: amount of quality rules available for the php source code; options for configuring the quality rules; and, the possibility to develop specific quality rules. the quality rules provided by sonarqube were reviewed and updated according to the organization's context. the sonarqube identified the source code that was not within the td guild's coding standard. 5.1.2 identify td in this action, the td guild's objective was to know in detail and select the quality rules provided by sonarqube considering the company's goals. table 1 presents the list of options to select and classify the quality rules. the classification and selection of quality rules were made based on priority definition. the description of priorities was defined by the td guild, taking into account the organization's context, and were used to guide the classification of the quality rules. the rationale for using this 2 https://github.com/sonarsource/sonar-php technical debt guild: managing technical debt from code up to build detofeno et al. 2023 scale is to allow easy mapping to the five-point scale used in sonarqube: blocker, critical, major, minor, and info. table 1. list by priority. priority description blocker rule considered as a bug, system vulnerability, or command that should not be used. critical an important rule with a high impact on product quality and source code standardization. major a minor rule with a low impact on product quality minor good practice rules that should be monitored. this action resulted in the approval of 93 quality rules that were activated in sonarqube, classified by priority and td type, as shown in table 2. table 2. first cycle: rules classified by priority and td type. priority blocker 25 critical 27 major 14 minor 27 td type code debt 39 defect debt 18 design debt 36 5.1.3 guide the teams in td payment the quality rules were classified by their priority, complexity, and impact to support the teams prioritizing and paying the td. the priority analysis ranked the quality rules considering the research context's available objectives and resources. this analysis was used to prioritize td payment actions. the quality rules' analysis on complexity and impact helped the teams select the td payment source files. complexity was understood as the technical difficulty to solve a quality rule. the impact of a change was classified by the extent of the change within the system, that is, the change's potential to affect other modules or classes. some members ranked the quality rules separately, following the guild's guidelines. the results of the classification were reviewed and aligned during the guild meetings. 5.1.4 monitor td payment the td monitoring was intended to expose the reality and motivate the teams to pay the td. to support the teams in the periodic monitoring of td, the sonarqube was configured per team, and a web portal with the values of the td classifications was made available. this action facilitated the management's follow-up on the teams' initiatives in td payment. 5.2 second cycle at the beginning of the second cycle, the guild members discussed and decided to maintain the objectives and guidelines defined in the first cycle. however, they added the initiative to identify and propose improvement actions in the php source code most relevant to the project. thus, the td guild defined the following steps: • deploy tools to support tdm. • define a coding standard in php. • define a documentation standard in php. • identify td in the context. • train the teams on the standards and best practices. • evaluate guild actions by the developers. the actions to monitor and guide the teams were incorporated into the software development process, so the teams have guidance on how to track and pay the td. to execute the definition of coding standards and documentation from samples of the php source code, the guild members evaluated the impacts of the product's modifications. in this way, besides assessing the impacts, it was possible to estimate the necessary efforts and create practical procedures to adapt and maintain the standards. 5.2.1 deploy tools to support tdm to support decision-making on td prioritization and payment actions, guild members developed two internal systems, one to calculate the effort needed to eliminate td items from a team and the other to analyze the dependencies of each source file in php. several tools were evaluated to facilitate large-scale changes, format the source code, and eliminate code smells. although the guild did not approve tools to automatically make the changes without going through the developers' manual supervision, it was recommended to use two free tools (visual studio code, scite) that presented the best results. one tendency is that this subject will be discussed again by the guild. 5.2.2 define a coding standard in php setting a coding standard for php development aimed to define rules and development standards to improve developers' communication capacity. the ultimate goal is to have less disruption when the source code is maintained by a developer who has not created it. the option was to use php standards recommendation (psr), which are project specifications proposed by the php framework interop group (php-fig). php-fig is currently the primary standard used in php development and has source code verification tools that help in automatic adaptation and source code monitoring. this project followed the recommendations of the psr-1 and psr-2 standards. sonarqube was used for monitoring td, which has the formatting rules according to the psr standard. the knowledge transfer was done through standard documentation and internal training for developers. 5.2.3 define a php documentation standard after implementing agile practices in the company, the developers questioned the source code's documentation, especially regarding its value to the product and the teams. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 according to the developers' reports, especially from recently hired developers, it was evident that the current source code did not have a standard terminology that allowed a quick understanding. another situation identified by the team was the difficulty of finding the routines already implemented in the system. php docblock standards were implemented as a reference for this action regarding functions, source code elements, classes, and methods documentation. phpdocumentor (a tool that generates documentation from php source code) was used to create the documentation. after defining the documentation standard, a custom rule in sonarqube helped teams monitor and identify the source code that did not meet the standard. similar to the previous action, the knowledge transfer was done by documenting the standard and delivering developers' internal training. 5.2.4 identify td after defining the coding and documentation standards, it was necessary to review the approved rules in the first cycle. guild members understood an improved knowledge of quality rules in this cycle. the 125 rules were approved and classified by priority and td type, as shown in table 3. table 3. second cycle: rules classified by priority and td type. priority blocker 33 critical 19 major 34 minor 39 td type code debt 44 defect debt 41 design debt 37 documentation debt 3 5.2.5 train the teams the training was conducted to qualify and guide developers on changes in programming procedures and source code releases in php. the td guild promoted three courses for eight groups in these first two years. each training session lasted 4 hours, with one developer from each team per group and groups composed of 12 to 14 participants. in total, the guild delivered 96 training hours to almost 90 participants. the first training covered the use of sonarqube, and procedures to monitor the team's source code. the other two courses were about coding standards and documentation of the php source code. the developers who attended the training were responsible for passing the knowledge to the other developers. the training was documented and published in the company's internal knowledge base tools. 5.2.6 evaluate the actions by the developers at the end of the second year, a survey was applied to the development team to assess the impact generated by the tdm guild actions. the objective of the survey was to extract the developers' perceptions of the actions taken by the guild. we had 83 responses out of 89 total peripheral members. the survey had two questions. first, a closed question in the likert scale format and, second, an open-ended question: 1. actions of the td guild to improve the source code contributed to the developer's productivity or quality. 2. in your opinion, what were the impacts of the actions promoted by the td guild on projects and teams? because the open-ended question was not mandatory, 52 responses were obtained from 83 respondents. the survey indicated that approximately 94% agree that the td guild's actions have helped improve the product quality and team productivity. a thematic analysis approach was used to analyze the 52 responses to the open-ended question once written in natural language. thematic analysis is an effective method for identifying, analyzing, and reporting patterns and themes within a searched data scope (braun and clarke, 2006). analyzing and coding the answers, we identified patterns among them, and five themes emerged: compliance (31 quotes), maintainability (16 quotes), refactoring (11 quotes), understandability (8 quotes), and reusability (3 quotes). the most cited characteristic was that the actions improved the standardization of the source code and the use of best programming practices. these actions helped improve the source code's understanding, refactoring, and maintenance (35 responses). three developers quoted source code reuse as a side benefit. the standardization of the source code documentation helped developers locate existing source code in other projects. as reported in the open-ended question, the teams' significant impacts were the change in developers' behavior in development activities and code review. developers were motivated to develop a cleaner and standards-compliant code. the developers sought to interact to improve the source code in the code review activity. in this context, many legacy source code was developed under an architecture with several anomalies, such as difficulties of reuse, strong coupling, and lack of separation of the responsibilities among software architecture layers. it was realized that developers understood td's impacts and were concerned with refactoring the source code. still, the pressure to deliver new features, lack of resources, and the source code's architecture hindered td's payment. 5.3 third cycle in the first two research cycles, guild actions focused on source code artifacts. however, the guild understood that efforts to manage td should expand to identify other types of immature, incomplete, or inadequate artifacts in the software development lifecycle that cause higher costs and lower quality in the long term. in this cycle, the guild kept the actions for improving the php source code and created measures to manage tests, technical debt guild: managing technical debt from code up to build detofeno et al. 2023 automated tests, and build technical debts. those were the most significant tds after source code. the issues discussed by the td guild for this cycle were around two main questions: i is the current state of the functional test planning, documentation, and execution optimal for this context? ii are build issues affecting the productivity of the teams? thus, the td guild defined the following actions: • identify td in the context. • review test case documentation. • define an automated test development standard. • monitor automation test execution. • identify build debt. 5.3.1 identify td the guild members understand that annually one should update the version of sonarqube and review the priority of quality rules to the source code. in this cycle, 189 quality rules were reviewed, divided into 181 rules provided by sonarqube and eight rules tailored by the guild members. these 149 rules were approved and classified by priority and td type, as shown in table 4. table 4. third cycle: rules classified by priority and td type. priority blocker 69 critical 29 major 32 minor 19 td type code debt 46 defect debt 26 design debt 45 documentation debt 3 vulnerability debt 29 5.3.2 review test case documentation this action aims to review the description of the test cases executed manually or automatically. this action was performed by 12 pos and 13 testers, where 2,314 test cases were reviewed, in which 2,097 steps are performed automatically daily, and 4,295 steps are performed manually on each new product release. the testlink1 tool was used to register and review the test cases. the company already used testlink to record the test cases, and the tool adhered to this action. as per the guidance of the td guild, reviewers were to answer the following checklist questions about their project's test cases: • are documented test cases suitable for the project? 1 https://testlink.org/ • do the most critical project requirements have planned test cases? • are the test cases updated in the software test management tool (testlink)? • do all test cases have a title, objective, action, steps, and the expected results? • do test cases have the desired results that can be validated? • do the test cases have a well-defined objective? 5.3.3 define an automated test standard the need to define a standard for developing automated test scripts was identified from developers' demotivation in creating automated tests. the guild discussed this issue along with some developers and team leaders, and they identified the following causes: • automated test scripts with too many lines. • outdated and redundant code. • many failures due to outdated test execution environments, databases, and test scripts. • lack of visibility into test automation results. we emphasize that the use of tools to develop and execute automated tests and the integration with the product are compliant with the company's needs. from the identified causes, the td guild developed a pilot project with a development team, this project developed 96 automated tests as a standard for developing new automated tests. the guild proposed and implemented the practice of code review for the automated tests. the guild developed a checklist to support automated tests code review with the following questions: • is the automated test documented, and does it have the test case and step references? • does the script contain outdated code? (e. g.: xpath, sleep, non-standard selectors) • in image comparison tests, is it correctly referencing the model image? • is the automated test independent? • is the data kept in the test base as evidence in case of failure? • is an evidence image generated of the correctly executed test at the end of the test run? 5.3.4 monitor automation test execution the teams highlighted the lack of a tool to control the execution of the automated tests. the proposal implemented by the td guild was the development of a data analytics dashboard to monitor the status of the automated tests by the teams. monitoring automated tests provide all developers and stakeholders with a view of test cycles' progress, results achieved, identified failures, and test execution metrics. https://testlink.org/ technical debt guild: managing technical debt from code up to build detofeno et al. 2023 the td guild made available two metrics to monitor the test execution. the first metric presents the test execution status, grouping the data by day. from this metric, the teams can monitor the execution of automated tests. the second metric represents the number of documented test cases with automatic or manual execution. this metric helps the teams plan to make resources available to develop new automated tests. 5.3.5 identify build debt the company has tens of millions of lines of source code at a change rate of 1,850 commits per month. we highlight that the main advantages are that the company has a guide for standardizing development, a single source code repository, a single build system for all projects, and a single testing infrastructure. in this action, we sought to identify build times, builds' success rate, and which services fail most in the build system. the data was extracted from the version control system, where we highlight the following information, grouped by month: • average of 1,010 merge requests. • average of 3,100 requested builds. • build success rate, which is the percentage of successfully executed builds, decreased from ~60% to ~30% (it will be presented in figure 7). • the average build time increased from ~10 to ~35 minutes (it will be presented in figure 8). • of the 43 services performed in the compilation, the five services that failed the most were identified. • in the last month of the cycle, the compilation failures consumed 107,036 minutes of processing. 6 discussion in this section, we discuss the results obtained in the payment of td, the main success factors, and the challenges faced. we also describe a guideline to support the creation of a td guild, related work, and the main threats to our work validity. 6.1 results the guild was present in all tdm activities. its involvement in td's categorization and prioritization provided confidence and reliability to the teams in td payment. classifying the quality rules by priority was performed in all three research cycles, so it was possible to evaluate the results obtained in the payment of the td items during all cycles. td guild meetings were held monthly with a pre-defined duration of at least one to two hours. when necessary, for example, the guild had extra meetings to anticipate decisionmaking in the planning phase. it is estimated that each active member spent 8 hours per month participating in the meetings, contributing to decision-making, participating in the communication group, and supporting the implementation of actions. 6.1.1 source code debt table 5 shows the number of td items at the beginning and end of each cycle, summarized by priority. the column td items reduction means the amount of paid td. the focus of this table is on showing the source code debts. in the first cycle, there was a reduction of approximately 62% of the total tds: 64% with blocker priority, 34% with critical, 70% with major, and 11% with minor. the developers' primary explanation for not paying 100% of the tds with blocker priority was their difficulty prioritizing refactoring source code with low importance for the project. by this time, these criteria were not being taken into consideration. in the second cycle, there was a decrease of approximately 48% of the total tds: 67% with blocker priority, 15% with critical, 53% with major, and 13% with minor. in this cycle, the teams prioritized the td's payment in the files with more defects, increasing the value of product quality. in the third cycle, the reduction percentages were lower than in the previous cycles. the same guidelines for prioritizing td were followed in this cycle as in the previous ones. however, the paid td items quantity with blocker and critical priority was higher than in the second cycle: 20% with blocker priority, 46% with critical, 28% with major, and 6% with minor. table 5. td items payment results by cycle. priority td items cycle start td items cycle end td items reduction % reduction 1st cycle blocker 8,992 3,189 5,803 64,54% critical 37,441 24,574 12,867 34,37% major 476,572 139,057 337,515 70,82% minor 64,341 56,696 7,645 11,88% total 587,346 223,516 363,830 61,94% 2nd cycle blocker 2,066 666 1,400 67,76% critical 9,026 7,640 1,386 15,36% major 650,533 299,642 350,891 53,94% minor 98,664 85,712 12,952 13,13% total 760,289 393,660 366,629 48,22% 3rd cycle blocker 12,089 9,597 2,492 20,61% critical 22,517 12,017 10,500 46,63% major 211,440 150,305 61,135 28,91% minor 48,037 44,984 3,053 6,35% total 294,083 216,903 77,180 26,24% this phenomenon was observed because the decrease in the first two cycles was possible once these td items were mainly related to the source code formatting. the td guild suggested using automated source code editor tools to support the payment of td items of this nature, accelerating their payment. in the third cycle, it was unnecessary to use the tools for source code formatting, and no other tool was technical debt guild: managing technical debt from code up to build detofeno et al. 2023 identified that could have accelerated the payment of the td placed in the source code. in the second and third cycles, guild members reviewed and reclassified the priorities to be more precise in paying the most relevant td for the team. thus, it is recommended to analyze the results of table 5 per research cycle. due to most php source code analysis, an unexpected effect for the guild showed up: discovering dead code source code that is not executed by any product's routine. this subject will be dealt with in the next improvement cycle. the guild recommendation was to record a task of the possible dead code identified to be evaluated by the teams at the beginning of each product release. the payment of td items, identified in the source code, improved with each research cycle. thus, table 5 shows that teams have incorporated td prevention and payment into their development activities. 6.1.2 test debt in the third cycle, the td guild went beyond the source code boundary to seek solutions to improve the internal quality of the product and reduce maintenance costs in other product artifacts, such as test cases, automated test scripts, and the build pipeline. the test cases were reviewed according to the guidelines passed on by the td guild. as previously stated, this action involved 12 pos and 13 testers who reviewed 2,314 test cases and 6,342 test steps. ten professionals defined a standard for developing the automated tests, rewriting ~1,160 test scripts compliant with the new standard. a dashboard with automated test execution metrics was developed to help the teams monitor the results. figure 5 shows all teams' test execution status, but the dashboard also presents the data per team. this dashboard reflects the moment after defining the standard for developing automated tests (action 5.3.3). the chart illustrates the test scripts executed daily for nine days. for example, on day 1, 1,070 test cases passed, and 78 failed. the chart shows that ~6.5% of the performed tests have flaws that the responsible teams should analyze. figure 5. tests execution status per day. figure 6 shows the percentage of the automated test for all teams, but the organizational dashboard also presents data per team. after reviewing the test cases, this data was obtained to track the number of test steps that have automated and manual execution. during the 3rd cycle, the automated test scripts corresponded to 2,106 (33.21%) test steps, and there are still 4,236 (66.79 %) automated test steps to be created. figure 6. percentage of automated tests. 6.1.3 build debt in the build procedure, the td guild identified the existence of build debt because the build time, the success rate of the builds, the failures in the build systems are not meeting the needs of the context, and they are causing rework for testers and developers. in this action, the guild's goal was to present quantitative evidence of the existence of build debt. the data from the charts presented in figure 7 and figure 8 were extracted from the last 22 months and grouped by month. this period was chosen because the number of builds requested per month was not less than 10% of the previous six months' average, 1,037 builds. the months were selected until the monthly build quantity was higher than 933 builds. this procedure was chosen to mitigate the risk of the number of builds influencing the results. each time a build is not successfully executed, the developer needs to request the build again, thus causing a waste of resources. figure 7 presents the percentage of requested builds completed successfully (e.g., having 200 requested builds, 125 builds were executed successfully, resulting in a 62,5% build success rate). it can be seen in figure 7 that the percentage of builds successfully executed had fluctuations until month 17, with ups and downs. the last five months dropped below the previous periods, and the rate stays at ~30% of the total builds requested. figure 7. percentage of build success rate. figure 8 shows the average build times per month (in minutes) over 22 months. visually it is possible to see that the average time to execute a build has increased. in the first five months, the average was ~10 minutes. in the last five months, the average was over 35 minutes. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 figure 8. average build time per month. by analyzing the graphs in figure 7 and figure 8, it is possible to observe the presence of build debt, as they present evidence that developer rework is increasing, even without increasing build demand. the build results are below the goal desired by the company. those responsible agreed that the optimal value for the company should be a build success rate higher than 60% and the average build time per month lower than 15 minutes. the devops team will be responsible for implementing measures to improve the build debt. 6.2 success factors this section highlights the main elements that contributed to successfully implementing td guild's actions and continuity. they were analyzed during retrospective meetings performed by the guild team: • sponsorship of top management: the sponsorship of the area director positively contributed to the engagement of members and teams. the members felt motivated to participate, knowing that the suggested actions had organizational support. the teams adhered to the changes because they knew that the activities were aligned with the top executive view. • support tools: the tools used were fundamental to giving td visibility and transparency. for example, we used the data provided by sonarqube in data analytics tools for monitoring and tracking actions of td payments. • objectives and guidance well defined: the goals and directions established in the guild's first meetings delimited the scope of subjects and tasks aligned to the organization's needs. • qualified team: the td guild was trustworthy to the teams and stakeholders once the team was composed of technical experienced and reference professionals. • alignment with the board of directors: all decisions were aligned with the company's board of directors. • visible results: the guild engagement was mainly due to seeing the suggested actions generating value for the organization and knowledge for the members. 6.3 main challenges during this study period, the td guild was created and obtained recognition from the organization but, at the same time, faced several challenges: • aligning the members' issues with the organization's needs to generate value for both is a constant challenge in the guild. this challenge was mitigated with the early alignment of the guild's objectives and guidelines with the sponsor and members. • in suggesting the change actions, guild members found a complex context in which the size of the source code base and the rate at which it changes were significant. the guild had technical skills to analyze the environment in detail and propose viable solutions to overcome this challenge. the guild also sought to communicate the purposes and expected results of the changes clearly and permanently to achieve engagement. • the standards suggested by the guild affected the individual characteristics related to software development. the developers had their coding habits and standards before the guild started. with the guild, changes happened and required a development culture shift. • in an organizational environment where the professionals have several activities and commitments, finding time to devote to the guild's tasks was another challenge. this challenge was mitigated by having the area director sponsor the guild. thus, the guild tasks were prioritized and executed during regular working hours. • the actions that involved data analysis were necessary to obtain data viewing permission from teams in other areas of the organization. these accesses were granted only to a guild member responsible for extracting and disseminating the data. • the sponsor and the teams empirically recognized that the actions promoted by the guild contributed to software development. however, it was impossible to quantitatively evaluate the results in the software's maintenance and evolution. 6.4 resulting guidelines after analyzing the results obtained in the three research cycles, table 6 presents some guidelines to support the creation of a td guild. we have splited it into three sections: general recommendations, guild meetings and guild actions. in the first part we present guild planning and setup. the second presents the guidelines for the meetings. the third shows the recommendations upon the guild actions. table 6. guidelines for building a td guild part i general recommendations context: the td guild should emerge within an organizational context, aligned with strategic objectives and the needs of the software development teams. purpose: to improve internal quality and reduce maintenance costs and software evolution. technical debt guild: managing technical debt from code up to build detofeno et al. 2023 challenge: to generate value for the product and add knowledge to software development teams. guidelines: to develop purpose, objectives, and guidelines to conduct the meetings and the actions aligned with the organization's expectations. invitation: the invitation for td guild should be to all professionals involved in the product's maintenance and evolution activities. the invitation should be sent by the guild sponsor. sponsor: responsible for evaluating the proposed actions and approving and providing the necessary resources to execute the tasks. coordinator: the coordinator is responsible for organizing the subjects and meetings, monitoring the execution of tasks, supporting guild members, and aligning the needs with the sponsor. active member: an active member is a motivated person who participates in the meetings and leads the actions proposed by the td guild. review: the objectives and guild needs may change over time. thus, guild members should review goals and procedures periodically. we recommend the yearly td guild review. part ii guild meetings meeting schedule: guild meetings can be monthly with a duration of two hours or biweekly with a duration of one hour. ideas discussion: each member presents the ideas and problems to which the td guild should pay attention. actions selection: actions are selected. for each action a guild member is assigned to be responsible for approving and implementing the action. action goal definition: the goal of the action must be aligned in the meetings. monitoring: the progress is presented and discussed during guild meetings. part iii guild actions approval: the sponsor approves the actions so the person in charge can prioritize this task with the other demands of the team. build/execution actions: lists the action steps needed to achieve the established goals. desjardins (2011) suggests that for the execution of the action, consider: ownership, action steps, responsibility, support, informed, metrics and budget, milestone date, and completion date. monitoring: the actions in progress are discussed during guild meetings. communication: the specific issues about the actions can be discussed in an internal communication channel or by e-mail after de meeting. 6.5 related work the td guild has the essential elements of the domain, community, and practice that characterize a cop, as wenger, mcdermott, and snyder (2002) described. the guild implemented by our research project has different types of members, as identified by smite et al. (2019), as previously presented in figure 4. the td guild followed smite et al. (2019) recommendations, establishing a straightforward practice and a well-defined scope, having regular interactions with tasks and responsibilities that showed signs of member engagement with the results. we confirmed the statement of smite et al. (2019) that the sponsor's authority and attention contribute to helping achieve the guild's objectives. our study confirmed the relevance of the sponsor role. as already pointed out, the director played an essential role in sponsoring the guild. the guild's formation followed the guidelines of several studies in the area. still, the td guild differentiates itself by supporting the deployment and continuity of the tdm in software development. despite the lack of studies on strategies to implement and monitor tdm in a business context, several studies present suggestions and challenges that a td guild can contribute: • it considers the context in identifying and evaluating td, as kruchten et al. (2012) suggested. • to help the teams to quantify, prioritize and justify the payment of td, challenges cited in the studies by sharma et al. (2015), fernández-sánchez et al. (2017), and cai and kazman (2018) were also observed in our study. • it is applicable to provide transparent communication about the expected returns on td payments (fernándezsánchez et al., 2017). • it involves all stakeholders in decision-making for tdm, as suggested in (fernández-sánchez et al., 2017; rios et al., 2018). the studies that were part of the tertiary study by rios et al. (2018) did not point out strategies that collaborate to prevent td. our study obtained it through source code standardization, teams training, test scripts development standardization, and code review for automated tests. thus, a td guild can also be used as a strategy to prevent td. 6.6 generalization and threats to validity it was observed that the sponsor or the guild coordinator invited the guild participants. it may be that inviting the people to join may have created some embarrassment to deny the invitation and may have intimidated peripheral members into participating more actively in the guild. this can also be seen as a positive factor (once the director sponsored the project). technical debt guild: managing technical debt from code up to build detofeno et al. 2023 for the actions proposed by the td guild that were aligned with the company's goals and were approved by the sponsor, the td guild obtained the necessary resources to continue the actions. because of this, the td guild can be interpreted as a working group at some point. this study aimed to present the results and challenges of a td guild obtained throughout three action research cycles. this paper does not detail the calculations, resources, strategies, and tools adopted to support tdm activities. 7 conclusion with the results obtained, it is possible to conclude that the guild can contribute to technical debt management in an organization. the td guild was present in all tdm activities identified in the source code and was responsible for preventing td from creating standards and guidelines for the teams. the guild also contributed to determining the tds that were most aligned with the company's objectives. td is often incurred because people do not know it. the guild disseminated knowledge about td and guided developers in best practices and development standards. besides, it helped deploy tools to verify and monitor the source code, making incorrect development difficult. in the first two years, the td guild focused on the td identified in the php source code, but in the third year, the actions promoted by the td guild reached other software artifacts, such as test cases, automated test scripts, and the build pipeline. the td guild promoted actions in different td types: automation test debt, build debt, code debt, defect debt, design debt, documentation debt, and test debt. these experiences can be helpful for other professionals and provide practical knowledge to help with the guild, cops, and tdm research. setting up a guild with periodic meetings was the most adherent proposal to the company's context. the continuity and maintenance of tdm tools were passed on to two company professionals. as the company evolves in tdm, the need for a professional or an allocated team responsible for tdm also increases. this work raises the question about the lack of a professional trained and dedicated to the tdm in organizations: the td manager. we are now working to define and implement an incremental and evolutionary tdm process aligned with empirical evidence of use in the software industry. acknowledgments the authors would like to thank the company and the professionals who participated in this research. references alves, n., mendes, t., de mendonça, m., spínola, r., shull, f., & seaman, c. (2016). identification and management of technical debt. information and software technology, 70(c). ampatzoglou, a., michailidis, a., sarikyriakidis, c., ampatzoglou, a., chatzigeorgiou, a., & avgeriou, p. (2018). a framework for managing interest in technical debt: an industrial validation. 2018 ieee/acm international conference on technical debt (techdebt). bavani, r. (2012). distributed agile, agile testing, and technical debt. ieee software, 29(6), pp. 28-33. doi:10.1109/ms.2012.155 besker, t., martini, a., & bosch, j. (2017). the pricey bill of technical debt when and by whom will it be paid? 2017 ieee international conference on software maintenance and evolution (icsme). doi:10.1109/icsme.2017.42 besker, t., martini, a., & bosch, j. (2019). software developer productivity loss due to technical debt a replication and extension study examining developers' development work. the journal of systems and software, pp. 41–61. doi:https://doi.org/10.1016/j.jss.2019.06.004 braun, v., & clarke, v. (2006). using thematic analysis in psychology. qual. res. psychol. 3, pp. 77–101. brown, n., cai, y., guo, y., kazman, r., kim, m., kruchten, p., . . . zazworka, n. (2010). managing technical debt in software-reliant systems. foser '10 proceedings of the fse/sdp workshop on future of software engineering research, pp. 47-52. codabux, z., williams, b., bradshaw, g., & cantor, m. (2017). an empirical assessment of technical debt practices in industry. journal of software: evolution and process 2017. doi:doi:10.1002/smr.1894 connolly, c. (1992). team-oriented problem solving. iee seminar on team based techniques design to manufacture. coughlan, p., & coghlan, d. (2002). action research for operations management. january 2002 international journal of operations & production management, 22, pp. 220-240. doi:10.1108/01443570210417515 desjardins, m. (2011). how to execute corporate action plans effectively. business in vancouver. archived from the original on 22 march 2014. dick, b. (2000). a beginner's guide to action research. acesso em 03 de 09 de 2019, disponível em http://www.aral.com.au/resources/guide.html fernández-sánchez, c., garbajosa, j., yagüe, a., & pereza, j. (2017). identification and analysis of the elements required to manage technical debt by means of a systematic mapping study. journal of systems and software, 124, pp. 22-38. doi:https://doi.org/10.1016/j.jss.2016.10.018 ghanbari, h., besker, t., martini, a., & bosch, j. (2017). looking for peace of mind? manage your (technical) debt. an exploratory field study. published in: 2017 acm/ieee international symposium on empirical software engineering and measurement (esem). doi:10.1109/esem.2017.53 griffith, i., taffahi, h., izurieta, c., & claudio, d. (2015). a simulation study of practical methods for technical debt management in agile software development. proceedings of the winter simulation conference 2014. doi:10.1109/wsc.2014.7019961 technical debt guild: managing technical debt from code up to build detofeno et al. 2023 guo, y., spínola, r., & seaman, c. (2016). exploring the costs of technical debt management a case study. empirical software engineering, 21(1), pp. 159–182. doi:https://doi.org/10.1007/s10664-014-9351-7 kniberg, h. (2014). spotify engineering culture. (spotify) accessed on: oct/30/2020, available: https://engineering.atspotify.com/2014/03/27/spotifyengineering-culture-part1/?fb_comment_id=278872278947916_36091417074372 6 kruchten, p., nord, r., & ozkaya, i. (2012). technical debt: from metaphor to theory and practice. ieee software, 29(6), pp. 18-21. doi:10.1109/ms.2012.167 larman, c., & vodde, b. (2010). practices for scaling lean & agile development: large, multisite, and offshore product development with large-scale scrum. addisonwesley professional. lave, j., & wenger, e. (1991). situated learning: legitimate peripheral participation. martini, a., & bosch, j. (2016). an empirically developed method to aid decisions on architectural technical debt refactoring: anacondebt. 2016 ieee/acm 38th international conference on software engineering companion (icse-c). martini, a., bosch, j., & chaudron, m. (2014). architecture technical debt: understanding causes and a qualitative model. 2014 40th euromicro conference on software engineering and advanced applications. martini, a., fontana, f. a., biaggi, a., & roveda, r. (2018). identifying and prioritizing architectural debt through architectural smells: a case study in a large software company. springer international publishing. doi:https://doi.org/10.1007/978-3-030-00761-4_21 mo, r., snipes, w., cai, y., ramaswamy, s., kazman, r., & naedele, m. (2018). experiences applying automated architecture analysis tool suites. acm/ieee international conference on automated software engineering (ase 2018), pp. 779–789. doi:10.1145/3238147.3240467 nord, r., ozkaya, i., kruchten, p., & gonzalez-rojas, m. (2012). in search of a metric for managing architectural technical debt. 2012 joint working ieee/ifip conference on software architecture and european conference on software architecture, pp. 20-24. doi:10.1109/wicsa-ecsa.212.17 paasivaara, m., & lassenius, c. (2014). deepening our understanding of communities of practice in large-scale agile development. doi:10.1109/agile.2014.18 rios, n., mendonça, m., & spínola, r. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, pp. 117-145. doi:https://doi.org/10.1016/j.infsof.2018.05.010 schmid, k. (2013). a formal approach to technical debt decision making. qosa '13 proceedings of the 9th international acm sigsoft conference on quality of software architectures, pp. 153-162. doi:10.1145/2465478.2465492 seaman, c., guo, y., zazworka, n., shull, f., izurieta, c., cai, y., & vetrò, a. (2012). using technical debt data in decision making: potential decision approaches. 2012 third international workshop on managing technical debt (mtd). doi:10.1109/mtd.2012.6225999 sharma, t., suryanarayana, g., & samarthyam, g. (2015). challenges to and solutions for refactoring adoption. ieee software, 32(6), pp. 44-51. smite, d., moe, n. b., floryan, m., levinta, g., & chatzipetrou, p. (2020). spotify guilds. 63(3), pp. 56–61. doi:https://doi.org/10.1145/3343146 smite, d., moe, n. b., levinta, g., & floryan, m. (2019). spotify guilds: how to succeed with knowledge sharing in large-scale agile organizations. 32(2), pp. 51-57. doi:10.1109/ms.2018.2886178 spínola, r., vetrò, a., zazworka, n., seaman, c., & shull, f. (2013). investigating technical debt folklore: shedding some light on technical debt opinion. 2013 4th international workshop on managing technical debt (mtd). doi:10.1109/mtd.2013.6608671 tom, e., aurum, a., & vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6), pp. 1498-1516. doi:https://doi.org/10.1016/j.jss.2012.12.052 wenger, e., & wenger-trayner, b. (2015). introduction to communities of practice. a brief overview of the concept and its uses. accessed on: oct/30/2020, available:http://wenger-trayner.com/wpcontent/uploads/2015/04/07-brief-introduction-tocommunities-of-practice.pdf wenger, é., mcdermott, r. a., & snyder, w. (2002). cultivating communities of practice: a guide to managing knowledge. harvard business press. wolek, f. (1999). the managerial principles behind guild craftsmanship. 5(7). doi:10.1108/13552529910297460 yuanfang, c., & kazman, r. (2019). dv8: automated architecture analysis tool suites. ieee/acm international conference on technical debt (techdebt), pp. 53-54. doi:10.1109/techdebt.2019.00015 zengyang, l., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, pp. 193–220. doi:10.1016/j.jss.2014.12.027 technical debt guild: managing technical debt from code up to build 1 introduction 2 background 2.1 guild or communities of practice (cop) 2.2 technical debt management (tdm) 3 research method 4 context 5 research cycles 5.1 first cycle 5.1.1 deploy tools to support tdm 5.1.2 identify td 5.1.3 guide the teams in td payment 5.1.4 monitor td payment 5.2 second cycle 5.2.1 deploy tools to support tdm 5.2.2 define a coding standard in php 5.2.3 define a php documentation standard 5.2.4 identify td 5.2.5 train the teams 5.2.6 evaluate the actions by the developers 5.3 third cycle 5.3.1 identify td 5.3.2 review test case documentation 5.3.3 define an automated test standard 5.3.4 monitor automation test execution 5.3.5 identify build debt 6 discussion 6.1 results 6.1.1 source code debt 6.1.2 test debt 6.1.3 build debt 6.2 success factors 6.3 main challenges 6.4 resulting guidelines 6.5 related work 6.6 generalization and threats to validity 7 conclusion acknowledgments references journal of software engineering research and development, 2021, 9:13, doi: 10.5753/jserd.2021.1942 this work is licensed under a creative commons attribution 4.0 international license. towards to transfer the directives of communicability to software projects: qualitative studies adriana lopes damian [ federal university of amazonas | adriana@icomp.ufam.edu.br] edna dias canedo [ university of brasília | ednacanedo@unb.br] clarisse sieckenius de souza [ pontifical catholic university of rio de janeiro | clarisse@inf.puc-rio.br] tayana conte [ federal university of amazonas | tayana@icomp.ufam.edu.br] abstract the software artifacts developed in the early stages of the development process describe the proposed solutions for the software. for this reason, these artifacts are commonly used to support communication among members of the development team. miscommunication through software artifacts occurs because practitioners typically focus on their modeling, without reflecting on how other software development team members interpret them. in this context, we proposed the directives of communicability (dcs) to support practitioners analyzing characteristics that affect the artifact’s content on communication via artifact. we conducted preliminary studies in a controlled environment with our proposal. however, we noticed that new studies are necessary to evaluate the dcs concerning practitioners’ perceptions before transferring them to the industry. in this paper, we present two studies performed aiming to transfer the dcs to the software industry. in the first study, we evaluated the practitioners ’ perception about the dcs. in the second study, we evaluated the feasibility of the dcs in a software developmen t team. the studies’ results indicated that dcs have the potential to support improvements in artifacts’ content to reduce miscommunication via artifact. to facilitate the use of our proposal in the software industry, we created procedures that support the adoption of the dcs and checklists for the application of each directive in the software artifacts. we noticed positive perceptions of practitioners about the application of dcs in software artifacts. we hope that our contribution support software development teams that use artifacts in your projects. keywords: communication via software artifacts, human‑centered computing, semiotic engineering 1 introduction artifa cts developed in the ea rly sta ges of the softwa re development process, such a s the different dia gra ms of the unified modeling la ngua ge (uml) (freire et a l., 2 018; omg, 2015), a ssist pra ctitioners in understa nding the problem for which softwa re wa s required. as proposed solutions for softwa re development a re in a rtifa cts, these a rtifa cts a lso support tea m communication (petre, 2013). communica tion is considered a n importa nt fa ctor in softwa re development, since miscommunica tion in software tea ms ca uses low productivity a nd softwa re fa ilures (kä fer, 2017). miscommunica tion via a rtifa ct occurs, for exa mple, when consumers (who ta ke the informa tion they see in the models for the development of a nother a rtifa ct) ha ve different interpreta tions from the ones intended by the producers (who conceive the modeling of the softwa re). as much as consumers know the modeling nota tion, the wa y the modeling ha s been expressed by their producer ca n a ffect these pra ctitioners’ mutua l understa nding. in order to mitiga te miscommunica tion via a rtifa ct, we proposed the directives of communica bility 1 (dcs), presented in lopes et a l. (2019a ). the dcs ca n support reflections to producers a bout how they ca n crea te a softwa re solution via a rtifa cts a imed to get a mutua l understa nding a mong development tea m members. 1 communicability in this context refers to the artifact’s ability to convey to its consumers the solution conceived by its producers. pra ctitioners ca n use our proposa l ma inly in the a rtifacts developed in the initia l sta ges of the development process, such a s uml dia gra ms, mockups a nd others. we conducted prelimina ry studies to eva lua te our proposa l to reduce miscommunica tion (lopes et a l., 2019a ; lopes et a l., 2019b). however, we noticed tha t new studies a re necessa ry to eva lua te the dcs concerning pra ctitioners’ perceptions before tra nsferring them to the industry. given the context a bove, we conducted a n explora tory study (lopes et a l., 2020) to eva lua te pra ctitioners’ perceptions of the dcs. fifteen pra ctitioners pa rticipa ted in this study by modeling uml use ca se (omg, 2015) with the support of dcs. the results demonstra ted that the uml use ca ses developed, with the support of dcs, ha d few risks of miscommunica tion. besides, pa rticipa nts’ perceptions a bout the dcs indica te tha t such directives ca n support better communica tion via a rtifa ct, contributing to software qua lity. however, it is a lso importa nt to eva lua te how softwa re engineers a pply the dcs in a rtifa cts used in software projects to identify their fea sibility. this pa per extends our previous work (lopes et a l., 2020), presenting a study ca rried out to a na lyze communica tion via a rtifa cts in a softwa re development team. we conducted this study in a softwa re tea m with fourteen pra ctitioners that worked on a coopera tion project between the university of bra silia (unb) a nd the bra zilia n army. the results of this study showed the potentia l of the dcs to indica te improvements in the a rtifa ct’s content rega rds communica tion via towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 a rtifa cts. in a ddition, we present proposa ls tha t fa cilitate pra ctitioners to a dopt the dcs, such a s procedures tha t direct the a doption of the dcs in softwa re a rtifa cts a nd checklists tha t support the employ of ea ch directive in common scena rios of two specific a rtifa cts. through both studies, we noticed the contribution of the dcs for: (i) few risks of miscommunica tion via a rtifa cts, a llowing better communica tion via a rtifa ct ; a nd (ii) improvements on the qua lity of a rtifa cts, since miscommunication ca used incorrect informa tion in softwa re a rtifa cts. we hope tha t our contribution helps softwa re development tea ms reduce miscommunica tion via a rtifa cts. 2 theoretical foundations and related works this section begins by presenting both the semiotic engineering theory (de souza , 2005; de souza et a l., 2016) and the grice coopera tion principle (grice, 1 975), which we used to understa nd communica tion via a rtifacts a nd to propose the dcs. additiona lly, we present rela ted works to this type of communica tion. 2.1 theoretical foundations semiotic engineering theory (de souza , 2005; de souza et a l., 2016) cha ra cterizes user-system intera ction a s a pa rticula r ca se of huma n-media ted systems communica tion. systems a re considered metacommunication a rtifa cts in semiotic engineering, i.e., a rtifa cts tha t communica te a messa ge from the designer to users a bout h ow they ca n or should communica te with the system to do wha t they wa nt. the content of the metacommunication messa ge, or metamessage, ca n be pa ra phrased in the following templa te: “here is my understanding of who you are, what i’ve learned you want or need to do, in which preferred ways, and why. this is the system that i have therefore designed for you, and this is the way you can or should use it in order to fulfill a range of purposes that fall within this vision”. semiotic engineering uses the communica tion space model proposed by ja kobson (1960), tha t is structured in terms of context, sender, receiver, messa ge, code, a nd cha nnel, where: “a sender transmits a message to a receiver through a channel. the message is expressed in code and refers to a context”. ba sed on the communica tion space model proposed by ja kobson (1960), we ca n structure the communica tion elements via a rtifa ct in terms of the problem doma in (context), how the a rtifa ct is a va ila ble (cha nnel), informa tiona l a rtifact’s content (messa ge) composed of the a rtifa ct’s nota tions (code) for the communica tion between the producer (sender) a nd consumer (receiver) of a n a rtifa ct, where: “a producer transmits the informational content of the artifact to a consumer through a channel. informational content of the artifact is expressed by the artifact’s notations and refers to the problem domain ”. figure 1 presents a cha ra cteriza tion of these elements. semiotic engineering proposed eva lu a tion methods to support designer-user communica tion in order to understand how the user is being receives the meta messa ge. the principle of ca tegoriza tion of communication failures presented by semiotic engineering is rela ted to three ca tegories: figure 1. communication space of jakobson (1960). • complete failures when the intention of the communication and its effect are inconsistent; • partial failures when part of the intended effect of the communication is not reached; and • temporary failures when in the intention of a communicative act between user and system, the user has momentary difficulty to continue talking with the system. semiotic engineering extended its origina l perspective to a huma n-centered computing perspective, a resea rch field tha t a ims to understa nd huma n beha vior by integra ting technologies in socia l a nd cultura l contexts (sebe, 2010). this contribution is rela ted to the set of conceptua l a nd methodologica l tools ca lled signifyi (signs for your interpreta tion) (de souza et a l., 2016). the signifyi suite helps investiga te mea nings in softwa re during the development process and the communica tion between softwa re producers a nd consumers. among them, the signifyi messa ge tool (sfyi messa ge) is the opera tiona l version of the meta communication templa te. this opera tiona l version proposes tha t it can sta nd on its own a s a powerful eva lua tion resource to identify communica bility issues (which refers to the qua lity of the tra nsmission of the solution designed by producers to consumers). de souza et a l. (2016) report the use of a principle of reciproca l coopera tion rela ted to effective a nd efficient communica tion, ca lled grice’s coopera tive principle (grice , 1975). this principle is expressed by four ma xims. brea king one or more of these ma xims ma y lea d to a communication fa ilure. grice’s four ma xims a re: quality try to ma ke your contribution a true one. do not sa y wha t you believe to be fa lse a nd do not sa y something without a dequate evidence. in softwa re development, for exa mple, the softwa re engineer must communica te to the team only informa tion tha t is rela ted to the problem doma in. quantity make your contribution as informative as is required. do not make you contribution more informative than is required. following the previous exa mple, when towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 communica ting to the tea m, the softwa re engineer must try to use only sufficient content to cla rify the informa tion they must develop. relation – be releva nt, tha t is, do not introduce points tha t do not come under discussion. in the ca se of systems developed in different cycles, ea ch cycle must conta in only informa tion releva nt to such development. manner be perspicuous, a voiding obscurity of exp ression a nd a mbiguity, be brief a nd be orderly . the software engineer must use descriptions tha t the tea m ea sily interprets, a voiding a mbiguity. 2.2 related works for the communica tion to be efficient, the sender must ca refully choose a n expression for th e content he wishes to communica te, using a code tha t the receiver is a ble to interpret (de souza , 2005; de souza et a l., 2016). in this sense, we identified works rela ted to a rtifa cts’ comprehensibility, which refers to the receiver’s interpreta tion of wha t the sender sa id in his communica tive a ct. on communica tion via a rtifa ct, bordin a nd de angeli (2016) point out tha t softwa re engineers sta ted tha t documenta tion keeps a softwa re development team a ligned, especia lly in scena rios of distributed tea ms or with the introduction of new members in the tea m. schoonewille et a l. (2011) present a contribution rela ted to cognitive a spects in the understa nding of softwa re design documenta tion. they investiga ted, in one study, the pa rticipa nts' a bility to extra ct inf orma tion from dia gra ms a nd texts (gra mma tica lly a nd synta ctica lly correct). the a uthors noticed tha t self-a ssessment could be problema tic. they observed tha t developers were sa tisfied to “fill in” informa tion missing from the documenta tion without the sa me understa nding a s the documenta tion producers. this ca n ca use incorrect interpreta tions rega rding the softwa re. na ka mura et a l. (2011) proposed three metrics rela ted to the comprehensibility of uml cla ss dia gra ms in the following a spects: (1) cla ss structure, (2) pa cka ge structure a nd (3) a ttributes a nd opera tions. the a uthors cla im tha t the metrics help in estima ting the cost of time for understa nding a cla ss dia gra m. cruz-lemus et a l. (2010) present a predictive model of comprehensibility for uml sta te ma chine dia gra ms, a na lyzing its structura l complexity. the a uthors’ goa l wa s to reduce the impa ct of understa nding this dia gra m. tilley (2009) presents a work tha t summa rizes 15 years of resea rch on the use of gra phica l nota tion a s documentation for understa nding the system. according to the a uthor, the gra phica l nota tion ca n help to understa nd the system and support com munica tion. however, technica l ‘communicators’ a re not usua lly involved in this process. still, a ccording to the a uthor, the result is tha t the engineers, who ha ve the best of intentions, do not ha ve the necessa ry ba ckground to explore the resources of the gra phic nota tion to support end users’ ta sks. therefore, the a uthor reports a lesson lea rned: “we need to know how to ta lk”. theref ore, this highlights the importa nce of the producer thinking a bout the consumers. la nge a nd cha udron (2006) present a work tha t investiga ted the effects of defects in uml dia gra ms in rela tion to different interpreta tions. they conducted two controlled experiments with a la rge group of students a nd pra ctitioners. the two ma in contributions of this work a re investiga tions on defect detection a nd different interpreta tions ca used by undetected defects. the a uthors sta te that the results a re genera liza ble for modeling with uml dia gra ms. these works dea l with topics rela ted to communication with the support of a rtifa cts developed in the ea rly sta ges of softwa re development. schoonewille et a l. (2011) a nd tilley (2009) show the importa nce of a rtifa ct producers to reflect on consumers. thus, it is importa nt to ha ve a proposa l for a rtifa cts producers to reflect on the consumers. the contributions of the dcs ca n help with this, a s their goa l is to support communica tion via a rtifa ct. this ca n be a chieved when pra ctitioners ma ke improvements to the a rtifa cts to obtain a mutua l understa nding of tea m members. 3 directives of communicability for the dcs proposa l, we ha ve a ppropria ted the communica tion spa ce of ja kobson (1960) to communica tion via a rtifa ct, a s follows: the a rtifa ct is ma de a vaila ble with the support of a tool (the cha nnel) with informa tion from the problem doma in (context) to support communica tion between artifa cts producers (the emitters) a nd consumers (the receivers). the producer, in his messa ge, must con sider how the content is expressed (the use of the code) in such a rtifa cts. figure 2 shows such a ppropria tion . figure 2. appropriation of the communication space of jakobson (1960) for communication via artifact. besides, we ha ve a ppropria ted of semiotic engineering to define the following concepts rela ted to communication via a rtifa ct: • communicability of software artifacts refers to the a rtifa ct’s a bility to tra nsmit to its consumers the proposed solutions for softwa re development. • communicability issues in software artifacts – refers to the expressions or fea tures of the a rtifa ct tha t can be directly a ssocia ted with a n incompa tibility between mea nings a ssocia ted to them by their producers and consumers. • risks of miscommunication via artifacts – the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 likelihood of a communica bility issue to ca use communica tion fa ilures between producers and consumers. • miscommunication via artifacts – incompa tible interpreta tions by a rtifa cts consumers from the producer perspective for softwa re modeling. 3.1 proposal of the dcs we ela bora ted the dcs ba sed on semiotic engineering (de souza , 2005; de souza et a l., 2016) a nd grice’s coopera tive principle (grice, 1975). we a da pted the origina l semiotic engineering metacommunication templa te a s follows: “here is my understanding as a producer of the model, of who you are, as its consumer (to whom the producer is designing the model), what i have learned about what you need to do in system development (about what should be addressed in the model). this is the solution of the system that i designed for you to carry out your activities”. ba sed on this, we crea ted the following questions to help producers reflect on a rtifa cts consumers: (i) can the consumer understand the artifacts’ cont ent? can the consumer achieve its goals? – to support producers to reflect on whether everyone involved can understand the information in the model, such as developers and managers, or only developers; and (ii) what content should be addressed about the domain of the problem/solution of the system in the artifact? in order to encourage the producer to reflect on the content that she wishes to be comprehended from the model, such as the tasks that a user can perform on the system. these questions are used before the use of the dcs. rega rding the informa tion rela ted to models’ content, the dcs use the four ma xims of grice’s coopera tive principle. the directives will a llow producers to reflect on the models’ content before they send it to the consumer, so tha t there is mutua l comprehension in softwa re development teams. with this, the dcs ca n improve the model’s a bility to convey to its consumers the solution conceived by its producers. below we present ea ch dc, ba sed on grice’s ma xims: “say the truth!” dc1: use true informa tion. do not use informa tion tha t a ffects the content qua lity in the model (ma xim of qua lity). in the uml use ca se dia gra m, for insta nce, do not insert use ca ses tha t a re outside the problem doma in: “say what is needed and no more than necessary” dc2: use the necessa ry content in the model. do not use unnecessa ry content in the model (ma xim of qua ntity). ana lyze, for insta nce, the a mount of informa tion in the specifica tion of a ll use ca ses; “say it logically” dc3: orga nize the informa tion in the model consistently (ma xim of rela tion). for exa mple, orga nize the use ca ses in the dia gra m so tha t they present a logica l sequence for the producers; “say it clearly” dc4: orga nize the informa tion in the model clea rly (ma xim of ma nner). describe the na mes of the use ca ses so tha t they a re ea sily understood a nd differentia ted from ea ch other. 3.2 how can software engineers can apply the dcs? we designed the dcs to be employed by softwa re engineers in a rtifa cts tha t represent a spects of the softwa re developed from their perspectives, such a s uml dia gra ms, bpmn dia gra ms, a nd prototypes. in the study presented in subsection 4.2, a tea m a dopted uml use ca ses a nd prototypes to represent their softwa re development decisions. the dcs ca n reduce the risks of miscommunica tion via these a rtifa cts. figure 3 presents a schema tic of how softwa re engineers can a pply the dcs into uml use ca ses. figure 3. directives of communicability for software artifacts. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 in step 1 of figure 3, the producer begins his process of reflection on the consumers of the a rtifa ct produced based on the proposed questions. in step 2 of figure 3, the producer is a ble to obta in a better use of the directives in modeling, so tha t mutua l understa nding occurs. for exa mple, consider a pra ctitioner modeling a use ca se for a system tha t supports users in the use of medicines. by using dc2, ba sed on the pra ctitioner’s reflection, if the producer knows tha t the consumer ca n recognize the difference between the ‘reminders’ a nd ‘notices’ elements tha t will be used in the system, there is no need to deta il the difference between them. if the consumer does not know such a difference, it is importa nt that the producer describes this. dc2 will support this producer in producing use ca se specifica tions with the a mount of informa tion needed for those elements. rega rding the use of the dcs, producers ca n use them in digita l forma t, a vaila ble in a technica l report (lopes et a l., 2021), or print them to put on their worksta tions. we just empha size tha t it would be interesting for the producers had a ccess to directives during the development of the a rtifa cts. in section 5, we present proposa ls tha t ca n help softwa re engineers a dopt the dcs in softwa re projects. about the users of our proposa l, we crea t ed the dcs to be used by both beginners a nd experienced softwa re engineers, since they know the modeling nota tion. we empha size tha t the dcs support producers reflect on the a rtifa ct’s content to a chieve a mutua l understa nding a mong the members of a softwa re development tea m a nd not in modeling errors. 3.3 preliminary studies with the dcs in lopes et a l. (2019a ), two softwa re engineers, with the sa me level of experience in modeling, produced a rtifa cts. one of them used the dcs a nd the other did not. then, 3 0 pa rticipa nts were invited to crea te mockups ba sed on the a rtifa cts produced by the softwa re engineers. we divided the pa rticipa nts into two groups. the experimenta l group crea ted the mockups ba sed on the a rtifa cts produced with the dcs a nd the control group used the mockups ba sed on the a rtifa cts developed without the dcs. we noticed tha t the experimenta l group ha d a lower number of miscommunica tion. in lopes et a l. (2019b), the dcs were a lso a na lyzed to reduce the risk of miscommunica tion in softwa re a rtifa cts, such a s uml cla ss dia gra ms, bpmn (business process modeling a nd nota tion) dia gra ms (omg, 2011) a nd ifml (intera ction flow modeling la ngua ge) (bra mbilla a nd fra terna li, 2014). we choose these dia gra ms for different communica tion purposes during softwa re development. twentyfour pa rticipa nts, divided into two groups, ba sed on a modeling scena rio, produced such dia gra ms. the experimental group used the dcs a nd the control did not use the directives. the experimenta l group crea ted a rtifa cts with a lower number of risks of miscommunica tion compared to the control group. in lopes et a l. (2019a ) a nd lopes et a l. (2019b), we presented studies with qua ntita tive a na lyzes. however, it is importa nt to ca rry out qua lita tive studies on the dcs before tra nsferring them to the industry. for this rea son, we pla nned new studies tha t a im to a na lyze pra ctitioners’ perceptions a bout the directives. figure 4 presents a timeline a bout the studies ca rried out a nd our pla nning rega rds the new studies, which to answer the resea rch questions (rq) below. rq1 do practitioners perceive the dcs as support in improving the quality of artifacts? rq2 is the dc application by producers feasible in development teams? figure 4. timeline of preliminary studies with the dcs and planning of new studies in the software industry. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 4 experimental studies this section presents the studies ca rried out with pra ctitioners before tra nsferring the dcs to the industry. in the first study (study 1), fifteen pra ctitioners pa rticipa ted. th ey crea ted a uml use ca se with the support of the dcs. our main goa l in this study wa s to a na lyze the communica tion intention by a rtifa cts producers. after the study, the pa rticipa nts provided their perceptions a bout the dcs through a questionna ire. in the second study (study 2), we ca rry out a study in the context of a softwa re project. we eva lua te if the dcs can provide supports to identify risks in softwa re a rtifa cts that ca used miscommunica tion. in a ddition, producers a nd consumers provide their perceptions a bout communica tion via a rtifa cts through interviews a nd a n online questionna ire. 4.1 study 1: evaluation of the dcs from the practitioners’ perception we conducted a first study tha t eva lua tes the pra ctitioners’ perception of the dcs from 15 pra ctitioners rega rding their support during a rtifa cts development (lopes et a l., 2020). in this study, the pa rticipa nts a pplied the dcs in uml use ca ses, tha t is, in the use ca se dia gra m a nd specifica tion. after that, we sent questionnaires to collect the practitioners’ perceptions. as in this study we evaluated the practitioners’ perception of the dcs during the modeling of use cases. we did not investigate the communication between producers and consumers. therefore, the researchers analyzed only the possibility of a risk of miscommunication in the use cases. in addition, we analyzed the impact on the quality caused by the risks of miscommunication and qua lita tive da ta obta ined a bout the practitioners’ answers. 4.1.1 study 1: planning we selected 15 pra ctitioners to produce uml use ca ses with the support of the dcs. all pra ctitioners ha d a college degree a nd they were ta king the funda mentals of softwa re engineering cla ss in softwa re engineering postgra dua te course a t northern university center (uninorte). table 1 presents a summary of the participants’ experience. regarding the participants, most of them did not work creating artifacts in software projects related to our research. however, we consider practitioners who are consumers in software projects able to participate in the study, because they can provide their perception into our proposal to communication via artifacts. in this way, we planned training so that participants execute the study activities. we planned this study to take in a single day, du ring the morning and afternoon. in the morning, before we ca rried out the study, the pa rticipa nts received tra ining of a pproxima tely two hours for exercising use ca se modeling. it is noteworthy tha t a ll pa rticipa nts ha d prior knowledge of uml use ca ses. in the a fternoon, we reserved a la bora tory for the execution of this study, which ha d notebooks for the pa rticipa nts to use. we pla nned to run this study in a pproximately three hours. table 1: participants’ experience in software industry experience in the industry participants 1 – 3 years p1 (developer) p2 (software tester developed) p3 (software analyst) p4 (developer) p6 (process engineer) p7 (developer) p10 (developer) p12 (developer) p13 (developer) 4 – 8 years p8 (software tester developed) p9 (developer) p15 (developer) more than 9 years p5 (developer) p11 (developer) p14 (project manager) in order to observe the pa rticipa nts’ discussion rega rding the development of use ca ses in different modeling scena rios, we ra ndomly defined four groups. ea ch modeling scena rio ha d simple content, so tha t the pa rticipa nts complete the study a ctivities in the pla nned time. we present the description of the modeling scena rios for ea ch group below: group 1 scenario – to support students who want private lessons in basic classes such as mathematics, a system must be developed. the system should provide teachers with private lessons. additionally, evaluations of these teachers by students/other teachers should be displayed. the system should allow managing the teachers’ agendas on the classes, so that students can enroll in them. thus, it is possible to include and cancel classes. group 2 scenario – to support the small events, a system must be developed. in this system, the organizers will be able to create their accounts and, from th is, register events such as birthday parties, guest lists, and gift lists. they will also be able to send invitations via e-mail, control expenses, and generate reports for both guests and expenses. the system provides communication among organizers and gu ests. guests may or may not confirm their presence at the event and consult the gift list. group 3 scenario – to support sales professionals in their orders, such as delivery control, customer management (retailers and wholesalers), a system must be developed. the system will support professionals who want to computerize and innovate the service, minimizing errors and constraints from the lack of systematic control. the system should allow users to register their customers and to manage the stock of their products. after the payment record, the order is sent to the customer with the delivery invoice. group 4 scenario – to support residents of the state of amazonas in brazil, who have difficulty accessing the information on river routes for purchasing tickets, a system must be developed. the system will support passengers of different vessels, embarking/disembarking times, the vessels’ capacity, number of available spaces, price, and information on river routes. concerning vessels’ owners, they will be able towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 to register the number of employees availabl e for passenger assistance. before the execution of the study, we planned training with the dcs to be applied in use case modeling. rega rding the study a ctivities pla nned a nd figure 5 summa rizes them. figure 5. study 1 activities planned. to analyze the defects related to the risks of miscommunication, we used the types of defects presented by granda et al. (2015). table 2 shows these defects. table 2. types of defects (adapted from granda et al. (2015)) type description omission the required information has been omitted. incorrect fact some information in the model contradicts the list of requirements or general knowledge of the system domain. inconsistency information in one part of the model is inconsistent with information in other parts in the model. ambiguity the information in the model is ambiguous. this can lead to different interpretations of information. extraneous information the information that is provided is not required in the model. redundant information is repeated in the model. 4.1.2 study 1: execution we asked the participants to position themselves according to the groups defined to carry out the study activity. the participants were in the same laboratory, but the groups were far from each other. after that, we delivered to the groups the modeling scenarios and the printed dcs. the main researcher, the participants to draw up the use case diagram together, discussing relevant aspects of the system. after that, the researcher requested each participant to specify only one use case. the participants created the use cases from the modeling scenarios. regarding the use of the dcs, the participants should, for example, create use cases in the context of the problem domain (use of dc1) and analyze the amount of information to understand these use cases correctly (use of dc2). during the study, a researcher a researcher took notes of the directives most used by the participants. 2 https://a stah.net/ the participants used the astah21tool to model use cases. table 3 presents the four groups defined with the participants and the objective of each system in the modeling scenarios. table 3. groups defined in this study groups participants group 1 p4, p5, p6 e p7 group 2 p8, p9, p10 e p11 group 3 p12, p13, p14 e p15 group 4 p1, p2 e p3 regarding the use of the dcs, the main researcher informed the participants the directives can be applied according to the most appropriate for them, such as by using the directives during the modeling or after the participants made a modeling proposal. the main researcher noticed that all groups made a modeling proposal and then applied the dcs. at the end of the study, all participants answered a post study questionnaire to provide their perceptions about the dcs, including each participants’ experience in the industry. regarding the duration of the study, it was completed ahead of our planning. 4.1.3 study 1: results we analyzed the use cases produced by the groups regarding the risks of miscommunication, which were discussed with the other authors of this paper. the risks of miscommunication identified in the use cases of each group are shown in table 4, including their total number of occurrences and the description of each risk. table 4: risks of miscommunication in the use cases developed by the groups groups description of the risks of miscommunication group 1 lack of relationship in the use case diagram (1) different standards in the organization of the use case specification (3) lack of information in business rules (5) group 2 use case specification inconsistent with the use case diagram (1) lack of relationship in the use case diagram (1) lack of information in business rules (4) group 3 lack of information in business rules (2) lack of steps in the main flow of the use case specification (2) group 4 lack of steps in the main flow of the use case specification (2) lack of information in business rules (5) in this analysis, for example, we noticed that the participants in group 1 did not provide all the necessary information in the business rules, such as the fields in the system for a student to evaluate the teachers. the evaluation of the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 artifacts produced by the groups showed few risks of miscommunication compared to the number of risks of miscommunication identified in other software artifacts in a preliminary study (lopes et al., 2019b). however, such risks can cause possible miscommunication. regarding the application of the dcs by the participants, based on the researcher’s notes during the study, the majority of them used the following directives: dc2 to evaluate the amount of information that should be represented, and dc3 for the organization of information logically in use cases. analysis of software defects related to risks of communication failure. we grouped the risks of miscommunica tion in ta ble 5, which a re rela ted to the groups. defects rela ted to risks ha ve a lso been described in this ta ble. rega rding the risks of miscommunica tion, we noticed a la ck of informa tion for the business rules in the four modeling groups. besides, there wa s a la ck of informa tion for the rela tionship between use ca ses in the dia gra ms produced by the two groups. there wa s a la ck of specifica tion of steps in the ma in flow of use ca ses of two groups. these risks would be mitiga ted if the pa rticipa nts ha d reflected better on the a mount of information, rela ted to dc2. rega rding the risks rela ted to the la ck of sta ndardiza tion of use ca se specification itself a nd inconsistency between the use ca se specifica tions a nd the use ca se dia gra m, this would be mitiga ted with dc4 a nd dc1, respectively. table 5. defects related to the risks of miscommunication in the use cases developed by the groups groups description of the risks defects group 1 group 2 lack of relationship in the use case diagram omission group 1 different standards in the organization of the use case specification ambiguity group 1 group 2 group 3 group 4 lack of information in business rules omission group 2 use case specification inconsistent with the use case diagram inconsistency group 3 group 4 lack of steps in the main flow of the use case specification omission rega rding the risks rela ted to the la ck of informa tion in: (i) the business rules, (ii) ma in flow steps in the use case specifica tion, a nd (iii) rela tionships between use ca ses in the dia gra m we considered them to be a n ‘omission’ defect. different sta nda rds in the specifica tion of use ca ses ma y a llow different interpreta tions by consumers, which we considered a n ‘ambiguity’ defect. fina lly, we considered inconsistent informa tion between the use ca se dia gra m a nd the use case specifica tion to be a n ‘inconsistency’ defect. analysis of the participants’ perception. rega rding the post-study questionna ire, the pa rticipa nts a nswered the following question: “wha t is your perception of the directives of communica bility?” we defined this question in a genera l wa y to collect different opinions of the pa rticipa nts on the dcs. to a na lyze the qua lita tive da ta obta ined in the study a ccording to stra uss a nd corbin (1998), resea rchers ca n use coding procedures to a chieve their resea rch objectives. we used open coding to understa nd pa rticipa nts’ perceptions. with tha t, we observed the following codes: dcs contribute to the quality of software artifacts "the directives help to reflect on what should be developed, avoiding inconsistencies" (p5) "the directives help to understand the system, support to identify possible errors" (p8) "facilitates the identification of problems in modeling" (p13) dcs promote the organization of information in artifacts "the directives helps to organize and improve the information required to create a system" (p3) "dcs assist in organizing ideas together with the development team" (p10) "the directives help to organize thoughts when designing the system" (p2) dcs support the understanding of the system "dcs assist to obtain relevant information for the project" (p4) "the directives provide great support for the production of the software" (p7) "helps to considerably improve the general understanding of a system" (p15) dcs can promote effective communication via artifact "dcs are a type of roadmap for organizing ideas in communication through a logical way" (p11) "they help to think about how to communicate with colleagues" (p6) "help in communicating correctly in software development" (p12) dcs promote the reduction of different interpretations "the directives help reduce the multiple interpretations of the same idea, as the ideas must be conveyed so that everyone understands" (p14) difficulties with the use of the dcs "it is not easy to understand the directives; it required more of my mental effort" (p2) "it is not easy to apply the directives; i believe it depends on the user's experience" (p5) "directives demand time for understanding" (p6) through the participants’ perceptions, we observed that the dcs contribute to the improvement of the quality of the towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 artifacts. such perceptions are represented by the co des ‘dcs contribute to the quality of software artifacts’ and ‘dcs promote the organization of information in artifacts’. most of the participants’ responses showed that they perceived the purpose of the dcs, as we noticed the codes ‘dcs can promote effective communication via artifact’ and ‘dcs promote the reduction of different interpretations’. some participants also reported ‘difficulties with the use of the dcs’, which may be related to their reflection on whether or not they are correctly applying the main concept of each directive for what each producer wants to communicate. however, this is part of the reflection process by producers regarding their communication through artifacts. analysis of acceptance. we applied the technology acceptance model (tam) to analyze the participants’ perception of the dcs in the post-study questionnaire (venkatesh and bala, 2008). tam is one of the most adopted models for collecting information about the decision to accept or reject technologies (marangunić et al., 2013). this model is basically based on two constructs: perceived ease of use: degree to which a user believes to use a specific technology with little effort. perceived usefulness: the degree to which a user believes tha t using a specific technology would improve their performa nce a t work. the user’s beha viora l intention to use a technology, the intention to use, is determined by the perceived ea se of use a nd perceived usefulness. the sta tements conta ined in the post-study questionna ire to a ssess the constructs of ea se of use, usefulness, a nd intention to use the dcs, a da pted from [25], a re presented below: perceived ease of use e1. my interaction with the directives of communicability is clear and understandable. e2. interacting with the directives of communicability does not require a lot of my mental effort. e3. i find the directives of communicability easy to use. e4. i find it easy to get the directives of communicability to do what i want it to do. perceived usefulness u1. using the directives of communicability improves my performance better for understanding aspects of the software. u2. using the directives of communicability in my job has improved my productivity, since i will not have to correct information that is not understood by colleagues. u3. using the directives of communicability enhances my effectiveness on communication with the team based on the artifacts. u4. i consider the directives of communicability useful for software design. intention to use i1. assuming i had enough time to design software, i intend to use the directives of communicability. i2. considering that if i could choose any tool, i predict that i would use the directives of communicability. i3. i plan to use the directives of communicability in my next project. rega rding the a da pted tam sta tements, pa rticipa nts provided their a nswers on a seven-point likert sca le (likert, 1932). the possible a nswers were “tota lly agree, strongly agree, pa rtia lly agree, neutra l, pa rtia lly disa gree, strongly disa gree, a nd tota lly disa gree”. the pa rticipa nts a nswered their degree of a greement on the usefulness, ea se of use, and intention to use the dcs in the production of a rtifa cts. figure 6 summa rizes the pa rticipa nts’ a nswers. figure 6. degree of participants’ acceptance regarding the use of the dcs in the production of artifacts rega rding the disa greements rela ted to e2, e3 a nd e4 for ea se of use, a s shown in figure 6, we noticed five pa rticipa nts a nswered tha t, including p2, p5 a nd p6 tha t cited it’s not ea sy to employ dcs, represented by the ‘difficultie s with the use of the dcs’ code in subsection 5.2. the other pa rticipa nts did not provide a nswers to expla in why they disa greed with e2. in summa ry, such a nswers ma y indica te that it is importa nt to provide ma teria l tha t helps in the producer's reflection ba sed on the dcs. about the disa greement a nd neutra l a nswers with the sta tements tha t measure usefulness, we noticed tha t p3, p6, a nd p11 a nswered this. however, a ll pa rticipa nts who consume informa tion from a rtifa cts, i.e. developers, a gree that our proposa l is useful to communica tion via artifa ct. overa ll, most of the pa rticipa nts’ a nswers showed a greement rega rding ea se of use, usefulness, a nd intention to use the dcs. with this resea rch (lopes et a l., 2020), we observed that the dcs promoted the pa rticipa nts’ reflection on their communica tion to the others involved in the development of a softwa re. the dcs a lso ma de it possible to reduce the introduction of defects, beca use we perceived consistent ma pping between the risks of miscommunica tion a nd softwa re defects. additiona lly, most of the pa rticipa nts’ a nswers to the dcs were positive a bout their use. with this, it is possible to infer tha t the softwa re industry considered the directives useful. ba sed on the results obta ined in this study, we decided to ca rry out a fea sibility study in a softwa re development team. this study ma y increa se the indica tions on the tra nsfer of the dcs to the industry. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 4.1.4 threats to validity of study 1 in all experimental studies, there are threats that can affect the validity of the results. the threats related to this study are discussed below with the classification of threats to validity presented by wohlin et al. (2012): internal validity. tra ining effect it would be interesting if there wa s no need for tra ining. however, the short tra ining time a llowed the dcs to be used by pra ctitioners during the production of uml use ca ses. in a ddition, tra ining on use ca se modeling a lso ena bled pa rticipa nts to execute the study a ctivities, a s most of them did not work crea ting a rtifa cts that represent softwa re decisions in projects. time used for the study despite the time considered long for the use case modeling, a ll pa rticipa nts completed the study a ctivities before the expected tim e. external validity. validity of the artifacts – we carried out only modeling of uml use cases in this study. it is not possible to claim that uml use cases represent all the artifacts that support communication. besides, the use cases were modeled for four software projects. it is not possible to claim that this artifact represents all types of software. construct validity. indicators for miscommunication the measures adopted to analyze miscommunication were based on the semiotic engineering theory (de souza, 2005; de souza et al., 2016), which has different methods to assess communication during the development process. conclusion validity. there is a limitation in the representativeness of the results, a known problem in experimental studies of software engineering (fernandez et al., 2012). the results obtained in this study may not be reproduced in other software artifacts that support the und erstanding of members of a team. analysis of artifacts – about the risks of miscommunication in use cases, there is a threat regarding the researcher who carried out such analysis. to mitigate this threat, we added another researcher to discuss this analysis. 4.2 study 2: feasibility study in study 1, a lthough we perceived positive a nswers a bout the dcs, we did not ca rry out the study in the context of a softwa re project. we ca rried out a nother study, our second study, in a softwa re development tea m. we investiga ted the use of the dcs in the a rtifa cts used by the tea m to identify risks tha t ca used miscommunica tion in the development of the bulletin system (sisbol). sisbol is a web system, with client-server a rchitecture, following the sta nda rd representa tional sta te tra nsfer (rest) [1], with the purpose of a utomating the process of genera ting newsletters (officia l) a nd ma naging the members' persona l historica l (cha nges of the milita ry) of the bra zilia n army (eb). a bulletin represents a n instrument by which the comma nder, chief or director of the eb dissemina tes the orders of the higher a uthorities a nd the fa cts tha t must be known by the milita ry orga niza tions in which the members pa rticipa te. sisbol is composed of entities a ssocia ted with milita ry, such a s qua lifica tion, gra dua tion, subunit/division/section, milita ry orga niza tion, function a nd a ltera tion, a ssocia ted with the bulletin structure (type of bulletin, section, pa rt, genera l subject, specific subject, note) a nd a ssocia ted with system users. notes a re documents proposed by a competent a uthority to be a pproved by the commander, chief or director, for publica tion in its bulletin. the system has a certa in degree of configura bility, a llowing the a pprova l processing workflows for notes a nd bulletins to be customized for ea ch milita ry orga niza tion. 4.2.1 study 2: planning we initia lly designed the study to a na lyze how the team conducted its a ctivities a nd how softwa re a rtifa cts support communica tion. then, we pla nned the a na lysis of the a rtifa cts with the support of the dcs to identify opportu nities for improvement to a better communica tion. fina lly, we pla nned a collection of the tea m members’ perception of the support of the a rtifa cts. the tea m selected for the study wa s composed of 14 pra ctitioners who developed the sisbol. ta ble 6 shows the cha ra cteriza tion of the tea m. table 6. characterization of the team team experience systems analyst (product ownerpo) 20 years designer 9 years developer 1 7 years developer 2 20 years developer 3 5 years developer 4 16 years developer 5 4 years developer 6 4 years developer 7 19 years developer 8 12 years developer 9 12 years developer 10 3 years developer 11 10 years developer 12 3 years the scope of the new sisbol involves 30 functiona lities, which were divided into lega cy fea tures (23) a nd new fea tures (07). the tea m used the a gile development methodology. the development tea m used the a gile scrum methodology. the a rtifa cts ela bora tion process wa s colla bora tive a nd involved different project sta keholders. the tea m used uml use ca ses a nd prototypes a s a rtifacts tha t conta in the solution designed for softwa re development. rega rding the tea m selected, the pra ctitioners did not create a doma in model, just the use ca ses a nd mockups. about the experience of the producers, the system a na lyst ha d twenty yea rs of experience with uml a nd the designer ha d nine yea rs of experience with prototypes in projects. rega rding the system developed by the tea m, it wa s a lrea dy in its fina l pha se, a s the tea m wa s only ma king corrections to some features. 4.2.2 study 2: execution we ca rried out the following steps in this study: towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 (i) a meeting between the po a nd the ma in resea rcher in order to obta in a n overview of the a ctivities and the a rtifa cts used a s a mea n of communica tion; (ii) meeting a mong the ma in resea rcher with the tea ms’ producers to a na lyze the a rtifa cts’ content ba sed on the dcs; (iii) a fter tha t, we prepa red a n electronic questionna ire for producers to a nswer their perceptions of the a rtifa cts a s a support for communica tion; (iv) a meeting of the ma in resea rcher with a n a rtifa cts consumer to understa nd how they used the a rtifa cts; (v) we a lso sent a n electronic questionna ire for consumers to a nswer their perceptions of the a rtifa cts with questions ba sed on the dcs. due to the una va ila bility of some pa rticipa nts to pa rticipa te in individua l meetings, this questionna ire fa cilita ted the collection of tea m members’ perceptions. in rela tion to step 2, the ma in resea rcher should be present to support the producers a nd to collect their perception a bout the a rtifa cts ba sed on the dcs. the ma teria l used in these steps is a va ila ble in a technica l report (lop es et a l., 2021). rega rding step 3, the pa rticipa nts a nswered the following questions on the electronic questionna ire: 1. what is your perception about this artifact as a means of communication? 2. tell us about your perception regarding the communication via artifact. about the step 5, we used the following questions to collect the consumer’s perceptions a bout the softwa re a rtifa cts: 1. during the software development, did you notice any inconsistent information regards the team knowledge about the software? – based on dc1; 2. about the quantity of information, is there the lack of information or excessive information? based on dc2; 3. is all information in the artifacts relevant to software development? please, tell us your perception – based on dc3; 4. it was difficult to understand any information in the artifacts? – based on dc4; with the execution of these steps, the dcs ca n help pra ctitioners to understa nd a spects that need improvements in the informa tiona l content of the a rtifa cts. these improvements ca n lea d a better communica tion via a rtifa cts. 4.2.3 study 2: results firstly, the producers a na lyzed the a rtifa cts with the support of the dcs a nd we a na lyzed the types of defects rela ted to risks identified. after this, a na lyzed the pa rticipa nts’ a nswers. analysis of software defects related to risks of communication failure. the ma in risks of miscommunica tion are in the use ca ses. such risks a re rela ted to the la ck of upda ting of some informa tion, identified with the support of dc1, and the excess of informa tion, identified with dc2. figure 7 presents a cha ra cteriza tion of the identified risks. figure 7. analysis of artifacts based on dcs rega rding defects rela ted to the risks of miscommunication, we ha ve identified: • lack of updating of some information in the use cases – inconsistence defect: the la ck of upda ting led to inconsistent informa tion in the a rtifa ct; a nd extraneous information defect: informa tion not needed in the a rtifa ct. • excess of information ambiguity defect: the excess of informa tion promotes different interpreta tions. analysis of communication via team artifact. we used open coding (stra uss a nd corbin, 1998) to understa nd the tea m’s communication via a rtifact and how the dcs ca n support the improvement of this type of communica tion. we a pplied the coding ba sed on the a nswer of producers a n d consumers a bout the a rtifa cts’ content. when a na lyzing the tea m’s communica tion through the a rtifa cts, we identified cha ra cteristics in the informa tional content tha t a ffected the communica tion. we noted tha t consumers ha d a dopted mockups more to support their a ctivities compa red to use ca ses. the tea m po, one of the producers of the a rtifa cts, a nd the consumers mentioned: “perhaps i put more information in the mockups than necessary and it led the team to not consult the use cases (systems ana lyst)”. “there was an excess of information in the documentation. so many details generated several differences in the documentation for implementation and other minimal details that did not affect the system's functionality itself... with the use of the mockups, it was easier to understand the user's needs, and so the doubts that i had about the functioning of the system were resolved (developer 11)”. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 “with the mockups, half of the system's functionalities were well defined, with only the business rules missing, which could not be modeled visually (developer 12)”. the dcs indica ted tha t consumers a dopted more mockups a s support in their a ctivities tha n the use ca ses due to the excess of informa tion in the use ca ses (with the support from dc2), a lso cited by developer 11. additiona lly, there wa s a n outdated use case (with the support from dc1), genera ting a nega tive impa ct on communica tion via a rtifact, as observed by one of the consumers: “throughout the development, i believe that the artifacts have become outdated in relation to the needs of users and the implementation of the system (developer 4)”. rega rding the communica tion via this tea m's a rtifact, one of the producers reflected on their communica tion based on the dcs a nd he believes there wa s a la ck of a nother a rtifa ct tha t supports the understa nding of the user’s intera ction with the system (dc 2): “the artifacts contain the necessary information that the team needs to understand the problem. however, there are some limitations and information that cannot be transmitted in the artifacts. for example, the 'disposable mockup' presents only an idea of what the interface with the possible fields of the system will look like, but it does not present how it will be done, or even the user's interaction with the system (designer)”. with the results of this study, we noticed miscommunica tion via a rtifa ct identified with the support of the dcs. the dcs were a ble to support the producers to ma ke improvements in the a rtifa cts, ena bling better communica tion via a rtifa ct. 4.2.4 threats to validity of study 2 the threats related to this study are discussed below with the classification of threats to validity presented by wohlin et al. (2012): internal validity. the ma in threa t to interna l va lidity wa s the sha ring of developers' perceptions of the a rtifa cts. to mitiga te this threa t, we sent a n electronic questionna ire to ea ch pa rticipa nt to a nswer their perception individua lly. however, this does not elimina te the possibility of communica tion between the pa rticipa nts. external validity. regarding the artifacts evaluated in this study, it is not possible to state that they represent all the artifacts that support communication. additionally, these artifacts were modeled for just one software project. construct validity. we identified the threat of the participant providing answers that do not reflect reality but rather personal expectations regarding the artifacts. to mitigate this threat, we informed the participants that the experiment did not provide any kind of personal or project assessment but rather as an assessment of the use of artifacts in support of communication. conclusion validity. there is a limitation in the representativeness of the results, this being a known problem in experimental studies of software engineering (fernandez et al., 2012). the results obtained in this study may not be reproduced in other software artifacts that support the understanding of those involved in the production of the systems. 4.3 lessons learned these studies helped us to understa nd different aspects of the dcs from the pra ctitioners’ perception. about study 1, we described our lessons lea rned below. • disagreements about the ease of use of the dcs show the need to create material that supports application of each directive – a lthough most pa rticipa nts a gree that dcs a re ea sy to use, we noticed some disa greements a bout this. the dcs a re genera l instructions tha t supports the ‘reflection’ of producers a bout their communica tion via a rtifa ct a nd there a re no specific steps for tha t. however, to support producers employ the directives, a ma teria l tha t indica tes some reflection points would be interesting. such ma teria l ca n be crea ted ba sed on common scena rios noticed in both studies presented in our pa per. • perceived usefulness by practitioners who act as consumers indicate that our proposal can support the communication of the artifact –the usefulness perceived by pra ctitioners who work a s developers indica ted tha t our proposa l supports mutua l understa nding between producers a nd consumers, since such pa rticipa nts ma y had experienced such a scena rio. about study 2, we noticed tha t dcs supported pra ctitioners in the eva lua tion of a rtifa cts a lrea dy used by a software tea m. we ha d the following lessons lea rned from our proposa l in this study: • consumers’ perceptions during evaluation of artifacts improve this type of communication – rega rding the employ of dcs in the eva lua tion of a rtifa cts a lready used by softwa re tea ms, both producers a nd consumers ca n do tha t, providing a contra st a bout communication via a rtifa ct in a tea m. such pra ctice supports continuous improvements in this type of communica tion. • material that supports producers to adopt the dcs in software projects – we designed the initia l proposa l of the dcs to a pply them in the production of a rtifa cts, but we noticed the potentia l of the directives eva lua te a rtifa cts a lrea dy used by tea ms. aims to help softwa re engineers to a dopt the dcs in their projects, it would be interesting the development of procedures tha t indica te the ma in steps to a pply the dcs. the next section presents the proposa l of the ma terials prepa red to support softwa re engineers a dopt the dcs in their projects. we crea ted such proposa ls ba sed in our lessons lea rning. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 5 proposal to support the application of the dcs in software projects rega rding the dcs, ea ch directive a ims to provide a genera l indica tion of how a rtifa cts ca n be expressed by their producers a bout their communica tion, so the risks of miscommunica tion a re mitiga ted. rega rding the a doption of the dcs to support improving the communica bility of a softwa re a rtifa ct, we observed two contexts in which a rtifa cts a re used: (1) when they a re a lrea dy being used by a tea m during the execution of a project , a nd (2) before being used by the tea m when the project is in its initia l st a ges. for the context a bove, we crea ted procedures to fa cilitate pra ctitioners who wish to a dopt the dcs in their projects. figure 8 presents a procedure to be followed by pra ctitioners who wish to a dopt the dcs to identify opportunities for improvements in the a rtifa cts. this procedure is suggested for tea ms tha t ha ve sta rted crea ting a rtifa cts without the support of our proposa l, but they would like to a dopt it in the a rtifa cts, a s noted in the second study presented in this pa per, such a s: 1. communication intent pra ctitioners should reflect on their communica tion intent ba sed on the questions: “ can the consumer understand the artifacts’ content? can the consumer achieve its goals?”, like developers a nd testers, a nd “what content should be addressed about the domain of the problem/solution of the system in the artifact? ”, a s the ta sks tha t a user ca n perform in the system. 2. use of the dcs in the artifacts’ content use of the dcs to identify risks tha t ca used miscommunica tion. to fa cilita te the use of the dcs, we prepa red a checklist, presented la ter in the text. at this sta ge, producers and consumers ca n ca rry out the eva lua tion for a better understa nding of the necessa ry improvements. 3. availability of artif acts with the improvements ma de, producers ma ke the a rtifa cts a vaila ble to consumers, a s this a lso a ffects communication via a rtifa cts, such a s e ma il or repository used by the tea m. figure 8. use of the dcs during execution of projects for the second context, figure 9 shows the procedure that pra ctitioners ca n a dopt when using the dcs before the production of the a rtifa cts. ea ch step to be followed in the procedure is described below. with the dcs a pplied to the a rtifa cts before their consumption, the risks tha t ca used miscommunica tion ca n be reduced. figure 9. adoption of the dcs in project planning 1. modeling notation it is importa nt for producers to reflect on the nota tion tha t will be a dopted when modeling the a rtifa cts to represent a spects of the softwa re. additiona lly, it is importa nt for producers to reflect on whether such nota tion is known to consumers. this st ep wa s not considered in the first context beca use the tea m a lready ha s the a rtifa cts esta blished to represent the solutions modeled for the softwa re. 2. communication intent simila rly to the first context, pra ctitioners must reflect on their intention to communicate ba sed on the questions proposed to use with the dcs. 3. use of the dcs in the artifacts’ content use of the dcs to reflect on producers' communica tion intent. the checklist a lso supports this reflection. 4. availability of artifacts producers should reflect on the best mea ns of communica tion tha t a rtifa cts should be ma de a vaila ble to consumers, a s it ca n a ffect communication via a rtifa cts, such a s e-ma il or repository. in a ddition to the procedures, we a lso developed checklists tha t ca n facilita te the a pplica tion of the dcs in the a rtifa cts investiga ted in our resea rch, such a s uml use case a nd mockups. ta ble 7 presents the checklist for mockups a nd ta ble 8 presents the checklist for uml use ca se. table 7. checklist based on dcs for mockups dcs item description dc1 is there information in the mockups that are outside the problem domain? if so, remove that information is there outdated information in the mockups? if so, update them dc2 are all requirements represented in the mockups? if not, design mockups with such information are all alternative paths represented in the mockups? if not, enter this information in the mockups in general, is the amount of information in the mockups sufficient for the team to understand the system? if not, enter the required amount of information is there an excess of information? if so, if this excess is unnecessary for understanding the system, remove it from the mockups dc3 is the order of the screens organized in such a way that the team better understands them? if not, arrange this sequence dc4 are the screen names clear in relation to their purpose? if not, clarify the names of the screens in mockups, are there any terms that are unknown to consumers? if so, please clarify such terms in mockups, is there any ambiguous information? if so, please clarify this information is information used to obtain implicit interpretation by the team? if so, reflect on whether such information should be expressed explicitly to avoid multiple interpretations towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 table 8. checklist based on dcs for use cases dcs item description dc1 is there information in the use cases that are outside the problem domain? if so, remove that information is there outdated information in the use cases? if so, update this information dc2 are all relationships between use cases represented in the diagram? if not, enter such relationships are all use cases represented in the diagram? if not, insert such use cases in use cases specification, are all actors involved represented? if not, insert such actors when specifying a use case, are all flows represented? if not, enter the necessary flows when specifying a use case, are all business rules represented? if not, insert the necessary rules is there an excess of information? if so, if this excess is unnecessary for understanding the system, remove it from the mockups dc3 are the use cases organized in the diagram logically? if not, organize the use cases are the actors organized concerning the use cases in the diagram? if not, organize the actors in the diagram is the sequence of information in each use case specification logically organized? if not, organize this information dc4 are the names of the use cases clear concerning their purpose? if not, clarify the names of the use cases are the names of the actors clear concerning their purpose? if not, clarify the actors in the use case specification, are there any terms that are unknown to consumers? if so, please clarify such terms when specifying a use case, is there any ambiguous information? if so, please clarify this information these checklists conta in questions ba sed on co mmon a rtifa ct scena rios tha t ha ve risks of miscommunica tion. however, we empha size tha t dcs help pra ctitioners to reflect on the a rtifa cts a nd checklists support the identifica tion of specific risks. in this wa y, checklists should be used together with the dcs. 6 discussion we ca rried out studies with the objective of tra nsferring the dcs to the softwa re industry. in the first study, conducted to a nswer rq1 (do practitioners perceive the dcs as support in improving the quality of artifacts? ), we noticed tha t the directives supported the pa rticipa nts' reflection on the communica tion via uml use ca ses. this a llowed reducing possible inconsistencies in the development of the explored a rtifa ct. it wa s possible to obta in evidence tha t t he dcs ca n contribute to improving the a rtifa cts’ qua lity, since the dcs supported reducing incorrect informa tion. this can reduce costs during softwa re development, a s defects discovered during the softwa re development process increa se costs due to the correction of such defects. the second study conducted a ims to understa nding the fea sibility of the dcs to support the improvements in the communica bility of softwa re a rtifa cts used by the tea m to a nswer rq2 (is the dc application by producers feasible in development teams?). the use of the dcs showed the main a spects tha t need improvements, since they nega tively a ffected the communication between producers a nd consumers of these a rtifa cts. the results of this study demonstra ted the benefit of using the dcs, a s the problems identified in the informa tiona l content of the a rtifa cts ca n be fixed. both studies showed evidence to tra nsfer the dcs to the industry. in a ddition, such studies help us to obta in insights for the development of proposa ls tha t fa cilita te the a doption of the dcs in softwa re projects. 7 final considerations this pa per presented resea rch ca rried out with the a im of tra nsferring the dcs to the softwa re industry. we explored the dcs concerning the pra ctitioners’ perception a bout the dcs a s supports in the reflexion of them a s producers, in a first study, a nd a specific softwa re development tea m about the risks in softwa re a rtifa cts tha t ca used miscommunication with supports of the dcs, in the second study. from the first study, the results showed tha t the dcs supported pa rticipa nts to reflect on the system, reducing possible inconsistencies in the development of the explored a rtifa ct, a uml use ca se. the dcs a lso promoted the pa rticipa nts’ reflection on their communica tion with the others involved in the softwa re development. the reduction of miscommunica tion a lso reduces the introduction of defects, a s a consistent ma pping between risks of miscommunica tion a nd softwa re defects has been perceived. in the second study, from the risks identified in the a rtifa cts used by the softwa re development tea m, producers ma de improvements in the a rtifa cts. with tha t, software development tea ms will be a ble to a dopt the dcs in their projects to improve communica tion via a rt ifa ct. additiona lly, most of the pa rticipa nts’ perception a bout the dcs were positive. from the studies results, we noticed the need to define some a rtifa cts so tha t pra ctitioners ca n use our proposa l in their projects. we presented in this pa per two procedures that fa cilita te the use of the dcs in softwa re projects. besides, for the better use of ea ch directiv e, we ha ve proposed checklists. we believe tha t pra ctitioners interested in a dopting our proposa l ca n use them. rega rding the use of the dcs in these studies, it is possible to infer tha t they were considered fea sible for the softwa re industry. the new studies in the context of softwa re projects ca n provide more evidence on the a pplica tion of the dcs to support producers a bout their communica tion, a iming to reduce the risks of miscommunica tion. as future work, we intend to ca rry out a n observa tional study in different softwa re development tea ms using our proposa l. in this future study, the tea ms will use our a rtifa cts, process a nd checklists, proposed in this pa per, including the eva lua tion of a rtifa cts developed in the ea rly sta ges of the softwa re development process not explored in our studies. in a ddition, we intend to investiga te the softwa re engineers perceptions a bout to include the dcs a s pa rt of the compa ny’s culture rela ted to crea tion of a rtifa cts used a s mea ns of communica tion. towards to transfer the directives of communicability to software projects: qualitative studies lopes et al. 2021 acknowledgements we a re gra teful for the fina ncia l support from capes (fina ncing code 001), cnpq (311494/2017-0 e 204081/20181/pde) a nd fapeam (062.00150/2020). references bra mbilla , m., & fra terna li, p. (2014). interaction flow modeling language: model-driven ui engineering of web and mobile apps with ifml. morga n ka ufma nn. bordin, s., & de angeli, a. (2016). foca l points for a more user-centered agile development. international conference on agile software development , 3-15. corbin, j., & stra uss, a. (2014). basics of qualitative research: techniques and procedures for developing grounded theory. sa ge publica tions. de souza , c. s. (2005). the semiotic engineering of human-computer interaction. mit press. de souza , c. s., cerqueira , r. d. g., afonso, l. m., bra ndã o, r. d. m., & ferreira , j. s. j. (2016). software developers as users. cha m: springer interna tional publishing. freire, e. s. s., oliveira , g. c., & de sousa gomes, m. e. (2018). ana lysis of open-source case tools for supporting softwa re modeling process with uml. in proceedings of the 17th brazilian symposium on software quality, 51-60. gra nda , m. f., condori-ferná ndez, n., vos, t. e., & pa stor, o. (2015). wha t do we know a bout the defect types detected in conceptua l models? in 2015 ieee 9th international conference on research challenges in information science (rcis), 88-99. grice, herbert p. logic a nd conversa tion. speech acts. brill, 1975. 41-58. ja kobson, r. (1960). linguistics a nd poetics. in style in language. ma: mit press, 350-377. kä fer, v. (2017). summa rizing softwa re engineering communica tion a rtifacts from different sources. in proceedings of the 2017 11th joint meeting on foundations of software engineering, 1038-1041. likert, r. (1932). a technique for the mea surement of attitudes. archives of psychology, 144 (55), 7-10. lopes, a., oliveira , e., conte, t., & de souza , c. s. (2019a ). directives of communica bility: towa rds better communica tion through softwa re models. in 2019 ieee/acm 12th international workshop on cooperative and human aspects of software engineering (chase), 45-48. lopes, a., conte, t., & de souza , c. s. (2019 b). reducing the risks of communica tion fa ilures through software models. in proceedings of the 18th brazilian symposium on human factors in computing systems, 1-10. lopes, a., conte, t., & de souza , c. s. (2020). exploring the directives of communica bility for improving the qua lity of softwa re artifa cts. in proceedings of the xix brazilian symposium on software quality (sbqs’20), 10 pa ges. lopes, a., conte, t., & de souza , c. s. (2021). directives of communica bility: towa rds softwa re development tea ms. uses resea rch group technica l report, tr -uses2021-01. https://doi.org/10.6084/m9.figsha re.15057984.v2 ma ra ngunić, n., & gra nić, a. (2015). technology a ccepta nce model: a litera ture review from 1986 to 2013. universal access in the information society, 14(1), 81-95. omg. (2011). business process model a nd nota tion (bpmn) version 2.0. object management group, 1(4). omg. (2015). unified modeling la ngua ge tm (uml) version 2.5. petre, m. (2013). uml in pra ctice. in proceedings of the 2013 international conference on software engineering (icse 2013), 722-731. kha re, r., & ta ylor, r. n. (2004). extending the representa tiona l sta te tra nsfer (rest) a rchitectura l style for decentra lized systems. in proceedings of the 26th international conference on software engineering , 428-437. sebe, n. (2010). huma n-centered computing. in handbook of ambient intelligence and smart environments, springer, boston, ma, 349-370. schoonewille, h. h., heijstek, w., cha udron, m. r., & kühne, t. (2011). a cognitive perspective on developer comprehension of softwa re design documentation. in proceedings of the 29th acm international conference on design of communication, 211-218. tilley, s. (2009). documenting softwa re systems with views vi: lessons lea rned from 15 yea rs of resea rch & pra ctice. in proceedings of the 27th acm international conference on design of communication , 239-244. venka tesh, v., & ba la , h. (2008). technology a ccepta nce model 3 a nd a resea rch a genda on interventions. decision sciences, 39(2), 273-315. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., & wesslén, a. (2012). experimentation in software engineering. springer science & business media . journal of software engineering research and development, 2021, 9:9, doi: 10.5753/jserd.2021.1802  this work is licensed under a creative commons attribution 4.0 international license.. how are test smells treated in the wild? a tale of two empirical studies nildo silva junior  [ federal university of bahia | nildo.silva@ufba.br ] luana martins  [ federal university of bahia | martins.luana@ufba.br ] larissa rocha  [ federal university of bahia / state univ. of feira de santana| lrsoares@uefs.br ] heitor costa  [ federal university of lavras | heitor@ufla.br ] ivan machado  [ federal university of bahia | ivan.machado@ufba.br ] abstract developing test code may be a time­consuming process that requires much effort and cost, especially when done manually. in addition, during this process, developers and testers are likely to adopt bad design choices, which may lead to introducing the so­called test smells in the test code. as the test code with test smells size increases, these tests might become more complex, and as a consequence, much more challenging to understand and evolve them correctly. therefore, test smells may harm the test code quality and maintenance and break the whole software testing activities. in this context, this study aims to understand whether software testing practitioners unintentionally insert test smells when they implement test code. we first carried out an expert survey to analyze the usage frequency of a set of test smells and then interviews to reach a deeper understanding of how practitioners deal with test smells. sixty professionals participated in the survey, and fifty professionals participated in the interviews. the yielded results indicate that experienced professionals introduce test smells during their daily programming tasks, even when using their companies’ standardized practices. additionally, tools support test development and quality improvement, but most interviewees are not aware of test smells’ concepts. keywords: test smells, survey study, interview study, mixed­method research. 1 introduction software projects, both commercial and open­source ones, commonly include a set of automated test suites as one cru­ cial support to verify software quality (garousi and felderer, 2016). however, creating test code may require high ef­ fort and cost (wiederseiner et al., 2010; yusifoğlu et al., 2015; garousi and felderer, 2016). automated test genera­ tion tools, such as randoop1, jwalk2, and evosuite3, emerge as alternatives to facilitate and streamline this activity. if designed with high quality, automated testing offers bene­ fits over manual testing, such as repeatability, predictabil­ ity, and efficient test runs, requiring less effort and costs (yusifoğlu et al., 2015; garousi and küçük, 2018). therefore, tests should be concise, repeatable, robust, sufficient, nec­ essary, clear, efficient, specific, independent, maintainable, and traceable (meszaros et al., 2003). however, the development of well­designed test code is neither straightforward nor a simple task. developers are usu­ ally under time pressure and must deal with constrained bud­ gets, which can stimulate anti­patterns in test code, leading to the occurrence of the so­called test smells. test smells are indicators of poor implementation solutions and problems in test code design (greiler et al., 2013). the presence of test smells in test code may lead to reduced quality and, conse­ quently, may not reach its expected capabilities at finding bugs while remaining understandable, maintainable, and so on (yusifoğlu et al., 2015; garousi and küçük, 2018). the lit­ erature reports 196 test smell types classified in the following 1https://randoop.github.io/randoop/ 2http://staffwww.dcs.shef.ac.uk/people/a.simons/ jwalk/ 3http://www.evosuite.org/ groups (garousi and küçük, 2018): behavior, logic, design­ related, issue in test steps, mock and stub­related, association in production code, code­related, and dependencies. the literature presents studies aimed to identify and an­ alyze the effect of test smells in software projects in sev­ eral aspects (greiler et al., 2013; garousi and felderer, 2016; van rompaey et al., 2006). the authors introduce test smells as non­functional quality attributes within the software test code engineering process in those studies. in addition, they discussed existing test smell types and their consequences in terms of test code maintenance (garousi and felderer, 2016). some authors attempted to correlate metrics, and the pres­ ence of test smells (greiler et al., 2013). however, few dis­ cussions about daily practices and programming styles that may contribute to insert test smells exist in the literature. un­ derstanding the relationship between development practices and the introduction of test smell may support improving the activity of test creation. this study extends our previous investigation (silva junior et al., 2020), which aimed to understand whether software testing practitioners 4 unintentionally insert test smells. we used an expert survey with sixty practitioners from brazilian companies to analyze which and how often they adopt prac­ tices that might introduce test smells during test creation and execution. in this extension, we sought to understand (i) how much the practitioners know about test smells and (ii) how the practitioners deal with the test code quality regarding test smells. for identifying whether and to what extent the practi­ tioners know about test smells and how they deal with them, we interviewed fifty practitioners. the results from both stud­ 4for simplicity, we will use “practitioners” to inform “software testing practitioners” https://orcid.org/0000-0003-1763-3421 mailto:nildo.silva@ufba.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-8069-5249 mailto:lrsoares@uefs.br https://orcid.org/0000-0002-9903-7414 mailto:heitor@ufla.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br https://randoop.github.io/randoop/ http://staffwww. dcs.shef.ac.uk/people/a.simons/jwalk/ http://staffwww. dcs.shef.ac.uk/people/a.simons/jwalk/ http://www.evosuite.org/ silva junior et al. ies are complementary. we found that most of the intervie­ wees did not know anything about the concept of test smells. they commonly used practices that introduced test smells, but they hardly removed them from the test code. we mapped which daily programming practices would be associated with each test smell for both test creation and ex­ ecution. then, we asked the practitioners if they used those practices without the need to name the test smells. we used the interviews to complement the survey and analyze the practitioners’ unit test creation, maintenance, and quality ver­ ification activities. in addition, we investigated the practition­ ers’ knowledge about test smells and how they treat those smells during unit test creation and maintenance. our study may provide insights to understand how and which practices may introduce test smells in test code. in ad­ dition, we presented the practitioners’ point of view about activities related to unit test code and their beliefs about test smells’ treatment. thus, we investigated the following re­ search questions: rq1: do practitioners use test case design prac­ tices that might lead to the introduction of test smells? we investigated whether bad design choices may be related to test smells. rq2: which practices are present in practitioners’ daily activities that lead to introducing test smells? we investigated which test smells are as­ sociated with the most frequent practitioners’ prac­ tices. rq3: does the practitioners’ experience interfere with the introduction of test smells? we investi­ gated whether, over time, practitioners improve the activity of test creation. rq4: how aware of test smells are the practitioners? we investigated the practitioners’ knowledge of test smells. rq5: what practices have practitioners employed to treat test smells? we investigated how the practi­ tioners deal with test smells in their daily activities. the remainder of this article is structured as follows: sec­ tion 2 introduces the concept of test smells; section 3 details the research method applied in this study; section 4 presents the survey’s design and results; section 5 presents the inter­ view’s design and results; section 6 discusses the main find­ ings of this investigation; section 7 presents the threats to va­ lidity; section 8 discusses related work, and section 9 draws concluding remarks. 2 test smells automated tests may generate more efficient results when compared to manually executed ones. due to their repeata­ bility and non­human interference, automated tests might lead to time and execution effort reductions (yusifoğlu et al., 2015; garousi and küçük, 2018). however, developing test code is not a trivial task, and the automated tools may not en­ sure the system quality because they can generate one poor design (palomba et al., 2016; virgínio et al., 2019). in real­ world practice, developers are likely to use anti­patterns dur­ ing test creation and evolution, leading to errors in imple­ menting test code (van deursen et al., 2001; bavota et al., 2012). these anti­patterns may negatively impact test code maintenance (van rompaey et al., 2006). several studies investigated different types of test smells. initially, van deursen et al. (2001) defined a catalog of 11 test smells and refactorings (to remove test smells from the test code). after that, other authors extended this catalog and analyzed the effects of the smells on the production and test code (van deursen et al., 2001; meszaros et al., 2003; van rompaey et al., 2006; bavota et al., 2012; greiler et al., 2013; bavota et al., 2015; garousi and felderer, 2016; palomba et al., 2016; peruma, 2018; virgínio et al., 2019; virgínio et al., 2020). for example, garousi and küçük (2018) identified more than 190 test smells in a literature re­ view of 166 studies. in this study, we selected 14 types of test smells frequently studied and implemented in cutting­edge test smell detection tools (van deursen et al., 2001; meszaros et al., 2003; pe­ ruma, 2018). these are described next: • assertion roulette (ar). a test method that contains assertions without explanation. if one of those asser­ tions fails, it is not possible to identify which one caused the problem (van deursen et al., 2001); • conditional test logic (ctl). a test method with conditional logic (if­else or repeat instructions). tests with this structure do not guarantee that the same flow is verified, as they might not test a specific code piece (meszaros et al., 2003); • constructor initialization (ci). a test class that presents a constructor method instead of a setup method to initialize fields (peruma, 2018); • eager test (et). a test method checks many object methods at the same time. this test may be hard to un­ derstand and execute (van deursen et al., 2001); • empty test (ept). a test method does not contain ex­ ecutable assertions (peruma, 2018); • for testers only (fto). a production class has meth­ ods only used by test methods (van deursen et al., 2001); • general fixture (gf). the fields instantiated in the setup method are not used by all test methods of a test class. it may be hard to read and understand and may slow down the test execution (van deursen et al., 2001); • indirect testing (it). a test class has methods that perform tests in different objects because there are ref­ erences to those objects at the test class (van deursen et al., 2001); • magic numbers (mn). a test method contains asser­ tions with literal numbers as a test parameter (meszaros et al., 2003); • mystery guest (mg). a test method uses an external resource, such as a file with test data. if the external file is removed, the tests may fail (van deursen et al., 2001); • redundant print (rp). a test method contains irrele­ vant print statements (peruma, 2018); • resource optimism (ro). a test method contains op­ timist assumptions about the presence or absence of ex­ ternal resources. the test may return a positive result silva junior et al. figure 1. research method overview. once, but it may fail at other times (van deursen et al., 2001); • test code duplication (tcd). a test method has un­ desired duplication (van deursen et al., 2001); • test run war (trw). a test method fails when sev­ eral tests run simultaneously and access the same fix­ tures (van deursen et al., 2001). 3 research method we carried out two empirical studies in this investigation: a survey and an interview study (miles et al., 2014). figure 1 shows the methodological steps employed in this study. initially, we designed our study by defining the research questions and the suitable research methods to investigate them (fig. 1 ­ design). we used the survey research method to identify which programming practices respondents (practi­ tioners who participate in the survey) adopt that might insert test smells in the test code (fig. 1 ­ survey). we next applied the interview study method to identify how the interviewees (practitioners who participate in the interview) deal with test smells during the test creation and execution (fig. 1 ­ inter­ view). we compared results obtained from both surveys and interviews to understand the adoption of practices that might lead to introducing test smells with the practitioners’ knowl­ edge about test smells from different perspectives (fig. 1 ­ data comparison). for the survey, we adopted the design of observation by case­control. case­control is a descriptive design used to in­ vestigate previous situations to support understanding a cur­ rent phenomenon (pfleeger and kitchenham, 2001). it en­ compasses activities for the design, application, and analy­ sis of a survey questionnaire. we designed the questionnaire not to require specific knowledge about test smells. we corre­ lated each test smell to a set of programming practices, which the participants should read and analyze. section 4 details the survey study. to complement the findings of the survey questionnaire, we carried out a semi­structured interview (singer et al., 2008; gubrium et al., 2012). the interview’s structure aims to capture the interviewees’ perception of test smells. as we needed the interviewees to know the definition of test smells for elaborating on how they deal with them, we first intro­ duced them to the concept of test smells. section 5 details the interview study. the survey and interview instruments were written and applied in the portuguese language with brazilian practitioners. finally, the data comparison summa­ rizes the survey and interview results methods to answer the research questions (creswell and clark, 2018). section 6 presents the results. 4 survey study we applied the survey research method to investigate how the respondents commonly insert test smells in the test code when designing or implementing their software projects (melegati and wang, 2020). throughout this section, we pro­ vide readers with detailed information about the research de­ sign and data analysis. all material used in the survey study, including the dataset, is publicly available at (junior et al., 2021). 4.1 design we structured the questionnaire so that the respondents were not required to be aware of test smells beforehand. thus, we silva junior et al. table 1. examples of practices related to test smells. test smell test creation practices test execution practices mystery guest i often create test cases using some configuration file (or supplementary) as support. a test case fails due to the unavailability of access to any configuration file. eager test i often create tests with a high number of parameters (number of files, database record, etc.). i run some tests without understanding what their pur­ pose is. assertion roulette i pack different test cases into one (i.e., put together tests that could be run separately). some tests fail, and it is not possible to identify the fail­ ure cause. for testers only i have already created a test to validate some feature that will not be used in the production environment. i run some tests to validate features that will not be used in the production environment. conditional test logic i have already created conditional or repeating tests. i run tests with conditional or repeating structures. empty test i have already created an empty test with no executable statement. i find empty tests, with no executable statement. covered a larger number of potential practitioners. we cor­ related the concepts of test smells to commonly applied test creation and execution practices. table 1 shows examples of those practices. for instance, the practices associated with conditional test logic (ctl) use loops or conditions in the test code. in this case, the respondents should analyze the practices to determine whether and how often they adopt them. in ctl, the respondents should indicate how often they create tests with those structures or face them during test execution. questionnaire instrument the questionnaire comprises three blocks of questions. the first block characterizes the respondents (profile) and has thirteen questions to identify their age, gender, education degree, and software testing/programming skills. the second block has fourteen statements and six comple­ mentary questions (four objective and two open­ended ques­ tions). the statements describe creation practices related to test smells. we structured those statements in a five­point likert scale, where the respondents could choose one of the following answers: always, frequently, rarely, never, or not applicable. in this scale, always indicates the adoption of bad practices for test creation. for example, the “i have already created a test to validate a feature that would not be used in the production environment” statement corresponds to the for testers only test smell. therefore, the answer “always” means that the respondent usually uses that practice in her daily tasks. as a consequence, it is likely that she uninten­ tionally inserts that test smells in the test code. we designed the six complementary questions to understand how the prac­ titioners deal with the test creation activity. the third block has fourteen statements and one additional question. those statements describe execution practices re­ lated to test smells. like the former block, we structured those statements on a five­point likert scale. the respon­ dents could choose one of the following answers: always, fre­ quently, rarely, never, or not applicable, where always indi­ cates that the respondent comes across with test smells. we designed the complementary question to understand which problems the respondents deal with when executing the tests. the survey was available from april 3rd, 2019, to june 3rd, 2019. appendix a includes all the questionnaire statements and questions used in this study. pilot application we ran a pilot survey with four practitioners to identify improvement opportunities. based on the responses, we im­ proved the questionnaire before running the survey. it is worth mentioning that we did not include data gathered in the pilot application in the research results. participants we sent invitations and one questionnaire copy (c1 ­ c8) to practitioners from eight brazilian companies on a conve­ nience sampling basis. the questionnaire’s different versions served to control the number of respondents from the compa­ nies. those companies have 4 to 66 practitioners who per­ form manual and automated tests (table 2). in addition, we also sent the questionnaire through direct message (d1) and posted it on a facebook group dedicated to discussing soft­ ware testing (g1). in total, we contacted 305 practitioners, and 60 practitioners participated in the survey (#s1 ­ #s60). analysis procedure to answer rq1, we analyzed the objective questions (statements) on test creation (second block) and execution (third block). to answer rq2, we grouped the practices by frequency to identify the most commonly used ones. the practices may be associated with test smells according to their characteristics, such as external file usage, conditional structure, and programming style. to answer rq3, we com­ pared the professional experience with the frequency of use of test smells. we also used the same answer format of rq1 but only considered test creation (second block). during the test execution, respondents identify test smells instead of cre­ ating them. we analyzed the three open­ended questions through cod­ ing and continuous comparison (kitchenham et al., 2015). the objective was to understand why the respondents use practices that may insert test smells. in addition, we also in­ tended to understand which difficulties they encounter when creating and executing tests. two researchers performed the coding task and validated it by consensus. we also associated some practices with the test code characteristics defined by meszaros et al. (2003). we employed open coding on the data collected to identify additional reasons why the respondents may use bad prac­ tices in their software testing activities. the obtained codes were peer­reviewed and changed upon agreement with the silva junior et al. table 2. respondents source professionals answers c1 66 14 c2 30 1 c3 10 0 c4 6 0 c5 5 0 c6 4 4 c7 4 4 c8 4 0 d1 52 35 g1 124 2 total 305 60 paper authors. we used coding to complement our results on open­ended questions because they were optional. 4.2 results we received 60 answers (out of 305 potential respondents) from three brazilian states: 40 respondents from bahia (66.7%), 19 respondents from são paulo (31.7%), and one re­ spondent from paraná (1.6%). the respondents ranged from 22 to 41 years old, and their experience with quality assur­ ance ranged from 0 to 13 years (5.16 on average). experi­ ence as software developers also ranged from 0 to 13 years (average 1.67). regarding gender, 35 respondents were male (65%), 19 respondents were female (32%), and two respon­ dents were non­binary (3%). most of the respondents hold a degree in computer science­related courses (50 respondents ­ 83.3%), six respon­ dents (10%) hold a degree in other stem (science, tech­ nology, engineering, and mathematics) courses, and four re­ spondents (6.7%) hold a degree in other areas. most of the re­ spondents (54 respondents ­ 90%) pursued higher education degrees, as follows: 40 respondents hold a bachelor’s degree (66.7%), 13 respondents hold a graduate degree (21.7%), and one respondent holds a postdoc (1.6%). regarding the software testing tasks they commonly per­ form, (i) 26 respondents reported they create and run tests at the same rate (43.3%); (ii) 13 respondents execute tests with more frequency than create (21.7%); and (iii) 8 respon­ dents create tests with more frequency than execute (13.3%). moreover, 12 respondents only execute test cases (20%); one respondent only creates test cases (1.7%). they perform tests over many different platforms; 35 respondents (58%) work with two or more platforms (web ­ 39 respondents (65%), android ­ 35 respondents (58%), desktop ­ 29 respondents (48%), and apple ­ 17 respondents (28%)). they also cited other platforms, such as back­end, microservices, api, main­ frame, and cable tv ­ one respondent each (1.67%). in terms of domain, 39 respondents claimed they test mo­ bile applications (65%), and 36 respondents test web appli­ cations (60%). we also identified the following domains: 14 respondents work with embedded systems (23.33%), 11 re­ spondents work with cloud computing (18.33%), seven re­ spondents test information security (11.67%), four respon­ dents test internet of things systems (6.67%). they also men­ tioned other domains: big data, retail, artificial intelligence, cable tv, bioinformatics, commercial information, desktop figure 2. test smells frequency in test creation. system, and payment solutions ­ one respondent each (1,67% each). 4.2.1 test creation and execution practices we asked whether the respondents search for test duplica­ tion and whether it was either personal or company prac­ tice. twenty­nine respondents (48,3%) answered that it was only an individual activity. eleven (18,3%) responded that it was only a company’s practice, and three respondents (5%) claimed that it was a personal and company activity. how­ ever, seventeen respondents (28,3%) do not apply this activ­ ity. checking tests with the same objective reduces the test code duplication (tcd) test smell. in addition, we established a relationship between the test creation and execution practices and the test smells occur­ rence using the data collected. figures 2 and 3 show the us­ age frequency of test smells during the test creation and exe­ cution activities, respectively. during test creation, the conditional test logic (ctl) and general fixture (gf) test smells were the most re­ ported ones. the former obtained 28 (47%) of always and frequently responses, and the latter, 27 (45%) in both re­ sponses (figure 2). the high rate of those responses may in­ dicate a common everyday use of practices related to ctl and gf. we also analyzed why developers create tests with bad practices (one open­ended non­mandatory question an­ swered by 27 respondents ­ 45%). the main reasons were related to the company or personally employed standards, limited time, and attempt to reach better coverage and effi­ ciency. we also asked whether they modified existing test sets when they came across tests containing any of the problem­ atic test patterns illustrated in the survey. we found that seven respondents (11,7%) always perform any test code changes, twenty­three respondents (38,3%) frequently change, six­ teen respondents (26,6%) rarely change, seven respondents (11,7%) never edit test code, and seven respondents (11,7%) answered as not applicable. among the reasons to modify the test, eighteen respondents reported ambiguities reduction (30%), sixteen respondents claimed execution speed im­ provement (26,7%), fourteen respondents stated adequacy silva junior et al. to the company standards (23,3%), eight respondents did not understand test objective (13,3%) and four respondents stated corresponding production class evolution (6,7%). in addition, the respondents pointed out that they used to face test structure problems. thirty­one respondents in­ dicated that some tests depended on third party resources (52%), 29 respondents reported that they were hard to under­ stand (48%), 24 respondents claimed to contain unnecessary information (40%), 24 respondents said ambiguous informa­ tion (40%), 20 respondents reported to depend on external files (33%), six respondents pointed to use an external config­ uration file (10%). one respondent presented resources limi­ tation (2%). regarding difficulties in creating test cases (one open­ ended non­mandatory question answered by 23 respondents (38%)), requirement issues were the most frequent ones, re­ ported by twelve respondents (52%). other problems were related to the difficulties in the test code reuse, lack of knowl­ edge, production code issues, code coverage, test environ­ ment problems, and time and resource limitation. the test execution questions also presented a sequence of statements about ordinary situations the developers usually face, in which respondents should answer according to the frequency. the ctl (52%) and gf (47%) test smells were also the most cited during test execution (figure 3). those test smells obtained 31 and 28 answers of always and fre­ quently frequencies, respectively. figure 3. test smells frequency in test execution. regarding difficulties in running test cases (one open­ ended non­mandatory question answered by 29 respondents ­ 48%), ten respondents reported test environment as a prob­ lem related to test execution (34%), such as test environ­ ment unavailability, demand for third­party features, and low­performance environments. the second most common problem is understanding the test purpose (28%), where eight respondents reported that tests were poorly written and with­ out a standard, allowing multiple interpretations. the lack of test maintenance was the third problem (24%), which in­ volves outdated and incomplete tests due to the system code evolution (7 respondents). table 3. answers grouped by experience range experience (in years) number of respondents total 0 ­ 2 11 143 > 2 ­ 4 12 156 > 4 ­ 6 15 195 > 6 ­ 8 5 65 > 8 ­ 10 9 117 > 10 ­ 12 4 52 > 12 ­ 14 4 52 4.2.2 professional experience although most respondents from the survey reported they create and execute tests simultaneously, our investigation presented a different scenario as the tester gets more expe­ rienced. figure 4 shows the daily activities according to the professional experience, with the following highlights: 10 re­ spondents (16.7%) with experience ranging from 4 to 6, and 5 respondents (8.3%) with 8 to 10 years of experience create and execute tests at the same proportion. eight respondents (13.4%) with less than two years of experience, six respon­ dents (10%) ranging from 2 to 4 years of experience, and four respondents (6.7%) ranging from 6 to 8 years of expe­ rience only run tests or run tests with more frequency than create. three respondents (5%) with more than 12 years of experience mostly create rather than run tests. therefore, less experienced respondents run more than creating tests, and re­ spondents with more experience create more than run tests. figure 4. testing tasks according to professional experience. we also analyzed whether the use of good practices to create tests increases as respondents become more experi­ enced. we provided the respondents with thirteen statements, with illustrative scenarios of problems with test cases. each scenario relates to a given test smell. the respondents had to answer how often they experienced each scenario. table 3 shows the number of respondents grouped by experience time (in years) and the number of valid responses. figure 5 presents the frequency of test smells grouped by professional experience. when we analyzed the first expe­ rience range (0­2), 71 answers (50%) from the respondents could not identify the adoption of practices related to test smells (not applicable). 9 (6%) answers pointed that respon­ silva junior et al. figure 5. test smells frequency in test creation according to professional experience. table 4. interview questions # question 1 how did you start working with software test? 2 what were your learning sources about test code? 3 which programming languages do you create tests for? 4 which programming languages do you use in your current software project? 5 how is your test creation process? 6 is there any flowchart or template document that standardizes this process? 7 which support tools are used for test creation and execution? 8 how do you verify the quality of unit tests? 9 moving to the test code maintenance process, tell me how is this process inside the company? 10 what do you know about the test smell? 11 how did you learn about that? 12 do you have any doubt about test smell? 13 how are the test smells handled in the unit testing creation process? 14 how are the test smells handled in the unit testing maintenance process? 15 how would it be possible to avoid the introduction of test smells during test creation? 16 do you have any question, additional information or suggestion to improve this interview? dents always adopted some practice related to test smells, 16 (11%) answers related to frequently, 29 (20%) answers to rarely, and 18 (13%) answers to never adopted practices re­ lated to test smells. when we extended that analysis through the next experience ranges, we could not observe any in­ crease in responses never and rarely with the professional experience, indicating that the experience might not influ­ ence the adoption of practices that lead to the introduction of test smells. 5 interview study after carrying out the survey study, we interviewed software engineers to gather further evidence on how the practitioners deal with test smells, develop unit test code, and deal with test smells in test creation and maintenance. the interview dataset, including the interview transcriptions, interviewees profile, and coding summary, is publicly available at (junior et al., 2021). 5.1 design we employed a semi­structured interview approach, guided by a set of sixteen questions, as table 4 shows. interview organization we organized the interview into three blocks: • warm­up block (#1­3). questions about the professional background, such as the learning resources on software test code the interviewees commonly use, as well as the programming language they often use to implement test code, if any; • technical block (#4­9). questions about how they cre­ ate, maintain, and assess the quality of developed unit tests; • test smell block (#10­15). questions about the intervie­ wees’ awareness of test smell and how they handle these in test case creation and maintenance. the interviewees could also ask for more information or give additional information and suggestions to increase the interview quality (question #16). unlike the survey, in the in­ terview, we employed the actual test smell term in the ques­ tions related to such the concept instead of considering a transitive approach through statements containing practices embedded with test smells. when the participants were not aware of the term or asked for more information on test smells, we presented the concept and two test smells samples, e.g., ctl and ept (virgínio et al., 2020). those test smells were related to the most and the least frequently program­ ming practice used on survey results, respectively. there were no questions about challenges or problems involved in creating and maintaining test code. the interviewees an­ swered the questions in table 4 according to their experi­ ences, concepts, and shared information during the meeting. silva junior et al. the interviewer and interviewees did not access any test code from interviewees to analyze the presence of test smells. at the beginning of the interview, the practitioners an­ swered a professional profile’ form with academic back­ ground and professional experience. they also provided an email to solve eventual doubts or collect more data during data analysis. we interviewed on june 3rd and june 30th. due to the pandemic period, online meeting tools, such as skype and google meet, were used upon the participants’ request. we recorded the interviews with either the skype conversa­ tion recording tool or the google meet screen capture feature. additionally, we used an external voice recorder for every in­ terview. participants initially, we contacted practitioners from the survey who agreed to keep contributing to research. unlike the survey, we opted only for test code developers whose focus was creating and maintaining unit testing, including the treat­ ment of test smells. some interviewees participated in the survey study because we applied the snowballing technique (kitchenham et al., 2015). next, we used linkedin to invite other potential participants, using the “unit testing” expres­ sion in the profile ability search linkedin provides users. a total of 50 practitioners accepted the invitation (#i1 ­ #i50). pilot study we performed two pilot interviews with practitioners to measure the interview length and analyze whether it would be necessary to modify any part of the predefined instrument. as a result, there was no need to perform any changes in the instrument. the average interview length was around 30 minutes. analysis procedure the first author was the one responsible for transcrib­ ing the interviews. from them, we performed open coding (corbin and strauss, 2014) to answer the research questions. the remaining co­authors analyzed the transcriptions to un­ derstand how the practitioners develop tests and deal with test smells. first, we analyzed and validated the coding until we reach a consensus. in the following, two authors individ­ ually reviewed the proposed coding. in the end, one expert researcher reviewed the final coding. 5.2 results the interviewees could answer open­ended questions in dif­ ferent ways, according to their reality. therefore, when pre­ senting the results, some responses got more than 100% dur­ ing the quantitative analysis. the respondents’ age ranged from 20 to 48 years old, most of them ranging from 25 to 34 years old (60%). regarding their education, six respondents have completed high school (12%), 31 respondents completed an undergraduate school (62%), and 13 respondents hold a graduate degree (26%). additionally, 48 respondents either have a degree or were studying any computer science­related course (96%), one respondent holds a degree in applied business (2%), and one respondent holds a degree in psychology (2%). the respondents worked in companies of different sizes, table 5. respondents’ roles role respondents % developer 22 44% software engineer 7 14% systems analyst 7 14% software architect 5 10% team leader 3 6% automation engineer 2 4% consultant 2 4% project manager 2 4% quality specialist 2 4% quality engineer 1 2% test developer 1 2% test analyst 1 2% table 6. programming languages language respondents % java 25 30% javascript 14 17% c# 11 13% typescript 8 10% python 7 9% kotlin 5 6% php 4 5% swift 3 4% ruby 2 2% c 1 1% c++ 1 1% elixir 1 1% go 1 1% as follows: (i) 10 respondents worked in small companies (less than 50 employees ­ 20%); (ii) 5 respondents worked in medium­sized companies (number of employees in the range from 50 to 99 employees ­ 10%); and (iii) 35 respondents worked in large­sized companies (more than 99 employees ­ 70%). additionally, the interviewees were responsible for different tasks within companies related to their current roles (table 5). they created unit tests for mobile, desktop, and web platforms using different programming languages (ta­ ble 6). their experience in software development tasks var­ ied from 1 to 20 years, of which more than 50% were in the 1­ 6 years of experience range. two out of them were not work­ ing with unit test creation when we interviewed them. in such cases, they should consider their previous experience. we compared and analyzed the information for the open coding analysis and grouped them into codes using sentences, paragraphs, or the entire document. for example, when we asked them about their unit test creation process, the intervie­ wee #i47 answered: “when i worked only with java [...] if i know the context well if i have deep knowledge of the context that i will develop, i like to do a little tdd [...], but unfortu­ nately this is not something that can be 100% reality in the business, because you have n situations, n circumstances. so i cannot do tdd; at least i develop the specific feature, [...] the features, methods, etc., and then i will test it, for example, for each method that i know has a logic within that method, i do the test cases for n possibilities”. from this answer, we identified the following codes: codea ­ tdd; codeb ­ tld; silva junior et al. codec ­ depends on personal skill. we found 159 codes. we did not consider the warm­up block answers (#1 ­ 3) as we used them to stimulate the interviewees to pro­ vide as much information as possible. we used the technical block answers (#4­9) to analyze how the interviewees cre­ ated, maintained, and verified the test quality to complement and compare the survey’s supplementary questions. we used the answers for question #10 to analyze which information the interviewees presented about test smells. therefore, we could answer rq4. questions #11 and #12 complemented question #10. we used answers for questions #13 and #14 to analyze the strategies for dealing with test smells and answer the rq5. then, we analyzed the answers given to question #16 to understand how the interviewees believe it was pos­ sible to avoid introducing test smells. those questions let us understand better how they create, maintain, and verify unit test codes and how they deal with and possibly avoid test smells. 5.2.1 unit test code creation and maintenance we found that the developers usually create unit test code using test­driven development (tdd) (48%), test last de­ velopment (tld) (42%), or behavior driven development (bdd) (16%). those strategies’ usage was motivated accord­ ing to the project task or developer’s knowledge about the project’s programming language or architecture. for exam­ ple, the interviewee #i16 stated that he used tdd when he dominated the programming language; otherwise, the func­ tional software code was created first and then tested (tld). the interviewee #i25 claimed that she created unit tests ac­ cording to the stories from the bdd scenario. when there was no scenario, she used tdd. the method adoption could also depend on if the software was new or legacy. the in­ terviewee #i32 pointed that tdd was used on new projects when possible, and he used a bdd variation before the soft­ ware code creation. during the test code creation description, four intervie­ wees (8%) mentioned using mocks to simulate components, and two interviewees (4%) used to adopt clean code prac­ tices. for instance, the interviewee #i22 claimed he creates easy­to­read and understand, fast, and independent test codes. the interviewee #i36 uses code patterns and creates less ver­ bose tests. additionally, the focus of four interviewees (8%) is on test coverage. the interviewee #i12 claimed that he identifies “interesting features” to test. according to the inter­ viewee #i43, the test code should cover 80% of the software code. moreover, the interviewee #i10 mentioned the solid principles, and the interviewee #i15 adopts the model­view­ viewmodel (mvvm) project pattern (#i15) as practices dur­ ing the test creation. when we asked whether there was any document that stan­ dardized unit test creation, nine interviewees (18%) indicated the use of templates or some other documentation. the in­ terviewees #i5 and #i9 mentioned a test template in their projects that the team members could adopt. the interviewee #i29 claimed his team followed the microsoft’s official doc­ umentation, but there was not any internal document. the interviewee #i39 mentioned using a domain specific lan­ guage (dsl) to share project information, as follows: “on project day 0, we create and standardize an official dsl for the code. you have prerogatives, you have the test, and you have the result”. in addition, some interviewees answered that there was no documented standard, but they adopted the given­when­then (gwt) pattern and the arrange­act­ assert (aaa) programming practices. furthermore, the interviewees mentioned 90 different tools to create and run tests. those tools are related to (i) code development (junit ­ 42%, jest ­ 14%, and visual stu­ dio ­ 20%); (ii) metrics analysis (sonar tools ­ 18%), and (iii) continuous integration (jenkins ­ 10%, azure ­ 2%, and cir­ cle ci ­ 2%). after creating unit test code, the test quality assessment was performed through code review (78%) by one or more developers inside the project team. this activity usually was supported by tools, such as pull panda. for example, the in­ terviewee #i2 claimed: “pull panda5 is a tool used to ran­ domly assign one or more developers to perform the code review. [...]”. furthermore, two other interviewees (inter­ viewee #i4) and (interviewee #i16) reported that they per­ formed peer review (4%), and four interviewees claimed they commonly verify test code quality through pair program­ ming (8%). other practices identified were: test coverage (30%), metric analysis tool (24%) (e.g., sonarqube tool), re­ viewing by continuous integration tool (16%), test execution (10%), application of programming practices (10%) (reuse, clean code, and libraries), running mutant test tool (6%), test validation by external quality assurance team (2%), and static validation (2%). three interviewees reported that there were “no test quality assurance” activities because there were not enough tests to perform this activity or because the company does not support it. the interviewees adopted various test maintenance types distributed by corrective (62%), adaptive (36%), preventive (4%), and perfective (4%) maintenance. four interviewees claimed there was no test code maintenance due to: (i) there was no defined maintenance process (interviewee #i22); (ii) the participation in one new project and no maintenance task was required (interviewee #i24); absence of maintenance ac­ tivity because of shortage of time (interviewees #i24 and #i36); and (iv) project environment (interviewee #i45). 5.2.2 test smells treatment we asked the interviewees about their knowledge of test smells to understand whether they comprehended the study subject. figure 6 summarizes the results. seven interviewees (14%) demonstrated some knowledge of test smells. for ex­ ample, the interviewee #i2 answered: “i know a few things. i consider these as bad practices, bad choices that you make in your test code that difficult its maintenance and evolu­ tion.”. twenty­three interviewees (46%) related test smells with code smells but claimed they have never heard of the test smells. the interviewee #i16 mentioned: “test smell, i do not know the concept. the code smell is a problem that the static test analysis tool found in the program. would test smell be that same analysis on top of the test code?”. finally, twenty interviewees (40%) did not know test smells and did not relate to any smells type. 5https://pullpanda.com/ https://pullpanda.com/ silva junior et al. we presented the definition and examples of two test smells (ctl and ept) for the interviewees who did not know about test smells or asked for more information. table 7 shows how they prevent test smells during test code cre­ ation and how they treat test smells during the test code cre­ ation and maintenance. for example, during the test code cre­ ation, the code review practice was the most recommended (38%), followed by tool usage (26%) and programming practices (24%). when developing the test code, the devel­ oper should follow the programming practices to prevent test smells. tools and code reviews help to check the test smells insertion in an early stage of development. two interviewees believed there were not test smells in their repository. for example, the interviewee #i39 said: “i think we do not have this problem (test smells) in the recent project because of its difficulty level, we follow a coding standard. we educate peo­ ple on how we code it [...]”. the interviewee #i11 also said: “as i am the only one working on the project, i coded, under­ stood, and never had this vision of test smells. i do not think i have any problem with that.”. regarding maintenance, we asked how the interviewees treated test smells during the test code maintenance. the an­ swers were similar to the previous question (table 7). for the test code maintenance, the code review was also the most rec­ ommended practice (28%), followed by refactoring (20%) and tool usage (18%). as the test code was already devel­ oped and might have test smells, they suggested using tools to help detect test smells and refactoring techniques to re­ move them from the test code. the code review practice can double­check the test code to treat the test smells during the maintenance. we also asked the interviewees how to prevent test smells during test code creation (table 7). for the test smells pre­ vention, the tool usage was the most recommended practice (44%), followed by developers’ skills (28%) and code re­ view (20%). the developers’ skills are related to develop­ ing tests’ know­how by following good practices, guidelines, and coding patterns. it should help the developers identify and prevent flaws in designing and implementing a test code. the tool usage can support the developers when developing a test code by identifying possible test smells. the code re­ view is a manual analysis of the test code to double­check the test code for test smells prevention. at the end of the interviews, they could either provide or ask for further information about test smells and test code quality assurance. therefore, the interviewee #i29 claimed: “for me, it is a quality guarantee in terms of dependence ex­ emption, in terms of development, cohesion, coupling, and fundamental architecture. from the moment you have unit testing or even tdd, it helps you improve the code and ar­ chitecture.”. the interviewee #i35 demonstrated interest in our study: “i would like to know more about the study, we can talk about it later if you want, [...] i thought the term ‘test smell’ is complicated, at least it does not seem to be a common industry expression.”. figure 6. prior knowledge about test smell. table 7. practices to prevent test smells or to treat them during the test code creation and maintenance # practice creation maintenance prevention 1 code analysis – – 2 (4%) 2 code removal – 1 (2%) – 3 code reuse – – 1 (2%) 4 code review 19 (38%) 14 (28%) 10 (20%) 5 coding patterns 4 (8%) 5 (10%) 8 (16%) 6 company support – – 1 (2%) 7 culture’s development – – 3 (6%) 8 developer skills 2 (4%) 2 (4%) 14 (28%) 9 documentation – – 1 (2%) 10 guidelines – – 3 (6%) 11 individual analysis 2 (4%) 6 (12%) – 12 mutant testing 1 (2%) 1 (2%) – 13 no treatment 13 (26%) 13 (26%) – 14 pair programming 2 (4%) 1 (2%) 4 (8%) 15 peer review 1 (2%) 1 (2%) 1 (2%) 16 professional experience – – 6 (12%) 17 programming practices 12 (24%) 8 (16%) 11 (22%) 18 refactoring 5 (10%) 10 (20%) – 19 tdd – – 3 (6%) 20 technical debt 1 (2%) 5 (10%) – 21 technical meeting 1 (2%) – – 22 tool usage 13 (26%) 9 (18%) 21 (44%) 23 traceability 1 (2%) 1 (2%) 1 (2%) 24 training – – 8 (16%) 25 take breaks – – 1 (2%) 26 software code improvement – 1 (2%) 27 test smell catalog – – 1 (2%) 6 discussion this section discusses the results obtained after conducting the survey and interview to answer the research questions. rq1, rq2, and rq3 are related to the survey, and rq4 and rq5 are related to the interview. 6.1 rq1: do practitioners use test case design practices that might lead to the introduc­ tion of test smells? we observed that at least one respondent pointed to 1 out of 14 practices related to test smells from the results. we ana­ lyzed those practices when creating and maintaining tests to identify which types of test smells the participants frequently insert in the test code. regarding test creation, we observed that every test smell presented at least three out of four possible answers (always, frequently, rarely, and never). we classified the data into two groups: the commonly­used practices group (cpg) and silva junior et al. the unused practices group (upg). cpg contains test smells that mostly present always and frequently as answers, and upg that mostly present rarely and never as answers. we considered a test smell belonging to one group when the difference between the always and frequently rates and the rarely and never rates is greater than 10%. for example, the empty test, for testers only, test run war, constructor initialization, resource optimism, redundant print, magic number, indirect test test smells belong to upg, which means practitioners rarely insert those smells on the testing activities. on the other hand, the respondents frequently adopt prac­ tices related to the general fixture test smell, the only member of cpg, indicating that they usually create tests with that smell. still, four test smells presented a similar perti­ nence frequency to both groups (less than 10% of difference). for them, there was not a pattern among respondents. for in­ stance, the eager test test smell obtained 38% to cpg and 40% to upg. in the test execution, upg contains the empty test, eager test, assertion roulette, redundant print, duplicated test, test run war, for testers only, mystery guest, constructor initialization, and resource optimism test smells, which means that the re­ spondents rarely face those smells during the test execution. otherwise, the respondents frequently find practices related to two test smells, general fixture and conditional test logic, which compose the cpg group. in addition, we did not perceive a significant difference among respon­ dents for two other test smells, indirect test and magic number, which presented similar pertinence frequency to both groups. we also investigated the reasons that lead the respondents to adopt the practices presented in the survey. thus, we an­ alyzed the open­ended questions and identified 16 different tags. the most common ones were company standard, per­ sonal standard, project politics, professional experience, sav­ ing time, and improving coverage. for example, the respon­ dent #s26 of the survey reported applying company stan­ dards when creating tests that may insert smells and com­ monly use bad practices “to match company development standards.” in another situation, respondent #s54 reported using personal standards when said: “i group tests by mod­ ules to execute them sequentially without compromising ef­ fectiveness.” this behavior suggests that participants may have misunderstood the test smells definition. when group­ ing tests, it is possible to insert the assertion roulette test smell and compromise test independence. a similar situ­ ation occurred with the respondents #s14, #s16, #s27, #s50, and #s59. in general, our study identified that all test smells appeared in testing activities. they all were cited by respondents, even if rarely. practitioners adopt practices for test case design, which introduce test smells. usually, those practices come from improper personal and company stan­ dards. 6.2 rq2: which practices are present in prac­ titioners’ daily activities that lead to intro­ ducing test smells? although there are specific tools to support test automation (fraser and arcuri, 2011; smeets and simons, 2011), 62% of respondents perform more manual than automated tests. besides, 55% have no experience with software development (less than two years of experience), the lack of knowledge does not influence the adoption of bad practices in the test code. according to the practices explored in the survey, we iden­ tified that the respondents usually come across: (i) the use of generic configuration data, which produces the general fixture test smell (most frequent on the activities of test creation and execution ­ cpg); and (ii) the use of condi­ tional or repetition structure, directly associated with the conditional test logic test smell (second most de­ tected on the activity of test execution ­ cpg). the respondents indicated they usually face several prob­ lems with tests, such as poorly written tests and outdated and incomplete test procedures. according to them, when the tests are associated with generic configuration data, test cases are hard to understand and may cause incorrect results. more­ over, the test coverage on the production code is unclear due to the conditional logic presence on the tests. understand­ ing which practices are most prevalent in the professionals’ activities supports improving test quality. other identified problems are related to incompleteness, outdatedness, or lack of documentation. these may hinder traceability, evolution, and maintenance of the testing tasks. the practices most present in the practitioner’s daily life that lead to test smells insertion were conditional structure or repetition and generic configuration data. 6.3 rq3: does the practitioners’ experience interfere with the introduction of test smells? in the survey study, we analyzed the respondents’ experi­ ence and its influence in adopting practices that might lead to insert test smells in their projects. as a result, we did not identify any clear cause­effect correlation. for example, the always option indicates they always use harmful practices. when we analyzed the answers’ frequencies for this option, the usage rate did not reduce over time. instead of that, we may observe from figure 5 that respondents with 8 to 10 years of experience achieved a higher usage rate of this fre­ quency. we also identified that behavior when we analyzed the other usage frequencies. however, we could not infer that inexperienced practitioners introduce more test smells than experienced ones regarding the activity of test creation. on the one hand, when testers are inexperienced program­ mers, they may write lower­quality tests. on the other hand, they can carry programming biases that may contain bad practices when they are more experienced. thus, the absence of a tendency indicates a non­behavioral change between less and more experienced practitioners. silva junior et al. experienced practitioners may not produce fewer test smells than inexperienced ones. 6.4 rq4: how aware of test smells are the practitioners? the survey results indicate that the lack of information on test smells is one reason that leads practitioners to adopt program­ ming practices that may introduce test smells. although the test smell concept had appeared in 2001 (van deursen et al., 2001), when we asked in the interview what they know about them, 14% of the interviewees demonstrated having some knowledge. for example, two interviewees mentioned: “i know a little bit about test smells. if i am not mistaken, there are smells like test assertion and duplicated [...]” (#i5) and “test smell? from smells? i know the basics” (#i19). we be­ lieve that the industry should explore this topic more through the initiatives proposed in academia (santana et al., 2020). some interviewees (46%) associated the test smells’ term with the code smells and related test smells detection with tool’s usage or personal practices. for example, interviewee #i04 mentioned: “although i had never heard the term, it makes sense, because i saw everything as a code smell, but there are some strategies, some guidelines that i follow for unit tests.”. this behavior may generate disagreement on the tool functioning, such as the interviewee #i10 said: “one of the outputs of those software that i mentioned, sonarqube and code climate, are these test smells. they can find some of them, [...] because we can not publish a project with these types of test structures, tests with commented content, such as empty test, the test with a complexity greater than 1”. con­ versely, in the sonarqube documentation, there is no infor­ mation about test smells analysis. thus, we considered that those analyses are related to code smells in test code, which is different from test smells detection. test practitioners do not know what test smell is. they can associate the test smell concept with code smells, but they have no information about test smell types and refactoring. 6.5 rq5: what practices have practitioners employed to treat test smells? commonly, the interviewees did not know what test smells are. after explaining the concepts to them in the interview, they could understand and explain how they deal with test smells in their daily activities. they reported adopting a set of project’s activities (e.g., code review, pair programming, and technical debt) and programming practices during the test creation and maintenance processes (e.g., the clean code approach and given, when, then (gwt), and arrange, act, assert (aaa) patterns) to either prevent or treat test smells. the interviewees tended to develop unit tests according to their skills. the professional abilities also determine the result of code review. the interviewees who did not learn about test smells or programming practices can approve a submitted package with these issues. the code review was the most reported activity to treat test smell in test creation (38%) and the most common activity performed by the inter­ viewees (78%) during test quality verification. in this activ­ ity, one or more practitioners analyze the submitted code. in this context, the reviewer’s knowledge determines whether the code is good enough to merge it on the repository. each team adopts different strategies to perform code re­ views based on the number of reviewers, number of ap­ provals, and professional experience. although some inter­ viewees reported that only the experienced members review software and test code, the review may not avoid test smells in the project repository, mainly because both experienced and inexperienced practitioners adopt practices that intro­ duce test smells. when we asked about the test smell treatment during test maintenance, some interviewees reported creating a techni­ cal debt to refactor test smell in another moment (intervie­ wees #i08, #i09, #i22, #i25, and #i50). this behavior may indicate that the test smell correction is not a priority. the technical debt creation may also be the reason why test smells remain in the repository. for example, interviewee #i09 said: “there is nearly no treatment for test smells. [...] when remov­ ing a feature from the software or its business rule is changed, the test code is commented and left there. [...] hardly the de­ velopers handle with commented test codes. [...]”. the interviewees hardly addressed the technical debt and failing tests because they needed to prioritize other tasks as software code development. with less time for testing, test smells would be introduced in the test code during test cre­ ation and maintenance and keep in the repository through postponing maintenance activities. we did not know whether practitioners have learned about test smells. thus, we adopted the concepts of test smells in literature. the validation of those concepts was out of scope. although we did not ask specifically whether the intervie­ wees considered test smell as a problem or agreed with the given test smells examples as a smell, during their answers about test smell treatment, part of them told about how they treat at least one of the given examples. for example, the interviewee #i07 said: “despite not having worked exactly with this type of concept, sonar itself warned us about these two problems, both when the logic was very complex, with a lot of ”if,” it warned us to break it in different methods, things like type. moreover, i remember that it identified comments, commented code, and sends a warning”. regarding the conditional test logic smell example, #i37 said: “this specific code enters into a specific clean code case. this test may be doing more than it should”. accord­ ing to these comments, the interviewees consider test smells, including the given examples, as structures to fix. practitioners adopt a set of project activities and pro­ gramming practices to treat test smells. as they do not know well the test smells concepts, it is impossi­ ble to guarantee that those strategies treat test smells appropriately. silva junior et al. 7 threats to validity internal validity. although there are more than 100 test smells, this study only considered 14 test smells. however, we selected the most frequent test smells discussed in the lit­ erature. in addition, the test smells were presented in the sur­ vey as practices. to mitigate ambiguities and text compre­ hension, we applied a pilot with four testers from different companies. we used professional social networking to reach as many respondents as possible from brazilian companies demographically distributed for the survey and interview ex­ ecution. external validity. our survey and interview respondents may not adequately represent the practices adopted by the practitioners in the wider software engineering industry. al­ though our results may not generalize, they provide a prac­ tice adopted initial view by the testers. there is an agreement among the practitioners’ responses, indicating that additional data might not reveal new insights. construct validity. the survey did not inform that the questions referred to test smells to investigate whether the practitioners non­intentionally insert test smells. we pre­ vented the respondents’ partiality when identifying the prac­ tices adopted. complementary, to investigate how the prac­ titioners deal with test smells. we presented the concept to the interviewees who did not know this subject. after learn­ ing test smells, the respondents were interested in finding so­ lutions for this ”problem” (test smells). we collected open­ ended questions answers and performed one peer­reviewed coding process to avoid biases. the survey and interview in­ struments were written in portuguese and translated to en­ glish by one author but reviewed by others. conclusion validity. the data analysis was an exhaustive process, which depends on the researchers’ interpretation of the open­ended questions answers. to prevent biases, we per­ formed the data analysis in three steps: i) two researchers ana­ lyzed the data on pair to discuss the identification of the code, ii) two researchers analyzed the data individually, checking if new codes could emerge, and iii) all researchers discussed and compiled the results from steps i and ii. additionally, to increase transparency, the crude survey and interview data are available online for other researchers to validate and repli­ cate our study. 8 related work bavota et al. (2015) presented a case study to investigate the test smells impact on maintenance activities. in that study, developers and students analyzed testing code to compare whether their experience would make a difference in test smell identification. as a result, they found that the inten­ sity of the test smells’ impact is different for different levels of experience; the number of impacting test smells is higher for students than industry professionals. additionally, they found that test smells have a significantly negative impact on maintenance activities. conversely, our survey found that the practitioners’ experience does not interfere in the test smell introduction during test creation and execution activ­ ities. moreover, the interview revealed that the practitioners are not aware of test smells, reinforcing that the experience is not influencing the test smells insertion in the test code. tufano et al. (2016) proposed an interview study with 19 participants to investigate developers’ perception of test smells. they performed an empirical investigation to ana­ lyze where test smells occur at the source code. the re­ sults showed that developers generally do not recognize test smells, and there are test smells since the first code commit in the repository. similarly, our interview indicated a lack of awareness of the developers about the underlying concept of test smells. additionally, we did not find any study investi­ gating how professional practices affect the test smells intro­ duction, and therefore we investigated it through a survey. spadini et al. (2020) surveyed developers to evaluate the severity thresholds for detecting test smells and investigate test smells’ perceived impact on test suite maintainability. the developers had to classify whether a test smell instance is valid and rate the test smell instance regarding its importance to its maintainability. the evaluation of test smells instances requires knowledge about the topic. therefore, our survey presented practices that might lead to test smells insertion, and our interview provided information about test smells to level the respondents about the topic. in our previous work (silva junior et al., 2020), we con­ ducted an expert survey to understand whether practitioners unintentionally insert test smells. we surveyed sixty brazil­ ian practitioners regarding fourteen bad practices that might lead to the test smell insertion during the test code creation and execution. the results indicated that the practitioners’ experience might not influence the test smells insertion. usu­ ally, practices that lead to test smells insertion came from im­ proper personal and company standards. this current study complements the previous one by investigating the practition­ ers’ knowledge about test smells and how they deal with the test code quality regarding the presence of test smells. we conducted interviews with fifty brazilian practitioners to ask them about the test code creation and maintenance processes. as a result, the interviewees indicated a set of practices that might be useful to treat test smells. however, as they do not know about test smells concepts, those practices need further investigation for test smell treatment. 9 conclusion test smells may decrease the test code quality and main­ tenance. our study aimed to identify whether practitioners unintentionally insert test smells in the test code and how they treat them. therefore, we applied two complementary research methods: a survey and an interview study. we surveyed sixty respondents to investigate the uninten­ tional test smells insertion in the test code. they evaluated a set of practices related to the test smells insertion in the test code. the results indicated that the respondents adopt bad practices that might lead to insert the test smells. the bad practices adoption is more related to the improper company standards than the respondent’s experience with test code de­ velopment. to investigate how the practitioners treat test smells, we interviewed fifty respondents. they answered questions on silva junior et al. how they prevent and treat test smells during the test code de­ velopment. the results indicated an overall knowledge lack on the test smells. for most of the interviewees, it was their first contact with this subject. however, after explaining one test smell to the respondents, they recognized it in their test code and identified practices that they adopted to deal with it. among the recommended practices, we highlight the adop­ tion of tools, coding patterns, programming practices, code review, and training to improve the developers’ skills and expertise. after analyzing the answers to the survey and the inter­ view, we could identify that practitioners did not know test smells. thus, they insert different test smells types, even the experienced ones. they have tried to treat test smells through some strategies, but as they have not learned about this sub­ ject, they have inserted test smells in their test code, and the strategies may not be enough to avoid that. those studies are starting points to researches that consider practitioners as agents in the test smell treatment. as future work, we aim to follow the grounded theory methodology (corbin and strauss, 1990) to leverage a com­ mon understanding of how the software industry is receptive to improving the test code quality by taking test smells into consideration. we would validate the respondents’ practices to prevent and treat test smells and elaborate a checklist for test code quality development and assurance with an in­depth study. acknowledgements we would like to thank the participants in our survey and pilot study. this research was partially funded by ines 2.0; cnpq grants 465614/2014­0 and 408356/2018­9 and fapesb grants jcb0060/2016 and bol0188/2020. references bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2012). an empirical analysis of the distribution of unit test smells and their impact on software maintenance. in 28th ieee international conference on software main­ tenance (icsm). bavota, g., qusef, a., oliveto, r., lucia, a., and binkley, d. (2015). are test smells really harmful? an empirical study. empirical software engineering, 20(4). corbin, j. and strauss, a. (2014). basics of qualita­ tive research: techniques and procedures for developing grounded theory. sage publications. corbin, j. m. and strauss, a. (1990). grounded theory re­ search: procedures, canons, and evaluative criteria. qual­ itative sociology, 13(1):3–21. creswell, j. w. and clark, v. l. p. (2018). designing and conducting mixed methods research. sage publica­ tions, third edition. fraser, g. and arcuri, a. (2011). evosuite: automatic test suite generation for object­oriented software. in 13th eu­ ropean conference on foundations of software engineer­ ing, esec/fse, new york, ny, usa. acm. garousi, v. and felderer, m. (2016). developing, verifying, and maintaining high­quality automated test scripts. ieee software, 33(3). garousi, v. and küçük, b. (2018). smells in software test code: a survey of knowledge in industry and academia. journal of systems and software, 138. greiler, m., van deursen, a., and storey, m. (2013). auto­ mated detection of test fixture strategies and smells. in 2013 ieee sixth international conference on software testing, verification and validation. gubrium, j. f., holstein, j. a., marvasti, a. b., and mckin­ ney, k. d. (2012). the sage handbook of interview re­ search: the complexity of the craft. sage publications, 2nd edition. junior, n. s., martins, l., rocha, l., costa, h., and machado, i. (2021). how are test smells treated in the wild? a tale of two empirical studies [dataset]. available at: https: //doi.org/10.5281/zenodo.4548406. kitchenham, b. a., budgen, d., and brereton, p. (2015). evidence­based software engineering and systematic re­ views, volume 4. crc press. melegati, j. and wang, x. (2020). case survey studies in software engineering research. in proceedings of the 14th acm / ieee international symposium on empirical soft­ ware engineering and measurement (esem), esem ’20, new york, ny, usa. acm. meszaros, g., smith, s. m., and andrea, j. (2003). the test automation manifesto. in maurer, f. and wells, d., editors, extreme programming and agile methods ­ xp/agile uni­ verse 2003. springer berlin heidelberg. miles, m. b., huberman, a. m., and saldaña, j. (2014). qual­ itative data analysis. sage publications, fourth edition. palomba, f., di nucci, d., panichella, a., oliveto, r., and de lucia, a. (2016). on the diffusion of test smells in au­ tomatically generated test code: an empirical study. in 9th international workshop on search­based software testing. acm. peruma, a. s. a. (2018). what the smell? an empiri­ cal investigation on the distribution and severity of test smells in open source android applications. phd thesis, rochester institute of technology. pfleeger, s. l. and kitchenham, b. a. (2001). principles of survey research: part 1: turning lemons into lemonade. acm sigsoft software engineering notes, 26(6):16– 18. santana, r., martins, l., rocha, l., virgínio, t., cruz, a., costa, h., and machado, i. (2020). raide: a tool for asser­ tion roulette and duplicate assert identification and refac­ toring. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 374–379, new york, ny, usa. association for computing machinery. silva junior, n., rocha, l., martins, l. a., and machado, i. (2020). a survey on test practitioners’ awareness of test smells. in proceedings of the xxiii iberoamerican con­ ference on software engineering, cibse 2020, pages 462– 475. curran associates. singer, j., sim, s. e., and lethbridge, t. c. (2008). soft­ ware engineering data collection for field studies. in shull, f., singer, j., and sjøberg, d. i. k., editors, guide to ad­ https://doi.org/10.5281/zenodo.4548406 https://doi.org/10.5281/zenodo.4548406 silva junior et al. vanced empirical software engineering, pages 9–34, lon­ don. springer london. smeets, n. and simons, a. j. (2011). automated unit testing with randoop, jwalk and µjava versus manual junit test­ ing. research report, department of computer science, university of sheffield/university of antwerp, sheffield, antwerp. spadini, d., schvarcbacher, m., oprescu, a.­m., bruntink, m., and bacchelli, a. (2020). investigating severity thresh­ olds for test smells. in proceedings of the 17th interna­ tional conference on mining software repositories, msr. tufano, m., palomba, f., bavota, g., di penta, m., oliveto, r., de lucia, a., and poshyvanyk, d. (2016). an empiri­ cal investigation into the nature of test smells. in 31st inter­ national conference on automated software engineering. ieee. van deursen, a., moonen, l., van den bergh, a., and kok, g. (2001). refactoring test code. in proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (xp). van rompaey, b., du bois, b., and demeyer, s. (2006). characterizing the relative significance of a test smell. in 22nd international conference on software maintenance, icsm’06. ieee computer society. virgínio, t., martins, l., rocha, l., santana, r., cruz, a., costa, h., and machado, i. (2020). jnose: java test smell detector. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 564–569, new york, ny, usa. association for computing machinery. virgínio, t., martins, l. a., soares, l. r., santana, r., costa, h., and machado, i. (2020). an empirical study of automatically­generated tests from the perspective of test smells. in sbes ’20: 34th brazilian symposium on software engineering, pages 92–96. acm. virgínio, t., santana, r., martins, l. a., soares, l. r., costa, h., and machado, i. (2019). on the influence of test smells on test coverage. in proceedings of the xxxiii brazilian symposium on software engineering. acm. wiederseiner, c., jolly, s. a., garousi, v., and eskandar, m. m. (2010). an open­source tool for automated gen­ eration of black­box xunit test code and its industrial eval­ uation. in bottaci, l. and fraser, g., editors, testing – practice and research techniques. springer berlin hei­ delberg. yusifoğlu, v. g., amannejad, y., and can, a. b. (2015). software test­code engineering: a systematic mapping. in­ formation and software technology, 58. a appendix a block 1: respondents’ profile q1. what is your gender? q2. what is your age? q3. which course do you have an academic background in? q4. what is the highest degree or level of education you have completed? q5. which brazilian state do you currently work? q6. how long have you been working with software test­ ing? q7. how long have you been working with software de­ velopment? q8. which activity do you perform daily? q9. what is the platforms of the projects that you have worked on? q10. what is the application domain of the last project that you worked on? q11. which test technique do you execute? q12. are the tests executed more often manually or auto­ mated? q13. how do you describe your expertise with coding? block 2: test creation q14. what is the source for creating the test cases for the projects in which you work? q15. is there verification to detect duplicate tests (with the same writing or with different writing and the same objective)? more than one option could be selected. evaluate the following statements according to your daily activities: q16. “i usually create test cases using some configuration file (or complementary file) as a backup” q17. “when creating a test, i analyze whether it can be ex­ ecuted at the same time with others or if it should be executed in isolation, due to the availability of exter­ nal resources .“ q18. “i analyze the possibility of a test failing because it uses a resource that is being used at the same time by another test.” q19. “i have a habit of creating tests with a high number of parameters (number of files, database record, etc.).” q20. “i group different test cases into one (that is, combine tests that could be run separately).” q21. “i create tests that depend on resources that may not have their own tests for validation (eg a test that in­ volves retrieving information from the database, but there is no test to validate database research). “ q22. “i have already created a test to validate some feature that will not be used in the production environment” q23. “i have already created a test with a high value for a specific parameter (eg number of records in the database, number of files in folder) even that makes it difficult to repeat. “ q24. “i have already created a test with a conditional or repetitive structure.” q25. “i have already created an empty test, with no exe­ cutable instructions.” q26. “i usually create tests using some data from a config­ uration file.” q27. “i usually create tests with printing or displaying re­ sults in a redundant way, or without need.” q28. “i have already created a test considering the exis­ tence of a resource, without checking its existence or availability.” silva junior et al. q29. “i already changed a test by identifying one of the pre­ vious points.” q30. if you answered “always”, “frequently” or “rarely” in the previous questions, why were the tests created with these standards? q31. if you changed any tests according to the design stan­ dards above, why were they edited? q32. what problems in the test structure have you encoun­ tered? q33. what difficulties do you often encounter when creat­ ing test cases? block 3: test execution evaluate the following statements according to the fre­ quency found in daily activities: q34. “a test case fails due to unavailability of access to a configuration file.” q35. “repeat a test case because it previously failed due to competition with some other test case that was run­ ning at the same time.” q36. “execute tests that could be executed performed more quickly, when modifying the contents of the configu­ ration file.” q37. “run a test without understanding its purpose.” q38. “some test fails and it is not possible to identify the cause of the failure.” q39. “run a test that depends on an external resource that does not have a test for direct validation.” q40. “a test case fails due to unavailability of access to any external resource.” q41. “run test with a high value for a specific parameter (eg: number of records in the database, number of files in folder) even if it makes it difficult to repeat.” q42. “run a test to validate a feature that will not be used in the production environment.” q43. “find duplicate test (with the same or different writ­ ing).” q44. “run test with conditional or repetitive structure.” q45. “find empty test, with no executable instruction.” q46. “run test with printing or display of results in a re­ dundant way, or unnecessary.” q47. “run a test considering the existence of a resource, without checking the existence or availability of it.” what difficulties do you usually encounter when run­ ning test cases? introduction test smells research method survey study design results test creation and execution practices professional experience interview study design results unit test code creation and maintenance test smells treatment discussion rq1: do practitioners use test case design practices that might lead to the introduction of test smells? rq2: which practices are present in practitioners' daily activities that lead to introducing test smells? rq3: does the practitioners’ experience interfere with the introduction of test smells? rq4: how aware of test smells are the practitioners? rq5: what practices have practitioners employed to treat test smells? threats to validity related work conclusion appendix a journal of software engineering research and development, 2022, 10:9, doi: 10.5753/jserd.2022.1897  this work is licensed under a creative commons attribution 4.0 international license.. assessing the credibility of grey literature: a study with brazilian software engineering researchers fernando kamei  [ ufpe, ifal | fernando.kenji@ifal.edu.br ] igor wiese  [ utfpr | igor@utfpr.edu.br ] gustavo pinto  [ zup innovation & ufpa | gustavo.pinto@zup.com.br ] waldemar ferreira  [ unicap | waldemar.neto@unicap.br ] márcio ribeiro  [ ufal | marcio@ic.ufal.br ] renata souza  [ ufpe | rmcrs@cin.ufpe.br ] sérgio soares  [ ufpe | scbs@cin.ufpe.br ] in recent years, the use and investigations about grey literature (gl) increased, in particular, in software engineering (se) research. however, its understanding is still scarce and sometimes controversial, such as interpreting gl types and assessing their credibility. this study aimed to understand the credibility aspects that se researchers consider in assessing gl and its types. to achieve this goal, we surveyed 53 se researchers (who answered that they have used gl in our previous investigation), receiving a total of 34 valid responses. our main findings show that: 1) gl source produced or cited by a renowned source is the main credibility criteria used to assess gl, 2) most of the gl types tend to have a low to moderate level of control and expertise, 3) there is a positive statistical correlation between the level of control and expertise for most gl types, and 4) the different respondent profiles shared similar opinions about the credibility criteria. our investigation contributes to helping future se researchers that intend to use gl with more credibility. additionally, shows the need for future studies to better understand the gl types in se research. keywords: grey literature, credibility, empirical software engineering, evidence-based software engineering. 1 introduction grey literature (gl) refers to a kind of publication that does not go through a peer-reviewed process before its publication (petticrew and roberts, 2006). some areas of knowledge have used and investigated gl. for instance, in management, adams et al. (2016b) investigated how gl could be used with relevance for management and organization studies. in science of information (schöpfel and prost, 2020), there is an investigation about the term and concept of gl in scientific papers. in software engineering (se), many researchers interpret gl as any material that was not formally peer-reviewed and published (garousi et al., 2019). in the last years, se researchers increased their interest in investigating gl, motivated by the growth of social media and communication channels that se practitioners use to communicate, exchange problems and ideas (storey et al., 2017), including, for instance, code hosting websites such as github (coelho et al., 2020) and communication platforms such as slack (stray and moe, 2020). in se, several studies investigated and recognized the importance and usefulness of gl. for instance, garousi et al. (2016) explored the benefits of gl for multivocal literature reviews, showing what the secondary studies gained when considered gl and what was missed when it was not considered. other studies (williams and rainer, 2017; rainer and williams, 2018) investigated the benefits and challenges of using blog content for se research, and how to improve its use by selecting gl content with more credibility. despite the increase in investigations in this field, there are some misunderstandings about gl and its diverse types (tom et al., 2013; kamei et al., 2021), and how the set of credibility criteria investigated in previous studies (e.g.,williamsand rainer (2017)) could be used and interpreted to the diverse types of gl (kamei et al., 2021). according to adams et al. (2016a), the different types of gl could be classified in terms of the “shades” of grey, which groups gl according to two dimensions: control and expertise. garousi et al. (2019) explained these dimensions as follows: control is the extent to which content is produced, moderated, or edited in conformance with explicit and transparent knowledge creation criteria. on the other hand, expertise is the extent to which we can determine the producer’s authority and knowledge. in this paper, we begin by studying the different perceptions of se researchers about gl. we then focused on studying how gl could be assessed considering its different types. for each study, we surveyed brazilian se researchers. in the first survey — which was published previously (kamei et al., 2020) — we investigated how brazilian se researchers use gl, focusing on understanding which criteria they employed to assess its credibility as well as the benefits and challenges they perceived. in the second survey (the novel contribution of this paper), we focused on how brazilian se researchers that previously used gl perceived the criteria to assess the different gl types according to control and expertise. in the following, we list our main findings (s1 means survey 1, while s2 means otherwise): s1 we identified the main gl sources used by the brazilian se researchers; s1 we identified several motivations to use (or to avoid) gl; https://orcid.org/0000-0002-5572-2049 mailto:fernando.kenji@ifal.edu.br https://orcid.org/0000-0001-9943-5570 mailto:igor@utfpr.edu.br https://orcid.org/0000-0001-7598-2799 mailto:gustavo.pinto@zup.com.br https://orcid.org/0000-0003-4548-7601 mailto:waldemar.neto@unicap.br https://orcid.org/0000-0002-4293-4261 mailto:marcio@ic.ufal.br https://orcid.org/0000-0002-2849-1273 mailto:rmcrs@cin.ufpe.br https://orcid.org/0000-0002-4428-2535 mailto:scbs@cin.ufpe.br submitted to jserd kamei et al. 2022 s1, s2 we identified that the main criteria employed by brazilian se researchers to assess gl credibility are: gl source be provided by renowned authors, institutions, companies, or cited by a renowned source; s2 gl is not widely used as a reference in scientific studies; s2 we identified different interpretations to assess gl types, showing the importance to consider each type in particular; s2 we identified for most of the gl types a strong to very strong positive correlations (p-value <= 0.05%) between the perceptions of the level of control and expertise; s2 we did not find a significant correlation (p-value <= 0.05%) between the perceptions of control and expertise to gl types when considering the respondent’s profile; s2 we perceived misunderstandings about whether a source type is considered a gl type or not, mainly related to the most classified sources as high control and high expertise. this paper is structured as follows: section 2 presents the core concepts of this work. section 3 shows the research questions explored with their rationales. section 4 exposes the methods employed to conduct, analyze and synthesize the data collected. section 5 summarizes the answers to the researcher questions (rq1–rq4) of the previous investigation (kamei et al., 2020). section 6 provides the answers to the research questions (rq5–rq6) specifically for this investigation. section 7 presents the discussions about the findings, lessons learned, and the threats to the validity of this research. section 8 provides the description and comparison of the related works. finally, section 9 exposes the conclusions and future works. 2 background grey literature (gl) has many definitions. however, the most known is called as luxembourg definition (garousi et al., 2019), approved at the third international conference on grey literature in 1997, that stated: “[gl] is produced on all levels of government, academics, business, and industry in print and electronic formats, but which is not controlled by commercial publishers, i.e., where publishing is not the primary activity of the producing body.” focusing on software engineering (se) research, recently, garousi et al. (2019) proposed the following definition: “grey literature can be defined as any material about se that is not formally peerreviewed nor formally published.” considering those definitions, they showed a wide concept of what would be considered a gl, showing that it can be produced in different ways. however, it may lead to a misunderstanding. for this reason, adams et al. (2016a) introduced some terms to distinguish the different concepts about grey, including grey literature, grey data, and grey information. the term “grey data” describes user-generated web content (e.g., tweets, blogs, videos). the term “grey information” is informally published or not published (e.g., meeting notes, emails, personal memories). however, se literature hardly distinguishes these terms. similarly, we considered all forms of grey data and grey information as gl in our work. beyond the gl types, adams et al. (2016b) classified gl according to “shades of grey”. in se, garousi et al. (2019) adapted these shades according to three tiers, as shown in figure 1. in this figure, on the top of the pyramid is the “traditional literature” with scientific articles from conferences and journals. on the rest of the pyramid are what we called as three tiers of gl. these tiers are running according to two dimensions: control and expertise. the first dimension runs between extremes “low” and “higher” and the second runs between extremes “unknown” and “known”. the darker the color, the less moderated or edited the source in conformance with explicit and transparent knowledge creation criteria. figure 1. the “shades” of grey literature, adapted of garousi et al. (2019). recently, gl was used and investigated in se research for many purposes. for instance, primary studies explored the gl available on several social media sources used by se practitioners. for instance, rainer and williams (2018) assessed the importance of blog posts to se research, and oliveira oliveira et al. (2021) investigated several java projects from github to evaluate the developers’ skills based on the source code activities. thepresenceofglinsecondarystudieswasnotableinthe investigations conducted by zhang et al. (2020) and kamei et al. (2021) and by the increase in studies based on grey literature reviews (glr) (e.g., raulamo-jurvanen et al. (2017) and soldani et al. (2018)) and multivocal literature reviews (mlr) (e.g., garousi et al. (2017) and saltan (2019)). explaining these types of study, a glr is a secondary study that explores the evidence, looking at only gl sources, and a multivocal literature is also a secondary study that searches for gl and traditional literature. even with this increase in interest in gl, its use is recent in the se research (zhang et al., 2020; kamei et al., 2021). and there are some gaps and different findings of gl in se research. for instance, kamei et al. (2021) identified that there is a lack of understanding of what is considered a gl type, and previous studies provide different criteria to assess gl credibility (kamei et al., 2020; williams and rainer, 2019). 3 research questions in this section, we stated our research questions and the rationale for their purposes. submitted to jserd kamei et al. 2022 rq1: why do brazilian se researchers use grey literature? rationale: recently, se practitioners have relied on social media and communication channels to share and acquire knowledge (storey et al., 2017). on the one hand, some researchers try to take advantage of its use in se research. for instance, rainer and williams (2018) explored the benefits and challenges of blog articles as evidence in se research. on the other hand, some concerns (e.g., lack of detail and lack of empirical methods) related to gl could make se researchers skeptical about their credibility (rainer and williams, 2019). in this broad question, we intend (i) to understand if brazilian se researchers are using gl and, if so, (ii) what motivates them to use, or if not, (iii) the reasons that lead to not using gl. rq2: what types of grey literature are used by brazilian se researchers? rationale: according to adams et al. (2016a), gl has many forms, from traditional mediums such as question & answer websites and blogs to more dynamic mediums such as telegram and slack. for this reason, bonato (2018) emphasized the importance of exploring the gl definition and its types for each research area. there is a lack of understanding of gl types, precisely what the brazilian se researchers used. this research question sought to investigate what brazilian se researchers often use gl sources. a better understanding of the gl types could guide future research in this area. rq3: what are the criteria brazilian se researchers employ to assess grey literature credibility? rationale: software engineering research uses gl sources, such as data provided by practitioners retrieved from several social media and communication channels. however, as gl is, by nature, a not peer-reviewed source, se practitioners are free to share their thoughts using social media, for instance, without worrying about methodological concerns. thus, it is essential to assess gl sources to ensure the selected gl is appropriate for the study. answering this question will help us understand the credibility criteria that brazilian se researchers consider. rq4: what benefits and challenges brazilian se researchers perceive when using grey literature? rationale: according to storey et al. (2014), the se research community has increased its interest in gl since the widespread presence of se professionals using social media and communication channels. for instance, exploring the stack overflow, zahedi et al. (2020) found some trends and challenges in continuous se that researchers could better explore. in this question, we are interested in understanding the (i) benefits and (ii) challenges that researchers may face when resorting to gl. answering this question is essential to understanding the potential benefits and challenges of using gl more broadly by researchers. rq5: how do se researchers prioritize a set of criteria to assess grey literature credibility? rationale: in our first investigation (kamei et al., 2020), we provided a set of criteria used by brazilian se researchers to assess gl credibility. previous literature (williams and rainer, 2019) also identified another set of criteria. in this question, we focused on understanding the importance of those criteria to assess gl credibility. rq6: what is the perception of brazilian se researchers about the different types of grey literature according to the perspective of control and expertise? rationale: due to the diverse nature of the gl types, some studies suggested that gl needs to be assessed in different ways (garousi et al., 2019). for this reason, adams adams et al. (2016b) classified its types according to the shades of grey. this classification is based on two dimensions: control and expertise. control refers to the rigor with which a source is produced. expertise is the extent to which the knowledge and producer authority can be determined. nevertheless, this understanding and classification are still confused. this research question sought to understand how brazilian se researchers commonly perceived the gl types according to the (i) control and (ii) expertise. 4 research methods in this work, we followed (linåker et al., 2015), aiming to use a survey methodology for data collection. this data was collected from a group of people sampled from a large population. we conducted two surveys. the first (survey 1) aimed tounderstandthebrazilianseresearcher’sperceptionsabout gl. the second (survey 2) investigated only the brazilian researchers from the first survey who answered that they used gl. in the following sections, we detailed the procedures used to conduct survey 1 with participants of a flagship conference of se in brazil (section 4.1). then, we present the procedures used for survey 2 that focused on the researchers that have experience using gl (section 4.2). finally, we provide the methods used for the analysis of both surveys (section 4.3). 4.1 survey 1: initial investigation with the brazilian se researchers in survey 1, we intended to gather a broad perception of gl used by brazilian se researchers, focusing on understanding the motivations to use (or avoid), the types of gl used, the benefits and challenges, and the criteria used to assess its credibility. submitted to jserd kamei et al. 2022 4.1.1 survey design we conducted our survey with participants of the 10th brazilian conference on software: practice and theory (cbsoft), the largest brazilian software conference with many se researchers’ participating. it includes well-established and specialized satellite se conferences in its domain. our population comprehends se researchers are potentially interested in using gl in their research. we chose our sample using nonprobabilistic sampling by convenience (baltes and ralph, 2021). before sending the final survey version, an experienced researcher (ph.d. se researcher with more than 15 years of experience in research) reviewed our draft. we also conducted a pilot study by randomly selecting two participants and explicitly asking for their feedback. we received feedback suggesting changing the order and re-writing some questions to make them more understandable to the target population. we obtained the contact of all the 252 participants, asking the conference’s general chair whether s/he could share this information with us, which s/he gently provided.1 we used two approaches to invite the researchers to answer our questionnaire. first, we placed posters on the event’s walls and tables with a brief description of the work and the link to the online survey. second, we sent the actual survey to the 250 remaining participants of the event. in the invitation email, we briefly introduced ourselves, presented the research’s purposes, highlighted that the invite was to the participant of the cbsoft, and the link to the online survey. we also mentioned that the participant was free to withdraw at any moment, and all information stored was confidential. the survey was open for responses from september 26th to october 11th, 2019. we received a total of 76 valid answers (30.4% response rate). we did not consider the pilot survey answers. 4.1.2 survey respondents among the survey respondents, 48.7% have a ph.d., 31.6% have a master’s, 2.6% are graduate specialization, 14.5% have a bachelor’s degree, and 2.6% are undergraduates. among them, 72.4% are men, and 27.6% are women. table 1 presents the demographics’ information about the respondents and their experience using gl or not. this table shows that most respondents with ph.d. and master’s degrees answered that they were using gl. 1in the period of this research, the brazilian general data protection law was not yet officially published. table 1. demographics information of the survey 1 respondents. gender level of course used gl not used gl woman doctorate 5 5 man doctorate 24 3 woman master 4 2 man master 15 3 woman expert 1 1 man expert 0 0 woman university graduate 0 2 man university graduate 2 7 woman technical education 0 0 man technical education 0 0 woman high school 1 0 man high school 1 0 4.1.3 survey questions our survey had 11 questions (three were required, nine of which were open). we used different questions flow for those who used gl (did not answer question 10) from those who did not (answered only questions 1 to 4 and questions 10 and 11). table 2 presented the questions covered in this survey. 4.2 survey 2: investigating brazilian se researchers that use grey literature in this survey, we intended to do a follow-up survey to collect perceptions only from the brazilian se researchers from survey 1, who answered that they have previously used gl. we focused on the perceptions of the different gl types concerning the dimensions of control and expertise. 4.2.1 survey design using a non-probability sample by convenience (baltes and ralph, 2021), we invited by email once again the 53 researchers that participated in our survey 1 and mentioned the use of gl. we first drew our questionnaire and improved it through the conduction of three sequential steps: 1) a pilot study with five ph.d. se researchers; 2) another se researcher specialist assessed the questionnaire; and 3) received feedback of a participant relating a problem in the first hours after opening the survey. for this reason, we closed the survey to stop receiving answers. then, we deleted all answers previously received and sent a new questionnaire version to the researchers. we opened the survey for answers from february 10th to march 4th, 2021. we received a total of 34 valid answers (64.1% response rate). we did not consider the pilot survey answers. 4.2.2 survey respondents in this survey, as we retrieved our sample from the previous one who answered that they had used gl, we did not ask the same questions (e.g., gender, academic degree). instead, we collected information about their experience in se research and using gl in scientific articles. submitted to jserd kamei et al. 2022 table 2. questions covered in the survey 1. # question type of question options of answers (for closed questions) required? rq q1 what is your e-mail? open no q2 what is your gender? open yes q3 please list the highest academic degree you have received. closed high school, technical education, university graduate, expert, master’s degree, doctorate. yes q4 have you used grey literature? if you never used, go to question q10. closed yes, no. yes rq1 q5 what sources of grey literature did you use? open no rq2 q6 in which conditions do you use grey literature? open no rq1 q7 in which conditions do you do not use grey literature? open no rq1 q8 could you list any benefits in using grey literature? open no rq4 q9 could you list any challenges in using grey literature? open no rq4 q10 if you answered ’no’ in question four, please state why did you never use or avoid use grey literature? open no rq1 q11 what would be a reliable source of grey literature for you? open no rq3 the respondents’ profile of our survey was composed of 76.5% of professors or researchers and 23.5% of undergraduates. regarding se research experience, 55.9% of the respondents had more than ten years. considering the experience using gl, 47% had conducted between 2 and 5 scientific studies using gl, although 26.5% were unable to answer. 4.2.3 survey questions our second survey had ten questions (six were required, and four were open). table 3 presents the questions covered in this survey. before question 4, we produced and included a video2 to summarize and explain the “shades of gl” according to the level of control and expertise. 4.3 data analysis and synthesis in both surveys, we employed a mixed-method approach based on both qualitative (section 4.3.1) and quantitative (section 4.3.2) methods to analyze data. we used a qualitative approach when we were interested in questions about “what” and “how” and a quantitative analysis using descriptive statistics to discuss frequency and distribution and correlation analysis between the dimensions of control and expertise to each gl type. we describe these methods in the following. 4.3.1 qualitative analysis we used a qualitative approach based on the thematic analysis technique (braun and clarke, 2006). this process in2video explaining the “shades of gl” (in portuguese): https://youtu.be/hgmkvxiapr0 volved three se researchers with previous qualitative research experience (one ph.d. student (r1) and two ph.d. professors (r2–r3)) for both surveys. we performed an agreement analysis with the codes and categories generated by each researcher using the kappa statistic (viera and garrett, 2005) to survey 1. the kappa value was 0.749, indicating a substantial agreement level, according to the kappa reference table (viera and garrett, 2005). for survey 2, we do not calculate kappa due to the analysis process that occurred with the researchers working together. figure 2 presents a general overview of the process employed. in the following, we detailed the procedure used to analyze all the answers (adapted from pinto et al. (2019)) of both surveys, showing the differences employed in each survey research: 1. familiarizing with data: the process starts with two independent researchers reading the answers of the survey respondents, as expressed in figure 2-(a). 2. initial coding: then, for survey 1, two independent researchers (r1 and r2) individually analyzed and added codes. for survey 2, the researchers analyzed, discussed, and coded together (r1 and r2, into a dotted box). we used a post-formed code, so we labeled portions of text that expressed the meaning of the excerpts without any previous pre-formed code. the initial codes are temporaries, since they still need refinement. we refined the emerged codes throughout all the analyses. an example of coding is present in figure 2-(b). 3. from codes to categories: here, we already had an initial list of codes. for survey 1, two researchers individually conducted this process (r1 and r2). for survey submitted to jserd kamei et al. 2022 table 3. questions covered in the survey 2. # question type of question options of answers (for closed questions) required? rq q1 what is your occupation? closed professor/researcher, student (m.sc. or ph.d.), other (open). yes q2 how many years of experience did you have conducting se research? closed until 1 year, from 1 and 3years, from 4 to 6 years, from 7 to 9 years, 10 years or more. yes q3 how many scientific studies have you conducted using gl as source of evidence? closed i do not know, no one, only one, from 2 and 5, from 6 and 10, more than 10. yes q4 we are aware that the level of control varies from source to source. for this reason, we ask you to consider your experience more frequent in relation to each source type in relation to the control dimension of the production. closed source types: {adapted from maro et al. (2018); level of control: i did not consider it as a gl type, low control, moderate control, high control, no opinion. yes rq6 q5 please, explain what did you consider to classify each source type with the control criteria presented in question 5. open no rq6 q6 we are aware that the level of expertise varies from source to source. for this reason, we ask you to consider your experience more frequent in relation to each source type in relation to the expertise dimension of the production. closed source types: {adapted from maro et al. (2018); level of expertise: i did not consider it as a gl type, low expertise, moderate expertise, high expertise, no opinion. yes rq6 q7 please, explain what did you consider to classify each source type with the expertise criteria presented in question 7. open yes rq6 q8 considering a gl source with important information to your research, would you include a gl source if it is produced by/with. closed choices for expertise criteria: be produced by a renowned author, be produced by a renowned institution, be produced by a renowned company, be cited by others renowned sources, describe the methods of collection, cites an academic reference, cites a practitioner source, presents information with rigor, presents empirical data; choices for answers: no opinion, no, yes. yes rq5 q9 could you cite any additional potential aspect to assess the credibility of a gl source that was not mentioned before? open no rq6 q10 we are planning to conduct a future research about quality assessment in grey literature. please, could you inform your mail to future contact? open no submitted to jserd kamei et al. 2022 2, this process occurred with two researchers working together (r1 and r2). this process begins to look for similar codes in the data. we grouped the codes with similar characteristics in broader categories. eventually, we also had to refine the categories identified, comparing and re-analyzing them in parallel, using an approach similar to axial coding (spencer, 2009). figure 2-(c) presents an example of this process. 4. categories refinement: here, we have a potential set of categories. for both surveys, in a consensus meeting between r1 and r2 (figure 2-(d)), the categories were evaluated and solved the disagreements of interpretationforevidencethatsupportedorrefutedthecategories found. we also renamed or regrouped some categories to describe the excerpts better there. in cases where disagreements remained, we invited a third researcher (a ph.d. professor) to review and solve them for both surveys. 4.3.2 quantitative analysis we based our quantitative investigation on three samples: (i) we used the answers from 76 se researchers to answer rq1; (ii) we used the answers from 53 researchers that mentioned using gl to answer rq2, rq3, and rq4; and (iii) we used the answers from 34 to answer rq5 and rq6. for the descriptive statistics, we highlighted that one answer of a respondent could be related to more than one category found. in the investigations related between the gl types and the dimensions of control and expertise, we present it into boxplots to show the differences of interpretations of each gl type. we used spearman’s rank correlation coefficient for the correlation analysis of the control and expertise perceptions for each gl type. then, we transformed the answers related to the level of control and expertise (low, moderate, high) into non-linear scales: low = 0, moderate = 50, and high = 100. for the quantitative data analysis, we used r language and python. this last, with the support of google colab3. 5 previous results in this section, we summarized the findings of our first study to present answers to rq1–rq4. to understand these research questions, consider reading the previous study (kamei et al., 2020). to each rq, we summarized the categories in tables with the total number of occurrences of a given category in the column “#”. two critical observations are required: 1) the researchers may have reported more than one answer per question, which may happen to be grouped into different categories; and 2) some questions are not required. thus, the overall results might not reach 100% of respondents. 3https://colab.research.google.com rq1: why do brazilian se researchers use grey literature? in our survey 1, we identified 53 se researchers using gl for research purposes. focusing on understanding better why and how se researchers are using gl or avoiding its use, we asked questions that included the motivations to use gl or reasons to avoid it. in the following, we present a summary of the (i) motivations to use gl and (ii) and the reasons to avoid or never use gl. (i) motivations to use table 4 presents the identified se researchers’ motivations for using gl. in this table, the first column describes the motivation identified, followed by the number of respondents related to the category and the percentage associated with the total of se researchers that used gl (n=53). in the following, we briefly describe some motivations. table 4. motivations to use gl. motivation # % to understand the problems 28 52.8% to complement research findings 12 22.6% to answer practical and technical questions 10 18.9% to prepare classes 4 7.5% to conduct government studies 1 1.9% to understand problems was the most cited motivation to use gl, where several researchers noted the use of gl for some reasons: to understand or investigate a new topic, or to search for something to solve problems, or to acquire specific information to deepen the knowledge. to complement research findings was the second most cited motivation, mentioned when the knowledge gained from the traditional literature is not enough for the investigation. for instance, a researcher noted the use of gl to complement the findings of a mapping study. to answer practical and technical questions was the third most cited motivation, related to the necessity to understand the state of the practice in se. other motivations were mentioned but to less extent, such as to prepare class and to conduct government studies. (iii) reasons to avoid/never use even though several motivations to use gl were identified, 50.9% of se researchers (27/53) avoid using gl as a reference or to reinforce some claims in scientific studies. we also found some researchers that never used gl (23/76 occurrences, 30.3%) to any research situations. we used this value to analyze the extent of each category about reasons to never use gl. of the 23 respondents that never used gl, only 15 answered the reason. table 5 presents the summary of the findings for this question. in the following, we briefly describe the reasons to avoid gl. submitted to jserd kamei et al. 2022 figure 2. example of a coding process used to analyze the questionnaire answers. table 5. reasons to avoid/never use gl. reason # % lack of reliability 6 26% lack of scientific value 3 13% lack of opportunity to use 3 13% lack of reliability was the main reason that se researchers mentioned not to use gl. this is related to the lack of rigor in which gl sources are written and published, which affects its credibility. lack of scientific value was another category mentioned, where the researchers were afraid that the use of gl would weaken a research paper when submitted to the peer-review process. lack of opportunity to use was related to the nature of research previously conducted and because gl is recent in the context of se. summary of rq1: brazilian se researchers use gl motivated mainly to understand new topics, find information about practical and technical questions, and complement research findings. however, some researchers avoid gl, particularly as references in scientific papers, due to its lack of reliability and scientific value. rq2: what types of grey literature are used by brazilian se researchers? in this question, we explored the gl sources used by the 53 se researchers that mentioned using gl. table 6 listed these sources. in the following, we briefly present some of our findings. q&a websites were the most common source mentioned, used to interact with other users, create content, post comments, and assess the content. some examples of sources mentioning q&a websites were stack overflow and quora. blog post was the second most common category found. blogs from renowned practitioners and from companies that produce a diversity of material and content for se and software development, in general, were mentioned. technical reports were mentioned for se researchers that used technical experience, reports, and surveys derived from industry and national and international research groups. companies websites provided by google, facebook, and thoughtworks, containing information regarding their technologies, methods, and practices, were mentioned as sources used. some researchers said browsing these websites to find news to help decision-making about a specific technology. table 6. gl sources used by se researchers. source # % q&a websites 16 30.2% blog posts 15 28.3% technical reports 14 26.4% companies websites 8 15% preprints 5 9.4% books/book chapters 5 9.4% software repositories 4 7.5% videos 3 5.7% magazine articles 3 5.7% news articles 2 3.8% summary of rq2: brazilian se researchers are using several gl sources. the most common are q&a websites, blog submitted to jserd kamei et al. 2022 posts, technical reports, and companies websites. rq3: what are the criteria brazilian se researchers employ to assess grey literature credibility? in this research question, we explored the answers into one open-ended question the criteria of how the se researchers assess gl credibility. table 7 summarized our findings. in the following, we briefly describe the criteria identified. renowned authors were the criteria most cited, in which se researchers considered the author’s experience and reputation concerning the topic. for instance, martin fowler was cited as a notorious software engineer with much knowledge. renowned institutions were another crucial criteria, where se researchers assess if renowned institutions or renowned research groups provided the gl content. cited by others was a criterion mentioned to express those researchers that considered as a trusted source cited by others (studies or people). renowned companies was a criterion identified that consider relevant when renowned software industries or portals produce the gl source. table 7. criteria to assess gl credibility. criteria # % renowned authors 15 28.3% renowned institutions 14 26.4% cited by a renowned source 8 15% renowned companies 7 13.2% summary of rq3: whoever produces gl’s content, whether made by a person, institution, or company since the producer is considered renowned, is a significant credibility criterion. rq4: what benefits and challenges brazilian se researchers perceive when using grey literature? in this research question, we explored the benefits and challenges on the gl use mentioned by se researchers. table 8 summarizes the benefits and table 9 the challenges. in the following, we briefly describe some of them. table 8. benefits of the use of gl. benefit # % easy to access and read 16 30.2% provide a practical evidence 13 24.5% knowledge acquisition 13 24.5% updated information 6 11.3% advance the state of the art/practice 5 9.4% different results from scientific studies 3 5.7% table 9. challenges of the use of gl. challenge # % lack of reliability 34 64.2% lack of scientific value 15 28.3% difficult to search/find information 6 11.3% non-structured information 6 11.3% (i) benefits easy to access and read was the most common benefit mentioned, mainly because most gl sources are open access, are quickly recovered by free search engines, and the contents are usually easy to read. empirical evidence was another essential benefit mentioned, showing that gl provides evidence from the se industry to understand the state of the practice. knowledge acquisition was mentioned as a benefit, as gl allows expanding knowledge with different information from what is usually obtained in traditional literature. updated information was mentioned because the production of gl content happens fast compared with traditional literature, mainly related to technical content. advance the state of the art/practice was mentioned due to the importance of gl to understand better the industry and to provide evidence to find relevant gaps in the practice. different results from scientific studies was mentioned because some researchers considered gl essential to provide additional knowledge not yet available in the research area. (ii) challenges lack of reliability was the main challenge the researchers perceived, where some questioned the reliability of the data retrieved from gl. lack of scientific value was the second category most cited. some researchers mentioned that they did not feel comfortable using gl as a reference in scientific works due to the research community’s lack of recognition of this source. difficult to search/find information in gl sources was perceived as a challenge due to the diversity of sources. each source has its structure and manner to provide access to the content, and it is not easy to replicate the study that used gl. non-structured information was mentioned due to the lack of a writing pattern and a large variety of formats in which the gl sources are published, making it difficult to find information, for instance, using an automatic process. summary of rq4: we found several benefits, the most common was that the gl’s content is easy to access and read, which is important to knowledge acquisition, mainly about providing practical evidence derived from se practitioners. the most cited challenges were using gl in scientific research due to the lack of reliability and scientific value. 6 results in this section, we present answers to rq5 and rq6, both research questions answered by the investigation of survey submitted to jserd kamei et al. 2022 table 10. prioritized criteria to assess gl credibility. criteria # % renowned authors 30 88.2% renowned institutions 30 88.2% cited by a renowned source 27 79.4% cites academic sourcea 26 76.5% present empirical dataa 26 76.5% renowned companies 25 73.5% cites practitioner sourcea 16 47.1% rigor in presenting informationa 12 35.3% describe the methods of collectiona 6 17.6% aproposed in williams and rainer (2019) 2. rq5: how do se researchers prioritize a set of criteria to assess grey literature credibility? in our second survey, we asked 53 researchers to prioritize the importance of a set of criteria to assess gl credibility. these criteria were derived from our first investigation and found in williams and rainer (2019) study. we received answers from 34 se researchers. table 10 presents the result of the ranking prioritization of credibility criteria, revealing that essential criteria perceived by se researchers are: gl source be provided by renowned authors, renowned institutions, or cited by a renowned source. we also investigated whether the se researchers have any additional criteria to assess gl credibility not mentioned in the previous survey questions. by analyzing the answers, we did not find any new criterion that was not related to the criteria as earlier presented in table 10. for instance, some researchers mentioned that the detailed description of the publication context is an important criterion. for this case, we considered that it is already contemplated in rigor in presenting information criterion, previously mentioned by williams and rainer (2019). the author’s experience with the topic was another criterion mentioned. we considered this criterion related to the renowned author’s criterion identified in our first survey. summary of rq5: we assessed the prioritization of credibility criteria identified in our first investigation, in addition to those identified in previous studies. we found that the most used criteria by se researchers are when the gl is produced by a renowned source, cited by a renowned authority, cites an academic source, and presents empirical data. rq6: what is the perception of brazilian se researchers about the different types of grey literature according to the perspective of control and expertise? our last research question explored how the researchers perceived the different types of gl concerns to the dimensions of control and expertise. these dimensions are used to classify the tiers of the “shades of gl.” each dimension could be evaluated into three levels (low, moderate, high). figure 3 presents the results of classifications according to the level of control, and figure 4 shows the results of the level of expertise. even we are investigating different dimensions, interestingly, in some cases, the figures 3 and 4 presented similar behaviors. for instance, for some gl types (e.g., blog posts, forums/list of discussions), the low level was predominantly in both dimensions. we also found similarities concerning theother levelsforbothdimensions.forinstance,sometypes (e.g., materials training, news articles, software repositories, and tutorials) run between low (1st quartile) to moderate (2nd quartile). although, for a diversity of cases, the median behavior varied. we also found differences. for instance, considering the level of control to cases/services descriptions and guidelines, the classifications run between low (1st quartile) to moderate (2nd quartile). in contrast, for the level of expertise to these gl types, we found outliers on the low level (1st quartile) and outliers on the high level (3rd quartile). other classifications caught our attention. for instance, regarding the control dimension, the opinions about the magazine articles are not equalized, as we identified some outliers in both extremes (low and high). a similar classification we identified related to guidelines for the expertise dimension. in addition to classifying the levels (low, moderate, and high) of the dimensions (control and expertise), we offered the possibility to the researcher to choose the options of “i did not consider it a gl type” or “i have no opinion.” we included these options because even previous studies (e.g., maro et al. (2018)) presented the gl types for se research; in our previous investigation (kamei et al., 2021), we identified different interpretations, for instance, in which some types were not considered as gl. table 11 shows the results of these classifications. comparing the findings presented in table 11 with the information presented in figures 3 and 4, we perceived that most of gl types classified with high expertise and high control were also, many times, considered as not a gl type (e.g., thesis, books/book chapters, and patents). moreover, we identified that patents are still unknown to several researchers. rationale to employ classification of each dimension (control and expertise) we asked why the researchers employed the classifications of each gl type according to the control and expertise. we identified four main reasons that are summarized in table 12 and described in the following. table 12. reasons to classify gl types according to the level of control and expertise. reasons # % rigor 23 67.6% producer reputation 14 41.2% research experience 13 38.2% peer interaction 5 14.7% submitted to jserd kamei et al. 2022 figure 3. classification of each gl source type according to the level of control. each level of control indicates: low = 0; moderate = 50; high = 100. rigor (23/34 occurrences). researchers considered the rigor (control) of each source’s production, for instance, the degree of formality present. in this regard, one researcher pointed out: “technical reports, for instance, present systematic studies with high control (of production).” this category was also related to the credibility dimension, as one researcher affirmed: “i consider that credibility is directly related to the rigor of the publication/availability of an artifact.” producer reputation (14/34 occurrences). the producer’s reputation was considered an essential criterion to assess control and expertise, as one researcher pointed out: “the credibility relates to who is the author of the material and to the platform being conveyed. another one mentioned: “depending on the publisher, i can consider high (e.g., elsevier) or low (e.g., autonomously published book) control. the same applies to news: the credibility of the source influences the level of control regarding stricter editorial control in favor of the integrity of the information.” researcher experience (13/34 occurrences). the own researchers’ experience was used to employ the classification. in this regard, one researcher pointed out: “i thought of the examples for each type that i have used and classified them according to my experience in dealing with each material.” another one mentioned that: “i considered what i have read about grey literature.” peer interaction (5/34 occurrences). another criterion considered for assessing gl control and expertise was the users’ interactions in gl sources. in this regard, one researcher mentioned: “another point is that if i have a lot of people interacting and building the content (such as q&a websites), i consider that it has a certain control in the final knowledge presented there.” another one pointed out: “in general, i consider the control to be higher when there is a peer review in some way, as in the case of theses and stack overflow.” correlation analysis between the level of the dimensions (control and expertise) and each gl type we conducted our analysis using correlation statistics between the two variables (control and expertise) to each gl type using the spearman coefficient. we interpreted the spearman coefficient according to dancey and reidy (2004). to conduct this analysis, aiming to pair the samples, we removed the answers in which one respondent answered that “i did not consider it a gl type” or “i have no opinion” to at least one dimension to the same gl type. based on the results of spearman’s rank correlation presented in table 13, we identified 13 gl types (13/19; 68.4%), with correlations that varies from strong to very strong positive correlations (p-value <= 0.05% of significance). it indicates that when the control’s level increases, the expertise tends to increase. considering only the group of gl types that presented less than 95% of significance, we identified six types. among these types, 4 out of 6 (forums/list of discussions, cases/services descriptions, keynote speeches, materials training) had moderate correlations. for the remaining two (books/book chapters and magazine articles), we identified the negligible correlations. submitted to jserd kamei et al. 2022 figure 4. classification of each gl source type according to the level of expertise. each level of expertise indicates: low = 0; moderate = 50; high = 100. table 13. types of grey literature: control and expertise correlation test. notes: *correlation is significant (strong) at the rho >= 0.4 and p-value <= 0.05 level; **p-value is not zero (we used three decimal places). type of grey literature spearman coefficient p-value blog post .441* .017 book/book chapter .106 .607 case/soft. description .341 .082 forum/discussion list .337 .069 guideline .518* .004 keynote speeches .305 .101 magazine article .167 .377 manual .620* .000** material training .308 .104 news articles .525* .003 patent .550* .027 q&a websites .656* .000** slide presentation .593* .001 soft. repository .652* .000** technical report .527* .005 thesis .546* .013 tutorial .688* .000** video .671* .000** white paper .769* .000** correlation analysis between the level of the dimensions (control and expertise) and the respondent profiles after analyzing our data, a chi-square test of independence was conducted between the respondent profiles and their inclination to answer “i did not consider it a gl type” or “i have no opinion”. therefore, we evaluated if the fact that the respondent is a professor or not has any influence in not considering as gl or not having an opinion. table 14 presents our result. submitted to jserd kamei et al. 2022 table 11. the types of gl in which se researchers have no opinion regarding the level of control and expertise, or do not consider as gl ( gl). control expertise type of source no opinion no opinion  gl thesis 0 1 12 patents 7 10 7 books/book chapters 2 1 6 magazine articles 1 2 3 case/serv. desc 1 5 3 manuals 1 3 3 materials training 0 3 3 software repositories 0 3 3 blog posts 1 3 2 forums / lists 0 2 2 news articles 0 3 2 slide presentations 0 6 2 keynote speeches 0 2 2 videos 3 4 2 technical reports 3 2 2 q&a websites 1 3 1 guidelines 1 4 1 tutorials 0 4 1 white papers 2 5 1 table 14. chi-square test between respondent profiles and (i) not considered as gl, (ii) no opinion control, and (iii) no opinion expertise. type of gl i ii iii blog post .769 .526 .959 book/book chapter .925 .959 .526 case/soft. description .959 .959 .439 forum/discussion list .959 .999 .959 guideline .526 .526 .579 keynote speeches .959 .999 .959 magazine article .959 .526 .959 manual .769 .526 .769 material training .959 .999 .769 news articles .959 .999 .769 patent .883 .393 .726 q&a websites .769 .526 .959 slide presentation .959 .999 .925 soft. repository .769 .999 .769 technical report .959 .769 .769 thesis .526 .999 .194 tutorial .526 .999 .579 video .959 .769 .579 white paper .959 959 .711 as we can see in table 14, we did not have found a statistically significant association (p < 0.05) between respondent profile and their inclination to have no opinion regarding the level of control and expertise, or did not consider as a gl type. therefore, based on our results, we did not reject any null hypothesis, i.e., the respondent profile did not influence their answers, or our sample is not large enough to show this influence. we performed another chi-square statistical test to discover if the respondent profiles affect results to their opinion on low, moderate, or high level of control and expertise. for each factor (control or expertise) and gl (blog posts, books/book chapters, etc.), we populated a 2x3 contingency table composed of rows (i.e., respondent profile) and columns (i.e., their opinion as low, moderate, or high) variables. table 15 presents the p-value from the chi-square statistical test for each contingency table. table 15. chi-square test between respondent profiles and (i) expertise level and (ii) control level. type of gl expertise control blog post .785 .100 book/book chapter .958 .722 case/soft. description .632 .293 forum/discussion list .720 .557 guideline .769 .853 keynote speeches .185 .853 magazine article .539 .692 manual .496 .069 material training .316 .690 news articles .049 .205 patent .651 .905 q&a websites .567 .289 slide presentation .478 .157 soft. repository .387 .261 technical report .848 .743 thesis .746 .844 tutorial .132 .707 video .755 .894 white paper .925 .752 table 15 shows the distribution of the p-values per comparison from each chi-squared test of independence. as we can see, there is no evidence that different respondent profiles have different opinions. the only exception regards news articles credibility. the contingency table (see table 16) summarizes the results from comparing answers from professors/students and news articles credibility. we conclude that students think that news articles are more believable by analyzing this result. table 16. contingency table from respondent profiles and the levels of expertise for news articles respondent profile low moderate high professors/researchers 7 1 0 students 8 13 0 summary of rq6: we identified similar behaviors when considering the same gl type concerning the two dimensions: control and expertise. most gl types ran between the low and moderate levels in these dimensions. we also identified some differences, such as the median of answers for control were at the low level and a moderate level for the expertise dimension. the production rigor, the producer’s reputation, researcher experience, and the permission of peer intersubmitted to jserd kamei et al. 2022 action are the criteria employed by the researchers to assess gl source. moreover, we found some misunderstandings to consider or not some data sources as gl, mainly related to thesis, patents, magazine articles, and books/book chapters. considering the correlation analysis, we identified that it varied from strong to very strong between control and expertise dimensions for most gl types. our investigation also shows a correlation analysis between the level of control and expertise for most gl types, showing that when one dimension increases, the other one tends to increase too. the same happens when the level decrease. considering the researcher profile, we did not find evidence that different researcher’s profiles have different opinions, except for the news articles. 7 discussion in this section, we discussed each research question, relating them to previous studies (section 7.1). then, we discussed some findings out of the scope of the rqs that caught our attention (section 7.2). we also presented some advice to se researchers based on the lessons learned with this research and previous knowledge (section 7.3). finally, we discussed some threats to the validity of this work (section 7.4). 7.1 revisiting findings in this section, we discussed our findings with each rq. even though we have addressed the rq1–rq4 in our previous study (kamei et al., 2020), in this work, we included additional discussions and considered other related works not mentioned before. (rq1) motivations to use or reasons to avoid gl (i) even our first investigation showed several motivations and benefits in using gl. our second investigation shows that most researchers avoid its use as a reference in scientific papers. (ii) we organized the motivations to use gl into five categories. three of them were similar to previous works. for instance, rainer and williams (2019) and zhang et al. (2020) also discussed the motivation to complement research findings. another related motivation was to understand problems, identified in three studies (rainer and williams, 2019; neto et al., 2019; zhang et al., 2020). (rq2) types of grey literature used we did not find previous primary studies focusing on this research question. we found tertiary studies that investigated the most gl types found in selected studies. for instance, zhang et al. (2020) identified that the most common gl types used in the list of selected secondary studies were (in order) technical reports, blog posts, books/book chapters, and thesis. considering the types of gl used by brazilian se researchers, the most common are the q&a websites (e.g., stack overflow), blog posts (e.g., se firms, such as netflix, uber, facebook), and technical reports (e.g., from sei). our investigation shows that most of these types are related to se practice, mainly retrieved from renowned firms or research institutions. (rq3) criteria used to assess grey literature credibility we found several criteria to assess the gl credibility, showing that most of them are related to the gl producer being renowned (authors, institutions, and companies). these criteria caught our attention because we did not find any criterion mentioning to assess the gl content. however, the challenge of lack of reliability identified is related to this, and previous work (williams and rainer, 2019) have investigated a set of criteria to assess gl content (e.g., rigor in presenting information, presenting empirical data, describing the methods of data collection). (rq4) benefits and challenges using grey literature we identified some contradictory findings between the benefits and challenges of gl use. they are part of the trade-off between traditional literature and gl nature. for instance, on the one hand, se researchers mentioned that it is easy to access and read the gl content. on the other hand, they said it the difficult to search/find information. regarding the benefit, it is related to accessing the gl content without paywall restriction and to the informal language usually written. however, these benefits hinder the use of automatic data extraction. we identified another trade-off, for instance, even the perceived benefit of advanced the state of the art/practice, several researchers are avoiding the use of gl due to the challenges of lack of reliability and lack of scientific value. in part, those trade-offs are expected, showing the necessity for further investigations on how to improve the use of gl in se research. for instance, as we have done in this research. even though we confirmed some findings of the literature, the main benefit identified (easy to access and read) was not mentioned by previous studies (williams and rainer, 2017; rainer and williams, 2018, 2019; garousi et al., 2016). similarly, it occurred with the challenges. for instance, the lack of scientific value was not identified in previous studies. even, it was the second challenge most mentioned in our investigation. we informed that the benefits identified in this studyare relatedto ourresults ofa tertiarystudy (kameiet al., 2021). regarding the challenges, some findings in previous works (zhang et al., 2020; kamei et al., 2021). for instance, the uncertain availability of gl was not identified in our investigation. (rq5) prioritizing the criteria to assess grey literature expertise this investigation confirmed some findings of survey 1 (kamei et al., 2020), showing that the most important credibility criteria are related to the gl source be produced by a renowned source. however, using the prioritization criteria, some of these findings contrasted partly because, in survey 1 results, no criteria were related to assessing the gl content. at the same time, in survey 2, several se researchers submitted to jserd kamei et al. 2022 considered important criteria of citing academic sources and presenting empirical data. the criteria of citing academic sources, describing the collection methods, and presenting empirical data caught our attention due to the emphasis on applying scientific perspectives to assess gl sources. in our opinion, these criteria are difficult to be used, as we discuss in the following: 1) according to williams (2018), online articles and blogs produced by se practitioners rarely mentioned academic sources; 2) gl sources are produced mainly by practitioners (kamei et al., 2021), and consultant/companies have different manners of expressing than academics one; and 3) most of the gl sources do not present empirical data. instead, they are primarily based on their opinions and belief (rainer, 2017). (rq6) types of grey literature vs. dimensions of control and expertise some findings caught our attention because some gl types run between two and sometimes into three levels of the classification of the dimensions, showing that different interpretations may occur for the same type. although, the correlation analysis showed a strong correlation between these interpretations for most of the gl types investigated. considering the respondent’s profiles, different from what we expected, our statistical analysis based on the chi-square test showed that different respondent profiles shared similar opinions about each source type being considered a gl or not and concerning the level of control and credibility. the criteria used by se researchers to classify these dimensions are mostly related to the rigor of source, researcher experience, and the interaction permitted for the user to deal with each gl type. although some of them considered it is challenging to classify considering only the source type, without a real example to be deeply assessed, as one researcher pointed out: “(...) the credibility will depend on who produced that content.” moreover, we perceived that sources (e.g., technical reports, books/book chapters, thesis) produced by companies and institutions mainly were considered with moderate to a high level of control and expertise. in contrast, the sources commonly produced by se practitioners (e.g., forums/list of discussions, blog posts, videos) have a low level of control and expertise. these findings caught our attention because, in rq2 results, the most used gl sources runs between low to moderate level. it appears that the benefits and the motivations to use gl outweigh the low level of control and expertise presented in these sources. with these findings, we reinforce the claim of garousi et al. (2019) that it is complicated to assess the dimensions of control and expertise alone. although they could bring us one direction, other essential criteria include identifying gl’s producer and content. for this reason, we advocate that se researchers use the concept of the “shades of gl” to classify and assess a gl source because it recognizes the different perspectives of the nature of gl, although future investigations to set a limit between tiers of the shades are essential. beyond that, we claimed the importance of employing objective criteria to assess gl sources and better permit the gl classification according to the shades. although, as our findings showed, it could be essential to propose intermediate shades between each tier. 7.2 other discussions in this section, we discussed some findings and important discussions unrelated to a specific research question. first, we discussed the relations among the researcher’s perceptions’ of gl. second, describe the relationship between the credibility criteria and the dimensions of credibility investigated. lastly, discuss our findings of the perceptions of the different gl types. perceptions of grey literature we identified relations between the perceptions of gl, as shown in figure 5. for instance, we identified some motivations to use gl related to some benefits identified (slashed line) and some reasons to avoid gl with some challenges by gl use (dotted line). in what follows, we discussed some of them. regarding the motivation to use “to complement research findings” is related to the benefit of use gl to provide “different results from scientific studies” as some respondents informed that the inclusion of gl could provide evidence not explored or identified in the research area. another one is “to answer practical and technical question” related to the benefit of “practical evidence”, which was not perceived using only traditional literature. the reasons to avoid gl and the challenges identified are almost the same. except for the “lack of reliability” that hinders the replicability of the search for gl. it could be motivated due to the “non-structured information” of a gl source. expertise criteria vs. dimensions of control and expertise the most important criteria identified to assess gl credibility are related to the “producer reputation” and the “rigor” presented in the gl source. the first is related to the source be produced by a renowned author, institution, or cited by a renowned source. the second with how the information is presented, for instance, if it describes the methods used to collect the data. figure 6 presented these criteria. we also identified some relations between the credibility criteria with some reasons to classify the control and expertise dimensions, as shown in figure 6. the control (slashed line) is related to the “peer interaction”, “producer reputation”, and the “rigor”. the expertise (dotted line), their relations are the same as the control dimension, including the “researcher experience”. this last is related to their own researcher experience using gl to assess its credibility. gl types interpretation in our second investigation, we found some misunderstanding in interpreting gl types (see table11), even though those types were recognized as gl in some previous se works submitted to jserd kamei et al. 2022 figure 5. relationships identified between the motivations to use gl with benefits and the reasons to avoid with the challenges. (e.g., maro et al. (2018), zhang et al. (2020)). in the following, we present the most common types that were not considered gl: thesis (11/34 occurrences), patents (6/34 occurrences), books/book chapters (6/34 occurrences), and magazine articles (3/34 occurrences). in this regard, for instance, one researcher pointed out: “i understand that thesis and dissertations are not grey because external researchers formally assess them.” we also found in previous studies some contradictions in interpreting a source type as a gl type or not. for instance, while hosseinzadeh et al. (2018) considered books/book chapters as a gl type, the study of berg et al. (2018) did not. we identified another conflict, for instance, while neto et al. (2019) considered thesis a peer-reviewed source, rodríguezpérez et al. (2018) classified them as gl types. these misunderstandings were also identified in the previous investigation with secondary studies (kamei et al., 2021). in our opinion, these misunderstandings reflect on each source’s classification regarding control and expertise. for instance, for most researchers, books/book chapters, technical reports, thesis, and patents were not considered a gl type and related them to a high level of control and expertise (figures 3 and 4). it shows that the peer-reviewed process and grey literature boundary are unclear when considering only the source type. 7.3 lessons learned with this investigation and the previous one (kamei et al., 2020), we showed how gl could contribute to se research. however, some advice is important to this use could be improved. for se researchers, our findings highlight to pay attention when searching, selecting, and using grey literature in se research: 1) explore the gl sources before using on their research, as there are several types of gl source, to understand what evidence each gl source could provide and could benefit the research and how to retrieve information from them, due to the issues about the difficulty to search for; 2) it is important to the researchers be aware of a set of credibility criteria that could be used to assess gl sources. for instance, by selecting data produced by renowned sources (e.g., authors, institutions) and understanding how each credibility criteria could better fit each type of gl; 3) another criterion to improve gl credibility could be used, considering the various interpretations for gl assessment related to the control and expertise aspects; and 4) understand how to improve the submitted to jserd kamei et al. 2022 figure 6. relationships identified between the grey literature expertise criteria with the dimensions of control and expertise. search for gl using a systematic approach with methods and techniques to better deal with the content, aiming to reduce their lack of reliability. 7.4 threats to validity this section discussed some limitations and threats to validity and what we have done to mitigate them. construct validity: even our efforts to improve our questionnaire, we identified two potential threats in our research: 1) specifically on the questions that we asked for the participant to classify each source type concerning the control and expertise dimensions. we mitigate this, informing the researchers that we know that control and expertise vary from source to source, and asked them to consider the most frequent experience for each data source. however, three researchers reported that assessing these gl types’ dimensions was difficult without considering the content and the producer. this difficulty may have introduced some bias, and 2) we used a non-probability sample by convenience (baltes and ralph, 2021) because we intend to investigate only se researchers with previous experience in gl use. then, we surveyed only 53 brazilian researchers we knew had this experience. internal validity: as our investigation used personal interpretation, we may have introduced biases during the data extraction and analysis. we tried to minimize those by using a paired approach with a constant discussion between the researchers and invoking a third researcher to revise the derived codes and categories. external validity: our first investigation used a sample of the se researchers from the largest se conference in brazil. in the second investigation, our sample was representative of se research because we had a 30.4% response rate with a diversity of researchers (1/3 are women, 50% have a ph.d. in se, and 30% a master’s). in our second investigation, we conducted our survey with the researchers from the first survey that mentioned they had used gl in se research. we received 64.1% of response rate. from these, almost 60% are professors or researchers with more than ten years of se research experience, and most have used gl from 2 and 5 scientific studies. nevertheless, as we focused on the brazilian se research community for both surveys, the findings may not apply to other populations. although, we used the peer review process during all this research, aiming to improve submitted to jserd kamei et al. 2022 the external validity to draw general conclusions. conclusion validity: even with 30.4% and 64.1% of response rates in both surveys, we may have lost some important information. for the first investigation, we mitigated this threat by comparing our results with previous studies conducted with different populations, showing that our results showed similarly. even though we have reached a considerable response rate for the second investigation, our sample was small and focused only on the brazilian se researchers’ perspective to permit the results’ generalization. another threat is related to the correlation analysis between the dimensions of control and expertise to each gl type because we did not explicitly ask this correlation to the respondents. 8 related works this section groups the related works in studies that explored gl’s credibility and quality assessment in se research. for each study presented, we show the differences concerning our work. the grey literature review (glr) conducted by raulamo-jurvanen et al. (2017) focused on understanding how se practitioners choose a test automation tool by investigating the opinions and experiences of se practitioners produced in gl sources. they analyzed the gl source’s credibility during the quality assessment according to the number of readers, number of shares, number of comments, number of google hits for the titles, and adopting backlinks analysis (a reference comparable to a citation). our work differs because we provide different findings on assessing gl credibility. moreover, we also intend to understand the prioritization of a set of criteria identified in previous investigations (kamei et al., 2020; williams and rainer, 2019). soldani et al. (2018) conducted another study based on glr. this study investigated the pains and gains of the use of microservices. they perceived that the traditional literature on the topic is still in the early stage even though companies are working day-by-day with microservices, as witnessed by the considerable amount of gl on the subject. the authors considered a set of criteria of control factors to select gl sources: practical experience of the authors (+5 years), industrial case-study, heterogeneity (present the information about at least 5 top industrial domains), and implementation quantity (present detailed information). our work differs from this because we focused on investigating and providing a set of general criteria that could be used to assess different types of gl sources. williams and rainer conducted two studies to investigate how to improve the quality and credibility assessment of blog articles in se research. the first study (williams and rainer, 2017) examined some criteria to evaluate blog articles to be used as a source of se research evidence through two pilot studies (a systematic mapping study and preliminary analyses of blog posts). the findings showed some criteria for selecting a blog article’s content (e.g., authentic, informative). the second study (williams and rainer, 2019) focused on finding credibility criteria to assess blog posts by selecting 88 candidate credibility criteria from a previous mapping study (williams and rainer, 2017). then, to gather opinions on a blog post to evaluate those credibility criteria, they surveyed 43 se researchers. some criteria were found, for instance, the presence of reasoning, reporting empirical data, and reporting data collection methods. as discussed in the previous related works, our criteria were not focused on a specific type of gl. moreover, our identified criteria are different from williams and rainer’s, and we tried to understand what each se researcher considered in assessing the different types of gl. most recently, we conducted a tertiary study with secondary studies of se (kamei et al., 2021) presenting a critical review of gl use in secondary studies. in total, were investigated 446 studies, identifying 126 studies that searched or included gl as a primary source. this finding showed that gl was not widely used in the analyzed studies, although it increased in gl use over the years. the tertiary research explored the benefits, challenges, and motivations to use or avoid gl use. our work differs from this previous one because we asked the se researchers directly, different from investigations with published studies, where these questions were not directly explored, leaving the authors the option to include or not that information. even though the similarity of these works with our work, there are differences in at least four points: i) we found a different set of credibility criteria: the source needed to be provided by renowned institutions, renowned companies, cited by others, and derived from academia, ii) we did not focus on a specific type of gl source, iii) we explored the experience of se researchers to understand the perspectives on the credibility of different gl types and how se researchers assess them, and iv) we investigated a set of prioritization criteria used to assess gl credibility. 9 conclusions and future works although the use and investigation of grey literature in se research increased over the last years, they are still recent. in this work, we reported two investigations based on the brazilian se researchers’ perspective to present an overview of gl sources usage, potential benefits and challenges of its use, a set of criteria to assess gl credibility, and the perceptions about gl types concerning control and expertise criteria. our main findings show: 1. blogs, community websites, and technical experience/reports are the most common gl sources used by se researchers; 2. the main motivations to use gl is because its content could complement research findings by providing different results from scientific studies and answer practical and technical questions; 3. gl use is not widespread as a scientific reference due to some credibility and reliability constraints; 4. the use of the “shades of gl” can help se researchers to assess gl and interpret the different gl types. although, we identified that se researchers have different interpretations of gl control and expertise; 5. the most relevant criteria used to assess gl credibility submitted to jserd kamei et al. 2022 are the gl source be provided by renowned authors, institutions, companies, or be cited by a renowned source; 6. the most critical criteria to assess the control and expertise of a gl source are related to the producer reputation and the rigor of the gl content presented; 7. thereisapositivecorrelationforcredibilitycriteriaconsidering the dimensions of control and expertise for each gl. it shows that when the level of control increases, the level of expertise tends to increase too; 8. we did not find significant differences between the opinions of graduate students and professors/researchers concerning the control and expertise dimensions analyzed of each gl type. for replication purposes, all the data used in these investigations are available online at https://doi.org/10.5281/zenodo.5164714. for future works, we plan i) to expand our view by investigating other se research communities; and ii) to deeply understand the gl credibility aspects, focusing on building an objective quality assessment instrument that comprehends these several types. references adams, j., hillier-brown, f. c., moore, h. j., lake, a. a., araujo-soares, v., and summerbell, m. w. c. (2016a). searching and synthesising ‘grey literature’ and ‘grey information’ in public health: critical reflections on three case studies. systematic reviews, 5(1):164. adams, r. j., smart, p., and huff, a. s. (2016b). shades of grey: guidelines for working with the grey literature in systematic reviews for management and organizational studies. international journal of management reviews, 19(4):432–454. baltes, s. and ralph, p. (2021). sampling in software engineering research: a critical review and guidelines. berg, v., birkeland, j., nguyen-duc, a., pappas, i. o., and jaccheri, l. (2018). software startup engineering: a systematic mapping study. journal of systems and software, 144:255–274. bonato, s. (2018). searching the grey literature. rowman & littlefield. braun, v. and clarke, v. (2006). using thematic analysis in psychology. qualitative research in psychology, 3(2):77– 101. coelho, j., valente, m. t., milen, l., and silva, l. l. (2020). is this github project maintained? measuring the level of maintenance activity of open-source projects. information and software technology, 1:1–35. dancey, c. p. and reidy, j. (2004). statistics without maths for psychology: using spss for windows. prentice-hall, inc., usa. garousi, v., felderer, m., and hacaloğlu, t. (2017). software test maturity assessment and test process improvement: a multivocal literature review. information and software technology, 85:16–42. garousi, v., felderer, m., and mäntylä, m. v. (2016). the need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. in proceedings of the 20th international conference on evaluation and assessment in software engineering, ease ’16, pages 26:1–26:6, new york, ny, usa. acm. garousi, v., felderer, m., and mäntylä, m. v. (2019). guidelines for including grey literature and conducting multivocal literature reviews in software engineering. information and software technology, 106:101–121. hosseinzadeh, s., rauti, s., laurén, s., mäkelä, j.-m., holvitie, j., hyrynsalmi, s., and leppänen, v. (2018). diversification and obfuscation techniques for software security: a systematic literature review. information and software technology, 104:72–93. kamei, f., wiese, i., lima, c., polato, i., nepomuceno, v., ferreira, w., ribeiro, m., pena, c., cartaxo, b., pinto, g., and soares, s. (2021). grey literature in software engineering: a critical review. information and software technology, page 106609. kamei, f., wiese, i., pinto, g., ribeiro, m., and soares, s. (2020). on the use of grey literature: a survey with the brazilian software engineering research community. in proceedings of the xxxiv brazilian symposium on software engineering, sbes 2020, new york, ny, usa. association for computing machinery. linåker, j., sulaman, s., maiani de mello, r., and martin, h. (2015). guidelines for conducting surveys in software engineering. technical report, lund university. maro, s., steghöfer, j.-p., and staron, m. (2018). software traceability in the automotive domain: challenges and solutions. journal of systems and software, 141:85 – 110. neto, g. t. g., santos, w. b., endo, p. t., and fagundes, r. a. a. (2019). multivocal literature reviews in software engineering: preliminary findings from a tertiary study. in proceedings of the acm/ieee international symposium on empirical software engineering and measurement, esem ’19, pages 1–6. oliveira, j. a., viggiato, m., pinheiro, d., and figueiredo, e. (2021). mining experts from source code analysis: an empirical evaluation. journal of software engineering research and development, 9(1):1:1 – 1:16. petticrew, m. and roberts, h. (2006). systematic reviews in the social sciences: a practical guide, volume 11. blackwell publishing ltd. pinto, g., ferreira, c., souza, c., steinmacher, i., and meirelles, p. (2019). training software engineers using open-source software: the students’ perspective. in proceedings of ieee/acm 41st international conference on software engineering: software engineering education and training, icse-seet ’19, pages 147–157. institute of electrical and electronics engineers (ieee). rainer, a. (2017). using argumentation theory to analyse software practitioners’ feasible evidence, inference and belief. information and software technology, 87:62–80. rainer, a. and williams, a. (2018). using blog articles in software engineering research: benefits, challenges and case–survey method. in proceedings of the 25th australasian software engineering conference), aswec ’18, pages 201–209. submitted to jserd kamei et al. 2022 rainer, a. and williams, a. (2019). using blog-like documents to investigate software practice: benefits, challenges, and research directions. journal of software: evolution and process, 31(11):e2197. raulamo-jurvanen, päivi, mäntylä, m., and garousi, v. (2017). choosing the right test automation tool: a grey literature review of practitioner sources. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease ’17, pages 21– 30. acm. rodríguez-pérez, g., robles, g., and gonzález-barahona, j. m. (2018). reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the szz algorithm. information and software technology, 99:164–176. saltan, a. (2019). do we know how to price saas: a multivocal literature review. in proceedings of the 2nd acm sigsoft international workshop on software-intensive business: start-ups, platforms, and ecosystems, iwsib 2019, pages 7–12. acm. schöpfel, j. and prost, h. (2020). how scientific papers mention grey literature: a scientometric study based on scopus data. collection and curation. soldani, j., tamburri, d. a., and heuvel, w.-j. v. d. (2018). the pains and gains of microservices: a systematic grey literature review. journal of systems and software, 146:215–232. spencer, d. (2009). card sorting: designing usable categories. rosenfeld media. storey, m.-a., singer, l., cleary, b., filho, f. f., and zagalsky, a. (2014). the (r) evolution of social media in software engineering. in proceedings of the on future of software engineering, fose ’14. acm press. storey, m.-a., zagalsky, a., filho, f. f., singer, l., and german, d. m. (2017). how social and communication channels shape and challenge a participatory culture in software development. ieee transactions on software engineering, 43(2):185–204. stray, v. and moe, n. b. (2020). understanding coordination in global software engineering: a mixed-methods study on the use of meetings and slack. journal of systems and software, 170:110717. tom, e., aurum, a., and vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6):1498–1516. viera, a. j. and garrett, j. m. (2005). understanding interobserver agreement: the kappa statistic. family medicine, 37(5):360–363. williams, a. (2018). using reasoning markers to select the more rigorous software practitioners’ online content when searching for grey literature. in proceedings of the 22nd international conference on evaluation and assessment in software engineering, ease ’18, pages 46–56. acm. williams, a. and rainer, a. (2017). toward the use of blog articles as a source of evidence for software engineering research. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease’17, pages 280–285, new york, ny, usa. acm. williams, a. and rainer, a. (2019). how do empirical software engineering researchers assess the credibility of practitioner-generated blog posts? in proceedings of the 23nd international conference on evaluation and assessment in software engineering, ease ’19, pages 211–220. acm. zahedi, m., rajapakse, r. n., and babar, m. a. (2020). mining questions asked about continuous software engineering: a case study of stack overflow. in li, j., jaccheri, l., dingsøyr, t., and chitchyan, r., editors, ease ’20: evaluation and assessment in software engineering, trondheim, norway, april 15-17, 2020, pages 41–50. acm. zhang, h., zhou, x., huang, x., huang, h., and babar, m. a. (2020). an evidence-based inquiry into the use of grey literature in software engineering. in proceedings of the 42th international conference on software engineering, icse ’20. introduction background research questions research methods survey 1: initial investigation with the brazilian se researchers survey design survey respondents survey questions survey 2: investigating brazilian se researchers that use grey literature survey design survey respondents survey questions data analysis and synthesis qualitative analysis quantitative analysis previous results results discussion revisiting findings other discussions lessons learned threats to validity related works conclusions and future works journal of software engineering research and development, 2023, 11:3, doi: 10.5753/jserd.2023.2581 this work is licensed under a creative commons attribution 4.0 international license. investigating the relationship between technical debt management and software development issues clara berenguer [ salvador university | claraberenguerledo@gmail.com ] adriano borges [ salvador university | arborges.12@gmail.com ] sávio freire [ federal institute of ceará and federal university of bahia | savio.freire@ifce.edu.br ] nicolli rios [ federal university of rio de janeiro | nicolli@cos.ufrj.br ] robert ramač [ university of novi sad | ramac.robert@uns.ac.rs ] nebojša taušan [ university of novi sad | nebojsa.tausan@ef.uns.ac.rs ] boris pérez [ francisco de paula santander university | br.perez41@uniandes.edu.co ] camilo castellanos [ university of los andes | cc.castellanos87@uniandes.edu.co ] darío correal [ university of los andes | dcorreal@uniandes.edu.co ] alexia pacheco [ university of costa rica | alexia.pacheco@ucr.ac.cr ] gustavo lópez [ university of costa rica | gustavo.lopezherrera@ucr.ac.cr ] manoel mendonça [ federal university of bahia | manoel.mendonca@ufba.br ] davide falessi [ university of rome tor vergata | d.falessi@gmail.com ] carolyn seaman [ university of maryland baltimore county | cseaman@umbc.edu ] vladimir mandić [ university of novi sad | vladman@uns.ac.rs ] clemente izurieta [ montana state university and idaho national laboratories | clemente.izurieta@montana.edu ] rodrigo spínola [ virginia commonwealth university and salvador university | spinolaro@vcu.edu ] abstract context: the presence of technical debt (td) brings risks to software projects. managers must continuously find a cost-benefit balance between the benefits of incurring in td and the costs of its presence in a software project. much attention has been given to td related to coding issues, but other types of debt can also have impactful consequences on projects. aims: this paper seeks to elaborate on the growing need to expand td research to other areas of software development, by analyzing six elements related to td management, namely: causes, effects, preventive practices, reasons for non-prevention, repayment practices, and reasons for nonrepayment of td. method: we survey and analyze, quantitatively and qualitatively, the answers of 653 software industry practitioners on td to investigate how the previously mentioned elements are related to coding and noncoding issues of the software development process. results: coding issues are commonly related to the investigated elements but, indeed, they are only part of the td management stage. issues related to the project planning and management, human factors, knowledge, quality, process, requirements, verification, validation, and test, design, architecture, and the organization are also common sources of td. we organize the results in a hump diagram and specialize it considering the point of view of practitioners that have used agile, hybrid, and traditional process models in their projects. conclusion: the hump diagram, in combination with the detailed results, provides guidance on what to expect from the presence of td and how to react to it considering several issues of software development. the results shed light on td management of software elements, beyond source code related artifacts. keywords: technical debt, technical debt management, causes of technical debt, effects of technical debt, process model 1 introduction technical debt (td) refers to postponed tasks or immature artifacts in software projects that can bring short-term benefits (e.g., higher productivity and lower costs), but may have harmful impacts in the long run (izurieta et al. 2012). by managing td items, software teams can reduce the risks associated with these items, such as unexpected delays in system evolution or difficulty in achieving quality criteria defined for the project (rios et al. 2020). technical debt management (tdm) is a challenging endeavor. successful tdm is about reaching a balance between the benefits of incurring in td and the later impacts of its presence in a software project (lim et al. 2012, guo et al. 2016). tdm must seek to define preventive practices to avoid potential td items and the appropriate actions to repay incurring debt (li et al. 2015, ribeiro et al. 2016, freire et al. 2020a, freire et al. 2020b). tdm requires knowledge of the causes that lead software teams to incur debt items and the effects of their presence in software projects (rios et al. 2020, besker et al. 2020). knowing the causes of td can support software teams in understanding their project context and define preventive practices to avoid the debt. having information on td effects can aid in the prioritization of td items to be paid off, supporting a more precise impact analysis and the identification of corrective actions to minimize possible negative consequences of td items for the project. although it was initially associated with code level issues, td can impact any type of software artifact and activity (alves et al. 2016, rios et al. 2018). for example, outdated requirement documentation can lead to a code that does not meet user requirements. despite the growing number of studies on td, there is a clear concentration of studies investigating it from the investigating the relationship between technical debt management and software development issues berenguer et al. 2023 source code and its related artifacts perspective (zazworka et al. 2014, alves et al. 2016, rios et al. 2018). focusing solely on coding is risky business, because td can affect many other software activities. but, how can one identify and manage td related to different software activities? this paper elaborates on the growing need to expand td research to other areas of software development. it analyzes six elements related to tdm: causes, effects, preventive practices, reasons for non-prevention, repayment practices, and reasons for non-repayment of td, for several types of software artifacts and activities. the paper uses a subset of the data collected by the insightd project, a family of surveys globally distributed on causes, effects, and management of td (rios et al. 2020). this data set consists of data from six countrywide replications of the survey, totaling 653 responses from software practitioners. by investigating how practitioners face td in their projects, we gain insight into the state of practice regarding tdm, which allow us to identify existing gaps in tdm theory. the data are analyzed qualitatively and quantitatively to investigate whether the above listed tdm elements are more related to coding or to non-coding issues (e.g., planning and management, requirements engineering, human factors) of the software development. this paper is based on our previous work by berenguer et al. (2021), extending it by including: • a more comprehensive analysis of the relation between td and non-coding activities, • specializations of the hump diagram by process model (agile, hybrid, and traditional), and • an analysis between td, coding and non-coding activities by process model. our results indicate that both coding and non-coding activities are commonly affected by td, but causes, effects, preventive practices, reasons for non-prevention, and reasons for non-repayment, affect non-coding activities more than coding activities. for repayment practices, we found similar behaviors between the two groups (coding and non-coding activities). given all the investigated tdm elements, some software development issues are more commonly reported by practitioners. planning and management issues and human factors stand out, but there are several issues related to debt items such as process, knowledge, td management, and requirement engineering issues. concerning the analysis per process models, we found that practitioners following agile, hybrid, or traditional process models shared a similar view on td elements affecting coding activities. on the other hand, practitioners who use traditional process models have a different view of those using agile and hybrid process models on td elements affecting non-coding activities. results are presented with a hump diagram that, in combination with the analyses of each of the investigated td management elements, provides guidance on what to expect from the presence of td and how to react to them considering several issues of the software development process. in addition to this introduction, this paper has seven additional sections. section 2 presents background information on td research and related work. section 3 describes the methodology used. section 4 presents the results of this work. and section 5 presents the hump diagram and its specializations by process models. section 6 summarizes the results and discusses their implications for researchers and practitioners. section 7 discusses the threats to validity. lastly, section 8 presents our concluding remarks. 2 background td can be incurred at any time and in several artifacts throughout the software development process. as such, it has different characteristics depending on the time it was incurred and the activities it is related to, such as testing, code, build, documentation, and so on (alves et al. 2016). although td is a rising research topic, many studies focus solely on its relationship to source code. li et al. (2015) investigated studies on td and its management (tdm), in addition to carrying out classification and thematic analysis on it, comprehensively understanding the concept of td and presenting an overview of the current state of research in tdm. in their results, it was observed that code debt was the most cited type among the primary articles that were analyzed. in alves et al. (2016), the authors also reported the focus on approaches to identify td items from source code. the authors suggested that a possible explanation for this is that there is a plethora of tools that perform the analysis of source code that can be used to support the detection of td. in another study, rios et al. (2018) presented fifteen types of td. the authors also indicated that there is a concentration of studies focusing on source code. the authors gave some explanations for this phenomenon. the term td was first coined by cunningham (1992), who directly related it to source code, which may have influenced subsequent studies. furthermore, the types related to code tend to cause effects that can be felt more quickly by development teams. more recently, saraiva et al. (2021) performed a systematic mapping study to investigate the current state of the art of td tools, identifying which activities, functionalities and types of debt are handled by the existing tools to support td management. the study identified 50 tools, 42 of which are new tools, and 8 tools extend an existing one. the main td types addressed by tools deal with source code (60% 30/50), architectural issues (40% 20/50) and design issues (28% 14/50). the distribution of tools over the categories was mainly: quantifying code properties, architectural smell detection, pattern matching, cost benefit analysis, project management, and code smell. the authors also reinforce that this trend is in line with the original definition of td, which is heavily defined by concepts coming from source code and related issues. lenarduzzi et al. (2021) also performed a systematic mapping study to understand which td prioritization approaches have been proposed in research and industry. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 the results showed that code debt (38%), architecture debt (24%) and design debt (10%) are by far the most frequently investigated types of debt when considering td prioritization, although there is scant evidence on other types like test and requirement debt. thus, the approaches mainly involve models that reduce td by acting on source code, removing or refactoring code smells or other metrics. such concentration of studies at the code level is a worrying scenario because other types of debt can also have impactful or even worse consequences on projects. we claim that it is necessary to go beyond the source code and investigate other facets of td. we do it under the perspective of td causes, effects, prevention, and repayment, and use data collected from insightd project, presented in the next section. 3 research method this section presents the insightd project in which this work is contextualized, our research questions, and the data collection and analysis procedures. 3.1 the insightd project insightd is a family of globally distributed industrial surveys, present in countries such as brazil, chile, colombia, costa rica, the united states, and serbia. it aims to investigate the causes, effects, and management of td in software projects. several results of the project have been disseminated so far, for example: the empirical design of the insightd and the results of its brazilian replication on causes and effects of td (rios et al. 2020), probabilistic diagrams of causes and effects of td (rios et al. 2019), the set of causes and effects of td collected from six insightd replications (ramač et al. 2022, freire et al. 2021b), the relation between td and process models (rios et al. 2021), td prevention (freire et al. 2020a, freire et al. 2021a), and practices and impediments to repay td items (freire et al. 2020b, perez et al. 2020, freire et al. 2021a, freire et al. 2021c). other results from the project can be found at http://www.td-survey.com/publication-map/. concerning the relation between td and development issues related to coding or other development issues, we previously investigated it in our previous work (berenguer et al. 2021). in this paper, we further investigated it by including: • a more comprehensive analysis of the relation between td and non-coding activities, as shown in section 4, • specializations of the hump diagram by process model (agile, hybrid, and traditional), as presented in section 5, and • an analysis between td, coding, and non-coding activities by process model, as discussed in subsection 5.2. 3.2 research questions in this work, we investigate whether td management elements (causes, effects, prevention, and repayment) are more related to coding issues or to other software development issues. to this end, we consider the following research questions: • rq1: are the causes of td more related to coding issues or other software development issues? • rq2: are the effects of td more felt in coding issues or other issues in the software development process? • rq3: is td prevention more related to coding issues or other issues in the software development process? • rq4: are the reasons for not preventing td more related to coding issues or other development issues? • rq5: is td repayment more associated with coding issues or other issues in the software development process? • rq6: are the reasons for not paying td more related to coding issues or other development issues? 3.3 data collection this study uses a subset of available data from 18 questions from the insightd questionnaire. table 1 shows these questions, reports their type and the rq they refer to. questions q1 through q8 document the characteristics of the survey respondents. more specifically, in q8, the respondents inform the process model adopted in their projects, choosing one of the following options: agile (a lightweight process that promotes iterative development, close collaboration between the development team and business side, constant communication, and tightly-knit teams); hybrid (is the combination of agile methods with other non-agile techniques. for example, a detailed requirements effort, followed by sprints of incremental delivery); and traditional (conventional document-driven software development methods that can be characterized as extensive planning, standardization of development stages, formalized communication, significant documentation and design up front). more information on the closed questions’ options is available in rios et al. (2020). in q13, respondents provide an example of a td item that occurred in their projects. participants discuss causes of td in q16 thru q18 and effects in q20. we use the answers given to these questions for answering rq1 (q16-q18) and rq2 (q20). concerning td prevention, participants give their responses in q22 and q23, and address td repayment in q26 and q27. the answers given in these questions are used for answering rq3-4 (q22 and q23) and rq5-6 (q26 and q27). we invite only software practitioners from the brazilian, chilean, colombian, costa rican, north american, and serbian software industries through linkedin, industryaffiliated member groups, and industry partners for answering the survey. http://www.td-survey.com/publication-map/ investigating the relationship between technical debt management and software development issues berenguer et al. 2023 3.4 data analysis procedures the analysis procedures are divided into three steps: demographics, preparing data for analysis, and data classification and analysis. 3.4.1 demographics we calculate the quantity of respondents choosing an option available through the closed questions of the survey. subsequently, we sum up the participants’ characterization. 3.4.2 preparing data for analysis for the open-ended questions, we applied coding process (strauss and corbin 1998). in answers given to q16 thru q18 and q20, we used the coding process described in rios et al. (2020) to identify a set of causes and effects, as well as the number of occurrences for each. to exemplify, let us consider the answers given by two respondents in q16: “poorly developed code” and “low quality code”. as these answers are associated with problems in source code, they were unified under the cause sloppy code. we used the coding process described in freire et al. (2020a) to code the responses to q23. we identified practices for td prevention from this process when q22 received a positive response; otherwise, we identified reasons for td non prevention. an example of this process is as follows: two respondents provided the following answers in q23 when q22 has a negative answer: “requirements are always going to change during development...” and “because when the client asks for features abruptly, no matter how generalized the architecture is towards the problem, with an outlier there may be, that can mean a refactor of the code, and that could dirty the code, reducing its maintainability”. as these answers are associated with requirements change requests, they were unified under the reason for td non-prevention requirements change. finally, we coded the responses to q27 using the coding procedure described in freire et al. (2020b). similarly, if q26 received a positive response, we identified td repayment practices; otherwise, we identified nonrepayment reasons. for both prevention and repayment, we also had a list of practices and reasons, and their corresponding number of occurrences. for example, two respondents provided the following answers in q27 when q26 has a positive answer: “we rewrote the offending code” and “it was fixed, code was refactored and greatly simplified”. these answers were unified under the repayment practice code refactoring. at least two researchers from each replication team participated in the coding process. the brazilian replication team created the first codified list of causes, effects, prevention practices, reasons for not preventing, repayment practices, and reasons for not repaying, which was distributed to the other replication teams in order to standardize the used nomenclature. the consistency was verified by the brazilian replication team. 3.4.3 data classification and analysis we began by analyzing the codes of each td management element to determine whether they are related to coding issues or other software development issues. repayment table 1. subset of the insightd survey’s questions (adapted from rios et al. (2020)). rq no. question (q) description type q1 what is the size of your company? closed q2 in which country are you currently working? closed q3 what is the size of the system being developed in that project? (loc) closed q4 what is the total number of people of this project? closed q5 what is the age of this system up to now or to when your involvement ended? closed q6 to which project role are you assigned in this project? closed q7 how do you rate your experience in this role? closed q8 which of the following most closely describes the development process model you follow on this project? closed q10 in your words, how would you define td? open q13 please give an example of td that had a significant impact on the project that you have chosen to tell us about: open rq1 q16 what was the immediate, or precipitating, cause of the example of td you just described? open rq1 q17 what other cause or factor contributed to the immediate cause you described above? open rq1 q18 what other causes contributed either directly or indirectly to the occurrence of the td example? open rq2 q20 considering the td item you described in question 13, what were the impacts felt in the project? open rq3-4 q22 do you think it would be possible to prevent the type of debt you described in question 13? closed rq3-4 q23 if yes, how? if not, why? open rq5-6 q26 has the debt item been repaid (eliminated) from the project? closed rq5-6 q27 if yes, how? if not, why? open investigating the relationship between technical debt management and software development issues berenguer et al. 2023 practices such as bug fixing, code refactoring, and code reuse, for example, were classified as practices related to coding issues. however, the repayment practices prioritizing td items and updating system documentation were linked to other software development issues. this procedure was carried out independently by the first and second authors. the third (prevention and repayment) and fourth (causes and effects) authors reached an agreement. the final classification was also reviewed by the last author. next, we classified the td management elements related to the other software development issues using the grouping process defined by strauss and corbin (1998). the categories show the relationship between software development process issues (for example, requirement engineering issues, planning and management issues, and human factors issues) and each td management element. the names of the categories are derived from the ongoing process of grouping the td management elements around the central concern to which they are related. the causes deadline and inappropriate planning, for example, are part of the category planning and management issues, whereas the effects team demotivation and dissatisfaction of the parties involved are part of the category human factors. this procedure was carried out independently by the first and second authors. the third (prevention and repayment) and fourth (causes and effects) authors reached a consensus, and the final result was reviewed by the last author. 4 results participants were asked to provide a definition of td (q10) and then an example of a significant td item in their professional experience (q13). as detailed in (rios et al. 2020), the answers to the questions provided in q13 were used as a criterion for the inclusion of participants. if they did not provide a valid example, their responses were discarded. in total, we considered the responses of 653 professionals from six countries (brazil = 107, chile = 89, colombia = 134, costa rica = 145, serbia = 79, us = 99). next, we will present the characterization data of the participants, as well as the answers to the research questions posed in this study. 4.1 demographics figure 1 presents the demographic information. half of the participants identified themselves as developers, but managers (17%), testers (7%), software architects (13%), and other roles (13%) also answered the questionnaire. besides, the participants described their experience level in their role. the majority of them is competent (good working and background knowledge of area of practice, with 34% of the total of participants), followed by proficient (depth of expertise of discipline and area of practice, 31%), expert (authoritative understanding of discipline and deep tacit information throughout area of practice, 21%), beginner (working information of key factors of practice, 12%), and novice (minimal or “textbook” knowledge without connecting it to practice, 2%). the majority of the participants worked in middle-sized companies (39%), followed by small (32%) and large (29%) ones. further, participants normally worked in teams composed of 5-9 people (34%), but participants working in teams with 10-20 people (22%), less than five people (20%), more than 30 people (16%), and 21-30 people (8%) also answered the questionnaire. concerning the process models adopted, the participants followed hybrid (45%), agile (42%), and traditional (13%) process models. regarding the systems, the respondents normally worked with systems with 10-100kloc (35%), followed by ones with 100kloc-1mloc (30%), less than 10kloc (14%), 1-10mloc (14%), and more than 10mloc (7%). lastly, the majority of the systems is 2-5 years old, followed by 12 (23%) years old, less than one year old (17%), 5-10 years old (15%), and more than 10 years old (11%). in summary, our data set is composed of answers from practitioners from different organization and team sizes, figure 1. participants’ demographics. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 system ages and sizes, roles, experience levels, and adopted process models. in the following subsections, we present the detailed results of each investigated td management element. we use the same structure when describing the results. for example, for the element td cause, initially we (i) present the overall result. next, we (ii) discuss the causes related to coding issues. then, we (iii) present the causes related to the other software development issues, and (iv) analyze which are the types of those issues (e.g., planning and management, human factors, knowledge issues). 4.2 rq1: are the causes of td more related to coding issues or other software development issues? in total, 96 causes1 that lead to the occurrence of td were identified, totaling 1695 citations. of this total, ~92% were related to other development issues, while only ~8% were related to code. this indicates a significant difference between the two subsets, representing a tendency of other software development issues to have an influence on the occurrence of td items. there are 13 causes related to coding. the ten most commonly cited are presented in the second column of table 2. the complete list is available at https://bit.ly/37bopif. the causes non-adoption of good practices, sloppy code, and lack of refactoring stand out. all of them indicate issues that compromise the internal quality of the product. alternatively, we identified 83 causes related to other software development issues. the three most commonly (third column of table 2) cited reflect concerns focused on project management and planning: deadline, not effective project management, and inappropriate planning. other issues related to the team's lack of technical knowledge and experience, pressure, and processes were also commonly mentioned. we observed that those causes were related to each other and grouped them, identifying 14 categories of causes that reflect the main concerns that practitioners have during the development of software projects: • planning and management: refers to causes related to the project's planning and management issues. some examples are deadline, inappropriate planning, and not effective project management; 1 some causes seem to overlap among them. for example, non-adoption of good practices could cover the causes lack of refactoring or lack of reuse practices. however, the cause non-adoption of good practices refers to the non-use of good practices that would facilitate the accomplishment and maintenance of activities in the project, as can be observed in the following responses from participants: “employment of bad design practices” and “lack of use of good software development practices”. on the other hand, lack of refactoring refers to situations in which the team does not perform the improvement of the internal structure of the code without changing its external behavior, as exemplified in “lack of code refactoring” and “there was no code refactoring at the beginning of the problem”. on its turn, lack table 2. the 10 most cited causes related to coding and other software development issues. coding other development issues cause # cause # 1st non-adoption of good practices 54 deadline 169 2nd sloppy code 21 not effective project management 98 3rd lack of refactoring 17 inappropriate planning 83 4th external component dependency 12 lack of technical knowledge 80 5th adoption of contour solutions as definitive 11 producing more at the expense of quality 67 6th lack of reuse practices 5 inappropriate / poorly planned / poorly executed test 59 7th lack of automated testing 5 lack of experience 58 8th discontinued component 4 inaccurate time estimate 56 9th concern with just back-end development 4 lack of qualified professional 54 10th inadequate data model 3 pressure 53 • human factors: groups causes related to people's participation in project issues. some examples are lack of experience and lack of commitment; • knowledge issues: groups items originating from concerns around the knowledge of team members. two examples are lack of technical knowledge and lack of domain knowledge; • requirements engineering: encompasses the causes related to requirements issues. examples are: change of requirements and requirements elicitation issues; • verification, validation, and testing: encompasses the causes related to the execution of quality assurance activities. two examples are inappropriate/poorly planned/poorly executed test and lack of code review; of reuse practices occurs when existing software component or software component knowledge is not used for the construction of a new software, for example, “need to create the culture of reusability”. another example of overlapping encompasses the causes not effective project management and inappropriate planning. however, the cause not effective project management refers to inadequate management during project development, as reported in: “not following planning” and “lack of understanding of managers”. differently, the cause inappropriate planning refers to issues in project planning, for example, “lack of prioritization of activities” and “deficiency in project planning (disorganization)”. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 • architectural issues: groups causes related to decisions made regarding software architecture. examples are: inadequate technical decisions and problems in architecture; • process issues: refers to causes related to the definition or execution of the processes used in the development of the software. two examples are lack of a welldefined process and lack of traceability of bugs; • design issues: encompasses causes related to the design of the software. there are two causes in this category: poor design and changes in design; • documentation: groups causes related to documentation. example of causes in this category are nonexistent documentation and outdated/incomplete documentation; • external factors: refers to causes associated with external factors, such as customer does not listen the project team and structural change in the involved organizations; • infrastructure issues: encompasses causes related to problems in the software development infrastructure, such as required infrastructure unavailable and updating existing tools; • organizational issues: groups causes from the organizational context, such as lack of awareness of the importance of testing and refactoring and organizational misalignment; • quality issues: refers to causes (lack of quality) associated with lack of quality in software artifacts; • td management: encompasses causes related to management of td items. this category has only the cause lack of perception of the importance of dealing with td. table 3 shows the categories together with the corresponding number of causes, number of citations, and percentage of the causes cited in relation to the other categories. the category planning and management stood out with ~47% of citations, representing more than three times the citations of the second ranked category. this is an indication that the causes of the occurrence of td are strongly related to project management issues. the results also highlight the importance that human factors have, occupying the second position with ~13% of citations. this result is somehow aligned with previous work on social debt (tamburri et al. 2015, martini et al. 2019). concerns related to requirements engineering and issues related to knowledge were also commonly mentioned. 4.3 rq2: are the effects of td more felt in coding issues or other issues in the software development process? the participants reported a total of 73 td effects, totaling 980 citations. among them, ~64% are related to other development issues and ~36% are related to coding. table 3. categories of causes related to other software development issues. categories of causes #causes #cited causes ~%cited causes planning and management 22 733 47% human factors 10 206 13% knowledge issues 7 128 9% requirement engineering 7 120 8% vv&t 6 91 6% architectural issues 6 63 5% process issues 6 54 4% design issues 2 45 3% documentation 4 37 2% external factors 4 25 2% organizational issues 3 25 2% infrastructure issues 4 15 1% quality issues 1 12 1% td management 1 1 0.1% there are 18 coding-related effects experienced by the participants. the 10 most commonly cited are presented in table 4 (second column). the full list is available at https://bit.ly/37bopif. concerns about the capacity of the team to evolve the code, rework, and the need of employing refactoring practices to improve the internal quality of the software are common. other common effects are: bad code, low performance, and stop development for debt repayment. table 4. the 10 most cited effects related to coding and other development issues. coding other development issues effects # effects # 1st low maintainability 97 delivery delay 141 2nd rework 86 low external quality 78 3rd need of refactoring 35 financial loss 55 4th bad code 31 increased effort 41 5th low performance 28 stakeholder dissatisfaction 34 6th stop dev. activities for debt repayment 14 team demotivation 24 7th increase in the amount of maint. activities 13 stress with stakeholders 23 8th difficulty in impl. the system 10 team overload 16 9th low code reuse 8 fall in productivity 13 10th low reliability 7 project not completed 13 https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 we identified 55 effects related to other development issues. the four most commonly (third column of table 5) cited reflect concerns on the project management and planning (delivery delay, increased effort, financial loss) and external quality of the product (low external quality). issues related to human factors were also commonly cited, with emphasis on stakeholder dissatisfaction, team demotivation, and stress with stakeholders. table 5 shows the categories of effects related to other software development issues. the category planning and management has ~47% of citations, revealing that managerial aspects of software development are commonly affected by the presence of debt items. next is the human factor category, with ~18% of the effects cited, showing that td also impacts human aspects of software development. quality issues are also a common concern. the other categories are less commonly cited. table 5. categories of effects related to other software development issues. categories of effects #effects #cited effects ~%cited effects planning and management 15 297 47% human factors 7 110 18% quality issues 6 110 18% vv&t 3 23 4% design issues 2 21 3% knowledge issues 8 21 3% architectural issues 4 18 3% organizational issues 3 10 2% documentation 1 6 1% process issues 2 4 1% requirement engineering 2 4 1% infrastructure issues 1 3 0.5% td management 1 2 0.3% 4.4 rq3: is td prevention more related to coding issues or other issues in the software development process? the data shows a total of 89 practices to support the prevention of td items, resulting in 819 citations. from this, ~84% are items related to other development issues, while only ~16% are associated with code. this result indicates a tendency for other development issues to play a key role in the prevention of td. we identified a total of 13 td prevention practices related to coding. table 6, second column, presents the 10 most cited items. the complete list is available at https://bit.ly/37bopif. adoption of good practices, using good design practices, refactoring, code review, increasing time for analysis and design, use the most appropriate version of the technology, and appropriate reusing of code are the prevention practices most cited by the participants. the adoption of good practices and using good design practices reflect concerns that practitioners should have when carrying out their coding and design activities. the practices refactoring and code review are related to the continuous improvement of the code under development. lastly, increasing time for analysis and design, use of the most appropriate version of the technology, and appropriate reusing of code are related to concerns that teams must have around an adequate analysis of the functionalities, implementation of the software structure, and software reuse, respectively. table 6. top 10 most commonly cited td prevention practices related to coding or other development issues. coding other development issues prevention practices # prevention practices # 1st adoption of good practices 49 well-defined requirements 57 2nd using good design practices 26 better project management 43 3rd refactoring 12 providing training 36 4th code review 10 follow the proj. planning 34 5th increasing time for analysis and design 7 improving software development process 33 6th use the appropriate version of the tech. 7 improve documentation 26 7th appropriate reusing of code 6 well planned deadlines 26 8th version control 5 better project planning 24 9th considering technical constraints 4 creating tests 24 10th improving the project maintainability 4 allocation of qualified professionals 23 on the other hand, we found 76 prevention practices related to other development issues. table 6 (third column) shows the ten most cited. interestingly, five of them reflect different concerns through the software development process, such as management (following the project planning and better project management), the process itself (improving software development process), the documentation (well-defined requirements), and the qualification of the team (providing training). we see in table 7 that td prevention practices are commonly related to project management issues (~34%). the results also highlight the importance that the process followed by the team has, ranking second (~12%) among the most cited categories. concerns related to requirements, vv&t, td management, and human factors were also commonly mentioned. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 7. categories of prevention practices related to other software development issues. categories of prevention practices #practices #cited practices ~%cited practices planning and management 21 232 34% process issues 8 80 12% requirement engineering 5 69 11% vv&t 11 67 10% td management 7 64 10% human factors 11 61 9% knowledge issues 4 51 8% documentation issues 2 28 4% architectural issues 3 27 4% organizational issues 2 4 1% infrastructure issues 2 3 1% 4.5 rq4: are the reasons for not preventing td more related to coding issues or other development issues? participants reported 25 reasons that lead to the nonprevention of td items, resulting in 63 citations. of them, ~87% are related to other development issues, while only eight ~13% are related to coding. again, other development issues have an important role in preventing td. there are only four reasons related to code leading teams not to prevent the occurrence of debt items: lack of technical knowledge, lack of good technical solutions, lack of concern about maintainability, and continuous change of coding standards. on the other hand, we found 21 reasons (the 10 most cited are presented in table 8) related to other software development issues. short deadline was the most cited. table 8. top 10 most cited reasons for not preventing td related to other development issues. other development issues reason # reason # 1st short deadline 14 6th documentation issues 2 2nd ineffective management 7 7th lack of process maturity 2 3rd lack of predictability in the soft. development 5 8th lack of qualified professionals 2 4th requirements change 5 9th legacy system difficult to heal 2 5th pressure for results 4 10th accepting the td 1 table 9 shows the categories identified. planning and management once again stands out with ~38% of citations. the other categories were less commonly cited, with less than seven citations. although not too mentioned, the result suggests that other issues related to the software development can also negatively influence teams in td prevention. table 9. categories of reasons for td non-prevention related to other software development issues. categories of reasons #reason #cited reasons ~%cited reasons planning and management 2 21 38% requirement engineering 2 6 11% coding 1 5 9% external factors 2 5 9% human factors 4 4 8% process issues 2 3 6% design issues 1 2 4% documentation issues 1 2 4% knowledge issues 1 2 4% td management 2 2 4% architectural issues 1 1 2% infrastructure issues 1 1 2% organizational issues 1 1 2% 4.6 rq5: is td repayment more associated with coding issues or other issues in the software development process? we identified 32 td repayment practices, resulting in 315 citations. of them, ~56% are related to other development issues, while ~44% are associated with code. unlike the other td management elements, these percentages differ slightly, indicating that coding issues play a key role in td repayment initiatives. we recognized eight td repayment practices related to coding, presented in table 10. code refactoring and design refactoring are the most cited practices. both are associated with changes in the internal structure of the system without changing its external behavior. the practices solving technical issues and bug fixing focus on fixing open issues in the code. lastly, the practices using code analysis, code reviewing, and using code reuse can support teams implementing td repayment initiatives, i.e., although these practices did not repay the debt, they increase the capacity for better repayment. the remaining 24 repayment practices are related to other development issues. table 10 (third column) shows the ten most cited ones. these practices evidence several concerns in software development processes: documentation (update system documentation), organizational decisions (hiring investigating the relationship between technical debt management and software development issues berenguer et al. 2023 specialized professionals), project management (increasing the project budget, monitoring and controlling project activities, negotiating deadline extension, investing effort on td repayment, and prioritizing td items), process (improving the development process and using short feedback iterations), and software quality (investing effort in testing activities). table 10. top 10 most commonly cited td repayment practices related to coding or other development issues. coding other development issues repayment practices # repayment practices # 1st code refactoring 80 investing effort on td repayment activities 33 2nd design refactoring 25 investing effort on testing activities 22 3rd adoption of good practices 10 prioritizing td items 15 4th solving technical issues 9 negotiating deadline extension 14 5th bug fixing 6 update system documentation 9 6th using code analysis 3 monitoring and controlling project activities 9 7th code reviewing 3 increase the project budget 9 8th using code reuse 2 improving the development process 8 9th hiring specialized professionals 8 10th using short feedback iterations 7 table 11 presents the categories of repayment practices. td management and planning and management stand out with ~32% and ~27% of the total of citations. the categories verification, validation and test, and process issues were both cited by ~12% of participants, while the others were less commonly reported. 4.7 rq6: are the reasons for not paying td more related to coding issues or other development issues? we identified 27 reasons for not repaying td items, totaling 319 citations. from these, 99.7% are related to other development issues and only lack of access to the component code (0.3%) is associated with code. the reasons for td non-repayment arise from development issues other than coding. table 11. categories of repayment practices related to other software development issues. categories of repayment practices #practices #cited practices ~%cited practices td management 4 56 32% planning and management 8 47 27% vv&t 1 22 13% process issues 5 21 12% documentation 1 9 6% organizational issues 1 8 5% human factors 1 6 4% requirement engineering 1 3 2% infrastructure issues 1 3 2% design issues 1 2 1% table 12 shows the ten best-positioned reasons for not repaying td. the complete list is available at https://bit.ly/37bopif. we notice that the majority of the reasons (focusing on short-term goals, lack of time, cost, lack of resources, effort, the project was discontinued, complexity of the td item, and insufficient management view about td repayment) are associated with project planning and management. the others refer to external (customer decision) and human (team overload) factors. table 12. top 10 most cited reasons for not paying off td related to other development issues. other development issues reason # reason # 1st focusing on short term goals 69 6th customer decision 13 2nd lack of org. interest 48 7th complexity of the td item 12 3rd lack of time 41 8th effort 11 4th cost 34 9th insufficient mgmt. view on td repayment 10 5th lack of resources 19 10th complexity of the project 10 the reasons were also grouped into categories. planning and management issues stand out with ~58% of citations, as shown in table 13, pointing out that the reasons for this category are categorical for td non-repayment. the categories organizational issues and td management were also commonly cited by ~16% and ~11% of the participants. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 13. categories of reasons for td non-repayment related to other software development issues. categories of reasons #reason #cited reasons ~%cited reasons planning and management 7 185 58% organizational issues 2 50 16% td management 7 34 11% external factor 1 13 5% knowledge issues 3 12 4% human factors 3 11 4% architectural issues 2 11 4% vv&t 1 2 1% 5 organizing the td management elements into hump diagrams we represent the relationship between the investigated td management elements (causes, effects, prevention practices, reasons for td non-prevention, repayment practices, and reasons for td non-repayment) and software development issues in hump diagrams (figure 2). to plot results for coding and for other issues in the same hump diagram, we normalized the number of citations for an element of a specific software development issue with the total number of citations for that element. for example, prevention practices have in total 819 citations, but 232 citations for the issue planning and management. thus, the hump value for planning and management issues of prevention practices is 28% (232/819*100). this count is slightly different from the ones we used in tables 3, 5, 7, 9, 11, and 13 because now we consider coding as another software development issue. 5.1 using the diagram we can read the hump diagram horizontally and vertically. horizontally, we have a broad view on the impact of each software development issue through the td management elements. for example, in figure 2, we can notice that coding plays an important role for all the analyzed td elements, but mainly for td repayment. there is a high concentration of practices related to td repayment and, at the same time, almost none of reasons for the non-repayment of debt items is due to coding issues. we also perceive that there are many other issues we need to be aware of when dealing with td in software projects, mainly, planning and management. indeed, this is even stronger when combined with td management concerns. much about the non-repayment of td can be understood by looking at these issues. human factors also call our attention, clearly indicating that td, more than technical aspects of the software development, is also about team morale, satisfaction, motivation, communication, and commitment. other commonly found issues in several elements of the td management are architectural issues, design issues, documentation, knowledge issues, process issues, requirement engineering, and vv&t. figure 2. the hump diagram for td management elements and software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 by reading the diagram vertically, we can observe the impact of all identified software development issues on each td management element. in figure 2, for example, we can observe that planning and management, organizational, and td management issues are decisive for the non-repayment of debt items. we also notice that the presence of debt items mainly impacts (effect) planning and management, quality issues, maintenance issues, human factors, and coding. practitioners can use the hump diagram to have a comprehensive view on how td relates to several issues of their software projects, ranging from organizational to coding level issues. moreover, for each td management element, they can go through the detailed results presented in section 4 and the auxiliary material to understand how to deal with them. for example, by looking at figure 2, a practitioner can see that the effects of td are commonly related to coding, human factors, maintenance, quality, and planning and management issues. if (s)he is interested in discovering more about the human factors issues, then (s)he can observe in the results and auxiliary material that team demotivation, dissatisfaction of the parties involved, and stress with stakeholders are the main concerns to be mitigated. 5.2 specializing the diagram by process models practitioners can specialize the hump diagram for their context. to illustrate it, we organize the td management elements considering the process model used by the participants who answered the insightd questionnaire choosing one of the following options: agile, hybrid, and traditional. figures 3, 4, and 5 present the hump diagram for agile, hybrid, and traditional process models, respectively. comparing them, we can notice that the diagrams for agile and hybrid process models are just slightly different from each other. it indicates that the view on the td management elements goes in the same direction to these process models. conversely, traditional process model presents some particularities against the other models. for example, prevention practices are more affected by architectural, infrastructure, organizational, and requirement engineering issues in traditional process model than the others. reasons for td non-prevention are less affected by coding, design, documentation, human factors, knowledge, maintenance, requirement engineering, and td management in traditional process model, while external factors and planning and management affect mainly this model. to further understand the possible impact of different process models in the td management elements, we organized ranked lists of each td management element considering its number of citations by process models (agile, hybrid, and traditional). to verify if there are differences between the lists, we adopted the rbo (rank-biased overlap) analysis (webber et al. 2010), which quantitatively measures how similar the ranked lists are. figure 3. the hump diagram for agile model process. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 figure 4. the hump diagram for hybrid model process. figure 5. the hump diagram for traditional model process. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 rbo gives a value ranging from 0 to 1. the closer this value is to 1, the greater the similarity between the lists. as rbo supports top-weighted ranked lists, the first elements of a list have more impact on the similarity index than the last ones. we can configure what elements will be compared by setting the p-value, which, differently than the p statistic, refers to a level of overlapping and the degree of topweightedness. in the analysis, we chose p-value ranging from 0.5 (only the very initial elements of a rank are considered) to 0.9 (almost all elements are considered). the results of the comparison for each of the td management elements are presented in the following subsections. 5.2.1 comparing td causes between agile, hybrid, and traditional process models figure 6 shows the results of the comparison between the ranked lists of causes for each process model considering (a) causes related to coding issues and (b) causes related to other software development issues. the rbo analysis for causes related to coding (figure 6 (a)) reveals that the similarity level is about 80-90% between the three lists. it indicates that the lists are quite similar with little variation when more causes are included, i.e., the p-value increases. this similarity can be perceived when we observe the top5 ranked causes for each process model (table 14). the cause non-adoption of good practices was the most cited cause for all process models, while lack of refactoring, sloppy code, adoption of contour solutions as definitive were perceived, but in different positions. for example, lack of refactoring (agile: 2nd, hybrid: 4th, and traditional: 3rd) and sloppy code (agile: 3rd, hybrid and traditional: 2nd). further, we can see that the cause external component dependency is not perceived in traditional process model while lack of reuse practices is only perceived in this process model. for causes related to other software development issues (figure 6 (b)), we can see that the rbo value is almost constant with similarity level about 80-90% for agile and hybrid process models. differently, the similarity level is about 65-80% when comparing traditional with agile/hybrid. in table 15, we can see that the cause deadline was the most cited cause for each process model. regarding agile and hybrid process models, they did not share the causes focus on producing more at the expense of quality and lack of experience. however, the causes inaccurate time estimate, inappropriate planning, and lack of qualified professional were perceived only in the context of traditional process model. table 14. top 5 most cited causes related to coding issues per process model. agile hybrid traditional 1 non-adoption of good practices (25) non-adoption of good practices (23) non-adoption of good practices (6) 2 lack of refactoring (10) sloppy code (8) sloppy code (5) 3 sloppy code (8) external component dependency (7) lack of refactoring (2) 4 adoption of contour solutions as definitive (6) lack of refactoring (5) lack of reuse practices (2) 5 external component dependency (4) adoption of contour solutions as definitive (4) adoption of contour solutions as definitive (1) table 15. top 5 most cited causes related to other development issues per process model. agile hybrid traditional 1 deadline (66) deadline (85) deadline (18) 2 inappropriate planning (35) not effective project management (53) inaccurate time estimate (14) 3 not effective project management (35) inappropriate planning (38) inappropriate / poorly planned / poorly executed test (13) 4 lack of technical knowledge (34) lack of technical knowledge (38) inappropriate planning (10) 5 focus on producing more at the expense of quality (30) lack of experience (32) lack of qualified professional (10) figure 6. rbo comparing causes related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 in summary, coding-related causes are perceived in the same way in agile, hybrid, and traditional process models, while non-coding related causes are differently perceived by those who follow traditional process models. 5.2.2 comparing td effects between agile, hybrid, and traditional process models figure 7 shows the results of the comparison between the ranked lists of effects by process model considering (a) coding related effects and (b) effects related to other software development issues. the rbo analysis for effects related to coding (figure 7 (a)) reveals that the lists are quite similar, as the similarity level is about 90% between the three lists. analyzing the top 5 ranked effects of each process model (table 16), we can see this similarity. for example, the effects low maintainability and rework were the most cited effects for all process models, occupying the same position in the lists. further, the effect difficulty in implementing the system is only perceived by the traditional process model while it did not perceive the effect need for refactoring. table 16. top 5 most cited effects related to coding issues per process model. agile hybrid traditional 1 low maintainability (40) low maintainability (43) low maintainability (14) 2 rework (39) rework (35) rework (12) 3 need for refactoring (19) bad code (17) bad code (5) 4 low performance (14) need for refactoring (14) low performance (4) 5 bad code (9) low performance (10) difficulty in implementing the system (3) regarding the effects related to other software development issues (figure 7 (b)), the similarity level is almost 100% for the first effects in the agile and hybrid lists. it means that these process models have the same view on the most critical effects of td, but this similarity level decreases when more effects are considered. table 17 presents the top 5 ranked effects by process models. we can see that the effect delivery delay was the most perceived effect by the process models. besides, the effects from the list of agile and hybrid process models are quite the same, except team demotivation and stakeholder dissatisfaction. although the effect design problems is only perceived in the context of traditional process models, the other effects (financial loss, low external quality, and team demotivation) are also present in the other two lists. in conclusion, agile, hybrid, and traditional process models are related to almost the same coding-related effects. this also applies for non-coding related effects. table 17. top 5 most cited effects related to other development issues per process model. agile hybrid traditional 1 delivery delay (51) delivery delay (69) delivery delay (21) 2 low external quality (34) low external quality (36) financial loss (10) 3 financial loss (20) financial loss (25) low external quality (8) 4 increased effort (18) increased effort (20) team demotivation (5) 5 team demotivation (13) stakeholder dissatisfaction (19) design problems (3) 5.2.3 comparing td preventive practices between agile, hybrid, and traditional process models figure 8 shows the results of the comparison between the ranked lists of preventive practices by process model considering (a) preventive practices related to coding and (b) those related to other software development issues. the rbo analysis for preventive practices related to coding (figure 8 (a)) reveals that the lists are different. the similarity level is about 60-80% between the three lists. in table 18, we can see that while the preventive practice adoption of good practices was the most used practice in the process models, the other practices were not shared by all figure 7. rbo comparing effects related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 process models. for example, using good design practices, refactoring and considering technical constraints are only present in the context of agile process model, while use the most appropriate version of the technology and bug tracking are only related to traditional process model. table 18. top 5 preventive practices related to coding issues per process model. agile hybrid traditional 1 adoption of good practices (18) adoption of good practices (25) adoption of good practices (6) 2 using good design practices (13) appropriate reusing of code (3) increase time for analysis and design (2) 3 refactoring (8) code review (2) use the most appropriate version of the technology (2) 4 code review (7) improving the maintainability of the project (4) appropriate reusing of code (1) 5 considering technical constraints (4) increase time for analysis and design (3) bug tracking (1) concerning the preventive practices related to other software development issues, the similarity level is 70-80% (figure 8 (b)), indicating that the lists are also different. in table 19, we can see that the preventive practice welldefined requirement was present in all process models, but the others were not shared by all process models. for instance, well-defined architecture, creating tests, and improve documentation were only used by traditional process models. in summary, agile, hybrid, and traditional process models did not share the same view on preventive practices regardless they are related to coding or not. table 19. top 5 most cited preventive practices related to other development issues per process model. agile hybrid traditional 1 well-defined requirement (21) well-defined requirement (26) well-defined requirement (10) 2 following the project planning (17) better project management (22) well-defined architecture (6) 3 better project management (16) training (18) better project management (5) 4 training (13) improving software development process (17) creating tests (5) 5 better project planning (12) well planned deadlines (14) improve documentation (5) 5.2.4 comparing reasons for td non-prevention between agile, hybrid, and traditional process models figure 9 (a) shows the rbo result considering the lists of coding-related reasons for td non-prevention of agile and hybrid process models. we did not consider traditional process models because their practitioners did not mention any reason for td non-prevention. analyzing the figure, we can see that the similarity level is 10-30%, indicating that agile and hybrid did not share the same vision on reasons for td non-prevention. this low similarity level is also perceived when we compared the list of reasons for td nonprevention, as shown in table 20. table 20. top 5 most cited reasons for td non-prevention related to coding issues per process model. agile hybrid 1 lack of technical knowledge (2) lack of good technical solutions (2) 2 lack of concern about maintainability (1) continuous change of coding standards (1) 3 lack of concern about maintainability (1) 4 lack of technical knowledge (1) figure 8. rbo comparing preventive practices related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 about the reasons for td non-prevention related to other software development issues, figure 9 (b) shows that the similarity level is about 80-90% for the most cited reasons in agile and hybrid process models. but this value decreases, reaching about 55%, when considering the full list of reasons. traditional process models did not share the same view on reasons for td non-prevention as the similarity level is about 30-50%. this low similarity level can be perceived when we analyze the five most cited reasons for td nonprevention (table 21). in conclusion, agile and hybrid process models did not share the same vision on coding-related reasons for td nonprevention, but these models have the same view on the most cited non-coding-related reasons. traditional process models did not share the same non-coding-related reasons with agile and hybrid process models. 5.2.5 comparing td repayment practices between agile, hybrid, and traditional process models figure 10 (a) and table 22 show the rbo result considering the lists of repayment practices related to coding for each process model. we can see that agile, hybrid, and traditional process models share the same view in repayment practices. the similarity level varies between 80-90%. table 21. top 5 most cited reasons for td non-prevention related to other development issues per process model. agile hybrid traditional 1 short deadline (7) short deadline (5) pressure for results (2) 2 ineffective management (3) ineffective management (3) short deadline (2) 3 lack of predictability in the software development (3) lack of predictability in the software development (2) ineffective management (1) 4 requirements change (3) legacy system difficult to heal (2) lack of process maturity (1) 5 architectural evolution (1) requirements change (2) concerning the repayment practices related to other software development issues, figure 10 (b) shows the comparison for the three process models. agile and hybrid process models have used almost the same practices (similarity level is about 80-90%). on the contrary, the similarity level when comparing traditional process model with the other two is slightly low, almost 70-80%, for the top 5 ranked elements of their lists as noticed in table 23. figure 9. rbo comparing reasons for td non-prevention related to (a) coding and (b) other software development issues. figure 10. rbo comparing repayment practices related to (a) coding and (b) other software development issues. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 table 22. top 5 most cited repayment practices related to coding issues per process model. agile hybrid traditional 1 code refactoring (38) code refactoring (37) code refactoring (5) 2 design refactoring (14) design refactoring (7) design refactoring (4) 3 adoption of good practices (6) adoption of good practices (4) bug fixing (1) 4 solving tech. issues (6) bug fixing (3) solving tech. issues (1) 5 code reviewing (3) solving tech. issues (2) table 23. top 5 most cited repayment practices related to other development issues per process model. agile hybrid traditional 1 investing effort on td repayment activities (13) investing effort on td repayment activities (16) investing effort on td repayment activities (4) 2 investing effort on testing activities (12) investing effort on testing activities (7) increasing the project budget (4) 3 prioritizing td items (9) negotiating deadline extension (6) negotiating deadline extension (4) 4 using short feedback iterations (5) prioritizing td items (6) investing effort on testing activities (3) 5 implementing preventive actions for avoiding td(4) changing project scope (4) update system documentation (3) practitioners using agile, hybrid, and traditional process models have shared almost the same experience on repayment practices related to coding, but this scenario is different for repayment practices related to other software development issues when considering the context of traditional process models. 5.2.6 comparing reasons for td non-repayment between agile, hybrid, and traditional process models figure 11 presents the rbo result considering the lists of non-coding-related reasons for td non-repayment. we did not perform the analysis for coding-related reasons for td non-repayment because only one reason (lack of access on component code) was cited by the participants. analyzing the figure, we can see that the similarity level is around 8090%, indicating that practitioners have same view on noncoding-related reasons for td non-repayment. in table 24, we can observe that the reasons focusing on short term goal and lack of organizational interest were the most used reasons for explaining the td non-repayment. besides, the other reasons are also very similar among the process models. in summary, practitioners using agile, hybrid, and traditional process models share the same view on noncoding-related reasons for td non-repayment. figure 11. rbo comparing reasons for td non-repayment related to other software development issues. table 24. top 5 most cited reasons for td non-repayment related to other development issues per process model. agile hybrid traditional 1 focusing on short term goals (28) focusing on short term goals (32) focusing on short term goals (9) 2 lack of organizational interest (20) lack of organizational interest (21) lack of organizational interest (7) 3 lack of time (16) lack of time (20) cost (5) 4 cost (13) cost (16) lack of time (5) 5 effort (7) lack of resources (13) lack of technical knowledge (3) 6 discussion this section presents an overview of the findings and discusses their implications for practitioners and researchers. 6.1 summary of findings the results indicate that coding issues related to the causes, effects, prevention, non-prevention, repayment, and non-repayment of td are only a small part of the concerns that practitioners face in the presence of td. indeed, td has been more commonly found in other software development issues. the radar graph presented in figure 12 shows the percentages of the distribution of the participants’ responses to each of the investigated elements concerning the categories coding issues and other software development issues. for every investigated element, most of the responses are related to other software development issues. the difference is quite bigger for the elements: causes, prevention, reasons for not preventing, and reasons for not repaying. the values for td repayment are very close between the two groups (56% vs 44%). this is an indication that, although practitioners perceive that td is ubiquitous in investigating the relationship between technical debt management and software development issues berenguer et al. 2023 software development projects, they also see that its repayment is commonly related to coding issues. figure 12. distribution of the participants’ answers on the td management elements. we organized the td management elements into categories. the category planning and management concentrated the biggest number of citations of causes, effects, preventive practices, reasons for td nonprevention, and reasons for td non-repayment. alternatively, the category td management has the biggest quantity of preventive practices citations. all identified categories of each td management element were represented in a hump diagram. by analyzing the diagram, practitioners can perceive the influence of each td management element in a specific issue associated with the software development process. these issues correspond to the categories defined in this study. besides, practitioners can specialize the diagrams following their project context. for illustrating it, we specialized the hump diagram for agile, hybrid, and traditional process models, and compared them with each other. from the comparison, we noticed that agile and hybrid process models share the same point of view on the td management elements analyzed in this work. on the other hand, practitioners who adopted traditional process models tend to have a different view on these elements. strategies defined to support td management initiatives must consider the specificities of each process model. 6.2 implications for researchers and practitioners the hump diagram can guide practitioners, showing how each software development issue is related to each td management element. having this information, practitioners can define strategies to mitigate causes, effects, reasons for td non-prevention, or reasons for td non-repayment. also, the combined use of the hump diagram and the detailed results, presented in section 4 and available at https://bit.ly/37bopif, provides a comprehensive guidance for software development teams about what to expect from the presence of td and how to react to them considering several software development issues. for example, practitioners can diagnose the causes of td by consulting the hump diagram. as the causes from the category planning and management are more common in agile software projects, if an agile team has defined preventive practices for these causes and it still identifies new causes, by analyzing the diagram, the team can focus on other causes from more common categories in the agile process, such as human factors. practitioners can also identify preventive practices to avoid td items in their projects. suppose a traditional team has applied all preventive practices from the category planning and management (with the highest concentration of practices), but the team still felt the effects of td. the team can apply preventive practices from other categories by analyzing the hump diagram, such as requirement engineering and verification, validation, and test. for researchers, our results point out the need of investing more research effort on other issues of the software development. for example, complementary to understanding td at the code level, it is also necessary to investigate strategies to mitigate the managerial reasons that lead software teams to not repay debt items. another promising topic for investigation would be the relationship between human factors of the software development and td. for practitioners and researchers, the results of rbo analyses bring to the fore the need to further investigate practitioners' perceptions of the elements of tdm. this investigation may reveal differences that can be used to develop methods, techniques, and tools more suited to professionals needs. for example, our findings reveal that agile and traditional processes consider td prevention differently. before developing a td prevention strategy, researchers may investigate agile software development characteristics that influence td prevention. also, agile practitioners can learn from traditional practitioners by identifying the differences in perceptions concerning td prevention. 7 threats to validity as in any empirical study, there are threats to validity in this work. we attempt to remove them when possible, and mitigate their effects when removal is not possible the main threat to the validity of the conclusion is related to the coding process, as it is a creative process. to mitigate it, the analyses were carried out separately by two researchers, and the consensus was carried out by a third, more experienced one. also, additional procedures were considered for seeking consistency in the nomenclature used by each replication team during their coding activities. lastly, the classification of the coded td management elements into code/non-code, as well as the definition of their categories, are essentially subjective tasks. to mitigate them, we followed a rigorous analysis procedure. the classification process was always performed individually by two researchers, being reviewed by at least one experienced researcher. https://bit.ly/37bopif investigating the relationship between technical debt management and software development issues berenguer et al. 2023 another threat is related to the specialization of hump diagrams per process model. to this end, we relied on the responses from participants to the questions q8 of the insightd questionnaire, which explicitly states the definition of the three categories of processes considered in this research (agile, hybrid, and traditional). the questionnaire was designed to eliminate threats to internal validity. as discussed in (rios et al., 2020), the questionnaire went through a series of validations (three internal and one external) and a pilot study to identify any issues before its execution. it is also worth mentioning that the participants could act differently from what they usually do because they are part of a study. to avoid this, we clearly explain the purpose of the study and ask participants to answer the questions based on their own experience. we also state explicitly that the questionnaire is anonymous, and that the data collected is analyzed without considering the identity of the participants. also, participants may have misinterpreted the use of the terms prevention and repayment of td. to investigate whether this threat manifested, all responses on how participants avoided and repaid the debt item were analyzed (q23 and q27) to analyze if there were invalid answers. a high proportion of invalid responses would mean that the questions could be misinterpreted. in the end, we did not identify any invalid response, indicating that this threat did not appear in the study. lastly, external validity threats were reduced by targeting industry professionals and seeking to achieve participant diversity among survey respondents. in search of more generalizable results, insightd is being replicated in other countries. 8 concluding remarks in this paper, we investigate the relation between td management elements (causes, effects, preventive practices, repayment practices, reasons for td non-prevention, and reasons for td non-repayment) and software development issues related to coding or other activities. also, we categorize these elements and organize them into hump diagrams. further, we define a hump diagram for each process model (agile, hybrid, and traditional) to demonstrate how the diagram can be specialized by practitioners following one of their project’s variables, such as, process model and role. the next steps of this work include (i) to investigate whether the type of debt impacts how practitioners see td management elements, (ii) to develop a td management instrument encompassing the hump diagram and the detailed results, and (iii) to empirically assess this instrument on the supporting of td management. we also intend to investigate the main human factors associated with td. acknowledgements this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior – brasil (capes) finance code 001 and the conselho nacional de desenvolvimento científico e tecnológico – cnpq. this research was also supported in part by funds received from the david a. wilson award for excellence in teaching and learning, which was created by the laureate international universities network to support research focused on teaching and learning. references alves, n.s.r., mendes, t.s., mendonça, m.g., spínola, r., shull, f., & seaman, c. (2016). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100-121. doi: https://doi.org/10.1016/j.infsof.2015.10.008. berenguer, c., borges, a., freire, s., rios, n., tausan, n., ramac, r., pérez, b., castellanos, c., correal, d., pacheco, a., lópez, g., falessi, d., seaman, c., mandic, v., izurieta, c., & spínola, r. (2021). technical debt is not only about code and we need to be aware about it. in proceedings of the xx brazilian symposium on software quality (sbqs '21). acm, new york, ny, usa, 1–12. doi: https://doi.org/10.1145/3493244.3493285. besker, t., ghanbari, h., martini, a., & bosch, j. (2020). the influence of technical debt on software developer morale. journal of systems and software, 167. doi: https://doi.org/10.1016/j.jss.2020.110586. cunningham, w. (1992). the wycash portfolio management system. acm sigplan oops messenger, 4, 2 (april 1993), 29-30. doi: https://doi.org/10.1145/157710.157715. freire, s., rios, n., mendonça, m., falessi, d., seaman, c., izurieta, c., & spínola, r. (2020a). actions and impediments for technical debt prevention: results from a global family of industrial surveys. in proceedings of the 35th acm/sigapp symposium on applied computing, brno, 1548–1555. freire, s., rios, n., gutierrez, b., torres, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2020b). surveying software practitioners on technical debt payment practices and reasons for not paying off debt items. in proceedings of the evaluation and assessment in software engineering. trondheim, 210–219. freire, s., rios, n., perez, b., castellanos, c., correal, d., ramac, r., mandic, v., tausan, n., pacheco, a., lópez, g., mendonça, m., izurieta, c., falessi, d., seaman, c., & spínola, r. (2021a). pitfalls and solutions for technical debt management in agile software projects. ieee software, vol. 38, no. 6, pp. 42-49, nov.-dec. 2021. doi: 10.1109/ms.2021.3101990. freire, s., rios, n., perez, b., castellanos, c., correal, d., ramac, r., mandic, v., tausan, n., lópez, g., pacheco, a., falessi, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2021b). how experience impacts practitioners’ perception of causes and effects of technical debt. in proceedings of the ieee/acm 13th international workshop on cooperative and human aspects of software engineering (chase). doi: 10.1109/chase52884.2021.00011. investigating the relationship between technical debt management and software development issues berenguer et al. 2023 freire, s., rios, n., pérez, b., correal, d., mendonça, m., izurieta, c., seaman, c., & spínola, r. (2021c). how do technical debt payment practices relate to the effects of the presence of debt items in software projects? in proceedings of the ieee international conference on software analysis, evolution and reengineering (saner). doi: 10.1109/saner50967.2021.00074. guo, y., spínola, r.o., & seaman, c. (2016). exploring the costs of technical debt management --a case study. empirical software engineering, 21, 1 (february 2016), 159–182. doi: https://doi.org/10.1007/s10664-014-9351-7. izurieta, c., vetrò, a., zazworka, n., cai, y., seaman, c., & shull, f. (2012). organizing the technical debt landscape. in proceedings of the 3rd international workshop on managing technical debt (mtd). zurich, 23-26. doi: https://doi.org/10.1109/mtd.2012.6225995. lenarduzzi, v., besker, t., taibi, d., martini, a., & fontana, f. a. (2021). a systematic literature review on technical debt prioritization: strategies, processes, factors, and tools. journal of systems and software, 171, 110827. li, z., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, 101, 193–220. doi: https://doi.org/10.1016/j.jss.2014.12.027. lim, e., taksande, n., & seaman, c. (2012). a balancing act: what software practitioners have to say about technical debt. ieee software, 29, 6 (november 2012), 22–27. doi: https://doi.org/10.1109/ms.2012.130. martini, a., stray, v., & moe, n.b. (2019). technical-, social-and process debt in large-scale agile: an exploratory case-study. in proceeding of the international conference on agile software development (pp. 112-119). springer, cham. ramač, r., mandić, v., taušan, n., rios, n., freire, s., pérez, b., castellanos, c., correal, d., pacheco, a., lopez, g., izurieta, c., seaman, c., & spinola, r. (2022). prevalence, common causes and effects of technical debt: results from a family of surveys with the it industry. journal of systems and software, 184, 111114. doi: https://doi.org/10.1016/j.jss.2021.111114. ribeiro, l.f., farias, m.a.f, mendonça, m., & spínola, r.o. (2016). decision criteria for the payment of technical debt in software projects: a systematic mapping study. in proceedings of the 18th international conference on enterprise information systems (iceis). doi: https://doi.org/10.5220/0005914605720579 rios, n., freire, s., pérez, b., castellanos, c., correal, d., mendonça, m., falessi, d., izurieta, c., seaman, c., & spínola, r. (2021). on the relationship between technical debt management and process models. ieee software. rios, n., mendonça, m., & spínola, r. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, 117-145. doi: https://doi.org/10.1016/j.infsof.2018.05.010. rios, n., spínola, r.o., mendonça, m., & seaman, c. (2019). supporting analysis of technical debt causes and effects with cross-company probabilistic cause-effect diagrams. in proceedings of the ieee/acm international conference on technical debt (techdebt). doi: https://doi.org/10.1109/techdebt.2019.00009. rios, n., spínola, r.o., mendonça, m., & seaman, c. (2020). the practitioners’ point of view on the concept of technical debt and its causes and consequences: a design for a global family of industrial surveys and its first results from brazil. empirical software engineering, 25, 32163287. saraiva, d., neto, j. g., kulesza, u., freitas, g., reboucas, r., & coelho, r. (2021). technical debt tools: a systematic mapping study. in proceedings of the 23rd international conference on enterprise information systems. doi:10.5220/0010459100880098. strauss, a. & corbin, j. (1998). basics of qualitative research: techniques and procedures for developing grounded theory. sage publications. tamburri, d.a., kruchten, p., lago, p. & van vliet, h. (2015). social debt in software engineering: insights from industry. journal of internet services and applications, 6(1), 1-17. webber, w., moffat, a., & zobel, j. (2010). a similarity measure for indefinite rankings. acm transactions on information systems, vol. 28, no.4. wohlin, c., runeson, p., host, m., ohlsson, m.c., regnell, b., & wesslen, a. (2012). experimentation in software engineering: an introduction. springer. zazworka, n., vetro’, a., izurieta, c., wong, s., cai, y., seaman, c., & shull, f. (2014). comparing four approaches for technical debt identification. software quality journal, 22, 403–426 (2014). doi: https://doi.org/10.1007/s11219-013-9200-8. investigating the relationship between technical debt management and software development issues 1 introduction 2 background 3 research method 3.1 the insightd project 3.2 research questions 3.3 data collection 3.4 data analysis procedures 3.4.1 demographics 3.4.2 preparing data for analysis 3.4.3 data classification and analysis 4 results 4.1 demographics 4.2 rq1: are the causes of td more related to coding issues or other software development issues? 4.3 rq2: are the effects of td more felt in coding issues or other issues in the software development process? 4.4 rq3: is td prevention more related to coding issues or other issues in the software development process? 4.5 rq4: are the reasons for not preventing td more related to coding issues or other development issues? 4.6 rq5: is td repayment more associated with coding issues or other issues in the software development process? 4.7 rq6: are the reasons for not paying td more related to coding issues or other development issues? 5 organizing the td management elements into hump diagrams 5.1 using the diagram 5.2 specializing the diagram by process models 5.2.1 comparing td causes between agile, hybrid, and traditional process models 5.2.2 comparing td effects between agile, hybrid, and traditional process models 5.2.3 comparing td preventive practices between agile, hybrid, and traditional process models 5.2.4 comparing reasons for td non-prevention between agile, hybrid, and traditional process models 5.2.5 comparing td repayment practices between agile, hybrid, and traditional process models 5.2.6 comparing reasons for td non-repayment between agile, hybrid, and traditional process models 6 discussion 6.1 summary of findings 6.2 implications for researchers and practitioners 7 threats to validity 8 concluding remarks acknowledgements references journal of software engineering research and development, 2021, 9:15, doi: 10.5753/jserd.2021.1944 this work is licensed under a creative commons attribution 4.0 international license. software process improvement programs: what are the pitfalls that lead to abandonment? regina albuquerque [ pontifícia universidade católica do paraná| regina.fabia@pucpr.br] gleison santos [ universidade federal do estado do rio de janeiro | gleison.santos@uniriotec.br] andreia malucelli [ pontifícia universidade católica do paraná | malu@ppgia.pucpr.br] sheila reinehr [ pontifícia universidade católica do paraná | sheila.reinehr@pucpr.br] abstract while many organizations successfully embrace and experience software process improvement (spi) benefits, others abandon the effort before realizing the total potential result of an spi initiative. therefore, researchers' interest in understanding the reasons why software organizations that have a successful start in adopting spi abandon improvement initiatives after evaluation has increased. thus, this work aims to investigate how the abandonment of spi programs based on maturity models occurs after the evaluation. the multiple case study method was used with eight organizations. data were analyzed using grounded theory open and axial coding procedures. the results show that spi initiatives failed because of internal factors (people, spi project management, organizational aspects, and processes) and external factors to the organizational context (country economic crisis, outsourcing, governmental political influence, and external pressure from the client). as a contribution, we highlight the identification of these factors that organizations can use to learn about their initiatives and avoid pitfalls that can lead to the abandonment of spi. keywords: software and its engineering, software quality, software process improvement, abandonment of software process improvement 1 introduction software organizations operate in a highly competitive market that demands quality and productivity (canedo et al., 2019). in this sense, software process improvement (spi) aims to offer insights into the software process as it is used within organizations and, thus, lead to the implementation of changes to achieve specific objectives, such as increasing product quality or reducing cost and development time (coleman et al., 2008). several process improvement support models have gained ground in the software industry, such as cmmi-dev (cmmi institute, 2018) and iso/iec 33020 (iso/iec, 2015). in brazil, where this research was conducted, the mps.br (brazilian program for software process improvement) resulting model is primarily used. mps.br is a mobilizing, long-term program that aims to define software and service process improvement and assessment models targeting primarily micro-small and medium-sized enterprises to meet business needs (softex, 2020). the mr-mps-sw (brazilian reference model for software process improvement) model is structured in seven evolving maturity levels. they are a combination of processes, which are based on iso/iec 12207 (iso/iec, 2017), and compatible with cmmi-dev (cmmi institute, 2018) and their capabilities, which are based on iso/iec 33020 (iso/iec, 2015). the maturity levels establish thresholds of process evolution that characterize improvement stages for spi implementation in software organizations. the maturity evolution begins with level g and progresses up to level a (softex, 2020). to qualify their processes, organizations must undergo an official assessment, which is valid for three years. previous studies have reported benefits such as higher customer satisfaction, cost reduction, greater predictability of costs and deadlines, and increased productivity and quality (kalinowski et al., 2010). until april 2021, 816 assessments had been successfully completed (http://www.softex.br/). many organizations were assessed at the initial levels g (55%) and f (31%). only 14% of the assessments are associated with the upper levels (level e: 4%, level c: 9%, and level a: 1%), which signifies that progress occurs up to level f, in general. that suggests that most organizations either abandon their spi programs or maintain compliance with the maturity level requirements without undertaking renewal appraisals. therefore, an important question arises: if companies achieve benefits by improving software processes, why do they abandon spi programs? our previous research has pointed to organizational, human, and process-related issues (albuquerque et al., 2018). other research studies have sought to gather further information on maintaining process practitioners' participation after the appraisal period (uskarci et al., 2017). nalepa et al. (2019) and fontana et al. (2015) have found a different way for organizations that use agile methods to mature. understanding how companies continue to improve their processes after an appraisal is relevant to the software industry, which still faces challenges posed by time and budget constraints that may hinder spi initiatives' continuation. given this context, the aim of this study is to understand how abandonment occurs in spi programs after a successful software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 assessment based on maturity models. to accomplish this objective, we conducted case studies in eight brazilian software companies. data were analyzed using open and axial coding procedures from grounded theory (strauss and corbin, 1998). spi managers can use the results of this research to avoid the pitfalls that can lead to abandoning the spi initiative. results from four of these organizations were published in albuquerque et al. (2020). the main contribution of the present paper is the confirmation that factors internal to the organization (human, organizational, spi project, and processes) and factors external to the organization (the economic crisis of the country) when neglected can cause the abandonment of the spi. in addition, new results emerged, such as lack of external demand for evaluation1, dissolution of the company, the fusion of companies, and adherence to agile methodologies. the paper is organized into seven sections besides this introduction: section 2 presents the background; section 3 describes the research method; section 4 reports the results; section 5 presents the discussion; section 6 presents threats to validity; section 7 presents the final considerations. 2 related works software process improvement (spi) is an approach that has attracted the interest of software companies because it promises to increase quality and decrease costs and project deadlines (coleman et al., 2008). while many organizations successfully adopt and experience the benefits of spi (kalinowski et al., 2010), others abandon the effort before realizing the potential of spi benefits (albuquerque et al., 2018). therefore, there is an interest in understanding the reasons why these companies abandon these improvement initiatives. almeida et al. (2011) have identified factors that can affect continued adherence to the software process in an organization, focusing on the software processes assessed using mr-mps-sw as a basis. the results of their study were classified into four factors: technical factors, sociocultural factors, resources and, commitment. besides, they have shown that project management processes are challenging to maintain in the routine of companies. uskarci et al. (2017) sought to identify the problems of continuity and participation in software process improvement activities in two level 3 cmmi-dev companies in turkey. they have identified higher submission rates of suggestions for improving the process when the assessment date is approaching and lower rates when the assessment is completed. besides, the employees' participation in these activities and their prospects for process improvement are highly dependent on their role within the organization. the authors have identified greater involvement of employees in the quality group and process 1 in some parts of this text the term certification will be used meaning evaluation, specially in the transcriptions of the interviews. group. on the other hand, practitioners of the process are reluctant to suggest improvements in the process. albuquerque et al. (2018) present a survey conducted in brazil to identify which factors (based on a systematic literature review) can lead to spi programs' maintenance or abandonment. the interviewees comprised specialists in spi (consultants and appraisers of cmmi-dev and mr-mpssw models). results indicate that spi programs continuation is positively influenced by human factors (motivation and acceptance; support, commitment, and involvement; technical and personal competencies), the spi project itself (definition of strategies; resources; adequate external consultancy service), organizational factors (communication; goals; organizational structure; internal and external policies; return on investment and leadership), consultancy and processes. albuquerque et al. (2019) investigated how organizations using agile methods evolved their processes after assessing the maturity model. the unit of analysis of the case study was four privately owned software organizations that have been assessed with the mr-mpssw model and that used agile methods. results showed that companies using agile methods have difficulties in implementing spi initiatives with maturity models. it was found that processes based on maturity models were partially abandoned and that project management practices are the most difficult to maintain, confirming the results found by uskarci et al. (2017). according to anastassiu et al. (2020), the resistance negatively affects spi, both in implementation and maintenance. they conducted a qualitative study on the causes and effects of change resistance in spi initiatives and procedures to mitigate resistance. they interviewed 21 professionals and specialists in improving software processes. the authors identified 32 causes of resistance, 16 effects, and 29 behaviors related to resistance to change. among the results, it is worth highlighting the effects that resistance creates in spi initiatives, were: ef01: rejection of resistant members who boycott the process, ef02: the firing of members resistant to change and/or to follow the process, ef03: demotivation of the process team due to the resistance of its executors, ef04: compromised improvement project goals, ef05: use of bypass solutions, ef06: abandonment of the process, ef07: real improvements are not achieved, ef08: demotivation due to the difficulty in changing the culture, ef09: skepticism due to the difficulty in changing culture, ef10: resignation from employment because of the difficulty in changing culture, ef11: inappropriate attitudes (rebellious and deceitful) by some of the leaders, ef12: feeling of isolation in the organization, ef13: submission by fear by middle management and executors of the process, ef14: bad influence for new hires, ef15: one-off and non-continuous improvements and ef16: fear of job loss. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 although previous studies have provided information on the post-assessment phase, they have limitations for not addressing information regarding the abandonment of spi. it is crucial for organizations interested in adopting spi to know what causes can lead to spi failure to avoid or mitigate these risks. for example, almeida et al. (2011) and uskarci et al. (2017) reported results from organizations with valid official assessments. in albuquerque et al. (2018), the authors reported a survey with spi specialists and anastassiu et al. (2020) in a qualitative study with spi specialists. although these specialists' point of view is relevant, it is essential to conduct qualitative research to identify how human, organizational, spi project, and process factors influence spi initiatives' continuity in organizations from the organizations' point of view. albuquerque et al. (2019) presented the difficulty of agile companies in sustaining spi programs using maturity models. however, there is a lack of information about organizations' challenges with their overdue official assessments. understanding this topic is essential to conduct qualitative research in different contexts and from the organizations' perspective. 3 research method this paper addresses the following research question: rq: how does abandonment occur in software process improvement programs? to answer the question, we conducted a case study in eight software organizations. yin (2017) states that when the research aims at answering a "how" question, a case study is a method that offers the response. in case studies, the definition of propositions guides data collection and analysis. they also help to accomplish the research objective. based on the literature (albuquerque et al. (2018), almeida et al. (2011), albuquerque et al. (2019), and uskarci et al. (2017), the following propositions were defined: ▪ p1. there are human factors that influence the abandonment of the spi program. ▪ p2. there are spi design factors that influence the abandonment of the spi program. ▪ p3. there are organizational factors that influence the abandonment of the spi program. ▪ p4. there are process-related factors that influence the abandonment of the spi program. 3.1 context the analysis unit, also called a case, is a software organization evaluated by the mr-mps-sw model and has not carried out new evaluations. an organization was considered to be abandoning spi when they reported no longer using the processes (organizations 4 and 8) or partially using it (organizations 1, 2, 3, 5, 6, and 7). we carried out the case study in eight software organizations with different profiles, as shown in table 1. organizations of various sizes participated in this research, such as small (2 and 7), medium (3 and 8), large (1, 4, and 6), and micro-enterprise (5). only organization 1 is from the public sector. regarding the main activities, organizations 1, 4, and 8 maintain software products and develop custom software. organizations 2, 3, 5, and 7 perform maintenance on software products. organization 6 develops software and offers software services. table 1. profile of the studied companies. org. profile data 1 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 300 employees public tic not not g june 2016 2 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 40 employees private erp product not yes f january 2017 3 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 80 employees private erp product yes not c november 2018 4 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 100 employees private custom/embedded software not yes e may 2018 5 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment 05 employees private erp product not yes g november 2015 6 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 270 employees private software factory/ services yes not c january 2020 7 organization size origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment + 30 employees private erp product not yes f august 2019 8 organization size + 50 employees software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 origin of capital main activity participate in bidding federal grant maturity level the validity of the assessment private erp product/ software factory yes yes f september 2015 only organizations 3, 6 e 8 participates in government bids. it is worth clarifying that in brazil, the federal government launches bids to carry out software projects. some of them require the company to have a valid assessment compliant with a quality model or standard. therefore, a company that has a maturity model evaluation can achieve a higher score than its competitors. to incentivize organizations to improve their processes, softex has developed a business model that offered some financial support for organizations with less than 100 employees. organizations that were interested in implementing the reference models of the mps.br program could have had financial support by mct (ministry of science and technology) or by sebrae (support service for micro and small companies) (softex, 2020). regarding the federal grant, organizations 2, 4, 5, 7, and 8 received this benefit. table 1 also shows the mps-sw maturity level that the organization accomplished in its last evaluation. the study was conducted with: 2 level g organizations (1 and 5), 3 level f organizations (2, 7, and 8), 2 level c organizations (3 and 6), and one level e organization (4). 3.2 data collection we sent a letter of introduction to the organizations explaining the research objectives with a non-disclosure agreement (nda) signed by the researchers for data collection. to obtain the vision of different software development roles, we interviewed people in management positions (sponsor, director, project manager, process improvement team, and quality assurance) and software engineers (analysts, developers, and testers). table 2 shows the participants' profiles. table 2. profile of the participants. org. participants 1 1 sponsor 1 spi manager 3 project managers, 1 coordinator of project managers 2 quality assurance analysts 4 analysts and developers (acting in both roles) 2 1 sponsor 1 project managers 1 development director 3 analysts and developers (acting in both roles) 3 1 quality assurance manager 4 1 process manager 5 1 sponsor 6 1 sponsor 1 human resources manager 7 1 sponsor 1 project managers 1 quality assurance manager 8 1 sponsor as shown in table 2, in some organizations, due to high turnover, only one person who took part in the spi initiative was still in the company to be interviewed. we built a semi-structured script to guide the interviews. the questionnaire consisted of two sets of questions: one to characterize the organization and interviewee profiles, and the other about spi, aiming to gather information about the challenges faced after evaluating the company and the strategies to deal with these challenges. the second part also helped to obtain information about the processes considered challenging to continue after the assessment. the following questions were used as a semi-structured interview script to guide the researcher. it is worth noticing that the questions asked in the field were broader to allow higher data coverage and richer answers. the questions supported the researcher while conducting the semistructured interview acting more as a checklist than a fixed route: part 1 characterization questions ▪ can you describe the organization in terms of business and culture? ▪ what position do you currently hold in the organization? ▪ how long have you worked in the organization? ▪ what is your academic background? part 2 questions about spi ▪ how is top management involved, and which support is offered to the spi program? ▪ what is your perception of the involvement and support of the technical team in the spi program? ▪ is there an ongoing investment in training? which trainings are offered? ▪ what is your perception of the involvement and support of the technical team in the spi program? how have the improvement program activities changed your development activities? are the activities easier or harder to work with? ▪ is there a specific budget for the spi project (hours, staff, infrastructure)? how is the spi project structured in terms of infrastructure (environment and tools) and staff? ▪ how are changes in the organization's development process made? who defines the process activities, and who determines how they are executed? how are the changes introduced in the projects? ▪ how did the consultant evaluate the company's previous process before defining the current process? how do you software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 evaluate external consultancy's performance during the improvement model's implementation period (hours of service, relationship, competence)? ▪ is the company interested in renewing or evolving its maturity level? why (not)? how is the spi program aligned with the organization's strategic planning? how are these business goals monitored in the organization? ▪ is there a software engeneering process group (sepg) to lead process improvements implementations? what is the composition of this group? how are the activities of the sepg conducted (meetings, periodicity)? what is the degree of influence of this group on the company's other groups regarding knowledge, reputation, and relationships? ▪ how constant is the organization's project flow? how are the roles and responsibilities shared within the organization? is turnover an issue in the organization? how is it avoided? ▪ how are the improvement project goals communicated to the employees? ▪ how is day-by-day communication performed in the spi project? how are the results of the spi project communicated to the employees? ▪ how are the processes used in the organization? are they used in all areas and projects? ▪ which processes are most challenging to maintain? why? ▪ which processes are more natural to maintain? why? ▪ are there performance indicators for the spi project? ▪ how is the return on investment (roi) of the spi project measured (for instance, product quality, customer satisfaction, market expansion, estimates, cost, and term)? how are the process activities monitored (i.e., detection of nonconformities and their solution)? 3.3 data analysis yin (2017) guides the researcher to define the logic that links the data to the study's propositions and the criteria to interpret the results. in this research, we used the model proposed by (reinehr et al., 2008) that defines points of analysis (pa), which supports concepts based on the literature review to evaluate whether a proposition is confirmed or not to answer the main research question. table 3 shows the defined research propositions and related points of analysis. the propositions, as previously explained, are the statements of what the researchers expect to find in the field study, based on the previous literature. the points of analysis are the connection between data collected in the field and propositions analysis. the theoretical basis for constructing these research elements (propositions and points of analysis) was the background presented in section 2, a systematic literature review, and the survey carried out with spi specialists presented in albuquerque et al. (2018). the categories of critical factors for spi maintenance were used (human, organizational, spi project, and process) to define the propositions. to determine the points of analysis, we used the factors related to each category: ▪ human factors: motivation and acceptance, support, commitment, and involvement, technical competencies; ▪ organizational factors: goals, communication, organizational structure, internal and external policies, return on investment and leadership; ▪ the spi project itself: definition of strategies, resources, appropriate external consultancy service, consultancy; and, ▪ processes factors: level of bureaucracy; measurement program for continuous improvement. table 3. propositions and points of analysis. proposition p1. there are human factors that influence the abandonment of the spi program pa.01: training is offered for the qualification of the employees of the company. pa.02: there is support, commitment, and involvement of organization members. pa.03: the technical team members are motivated and willing to carry out the process activities. proposition p2. there are spi project factors that influence the abandonment of the improvement program. pa.04: budget and resources are available for the spi initiative. pa.05: there is a strategy to introduce changes in software processes. pa.06: existence of an external consultancy with the ability and competence to implement a process compatible with company needs. proposition p3. there are organizational factors that influence the abandonment of the improvement program. pa.07: existence of a strategic plan that relates the spi program to business goals achievement. pa.08: leadership is available to support continuous process improvement. pa-09: there is an organizational structure favorable to the spi program. pa-10: there are communication mechanisms for the dissemination of the spi project. proposition p4: there are process-related factors that influence the abandonment of the improvement program. pa.11: there is a non-bureaucratic process that meets the needs of the company. pa.12: there is a measurement program of continuous process improvement. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 we used grounded theory (strauss; corbin, 1998) open and axial coding procedures for qualitative analysis because it is a systematic analysis approach, which adds value in terms of academic rigor, providing validity in terms of traceability from the coding of the initial data to the final result of the analysis (o'connor, 2012). we did not intend to create a theory using the interactive process of conducting interviews and then analyzing the data to guide the following interviews, oriented by strauss and corbin (1998). we did not achieve saturation as preconized by colleman et al. (2008). all the interviews were recorded and then transcripted. we performed the analysis after all interviews were completed. the transcript was read (more than once by the first author) and analyzed with the support of the atlas.ti tool. the first author performed the open coding activities, which is the microanalysis of the interviews. she analyzed each transcript line-by-line and created codes merged with existing codes as appropriate when new evidence data appeared. memos were created to support the analysis (also considering the field notes). then, the codes were grouped according to their properties, forming concepts that represent categories. finally, the categories and subcategories were related to each other in the axial coding stage. all the analyses were reviewed and discussed by the other authors. figure 1 shows how we identified the presence or the absence of a point of analysis in the interview excerpts and related them to the research propositions. as can be seen in figure 1, we used codes that differentiate the encoding stages. in open coding, codes called types of findings were identified with an [a]. codes from the axial coding cycle were grouped into negative factors [nf] and positive factors [pf]. subsequently, these positive and negative factors were grouped into the category called analysis points [pa]. figure 1. extract of codes and citations related to pa.12 monitoring. the example shows a negative factor [nd]. when the researcher asked: "how is the process monitored?" two participants answered, "no. this has not been done recently, because there is no professional to guarantee the quality" and "there is no charge for non-conformities in the process". based on these statements, the code generated was "lack of qa professional to guarantee the quality of the process". the same coding process was applied to the code "lack of control and collection of process evidence" which is contrary evidence to the code "monitoring the improvement process" which, in turn, is part of the point of analysis "pa.12_measurement program". later in the codification process, the analysis point mentioned above was related to "proposition p4. processes". during the analysis, new findings emerged from the data. these codes were called new discovery [nd], with the nd code followed by a number. 4 results 4.1 analysis of individual cases the following sections present the description of the analysis of each case study, listing the points of analysis (pa), the new discoveries (with the nd code followed by a number), and the participants' quotes. in addition, we present the context of spi in the implementation and maintenance period. 4.1.1 organization 1 implementation period. the reasons for the adoption of the maturity model were process improvement and market. the board appointed a team to work on the spi project, providing training for a group of people who participated in the definition of mr-mps-sw level g processes. at the beginning of the implementation, the spi was disseminated through different communication means (lectures, training, e-mail, and intranet). still, only the people directly involved with the group of processes were better informed. there was no hiring of consultants once people in the organization had experience implementing maturity models (pa.06). the quality assurance team monitored the process, and non-compliances were dealt with. the main difficulties were: failure in communication (as the organization is large, some people were uninformed), insufficient training, an overload of work due to the accumulation of functions, lack of human resources, bureaucracy in the process, and resistance to changes. interviewee: training (pa.01) and communication (pa.10). "we feel that people are doing the projects; they take the templates and come to ask. but how do i do this? will i attend the course? because we feel this … that there is still a failure in the issue of communication, because there are more than 300 people in the development area, so there are many people who are not yet having this level of information." maintenance period. organization 1 reported no intention to evolve its maturity level because the development area remained immature in project management practices. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 the training (pa.01) did not cover the whole development area. the lack of support (pa.02) from top management to demand project coordinators to use the processes led process practitioners and the quality assurance team to lose motivation (pa.03). for example, the quality assurance team failed to monitor processes because managers did not take the corrective actions needed after quality assessments. the lack of human resources (pa.04) resulted in the outsourcing of the projects. there is an active process group that defined strategies to support spi (pa.05). outsourcing (nd.01) was a new aspect that emerged during the analysis. for managers, it is difficult to adhere to process methodology in outsourced projects. there was an attempt to mentor the outsourced company, but it did not work out due to the high turnover in third-party companies. interviewee: outsourcing (nd.01). "[outsourcing] makes it very difficult. they [i.e., the contractors] are not manageable. it's not up to us to manage how they work, their productivity. we hire contractors (...) we don't know how the work is done, by how many people or which process is executed. it is not a partnership. it is a contract." resistance to change is the most prominent issue among respondents. as the company is public, its president and managers may change every four years, which favors some employees' skepticism. we were told that previous management initiatives were discontinued (nd.02), which caused instability among older employees, who tended to show disbelief and disinterest in using the processes. despite the difficulties, the process group continued to improve the process (pa.05), such as i) creation of an agile path for product development using scrum; ii) use of canvas in the preliminary phase to plan projects with a smaller scope; iii) use of kanban for task execution; iv) gamification of the standard process to improve usability and foster dissemination of process artifacts, and v) institutionalization of supporting tools (mantis and clarity). there are no spi program goals aligned with the company's strategic plan (pa.07). there is no effective leadership to support the actions of the process improvement group (pa.08). the organizational structure is not adequate due to a lack of human resources and roles overlapping (pa.09). lack of communication also influenced demotivation for using the process (pa.10). interviewee: communication (pa.10): "i think we have many problems. one of the hardest is that we have a serious problem with communication." the process meets the needs of the organization (pa.11). what hinders the use of the process is the lack of human resources to meet the demands. process monitoring (pa.12) is not performed; no information is collected to indicate the return-on-investment (roi). project management was identified as the most challenging process to maintain. interviewee: process monitoring (pa.12): "we did [quality checklists] for a long time, but the reports we generated from non-compliance had no corrective actions because the action is not ours." currently, the organization seeks to improve maturity in the project management process. for this, it created a group of project managers. however, the organization has no definition of whether it will undergo a new level g or f assessment in the future. 4.1.2 organization 2 implementation period. the organization implemented level g and later evolved to level f. in both implementations, the organization received financial assistance from the federal government. a project for spi was defined, and people from the development team were made available. but there were no resources with dedicated time for process improvement activities. the communication of changes in the processes was in lectures and by the group of key people involved in defining the processes. the consultancy was contracted on both implementations (pa.06), and satisfaction with consultancy services was reported. two people were hired to work in quality assurance management. the main difficulties were: insufficient training, lack of resources, lack of experience in spi, and the cultural changes that affected the oldest employees who were more resistant, for example, in the activities of configuration management. interviewee: resistance (pa.03). "the most difficult of all was the acceptance by people who had been here for a long time. the main thing, it was always this. people's acceptance. unfortunately, some people did not adapt to the process, and we had to dismiss them." maintenance period. the appraisal of organization 2 has expired. there is no intention to evolve the maturity level because managers believe that the current level meets their needs. besides, due to the country's economic crisis (nd.03), the organization had to reduce its maintenance fees to avoid losing customers. as a result, the professionals responsible for the process quality assurance (ppqa) activities were dismissed. after the appraisal, training (pa.01) was not available for new employees. the country's economic crisis inhibits new investments in the spi program (pa.02), reflecting on team members' motivation (pa.03) and leading to spi abandonment. researcher: training (pa.01): "do they have training in the process to get in?" interviewee: "no. training hasn't been done lately." software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 there is no employee exclusively in charge of managing the spi program (pa.04), and there is no strategy for introducing process improvement changes (pa.05). concerning consulting, the organization reported satisfaction in the services provided (pa.06). interviewee: resources (pa.04): "due to not pursuing further process appraisals, the quality team was dismissed. but then we reallocated the quality activities of the project to other internal people." there are no clearly defined goals (pa.07) nor a leading process group to foster continuous improvement in organization 2 (pa.08). although the organization is small, communication about the spi program is flawed (pa.10); for example, there is no information available on the spi program's benefits. besides, organization 2 experiences financial problems (i.e., decreased contract flow), and functions overlap due to its small size (pa.09). interviewee: strategic plan (pa.07): "last year, we started putting together the organization's strategic plan, so we have the outline of it (...) but, due to time constraints, we decided not to spend too much effort as planning activities requires." the development teams partially use the process. it is not because they are considered bureaucratic (pa.11), but because there are not enough employees to execute the quality assurance (qa) process. also, no measurement program (pa.12) exists to support process follow-up. interviewee: measurement (pa.12): "(...) having no financial resources, we ended up dismissed up the quality staff (composed of two employees)." 4.1.3 organization 3 implementation period. the organization implemented level f, evolved to level c (renewed level c once). the motivations for adopting the model were improved software processes, market, and legal need for maturity models to participate in bids. due to the quality manager's experience in renewing level c, consultancy services (pa.06) were hired to carry out only the assessment. the organization reported satisfaction with the services provided. maintenance period. organization 3 intends to renew its maturity level depending on its economic recovery. the company was going through a difficult financial situation (nd.03). therefore, the company has reduced its staff. the organization does not train its employees regularly (pa.01). however, top management supports the spi program (pa.02) because the company participates in bids. part of the team remains motivated to use the process because it automates activities (pa.03). interviewee: involvement (pa.02): "today, i see that you can always bring improvements by sharing [experiences] with the team because i think each one knows what can improve their own process." after downsizing, organization 3 started using opensource tools (redmine) (pa.04). there is no process group anymore (pa.04), and the process support strategies (pa.05) are carried out by the quality manager with experience implementing the mr-mps-sw model. interviewee: tools (pa.04): "so, the automation, it was fundamental to cover the lack of people." a strategic plan is aligned with the spi program objectives (pa.07), and the communication is appropriate (pa.10). notwithstanding, organization 3 difficult economic situation restricts investments in an assessment to renew its maturity level. currently, there is only one person responsible for process restructuring and monitoring (pa.09); there is no process group (pa.08). interviewee: structure favorable to spi (pa.09): "in 2015, the quality team consisted of five people. in 2016, it was reduced to three people. currently, there is only me on the quality team." organization 3 restructured and automated the processes using a free tool (redmine) that suits its needs (pa.11). therefore, the processes are considered easy to maintain. process monitoring is supported by redmine (pa.12). interviewee: monitoring (pa.12): "i can't identify improvements if i don't have a minimum measurement to monitor it..." 4.1.4 organization 4 implementation period. the motivation for adopting the mr-mps-sw model was to standardize organizational processes and the organization's ceo's prior knowledge, acquired in the graduate program in software engineering. before the maturity model implementation, some teams in the organization used some scrum practices. thus, the consultancy helped define a process that would combine the scrum practices with the maturity model. the main difficulties were: i) lack of support, employee involvement, ii) lack of a process group (sepg), iii) resistance of agile teams; iv) attitude of imposition of the director (who believed in the model) and, sometimes, of the consultant; v) lack of tools; vi) lack of support from team leaders; vii) focus on the result of the assessment. interviewee: resistance (pa.03). "there was an area of the company that questioned the process because they worked on an already agile scheme."... "what did we do? we did a process that was a little bit tailored: some things we used a little agile, some things were a little waterfall." software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 maintenance period. the organization does not intend to renew or evolve its maturity level. they develop software on-demand and do not participate in biddings that demand specific maturity levels. scrum currently meets its needs. although training (pa.01) and top management support (pa.02) were present after the assessment, the employees were unmotivated (pa.03). due to employees who worked with scrum on their projects, they did not accept the new process. there was also resistance from new employees to use the defined process based on the maturity model. these new employees were also resistant because they had previous experience in agile methods. interviewee: motivation (pa.03): "so as not to follow the process, she justified: i can't. i am doing this project in scrum, and there is no time to do anything because we have tight deadlines..." interviewee: veiled resistance (pa.03). "…you saw that they resisted, said it was ok because the ceo was defining it, then they said it was going to be used. but it was always like this: "no, because i need to put more hours in the estimate because of the model..." the consultancy (pa.06) took into consideration the teams that worked with scrum. however, these teams did not tell the truth to the consultant and helped define a process that would not be used after the assessment. after the assessment, the organization continued to invest in the spi program (pa.04) and hired a process manager to make the mr-mps-sw process compatible with scrum. however, he had no experience with agile methods. he defined a hybrid process that was also not well accepted by the teams (pa.05). organization 4 had a strategic plan, but it did not consider processes based on maturity models (pa.07). the spi program did not have effective leadership in charge of process improvement (pa.08). concerning the organizational structure, the organization has well-defined roles, which facilitate process execution (pa.09). communication was flawed (pa.10). there was no information on spi return on investment or benefits. the process defined in the implementation phase was abandoned shortly after the official assessment (pa.11). the lack of support from project managers and the organization's agile culture were the main reasons for the spi initiative's failure. project management was pointed out as the most challenging process to maintain, as the time estimated to perform activities increased due to process activities. the measurement process was abandoned after the appraisal (pa.12). interviewee: return on investment (pa.10): "is there information on return on investment?" interviewee: "no. we do not have." currently, the organization uses scrum, kanban, and squads. the current ceo of the organization, with experience in agile methods, used the following strategies to manage this software process improvement initiative: i) adapt the process with agile methodologies (pa.06), aiming to meet the needs of the business; ii) training; iii) standardization of tools (jira); iv) created the organization's agile manifesto (to encourage a sense of belonging); and, v) improved communication between teams. 4.1.4 organization 5 implementation period. before implementing the maturity model, the organization used extreme programming (xp) and kanban practices. however, the organization had a description of isolated procedures that generated the need for standardization. at the time of implementation, there were three partners, one of them actively participated in defining the processes. he participated in training on the model's processes. at that time, there was support from the owners for the spi initiative. the main difficulties were: i) lack of human resources; ii) change of external consultancy (pa.06) (failure in the model guidelines); iii) the second consultant was located in another region of brazil (difficulties in conducting the implementation), and iv) lack of a strategic plan. maintenance period. organization 5 has no interest in renewing or evolving the maturity level because the current process meets the business's needs. besides, with the lack of external demand for certification (nd.04), there is no need to maintain an assessment using reference models because its customers do not require such evaluation. after the evaluation, the organization went through economic difficulties due to the country's financial crisis (nd.03), lost the contracts of the civil engineering sector, and started developing a predial automation software product. this affected the owners' motivation (pa.03) and support (pa.02) for spi, who intended to implement the model's level e. interviewee: country's economic crisis (nd.03). "one of our biggest customers, the civil construction company, went into crisis. so, three years ago, we lost an entire segment of civil construction…" interviewee: disbelief and demotivation (pa.03). "i wonder why i participated in this, but why did we invent this ...?" the organization is a micro company. therefore, communication is easy (pa.10), and there was no need to provide training in the processes (pa.01). there is a shortage of resources and time (pa.04), and there is no spi project software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 management (pa.05) or spi specific goals (pa.07). it uses redmine as a tool to support daily activities (pa.04). the dissolution of society (nd.05) was the main factor that negatively influenced spi, because it affected the organizational structure (pa.09) and the leadership (pa.08) due to the loss of the partner who believed in the model. the process defined at the implementation time was considered bureaucratic (pa.11), being modified for scrum practices. the current sponsor had experience with agile methods and believes that it is more effective to give the team more decision-making power than to follow processes. project management that was considered the most bureaucratic process, was adapted with scrum practices. in the requirements management process, user stories were used together with prototyping for requirements specification and validation. there is no measurement program for continuous improvement of the process (pa.12). currently, the organization uses the appropriate process with agile methods because it meets the business's needs. interviewee: the dissolution of the company (nd.05). "as the company reduced the number of employees ... because we lost a partner, we didn't have time to renew the certification." "we were in the process of making the model's e-level. but then, in this process of changing partners and getting it right, we thought it was a good idea not to do it ... we don't have to do it to get the certificate…" interviewee: bureaucracy (pa.11). "… we fall into a planning task, and to count within our assessment, then, we had to have, for example, an action to define the communication plan. the communication plan was written once, and no one ever read it afterward... no one else used it..." 4.1.4 organization 6 implementation period. organization 6 assessed level g, level f (renewed once), and level c (renewed once) of the mr-mps-sw model but was undecided about the second renewal of the assessment of level c due to organizational restructuring caused by the fusion of companies (nd.06). the selection of the maturity model was influenced by the sponsor, who has previous project management training. the objective was to improve the process, product quality, and market. another strong motivator was the foreign policy to support spi, promoted by the model's executive body (formed of a cooperative group and external financial support). the most serious difficulty was the organization's lack of experience with process improvement that resulted in a bureaucratic process (pa11). work overload and resistance were caused (pa.03), especially for the project manager. what helped the organization achieve positive evaluation was the experience of external consultants (pa.06) and the networking between companies promoted by the cooperative group's formation. maintenance period. after the first evaluation, senior support management continued (pa.02), made the process group (pa.08) available to make adjustments to the process, intending to reduce bureaucracy (pa.11) and increase acceptance and motivation of the organization's members (pa.03). there is a policy of continuous training (pa.01). training needs are identified, with a technical training schedule (processes, programming language, and others) and behavioral training (motivation, integration, customer service, etc.). at the end of the training, an evaluation is made by the employees. interviewee: training policy (pa.01). "we carry out a needs assessment at the beginning of the year with the managers."… "after the training, hr [human resources] needs to know the attendance list, the initial reaction assessment and three months later an assessment of the effectiveness of the training…" the organizational structure is adequate (pa.09), with human resources and infrastructure (pa.04) (crm dynamics, pro-ject), with a strategic plan with spi goals aligned to the business (pa.07). when the first c-level assessment was renewed, the organization did not use external consultancy (pa.06) because one of the process group members had experience with spi consultancy. the process was tailored to the organization's needs. the audit of the process was automated (pa.12). the awareness of the benefits is subjective because there is no measurement of the return on investment (pa.12). spi's management was carried out by the sponsor, who believed in process improvement and influenced top management with the process group's support (pa.08). the main support strategy used was to facilitate the use of the process through automation and reduction of bureaucracy (pa.05). however, the determining factor for the abandonment of spi was the fusion of companies (nd.06). the fusion resulted in a clash of organizational cultures. there were changes in the business (in addition to the software factory, it started to focus on software services). there have been changes in the development process and in the way of working. the new manager of the development area encouraged discussions about the agility of organizational processes and the adhesion to the use of agile methods (nd.07), used: scrum, squads, sprint design, and other methodologies like design thinking. some members of the process group (pa.08) left the organization, and the process defined from the maturity model ended up abandoned. interviewee: fusion of companies (nd.06) business changes. "… there was a merge with company x ... and company x brought a new portfolio. i brought an software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 infrastructure portfolio, so we have infrastructure projects now, safety nets, so we have safety nets projects, which is very different from building software…" interviewee: fusion of companies (nd.06) change in the way of working. "one of the points, because of the merge, and, already advancing another point, is that it ends up that the software development process has changed a lot."… "we are reformulating our way of working."… "we are in that process like this: we certified a process, and today our process is already totally rigid. we are even looking at whether it will fit in for a reevaluation." 4.1.4 organization 7 implementation period. the purpose of adopting the model was the standardization of processes, product quality, market marketing, and the acquisition of public contracts (at the time, there was a requirement for evaluation using the maturity models). the spi initiative was supported by the sponsor (pa.02), who provided hours for the project manager and some members of the organization to define the processes (pa.04), and provided model training (pa.01). people's engagement was requested (pa.03). the organization's members had no experience with spi. what motivated the model's selection was forming a group of companies that were implementing the model in the region. before the assessment, they used scrum. they found the first implementation of the model more complex, with bureaucracies they were not used to (pa.11). the external consultancy was hired in both implementations of the model. however, in the second implementation, there was a conflict between the external consultant and the person responsible for implementing in the organization. it was reported that there was an exchange of consultancy because the consultancy had technical competence (pa.06) but lacked competence in soft skills. the consultancy had a very imposing posture. interviewee: consultancy service (pa.06). "our ideas didn't match; he didn't accept the suggestion to change the process. "no, you have to do it this way."... "this also made it very difficult for us, especially for me, who was in charge of this company project." maintenance period. although the organization members have reached maturity and the processes were standardized, the sponsor has no interest in renewing the assessment (pa.02). even meeting requirements for bids in the public sector, they did not achieve the goal defined in the strategic plan (pa.07), acquiring contracts in the public sector. interviewee: external pressure from customers (nd.04). "…even because concerning public projects, which was one of the ideals for us to have certification, that's not what happened..."… researcher: "but did they ask for certification?" interviewee: "in bidding yes." after the evaluation, there was no training available (pa.01) due to low turnover (pa.09). there were no human resources available (pa.04) to manage the spi (pa.05), and the tools used were not adequate (pa.04). there was no group of processes (pa.08) to lead continuous improvement in processes. the members of the organization were not motivated to continue with spi (pa.03). the process considered bureaucratic (pa.11) was adapted to the organization's needs, and they returned to using scrum with some practices of project management and requirements management. in quality assurance management, only the quality control of the product was carried out. the other level f processes were abandoned. interviewee: bureaucracy (pa.11). "at level g, i felt the processes were very bureaucratic, plastered ..." the monitoring of the process stopped being done (pa.12). therefore, there was no process institutionalization and no information on return on investment (pa.10). interviewee: monitoring of the process (pa.12). "today, we no longer do this audit of the process." currently, the organization uses scrum, and the organization members are satisfied with the reduction of bureaucracy. 4.1.5 organization 8 implementation period. the objective for implementing the model was to improve the process, product quality, and the acquisition of public contracts (at the time, there was a requirement for certification of models). interviewee: objectives of spi adoption. "we had two aspects of need. one was to improve our process, aiming for better quality."… "except that there was also a legal need for participation in public bids." a project for spi was defined, and people were involved in the definition of processes. consultancy services were hired, and the sponsor was satisfied with the consultancy service (pa.06). communication took place through engagement meetings and training (pa.10). maintenance period. after the evaluation, no training was available (pa.01). support from top management declined (pa.02) due to the country's economic crisis (nd.03) and the cooling of the model evaluation requirements in public bids. the organization no longer had the commercial motivation that was the requirement of software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 external customers (nd.04). these two factors affected the quality assurance process because the qa professional was not hired. therefore, there was no monitoring of the process (pa.12). the team and the sponsor were demotivated (pa.03). the team found the process bureaucratic (pa.11). besides, there was an overload of the product quality assurance activity, which was absorbed by the team. the sponsor thought that the documentation resulted in high costs. interviewee: country's economic crisis (nd.03). "i think the economic problem also helps, which is a consequence of it all."…" you see, if you don't have a crisis, you have the thriving thing."… "then how to hire someone exclusive to the gqa? but how do you do it? the budget does not allow it. the difficulties do not allow…" interviewee: lack of external demand for certification (nd.04). "the bidding processes started not to charge so much because the tcu (federal audit court) understands that, even, the biddings started to do as follows: if you have a certified development methodology, you present. if you don't have it, we do an audit. they kind of didn't charge. they're not charging anymore..." after the evaluation, there was no spi management (pa.04), with the availability of resources (pa.04) and support strategies (pa.05), and no processes group (pa.08) to define continuous improvements in the process. they use teams foundation as a support tool (pa.04). another factor was the turnover (pa.09) because the new employee has to learn and accept to use the process (pa.03). interviewee: adequate organizational structure turnover (pa.09). "eventually, that professional a or b who was already adhering to the process changes and then it will hurt us even more to have management." currently, the organization no longer uses the process defined with the maturity model and adherence to agile methods (nd.07) due to the need to streamline the process and reduce documentation costs. in addition, the private market accepts scrum well, and the public sector started to have contracts with the use of scrum. the sponsor reported satisfaction and several benefits from simplifying the process (there is no need to keep creating evidence), reducing the conflict with the client (there is no discussion about the project scope). interviewee: adherence to agile methods (nd.07). "we are now more with the private [sector], but with the private [sector] we can convince to use us in the agile model." 4.2 cross-analysis this section presents the data cross-analysis of the eight organizations based on the research propositions. we used three criteria to characterize the points of analysis (table 4): ▪ n (not identified): the point of analysis was not identified in the organization. ▪ p (partially identified): the point of analysis was partially identified in the organization. ▪ f (fully identified): the point of analysis was fully identified in the organization. to assess whether a proposition is confirmed, we analyzed whether the points of analysis were not identified (n) or were partially identified (p) in the organization. this means that the critical factors for maintaining spi have been neglected. the results indicate that neglecting these factors can lead to the abandonment of the spi program based on maturity models. to assess whether a proposition is not confirmed for the abandonment of spi, we defined that if all points of analysis were identified (f) in the organization, it meant that the organization continues to address critical spi maintenance factors after assessment. the following section discusses these results. table 4. analysis of proposition. proposition org. 1 org. 2 org. 3 org. 4 org. 5 org. 6 org. 7 org.8 p1: there are human factors that influence the abandonment of the spi program. pa.01. training is offered for the qualification of the employees of the company. p n n p n f n n pa.02: there is support, commitment, and involvement of organization members. p n p p n p n n pa.03. the technical team members are motivated and willing to carry out the activities of the process. p p p n n n n n p2: there are spi project factors that influence the abandonment of the improvement program. pa.04: budget and resources are available for the spi initiative. p f p f n n n n pa.05: there is a strategy to introduce changes in software processes. f n f n n n n n pa.06: existence of an external consultancy with the ability and competence to implement a process compatible with the company's needs. f f p f f p f p3: there are organizational factors that influence the abandonment of the improvement program. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 pa.07: existence of a strategic plan that relates the spi program to business goals achievement. f n f n n n f n pa.08: leadership is available to support continuous process improvement. p n p n n n n n pa-09: there is an organizational structure favorable to the spi program. n n n f n f n n pa-10: there are communication mechanisms for the dissemination of the spi program n n f n f n f n p4: there are process-related factors that influence the abandonment of the improvement program. pa.11: there is a non-bureaucratic process that meets the needs of the organization. p n f f n f n n pa.12: there is a program for the measurement of continuous process improvement. n n f n n f n n 5 discussion the research question guiding this work is: "how does the abandonment of software process improvement programs occur?" to answer this question, we conducted case studies on software organizations with either expired assessment date (organizations 1, 2, 4, 5, 7, and 8) or close to the assessment date expires (organizations 3 and 6). we identified that an organization is abandoning the improvement process when the interview participants report that all processes are no longer being used (organizations 4 and 8) or when they say that the processes are partially being used (organizations 1, 2, 3, 5, 6, and 7). we identified five pitfalls to spi and their relation to the research questions from the data analysis. we found that organizations do not set goals to pursue continuous process improvement. there is a lack of continuity in spi management and the sponsor's interest to continue. even after all the effort in implementing the spi, sponsors may not be satisfied with the results. this can lead the organization to return to its previous state or define a new way of working and improving its processes other than the maturity model. pitfall 1 negligence with human factors explanation: we found that organizations do not provide sufficient training (pa.01) (organizations 1 and 4) or have stopped providing training after assessment (organizations 2, 3, 5, 7, and 8 ). in these organizations, the lack of training negatively affected the use of the improved process because people do not use what they do not know. training a group of people only during the spi implementation period is not enough to ensure process understanding. the dissemination of knowledge about process improvement is complex, especially in large organizations (organizations 1 and 4), where communication can be more difficult. top management support can influence (pa.02) the investment provisions for spi initiatives. organization 2 dismissed the quality team, and in organization 3, the quality team's size was reduced to just one member. as for organization 1 (public capital), the quality team stopped monitoring the process due to the lack of top management support. in organizations 5, 6, 7, and 8, senior management's support was perceived only during the implementation period. regarding motivation (pa.03), we identified its partial occurrence in organizations 1, 2, and 3 because motivation depends on key people, and some people show resistance. in organizations 4, 5, and 7 that already used agile methods before implementation, employees were resistant and unmotivated to use the new process. organizations 6 and 8 started to adhere to agile methods (nd.07). in organization 8, it was possible to observe the sponsor's satisfaction regarding reducing documentation costs and greater understanding with the client due to the project scope. besides, this change in process was well accepted by its employees (especially by the younger programmers). thus, proposition p1 is confirmed (table 4). discussion: these results are consistent with the spi literature, which reports that training is essential for disseminating knowledge (alqadri et al., 2020) and providing awareness of the benefits of spi (peixoto et al., 2010). the importance of top management to be convinced about spi's benefits for both the implementation and continuity of spi is highlighted by almeida et al. (2011). resistance and lack of motivation were present in all organizational contexts. different issues influenced them, but the lack of human resources was a common point. the resistance literature corroborates these findings when reporting that work overload discourages new work practices (narciso et al., 2014) (anastassiu et al., 2020). it is worth mentioning the resistance of the agile teams in organizations 4, 5, and 7. this was observed in two distinct moments: a veiled resistance by the organization members in the implementation period (due to the interest of top management in the success of the evaluation) and a more declared resistance after the evaluation. in organization 4, the teams did not use the process, even with the support of the consultancy's effort to involve these teams in discussions to define a process that would meet the organization's needs. this finding corroborates the research software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 by albuquerque et al. (2019), which identified that teams from organizations that use agile methods have difficulties implementing and sustaining spi based on maturity models. pitfall 2 negligence with factors related to spi projects explanation: spi project management is a critical success factor (montoni et al., 2011). however, we have identified negligence in this regard. in most of the investigated organizations, it was possible to observe that in the implementation period there was a definition of a project with availability of dedicated resources (pa.04). however, after the evaluation there was no continuity in the management of the spi project. in other organizations (for example, 2 and 7), lack of management occurs even during the implementation period. the lack of a dedicated resource (pa.04) to manage spi negatively affects the continuous improvement of the process and the taking of actions to promote people's motivation, that is, the definition of spi support strategies (pa. 05). only organization 1 has a process group (pa.04) that continues to take actions (pa.05) to promote spi. however, it is difficult for a process group to keep the spi program running without senior management support (pa.02). in organizations 3 and 6, processes were automated to increase compliance (pa.05). regarding the analysis of this proposition, our data were not conclusive to confirm this proposition because the analysis point regarding the consultancy (pa.06) was not possible to evaluate in all organizations. for example, organization 1 did not hire consultancy services. thus, proposition p2 is partially confirmed (table 4). discussion: according to spi literature (montoni et al., 2011) (coleman et al., 2008) (peixoto et al., 2010) (almeida et al., 2011), spi initiatives are affected by the lack of human resources, resulting in work overload and, therefore, in the prioritization of activities related to the product. according to sulayman et al. (2012), the spi team needs to have the workforce available to define the processes, train the team members on these processes and supervise. for this reason, having a full-time person for coordination activities is essential for the success of the spi initiative (guerrero et al., 2004). pitfall 3 negligence with organizational factors explanation: there are no clearly defined goals (pa.07) or effective leadership (pa.08) of top management and project managers that foster continuous improvement. besides, there is role overlapping (pa.09), and communication is flawed (pa.10). only organization 4 had no role overlapping. however, agile culture hinders the acceptance of the new processes. this difficulty also occurred in organizations 5 and 7, which already used agile methodologies before implementation. we identified two new results: dissolution of the company (nd.05) and fusion of companies (nd.06) that affected the organizational structure, resulting in spi abandonment. in organization 5, the dissolution of society (nd.05) negatively affected the spi initiative because it lost its leadership. that is, it lost the person who believed in the model. thus, the organization returned to agile methods because the remaining partners believe in agile methods' value. in organization 6, the fusion of companies impacted spi's abandonment because there was a restructuring of organizational processes. in this restructuring, the new development manager with agile methods' experience defined a new way of working with senior management support. thus, proposition p3 is confirmed (table 4). discussion: the importance of considering organizational culture in spi initiatives was reported in the research (alqadri et al. 2020) shih et al. (2010). shih et al. (2010) emphasized that sepg (software engineering process group) leaders should consider culture when a new spi approach is implemented because it may be incompatible with the existing culture. in organizations 4, 5, and 7 with organizational cultures used to working with agile methodologies, it was challenging to continue spi with maturity models. we identified that groups, such as the process group and the quality assurance group, made the most effective support and leadership to sustain spi. our results are consistent with the research by uskarci and demirörs (2017). regarding the new findings, it was possible to observe the influence that the organizational structure has on spi initiatives and how they are related to knowledge and previous experience in process methodologies and decision making. in organizations 4, 5 and 6, the choice was made to use agile methods due to the organization's previous experience of managers with decision-making power. pitfall 4 negligence with process factors explanation: regarding the existence of a nonbureaucratic process (pa.11), we found that all organizations adjusted and simplified their processes after the official assessment. in organizations 1, 2, 6, and 7, the process is partially used (quality assurance and measurement are not performed). organizations 4 and 8, which have an agile culture, abandoned the processes thoroughly. notably, only organization 3 (which participates in bidding processes) continued to use and monitor the processes (pa.12). however, it had not renewed the maturity level because they experienced financial struggles by the interview time. we found that some organizations abandoned spi with maturity models due to adherence to agile methodologies (nd.07), as was the case with organizations 6 and 8. these are organizations that started using agile methods after the evaluation. its sponsors reported satisfaction with using these methodologies due to the reduction of bureaucracy and documentation costs. thus, proposition p4 is confirmed as can be seen in table 4. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 discussion: the results showed that abandoning the spi program does not mean not using the organizational processes at all. organizations 1, 2, and 3 have adapted and simplified their processes to meet their new business needs. these results align with the spi literature, which reports that processes tend to be simplified, stabilizing in a minimum process (coleman et al., 2008). organizations 4, 5, 6, 7, and 8 have been looking for other ways to mature the process using agile methods (fontana et al., 2015). it is worth mentioning that it is possible to implement an spi initiative with agile methodologies and maturity models. however, in the context of this research, only organization 4 tried to make this tailoring but was unsuccessful due to the boycott of agile teams. pitfall 5 negligence with external factors explanation: we identified external factors that impact the support of top management. we identified the negative impact of outsourcing (nd.01) it projects on organization a (a large public company). project managers reported difficulty in applying their processes to outsourced organizations. the main reason was the high turnover that made learning difficult and hindered the use of the processes. the country's economic crisis (nd. 03) has restricted investments in resources for spi. also, we found that regular changes in the state government (nd.02) demotivates process managers from adhering to the changes made by top management because the company's board can change every four years and, therefore, potentially change the internal software process quality policies. the lack of external pressure from customers (nd.04) is another factor that discouraged some organizations that had the commercial motivation to adopt spi with maturity models, that is, the interest in participating in public biddings. however, currently in the country, this requirement has not been made by all public bodies. organizations working in the private sector have reported no requirements to use an officially evaluated process. discussion: unlike the literature, our study identified new findings negatively influencing spi, called external factors. outsourcing (nd.01) impacted the lack of use of the improvement process due to the lack of standardization of outsourced contracts. this indicates that it is vital for the organization's top management to define procedures for managing third-party contracts. regarding the regular changes in the state government (nd.02), the results show that consistency in quality policies is necessary. the frequent change in the use of software process methodologies, or the definition of work procedures, may demotivate organization members at any organizational level. it is quite possible that this lack of managerial constancy may demotivate members in private organizations as well. here, it is a point worth investigating. the country's economic crisis (nd.03) has been affecting organizations' economic instability. these organizations have a reactive action to decrease their resources, prioritizing the resources that develop the software and dismissing the quality team. finally, the lack of external pressure from the client (nd.04) indicates that the organizations that adopted the spi for purely commercial reasons and not improving processes themselves tend to be frustrated with the results because the public sector has changed its way of acquiring software development services. thus, we formulated a new proposition: p5. there are external factors that influence the abandonment of the improvement program. 6 limitations and threats to validity to evaluate the research quality and research validity, we used the guidelines defined by yin (2017) and runeson et al. (2012) regarding quality criteria for empirical research. regarding construct validity, the propositions are based on the research carried out by albuquerque et al. (2018). propositions and analysis points were validated in a workshop held with experienced professionals in spi programs. regarding internal validity, grounded theory procedures were followed: the propositions were investigated using only the data collected from the interviews. the first author analyzed the interviews and built the networks. the other authors (professionals with experience in maturity models implementation and assessment) reviewed and analyzed quotes, codes, and categories. regarding external validity, we interviewed participants from eight different software organizations. we included organizations of various sizes, locations, and businesses. three organizations do not participate in biddings, and only one is a public company. some organizations only provided one participant for the interview (due to high turnover). still, we were careful to select those who effectively participated since the maturity model implementation. as expected in in-depth qualitative research, the results cannot be broadly generalized (eisenhardt, 1989) but present relevant evidence on how abandonment occurs after valid spi appraisals. nonetheless, we plan to replicate the research in more organizations. finally, to ensure research reliability, all the research protocol and data analysis steps were defined and followed. 7 conclusion this study aimed to understand how abandonment occurs in spi programs after successful assessments based on maturity models. results from four organizations (1, 2, 3, and 4) were published in albuquerque et al. (2020), who indicated that abandonment occurs when there is negligence to factors internal to the organization (human, organizational, spi project and processes) and factors external to the organization (outsourcing nd.01, political software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 change nd.02 and economic crisis of the country nd.03). in this paper, results from four more organizations (5, 6, 7, and 8) were presented. concerning internal factors, they all corroborated our previous research (albuquerque et al., 2020). however, new findings were identified: two organizational factors (dissolution of the company nd.05 and merger of companies nd.06) and a process factor (adherence to agile methodologies nd.07). concerning external factors, this research confirmed the negative influence of the country's economic crisis on spi and identified a new external factor (lack of external demand for certification nd.04). another point that draws attention is that some organizations carried out management activities during the spi project until the official assessment. after that, some of them neglect the proper management of the spi project. moreover, other organizations neglect management activities since the beginning of the spi project. considering that the literature and our expirence state that adequate management is a critical success factor to the spi project, it is not surprising that such organizations will fail to continue the spi activities carried out so far. as a contribution, we highlight the practical applicability of our results for the software industry. industry professionals can use this study's results to learn about their initiatives to avoid pitfalls that can lead to abandoning spi. for example, before starting an spi initiative, evaluate the organization's business, and assess whether it is the best time to invest in process improvement. evaluate if the organizational structure is appropriate if there is a flow of ongoing projects to avoid the investment restriction with training; and reduce the team, such as the quality team. before starting an spi initiative, know the improvement model that will be implemented, and be aware that the results come in the long term. it is also essential to involve the development team in selecting the process improvement model and the process definition to avoid resistance. the consultancy will only help define a valuable process for the organization, but the development team's commitment will lead to spi success. the technical skill of the consultancy is useless without the spontaneous participation of the team members. effectively combining agile methods and maturity models requires experienced consultants to overcome this integration's natural barriers. a balanced process can combine agile methods and model requirements in a sustainable path as future work, we are starting to replicate this study in other software organizations that use maturity models (mpssw and cmmi), considering different sizes, maturity levels, companies capital, and organizational contexts. our goal is to deepen our understanding of the movement organizations makes after the official appraisal. acknowledgments we thank the financial support provided by the araucária foundation (fa). agreement number: 001/2017. we also thank unirio for its financial support (edital ppq-unirio 2019 and 2020). references albuquerque, r., fontana, r.m., malucelli, a., reinehr, s. (2019). agile methods and maturity models assessments: what's next? in: proceedings of the systems, software and services process improvement (eurospi), edinburgh, scotland, pp 619-630. albuquerque, r., malucelli, a., reinehr, s. (2018). software process improvement programs: what happens after official appraisal. in: proceedings of the international conference on software engineering and knowledge engineering (seke), san francisco, usa. albuquerque, r., santos, g., malucelli, a., reinehr, s. (2020). abandonment of a software process improvement program: insights from case studies. in: proceedings of the brazilian symposium on software quality (sbqs), maranhão, brazil. almeida, c.d.a., albuquerque, a.b., macedo, t. c. (2011). analysis of the continuity of software processes execution in software organizations assessed in mps.br using grounded theor. in: proceedings of the international conference on software engineering and knowledge engineering (seke), miami, florida, usa. alqadri, y., budiardjo, e. k., ferdinansyah, a., rokhman, m. f. (2020) the cmmi-dev implementation factors for software quality improvement: a case of xyz corporation. in: proceeedings of the 2nd asia pacific information technology conference (apit), pp.34-40. anastassiu,m., santos, g. (2020). resistance to change in software process improvement an investigation of causes, effects and conducts. in: proceedings of the brazylian symposium on software quality (sbqs), maranhão, brazil. canedo, e. d., santos, g. a. (2019). factors affecting software development productivity: an empirical study. in: proceedings of the xxxiii brazilian symposium on software engineering (sbes), september in brazil. p.3017-316. cmmi institute (2018). cmmi for development v2.0. available at: https://cmmiinstitute.com/products/cmmi/cmmi-v2products. cmmi institute (2019). radix: delivers results with cmmi and behavioral driven development in agile environment. submitted by: cmmi institute. published: 25 july, 2019. coleman, g., o'connor, r. (2008). investigating software process in practice: a grounded theory perspective. journal of systems and software, v.81, issue 5, p.772-784. software process improvement programs: what are the pitfalls that lead to abandonment? albuquerque et al. 2021 eisenhardt, k. (1989). building theories from case study research. academy of management review, v. 14, issue 4, pp. 532-550. fontana, r.m., meyer, jr. v., reinehr, s., malucelli, a. (2015). progressive outcomes: a framework for maturing in agile software development. journal of systems and software, v. 102, pp. 88-108. guerrero, f., eletrovic, y. (2004). adopting the sw-cmm in a small it organization, ieee software, v.21, issue 4, july-aug. 2004, pp.29-35. iso/iec (2015). iso/iec 33020:2015: information technology process assessment – process measurement framework for assessment of process capability, geneve: iso. iso/iec (2017). iso/iec/ieee 12207:2017 systems and software engineering. software life cycle processes. kalinowski., m., weber, k., franco, n., zanetti, d., santos, g. (2014). results of 10 years of software process improvement in brazil based on the mps-sw model. in quality of information and communications technology (quatic) in portugal, p. 28-37. montoni, m.a., rocha., a. r. c. (2011). using grounded theory to acquire knowledge about critical success factors for conducting software process improvement implementation initiatives. international journal of knowledge management, v.7, issue 3 (jul 2011), pp. 43– 60. doi: 10.4018/jkm.2011070104. nalepa, g., fontana, r.m., reinehr, s., malucelli, a. (2019). using agile approaches to drive software process improvement initiatives. in: proceedings of the systems, software and services process improvement (eurospi), edinburgh, scotland, pp 495-506. narciso, h; allison, i. (2014). overcoming structural resistance in spi with change management. in: proceedings of the international conference on the quality of information and communications technology (quatic), p.8-17. o'connor, r (2012). using grounded theory coding mechanisms to analyze case study and focus group data in the context of software process research. published in the united states of america by information science reference (an imprint of igi global), 2012. cap.13, p.256 270. doi: 10.4018/978-1-4666-0179-6.ch013. peixoto, d.c.c., batista, v. a., resende, r.f., isaías, c. (2010). how to welcome software process improvement and avoid resistance to change. in: proceedings of the international conference on software process (icsp), alemanha, p.138-149. reinehr, s., pessôa, m s. p., burnett, r.c. (2008). software product lines in the financial sector in brazil. in: proceedings of the xxviii national congress on production engineering (enegep). rio de janeiro, brazil. runeson, p., host, m., rainer, a., regnell, b. (2012) . case study research in software engineering: guidelines and examples.. march 2012 256 pages. shin, c.c., huang, s.j. (2010). exploring the relationship between organizational culture and software process improvement deployment, in information & management, v.47, p.271–281. society for the promotion of brazilian software excellence – softex (2020). mps general guide to software. http://www.softex.br/mpsbr. strauss, a., corbin, j. (1998). basics of qualitative research, 2ª ed.: sage publications, thousand oaks, london new delhi, 1998, 312p. sulayman, m., urquhart, c., mendes, e., seidel, s. (2012). software process improvement success factors for small and medium web companies: a qualitative study, in information and software technology v.54, p.479–500, 2012. uskarci, a., demirörs, o. (2017). do staged maturity models result in organization-wide continuous process improvement? insight from employees. in computer standards & interfaces, v.52 p.25–40. yin, r. (2017). case study research: design and methods (applied social research methods), 6th edn. los angeles: sage publications. software process improvement programs: what are the pitfalls that lead to abandonment? 1 introduction 2 related works 3 research method 3.1 context 3.2 data collection 3.3 data analysis 4 results 4.1 analysis of individual cases 4.1.1 organization 1 4.1.2 organization 2 4.1.3 organization 3 4.1.4 organization 4 4.1.4 organization 5 4.1.4 organization 6 4.1.4 organization 7 4.1.5 organization 8 4.2 cross-analysis 5 discussion 6 limitations and threats to validity 7 conclusion acknowledgments references journal of software engineering research and development, 2023, 11:5, doi: 10.5753/jserd.2023.2582  this work is licensed under a creative commons attribution 4.0 international license.. naming practices in object-oriented programming: an empirical study remo gresta  [ federal university of são joão del-rei | remoogg@aluno.ufsj.edu.br ] vinicius durelli  [ federal university of são joão del-rei | durelli@ufsj.edu.br ] elder cirilo  [ federal university of são joão del-rei | elder@ufsj.edu.br ] abstract currently, research indicates that comprehending code takes up far more developer time than writing code. given that most modern programming languages place little to no limitations on identifier names, and so developers are allowed to choose identifier names at their own discretion, one key aspect of code comprehension is the naming of identifiers. research in naming identifiers shows that informative names are crucial to improving the readability and maintainability of programs: essentially, intention-revealing names make code easier to understand and act as a basic form of documentation. poorly named identifiers tend to hurt the comprehensibility and maintainability of software systems. however, most computer science curricula emphasize programming concepts and language syntax over naming guidelines and conventions. consequently, programmers lack knowledge about naming practices. this article is an extension of our previous study on naming practices. previously, we set out to explore naming practices of java programmers. to this end, we analyzed 1,421,607 identifier names (i.e., attributes, parameters, and variables names) from 40 open-source java projects and categorized these names into eight naming practices. as a follow-up study to further investigate naming practices, we examined 40 open-source c++ projects and categorized 1,181,774 identifier names according to the previously mentioned eight naming practices. we examined the occurrence and prevalence of these categories across c++ and java projects and our results also highlight in which contexts identifiers following each naming practice tend to appear more regularly. finally, we also conducted an online survey questionnaire with 52 software developers to gain insight from the industry. all in all, we believe the results based on the analysis of 2,603,381 identifier names can be helpful to enhance programmers awareness and contribute to improving educational materials and code review methods. keywords: naming identifiers, program comprehension, mining software repositories 1 introduction reading and comprehending source code plays a vital role in software development (allamanis et al., 2014). evidences suggest that choosing proper names to identifiers in software systems can positively impact code comprehension (lawrie et al., 2007b; fakhoury et al., 2018; oliveira et al., 2020). although giving meaningful names to identifiers is a widely accepted best practice, coming up with proper names is challenging (deissenboeck and pizka, 2006). as stated by host and ostvold (2007), even though naming is part of daily life for programmers, it entails a great deal of time and thought: names should convey to others the purpose of the code (martin, 2008) and reflect the meaning of domain concepts (marcus et al., 2004). meaningful identifier names are key to bridging the gap between intention and implementation (wainakh et al., 2021). therefore, given that poorly chosen identifier names might hinder source code comprehension (schankin et al., 2018), using meaningful identifier names is a recommended practice present in several coding style guides and conventions. according to the java language naming conventions1, names should be “short yet meaningful”. in a similar fashion, google c++ style guide2 states that names should be “as descriptive as possible”. martin (2008) argues that programmers should choose intention-revealing names as a way 1oracle.com/java/technologies/javase/ codeconventions-namingconventions.html 2google.github.io/styleguide/cppguide.html to avoid disinformation. he also advocates that names have to contain meaningful distinctions and be descriptive (not abbreviated). the gnu coding standards3 posit that programmers should not “choose terse names – instead, [they should] look for names that give useful information about the meaning of the variable”. although programming communities and internationally renowned experts have proposed best practices related to naming identifiers, little is known about the extent to which programmers follow these naming practices (arnaoudova et al., 2016). we argue that without proper guidance, programmers are more prone to resort to less than ideal naming practices as using number series or noise words. for example, bad naming practices can foster the sense that names as person person1 and person person2 are intuitive and understandable. careless naming practices might hinder not only code comprehension but also overall team communication. therefore, we argue that it is crucial for software engineering researchers to learn how to support programmers by understanding how naming practices are used “in the wild” and, through this better understanding, defining naming guidelines for educational materials (charitsis et al., 2021) and code review (nyamawe et al., 2021). in our previous study (gresta et al., 2021), we set out to investigate naming practices in the context of java programs, thus we looked only into java programmer’s name attributes, parameters, and variables. this article is an extension of our previous work on naming practices in which we also inves3www.gnu.org/prep/standards/ https://orcid.org/0009-0007-7178-6759 mailto:remoogg@aluno.ufsj.edu.br https://orcid.org/0000-0002-5768-1850 mailto:durelli@ufsj.edu.br https://orcid.org/0000-0003-1464-2314 mailto:elder@ufsj.edu.br oracle.com/java/technologies/javase/codeconventions-namingconventions.html oracle.com/java/technologies/javase/codeconventions-namingconventions.html google.github.io/styleguide/cppguide.html www.gnu.org/prep/standards/ gresta et al. 2023 table 1. java programs used in our experiment. project loc contributors commits kings median ditto cognome diminutive shorten index total total % total. % total. % total. % total. % total. % total. % aeron 108,442 86 14,409 606 6.34 450 4.71 5,205 54.46 933 9.76 1,932 20.21 114 1.19 318 3.33 9,558 androidutilcode 39,030 32 1,317 179 7.74 21 0.91 1,170 50.56 385 16.64 73 3.15 77 3.33 409 17.68 2,314 archunit 100,276 49 1,499 91 3.07 16 0.54 1,744 58.86 596 20.11 303 10.23 9 0.30 204 6.88 2,963 boofcv 650,019 14 4,520 7,483 23.19 1,696 5.26 1,573 4.87 266 0.82 880 2.73 1,354 4.20 19,017 58.93 32,269 butterknife 13,279 97 1,016 135 21.95 8 1.30 358 58.21 68 11.06 14 2.28 4 0.65 28 4.55 615 corenlp 581,374 107 16,280 2,372 9.53 831 3.34 4,281 17.20 3,864 15.52 610 2.45 1,622 6.52 11,310 45.44 24,890 dropwizard 74,215 364 5,789 53 1.85 14 0.49 1,993 69.64 343 11.98 269 9.40 29 1.01 161 5.63 2,862 dubbo 179,477 386 4,681 754 6.39 81 0.69 6,983 59.19 1,096 9.29 644 5.46 369 3.13 1,870 15.85 11,797 eventbus 8,369 20 507 4 1.33 0 0.00 195 65.00 59 19.67 23 7.67 1 0.33 18 6.00 300 fastjson 179,996 158 3,863 8,205 49.88 77 0.47 4,255 25.87 1,264 7.68 243 1.48 387 2.35 2,019 12.27 16,450 glide 76,418 129 2,583 105 2.77 22 0.58 2,442 64.47 629 16.61 194 5.12 45 1.19 351 9.27 3,788 guice 72,980 59 1,931 178 2.85 46 0.74 3,871 61.92 1,043 16.68 216 3.45 51 0.82 847 13.55 6,252 hdiv 30,631 11 1,086 106 9.72 11 1.01 573 52.52 63 5.77 177 16.22 31 2.84 130 11.92 1,091 ical4j 24,130 35 2,303 132 11.22 15 1.28 682 57.99 167 14.20 48 4.08 2 0.17 130 11.05 1,176 j2objc 1,810,274 75 5,284 5,523 10.13 866 1.59 9,302 17.06 4,750 8.71 1,276 2.34 3,978 7.30 28,827 52.87 54,522 jenkins 175,150 654 31,156 658 6.15 161 1.51 3,273 30.61 794 7.43 314 2.94 185 1.73 5,308 49.64 10,693 jtk 204,105 9 1,373 2,627 13.03 4,557 22.60 1,008 5.00 55 0.27 37 0.18 1,068 5.30 10,813 53.62 20,165 junit4 31,242 151 2,474 55 3.15 18 1.03 985 56.38 248 14.20 32 1.83 47 2.69 362 20.72 1,747 keywhiz 23,337 32 1,538 89 5.67 23 1.46 1,036 65.99 178 11.34 90 5.73 14 0.89 140 8.92 1,570 libgdx 272,510 505 14,661 49,315 47.83 21,653 21.00 11,800 11.44 1,831 1.78 2,041 1.98 2,252 2.18 14,215 13.79 103,107 litiengine 75,877 20 3,324 316 11.86 46 1.73 771 28.94 448 16.82 253 9.50 21 0.79 809 30.37 2,664 lottie-android 16,258 102 1,292 80 7.41 104 9.64 442 40.96 145 13.44 126 11.68 21 1.95 161 14.92 1,079 mockito 55,751 220 5,523 234 9.87 12 0.51 1,288 54.35 285 12.03 126 5.32 38 1.60 387 16.33 2,370 mpandroidchart 25,232 69 2,068 134 6.85 36 1.84 385 19.69 232 11.87 155 7.93 38 1.94 975 49.87 1,955 nutch 141,710 43 3,215 236 7.68 28 0.91 1,353 44.01 467 15.19 113 3.68 164 5.34 713 23.19 3,074 okhttp 48,465 235 4,848 455 16.01 39 1.37 1,902 66.92 161 5.67 126 4.43 21 0.74 138 4.86 2,842 orienteer 55,681 12 2,274 63 2.68 27 1.15 1,122 47.77 584 24.86 395 16.82 22 0.94 136 5.79 2,349 picasso 9,136 97 1,368 64 8.82 36 4.96 546 75.21 27 3.72 10 1.38 7 0.96 36 4.96 726 rest-assured 73,511 105 2,020 121 5.85 32 1.55 1,440 69.57 288 13.91 107 5.17 14 0.68 68 3.29 2,070 rest.li 523,972 89 2,617 2,158 9.26 533 2.29 10,054 43.16 4,712 20.23 3,458 14.84 237 1.02 2,143 9.20 23,295 retrofit 26,513 152 1,865 60 2.49 7 0.29 1,691 70.14 352 14.60 18 0.75 6 0.25 277 11.49 2,411 riptide 27,072 18 2,131 4 0.52 0 0.00 650 85.08 22 2.88 46 6.02 8 1.05 34 4.45 764 rxjava 468,957 277 5,877 2,371 10.25 34 0.15 4,275 18.48 573 2.48 115 0.50 373 1.61 15,387 66.53 23,128 spring-boot 343,138 804 32,096 443 2.74 95 0.59 10,868 67.24 1,354 8.38 3,002 18.57 91 0.56 309 1.91 16,162 tomcat 343,703 61 23,140 1,142 6.68 263 1.54 7,374 43.16 1,675 9.80 696 4.07 846 4.95 5,089 29.79 17,085 twelvemonkeys 99,418 42 1,334 379 8.43 123 2.73 912 20.28 808 17.96 588 13.07 327 7.27 1,361 30.26 4,498 unirest-java 15,979 43 1,603 12 1.75 1 0.15 310 45.19 58 8.45 23 3.35 22 3.21 260 37.90 686 webmagic 12,926 40 1,119 28 2.87 3 0.31 763 78.26 80 8.21 27 2.77 10 1.03 64 6.56 975 xchart 24,406 50 1,451 119 7.93 31 2.07 628 41.84 338 22.52 50 3.33 26 1.73 309 20.59 1,501 zxing 107,064 109 3,582 208 9.78 137 6.44 695 32.68 267 12.55 108 5.08 157 7.38 555 26.09 2,127 total 7,111,470 5,519 217,869 87,297 20.79 32,153 7.65 110,198 26.24 31,508 7.50 18,958 4.51 14,088 3.35 125,688 29.93 419,890 tigate name practices in the context of c++ programs. to investigate how c++ and java programmers name attributes, parameters, and variables we carried out an empirical study in which we analyzed 1,421,607 identifier names from 40 open-source java projects and 1,181,774 identifier names from 40 open-source c++ projects. we performed repository mining to determine how often eight categories of naming practices are within and across these projects. we also looked at how prevalent these naming practices are in certain code contexts (i.e., attribute, parameter, method, for, while, if, and switch). in this extended version, our results are based on two large samples of programs: the previous version of this study analyzed 40 open-source java programs, and results from this extended version of the article also include the analysis of 40 open-source c++ projects. moreover, to understand the industry practices, we conducted an online survey questionnaire to gain insight from software programmers. throughout a survey, we gathered quantitative data on programmers’ perceptions about the use and occurrence of the investigated naming practices. the online survey questionnaire ran from november 2021 to january 2022 and had 52 responses. this extended version of our study makes the following contributions: • our results show that the naming practice categories (kings, median, ditto, diminutive, cognome, shorten, index and famed) appear in all 80 open-source projects and are prevalent in practice; • we identified the most common names across projects. the top-3 recurrent names are: value; result; and name. many single-letter names are also commonly used in projects (e.g., i, e, s, c). we also observed that the majority of common names are associated with integer or string values; • we perceived that programmers naming practices are context-specific. single-letter names (index and shorten) seem to be more present in conditional or loops statements (if, for, while). in contrast, identifiers with the same name as her types tend to appear in largescope contexts (e.g., attribute); • we noted that, in general, the project’s characteristics might not impact the prevalence of one particular naming category practice: there is no representative correlation between size, number of contributors, or number of commits and the predominance of some naming category practice; • we also noted that, in general, the project’s characteristics might not impact the prevalence of one particular naming category practice: there is no representative. • finally, we observed that diminutive is the most adopted naming category practice by survey respondents and median is the least one. this result seems to align well with our observation about the prevalence of the naming practices in 80 open-source object-oriented programs. the remainder of this paper is organized as follows. the section 2 presents the background and related work on naming practices. section 3 details how we carried out our study. gresta et al. 2023 table 2. c++ programs used in our experiment. project loc contributors commits kings median ditto cognome diminutive shorten index total total % total. % total. % total. % total. % total. % total. % asio 196,656 53 3,034 135 3.65 27 0.73 1,664 44.99 32 0.87 657 17.76 220 5.95 964 26.06 3699 assimp 614,926 462 10,934 78 6.76 74 6.41 739 64.04 10 0.87 94 8.15 13 1.13 146 12.65 1,154 bitcoin 541,474 853 32,661 46 4.58 27 2.69 621 61.79 8 0.80 11 1.09 39 3.88 253 25.17 1,005 bluematter 812,822 2 5 3,972 29.20 1,350 9.92 1,893 13.91 1,560 11.47 506 3.72 685 5.03 3,639 26.75 13,605 calligra 1,602,456 263 101,573 47 3.41 2 0.15 743 53.92 137 9.94 267 19.38 14 1.02 168 12.19 1,378 chaste 587,473 25 5,384 2,954 40.46 882 12.08 673 9.22 667 9.14 470 6.44 14 0.19 1,641 22.48 7,301 citra 428,966 222 9,141 27 5.11 19 3.60 255 48.30 4 0.76 36 6.82 27 5.11 160 30.30 528 clickhouse 1,422,903 921 83,445 114 4.13 40 1.45 2,228 80.78 66 2.39 108 3.92 14 0.51 188 6.82 2,758 core 9,262,610 25 3,058 4,044 5.29 1,516 1.98 45,465 59.47 10,741 14.05 10,799 14.13 420 0.55 3,459 4.52 76,444 freecad 4,842,675 383 27,647 528 6.94 210 2.76 4,705 61.83 100 1.31 513 6.74 181 2.38 1,372 18.03 7,609 gacui 504,062 3 2,238 8 0.62 50 3.91 576 45.00 44 3.44 294 22.97 15 1.17 293 22.89 1,280 gecko-dev 28,303,180 4,910 785,724 1,116 4.57 1,548 6.34 11,737 48.11 2,567 10.52 4,805 19.69 311 1.27 2,314 9.48 24,398 godot 4,976,013 1,590 41,538 525 9.87 270 5.08 1,711 32.17 128 2.41 1,934 36.36 107 2.01 644 12.11 5,319 gromacs 1,680,900 74 20,825 89 5.03 104 5.88 994 56.16 38 2.15 250 14.12 54 3.05 241 13.62 1,770 grpc 717,441 708 50,493 76 3.40 49 2.19 799 35.75 68 3.04 842 37.67 44 1.97 357 15.97 2,235 kdenlive 205,469 94 15,645 4 0.43 0 0.00 671 72.93 66 7.17 36 3.91 34 3.70 109 11.85 920 kdevelop 338,648 245 42,650 52 4.70 3 0.27 723 65.37 61 5.52 93 8.41 10 0.90 164 14.83 1,106 krita 983,754 336 57,706 80 5.93 12 0.89 573 42.48 109 8.08 216 16.01 44 3.26 315 23.35 1,349 lammps 1,626,808 185 29,307 281 11.35 56 2.26 1,272 51.37 199 8.04 169 6.83 85 3.43 414 16.72 2,476 mediapipe 235,825 2 111 11 1.54 47 6.58 511 71.57 13 1.82 1 0.14 26 3.64 105 14.71 714 mlir 75,845 2,285 415,644 9 5.70 18 11.39 83 52.53 24 15.19 8 5.06 2 1.27 14 8.86 158 mongo 5,015,374 571 63,227 917 3.17 381 1.32 14,644 50.66 761 2.63 2,770 9.58 2,019 6.99 7,412 25.64 28,904 mysql-server 3,733,193 88 170,220 803 6.94 124 1.07 7,941 68.60 713 6.16 949 8.20 141 1.22 904 7.81 11,575 obs-studio 482,886 477 10,466 22 3.42 9 1.40 429 66.72 57 8.86 59 9.18 5 0.78 62 9.64 643 opencv 2,166,493 1,360 31,603 1,598 11.96 859 6.43 5,672 42.45 367 2.75 376 2.81 730 5.46 3,761 28.14 13,363 openoffice 6,894,647 21 7,657 3,977 5.82 1,703 2.49 39,683 58.06 9,796 14.33 9,453 13.83 335 0.49 3,397 4.97 68,344 percona-server 3,777,210 238 185,334 849 7.35 127 1.10 7,887 68.32 712 6.17 913 7.91 142 1.23 914 7.92 11,544 proxysql 121,989 90 4,680 7 1.38 12 2.37 219 43.20 10 1.97 46 9.07 37 7.30 176 34.71 507 pytorch 1,792,819 2,155 43,944 56 2.10 111 4.15 1,472 55.07 35 1.31 164 6.14 115 4.30 720 26.94 2,673 qtbase 2,714,097 783 55,238 185 4.51 89 2.17 2,403 58.54 258 6.29 229 5.58 132 3.22 809 19.71 4,105 rocksdb 497,140 628 10,766 41 1.66 52 2.10 1,494 60.36 21 0.85 34 1.37 59 2.38 774 31.27 2,475 server 1,967,124 300 195,145 22 1.59 2 0.14 874 63.01 40 2.88 172 12.40 33 2.38 244 17.59 1,387 tensorflow 3,284,592 3,068 125,560 778 5.67 747 5.45 8,108 59.13 235 1.71 279 2.03 499 3.64 3,067 22.37 13,713 terminal 360,717 313 2,855 159 3.69 49 1.14 2,640 61.20 118 2.74 311 7.21 124 2.87 913 21.16 4,314 vtk 3,690,369 352 81,218 500 7.78 216 3.36 2,167 33.74 147 2.29 1,137 17.70 503 7.83 1,753 27.29 6,423 winget-cli 305,116 317 539 64 2.56 62 2.48 1,252 50.00 65 2.60 111 4.43 312 12.46 638 25.48 2,504 xbmc 1,094,954 785 59,641 42 9.77 2 0.47 208 48.37 29 6.74 83 19.30 20 4.65 46 10.70 430 yarp 1,029,531 77 17,416 45 2.25 18 0.90 1,021 51.13 91 4.56 352 17.63 65 3.25 405 20.28 1,997 yuzu 488,099 203 20,860 30 19.61 7 4.58 76 49.67 0 0.00 6 3.92 3 1.96 31 20.26 153 zerotierone 137,784 58 5,409 34 2.05 64 3.85 975 58.70 12 0.72 62 3.73 56 3.37 458 27.57 1,661 total 99,515,040 25,525 2,830,541 24,325 7.28 10,938 3.27 177,801 53.24 30,109 9.01 39,615 11.86 7,689 2.30 43,444 13.01 333,921 the section 4 outlines the results of our empirical study and provides a general discussion. section 5 describes the threats to the validity. finally, section 6 presents some concluding remarks. 2 background and related work this section presents some background about names and related studies on naming identifiers. we introduce this section by presenting an overview of the role of names in software development. 2.1 naming names identify classes, attributes, methods, variables, and parameters (lawrie et al., 2006). they were originally designed to be pieces of code used to represent values in memory (tofte and talpin, 1997) and now they have become the primary source of information in software development (lawrie et al., 2006; ratiu and deissenboeck, 2006): programmers rely on existing names in their code comprehension journey (takang et al., 1996). indeed, high-quality names have a significant influence on the comprehension of source code (avidan and feitelson, 2017). arnaoudova et al. (2016) have acknowledged the critical role that the source code lexicon plays in the psychological complexity of software systems and coined the contradictory expression “linguistic antipatterns” (las) to denote poor practices in the naming, documentation, and choice of identifiers that might hinder program understanding. they argue that poor practices might lead programmers to make wrong assumptions and waste time understanding source code (arnaoudova et al., 2016). deissenboeck and pizka (2006) characterized a name as being a fully spelled word or even an abbreviation. names can also be composed of two or more words, might include words that do not exist, or even be single alphabetical characters. however, the proper use of words in names is a significant issue in software development (feitelson et al., 2020). in martin’s book (martin, 2008), tim ottinger drew a series of simple rules to guide programmers on naming identifiers. according to ottinger, programmers have to focus on creating intention-revealing names (the name by itself should be capable of informing what it does). they also have to avoid using non-informative words (e.g., words with multiple meanings, words with little differentiation between themselves or number series). ottinger also advocates that names should be pronounceable and searchable. for instance, it is impractical to discuss any source code composed of words that programmers cannot pronounce in a code review session. coding style guides and conventions also aim to address the naming identifiers’ challenges (dos santos and gerosa, 2018). however, they are usually hard to enforce rules, as others discussed in martin’s book (clean code) martin (2008). caprile and tonella (2000) proposed an approach for improving the meaningfulness of identifier names. the approach entails the following steps: (i) extracting identifier names; (ii) normalizing identifier names; and (iii) applying the changes gresta et al. 2023 to the source code. the proposed rules for creating meaningful names aim to guarantee that each word composing a name must belong to a dictionary of standard words and be compliant with existing grammar. deissenboeck and pizka (2006) proposed a set of precise rules for constructing concise and consistent names. in the interest of preserving consistency, the authors advocate that a single name must represent only one concept. the rules, therefore, ensure that one concept will not be taken into consideration in multiple identifier names. in order to preserve conciseness, the rules ensure that names chosen by programmers stand for the concepts they are indeed trying to convey. more recently, feitelson et al. (2020) suggested a threestep method to help programmers to systematically come up with meaningful names. the model encompasses the following steps: (i) selecting the concepts to include in the name; (ii) choosing the words to represent each concept; and (iii) creating a name from these words. the authors demonstrated that programmers could use the model to guide choosing names that are superior (in terms of meaningfulness) over randomly chosen names. 2.2 names in software quality there have been many studies that examine how names affect comprehension and programmer’s efficiency. avidan and feitelson (2017) conducted an experiment involving ten programmers in hopes of understanding the impact of identifier names in program comprehension. they observed that, when changing identifiers names from fully spelled words to single-letter ones, the fully spelled version was perceived as more understandable. hofmeister et al. (2017) also concluded that abbreviations and single-letter names decrease code comprehension and could indicate low-quality code as observed by butler et al. (2010) and kawamoto and mizuno (2012). butler et al. (2010) showed that source code containing poor quality identifiers names were associated with findbugs warnings. kawamoto and mizuno (2012) also observed that concise identifier names have a substantial effect on the fault-proneness in netbeans. takang et al. (1996), based on a survey conducted with 89 computer science students, concluded that the combination between identifier names and comments in the code provides a minor improvement in code comprehension. hence, improving identifier names seem to be a better option than including comments in the code. spending more time choosing meaningful identifier names can result in less work during software maintenance (lawrie et al., 2007a). lowquality names can affect code negatively by causing confusion and misinformation. the study conducted by lawrie et al. (2007a) found that the quality of identifier names improves over time and is also related to the software license. modern software systems contain more high-quality names, and proprietary ones include more abbreviations than opensource projects. moreover, a study investigating the semantic nature of identifier names in four large-scale open-source projects showed that the number of commits and contributors tended to influence the quality of names. projects with a high number of commits and contributors tend to have more identifier names presenting a large text-corpora of existing words (gresta and cirilo, 2020). 3 empirical study setup this section describes the empirical study design. we conducted an empirical study to characterize how c++ and java programmers name attributes, parameters, and variables. specifically, we analyzed 1,421,607 identifier names (i.e., attributes, parameters, and variables names) from 40 java projects and categorized these names into eight naming practice categories. afterwards, we expanded our analysis by selecting a sample of 40 c++ projects. upon analyzing this sample, we found 1,181,774 identifier names, which we then categorized according to the aforementioned eight naming practice categories. we used the results of categorizing identifier names from these two samples to provide answers to the research questions discussed in the next subsection. 3.1 goal and research questions we set out to probe into how common eight naming practices are “in the wild” (i.e., in real world software systems) – see section 3.2. more specifically, our goal is to contribute towards a better understanding of their prevalence in attributes, parameters, and variables naming in java. we believe a more insightful interpretation of the results of our study can be obtained from the standpoint of a researcher interested in helping programmers by defining naming guidelines for educational material and code review. our main goal is to provide answers to the following research questions (rqs): • rq1: how prevalent are the eight naming practice categories? we set out to investigate whether identifier names in open-source projects can be categorized according to eight naming practices categories and how common these naming practices are across c++ and java projects; • rq2: are there context-specific naming practices categories? we set out to examine if specific naming practice categories tend to occur more often in certain contexts (e.g., attribute, parameter, method, if, for, while, switch); • rq3: do the naming practice categories carry over across different c++ and java projects? we attempt to explore the prevalence of the categories spanning multiple c++ and java projects and identify any correlation between software metrics and programmer’s naming practices; • rq4: what is the perception of software developers about the investigated naming categories? we set out to probe into programmers’ perceptions regarding the use and occurrence of the eight investigated naming practices. 3.2 naming practice categories the categories presented in this subsection are a compilation of programmers’ practices reported in several studies (argresta et al. 2023 naoudova et al., 2016; beniamini et al., 2017; alsuhaibani et al., 2021) and books (martin, 2008; dileo, 2019). inspired by antipattern templates (brown et al., 1998), in order to explain the naming practice categories, we frame the discussion of each category in terms of the following elements: category name, examples, motivation (why), consequences of the naming practice, and recommendations. 3.2.1 kings this category represents identifier names composed by numbers at the end. example: string name1 and string name2 or integer arg1 and integer arg2 represent arbitrary distinctions as number series. why: programmers often opt to employ names that fall into this category to distinguish between identifiers that appear in the same scope. consequences: names with numbers at the end, however, are not very informative and do not represent intentional naming (martin, 2008; dileo, 2019). recommendation: usually, identifiers represent different things; whenever that is the case, they should be named accordingly (martin, 2008). 3.2.2 median this category is a variation of the kings category and comprises identifier names composed of numbers in the middle. example: the names fastuint64tobuffer and base64bytes contain numbers that might be representing 64 bits values. why: numbers in the middle, in general, are used to denote the value stored in the attribute/variable or even to provide some distinction among similar identifier names. consequences: names with numbers in the middle can potentially be harder to search for in the source code, hard to pronounce, and also can be very similar to other names that differ only in terms of the numbers that appear somewhere in the middle (martin, 2008). recommendations: programmers should use numbers only when necessary and surround numbers with pronounceable words (martin, 2008). 3.2.3 ditto the category ditto consists of identifier names spelled in the same way as their types. example: timezone is spelled as its type timezone in the same way that the name object has the same name as its type (object). why: naming identifiers according to the respective type is an easy option to avoid mental mapping (which usually are associated with the problem domain concepts). consequences: this naming practice might result in names that are harder to map to their purposes when used in larger scopes, and tend to cause misinformation when the type name changes but the identifier names do not (martin, 2008; alsuhaibani et al., 2021). recommendations: avoid using ditto based names in very large scopes and/or in contexts in which other names can conflict with them (martin, 2008). 3.2.4 diminutive this category encompasses identifier names that are a chunk of their respective type name. example: listener is an example of a name in this category when its associated type is named enginetestlistener. the name nfruleset ruleset is also considered as a chunk of its type. why: developers usually rely on short names to avoid overloading the reader with many concepts. consequences: when used in large-scope contexts, names that fall into this category might impair code comprehension (martin, 2008). recommendations: programmers should use names that properly convey the identifier’s purpose within the local context and scope (martin, 2008). 3.2.5 cognome identifier names in this category contain as an additional suffix or prefix the name of the respective type. example: an identifier namestring includes in its name the the respective type name (string). why: usually programmers resort to adding suffixes in names to help them remember the types. consequences: encoding type into names might place an extraneous cognitive load on the programmer martin (2008); dileo (2019). recommendations: give identifiers names that are meaningful without having to resort to adding its type information to the names martin (2008). 3.2.6 index and shorten these categories represent similar naming practices: naming an identifier with a single-letter word. the index category represents names with one arbitrary letter. names in the shorten category are the starting letters that correspond to their respective types. example: the names integer i and integer j falls into the index category and person p and string s are examples of shorten names. why: singleletter names are traditionally used to identify counters in loops. consequences: single-letter names usually are not easy to locate in the source code (unsearchable) and, when employed in large scopes, can be hard to be understood (martin, 2008; dileo, 2019; beniamini et al., 2017). recommendations: use single-letter names only in local and small scopes; otherwise, intent-revealing names are better (martin, 2008). 3.2.7 famed this category includes very common names; that is, when naming become arbitrary and programmers need to come up convenient defaults. famed names appear in almost every source code, potentially, in similar contexts, such as in loop statements (e.g., for). example: the word i is a recurrent identifier name used in loops to denote counters. why: very popular identifiers are part of the programmer mindset and can be quickly remembered and understood. implications: when used in an indiscriminate fashion, they may cause misinformation martin (2008); alsuhaibani et al. (2021). recommendations: use intent-revealing names even in shortscope contexts martin (2008); alsuhaibani et al. (2021). 3.3 data extraction and analysis projects selection our sample comprises 40 open-source java projects and 40 c++ projects hosted on github. these gresta et al. 2023 projects are listed in tables 1 and 2. we included widely used projects, most of which have been under development for at least five years (e.g., fastjson, jenkins, junit4, mockito, retrofit, spring-boot, tomcat, pytorch, and tensorflow). also, some projects were taken into account because they appear in a curated list of “awesome” projects.4 table 1 and 2 give an overview of the examined projects. as shown in these tables, our java and c++ samples cover somewhat small codebases (with less than 10k loc) and large-scale ones (with over 100k loc). overall, we selected heterogeneous java and c++ projects from a broad range of domains: e.g., software testing, game design, web applications development, image manipulation, and natural language processing. the selected projects also have a reasonable number of attributes, parameters, and variables names and were developed collaboratively by a diverse group of programmers. therefore, we consider that we have selected a somewhat representative set of java and c++ projects. the java projects were collected in july 2021 from github by cloning and storing their respective repositories. in a similar fashion, we extracted the information from the selected c++ projects in january 2022. after storing the repositories, we extracted three common software metrics: (i) the total lines of code (we excluded non-functional code such as comments and white-spaces); (ii) the number of commits; and (iii) the number of contributors. to answer rq3, we correlated these metrics with the prevalence of the categories in projects. names extraction in order to extract identifier names from each project, we created a parser based on the srcml tool collard et al. (2013). srcml is a multi-language parsing tool for the analysis and manipulation of source code. srcml turns source code into a document-oriented xml format (srcml5), which allows for queries using xpath. for example, the srcml format contains structural information (markup tags) about identifier declarations (), associated types (), and context (). we extracted 2,603,381 names from the 80 collected projects. after applying the naming categorization (see section 3.2), we get a total of 753,811 identifier names distributed across the categories (kings, median, ditto, diminutive, cognome, index, shorten) as shown in tables 1 and 2. the experimental package is available in github 6. to investigate and get an overview of the elements in the famed category, we used the entire dataset extracted from both programming languages. we examined the name of each extracted identifier and the associated type to answer rq1 and rq3. therefore, for each naming category practice we report the occurrences in the studied projects and across them. to answer rq2 we analyzed the context where identifiers were declared. survey design and sampling to answer rq4 we designed an online questionnaire containing fifteen closed4java-lang.github.io/awesome-java 5srcml.org 6github.com/rng-lab/naming-practices-analysis ended questions related to naming practices. a brief description (in portuguese) and an example accompanied these questions (see appendix a). we also included two initial questions to collect the demographic information of the respondents. the respondents had to point out their experience in software development as a single choice from four options: under two, two to five, six to 10, or over ten years; and also their education level (undergraduate, graduate, postgraduate). we selected the web-based questionnaire to conduct our survey because it maximizes the number of possible respondents. the google forms7 was chosen to host the questionnaire and enable data collection and pre-processing. the questionnaire was first trialed within the authors’ organizations, with one of the authors registering possible observed issues. some minor adjustments were made to ensure the consistency and clarity of the questions. finally, the questionnaire link was posted to multiple websites (e.g., forums) and online groups (e.g., discord, whatsapp). 4 experimental results in this section, we present the results of our empirical study around the rqs described in the previous sections. 4.1 rq1: how prevalent are the naming practice categories? to answer rq1, we analyzed the categories kings, median, ditto, diminutive, cognome, shorten, and index regarding how commonly they appear in the projects in our samples. tables 1 and 2 list how common each of these categories are table 3. the top 10 names in ditto category names num. num. repetitions projects ditto in java programs url 2,421 24 list 1,464 32 file 1,444 32 method 1,044 29 context 1,042 25 object 991 29 uri 968 25 node 844 21 type 593 30 date 526 25 ditto in c++ programs t 1,227 34 string 1,134 18 uint8_t 564 15 args 247 22 t 231 20 std 143 19 type 141 19 handle 96 17 mode 45 16 7www.google.com/forms gresta et al. 2023 across the 80 investigated projects. considering the identifier names in the chosen java projects, 20.79% are composed by numbers at the end (kings), 7.65% have numbers in their middle (median), 26.24% are spelled the same as their types (ditto), 7.50% contain the hole types as a sub-part (cognome), 4.51% have in their spelling a sub-part of their respective types (diminutive), 3.35% are single-letter names composed of the first letter of their types (shorten), and 29.93% are arbitrary single-letter names (index). as for the c++ projects in our sample, only approximately 7.28% of the identifier names fall into the kings category, 53.24% of the identifiers are named according to their respective types (ditto), around 9% follow the cognome naming practice, 11.86% of the c++ identifier names are diminutive, only 2.3% belong to the shorten category, and approximately 13% of the c++ identifier names are single-letter names (index). these results indicate that the use of single-letter names (index) is a widespread naming practice adopted in objectoriented programming. indeed, beniamini et al. (2017) have observed that single-letter names account for 9–20% of names in java programs. as stated by them, the most commonly occurring single-letter name is i, and in some cases, j is also highly used. in addition, we observed that single-letter names representing contractions of their respective type are not so common (shorten), but are prevalent across projects (see section 4.3). programmers seem to be conscious about single-letter names implications (hofmeister et al., 2017), and thus avoid choosing such naming practice: this category table 4. the most common names (famed) names num. num. common num. num. repetitions projects type occurrences different types famed in java programs value 16,940 40 string 3,345 598 result 12,975 39 int 1,924 887 name 11,374 40 string 10,208 116 i 11,172 39 int 9,794 139 e 10,225 40 throwable 1,851 589 index 8,224 38 int 7,184 83 key 7,696 35 string 3,187 205 s 7,442 35 string 2,771 318 c 7,337 35 int 1,468 441 t 6,989 37 throwable 1,210 336 a 6,970 34 float 739 575 b 6,511 38 int 983 486 type 6,162 40 class 1,523 315 input 6,008 37 string 565 277 p 5,256 35 int 381 443 source 5,025 37 string 765 263 n 5,010 34 int 2,930 165 request 4,719 32 request 1,489 212 context 4,437 37 context 1,042 241 id 4,216 36 string 1,523 104 famed in c++ programs i 5,421 40 int 2,362 151 value 3,912 40 double 427 268 x 3,856 36 double 858 250 result 3,771 40 t 448 231 index 3,106 38 int 869 88 n 3,027 37 int 729 159 ctx 2,964 22 opkernelconstruction 622 105 name 2,545 37 string 950 187 type 2,534 40 int 306 426 b 2,370 39 bool 386 219 p 2,351 37 void* 190 412 size 2,285 39 size_t 619 119 context 2,279 34 opkernelconstruction 501 133 s 2,254 35 status 427 243 len 2,101 34 uint32 463 47 node 2,093 30 node 154 286 v 1,983 38 double 118 253 data 1,832 37 void* 441 211 val 1,821 35 int 192 199 c 1,776 38 char 246 199 gresta et al. 2023 figure 1. naming practices distribution over java programming statements 0 statements pe rc en ta ge kings median ditto diminutive cognome index shorten attr 30.84% 13.56% 29.01% 6.20% 9.01% 10.63% 0.76% for 17.49% 10.60% 13.15% 2.38% 6.32% 45.70% 4.36% if 7.98% 2.07% 13.28% 2.84% 6.31% 52.99% 14.53% method 18.64% 3.46% 27.31% 5.43% 9.36% 32.78% 3.04% param 19.53% 8.90% 29.10% 2.95% 4.81% 32.46% 2.24% switch 12.16% 2.11% 14.32% 8.59% 4.62% 44.77% 13.42% while 9.51% 0.91% 13.43% 2.11% 5.98% 55.79% 12.27% kings median ditto diminutive cognome index shorten represents only 3.35% (14,088) of the examined java names and 2.3% ( 7,689) of the identifier names in c++ projects. names that fall into the ditto naming practice category make up the lion’s share of all identifier names in c++ (53.24%) projects and are the second most common naming practice in java (26.24%) programs. even though it might be argued that ditto is a sound naming practice given that it leads to pronounceable names and many ides suggest names that include the identifier type, in most cases, the practice does not lead to the creation of intention-revealing names. table 3 lists the five most reoccurring names in such a category for java and c++ projects. according to table 3, the use of identifier names as list, object, args, unit8_t and t are common, but these names do not reveal intentions. when the context is not explicit or broad, programmers have to trace back what kinds of data are in an identifier named as list or t. these names are generic and hurt the reader’s understanding. moreover, whether the type name changes, then the identifier names will be misleading as in cases such as string and type. according to avidan and feitelson (2017), the evil face of names is misleading names. the habit of choosing names that represent arbitrary sequential distinctions also revealed a common practice among java and c++ programmers (kings). however, numberseries is considered a bad practice in object-oriented programming when creating meaningful names. number-series naming is a non-informative option, which might disturb code comprehension and maintainability. the use of numbers in the middle of names, although prevailing in the studfigure 2. naming practices distribution over c++ programming statements 0 statements pe rc en ta ge kings median ditto diminutive cognome index shorten attr 9.34% 9.16% 20.75% 24.27% 32.68% 3.41% 0.38% for 21.98% 6.39% 26.81% 4.96% 4.78% 32.42% 2.65% if 13.31% 3.75% 17.01% 6.42% 2.97% 44.80% 11.75% method 22.64% 6.14% 25.17% 8.44% 5.18% 28.61% 3.81% param 3.98% 1.68% 65.86% 10.54% 5.65% 10.32% 1.97% switch 9.59% 1.26% 20.25% 5.33% 1.26% 51.74% 10.56% while 8.71% 3.42% 17.17% 7.49% 4.88% 48.90% 9.44% kings median ditto diminutive cognome index shorten ied names, does not appear to be a recurrent naming practice. we observed that the most common numbers used in the middle of names are: (i) 0, 1, 2, 3, 4, 5, and 6 – as well as meaning some distinction; and (ii) 8, 16, 32 and 64 – meaning identifiers which might be representing 8, 16, 32 or 64 bits values, respectively. the scenarios in which programmers choose names that are variants of their type are also common. for example, names that contain sub-parts of their type (cognome) account for 7.50% of the identifier names in java projects and around 9% in c++ programs. often, these identifier names represent prefix/suffix (noise words) conventions, such as: streetstring; listpersons; floatarg. noise words are redundant and should never appear in names. in general, streetstring is not better than street. short names are in general easier to comprehend and one of the first things a programmer can do to keep identifier names short is to avoid adding unnecessary information. in contrast, names that are part of their type are not so common. these names are hard to search for and are not very meaningful in most contexts. 4.1.1 very common names in feitelson et al. (2020), the authors observed that the probability of two programmers choosing the same name is low: the median probability was only 6.9%. at the same time, when a specific name is chosen, it is usually understood and often used by most programmers (avidan and feitelson, 2017; swidan et al., 2017). in fact, we observed that there are some frequently used names. the top-3 most comgresta et al. 2023 mon names in java programs are (see table 4): (i) value (16,940 occurrences); (ii) result (12,975 occurrences); and (iii) name (11,374 occurrences). it might be expected that i is a widespread name (beniamini et al., 2017), but many other single letter names are also commonly used across java projects (e.g., e, s, c, t, a, b, p, n). most of them are in the top-10 most common names. another interesting observation is index and key as part of the top-10 most common names. overall, some of the common identifier names in table 4 are popular in programmer’s vocabulary: value, result, name, index, key, type, input, source, request, context, id. as for c++ programs, the three most common identifier names are (i) i (5,421 occurrences), (ii) value (3,912 occurrences), and (iii) x (3,856 occurrences). according to our results, many of the identifier names shown in table 4 are widely common in programs written in java and c++: value, result, name, index, type, context, i, b, n, p, and s. it turns out that value appears among the top three most used identifier names both in java and c++. java programmers seem to have a slight preference for the names result and name in comparison to c++ programmers. as mentioned, some single-letter names are widely used by programmers in both languages, being i the most commonly used single-letter name in java and c++. further analysis of the names in table 4 and their corresponding most common types led to interesting results about programmers’ rationale when programming in java and c++. as noted by beniamini et al. (2017), analyzing this link yields interesting results because it is possible to understand the meaning related to names frequently used by programmers, especially single-letter names. we can observe most identifier names are associated with int variables (e.g., result, i, index, c, b, p, n) or string types (e.g., value, name, key, s, input, source, id). as shown in a survey conducted by beniamini et al. (2017), single-letter names such as i and j are understood as counter variables (integer values) and most of the time used as loop control variables. there are other interesting findings. for example, in java programs the single-letter name e, is usually correlated with error and exception (beniamini et al., 2017). our results show that e is mainly associated with the throwable type. in the same way, s is a single-letter name essentially associated with string (see table 4). however, we also found some counter-intuitive results. for instance, contrary to our expectations, we observed that in programs written in java the single-letter name b is not linked with boolean values (beniamini et al., 2017) but with integer values. additionally, the identifier name t is mainly associated with throwable; which is somewhat counter-intuitive because t is also often used to name and convey the idea of time-related constant values and variables or variables that hold temporary values (beniamini et al., 2017). other names that seem to have meaningful associations are the following: type, which is generally associated with the class type; context and request, which are often associated with the context and request types. our results would seem to suggest that the underlying meaning of the identifier names vary a lot. for example, the name result was associated with 855 different types. the name i, which intuitively is associated with index (int), also assumes other 139 different types. nevertheless, in most cases (9,794 out of 11,172), this name is associated with integer values. the name name seems to be usually associated with the string type: 10,208 out of 11,374 occurrences are associated with string. 4.2 rq2: are there context-specific naming practices categories? to answer the rq2, we investigated the predominance of the naming practice categories over particular contexts (attribute, parameter, method, for, while, if, and switch). the results are present in figure 1 and 2. we found that while some naming conventions (allamanis et al., 2014) acknowledge the use of single-letter words (index and shorten) to name a local, temporary or loop variable, this practice is much more pervasive than any other. except for naming attributes java and c++, in which case java programmers prioritize the use of ditto and kings naming practices while c++ programmers tend to use cognome, ditto, and diminutive. surprisingly, names with numbers at the end appear 30,655 times in our study as java attributes and only 4,066 in class attributes in c++ projects. especially in largescope contexts, kings names should always be avoided by programmers. in contrast, using ditto names in such a case seems to be a reasonable choice. ides (e.g., eclipse and intellij idea) usually analyze the scope and generate suggestions from the current context and these suggestions often include information regarding the respective type. focusing on particular contexts, we might see that programmer’s practices are context-specific. for example, the table 5. spearman correlation category loc commits commiters java c++ java c++ java c++ corr p-value corr p-value corr p-value corr p-value corr p-value corr p-value kings 0.337 0.038 0.391 0.014 0.150 0.365 0.199 0.222 0.053 0.748 0.090 0.583 median 0.254 0.123 0.004 0.978 0.054 0.743 -0.197 0.226 -0.081 0.627 0.070 0.668 ditto -0.517 0.001 -0.049 0.763 -0.216 0.191 0.074 0.649 0.101 0.545 -0.041 0.801 diminutive -0.021 0.898 0.335 0.037 0.008 0.959 0.225 0.166 -0.171 0.304 -0.025 0.875 cognome -0.227 0.169 0.268 0.098 -0.300 0.066 0.188 0.250 -0.178 0.283 -0.103 0.532 index 0.341 0.036 -0.330 0.040 0.133 0.421 -0.311 0.054 -0.098 0.554 0.010 0.950 shorten 0.387 0.016 -0.196 0.229 0.124 0.453 -0.110 0.501 -0.068 0.681 0.128 0.435 gresta et al. 2023 use of practices that might result in meaningful names (e.g., ditto) is more common in long-scope contexts (attribute and method) than in short-scope ones (if, for, while, switch). especially in c++ projects, ditto makes up for the lion’s share of the parameters names. java and c++ programmers seem to adopt less descriptive names in the context of switch and while statements. as shown in figures 1 and 2, index names appear more often inside contexts surrounded by if, for, switch, and while statements, where their occurrence is widely and accepted (kernighan and pike, 1999; beniamini et al., 2017). however, as observed by avidan and feitelson (2017), hiding the plural names using single-letter words may camouflage the meaning of the respective identifier. it might not be a natural interpretation that the identifier stores more than one object. the predominance of kings and index as parameter names do not agree with the findings of avidan and feitelson (2017). their experiment indicated that parameter names contribute more to code comprehension than any other names (e.g., attributes or local variables). since parameters are part of the method header and the starting point of the comprehension task, programmers pay special attention to parameter names in order to better understand the method behavior (avidan and feitelson, 2017). however, every naming practice category we studied are used to name parameters, although, as observed by avidan and feitelson (2017), parameter names are often more carefully chosen by programmers. 4.3 rq3: do the naming practice categories carry over across different java and c++ projects? in hopes of answering the rq3, we analyzed the prevalence of naming practice spanning multiple projects. tables 1 and 2 list the categories by projects. all selected projects turned out to have problematic names, which suggests that the investigated naming category practices are probably not uncommon. even the most popular projects have naming practices which might result in meaningfulness names (e.g., fastjson, jenkins, junit4, mockito, retrofit, spring-boot, tomcat, tensorflow, and pytorch). as highlighted in tables 1 and 2, ditto and index are very common naming practices. especially, these practices are dominant (representing more than 50% of analyzed identifiers) in some projects. for example, ditto names are widely used in java and c++ programs, accounting for 85.08% in riptide (java), 80.78% of the identifier names in clickhouse (c++), 78.26% in webmagic (java), 72.93% of the names in kdenlive (c++), 68.60% in mysql-server (c++), 68.32% in percona-server (c++), 65.99% in keywhiz, and 54.46% in aeron. the problem with ditto is that when the type changes, the identifier name might lose its meaning (scalabrino et al., 2017). index names appear to be more common in java programs. for instance, these identifier names account for 58.93% of all identifiers in boofcv (java) and 66.53% in rxjava (java). it would seem that index names are not very common in c++: proxysql which is the program in which index names are most common, has around 34.7% of the identifier names following this naming practice. rocksdb and citra also include a substantial amount of identifiers named according to the index naming practice: 34.71% and 30.30%, respectively. in some isolated cases, some name practice seems to be dominant, as kings in fastjson (49.88%) and libgdx (47.83%). on the other hand, the naming practices cognome, diminutive and shorten are not dominant in any specific project. specifically, shorten seems to be a naming practice that most programmers try to avoid: programmers avoid naming identifiers using the first letter of the type. as mentioned, shorten names usually are not easy to search for in the source code and, when employed in large-scope contexts, they tend to be hard to understand. to better comprehend whether the project’s characteristics may influence the prevalence of one practice, we looked at the correlation between common software metrics (e.g., lines of code, number of contributors, and number of commits) and the predominance of the naming practice categories. table 5 summarizes the spearman test results. the results show no representative correlation between the investigated project characteristics and the categories of naming practices. overall, we can observe a low correlation between the number of contributors and the prevalence of any category. one might surmise that an increase in the number of programmers might be beneficial towards removing bad naming practices. however, this does not seems to be the case. the same rationale might be employed to the number of commits: whether the project evolves, the quality of the identifiers names might evolve or decay. though, in contrast to deissenboeck and pizka (2006), which stated that identifiers names are subject to decay during software evolution, the results show that it might not seem to be the case. especially observing loc, we might observe some compelling correlations. for example, there is a negative correlation (rho -0.517) between size and the category ditto (for java programs). therefore, names spelled in the same way as their respective types tend to be way more common in small projects. on the other hand, large java projects might tend to contain names involving practices such as index (rho 0.341) and shorten (rho 0.387). figure 3. respondents demographics 5.8% 38.5% 32.7% 23.10% less than 2 years between 2 and 5 years between 5 and 10 years more than 10 years (a) respondents experience in software development 26.9% 44.2% 28.8% undergraduate graduate graduand (b) respondents education level gresta et al. 2023 figure 4. naming practices distribution over programming statements 0 frequency pe rc en ta ge kings never rarely occasionally often kings 30.8% 48.1% 17.3% 3.8% 0.0% median 73.1% 19.2% 7.7% 0.0% 0.0% ditto 50.0% 9.6% 17.3% 15.4% 7.7% diminutive 11.5% 11.5% 46.2% 21.2% 9.6% cognome 36.5% 28.8% 19.2% 13.5% 1.9% index 25.0% 21.2% 26.9% 17.3% 9.6% shorten 40.4% 26.9% 19.2% 13.5% 0.0% never rarely occasionally often veryoften as shown in table 1, ditto and index are the most dominant practice across java projects. considering only the two categories, they account for 235,886 identifier names, representing 56.17% of all analyzed names in java projects. these results are consistent with the findings of beniamini et al. (2017). although code conventions and style guides may constrain identifier naming practices, programmers seem to be heavily influenced by ides content assist capabilities. as programmers work in the editor, content assist analyzes their code and recommended elements to complete partially entered statements. therefore, it is indispensable to provide more sophisticated and context-aware capabilities to assist programmers in naming and renaming identifiers jiang et al. (2019); isobe and tamada (2018); peruma et al. (2018, 2019). finally, programmers would seem to prioritize single-letters names in contexts where they are widely accepted (see section 4.2). 4.4 rq4: what is the perception of software developers about the investigated naming categories? this section presents the results of our survey with 52 programmers. we start by characterizing the respondents (section 4.4.1). next, we assess the relevance of the naming practice categories by how often they are used by programmers (section 4.4.2). we then analyze how naming practice categories adoption varies according to programming statements (section 4.4.3). 4.4.1 respondents’ demographics figure 3 depicts the respondents’ experience in software development and the corresponding frequencies and percentages. a total of 5.8% of the respondents have less than two years of experience, while 55.8% have more than five years of experience, suggesting that most survey respondents are experienced programmers. moreover, we seem to have collected a reasonably balanced distribution of programmers in figure 5. naming practices distribution over programming statements kings median ditto diminutive cognome index shorten 14 4 19 27 21 4 10 24 11 23 39 30 6 19 8 0 13 20 10 43 13 4 0 12 18 12 13 8 16 39 22 7 17 9 22 attribute method loop conditional none gresta et al. 2023 terms of education level. figure 3 shows the respondents’ education level. as the majority of the respondents (73%) have a graduate degree, we claim that it increases our confidence in the validity of the responses. 4.4.2 most commonly used naming practices the respondents were queried about how often they choose identifier names conforming to the naming practice categories. a five-point likert scale was used to capture respondent opinions ranging from “never” to “very often”. figure 4 shows how frequently respondents have been using each naming practice category. in our sample, diminutive is the most frequently used naming practice category (i.e., used “often” or “very often”), followed by index and ditto. this result seems to align well with our observation about the prevalence of the naming practices in open-source objectoriented programming (see section 4.1). notably, from the survey, we can make the following observations: • all the respondents adopt at least one naming practice category “occasionally” or “often”, with 26% (13) of the respondents claiming to adopt at least one naming practice “very often”. • diminutive is the most adopted naming category practice by respondent. however, as we could observe, this naming practice category is not so prevalent in the analyzed object-oriented projects (see section 4.1) as claimed by the survey programmers. • median is the least adopted naming practice category (see figure 4), with just 26% (14) of the respondents using it “rarely” or “occasionally”. the lower use of this naming practice corroborates our observation that programmers seem to be conscious of this harmful practice in object-oriented programming. • ditto is not a widespread naming practice among the survey respondents. only 12 out of 52 programmers (23%) indicated a tendency to write identifier names spelled in the same way as their types; which do not ratify our previous observations about the prevalence of ditto across java and c++ projects (see section 4.3). this contrasting result suggests that programmers might be not aware of their general use of naming practices. moreover, this might also be a sign that naming assistant features present in modern ides do not influence the respondents. 4.4.3 most commonly used naming practices according to context in order to specify the location in which programmers mainly observe the occurrences of the naming practice categories, the respondents were allowed to select multiple locations (attribute, method, loop, conditional, and none). this is expected to be done by remembering instances of naming practice categories encountered by respondents in their software development works. the two most common answers from the respondents were: attribute and method (see figure 5). these findings share similarities with those presented in section 4.2, wherein 56% of the names occur as attribute or are declared in the context of method. one notary exception is index, in which case, 43 out of 52 respondents indicated that this naming practice occurs mainly inside contexts surrounded by loop statements (for or while). indeed, as observed by beniamini et al. (2017), single-letter names can be used safely in a short-scope context. finally, as expected, the majority of respondents (39 out of 52) indicated that they usually do not observe median in their day-life (see figure 5). 5 threats to validity as with most empirical studies, our study also has some practical limitations, i.e., it is also subject to some threats to its validity. in this section, we present potential threats and how we tried to mitigate some of those issues. conclusion & external validity one potential threat is that the samples we used in our study might not be representative of the target population: our analysis took into account 40 open-source java projects and 40 c++ projects. to mitigate this threat concerning the conclusion and generalization of the study results, we tried to select a heterogeneous sample. we think the impact of this threat is minimal for three reasons: (i) java and c++ are two popular programming languages;8 (ii) our sample covers somewhat small code-bases (with less than 10k loc) and large-scale ones (with over 100k loc), and (iii) we selected projects from a broad range of domains. thus, we argue that our study can be seen as an initial step towards identifying trends java and c++ programmers follow when picking identifier names. however, given the sizes of our samples, we cannot rule out the possibility that our results do not reflect how java and c++ programmers name identifiers. that is, the results might not be generalizable beyond the study samples and the participants that took part in our survey. to understand the prevalence of naming categories across java and c++ projects, we employed a set of metrics: program size (loc), number of commits, and number of contributors. nevertheless, as with many software metrics, one potential threat is that these measurements might not be sophisticated enough for our investigation. thus, our findings might not carry over to other settings and similar programming languages. it is also worth emphasizing that context and scope would seem to play an important role in determining identifier names. for instance, some of the most common identifier names listed in table 3 would seem to be context-dependent, e.g., node. we surmise that is the case because programmers might want to include relevant domain information when turning concepts into names. although we tried our best to maximize the sample heterogeneity during sample selection, we cannot rule out the fact that the most common domains (e.g., xml file parsing) from which the programs in our sample were extracted might have an impact on variable naming. finally, the representativeness of the survey respondents cannot be guaranteed. our target population was programmers, but we did not take any measures to verify the identity 8www.tiobe.com/tiobe-index/ gresta et al. 2023 of the respondents. however, we have included two initial questions, which might have permitted us to filter out individuals not belonging to our target population. there might also exist some other factors that bias our conclusions. one example is the environment in which the respondents worked. another one is whether or not respondents have a correct understanding of each category. to mitigate the latter, we included in the questionnaire a brief description and an example of the categories. future studies can ask respondents to consider this factor and evaluate how it impacts the naming practice category adoption. construct & internal validity a threat to the construct validity of our study comes from the number of identifier names we analyzed in our study. it might be argued that a more significant amount of names may lead to better and more conclusive results. to mitigate this threat we analyzed 2,603,381 identifier names in highly diverse sets of java and c++ projects. additionally, another potential threat has to do with how well the naming practices we identified reflect extant research and current industry practices. we tried to mitigate this threat by drawing from previous research, which helped us to get a better understanding regarding whether or not some of the naming practices we identified are indeed recurring practices. we also conducted a survey with 52 participants in order to gather programmers’ perceptions about the use and occurrence of the investigated naming practices. we tried to minimize possible construct and internal validity associated with the survey by disseminating it online through multiple websites and online groups; and introducing a brief description and an example of each question. 6 conclusion coming up with proper identifier names is challenging (brooks, 1983). as stated by host and ostvold (2007), even though programmers have to name identifiers on a daily basis, it still entails a great deal of time and thought. to make matters more challenging, identifier names are pivotal for program comprehension: developers have to go over identifier names to comprehend the code that they need to update and poorly chosen names might hinder source code comprehension (avidan and feitelson, 2017). given that it has been estimated that identifiers contribute to about 70% of a software system’s codebase (deissenboeck and pizka, 2006), it cannot be disputed that there is a need to define what makes up a good identifier as well as to assist developers in naming identifiers. similarly, identifying practices that result in poor identifier names might enhance programmers’ awareness and contribute to improving educational materials and code review methods. as an initial foray into creating an approach to optimal identifier naming (i.e., how to assign the proper words to an identifier), we investigated eight naming practices categories “in the wild”. the categories provide examples of naming practices from real-world software projects. we illustrated their possible consequences and also outlined their prevalence across projects and code contexts (i.e., attribute, parameter, method, for, while, if, and switch). our results based on 2,603,381 identifier names extracted from 80 real-world java and c++ projects and on a survey, would seem to suggest the following: • the eight categories are recurrently found in practice, but two are more common in java and c++ projects: naming identifiers with the same name as her type (ditto) and use single-letter names denoting counters (index). specifically, index and ditto are by far the most frequently occurring naming practices across java projects: index occurrences account for approximately 30% of all naming practice occurrences in the examined java projects, while ditto occurrences amounted to roughly 27%. as for c++ programs, ditto is the most widely used naming practice, which accounts for around 54% of all naming practice occurrences. index and diminutive are also popular among c++ coders, accounting for 13% and 11% of all naming practice occurrences. shorten seems to be the least used naming practice both by java and c++ programmers. additionally, programmers seem to be hardly influenced by ide-like features that help them to choose identifier names, although only 12 out of 52 surveyed programmers (23%) acknowledged a tendency to write identifier names spelled in the same way as their types; • there are several very common names (e.g., value; result; and name) and recurrent single-letter names (e.g., i, e, s, c) used in practice. the lion’s share of these names are used to denote identifiers that store either integer or string values. according to our results, single-letter identifiers are more commonly used by java programmers: i, e, s, c, t, a, b, p, and n would seem to be widely used by programmers. in c++ (in contrast to java), coders tend to prefer a smaller set of single-letter names: i, e, s, s, c, t, a, b, p, and n. thus, differently from java, in c++ e, c, t, and a do not rank among the most common single-letter identifier names; • the programmers naming practices are context-specific: single-letters names (index and shorten) seem to be more common in short-scope contexts (if, for, while), although they can also be found in large-scope contexts (e.g., attribute). results from our survey questionnaire showed that programmers acknowledge that the index naming practice occurs mainly inside contexts surrounded; • diminutive is the most adopted naming category practice by survey respondents and median is the least used naming practice. all the respondents adopt at least one naming practice category “occasionally” or “often”. • we could benefit from including poor naming practices in code reviews. the current practices follow extensive checklists, but no one addresses naming issues. a more nuanced take is to consider variable names that depart from commonly used naming practices as elements that can lead to a source of problems. we believe our results have the potential to inspire several future research directions. our work highlights the need for further research on how naming practices are prevalent in source code and how better names can be chosen. in this gresta et al. 2023 direction, an aspiring goal would be to devise tools capable of automatically evaluating and suggesting renaming opportunities during code review. similarly, code generation tools can capitalize on commonly used naming practice to generate names automatically. additionally, since our results would seem to suggest that some identifier names are contextdependent, we believe that tools (e.g., ide-based identifier name recommendation system) can take advantage of context information during software development by constantly monitoring how programmers name identifiers so that it can help developers new to a given project through the automated recognition of contextand project-specific naming conventions. therefore, this automated identifier naming assistant can support developers by identifying inappropriate naming choices and making recommendations. as a result, our longterm goal is to support the identification of opportunities to rename identifiers and understand more about programmers naming practices. finally, as future work, we plan to perform a qualitative study on commits, code changes, and review discussions. another possible future research avenue would be to account for the role of human factors in choosing identifier names by exploring how programmer experience, team size, and mood influence naming practices throughout different software projects. although our results give practitioners and researchers alike a good glimpse into the most common options for naming identifiers in c++ and java, we did not investigate how each naming practice contributes, if at all, to improving code comprehension. therefore, future research efforts should aim to better understand how these commonly used naming practices influence readability during code comprehension. references allamanis, m., barr, e. t., bird, c., and sutton, c. (2014). learning natural coding conventions. in international symposium on foundations of software engineering. alsuhaibani, r. s., newman, c. d., decker, m. j., collard, m. l., and maletic, j. i. (2021). on the naming of methods: a survey of professional developers. in international conference on software engineering. arnaoudova, v., di penta, m., and antoniol, g. (2016). linguistic antipatterns: what they are and how developers perceive them. empirical software engineering, 21(1):104– 158. avidan, e. and feitelson, d. g. (2017). effects of variable names on comprehension: an empirical study. in 25th international conference on program comprehension. beniamini, g., gingichashvili, s., orbach, a. k., and feitelson, d. g. (2017). meaningful identifier names: the case of single-letter variables. in international conference on program comprehension, pages 45–54. brooks, r. (1983). towards a theory of the comprehension of computer programs. international journal of manmachine studies, 18(6):543–554. brown, w. h., malveau, r. c., mccormick, h. w. s., and mowbray, t. j. (1998). antipatterns: refactoring software, architectures, and projects in crisis. john wiley & sons, inc., usa, 1st edition. butler, s., wermelinger, m., yu, y., and sharp, h. (2010). exploring the influence of identifier names on code quality: an empirical study. in 2010 14th european conference on software maintenance and reengineering, pages 156–165. ieee. caprile, b. and tonella, p. (2000). restructuring program identifier names. in icsm, pages 97–107. charitsis, c., piech, c., and mitchell, j. (2021). assessing function names and quantifying the relationship between identifiers and their functionality to improve them. in conference on learning@ scale. collard, m. l., decker, m. j., and maletic, j. i. (2013). srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. in 2013 ieee international conference on software maintenance, pages 516–519. ieee. deissenboeck, f. and pizka, m. (2006). concise and consistent naming. software quality journal, 14(3):261–282. dileo, c. (2019). clean ruby. dos santos, r. m. and gerosa, m. a. (2018). impacts of coding practices on readability. in internation conference on program comprehension. fakhoury, s., ma, y., arnaoudova, v., and adesope, o. (2018). the effect of poor source code lexicon and readability on developers’ cognitive load. in international conference on program comprehension. feitelson, d., mizrahi, a., noy, n., shabat, a. b., eliyahu, o., and sheffer, r. (2020). how developers choose names. ieee transactions on software engineering. gresta, r. and cirilo, e. (2020). contextual similarity among identifier names: an empirical study. in workshop de visualização, evolução e manutenção de software, pages 49– 56. sbc. gresta, r., durelli, v., and cirilo, e. (2021). naming practices in java projects: an empirical study. in xx brazilian symposium on software quality, pages 1–10. acm. hofmeister, j., siegmund, j., and holt, d. v. (2017). shorter identifier names take longer to comprehend. in 2017 ieee 24th international conference on software analysis, evolution and reengineering (saner), pages 217–227. ieee. host, e. w. and ostvold, b. m. (2007). the programmer’s lexicon, volume i: the verbs. in international working conference on source code analysis and manipulation. isobe, y. and tamada, h. (2018). are identifier renaming methods secure? in international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. jiang, l., liu, h., and jiang, h. (2019). machine learning based recommendation of method names: how far are we. in international conference on automated software engineering. kawamoto, k. and mizuno, o. (2012). predicting fault-prone modules using the length of identifiers. in 2012 fourth international workshop on empirical software engineering in practice, pages 30–34. ieee. kernighan, b. w. and pike, r. (1999). the practice of programming. addison-wesley longman publishing co., inc. gresta et al. 2023 lawrie, d., feild, h., and binkley, d. (2007a). quantifying identifier quality: an analysis of trends. empirical software engineering, 12(4):359–388. lawrie, d., morrell, c., and feild, h. (2007b). effective identifier names for comprehension and memory. innovations syst softw eng, 3(1):303–318. lawrie, d., morrell, c., feild, h., and binkley, d. (2006). what’s in a name? a study of identifiers. in 14th ieee international conference on program comprehension. marcus, a., sergeyev, a., rajlich, v., and maletic, j. i. (2004). an information retrieval approach to concept location in source code. in 11th working conference on reverse engineering, pages 214–223. ieee. martin, r. c. (2008). clean code: a handbook of agile software craftsmanship. nyamawe, a. s., bakhti, k., and sandiwarno, s. (2021). identifying rename refactoring opportunities based on feature requests. international journal of computers and applications, pages 1–9. oliveira, d., bruno, r., madeiral, f., and castor, f. (2020). evaluating code readability and legibility: an examination of human-centric studies. in international conference on software maintenance and evolution. peruma, a., mkaouer, m. w., decker, m. j., and newman, c. d. (2018). an empirical investigation of how and why developers rename identifiers. in 2nd international workshop on refactoring. peruma, a., mkaouer, m. w., decker, m. j., and newman, c. d. (2019). contextualizing rename decisions using refactorings and commit messages. in international working conference on source code analysis and manipulation. ratiu, d. and deissenboeck, f. (2006). programs are knowledge bases. in 14th ieee international conference on program comprehension (icpc’06), pages 79–83. ieee. scalabrino, s., bavota, g., vendome, c., linares-vásquez, m., poshyvanyk, d., and oliveto, r. (2017). automatically assessing code understandability: how far are we? in international conference on automated software engineering. schankin, a., berger, a., holt, d. v., hofmeister, j. c., riedel, t., and beigl, m. (2018). descriptive compound identifier names improve source code comprehension. in international conference on program comprehension. swidan, a., serebrenik, a., and hermans, f. (2017). how do scratch programmers name variables and procedures? in international working conference on source code analysis and manipulation (scam), pages 51–60. takang, a. a., grubb, p. a., and macredie, r. d. (1996). the effects of comments and identifier names on program comprehensibility: an experimental investigation. j. prog. lang., 4(3):143–167. tofte, m. and talpin, j.-p. (1997). region-based memory management. information and computation, 132(2):109– 176. wainakh, y., rauf, m., and pradel, m. (2021). idbench: evaluating semantic representations of identifier names in source code. in international conference on software engineering. gresta et al. 2023 appendix a survey questionnaire education level ◦ undergraduate ◦ graduate ◦ graduand experience in software development ◦ under two years ◦ two to five years ◦ six to ten years ◦ over ten years 1. how often do you choose identifier names with numbers at the end? examples: people people1; people people2 ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names with numbers at the end? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 2. how often do you choose identifier names with numbers in the middle? example: char int2char ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names with numbers in the middle?? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 3. how often do you name identifiers after their type names? examples: string string, people people ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names spelled in the same way as their types? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 4. how often do you name identifiers as chunk of their respective type name? examples: engineexecutiontestlistener listener ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names as chunk of their respective type name? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 5. how often do you includes in identifier names an additional suffix or prefix that is the name of the respective type? examples: string namestring ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see identifier names containing an additional suffix or prefix that is the name of the respective type? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 6. how often do you choose single-letter identifier names? examples: integer j ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see single-letter identifier names? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none 7. how often do you name identifiers with the starting letters that correspond to their respective types? examples: people p ◦ never ◦ rarely ◦ occasionally ◦ often ◦ very often where do you usually see names which are the starting letters that correspond to their respective types? ▭ attributes ▭ methods ▭ loops ▭ conditionals ▭ none introduction background and related work naming names in software quality empirical study setup goal and research questions naming practice categories kings median ditto diminutive cognome index and shorten famed data extraction and analysis experimental results rq1: how prevalent are the naming practice categories? very common names rq2: are there context-specific naming practices categories? rq3: do the naming practice categories carry over across different java and c++ projects? rq4: what is the perception of software developers about the investigated naming categories? respondents' demographics most commonly used naming practices most commonly used naming practices according to context threats to validity conclusion survey questionnaire journal of software engineering research and development, 2022, 10:12, doi: 10.5753/jserd.2022.2576  this work is licensed under a creative commons attribution 4.0 international license. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers barbara beato ribeiro  [ universidade federal do estado do rio de janeiro | barbara.ribeiro@edu.unirio.br ] catarina costa  [ universidade federal do acre | catarina.costa@ufac.br] rodrigo pereira dos santos  [ universidade federal do estado do rio de janeiro | rps@uniriotec.br] abstract merge conflicts are very common in collaborative software development, which is supported mainly by the use of branches that can be potentially merged. in this context, several studies have proposed mechanisms to avoid conflicts whenever possible and some identified factors that lead to conflicts. in this article, we report on an investigation of factors that can lead to conflicts or that can somehow reduce the chances of conflict from the developers’ perspective. to do so, based on related work, we conducted two empirical studies with brazilian software developers to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand how they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid conflicts. results showed that the use of branches is very common and mostly has the purpose of creating a new feature or fixing a bug. according to the participants, in most projects, developers have the autonomy to create new branches and sometimes conflicts happen. the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. secondly, we conducted a field study based on interviews with 15 software developers to analyze those factors to understand better what leads to or avoids conflicts in a merge. finally, this work allowed us to conclude that communication with the team, checking code updates, shorter branch duration, and management are important for software developers, especially when they think about what increases and decreases merge conflicts. keywords: version control, merge conflicts, survey research, field study, software developers 1 introduction version control systems (vcs) allow the creation of parallel branches in a simplified way. however, there is a cost regarding merge conflicts, which are common in collaborative software development. developers usually combine the work they have performed in parallel and may have changed the same parts of a specific file. although the solution is frequently present in one or both conflicting versions, it does not necessarily mean that it is a trivial task (ghiotto et al., 2018). conflict resolution might degrade the quality of the merged code and requires a deeper understanding of the program’s structure and goals (shihab et al., 2012; brindescu et al., 2020a). the person in charge may not have all the necessary knowledge to make the best decision or not feel comfortable making decisions by himself/herself over source code that was coded by other developers (shihab et al., 2012; costa et al., 2014). in some cases, it may be necessary to verify the knowledge of developers in the changes made in the branches to choose one or more developers to resolve the conflict (costa et al., 2019). in this context, recent studies (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) have investigated factors, indicators and attributes that can lead to merge conflicts. such studies have found evidence that some factors can impact merge conflicts more than others. therefore, we decided to use this knowledge as a reference to verify the software developer’s perspective in relation to factors that can lead or help to avoid merge conflicts. as such, based on related work, we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 brazilian software developers to understand the way they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid merge conflicts. the following three research questions guided our survey: • rq1 (branches): how often are branches created in software projects? • rq2 (merge conflicts): what factors lead to merge conflicts? • rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? we found that the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. this communication refers to the awareness of parallel changes: sometimes developers forgot to communicate what they were changing, resulting in two developers changing the same functionality or something very close. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. others mentioned by the participants were “divide the work among the team”, “small changes”, and “frequent commits”. we also identified that the main reasons to create a branch are “create new features” and “bug fixes”, and participants mentioned that developers create branches “frequently”. secondly, we conducted a field study based on interviews https://orcid.org/0000-0002-5215-2845 mailto:barbara.ribeiro@edu.unirio.br https://orcid.org/0000-0002-8851-1563 mailto:catarina.costa@ufac.br https://orcid.org/0000-0003-4749-2551 mailto:rps@uniriotec.br understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 with 15 brazilian software developers to analyze those factors to obtain a better understanding of what leads to or avoids merge conflicts. the following two new research questions guided our field study: • rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute increasing merge conflicts? • rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute decreasing merge conflicts? we deepened the factors highlighted in the survey and observed that most software developers agree with them and went through some situation that reinforces their opinion. furthermore, time of experience was mentioned, highlighting that the experience can modify the software developer’s perception regarding the question and, that the technology itself could evolve in this time, improving the work. this article is an extended version of a conference paper (costa et al., 2021) in which we answered the first three research questions, focused on the characterization of software developer’s perceptions of factors related to merge conflicts. we complement our previous work by adding two new research questions analyzing how developers see these factors and if and how they contribute to increasing and/or decreasing the chances of a merge conflict occurring. this article is organized as follows. we explain the merge conflict scenario and discuss related work in section 2. in section 3, we describe the research method. we present the studies conducted in this work, as well as their results and findings in sections 4 and 5. discussion and implications are presented in section 6. section 7 refers to threats to validity and credibility. finally, section 8 concludes this paper with some final remarks and opportunities for future work. 2 background in this section, we discussed the concepts of merge conflicts and other works that also investigated factors or attributes that can lead to conflicts. 2.1 merge conflicts textual or physical conflicts occur due to simultaneous modifications (e.g., addition, removal or editing) over the same physical parts of a file (e.g., same line) by several developers. direct conflicts are detected by a vcs and require resolution from a developer or a project team. figure 1 shows an example of a conflicting chunk detected by git where each part of the chunk has a version of a function to sum two values in python programming language. in this case, a developer in charge must choose one of the versions, since they have the same intention. ghiotto et al. (2018) verified how developers resolved conflicting chunks across 2,731 java projects. the authors found that the resolution of conflicting chunks is frequently present in one of the versions and three quarters of the conflicting figure 1. conflict detected by vcs chunks were resolved by choosing one of the versions version 1 (50%) or version 2 (25%). in some cases, it was necessary a concatenation (3%), or a combination (9%), or even a new code (13%). this does not necessarily mean that it is a trivial task, the person in charge must understand the conflicting intentions and generate a single version. vale et al. (2021) investigated the influence of some factors on conflict resolution time and found that the number of chunks, lines of code, conflicting chunks, developers involved, conflicting lines of code, conflicting files, and the complexity of the conflicting code influence the merge conflict resolution time. accioly et al. (2018) found that merge conflicts happened in 9.38% of their data set. the authors also mentioned that merging branches is not likely to be a simple task, since one needs to understand and merge contributions performed by different developers, probably working on different assignments (accioly et al., 2018). menezes et al. (2020) found that merge conflicts happened in 7.11% of their data set, but the number of merge conflicts is more than 20% in some projects. kasi and sarma (2013) analyzed a set of projects and found that merge conflicts ranged from 7.6% to 19.3%. in the study conducted by brun et al. (2011), 17% of merge operations required human assistance to resolve a textual conflict. as conflicts can be common, their consequences can be a problem for the quality of some projects. as mentioned by brindescu et al. (2020a), this situation can affect the code quality, given that developers can follow an established process of peer review of code submissions. however, a solution with a lower quality can be produced during the resolution of the merge. in fact, merge conflicts are widely discussed in the literature. some works (sarma et al., 2008; brun et al., 2011; sarma et al., 2011; guimarães and silva, 2012; estler et al., 2013) aim to prevent conflicts by monitoring workspaces and notifying developers of the potential conflicts. such approaches are important initiatives, but they do not guarantee conflict-free merges, mainly due to the adoption of branches. others (cavalcanti et al., 2015; mckee et al., 2017; accioly et al., 2018; ghiotto et al., 2018) try to characterize merge conflicts in order to learn more about the topic and support initiatives that help to reduce the number of conflicts. on the other hand, researchers (leßenich et al., 2018; owhadikareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) have started looking at factors, attributes and indicators that can lead to or avoid conflict more recently. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 table 1. related work attributes (leßenich et al., 2018) (owhadikareshk et al., 2019) (vale et al., 2020) (dias et al., 2020) (menezes et al., 2020) (menezes et al., 2021) abstract syntax tree (ast) nodes changed x – – – – – changed chunks x – x – – x changed files – x x x x x changed files in both branches (intersection) x x – – x x changed lines of code x – x x – x changes inside class declarations x – – – – – commit density x x – – – x commits x x x x x x communication measures – – x – – – developers x x x x x x duration x x x x x x files with merge conflict – – – x x x length of commit messages – x – – – – merge conflict occurrence x – x x x x modularity – – – x – – predefined keywords in commit messages – x – – – – programming language – – – – – x self-conflict – – – – x x 2.2 related work the studies (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020, 2021; vale et al., 2020) that investigated factors, attributes or indicators that may lead to conflicts analyze timing and size attributes of merge scenarios, such as commits, committers, lines of code, files, and others. these studies and the factors, attributes or indicators are summarized in table 1. leßenich et al. (2018) investigated indicators to predict the number of merge conflicts. such indicators were inferred from a survey with 41 developers. in the survey, developers mentioned what causes merge conflicts: formatting changes, large-scale refactoring, structural changes in longliving forks, and import statements. next, the authors conducted an empirical study with 163 open source projects, including 21,488 merge scenarios. they investigated the correlation of some indicators (commits, files, chunks, lines of code, developers, and others) with the number of conflicts. for example, they explored the commit density, with the hypothesis that “many commits within a small time span are more likely to produce conflicts than the same number of commits over longer time spans”. they did not observe any strong correlation with the number of conflicts and rejected this hypothesis. in fact, they found that no indicator analyzed in work can predict the number of merge conflicts, as suggested by the survey. owhadi-kareshk et al. (2019) also investigated if conflict prediction is feasible. so, they designed a classifier for predicting merge conflicts. the authors conducted an empirical study with 744 open source projects, including 267,657 merge scenarios, written in seven programming languages. they created and used a set of potentially predictive features for merge conflicts based on the literature on software merging. similarly to the work of leßenich et al. (2018), they also investigated the commit density, with the intuition that “lots of recent activity may increase the chance of conflicting changes”. moreover, they did not find a correlation between their feature sets and conflicts, but they were able to indicate merge scenarios that are not likely to have conflicts. dias et al. (2020) investigated the effect of modularity, size, and timing of developer’s contributions on merge conflicts. the authors conducted an empirical study with 125 open source projects, including 73,504 merge scenarios, written in two programming languages. they found that “conflict occurrence significantly increases when contributions to be merged are not modular”. they also mentioned that “conflict occurrence increases when contributions to be merged have more developers, commits, and changed files” and “contributions developed over longer periods of time are more likely associated with conflicts”. in a previous study, we also investigated size and timing attributes that can lead to conflicts (menezes et al., 2020). we conducted an empirical study with 80 open source projects, including 182,273 merge scenarios, written in ten programming languages. we performed statistical tests and mined association rules. we found that some attributes in the branch that is being integrated (branch 2) have more influence than the same attributes in the other branch. for example, committers, commits, and changed files in branch 2 have a large impact on the occurrence of merge conflicts. timing attributes, commits in branch 1, and changed files in branch 1 have a small influence. it is relevant to mention that this work calculated the metrics (except the timing attributes) by branch. the timing attributes were calculated by merge scenario, as well as the other attributes of the other works described here. menezes et al. (2021) verified more attributes (chunks, changed lines of code, commit density, programming language) in a second study. the attributes that presented a understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 higher relation to the occurrence of merge conflicts were changed files, commits, and committers in the branch b2 (as in the first study), and changed lines of code in b2. vale et al. (2020) investigated the role of communication activity in the occurrence or avoidance of merge conflicts. the authors conducted an empirical study with 30 open source projects involving 19,000 merge scenarios. they mined and linked contribution (git) and communication (github) data. they quantified the amount of github communication in merge scenarios (the communication of all active contributors awareness-based, means of pull requests, and related issues pull-request-based, and the communication mapped to artifacts that have been changed in the merge scenario changed-artifact-based). the authors found no significant relation between communication measures and number of merge conflicts. they also performed a multivariate analysis using merge scenarios’ characteristics, such as size, number of developers, and duration. against their expectations, they did not find a strong correlation between the size of merge scenario code changes and the occurrence of merge conflicts. finally, related work investigated similar attributes, although they reached different results. it is worth mentioning that they used different analysis techniques, projects and languages, and some attributes are calculated differently as well. however, the important implication of such related studies to the present work is the possibility of gathering some knowledge and investigating the developers’ perspective through a qualitative method focused on empirical studies addressing characteristics of open source projects. 3 research method based on related work identified as the first step of this work (section 2), we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand the way they use branches, the occurrence of conflicts and the resolution process, and factors that can lead to or avoid conflicts. secondly, we conducted a field study based on interviews with 15 software developers to analyze those factors to obtain a better understanding of what contributes to increase or decrease merge conflicts. we conducted survey research with brazilian software developers. the survey aimed to collect opinions on the actions that software developers usually take when they need to create or work in branches and merge code files. the study was directed to software developers who used any vcs to coordinate changes in their projects. next, we performed a field study with 15 developers based on conducting interviews. the field study aimed to deepen and detail the answers obtained in the survey research. these studies allowed us to organize a discussion and point out implications to researchers and practitioners in the field. 4 understanding factors that affect merge conflicts in this section, we present details on the survey planning and execution, as well as information about the survey participants. finally, we answer our first three research questions. 4.1 planning and execution we adopted the following steps to run the survey based on the principles presented by pfleeger and kitchenham (2001): (1) setting specific and measurable objectives, (2) planning and scheduling the survey, (3) preparing the data collection instrument, (4) validating the instrument, (5) selecting participants, (6) analyzing the data, and (7) reporting the results. we planned and constructed our questionnaire from the first three research questions presented in section 1 and based on the factors mentioned in related work (leßenich et al., 2018; owhadi-kareshk et al., 2019; dias et al., 2020; menezes et al., 2020; vale et al., 2020), mainly in the survey provided by leßenich et al. (2018). this questionnaire was divided into three sections: (1) basic information and professional experience, (2) use of branches, and (3) merge conflicts. our previous work and survey responses in portuguese are publicly available on github1. we performed a pilot with four software development practitioners aiming at validating the questionnaire and estimating response time. based on the answer and suggestions, we adjusted and improved the questionnaire. we sent out the questionnaire to developers via email, together with some contextual information such as the research objective, expected knowledge in version control, and estimated time to answer (5 minutes). as we used mailing lists and asked developers to share the survey with colleagues, we cannot compute a response rate. open and closed questions were used in the survey. the questions included in the survey are: 1. age (less than 24 years old, between 25 and 34 years old, between 35 and 44 years old, between 45 and 54 years old, more than 55 years old); 2. level of education (high school, technical education, bachelor’s degree, specialization degree, masters’ degree, phd); 3. job sector (private sector, public sector, both, selfemployed); 4. experience (between 1 and 5 years, between 6 and 10 years, between 11 and 15 years, between 16 and 20 years, more than 20 years); 5. average size of the project teams (between 1 and 5 people, between 6 and 10 people, between 11 and 15 people, more than 15 people); 6. version control tools (clear case, cvs, git, jazz, mercurial, pvcs version manager, rsc, subversion, team foundation server, visual source safe, others: ); 7. branch creation frequency (rarely, sometimes, frequently, very frequently, always); 1https://github.com/catarinacosta/mactool/blob/master/surveyanswerssbes2021.xlsx understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 8. reason (test, bug fixes, release, new features, refactoring, others: ); 9. branch creation policy (developers have autonomy to create new branches, only the project manager or the person who maintains the software, the team decides, others: ); 10. conflicts frequency (rarely, sometimes, frequently, very frequently, always); 11. factors that contribute to the occurrence of conflicts (number of changed files, number of changed lines, number of commits, number of developers, branching duration, lack of communication, developer working in several branches, others: ); 12. time to resolve a merge conflict (some hours less than 24 hours, some days 1 to 6 days, one week, more than a week); 13. difficulty in resolving a merge conflict (very easy, easy, medium, difficult, very difficult); 14. practices to avoid conflicts (team communication, less branching duration, small changes, frequent commits, divide the work among the team, others: ). we adopted the card sorting approach (spencer, 2009; zimmermann, 2016) to analyze the answers to the openended questions (in this questionnaire, optional questions 6, 8, 9, 11, and 14, in which the participants could enter other data) and obtained some answers not listed in the initial survey options. to do so, we grouped similar responses to the open-ended questions into codes. the coding was performed by two researchers who discussed the codes and categories and then were reviewed by another researcher with 10 years of experience in qualitative studies. an example of the coding is presented in figure 2, in which the codes are first extracted and the categories emerge after checking the similarity. figure 2. example of coding 4.2 results from the 109 brazilian software developers that answered the questionnaire, 38.5% are between 25 and 34 years old, and 33% are between 35 and 44 years old. less than 24 years old are 12.8%, and between 45 and 54 years old are 11%. finally, more than 55 years old are only 5%. 35.8% have bachelor’s degree, masters’ degree (29.4%), or specialization degree (22%). we asked participants where they worked and how much experience in software development they had. regarding the experience as a developer, 27.5% have between 11 and 15 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% git subversion team foundation server mercurial cvs visual source safe figure 3. experience with version control systems years of experience, and also 27.5% have between 6 and 10 years of experience. moreover, 23.9% have between 1 and 5 years of experience, 11% have more than 20 years of experience, and 10.1% have between 16 and 20 years years of experience. additionally, we asked participants on the number of people in the last project they participated (or on average in their career). 38.5% answered that they worked in teams from 1 to 5 members, 36.7% worked in teams from 6 to 10 members, and 15.6% worked with more than 15 members. finally, 9.2% answered that they worked in teams from 11 to 15 members. we also wanted to identify which tools developers used to adopt for version control. in this question, the participant was allowed to mark more than one answer. 105 (96.3%) developers marked that they have experience with git, 39 (35.8%) have experience with subversion, and 18 (16.5%) have experience with team foundation server. mercurial, cvs, and visual source safe were also mentioned. the developers were free to include other types of vcs not listed in the questionnaire, but no one answered anything different from the list. the information is shown in figure 3. 4.2.1 rq1 (branches): how often are branches created in software projects? we asked participants how often they create branches on software development. in the case of named branches, we believe that there is a scenario that may be more likely to conflict and be more complex to resolve. we would like to know the reason for the creation as well as the policies adopted to do so. respondents could answer: rarely, sometimes, frequently, very frequently, or always. the prevalent answer was “always” (45.9%). developers also chose “very frequently” (19.3%), and “frequently” (15.6%). therefore, we can say branching is a very common practice among the participants. results are shown in figure 4. we also verified that developers create branches “always” in projects of private companies (64.2%) more than in government projects (26.5%). we verified the main reasons for creating branches. in this question, developers were allowed to mark more than one answer. 94 (86.2%) participants answered that the main reason is “to create new features”, 81 (74.3%) answered “to fix bug”, and 46 (42.2%) mentioned “refactoring”. “test” (35.7%) and “release” (35.7%) were also chosen by 39 respondents as understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 45.9% 19.3% 15.6% 11.9% 7.3% always very frequently frequently sometimes rarely figure 4. frequency of branching the main reasons. participants could also use the open field to write other reasons. two developers mentioned “proof of concept”, and one mentioned “enable the collaboration of different people”. the results are shown in table 2. some software developers (hereafter referred to as sd) used the open field not to add a new reason, but to explain the selected reasons: “we usually want to implement new features and this ends up generating a new branch, (...) many times to make releases for the client we have to use a new branch.” (sd48) “refactoring is what we do most in the private company i work for.” (sd59) others developers also mentioned a different reason: “test new features and create proofs of concept.” (sd06) “for different people to be able to participate in the collaborative development.” (sd83) moreover, we evaluated policies adopted by the participants’ projects for creating a new branch. they responded that developers have “autonomy to create new branches” (68.8%). in contrast, others answered that the “team decides when a new branch will be created” (23.9%). only 7.3% marked “only the project manager or the person who maintains the software”. participants could also use the open field for other response options. four developers mentioned the “use of git flow”, a set of guidelines and a tool for creating and standardizing the use and name of branches in a project. one developer also pointed that “branches are automatically created by code review and pipeline systems”, and another mentioned that the policy is “not use branch”. the results are shown in table 3. some developers used the open field not to add a new policy, but to explain the selected policy, as exemplified in the following: “the developers create as many branches as they think it is necessary, but each one is responsible to constantly update the branch and integrate with the work of the others or exclude it if it does not have a well-defined purpose.” (sd48) table 2. reasons for creating branches reasons # % new features 94 86.2% bug fixes 81 74.3% refactoring 46 42.2% testing 39 35.7% release 39 35.7% reasons also mentioned # proof of concept 2 enable the collaboration of different people 1 table 3. policies for creating branches policies # % developers have autonomy to create new branches 75 68.8% team decides when a new branch will be created 26 23.9% only the project manager or the person who maintains the software 8 7.3% policies also mentioned # git flow 4 automatically created (by code review and pipeline/continuous delivery systems) 1 commit to a master branch (no branch) 1 “the team always discuss when it is really worth creating a new branch, managing new branches is difficult and if we are not in control something can be wrong.” (sd87) two developers selected the “team decides when a new branch will be created” and mentioned the “git flow strategy”. two developers selected that “the developers have autonomy to create new branches”, and also mentioned the “git flow”. as such, although the projects adopt a similar strategy and tool, some projects give more autonomy to members and others prefer to discuss each decision in depth: “we use git flow, where both developers and managers have responsibilities when creating branches.” (sd108) “a production and a development branch, based on the concept of git flow.” (sd28) answer to rq1: branches are created frequently. developers have the autonomy to decide when to create the branch. the main reasons are to create new features and bug fixes. 4.2.2 rq2 (merge conflicts): what factors lead to merge conflicts? we found that the use of branches is very common. however, as mentioned by shihab et al. (2012), such level of isolation sometimes implies a cost of having to resolve integration conflicts. to measure how often merge conflicts occur, we asked understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 developers to estimate the frequency: rarely, sometimes, frequently, very frequently, or “all the time”. for 45% of the participants, conflicts occur “sometimes”. the second most chosen option was “frequently” (24.8%). in turn, the third most chosen option was “rarely” (16.5%). it is important to mention that for 13.8%, conflicts occur “very frequently”. this measure leads us to conclude that conflict occurrence is common to be between “sometimes” and “frequently”. results are shown in figure 5. we also found that conflicts are more common in government projects (52.9% of developers working for the government scored “frequently” or “very frequently”) than in projects of private companies (24% of developers working for private companies selected the option “frequently” or “very frequently”). 13.8% 24.8% 45.0% 16.5% always very frequently frequently sometimes rarely figure 5. frequency of conflict occurrence. we also checked the factors that lead to the occurrence of conflicts. in this question, participants were allowed to mark more than one answer. 76 (69.7%) developers marked the option “branching duration”, i.e., the time a branch is isolated. the “lack of communication” among a team’s members was also chosen by 64 (58.7%), and the “number of changed files” was cited by 53 (48.6%). the “number of developers” was also chosen by 42 (38.5%). developers could also use the open field for other response options. five developers said that “not synchronizing the repositories” can lead to conflicts. three developers mentioned the “difference in the code formatting” as a reason that can lead to conflicts and two developers mentioned “tasks not mapped correctly”. results are shown in table 4. some developers used the open field to explain the selected factors that can lead to conflicts and also to add more factors. they selected and mentioned factors such as “time between the branch and the merge” and “lack of communication”, but they rather mentioned the fact that “repositories are not kept up to date” and “complex functionalities” can lead to conflicts, as exemplified next: “generally the longer the time between the branch and the merge, the more files tend to be changed (...), resulting in greater possibilities of conflicts. another point that influences is the nonpractice of constant rebase, leaving the branch out of date with respect to its origin (usually the master). complex functionality can also influence branches that take longer to merge.” (sd04) “the lack of communication is the worst of them. because if the team communicates daily, one table 4. factors that lead to conflicts factors # % branching duration 76 69.7% lack of communication 64 58.7% number of changed files 53 48.6% number of developers 42 38.5% number of lines of code 31 28.4% same developers in many branches 28 25.6% number of commits 24 22.0% factors also mentioned # do not keep repositories up to date 5 code formatting 3 tasks not mapped correctly 2 coupling level of the code 1 complex features 1 long time to deploy 1 many features in development 1 tasks not correctly mapped/broken into small pieces 1 technical debt 1 knows what the others are up to, and conflicts are mitigated/reduced. if conflicts are not easy, you need more communication or more frequent integration (minor merge).” (sd55) “whenever there’s a conflict, it is because developers forgot to communicate what they were changing, resulting in two developers changing the same functionality or something very close.” (sd16) some developers mentioned unlisted factors, such as the difference in the “code formatting”, “tasks not mapped correctly” and “technical debt”. as conflicts occur in modifications in the same code region, criticism of minor issues such as code writing and style is understandable: “lack of configuration in the editors, which change between indentation with tab/space, amount of space, line break... just opening and saving the file, and lack of style in the code, where each one writes the code in a different way, and another developer adds/removes spaces, parentheses (...) this causes a change in one line to reflect the entire file.” (sd26) “tasks not mapped correctly/broken into small enough parts correctly that lead to interfering with the same pieces of code. accumulated technical debt that requires changes in many places, for example regarding code formatting, use of depreciated techniques, etc...” (sd76) answer to rq2: conflicts sometimes occur. the main reasons that can lead to merge conflicts are the time a branch is isolated and lack of communication. 4.2.3 rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? as conflicts are common, we asked participants about the difficulty in resolving a conflict and which practices they beunderstanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 lieve may contribute to avoiding merge conflicts. to verify the time to resolve a conflict, we asked developers to estimate the duration: some hours, some days, one week, or more than a week. most of them (80.7%) answered that they spent “less than 24 hours” to resolve the conflict. some of them (17.4%) answered that they spend “some days” (1 to 6) to resolve the conflict. only 2 (1.9%) participants answered “one week”. we verified the difficulty level in resolving a merge conflict from their perspective, as greiler et al. (2022) cite: “factors may impact a specific developer’s experience and depends on his/her personal, team, organization, and project contexts”. some developers answered “easy” (32.1%) and “medium” (32.1%) and some of them answered “very easy” (22.9%). the results about the time to resolve the conflict and the level of difficulty are shown in table 5. table 5. time to resolve a conflict and difficulty level time to resolve conflict # difficult level # very easy 25 easy 31 less than 24 hours 88 medium 27 difficult 5 very difficult very easy easy 4 some days (1-6) 19 medium 6 difficult 9 very difficult very easy easy one week 2 medium 2 difficult very difficult more than one week 0 finally, we investigated practices to avoid conflicts. developers were allowed to mark more than one answer. the two most frequent answers that may contribute to reducing merge conflicts occurrence were: “improve team communication” by 78 (71.5%) participants and “less branching duration” by 75 (68.8%) participants. these factors really seem to be very important, given that “branching duration” and “lack of communication” were the most cited factors that can lead to conflicts. participants also selected “divide the work among the team” (57.7%), “small changes” (54.1%), and “frequent commits” (52.2%) as good practices to avoid conflicts. developers could use the open field for other response options. some participants informed that they do “not use new branches”, do commits directly on the main branch, “adopt code style” and “git flow tool”, and always keep the “workspace branch up to date with the remote repository”. results are shown in table 6. moreover, some developers used the open field to explain the selected factors that can avoid conflicts or even add unlisted factors: “always check for code updates in the master / trunk / main.” (sd34) “improve communication channels and also use table 6. factors to avoid conflicts factors # % improve team communication 78 71.5% less branching duration 75 68.8% divide the work among the team 63 57.7% small changes 59 54.1% frequent commits 57 52.2% factors also mentioned # do not use branches 3 adopt code style tool 2 keep the branch up to date with the master/trunk/main 2 git flow 2 adopt awareness tool 1 architecture patterns (more cohesion and less coupling) 1 branch by task 1 continuous integration 1 gui to interact with repository 1 frequent deploy 1 feature flags 1 keep only experts 1 language syntax 1 other awareness tools to know what each one is changing.” (sd68) “i encourage people on my team to avoid branches as much as possible and implement techniques such as feature flags for everyone to always work at master/main. rather than dealing with conflicts, i would like ‘devs’ to become more experienced in trunk based development.” (sd31) “use of techniques like git flow.” (sd77) “adopt a tool that validates the code style.” (sd26) answer to rq3: developers take more than some hours to resolve a conflict since it is usually easy to do. the main factors to avoid conflicts refer to improve team communication and less time of isolation. 5 analyzing factors that affect merge conflicts the first study (survey research) was grounded (planned and constructed) on the factors for merge conflicts mentioned in solid related work, as mentioned previously, focusing on the brazilian software developers’ perceptions. based on the quantitative results, we decided to deepen the understanding of how the factors contribute to increasing or decreasing merge conflicts based on interviews in a qualitative study (field study). in this section, we present details on the planning and execution of this qualitative study with semistructured interviews and we answer the two additional research questions. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 5.1 planning and execution we grounded this study on singer et al. (2008)’s work. according to the authors, a field study aims to investigate practitioners in the context of any task or activity and, based on a specific technique, identify how they cope with their work in practice or how they solve some problems in their contexts. based on singer et al. (2008)’s recommendations, software developers with at least one year of experience were invited to answer a set of questions regarding two major aspects: how the factors identified in the survey research contributes to increasing or decreasing merge conflicts. these participants were invited from the researchers’ networks and considered their availability to participate in an interview session. we planned and constructed our interview questions (iq) from the results of the first study (survey research) and focused on the last two research questions presented in section 1. as such, we took the main factors pointed out by the software developers in the survey research and designed eight questions for the interview sessions for the field study: four questions about factors that contribute to increase merge conflicts (table 4) and four questions about those that contribute to decrease merge conflicts (table 6). the goal is to understand why these factors are so important and how time of experience (years of working, studying, dealing with merge conflicts and collaborating with other developers) influences and modifies the software developers’ perceptions over the years. the eight grouped questions are listed below. regarding factors that contribute to increase merge conflicts, we asked the following questions: • iq1: questions on the factors – do you agree with the table presenting the factors that can lead to merge conflicts? – how was your experience in coping with merge conflicts early in your career? – how about coping with merge conflicts nowadays? • iq2: in your opinion, what makes the branching duration so important? • iq3: questions on lack of communication – what is your opinion on the lack of communication? – does the lack of communication affect other factors presented in the table? • iq4: questions on negative effects – which factor brings more negative effects? why? – has the time of experience changed your answer? regarding factors that contribute to decrease merge conflicts, we asked the following questions: • iq5: what is your opinion on the factors “lack of communication” and “branching duration” changing their positions in the table? • iq6: questions on the team influence – how did communication with the team influence the way to avoid these conflicts? – have you felt any improvement over time? • iq7: questions on past experiences – have you been able to cite or work on a project that covered any of these factors? – was it a successful experience or not? • iq8: how do you see that “less branching duration” contributes to decrease merge conflicts? a total of 15 software developers participated in the second study (hereafter referred to as fd from field study developer identifier). they were only brazilians and answered about merge conflicts and their perceptions. the goal was to deepen the understanding of how branches are adopted, as well as conflict resolution in this context. with this in mind, an interview session of about 30 minutes was conducted with each software developer and recorded via the zoom platform. all data and information were collected anonymously and treated specifically for academic purposes, as explained to the participants in the email invitation, informed consent form, and in the conversation before each interview. as mentioned above, the interviewees were contacted by email. the main selection criterion was to have used some version control systems (e.g., git, subversion, version manager), which means that he/she has probably already faced some merge conflict. they were informed that they could withdraw at any time, they were allowed to not answer some questions (if they wanted to), and all video and sound data that would be collected would not be public (just collected to the study analysis purposes. before starting an interview, we asked about the developer’s time of experience, how long he/she has been dealing with merge conflict situations (which is not necessarily linked to professional experience), what kind of industry sector he/she works in (or has worked to). in table 7, the interviewees’ time of experience is presented. from the 15 developers interviewed, 7 (46.6%) have about ten or more years of experience. the first four questions of each interview referred to factors that contribute to increase merge conflict. from tables 4 and 6, we could deepen this discussion by also understanding if the interviewees agree with, and/or would like to add more factors (or correlated factors). in turn, factors that contribute to decrease merge conflicts were covered by the last four questions. the goal was the same as the previous question, but we also would like to know if the software developers’ time of experience affects their perceptions over time. firstly, a pilot was run with four software developers to verify the interview session protocol and duration. the pilot helped us to check the questions (if they were clear enough) as well as to the how long a slot would last in average to avoid stressing the interviewee and to lose the focus in complex questions. 5.2 results this section presents the results obtained from the interviews in the field study and the answers to the last two research questions of our work, as mentioned in section 1. to do so, for each research question, we took as the main codes from the interviewees’ answers the most frequent factors that contribute to increasing and decreasing merge conflicts based on understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 those reported in the first study (survey research). this strategy allowed us to get a better understanding of how those factors have an impact on practice. table 7. interviewees’ time of experience developer time of experience fd01 almost 6 years fd02 4 years fd03 6 years fd04 2.5 years fd05 12 years fd06 7 years fd07 24 years fd08 14 years fd09 3 years fd10 10 years fd11 12 years fd12 1 year fd13 14 years fd14 4 years fd15 15 years 5.2.1 rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute to increase merge conflicts? regarding the factors that contribute to increase merge conflict and their order of importance, most of the software developers (9) agree (fd03, fd04, fd05, fd06, fd07, fd09, fd10, fd12, and fd14) with the table presented during the interview session (table 4) and some (4) partly agree (fd01, fd02, fd08, and fd13). only two interviewees (fd11 and fd15) declared that they do not completely agree with the table with the factors. the interviewees who did not completely agree with the table mentioned that maybe a factor should be considered as more meaningful for a specific context or scenario. “number of lines of code” was highlighted by three software developers (fd03, fd10, and fd14) as something of greater importance. one interviewee reported an experience about one of the critical factors that contributes to increase merge conflicts: “i had a lot of problems when i worked, even when the team was small (...) four people developing (...) it was like parallel editing of the same code and people did not have much experience in sharing code (what they did, what they edited...).” (fd11) we invited the interviewees to comment a little bit more about their career in order to compare their beginning against their current perception. six software developers (fd01, fd04, fd08, fd11, fd12, and fd14) explained that they did not work with either git or other repositories early in their career. additionally, they were running academic projects that were characterized as small and without merge problems: “[starting with academic projects] is common in our career. as far as your projects are more and more scaling, even the culture of the company where you work, you may have merges that end up being complicated to cope with.” (fd14) four developers (fd7, fd10, fd13, and fd15) with more years of experience pointed out how important the evolution of version control tools is, especially for assisting situations regarding merge conflict resolution and for detecting conflicts as well. according to one of those interviewees: “as time goes by (...), the existing tools started to carefully address this kind of activity (merge) (...) a diff not correctly done was complex for us at the first years of research and practice in version control systems (...) there was a free tool, but it was complicated to work with it considering a lot of existing bugs...” (fd07) team communication/behavior was mentioned by three developers (fd02, fd03, and fd06) as something noticed both early in their careers and also currently in their work. fd03 even highlighted that the size of a project a factor mentioned direct and indirectly by more than one interviewees also affects merge conflicts. fd06 pointed out that he noticed a programming language barrier in open source projects. on the other hand, long branch duration was referred to as important due to the changes made over time, i.e., how much code have been modified/moved in a project (fd01, fd04, fd09, fd12, and fd13). according to one of these interviewees: “i believe that more long branches you have, more changes and modifications of code you have to cope with, implying in implementation of new features, deprecating other functions of the program and methods, and so on. as such, long branches bring bigger merge conflict problems...” (fd04) the interviewees often argue their concern on not only what brings merge conflict problems in long branches, but also on why small branches would be more convenient. they mentioned some cases, such as a branch is outdated when it is compared to the main branch/another branch (fd05 and fd06), or a branch is associated with a sprint/short time interval (fd1, fd14, and fd15), as presented next: “i believe that shorter branches (...) can decrease the number of merge problems.” (fd04) “... if you are running an agile method and stories that are better defined, broken into sub-tasks (...), for example, you do not have this problem, because (theoretically) you have a story in a certain subscope of the development of your project that will be somehow isolated.” (fd01) “...if we have a branch with a very long, very extensive time frame, (...) sometimes we cannot collect such an accurate feedback from the business area and this would be what we really need to change for production.” (fd03) understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 communication was called “essential” by one interviewee, “important” by another, and “fundamental” by a third one. the lack of communication was related to problems not only previously described in table 4, but also declared as a behavioral problem, as exemplified next: “(...) because it leads to merge conflicts and frequently it makes some behaviors in the software development keep happening throughout the project and this affects code, functionalities, (...) the implementation of the project as a whole.” (fd02) some developers (fd01, fd03, and fd09) mentioned that communication problems go beyond the technical aspect. this fact is highlighted in the following fragment: “the lack of communication will lead to conflicts, not only in git, but also any way of working. here it leads mainly (...) to cases in which you will end up messing with something that someone else was already working on...” (fd03) moreover, agile methodology was also cited by 2 interviewees as a strategy to support communication in the software development project. this is pointed in the next fragment: “it is clear the difference between those who use the [agile] methodology or not.” (fd06) it is worth highlighting that the factors mentioned in table 4 were also reinforced by the interviewees. one of them stated that: “the lack of communication usually causes problems regarding the branching conflict, in merge, (...) within the development team itself (...). this is critical for the understanding the time the story is started and that you are doing a part of a whole, (...) for example, not keeping the repository updated (...) will affect the parallel editing of the same code.” (fd01) tools such as configuration management tools (fd10), project management tools (fd10), task organization tools (fd10), change tracking tool (fd05) and screen sharing tools (fd13) were cited as kinds of support for communication: “...depending on the tool for change tracking, continuous change etc., i believe that (verbal) communication (...) helps you eliminate the problem a little bit. you can see ‘someone’ (...) touching exactly such and such point of the system (...) and you can verify where you can touch or not, and i believe that this impacts less on merge problems.” (fd5) communication problems can also lead to rework (fd11, fd12, fd13, and fd14), either by an added feature or by a change not communicated to the team: figure 6. factors that negatively affect merge conflicts “lack of communication (...) you end up having to redo what you did. you thought it was right, but it was not what was supposed to be done. this generates so much rework for the developer, stress for the manager...” (fd11) when we asked about which factors they consider the most negative ones, nine different answers were obtained, as shown in figure 6. communication was the most mentioned factor in the opinion of six interviewees (fd02, fd04, fd06, fd09, fd11, and fd14). “...they [conflicts in merge] occur because the distraction of several people, myself included of course. it can lead to some error that will lead to a headache until we can solve it ....” (fd09) in this context, two problems were pointed out by two interviewees each: outdated repository (fd03 and fd07) and a long branch duration (fd8 and fd15). some interviewees selected other problems: many developers in the same branch (fd05); long implementation time (fd07); number of commits (fd08); and number of lines of code (fd12). only one of them (fd13) pointed out that it was hard to solve conflicts in the beginning of his career, but he would not see it as a problem at the present moment, but as something expected from the learning process over the years. when the interviewees were asked to remember their experiences from the beginning of their career, only three of them (fd02, fd10, and fd11) believe they did not respond to the questions differently. the majority, 12 developers (fd01, fd03, fd04, fd05, fd06, fd07, fd08, fd09, fd12, fd14, and fd15), believe they would think differently on what contributes to increase merge conflicts over time. as a conclusion, seven factors were perceived as the main problems regarding merge conflicts according to the interviewees, but that they have been rethought over time: number of modified files (fd01 and fd06); communication and duration of the branches (fd03); organization (fd04); tools used in the projects (fd07); lack of attention (fd09); and number of lines modified in the projects (fd14). when the interviewees were asked to compare their perception at the beginning of their career against their current perception, it is not clear if there is a pattern of “x-answers from the beginning changed to y-answers at the present moment”. the interviewees mentioned situations they had experienced to justify the choice of understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 a factor that affected their work early in their career versus others impacting their current projects. answer to rq4: the interviewees mostly agree with the factors that lead to merge conflicts, presented in table 4. long branches, software development methodology and communication problems were pointed out as some of the main factors to be considered in this context. 5.2.2 rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute to decrease merge conflicts? when we asked the interviewees if communication should be in the first position in table 4 (factors that lead to merge conflicts) as well as in the first position of table 6) (factors that avoid merge conflicts), 10 interviewees (fd02, fd03, fd04, fd05, fd06, fd09, fd10, fd11, fd12, and fd13) agreed with the greater importance of this factor. this relevance is exemplified in the following fragment: “...because communication is in fact very important and it impacts on several factors...” (fd03) in addition, the interviewee fd05 emphasizes the effective stimuli and use of communication tools (e.g., email, chat, awareness-support systems etc.) contributes to decrease merge conflicts. the interviewee fd09 argued that good communication reduce rework and the interviewee fd6 mentioned that the problems are more related to human aspects in the software development process. as such, there was no specific type of communication specifically recommended in the interviews. the answers referred to talking more and better before working on a project, using some computational tools and applying agile methodology as a way to improve and keep communication frequent. some interviewees (fd07, fd08, fd14, and fd15) believe that “branch duration” is a major factor even when the goal is to decrease merge conflict. the interviewee fd08 mentioned that some factors in this context may be related to inexperience or “post-conflict” thinking. the interviewee fd14 raised a concern on to which extent we notice communication surrounding us: “... i think it is normal to change your mind, and i think what happens is that you start thinking about the situations you have been through on the team, then you start thinking ‘if that guy had talked to me, it would be less torturous to resolve the conflict’...” (fd14). the interviewee fd01 reported that communication was highlighted because of cultural reasons. in other words, it refers to the idea of pointing out that there is a problem and that “lack of communication” would be the main factor on this subject, being similar to “pointing the finger” to someone. by indicating communication as a strategy to decrease merge conflict, people feel more comfortable in communicating to each other: “the communication) as something to avoid a problem rather than being the problem itself.”(fd01) all interviewees mentioned the communication in the team, either based on an previous project (past experience), or on the one they are currently work on. in both scenarios, nine of them (fd01, fd02, fd05, fd06, fd07, fd09, fd10, fd11, and fd14) notices improvements regarding effective communication and its positive effects. as some suggestions for improving communication, some interviewees cited management, planning, and infrastructure: “there are several factors that will influence that aspect of improving communication. i think the first one is management.” (fd03) “there is also that question related to technological limitation. if we think about the current pandemic scenario, (...) several companies have adapted their infrastructure with resources to foster and ensure good communication.” (fd03) “communication works from the moment you plan how that communication is going to be done.” (fd04) the improvements mentioned by the interviewees resulted in problem-solving (fd01), collaboration between team members (fd04), less rework (fd11) and less time doing merge (fd13). an example of this report is presented next: “the fact that you have a person with more knowledge helps a lot who is there working on that project and who may not have enough knowledge, especially related to the business in which that project is inserted.” (fd04) the interviewee fd15 mentioned that communication would help to resolve, rather than avoid, a merge conflict. this fragment is presented as follows: “i don’t think there has been much change regarding this topic in the last few years (...) i think all of that is still a problem related to our inability to clearly record or summarize the developer’s intention at the time he/she writes a particular piece of code.” (fd15) finally, it is worth mentioning that not everyone may have faced a situation in which they realized that communication would be the key factor for avoiding a merge conflict. this is indicated in the following fragment: “i did not have maturity on this subject before [i.e., thinking about communicating to avoid conflict]. so, if this ability is improved over time, i could only have the notion of its importance to avoid merge conflicts currently...” (fd04) understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 from the 15 software developers who were interviewed in our field study, 12 have some experience in implementing practices to address the factors they completely or partially agree with. based on the factors listed in table 6, communication (fd01, fd02, fd03, and fd09), task division (fd01, fd02, fd11, and fd15), repository up to date (fd01 and fd05), task-based branch (fd01 and fd05), branching strategy (fd01 and fd08), more frequent commits (fd01), small changes (fd01) and short branches (fd15) were the most prominent. devops culture was also mentioned (fd01 and fd13). some interviewees also included some of the previously mentioned factors, e.g., training (fd04) and some support tools (fd05, fd07, fd08, fd09, fd12, and fd13), such as trello, discord, and vscode. these tools were mentioned when the interviewees talked about aspects regarding communication, change history, standardization, and task division, as exemplified next: “we standardized vscode as our ide. there were many people who used other environments.” (fd13) “... it required improvements. communication should to be a frequent concern. you cannot have communication only when there is a problem, i.e., it has to be a daily target.” (fd01) another factor mentioned by some interviewees was the branching duration (fd03 and fd08), especially because of the business. this was also mentioned by the interviewee fd11: “...branching duration is directly linked to the business. what comes in and leaves depends on the business, the owner of the company, the client (...) and there is a little margin for negotiation.” (fd08) factors that contribute to decrease merge conflicts also mentioned by the interviewees refer to continuous integration (fd01 and fd08), project management environment (fd10), and more frequent commits (fd14). branching duration was confirmed as an important factor to avoid conflicts by nine interviewees (fd04, fd05, fd06, fd07, fd10, fd11, fd13, fd14, and fd15). other three (fd02, fd09, and fd12) did not know how to evaluate it, and the other three (fd01, fd03, and fd08) commented that it is not really the duration of a branch itself that prevents conflict. in this regard, we found arguments for a shorter branching duration, such as the repository being up to date (fd04, fd07, and fd13), less time for code changing (fd05), less divergence (fd06), memory of what has been done and what is not affected (fd10), the speed of development (fd11), less chance of conflicts (fd13), and faster merges (fd14). the interviewee fd15 explained somehow the mentioned factors: “the longer you are isolated in a branch, the more likely another developer will come and change the code that is in parallel with you (...). you will generally remember less and less about it. knowing how the code was before and having fewer developers working in the same code area as you are factors that help you solve (...) or avoid a merge conflict.” (fd15) answer to rq5: the interviewees mostly agree with the factors that avoid merge conflicts, presented in table 6. communication (from simply conversations to those based on computational tools), team management, and infrastructure were pointed out as some of the main factors to be considered in this context. 6 discussion and implications in this section, we present the main findings of this research on factors that affect merge conflicts based on a quantiqualitative method. 1) branches are very common and developers have the autonomy to create new branches: most software developers create branches frequently or all the time. the use of branches is very common according to the developers’ perspective collected from the survey questionnaire, but no participant in the field study interviews mentioned if he/she did not use to do it. only a few developers marked the option that they discuss the creation of new branches with their teams. in a large study at microsoft, shihab et al. (2012) identified that developers should be careful about branch creation, since it may lead to an increase in the likelihood of failures. they suggest aligning branching structure according to architectural structure and with the organizational structure of their teams (shihab et al., 2012). as mentioned by bird et al. (2011), branches do not come without a price, given that it is normally integrated into others at some point. 2) new features and fixing bugs are the main reasons for creating branches: our results confirm the findings of other studies. zou et al. (2019) found similar results in their investigation with 2,923 projects developed on github branches are mainly used to implement new features, conduct version iteration, and fix bugs. owhadi-kareshk et al. (2019) and vale et al. (2020) address that developers often use branches to add features or fix bugs. according to bird et al. (2011), branches are created to implement a feature, perform a maintenance exercise, do continued maintenance on a subsystem, or fix several related bugs. premraj et al. (2011) mentioned that branches help developers, architects, build managers, testers, and others people to change software artifacts. additionally, the agile methodology was cited by some software developers in the field study interviews as one of the strategies to cope with the creation of a new branch without contributing to increasing or decreasing merge conflicts. 3) branching duration and lack of communication are the main problems: based on the related work (table 1), attributes related to the branching duration are very common, but only two studies mention the branch duration as an indicator of conflict. understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 dias et al. (2020) and menezes et al. (2020) found a relation between the duration of the merge scenario and the conflict occurrence. dias et al. (2020) mentioned that “contributions developed over longer periods of time are more likely associated with conflicts”. menezes et al. (2020) found that the timing attributes have a (small) impact on the conflicts. vale et al. (2020) verified the relation between github communication and the occurrence of merge conflicts. the authors found no significant relation between communication measures and number of merge conflicts. however, the communication recorded by the authors was based only on the communication extracted from github. so, they extract the communication of all active contributors, means of pull requests, and related issues, and the communication mapped to artifacts that have been changed in the merge scenario. the communication mentioned by the software developers who responded to the survey questionnaire is regarding the awareness of parallel changes. sometimes developers forget to communicate what they are changing, resulting in two developers changing the same functionality or something very close. in the field study interviews, some software developers also suggested that keeping shorter branches is the best decision to avoid merging conflict problems, especially those related to developers’ communication and lack of memory on the changes performed in the project and team over time. 4) most of the time, conflicts are not difficult: most conflicts offer no difficulty (medium or easy) and are resolved in some hours. accioly et al. (2018), ghiotto et al. (2018) and pan et al. (2021) identified the most common conflict patterns and resolutions. accioly et al. (2018) found that 84.57% of merge conflicts happen because developers modify the same lines or consecutive lines of the same method. ghiotto et al. (2018) found that conflicting chunks generally contain all the necessary information to resolve them. pan et al. (2021) found in their study on conflict resolution that 28% of changes are of 1-2 lines for both main and forked branches and 39.5% of the resolution strategies involved concatenating the main and the forked branch’s changes. mckee et al. (2017) performed a survey and found nine factors and developers attempting to determine if the conflict is difficult, the complexity of conflicting lines of code and files, the knowledge in the area of conflicting code, and the number of conflicting lines were most cited. it is interesting to mention that some of these factors were used in some related work to predict conflict occurrence. brindescu et al. (2020b) also investigated the characteristics of merge conflicts that are associated with their difficulty. the authors found a subset of ten factors that can predict the difficulty of merging conflicts, including complexity, diffusion, size, and development pattern. the more experienced developers have appointed the improvement of the version control tool over time as a factor that has improved conflict resolution. it is worth highlighting that the field study interviews also raised that the project’s size somehow influences the resolution of merge conflicts, especially in large projects (and large teams) where the chance of merge conflicts is higher. moreover, when a developer is at the beginning of his/her career, he/she does not use to pay attention to this kind of situation, especially on the importance of communication in a project (de farias junior et al., 2022). 5) improve team communication and less branching duration can avoid conflicts: as mentioned previously, dias et al. (2020) and menezes et al. (2020) found that timing measures have an influence conflict occurrence. so, we believe that a good practice is to pay attention to the isolation time and not postpone the merge so much. when developers are less isolated, the repositories are synchronized and people are aware of what other people are doing. software developers in the survey questionnaire mentioned the importance of knowing what parts others are working on to avoid conflicts. communication and relation with merge conflict are investigated mainly in studies addressing awareness. some specific studies (sarma et al., 2008; brun et al., 2011; sarma et al., 2011; guimarães and silva, 2012; estler et al., 2013) focus on the prevention of conflicts through awareness, i.e, detecting conflicts early. basically, these tools monitor workspaces and inform developers of ongoing parallel changes in other workspaces. as also mentioned by some software developers in the field study interviews, it is relevant to improve communication channels and also use awareness tools to know what each one is changing. moreover, other points related to the factors referred to having more and better conversations before starting a branch (or even a project), based on computational tools, such as trello, as well as applying agile methodology as a strategy to reduce time span. 6) qualitative analysis findings: the answers to our survey questionnaire and field study interviews show that software developers also use a branch to create proofs of concept and git flow seems to be a good strategy to coordinate the use of branches. in addition, they suggest that not keeping the repository up to date can cause problems, so developers need to bring up the changes constantly. attention regarding the code formatting is important. accioly et al. (2018) noticed that part of the merge conflicts is simply caused by changes to code indentation or consecutive line edits. regarding this problem, some software developers suggest adopting a code style tool. furthermore, as good practices to avoid conflicts, some of them also mention the option of not using branches and adopting techniques such as feature flags. they also cited always communicating with the team and checking for code updates in the master/trunk as a good practice, as noticed in the field study interviews. 7 threats to validity and credibility this work applied a quanti-qualitative method. therefore, there are two different empirical studies (survey research and field study) and each of them has specific threats and limitations. each subsection below informs their threads as well as strategies to mitigate them. 1) survey research: a) protocol. we adopted some predefined answers to some closed questions, given that they were grounded on previous studies published in the literature (owhadi-kareshk et al., 2019; leßenich et al., 2018; dias et al., 2020; menezes understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 et al., 2020; vale et al., 2020). moreover, we also leave an open field allowing a participant to comment on different factors not listed in the question. we develop the questionnaire very carefully. as it would be our main source for all sections of this study, we discussed and took a long time to construct our questionnaire. in addition to our experience on the subject, we spent a lot of time looking at the literature and building our survey based on the pieces of evidence from these studies and some similar initiatives (condina et al., 2020; kamei et al., 2020). we also conducted a pilot with four developers and asked for feedback on the questions, and whether they were understandable and relevant to the study. b) sample. the software developers who responded to the questionnaire were invited by email via contact lists and they were asked to share with their colleagues with experience in software development (snowballing invitation). we tried to make sure that only people with experience in the use of vcs answered the questionnaire, either in the invitation or in the survey description, or even in the question specifically referring to the use of any vcs. such approach was important to avoid any participant with lack of experience or knowledge. c) context. we only had the participation of brazilian software developers in our study. results may not be generalized to the context of all software developers all over the world. some results confirmed the findings presented in related work, but others require more in-depth investigation. in addition, according to smith et al. (2013), high-quality research on the human side of software engineering requires real software developers, but getting high levels of participation remains a challenge for researchers. nonetheless, it is relevant to emphasize that our results reflect the perspective of a large group (109 participants). 2) field study: a) protocol. we used the results from the survey research as the input for the questions prepared to the interview sessions, considering the main factors that affect merge conflicts according to the brazilian software developers who answered the survey questionnaire. the developers who were interviewed were invited by email and they were requested to share with their colleagues with some experience in resolving merge conflicts (snowballing invitation). only brazilian software developers participated in the field study interviews. therefore, the results may not be generalized, especially considering the interpretive validity of a qualitative study, i.e., the possibility, even without the researcher’s intention, to put his/her perception instead of really understanding, to perceive what the interviewee meant. b) sample. our intention was to have at least 20 interviewees, based on guest et al. (2006)’s work regarding the occurrence of saturation with at least 12 interviews given that this research has “the aim is to understand common perceptions and experiences among a group of relatively homogeneous individuals”. moreover, steglich et al. (2019) and greiler et al. (2022) conducted field studies with software developers considering guest et al. (2006)’s work and reinforce that the main important criteria is the saturation, i.e., when any new interview with relatively homogeneous individuals do not provide any new data or information. for example, steglich et al. (2019) reached saturation with 11 interviews. in our study, 15 developers were able to participate in the period when the field study was run. based on the interviews, the saturation was obtained with 12 interviews and this is in accordance with guest et al. (2006)’s work. it is important to remark that the main goal of our field study was to collect the brazilian software developers’ perceptions on merge conflicts in a qualitative setting and not through a large-scale, quantitative study based on software repository analysis. c) context. the same concern point out by smith et al. (2013) is valid to the field study, i.e., “high-quality research on the human side of software engineering requires real software developers, but getting high levels of participation remain a challenge for researchers”. this includes the software developers’ feeling on how to proceed with the interview questions given the fear of leaking confidential information from their own projects and/or companies in which they work on. it is a critical barrier faced in field studies, given its qualitative, in-person nature (singer et al., 2008), especially when requesting participation. nonetheless, it is important to highlight that the result of this study reflects the vision of a group of brazilian software developers and whose focus was on deepening the understanding the results from the previous survey research (smith et al., 2013). 8 conclusion this research aimed to investigate factors that lead to or help to avoid merging conflicts. to do so, based on related work, we conducted two empirical studies to both understand and analyze factors that affect merge conflicts. firstly, we conducted survey research with 109 software developers to understand the adoption of branches as well as the occurrence and resolution of conflicts. results suggest that the main factors that can lead to conflicts are “the time a branch is isolated” and “lack of communication”. on the other hand, the factors cited as good practices to avoid conflicts were “improve team communication” and “less branching duration”. “divide the work among the team”, “small changes”, and “frequent commits” were also marked many times by the participants of the survey research. communication here refers to the awareness of parallel changes, considering the importance of knowing what others are working on. we also performed a qualitative analysis to extract codes and categories from open fields of five questions responded to the participants. we identified that git flow is a common strategy adopted to coordinate branches, synchronizing the repository constantly and paying attention to the formatting of the code to avoid conflicts. next, we conducted a field study based on interviews with 15 software developers to analyze those factors to obtain a better understanding of what contributes to increasing or decreasing merge conflicts. results show that communication with the team, checking code updates, shorter branch duration and management (which comprises software development methodology, communication strategies and awareness-support systems) seem to be key policies, not only to merge conflict resolution, but also to decrease conflict. moreover, the developers’ time of experience can change their perception on the problems faced in this context and helps to avoid or resolve a merge conflict, besides the fact understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 that version control systems have evolved to a greater extent, being also an important support in this topic. finally, this study allowed us to conclude that most of the software developers agree with the factors that lead to and the factors that avoid merge conflicts, and underlying problems and how to resolve them are still a concern for all them. in future work, we intend to evaluate the application of some good practices suggested in this work. we could evaluate supporting processes and tools to improve communication, and reduce isolated work and other mentioned factors. another opportunity is to perform a quantitative study based on mining software repositories in order to analyze some github projects against some findings of this work, for example, through a study on the projects’ branching duration, communication tactics, and merge conflict resolution. finally, this work can be executed with software developers from other contexts (e.g., different cultures, countries, genders etc.) to produce other indications and allow systemic analyses. acknowledgements we thank all the participants that answered our survey and interview. the author also thanks unirio and faperj (grant: 211.583/2019) for partial support. references accioly, p., borba, p., and cavalcanti, g. (2018). understanding semi-structured merge conflict characteristics in open-source java projects. empirical software engineering, 23:2051–2085. bird, c., zimmermann, t., and teterev, a. (2011). a theory of branches as goals and virtual teams. in proceedings of the 4th international workshop on cooperative and human aspects of software engineering, pages 53–56. brindescu, c., ahmed, i., jensen, c., and sarma, a. (2020a). an empirical investigation into merge conflicts and their effect on software quality. empirical software engineering, 25:562–590. brindescu, c., ahmed, i., leano, r., and sarma, a. (2020b). planning for untangling: predicting the difficulty of merge conflicts. in 42nd international conference on software engineering (icse), pages 801–811. brun, y., holmes, r., ernst, m. d., and notkin, d. (2011). proactive detection of collaboration conflicts. in 19th acm special interest group on software engineering symposium and the 13th european conference on foundations of software engineering (sigsoft), pages 168– 178. cavalcanti, g., accioly, p., and borba, p. (2015). assessing semistructured merge in version control systems: a replicated experiment. in 2015 acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1–10. ieee. condina, v., malcher, p., farias, v., santos, r., fontão, a., wiese, i., and viana, d. (2020). an exploratory study on developers opinions about influence in open source software ecosystems. in proceedings of the 34th brazilian symposium on software engineering, pages 137–146. costa, c., figueiredo, j. j., ghiotto, g., and murta, l. (2014). characterizing the problem of developers’ assignment for merging branches. international journal of software engineering and knowledge engineering, 24:1489–1508. costa, c., figueiredo, j. j., pimentel, j. f., sarma, a., and murta, l. g. p. (2019). recommending participants for collaborative merge sessions. ieee transactions on software engineering. costa, c., menezes, j., trindade, b., and santos, r. (2021). factors that affect merge conflicts: a software developers’ perspective. in brazilian symposium on software engineering, pages 233–242. de farias junior, i., marczak, s., dos santos, r. p., rodrigues, c., and moura, h. (2022). c2m: a maturity model for the evaluation of communication in distributed software development. empirical software engineering. dias, k., borba, p., and barreto, m. (2020). understanding predictive factors for merge conflicts. information and software technology, 121:106256. estler, h. c., nordio, m., furia, c. a., and meyer, b. (2013). unifying configuration management with merge conflict detection and awareness systems. in 22nd australian software engineering conference (aswec), pages 201–210. ghiotto, g., murta, l., barros, m., and hoek, a. v. d. (2018). on the nature of merge conflicts: a study of 2,731 open source java projects hosted by github. ieee transactions on software engineering, 46:892–915. greiler, m., storey, m.-a., and noda, a. (2022). an actionable framework for understanding and improving developer experience. ieee transactions on software engineering. guest, g., bunce, a., and johnson, l. (2006). how many interviews are enough? field methods, 18:59–82. guimarães, m. l. and silva, a. r. (2012). improving early detection of software merge conflicts. in 34th international conference on software engineering (icse), pages 342–352. kamei, f., wiese, i., pinto, g., ribeiro, m., and soares, s. (2020). on the use of grey literature: a survey with the brazilian software engineering research community. in proceedings of the 34th brazilian symposium on software engineering, pages 183–192. kasi, b. k. and sarma, a. (2013). cassandra: proactive conflict minimization through optimized task scheduling. in 35th international conference on software engineering (icse), pages 732–741. leßenich, o., siegmund, j., apel, s., kästner, c., and hunsen, c. (2018). indicators for merge conflicts in the wild: survey and empirical study. automated software engineering, 25:279–313. mckee, s., nelson, n., sarma, a., and dig, d. (2017). software practitioner perspectives on merge conflicts and resolutions. in 33rd ieee international conference on software maintenance and evolution (icsme), pages 467– 478. menezes, j. w., trindade, b., pimentel, j. f., moura, t., plastino, a., murta, l., and costa, c. (2020). what causes understanding and analyzing factors that affect merge conflicts from the perspective of brazilian software developers ribeiro et al. 2022 merge conflicts? in 34th brazilian symposium on software engineering (sbes), pages 203–212. menezes, j. w., trindade, b., pimentel, j. f., plastino, a., murta, l., and costa, c. (2021). attributes that may raise the occurrence of merge conflicts. journal of software engineering, 9:14. owhadi-kareshk, m., nadi, s., and rubin, j. (2019). predicting merge conflicts in collaborative software development. in 13th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1–11. pan, r., le, v., nagappan, n., gulwani, s., lahiri, s., and kaufman, m. (2021). can program synthesis be used to learn merge conflict resolutions? an empirical analysis. in 2021 ieee/acm 43rd international conference on software engineering (icse), pages 785–796. ieee. pfleeger, s. l. and kitchenham, b. a. (2001). principles of survey research: part 1: turning lemons into lemonade. acm sigsoft software engineering notes, 26:16–18. premraj, r., tang, a., linssen, n., geraats, h., and van vliet, h. (2011). to branch or not to branch? in proceedings of the 2011 international conference on software and systems process, pages 81–90. sarma, a., redmiles, d., and van der hoek, a. (2008). empirical evidence of the benefits of workspace awareness in software configuration management. in proceedings of the 16th acm sigsoft international symposium on foundations of software engineering, pages 113–123. sarma, a., redmiles, d. f., and hoek, a. v. d. (2011). palantir: early detection of development conflicts arising from parallel code changes. ieee transactions on software engineering, 38:889–908. shihab, e., bird, c., and zimmermann, t. (2012). the effect of branching strategies on software quality. in 12th acm/ieee international symposium on empirical software engineering and measurement (esem), pages 301– 310. singer, j., sim, s. e., and lethbridge, t. c. (2008). software engineering data collection for field studies, pages 9– 34. springer london, london. smith, e., loftin, r., murphy-hill, e., bird, c., and zimmermann, t. (2013). improving developer participation rates in surveys. in 2013 6th international workshop on cooperative and human aspects of software engineering (chase), pages 89–92. spencer, d. (2009). card sorting: designing usable categories. rosenfeld media. steglich, c., marczak, s., de souza, c. r., guerra, l. p., mosmann, l. h., figueira filho, f., and perin, m. (2019). social aspects and how they influence mseco developers. in 2019 ieee/acm 12th international workshop on cooperative and human aspects of software engineering (chase), pages 99–106. vale, g., hunsen, c., figueiredo, e., and apel, s. (2021). challenges of resolving merge conflicts: a mining and survey study. ieee transactions on software engineering. vale, g., schmid, a., santos, a. r., almeida, e. s. d., and apel, s. (2020). on the relation between github communication activity and merge conflicts. empirical software engineering, 25:402–433. zimmermann, t. (2016). card-sorting: from text to themes. in perspectives on data science for software engineering, pages 137–141. elsevier. zou, w., zhang, w., xia, x., holmes, r., and chen, z. (2019). branch use in practice: a large-scale empirical study of 2,923 projects on github. in 2019 ieee 19th international conference on software quality, reliability and security (qrs), pages 306–317. ieee. introduction background merge conflicts related work research method understanding factors that affect merge conflicts planning and execution results rq1 (branches): how often are branches created in software projects? rq2 (merge conflicts): what factors lead to merge conflicts? rq3 (resolve conflicts): which practices do developers generally adopt to avoid merge conflicts? analyzing factors that affect merge conflicts planning and execution results rq4 (produce conflicts): how do the factors identified in the survey research mostly contribute to increase merge conflicts? rq5 (avoid conflicts): how do the factors identified in the survey research mostly contribute to decrease merge conflicts? discussion and implications threats to validity and credibility conclusion 719-##_article-854-1-6-20200226 3 journal of software engineering research and development, 2019, 4:1, doi: 10.5753/jserd.2020.719 this work is licensed under a creative commons attribution 4.0 international license. editorial letter for cibse 2019 special edition beatriz marin [ universidad diego portales, chile | beatriz.marin@mail.udp.cl ] isabel sofia brito [ instituto politécnico de beja, portugal | isabel.sofia@ipbeja.pt ] this issue of the jserd contains seven extended and peerreviewed papers from the xxii ibero-american conference on software engineering (cibse 2019), which was held in la habana, cuba, in april 2019. cibse was conceived as a space dedicated to the dissemination of research results and activities on software engineering in ibero-america. this conference is to promote high-quality scientific research in ibero-american countries, supporting the researchers in this community in publishing and discussing their work. cibse is organized in three tracks: software engineering track (set), experimental software engineering latin american workshop (eselaw), and requirements engineering track (ret). cibse received 154 submissions, which were finally materialized in 60 papers. for this special issue, we selected the best papers from each track, which were extended and reviewed in two rounds. all papers were refereed by three well-known experts in the field. the selected papers are described as follows: the paper “supporting a hybrid composition of microservices: the eucaliptool platform”, by pedro valderas, victoria torres, and vicente pelechano, presents a hybrid solution based on the choreography of business process pieces that are obtained from a previously defined description of the complete microservice composition. to support this solution, the eucaliptool platform is presented. the authors face the challenge of defining a hybrid solution to compose microservices that combine the benefits of the choreography and orchestration approaches. https://doi.org/10.5753/jserd.2020.457 the paper “requirements engineering base process for a quality model in cuba”, by yoandy lazo alvarado, leanet tamayo oro, odannis enamorado pérez, and karine ramos, proposes a quality model for software development that contributes to raising the percentage of successful projects, in cuban´s software development organizations, regarding the fulfillment of the agreed requirements. the solution proposal contains specific requirements and support elements (graphic and textual description of the process), divided by the three levels of maturity proposed by the model. the satisfaction of the final user was also measured by implementing jadov techniques. https://doi.org/10.5753/jserd.2020.459 the paper “towards a new template for the specification of requirements in semi-structured natural language”, by raúl mazo, carlos andrés jaramillo, paola vallejo, and jhon harvey medina, addresses the problems in the specifications of the requirements of a system by means of an adaptable and extensible template for specifying requirements of different domains (application systems, software product lines, cyber-physical systems, self-adapting systems). through action research method, we could observe that the reference template must be improved and that it is possible to improve it. the authors also found that the new template could be used in industrial cases. https://doi.org/10.5753/jserd.2020.473 the paper “characterization of software testing practices: a replicated survey in costa rica”, by christian quesadalópez, erika hernandez-agüero, and marcelo jenkins, characterizes the state of the practice based on practitioners use and perceived importance of software testing practices. to make a more in-depth analysis of the software testing practices among practitioners, the authors replicated a previous survey conducted in south america. this study shows the state of the practice in software testing in a thriving and very dynamic industry that currently employs most of our computer science professionals. the benefits are twofold: for academia, it provides us with a road map to revise our academic offer, and for practitioners it provides them with a first set of data to benchmark their practices. https://doi.org/10.5753/jserd.2019.472 in the paper “specifying the process model for systematic reviews: an augmented proposal”, by pablo becker, luis olsina, denis peppino, and guido tebes, the proposed systematic literature review (slr) process considers with higher rigor the principles and benefits of process modeling backing slrs to be more systematic, repeatable and auditable for researchers and practitioners. the authors have documented the slr process specification by using processmodeling perspectives and mainly the spem language. it is a recommended flow for the slr process, since the authors are aware that in a process instantiation there might be some variation points, such as the parallelization of some tasks. https://doi.org/10.5753/jserd.2019.460 the paper “a revisited systematic literature mapping on the support of requirement patterns for the software development life cycle”, by taciana n. kudo , renato f. bulcãoneto, alessandra a. macedo, and auri m. r. vincenzi, describes a revisited systematic literature mapping (slm) that identifies and analyzes research in order to demonstrates those benefits from the use of requirement patterns for software design, construction, testing, and maintenance. the slm protocol includes automatic search over two additional sources of information and the application of the snowballing technique, resulting in ten primary studies for analysis and synthesis. results indicate that there is yet an open field for research that demonstrates, through empirical evaluation and usage in practice, the pertinence of requirement patterns at software design, construction, testing, and maintenance. https://doi.org/10.5753/jserd.2019.458 the paper “the rocs framework to support the development of autonomous robots”, by leonardo ramos, gabriel lisboa, guimarães divino, guilherme cano lopes, breno bernard nicolau de frança, leonardo montecchi, and esther luna colombini, addresses the need to organize and modularize software for robotic systems correct functioning, making the development of software for controlling robots a complex and intricate task. based on the wellknown ibm autonomic computing reference architecture (known as mapek), this work defines a refined architecture following the robotics perspective. to explore the capabilities of the proposed refinement, the authors implemented the rocs (robotics and cognitive systems) framework for autonomous robots. https://doi.org/10.5753/jserd.2019.470 we would like to thank the authors, track chairs, and members of the program committee of each track at the conference for their effort and rigorous work done in the review process, as well as the jserd editorial board for offering us the opportunity of preparing this special issue. enjoy the reading! beatriz marín isabel sofia brito 14-##_source texts-454-1-18-20190814 journal of software engineering research and development, 2019, 6:3, doi: 10.5753/jserd.2019.14 this work is licensed under a creative commons attribution 4.0 international license. towards a more in-depth understanding of the iot paradigm and its challenges rebeca campos motta [ universidade federal do rio de janeiro e lamih cnrs umr 8201 | rmotta@cos.ufrj.br] valéria martins da silva [universidade federal do rio de janeiro | vsilva@cos.ufrj.br ] guilherme horta travassos [universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract the internet of things (iot) is a new technological paradigm that brings together the physical and virtual worlds to provide software systems everywhere through daily life objects. the iot can transform how we interact with the environment surrounding us, leading to a significant multidisciplinary technological shift. however, since it is a new field of research and development, there is a lack of consensus and understanding of its concepts and features, as we observed when engineering some software systems in the field. therefore, we performed investigations to characterize iot regarding its definition, characteristics, and applications, organizing the area and revealing its challenges, and research opportunities, focusing on software engineering for the iot. a structured literature review of secondary studies supported the answering of three research questions: what is “internet of things”? which characteristics can define an iot domain? which are the areas of iot application? the structured literature review leads to 15 subsequent studies from which we recovered 34 definitions discussed in the light of the technical evolution 29 characteristics and several iot application areas. furthermore, the results include an iot characterization based on identification, sensing, and actuation capabilities, besides a discussion of the relation between iot and cyber-physical systems (cps), regarding other research areas and terms often associated with iot aiming at to bring clarification to the field. in this work, we offer an essential overview of the iot state-of-theart and a characterization, presenting issues that should be addressed to contribute to its strengthening and establishment. keywords: internet of things, systems engineering, evidence-based software engineering 1 introduction the internet of things (iot) has emerged as a new paradigm where the software systems are no longer limited to computers but to a great variety of different connected objects, or specific users’ goals and closed environments. the interaction between humans and the cyber-physical world is changing since software can be deployed everywhere and in everything, such as cars, smartphones, clothes and in different environments (atzori, iera, and morabito, 2010; kraijak and tuwanut, 2016; datta et al., 2017; wortmann, combemale and barais, 2017; cicirelli et al., 2018), characterizing the iot domain and vision. it enables a pervasive interaction between connected things enhanced with identification, sensing, actuation, and processing capabilities, which enable them to interact with the environment. together with the benefits proposed by the iot paradigm, new challenges also arise. the constant evolution of the technology, application heterogeneity and diversity of devices, and other particularities such as a lack of division of roles, scale, and different lifecycle phases differentiate iot applications from traditional ones (patel and cassou, 2015). it can challenge the current software technologies to develop iot applications and to consolidate such paradigm (skiba, 2013; zambonelli, 2016; larrucea et al., 2017). one of the recurrent difficulties regards the natural iot multidisciplinary and novelty. since iot it is a modern paradigm, some fundamental points are still under discussion and involve converging topics of different research streams (motta, de oliveira, and travassos, 2018). in our previous research regarding ubiquitous (spínola, pinto and travassos, 2008; spínola and travassos, 2012) and context-aware software systems (matalonga, rodrigues and travassos, 2017; santos et al., 2017), we have identified some gaps and the need for software technologies that can also be observed in the iot domain. however, as a constant challenge in this area, the lack of a unified iot perception together with some experiences on engineering iot software systems motivate this research as a starting point for further investigation and development activities at our research group. in this scenario, we performed a structured literature review of secondary studies on iot to understand the “internet of things” concept, as well as its characteristics and the application domains making use of it. therefore, this research aims to characterize the internet of things paradigm, considering the scenario of invisible and pervasive complex systems that support daily activities in the world. this review intends to answer the following questions: what is the “internet of things”? which characteristics can define an iot domain? which are the areas of iot application? the primary goal of this review is to strengthen the iot paradigm understanding, characterizing it based on its properties, and identifying the current iot applications (the domains that are currently getting some benefit from the iot domain) under the perspective of engineering iot software systems. we made this decision since the advancement of technologies makes society highly dependent on engineered-based software systems. we aim to discuss the software engineering scenario in the iot paradigm, being the results of this review the first step of research towards the understanding of engineering iot software systems. therefore, the intention is to promote a high-level discussion on identified iot paradigm characteristics and give an overview of the area, aiming to promote a better perception of current development needs and opportunities in the area. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 important works from the literature review supported our discussions and the answers to the research questions. there are many definitions for iot available in the technical literature, and even though they are different, they share similar points. from this diverse content, it was necessary an understanding from our side of the iot concept and what the “things” represent in the iot context. besides the iot characterization, we discuss the relation among iot, cps and other related terms to highlight some points that lead to consider some areas as building blocks for iot or, on the other hand, dependent on its evolution. the remainder of the paper is structured as follows. in the next section, the methodology is introduced, and we explain how it was applied in this study. then, in section 3, the results of the literature review are presented. these results are then further discussed in section 4, together with the validity threats. the main conclusions from the paper are summarized in section 5. 2 research methodology the purpose of this literature review is to contribute to a more in-depth understanding of the internet of things and its challenges, identifying its definitions, characteristics and the current areas of use. 2.1 review planning before undertaking any literature review, it is essential to observe its necessity (budgen and brereton, 2006). therefore, we started with an ad-hoc search looking for any existent secondary studies on iot. considering the iot paradigm as a new motivating area for investigation, we decided to review the technical literature more systematically, adopting existing practices to compose our study plan. in our perspective, “secondary studies” are the studies which survey primary studies to present a bigger picture of a domain, the iot in this case. all secondary studies that meet the selection criteria should be included; even it does not mention its research protocol. the research protocol followed the recommendations proposed by (budgen and brereton, 2006; de almeida biolchini et al., 2007) and, for the sake of space, have some of its details presented below. the research goal is gqmbased (basili, caldeira and rombach, 1994) defined as follows: to analyze the internet of things with the purpose of characterizing regarding its definitions, characteristics and application areas from the point of view of software engineering researchers in the context of knowledge previously organized and presented in secondary studies regarding iot available in the technical literature. from this goal, we defined the research questions (rq): (rq1) what is the “internet of things”? (rq2) which characteristics can define an iot domain? (rq3) which are the areas of iot application? with this goal, the secondary studies were searched according to the following information:  search strategy the search strategy used scopus 1combined with snowballing procedures. the scopus was chosen as the search engine since it indexes several 1 https://www.scopus.com/ databases of peer-reviewed sources, covering repositories such as ieee xplorer 2for example, and favor the repeatability of the search results (matalonga, rodrigues and travassos, 2017; santos et al., 2017). in turn, backward and forward snowballing refers to using the reference list of cited papers or the citations to a paper to identify additional sources of data, complementing and extending the initial set of papers (wohlin, 2014). also, as far as our experience shows, the strategy of using scopus with snowballing procedures mitigates an eventual lack of content, avoids duplicated filtering work, and provides a representative set of papers to a characterization study such as this one (motta, oliveira, and travassos 2016; motta, oliveira, and travassos 2018).  search string since the review focus is to retrieve information based on secondary studies, it was: title-abs-key (( "*systematic literature review" or "systematic* review*" or "mapping study" or "systematic mapping" or "structured review" or "secondary study" or "literature survey" or "survey of technologies" or "driver technologies" or "review of survey*" or "technolog* review*" or "state of research") and ( "internet of things" or "iot")).  selection criteria works presented as articles shall be available on the web, retrieved from the search engine and written in english. as the selection criteria we have:  inclusion criteria o provide an iot definition and o provide iot properties or o provide iot application areas  exclusion criteria: o duplicate publication/self-plagiarism or o register of proceedings  selection procedure read the title and abstract of each retrieved study and evaluate it according to the inclusion and exclusion criteria. two distinct readers evaluated each secondary study. the studies acceptance criteria happened as follows: o all two readers accept: the study is included. o one reader accepts, and one is in doubt: the study is included. o one reader accepts or is in doubt, and one reader excludes: the study is discussed. o two readers exclude: the study is not included.  data extraction data extraction aims to capture information from the selected articles to answer the proposed research questions. the data extraction form was proposed during the review planning and used throughout the process. the information was extracted as presented in table 1. 2 https://ieeexplore.ieee.org towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 2.1 review planning the review process was executed according to the following steps:  step 1 ad-hoc search. it is based on the researcher’s experience without providing any explicit or planned process in comparison to a systematic literature review. the primary objective of the ad-hoc search was to verify the need to carry out an initial literature review on the target topic and identifying control articles to guide formulating a search string for further searches. two researchers performed this step to identify the existence of any secondary study related to iot. the search perspective was established from the software engineering point-of-view for paper reading and analysis. since we identified secondary studies, we decided to review the existent articles instead of relying on primary studies. from the results of this ad-hoc search, three articles were selected as a starting point for the next step since they met the selection criteria (atzori, iera, and morabito, 2010; bandyopadhyay and sen, 2011; li, xu and zhao, 2015).  step 2 scopus search. we organized the terms of the search string based on synonyms and similar terms. the search string was adjusted to recover the three articles which were previously selected. the total of items found was 76; the search was executed at the end of may 2017, considering the papers available in the database until this date.  step 3 title and abstract reading. the list of 76 articles was reviewed to remove duplicates and proceedings, according to the selection criteria. the remaining articles were later read based on title and abstract and reviewed by a 3rd researcher with more experience in the research area. 24 articles were selected for further reading, considering the title and abstract reading, following the criteria established in the research protocol.  step 4 full reading. the two researchers read the full text of the 24 articles (12 for each, with crosschecking), considering the inclusion and exclusion criteria. seven of them met the criteria, being those finally selected.  step 5 snowballing. it refers to using the reference list of an article or its citations to identify additional material (wohlin, 2014). in this step, we performed backward and forward snowballing sampling, tracking down references in the seven articles selected in the previous step and their citations. the total of articles was divided, and each researcher was responsible for performing the snowballing in part of the articles. nineteen articles were identified as candidates, and the reviewers cross-checked the articles to be included considering the selection criteria. this step resulted in the inclusion of five new articles.  step 6 review update. the previous five steps were carried out between march and may 2017. the update was performed on december 2018 to cover new publications made available between 2017 and 2018. we re-executed the same string in scopus and analyzed the results following the criteria previously established. the three reviewers conducted the update repeating steps 3 and 4 for the new scopus results and the forward snowballing (step 5) for the whole set. this step resulted in the inclusion of three new articles. the review steps resulted in 15 articles, composing the final set: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; singh, tripathi and jara, 2014; borgia, 2014; whitmore, agarwal and da xu, 2015; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; gil et al., 2016; sethi and sarangi, 2017; trappey et al., 2017; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018). see the details of each step in table 2. table 1 information extraction fields. field description reference information authors, title, year and venue abstract abstract iot definition verbatim, as presented in the article (definition research-based derived or with reference) iot related terms it is associated with other definitions (ubiquitous, context-aware, pervasive, machine-to-machine, and others) iot application features characteristics of particular traits, features, properties, attributes that make iot what it is (that achieve the iot definition/concept) iot application areas the areas (and their related applications) that will benefit from the full iot idea deployment. development strategies for iot the used development strategies to build iot software (requirements analysis, design, and so on). type of study it is expected to have only secondary studies, represented by survey, systematic literature review, others. study properties protocol, research questions, search string, selection criteria. challenges open opportunities in practice or research article focus main concerns presented in the articles (architecture, security, and others) things a list of the kind of things explicitly stated in the article (coffeemaker, refrigerator, incubator, and others) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 table 2 total of articles selected at each step of the review. step number of articles selected step 1 step 2 76 step 3 24 step 4 7 step 5 5 step 6 3 final set 15 3 results the dataset contains papers from 2010 to 2018. it is possible to observe a growing interest in the area over the years. the results show that most of the available publications on technical literature were from 2015 to 2018, considering the period of search. since it is a topic that has recently gained strength, both initiatives from industry and research are still in the early stages. table 3 presents the study types considering the classification initially presented by the authors. table 3 study types type studies systematic literature review (carcary et al., 2018) literature review (atzori, iera and morabito, 2010; miorandi et al., 2012; singh, tripathi and jara, 2014; li, xu, and zhao, 2015; gil et al., 2016; burhanuddin et al., 2017; sethi and sarangi, 2017; ray, 2018) literature survey (bandyopadhyay and sen 2011; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; trappey et al. 2017) not defined (gubbi et al. 2013; borgia 2014) despite being a current trend, our initial research did not return secondary studies conducted systematically, nor did they present the methodology followed, nor the research questions that the papers intended to answer. the papers except in (whitmore, agarwal and da xu, 2015; carcary et al., 2018) do not present the research protocol or make explicit the study properties (research questions, search strings, databases, selection criteria, selected articles, among others). for this reason, we have not performed a quality assessment since there is no methodology related information to be evaluated. therefore, not performing the quality assessment represents a threat to this study validity. from this result, it is possible to observe the need to provide research data based on sound scientific methodology. despite the evolution and enthusiasm that new technology can provide with recent developments such as iot, the lack of scientific rigor it is still one of the significant challenges to strengthen the basis of software engineering knowledge (de almeida biolchini et al., 2007). this work was conducted by following established guidelines and in a protocolled way, accounting for the strength of the evidence found and its replicability. the questions that this review seeks to answer are aligned with the objective of characterizing iot and with this result we aim to contribute to strengthening the discussions and evolution of the area. from the selected papers, seven essential topics were addressed (figure 1):  concepts presenting discussions regarding the fundamentals, definitions, and visions behind the iot paradigm; articles (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; borgia, 2014; singh, tripathi and jara, 2014; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; gil et al., 2016; trappey et al., 2017; carcary et al., 2018).  technology introducing enabling technologies and solutions to develop and deploy iot applications. articles: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; gubbi et al., 2013; borgia, 2014; whitmore, agarwal and da xu, 2015; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015; burhanuddin et al., 2017; sethi and sarangi, 2017; trappey et al., 2017; ray, 2018).  applications describing the current state of the existing solutions and the applications of different domains as well as future possibilities to be achieved by using iot. articles: (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; bandyopadhyay and sen 2011; gubbi et al. 2013; borgia 2014; singh, tripathi, and jara 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; trappey et al. 2017).  open issues and challenges presenting opportunities for research and development aiming to evolve iot. articles: (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; miorandi et al., 2012; gubbi et al., 2013; borgia, 2014; singh, tripathi and jara, 2014; li, xu and zhao, 2015; whitmore, agarwal and da xu, 2015; burhanuddin et al., 2017; carcary et al., 2018).  architecture discussing possible implementations of iot based on different architectures proposals. articles: (bandyopadhyay and sen, 2011; singh, tripathi and jara, 2014; madakam, ramaswamy and tripathi, 2015; whitmore, agarwal and da xu, 2015; gil et al., 2016; sethi and sarangi, 2017; trappey et al., 2017; ray, 2018).  characteristics making specific general features and requirements of iot. articles: (borgia, 2014; gil et al., 2016).  initiatives research organizations, industries, standardization bodies, and governments that have an interest or put some effort into iot. articles: (miorandi et al. 2012; gubbi et al. 2013; borgia 2014; madakam, ramaswamy, and tripathi 2015). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 figure 1 most common topics in the articles. 3.1 studies overview gil et al. (gil et al., 2016) reviewed surveys regarding iot, focusing mostly on the context-aware feature and how both topics are related. the main difference from our work is that they lack a research methodology and the discussion revolved around the general purpose of the selected articles and the context-aware iot. another work that contains an analysis of the trends and coverage of the iot literature is from whitmore et al. (whitmore, agarwal and da xu, 2015). it presents an area overview. however, it does not worry about answering research questions, describing open questions and future directions to assist researchers. it differs from our work which concerns the characterization of iot regarding its definition and characteristics. numerous iot definitions exist in the technical literature due to different visions from the research community. some authors (miorandi et al., 2012; gubbi et al., 2013) discuss iot as an overall vision, while (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; singh, tripathi and jara, 2014) describe that iot is realizable by particular visions or pillars. the conceptualization of iot is the focus of ( miorandi et al. 2012; gubbi et al. 2013; madakam, ramaswamy, and tripathi 2015). other topics are also presented, as a taxonomy for iot (gubbi et al., 2013; sethi and sarangi, 2017), and iot patents (trappey et al., 2017). the works of (burhanuddin et al., 2017; ray, 2018) focus on the critical discussion related to architectural issues and options to deal with the immense number of interconnected devices as proposed in iot. besides, they also describe fundamental requirements along with implementation challenges and future directions. the work of (carcary et al., 2018) argues that the adoption of iot is not yet widespread and examines the existing literature on key determinants (drivers, benefits, barriers, and challenges) that influence the adoption of iot by organizations. it is important to highlight that from the 15 selected secondary studies, none covers all the topics, showing that the researchers have distinct perspectives and concerns. however, together these studies provide a wealth of information to our research topic. the application of a sound research protocol in this work provides an improvement over the previous ones since some do not make clear the performed procedures. besides, we offer a research protocol that can be replicated. in this work, we further improve the current state because we not only quantitatively point out the results but provide discussions and answers to research questions grounded in data. we also would like to highlight that one can value the findings and discussions in this article since we are relying on secondary studies. in this case, several primary studies reported in these 15 secondary studies support with evidence. 3.2 answering the research questions we based our analysis procedure on textual analysis, using codes to assign concepts to a portion of data, identifying patterns from similarities and differences emergent from the data extracted. two researchers conducted it, with crosschecking to achieve a consensus with the analysis, to decrease potential misinterpretation and bias. a third researcher reviewed the extractions and findings. this process was performed through all the data extracted and lead to the discussions of the research questions proposed, presented in the following subsections. 3.3 rq1: what is the “internet of things”? the 15 selected papers supported the extraction of 34 different iot definitions. from the analysis of these 34 definitions, we noticed that they followed a specific pattern in their structure in the concern of explaining the involved actors, requirements and the consequences of relations among actors as part of a system not necessarily presented in all definitions. we considered this structure not to limit our interpretation, but to support a more thorough iot conceptual understanding and thus finding an appropriate and updated definition for this work. we organized some of the definitions found in chronological order to observe how the concept has evolved. ''an intelligent infrastructure linking objects, information, and people through the computer networks, and where the rfid technology found the basis for realization.'' defined in 2001 by (brock, 2001), cited by (borgia, 2014). in this 2001 definition, we can observe that the idea is to connect objects, information, and people, where both objects and people can be actors in the system. it makes clear the network necessity as a way to connect the actors, and the realization was limited by the rfid identification technology (finkenzeller, 2010), which represents the iot vision starting point. “internet of things as a paradigm in which computing and networking capabilities are embedded in any conceivable object. we use these capabilities to query the state of the object and to change its state if possible.” defined in 2005 by (itu, 2005), cited by (sethi and sarangi, 2017). this definition from 2005 does not propose the use of any technology, like rfid, but includes the idea of expanding the original capabilities of an object through technology to perceive changes in the object’s states; it is only possible by addressing objects first, turning them identifiable. once achieving that, it enables things to communicate automatically (dunkels and vasseur, 2008). it can be considered an evolution since this kind of requirement was not previously discussed. the next definition addresses the idea: “a world where things can automatically communicate to computers and each other providing services to the towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 benefit of the humankind.” defined in 2008 by (dunkels and vasseur, 2008), cited by (atzori, iera, and morabito, 2010; gil et al., 2016). another definition is: ''a dynamic global network infrastructure with selfcapabilities based on standard and interoperable communication protocols where physical and virtual ''things'' have identities, physical attributes, virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network'' defined by in 2009 (gusmeroli, sundmaeker and bassi, 2015), cited by (borgia, 2014; whitmore, agarwal and da xu, 2015). in this 2009 definition, we can see that the central concept of communication and integration remains. it leads to an effort to make things identifiable (network sense, not physically) and the introduction of requirements such as interoperability and integration in a seamlessly way. this definition also details what are the things in iot, as things being virtual or physical, that can have different personalities and may use different communication protocols. “the basic idea of this concept is the pervasive presence around us of a variety of things or objects such as radiofrequency identification (rfid) tags, sensors, actuators, mobile phones, etc. which, through unique addressing schemes, are able to interact with each other and cooperate with their neighbors to reach common goals.” defined in 2010 by (atzori, iera, and morabito, 2010), cited by (miorandi et al., 2012; gubbi et al., 2013; singh, tripathi and jara, 2014). this iot definition from 2010 is one of the most used. it can be considered broader regarding the “actors, relations among actors, requirements and what enables” structure. it presents the vast amount and heterogeneity of actors that can engage an interaction, and a requirement to achieve that through unique addressing schemes. in this case, new actors are included, and we can observe that the sensing and actuation are other possible behaviors that a system can possess, differing from initial definitions. therefore, these actors can cooperate to reach some goals. “interconnection of sensing and actuating devices providing the ability to share information across platforms through a unified framework, developing a common operating picture for enabling innovative applications. this is achieved by seamless large-scale sensing, data analytics and information representation using cuttingedge ubiquitous sensing and cloud computing.” defined in 2012 by (gubbi et al., 2013). once more, sensing and actuation have essential roles in iot, as presented in this definition from 2012. the vast amount of data collection and sharing among actors can be a source to compose diversified, innovative applications. it makes clear the multidisciplinary nature of iot, as the integration of different disciplines for the accomplishment of successful iot systems, as there are areas that support or leverages it, such as data analytics, ubiquitous and cloud computing. “everyday objects can be equipped with identifying, sensing, networking and processing capabilities that will allow them to communicate with one another and with other devices and services over the internet to achieve some useful objective (…). every day “things” will be equipped with tracking and sensing capabilities. when this vision is fully actualized, “things” will also contain more sophisticated processing and networking capabilities that will enable these smart objects to understand their environments and interact with people.” defined in 2015 by (whitmore, agarwal and da xu, 2015). once the everyday things can sense the environment, they become more aware of what is around them, which characterizes context-awareness. in this 2015 definition, we see again that the primary concern in iot is to leverage the connection among different things to achieve a system objective. also, the authors explain that things in the iot context are those objects equipped with identifying, sensing, networking, and processing capabilities, whereas other definitions exemplify things as being the providers of such capabilities, that is, tags, sensors, and actuators. in our understanding, things exist in the physical realm, such as sensors, actuators and anything that is equipped with identification (tag reading), sensing or actuation capabilities, which excludes entities in the internet domain (hosts, terminals, routers, among others). the things should also have communication, networking and processing functionalities varying according to the systems requirements. as one can notice, the capabilities of the things evolved over time as observed from the definitions presented and the examples in figure 2. figure 2 iot evolution. as things evolved, the understanding and discussions should also follow the changes. in the beginning, the things in iot based systems were objects attached to electronic tags, so these systems present the behavior of identification. subsequently, sensors and actuators composing the systems enabled the sensing and actuation behaviors respectively. it means that an iot system may have identification, sensing or actuation behaviors, or a combination of them. the explaining of each behavior and examples of applications can be seen in figure 3 and table 4 towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 figure 3 iot behaviors. when discussing the previous definitions, it was necessary to distinguish the meaning of “identification” referred to objects. the reason is that an object can be identifiable in the sense of connectivity (e.g., throw ip addresses) or in the sense of physical identification when objects are tagged with electronic tags containing specific information, making it possible to identify objects through tag readers. further, it is also relevant to elucidate the meaning of “actuation” as it may bring diverse interpretations. when focusing on the iot context, the adequate meaning for “actuation” is precisely the one presented in table 4. it is divergent from actions represented by methods in the object-oriented paradigm, and it is not related to objects’ processing capabilities mentioned in the iot definition discussed previously. actuation is exclusively related to the possibility of virtual intervening in the real world by mechanical means. it is important to note this distinction in iot systems due to their capabilities since it is possible to have different compositions of systems. in an industrial plant, for example, identification tags are attached to products and provide realtime location and status. dashboards with the data recovered from products and machines (from a sensing activity) keep managers updated along the production line and the company are now able to monitor and control production almost automatically (actuation), including processing capabilities. it is a real-case scenario already deployed where the three behaviors and benefits of iot can be seen, such as providing more process visibility, more accurate work and improving production effectiveness (cisco, 2014). it is interesting to structure the characteristics and applications retrieved in this review within these three behaviors because iot does not necessarily have to present all of them, but only one or a combination of them. it can clarify and delimitate iot solutions contributing as a guide for their applications engineering. to answer rq1 from the review results, iot can be defined as a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. 3.4 rq2: which characteristics can define an iot domain? the 15 papers provided 263 excerpts, which were coded following the principles of open coding, as described in the grounded theory (strauss and corbin, 1990), from what we identified 29 characteristics (table 5). one point of discussion is that the authors do not define all the table 4 iot behaviors. behavior description example identification the primary function is to identify things, by labeling and enabling them to have an identity, then recover (through reading), and broadcast information related to the thing and its state. identifying patients with electronic tags (rfid) to be detected throughout hospitals using receivers (readers) placed in departments to accelerate the identification of empty beds (kannry et al., 2007). another example is the application of short-range identification technology for drug interaction and drug-allergy detection (alabdulhafith, sampangi, and sampalli, 2013). it operates by identifying patients (nfc tags integrated into their wristband) and drugs (nfc tags integrated), each tag holding a unique id. nurses read the patient’s and drug’s nfc tag by using the smartphone’s nfc reader. finally, the server verifies whether the patient is allergic to the drug or if there might be a potential interaction. sensing the primary function is to sense environment information, requiring information aggregation, data processing, and transmission. enables awareness, thus acting as a bridge between the physical and digital world. to illustrate the capability of the sensor in the real world, one interesting application is from the geophysics area. sensors have been deployed for long-distance volcanic monitoring, such as microphones and seismometers, collecting seismic and acoustic data on volcanic activity (werner-allen et al., 2006). actuation mechanical interventions in the real world according to decisions based on aggregated data or even upon actors’ right trigger; relay on responses to the collected information to perform actions in the physical world and change the object state. an example is the control of things, robots or even animals in the real world as in (wark et al., 2007), where actuators are used in an attempt to prevent fighting between bulls in on-farm breeding paddocks by autonomously triggering stimuli such as audio warning signals or mild electrical when one bull approaches another. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 characteristics presented in the articles or referred to the original work defining them table 6. table 5 iot characteristics. characteristics # all characteristics identified 29 characteristics not defined 20 characteristics defined 9 the lack of definitions hinders the research and understanding of the area since we cannot know the feature´s meaning or what the authors meant by that. although some characteristics such as interoperability and scalability are well defined, it is essential to establish a common understanding of the characteristics since they inspire different concepts when contextualized to distinct domains. table 6 characteristics not defined. characteristic cited by reference accuracy (borgia, 2014; burhanuddin et al., 2017) adaptability (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; ray, 2018) (nami and sharifi 2007; sampigethaya, hackmann, et al. 2008; poovendran, and bushnell 2008; lee and sokolsky 2010; azimi et al. 2011; barro-torres et al. 2012; hur and kang 2012) availability (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; gubbi et al., 2013; li, xu and zhao, 2015; madakam, ramaswamy and tripathi, 2015) (gluhak et al., 2011) connectivity (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; gubbi et al. 2013; whitmore, agarwal, and da xu 2015; gil et al. 2016; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018) (weiser et al. 1999; infso d.4 2008; conti 2006; dunkels and vasseur 2008; vermesan et al. 2009) efficiency (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; trappey et al. 2017; burhanuddin et al., 2017) ( hackmann et al. 2008; sampigethaya, poovendran, and bushnell 2008; lee and sokolsky 2010; azimi et al. 2011; hur and kang 2012; barro-torres et al. 2012) extensibility (bandyopadhyay and sen, 2011; li, xu and zhao, 2015) flexibility (li, xu, and zhao 2015; sethi and sarangi 2017) manageability (bandyopadhyay and sen, 2011; borgia, 2014) modularity (bandyopadhyay and sen, 2011) performance (gubbi et al., 2013; li, xu and zhao, 2015) privacy (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; sethi and sarangi 2017; whitmore, agarwal, and da xu 2015) (xianrong zheng et al., 2014a) reliability (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; sethi and sarangi 2017) ( koren and krishna 2007; hackmann et al. 2008; hackmann et al. 2008; lee and sokolsky 2010; azimi et al. 2011; hur and kang 2012; barro-torres et al. 2012) robustness (atzori, iera and morabito, 2010; miorandi et al., 2012) (koren and krishna, 2007) scalability (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; burhanuddin et al., 2017) (gluhak et al., 2011) smartness (li, xu and zhao, 2015; ray, 2018) sustainability (borgia, 2014) traceability (atzori, iera and morabito, 2010) trust (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; borgia 2014; li, xu, and zhao 2015; sethi and sarangi 2017) ubiquity (carcary et al., 2018) visibility (atzori, iera and morabito, 2010) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 for instance, “efficiency” is open to many interpretations even the iot domain is on the focus, which can be related to object’s data collection efficiency, energy-efficiency, security-efficiency, information processing efficiency as well as service adaptability-efficiency. it makes it challenging to characterize iot and to develop more suitable solutions that meet all the desired characteristics, since they were not defined, only listed. for the same reason, it is not possible to infer that the authors are discussing the same table 7 defined characteristics. characteristic cited by reference addressability: the ability to distinguish objects using unique ids. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; borgia 2014) unique id: it is necessary for unique identification for every physical object. once the object is identified, it is possible to enhance it with personalities and other information and enable the control over it (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; ray, 2018) (atzori, iera, and morabito 2010; finkenzeller 2010; gubbi et al. 2013) object autonomy: smart objects can have individual autonomy, not needing direct human interaction to perform established actions, while reacting or being influenced by real/physical world events. (atzori, iera and morabito, 2010; gubbi et al., 2013; madakam, ramaswamy and tripathi, 2015) mobility: object availability of across different locations. (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; sethi and sarangi, 2017) (akyildiz, jiang xie and mohanty, 2004; sharma, gusain and kumar, 2013) autonomy: refers to systems not needing direct human intervention to perform established actions such as data capture, autonomous behavior, and reaction. (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; ray, 2018; carcary et al., 2018) (chlamtac, conti, and liu 2003; nami and sharifi 2007; gusmeroli, sundmaeker, and bassi 2015) context-awareness: the use of context to provide task-relevant information and/or services to a user. (atzori, iera, and morabito 2010; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; ray, 2018) ( abowd et al. 1999; schmidt and van laerhoven 2001; nami and sharifi 2007; o’reilly and pahlka 2009; perera et al. 2014) heterogeneity: several services taking part in the system, which present very different capabilities from the computational and communication standpoints. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; burhanuddin et al., 2017; carcary et al., 2018) (infso d.4 2008; gluhak et al. 2011; nuzzo and sangiovannivincentelli 2014) interoperability: interoperability is of three types: network interoperability that deals with communication protocols. syntactic interoperability ensures conversion of different formats and structures. semantic interoperability deals with abstracting the meaning of data within a domain. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; sethi and sarangi 2017; burhanuddin et al., 2017; ray, 2018) (panetto and cecil 2013; jardimgoncalves et al. 2013; chengen wang, zhuming bi, and li da xu 2014; borgia 2014) security: to ensure the security of data, services and entire iot system, a series of properties, such as confidentiality, integrity, authentication, authorization, nonrepudiation, availability, and privacy, must be guaranteed. (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; miorandi et al. 2012; gubbi et al. 2013; borgia 2014; li, xu, and zhao 2015; madakam, ramaswamy, and tripathi 2015; whitmore, agarwal, and da xu 2015; sethi and sarangi 2017; burhanuddin et al., 2017) (sampigethaya, poovendran, and bushnell 2008; lee and sokolsky 2010; andreini et al. 2010; andreini et al. 2011; azimi et al. 2011; barro-torres et al. 2012; hur and kang 2012; cirani, ferrari, and veltri 2013; xianrong zheng et al. 2014b; chasaki and mansour 2015) towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 issues, such as efficiency for instance, which from the sources can be regarding cost, size, resources or energy. even with this lack of definition, the characteristics pointed out in table 5 are relevant for the characterization scenario of iot systems. in table 6, we retrieve the characteristics pointed out by the authors (cited by) and the original references used by them (reference) some references may have been used by more than one author and null (-) in case of no reference. this distinction is because we can value more the characteristics referenced by others since it is possible to have more sources to strengthen the results. to continue with our research, we consider only the characteristics that made explicit their definitions (table 7) these definitions came from our original material interpretation and compilation of the references cited. from the characteristics presented in table 7, we can observe that some of them are fundamental to an application in order to fulfill our iot definition: “a paradigm that allows composing systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that are able to communicate and cooperate to reach a goal”. addressability, unique id, heterogeneity, interoperability, mobility, and security are the essential characteristics necessary for an application to follow the iot paradigm. from this primary setting, an iotbased software system can be engineered with identification, sensing and/or actuation capabilities. each one of them requires new characteristics. for instance, context-awareness is required to enable sensing behavior, and autonomy is needed in actuation behavior. table 7 represents an initial set of iot characteristics as defined in the technical literature. we wish to perform more extensive research for the characterization of the three behaviors since new characteristics specific to each one of the iot applications may also be required. having a clearer and well-defined set of characteristics can aid the development of applications with higher quality and support to quality assurance and assessment. 3.5 rq3: which are the areas of iot application? several application domains will leverage the internet of things paradigm advantages. all the application domains are only examples of areas that benefit from iot or are supposed to do it in the future. as declared in whitmore et al. “the domain of the application areas for the iot is limited only by imagination at this point” (whitmore, agarwal and da xu, 2015). despite the application scenarios were described in different levels of detail, we tried to categorize some of them into the tree behaviors (table 5) as presented in table 8. atzori et al. (atzori, iera, and morabito, 2010)describe five domains: (a) transportation and logistics, (b) healthcare, (c) smart environment (home, office, plant), (d) personal/social and (e) futuristic domain (whose implementation of such applications is still too complicated). gubbi et al. (gubbi et al., 2013) describe (a) personal and home, (b) enterprise, (c) utilities, and (d) mobile domain. also, there is a classification of the applications for consumer (home, lifestyle, healthcare, transport) and business (manufacturing, retail, energy, transportation, agriculture, and others) (trappey et al., 2017). those domain categorizations can be a subpart of a categorization, which grouped the applications in three major domains (borgia, 2014): (a) industrial domain, (b) smart city domain, and (c) health well-being domain. they are not isolated from each other, but there is a partial overlapping since some applications are shared across the contexts. for example, tracking of products can be a demand for both industrial and health well-being domains. table 8 application type. behaviors application type identification touristic maps equipped with tags that allow nfc-equipped phones to browse it and automatically call web services, materials tracking to prevent left-ins during surgery (atzori, iera, and morabito, 2010); patient triage, resource management and distribution (gubbi et al., 2013); medical equipment tracking, secure access indoor environment management, personnel tracking, bike/car/van sharing, mobile tickets, luggage management, animal tracking, fast payment, warehouse management and inventory, identification of materials and goods (borgia, 2014); verifying the authenticity of aircraft, storing health records (bandyopadhyay and sen, 2011). sensing patient monitoring, remote personnel monitoring (health, location), sensors built into building infrastructure to guide first responders in emergencies or disaster scenarios or sensors built into infrastructure to monitor structural fatigue and other maintenance, sensing of water quality, leakage, usage and distribution, air pollution and noise monitoring, support to diagnoses, video/radar/satellite surveillance, road condition monitoring, product deterioration (borgia, 2014); monitoring chronic disease using wearable vital signs sensors in body sensors (bandyopadhyay and sen, 2011). actuation room lighting changing, alarm systems, remote switching off electrical equipment (atzori, iera, and morabito, 2010), temperature and humidity control (gubbi et al., 2013), irrigation control (borgia, 2014), muscle stimuli for paraplegic individuals (bandyopadhyay and sen, 2011). hybrid buildings adjusting locally to conditions while also taking into account outdoor conditions, robot taxis that respond to realtime traffic movements of the city, and are calibrated to reduce congestion at bottlenecks in the city and to service pick-up areas that are most frequently used (atzori, iera and morabito, 2010), water waste management (gubbi et al., 2013), parking system, traffic management (borgia, 2014). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 4 discussion 4.1 the things in iot alongside the application areas, we also extracted the things, as we are interested in recovering what natural objects are currently in use under the iot paradigm. in many cases, the authors listed usage possibilities and existing solutions based on iot. forty-one different things were extracted, and figure 4 shows the ten most cited ones. figure 4 most common things in iot. these are everyday objects enhanced with identification, sensing and actuation capabilities. for example, sensors attached to vehicles can collect information about the roads (e.g., about traffic density or surface conditions) reporting back to the city center and, from thing-thing interaction, a vehicle can communicate to another enabling smart parking and faster communication of problems in traffic. extracting information on things from already deployed iot applications has helped our research group to realize the innovative potential of this paradigm better. also, the results of the real use of things and the examples of applications (such as those described in table 8) might be a contribution for practitioners working on innovative problem-solving projects as a source of possibilities for stimulating thinking, creativity and to expand initial ideas. the three well-established behaviors (identification, sensing, and actuation) can support different usage scenarios varying according to the kind of objects used, data to be collected, business requirements and users’ need. for instance, a door lock with the “acting” behavior can open/close different sort of doors in different scenarios according to rules, e.g., from authentication by electronic tag reading, eyes or fingers scanning, humans/animals/robots proximity sensing and many other possibilities. even though an iot solution is taken as a massive amount of various connected objects of our everyday life, the three behaviors highlighted in this work are expressly the basis among iot objects. identifying and elucidating this common property is another contribution for practitioners, which can consider these three behaviors and issues concerned with them when idealizing, engineering and developing iot-based systems. 4.2 iot related terms internet of things sometimes sounds like a buzzword, so some terms seem to be synonyms or even “aliases” (madakam, ramaswamy and tripathi, 2015). however not every term can be used interchangeably for it. from the analysis and interpretation, we categorize the related terms as presented in table 9. all the data extracted, and other details can be found in the research protocol (https://goo.gl/ctyzut).  related technology technologies related to iot supporting its development.  related areas other research areas that are frequently associated with iot because they share some similarities or are considered iot drivers. by looking at the related terms, we argue that the iot paradigm proposal is to enable a connected world, believing that different research areas can also be enablers in a joint effort for research, development, and evolution. also, there are areas which need further research to deal with the challenges of this novel paradigm. from our understanding, iot is an umbrella combining the advances of many areas, and we discuss the points that make those areas connected to iot or the convergence points that make some topics to sound as iot synonyms. the definition of the terms is out of the scope of this discussion. from the table below we discuss only the related areas. table 9 iot related terms. categories terms related technology cloud computing internet protocol communication middleware rfid universal identifier architecture wireless sensor networks related areas ambient intelligence context-aware systems cyber-physical systems human-computer interaction industry 4.0 internet of computers internet of objects internet of people intranet/extranet of things machine-to-machine interaction micro-electro-mechanical systems network of things pervasive computing social iot ubiquitous computing web of things  ambient intelligence ambient intelligence is a developing technology that will increasingly make our everyday environment more sensitive and responsive (madakam, ramaswamy and tripathi, 2015). according to (miorandi et al., 2012), iot may well inherit concepts and lessons learned in ambient intelligence, enabling ambient intelligence to a larger scale.  context-aware systems considering our understanding of things those that are equipped with identification and sensing capabilities and are the bridge from the physical to the virtual realm. from identification technologies such as rfid, it is possible towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 to get the identity and location of entities. sensors enable sensing environment information such as sound, temperature, humidity, among others (atzori, iera, and morabito, 2010). in our interpretation, these capabilities of things in iot make the field of iot related to context-awareness, because from sensors and tag reading the environment and entities’ context information can be perceived (not explicit input to the system). then such context information can be used to provide task-relevant information and/or services to a user (abowd et al., 1999). even though contextawareness is considered an essential aspect of iot (sethi and sarangi, 2017), it does not mean any iot system is context-aware, at least information gathered are used as relevant resources for decision-making and for dynamically taking actions, such as systems customization.  cyber-physical systems (cps) cloud computing, wireless sensor network (wsn), m2m, iot, and others are all fields that collaborate somehow to reach the broad goal of cps, that is, “to bring the cyber-world of computing and communications together with the physical world” (rajkumar et al. 2010; madakam, ramaswamy, and tripathi 2015). according to (miorandi et al., 2012), “a cyber-physical infrastructure is the result of the embedding of electronics into everyday physical objects, making them ''smart'' and letting them integrate seamlessly within the global.” as discussed previously, we understand that wsns are enablers for m2m and consequently for iot. m2m systems are the precursor of cps as devices allow the bridge between the physical and virtual world, in the same manner, m2m are the basis for the internet of things. it leads us to interpret iot is a form of realizing cps, and it is consistent with (chen, 2012), who proposes that “cps is an evolution of m2m by the introduction of more intelligent and interactive operations, under the architecture of internet of things (iot)”.  human-computer interaction hci is an area that needs further research to deal with this novel iot context where human intervention is low or even absent. it usually involves the study, planning, and design of the interaction between people and computers (madakam, ramaswamy and tripathi, 2015).  industry 4.0 iot is described as a critical enabler for industry 4.0 (trappey et al., 2017). iot has been deployed in factories and production environment, turning them more intelligent. it is leading toward the fourth industrial revolution.  internet of computers mentioned not as a synonym of iot but as an orthogonal term (gil et al., 2016). in their description internet of computers are traditional internet environments, where both leading data producers and consumers are human beings (not things).  internet of objects considering some of the iot definitions found in the technical literature, we can interpret that “objects” and things are equivalent. for instance, “iot implies that objects in an iot can be identified uniquely in the virtual representations” (li, xu, and zhao, 2015). in addition, “ [iot is] the pervasive presence around us of a variety of things or objects – such as radio-frequency identification (rfid) tags, sensors, actuators, mobile phones, etc.” (wan et al., 2013) and ''a worldwide network of interconnected objects uniquely addressable, based on standard communication protocols” (atzori, iera, and morabito 2010; bandyopadhyay and sen 2011; gil et al. 2016).  internet of people the internet of things is not synonymous with the internet of people as mentioned by borgia (borgia, 2014), but the author does not explain that. for this reason, we searched for works addressing this subject, and we could not find any consensus. nevertheless, (miranda et al., 2015) explain that iot technology needs people-centric enhancements to achieve the more desirable iot scenarios, that is, scenarios which consider people’s context, learning from it, reasoning and taking actions proactively. therefore, achieving those desired scenarios requires moving from the internet of things to the internet of people (iop). some essential features of iop systems are: be social, be personalized, be proactive and be predictable.  intranet/extranet of things intranet/extranet of things and iot are not synonymous (borgia, 2014). however, as far as we know, they share a broad concept, but the difference is that in intranet/extranet there is a restriction of connection for restricted areas, while on the internet the connections are publicly accessible.  machine-to-machine interaction m2m means no human intervention while devices are communicating end-to-end (madakam, ramaswamy and tripathi, 2015). it leads us to think that m2m and iot are similar, but m2m is more a paradigm leading towards iot (atzori, iera, and morabito, 2010). m2m refers to technologies that allow both wireless and wired systems to communicate with other devices of the same ability (wan et al., 2013). unlike devices in iot, in m2m they are meant to operate in a specific application, which means that m2m solutions do not allow the broad sharing of data or opened connection of devices into the internet (holler et al., 2014).  micro-electro-mechanical systems mems technology is one of the enablers to develop miniature devices, which are capable of sensing, compute and communicate (gubbi et al., 2013). when connected, these miniature devices form a wireless sensor network, and, consequently, are crucial building blocks for developing machine-to-machine, iot, among others.  network of things network of things is similar to intranet/extranet regarding connection restrictions. it is referred to operate in a restricted local, within a work environment like an enterprise-based application. only the owners use the information collected from such networks, and the data may be released selectively (gubbi et al., 2013).  pervasive or ubiquitous computing these two terms are intimately connected. some authors have addressed them interchangeably (satyanarayanan 2001; baldauf, dustdar, and rosenberg 2007; spínola, pinto, and towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 travassos 2008). our interpretation of the relation between iot and ubicomp is that iot projects can be considered ubiquitous according to their adherence to ubiquity characteristics (spínola and travassos 2012). such characteristics are context-sensitivity, adaptable behavior, service omnipresence, heterogeneity of devices, experience capture, spontaneous, interoperability, scalability, privacy and trust, fault tolerance, quality of service, and universal usability (spínola and travassos 2012). that is, ubiquity becomes a transversal property of iot systems as they fulfill ubiquity characteristics.  social iot the term social iot (siot) is mentioned as a new paradigm that has been proposed (atzori, iera, and morabito 2010; li, xu, and zhao 2015; gil et al. 2016; sethi and sarangi 2017). siot means that the things are seen now as “beings,” and the interconnections among them are compared to human social relations. the authors described the three main facets of a siot system: (i) the siot is navigable; (ii) a need for trustworthiness (relationship strength) is present between devices; and (iii) models to study human social networks are similar to social networks of iot devices.  web of things it refers to the re-use of web standards to connect and integrate iot objects into the web (atzori, iera and morabito, 2010; bandyopadhyay and sen, 2011; borgia, 2014; madakam, ramaswamy and tripathi, 2015). it is possible to observe that the evolution of some areas and the collaboration among them enable the iot paradigm realization. once it is possible to develop small devices, embed intelligence, seamless communication, thing-thing interaction, wireless connections, and others, all of these are iot enabling technologies. this discussion of terms related to the iot paradigm might be a contribution for further investigations, which might depend on grounded concepts and clarity about convergence points that make other topics seem as iot synonyms. in addition, practitioners and researchers can benefit from this discussion in the circumstances there are doubts on whether iot is indeed the right term to consider for their software projects and/or future investigations. 4.3 iot challenges to foster our discussions and research directions, one of the information extracted from the selected articles were challenges, which we understand as open opportunities in the industry or academia. the data extracted were analyzed based on grounded theory procedures (strauss and corbin, 1990). the process started by retrieving the excerpts related to iot challenges (the excerpts could be a word, a phrase or a full paragraph). the 15 papers provided 38 excerpts regarding iot challenges. the 38 excerpts were organized into seven categories (table 10). we used codes to assign concepts to a portion of data, with a constant comparative analysis to identify patterns from similarities and differences emergent from the data. this textual analysis was conducted by two researchers, with crosschecking to achieve consensus. the excerpts were organized in the categories, and we present each category with a definition and an example of an excerpt to support its comprehension. it is interesting to notice that the concerns are usually interrelated, confirming the multidisciplinary nature of iot. for example: “for technology to disappear from the consciousness of the user, the internet of things demands software architectures and pervasive communication networks to process and convey the contextual information to where it is relevant” (gubbi et al., 2013), this excerpt is coded for an architectural issue and network as well. another example is “central issues are making full interoperability of interconnected devices possible, providing them with an always higher degree of smartness by enabling their adaptation and autonomous behavior, while guaranteeing trust, privacy, and security” (ieee, 2004), which was coded both for interoperability and security issues. provide solutions to the issues presented in the technical literature can be tricky to achieve due to the diversity of concerns, variety of devices and uncertainties in the area. from the findings recovered in this review, our research perspective will be directed to support the proposed definition: iot is a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. our focus will be on looking at the perspective of the software orchestration necessary for the composition of systems that will rise in this contemporary paradigm. despite our decision to direct the research, the article may contribute to other areas of research providing definitions, characteristics, and the challenges presented in this section. 4.4 threats to validity since only scopus was used as a search engine, it may be missing some relevant studies, but from our experience, we know that it can give a reasonable coverage when performing together with snowballing procedures (backward and forward) (matalonga, rodrigues, and travassos 2015; motta, oliveira, and travassos 2016). in addition, a recurrent issue in literature reviews regards inconsistent terminology and restrictive keywords. we searched for other reviews and observed the terms used to compose our search string to reduce the researchers’ bias. data extraction and interpretation biases were mitigated with crosschecking between two researchers and by having a third researcher to revise the results. all phases of this review were peer-reviewed; any doubt was discussed among the readers, to reduce selection bias. we have not performed a quality assessment regarding the research methodology of the selected studies due to the lack of information in the secondary reports. it is a threat to this study validity. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 5 conclusion this work presented the research on the iot paradigm, detailing the activities performed for the literature review, and analyzing the findings and discussions to answer the following research questions: (rq1) what is “internet of things”? (rq2) which characteristics can define an iot domain? (rq3) which are the areas of iot application? as the iot concept is currently under discussion, there are still significant issues regarding its understanding that need to be clarified and established. one contribution of this work is to present an organized perspective regarding the current state-of-the-art regarding the iot paradigm. besides, it allows observing which areas of application are making use of iot (rq3). all of these findings were related and summarized to enrich the iot paradigm comprehension. from the discussion of rq1, we understand that iot is a paradigm allowing the composition of software systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. the idea of composing software systems from available components is not new, but one of the issues that set iot apart is the scale at which it can be achieved and the actors involved in these new software systems. from this, shared concerns regarding the development and evaluation of such software systems should be reframed to cover the particularities of these new types of devices. a critical step towards it is to establish what quality characteristics should be contemplated. with the second research question, we moved forward in this direction. regarding the iot characteristics (rq2), from the technical literature, we recovered 29 different attributes, from which this paper discussed nine of them with clear evidence from the sources of information. considering that the results retrieved are from secondary studies, the characteristics represented reflect more than just the 15 secondary studies, but rather the whole set of primary studies involved in them which can strengthen these results. the most commonly cited characteristics presented are efficiency, interoperability, scalability, privacy, and security that reassure the definition reached in the paper. this work is the first step towards future investigations focusing on aspects such as software development and quality control of iot. apart from that, the grounded concepts, properties and terms related to the iot paradigm can be a contribution to any future related research. besides, the identification and discussions on already deployed applications and the three behaviors of things can contribute to practitioners in the processes of idealizing, engineering and developing iot software systems. at last, it is expected table 10 iot challenges. category example architecture: issues and concerns regarding design decisions, styles and the structure of iot systems. “finding a scalable, flexible, secure and cost-efficient architecture, able to cope with the complex iot scenario, is one of the main goals for the iot adoption.” (borgia, 2014). data: it refers to the management of a significant amount of data, and how to recover, represent, store, interconnect, search, and organize data generated by iot from so many different users and devices. “this new field offers many research challenges, but the main goal of this line of research is to make sense of data in any iot environment. it has been pointed out that it is always much easier to create data than to analyze them. with this in mind, new conceptual modeling, as well as new paradigms of data mining techniques will be crucial to provide value and meaning to initially empty data.” (gil et al. 2016). interoperability: related to the challenge of making different systems, software, and things to interact for a purpose. standards and protocols are also included as issues. “the end goal is to have plug n' play smart objects which can be deployed in any environment with an interoperable backbone allowing them to blend with other smart objects around them.” (gubbi et al., 2013). management: the application of management activities, such as planning, monitoring and controlling, in the iot system will raise the interaction of different things. “from the viewpoint of the network, iot is a very complex heterogeneous network, which includes the connections among various types of networks through various communication technologies. the devices and methodologies for addressing things management is still a challenge.” (li, xu and zhao, 2015) network: technical challenges related to communication technologies, routing, access and addressing schemes considering the different characteristics of the devices. “designing an appropriate topology, routing, and mac layer is critical for scalability and longevity of the deployed network” (gubbi et al., 2013). security: issues related to several aspects to ensure data security in the iot system. for that, a series of properties, such as confidentiality, integrity, authentication, authorization, nonrepudiation, availability, and privacy should be investigated. “security issues are central in iot as they may occur at various levels, investing technology as well as ethical and privacy issues [...] this is extremely challenging due to the iot characteristics.” (borgia, 2014). social: concerns related to the human end-user to understand the situation of its users and their appliances. “for a lay person to fully benefit from the iot revolution, attractive and easy to understand visualization have to be created.” (gubbi et al., 2013). towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 the knowledge organized and presented in this paper can contribute to stimulating discussions and future investigations on providing software technologies to promote the engineering of high-quality iot software systems. 6 declarations abbreviations iot: internet of things; cps: cyber-physical systems; rfid: radio-frequency identification; mems: microelectro-mechanical systems; m2m: machine-to-machine; hci: human-computer interaction. availability of data and materials details of the protocol are available in https://goo.gl/ctyzut. authors’ contributions we present a review supported by established guidelines and aims to contribute to the iot field of with awareness, understanding of its concepts and features and a characterization regarding its definition, characteristics, and applications. we answer the research questions characterizing the area, present challenges and opportunities and we offer an essential overview of the internet of things state-of-the-art, presenting issues that should be addressed to contribute to its strengthening and establishment. acknowledgments the authors thank cnpq and capes for supporting this research. funding prof. travassos is a cnpq researcher (grant 305929/20143). this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. competing interests the authors declare that they have no competing interests. consent for participation and publication not applicable. 7 references abowd, g. d. et al. (1999) ‘towards a better understanding of context and context-awareness’, in computing systems, pp. 304–307. doi: 10.1007/3-540-48157-5_29. akyildiz, i. f., jiang xie and mohanty, s. (2004) ‘a survey of mobility management in next-generation all-ip-based wireless systems,’ ieee wireless communications, 11(4), pp. 16–28. doi: 10.1109/mwc.2004.1325888. alabdulhafith, m., sampangi, r. v. and sampalli, s. (2013) ‘nfc-enabled smartphone application for drug interaction and drug allergy detection,’ in 2013 5th international workshop on near field communication (nfc). ieee, pp. 1–6. doi: 10.1109/nfc.2013.6482450. de almeida biolchini, j. c. et al. (2007) ‘scientific research ontology to support systematic review in software engineering,’ advanced engineering informatics, 21(2), pp. 133–151. doi: 10.1016/j.aei.2006.11.006. andreini, f. et al. (2010) ‘context-aware location in the internet of things,’ in 2010 ieee globecom workshops. ieee, pp. 300–304. doi: 10.1109/glocomw.2010.5700330. andreini, f. et al. (2011) ‘a scalable architecture for geolocalized service access in smart cities,’ in 2011 future network & mobile summit, pp. 1–8. atzori, l., iera, a. and morabito, g. (2010) ‘the internet of things: a survey,’ computer networks. elsevier b.v., 54(15), pp. 2787–2805. doi: 10.1016/j.comnet.2010.05.010. azimi, s. r. et al. (2011) ‘vehicular networks for collision avoidance at intersections’, sae international journal of passenger cars-mechanical systems, 4(1), pp. 2011-01– 0573. doi: 10.4271/2011-01-0573. baldauf, m., dustdar, s. and rosenberg, f. (2007) ‘a survey on context-aware systems,’ international journal of ad hoc and ubiquitous computing, 2(4), p. 263. doi: 10.1504/ijahuc.2007.014070. bandyopadhyay, d. and sen, j. (2011) ‘internet of things: applications and challenges in technology and standardization’, wireless personal communications, 58(1), pp. 49–69. doi: 10.1007/s11277-011-0288-5. barro-torres, s. et al. (2012) ‘real-time personal protective equipment monitoring system,’ computer communications, 36(1), pp. 42–50. doi: 10.1016/j.comcom.2012.01.005. basili, v. r., caldeira, g. and rombach, h. d. (1994) ‘goal question metric paradigm.’ borgia, e. (2014) ‘the internet of things vision: key features, applications, and open issues,’ computer communications. elsevier b.v., 54, pp. 1–31. doi: 10.1016/j.comcom.2014.09.008. brock, d. l. (2001) ‘integrating the electronic product code (epc) and the global trade item number (gtin),’ mit auto-id center, (february 1), pp. 1–25. budgen, d. and brereton, p. (2006) ‘performing systematic literature reviews in software engineering,’ in proceeding of the 28th international conference on software engineering icse ’06. new york, new york, usa: acm press, p. 1051. doi: 10.1145/1134285.1134500. burhanuddin, m. a. et al. (2017) ‘internet of things architecture: current challenges and future direction of research,’ international journal of applied engineering research, 12(21), pp. 11055–11061. carcary, m. et al. (2018) ‘exploring the determinants of iot adoption: findings from a systematic literature review,’ in zdravkovic, j. et al. (eds) ceur workshop proceedings. cham: springer international publishing (lecture notes in business information processing), pp. 113–125. doi: 10.1007/978-3-319-99951-7_8. chasaki, d. and mansour, c. (2015) ‘security challenges in the internet of things,’ international journal of spacebased and situated computing, 5(3), p. 141. doi: 10.1504/ijssc.2015.070945. chen, m. (2012) ‘machine-to-machine communications: architectures, standards, and applications,’ ksii towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 transactions on internet and information systems, 6(2), pp. 480–497. doi: 10.3837/tiis.2012.02.002. chengen wang, zhuming bi and li da xu (2014) ‘iot and cloud computing in automation of assembly modeling systems,’ ieee transactions on industrial informatics. ieee, 10(2), pp. 1426–1434. doi: 10.1109/tii.2014.2300346. chlamtac, i., conti, m. and liu, j. j. n. (2003) ‘mobile ad hoc networking: imperatives and challenges,’ ad hoc networks, 1(1), pp. 13–64. doi: 10.1016/s15708705(03)00013-1. cicirelli, f. et al. (2018) ‘a metamodel framework for edgebased smart environments’, in 2018 ieee international conference on cloud engineering (ic2e). ieee, pp. 286– 291. doi: 10.1109/ic2e.2018.00067. cirani, s., ferrari, g. and veltri, l. (2013) ‘enforcing security mechanisms in the ip-based internet of things: an algorithmic overview,’ algorithms. multidisciplinary digital publishing institute, 6(2), pp. 197–226. doi: 10.3390/a6020197. cisco (2014) leading tools manufacturer transforms operations with iot. available at: http://www.cisco.com/c/dam/en_us/solutions/industries/d ocs/manufacturing/c36-732293-00-stanley-cs.pdf. datta, s. k. et al. (2017) ‘vehicles as connected resources: opportunities and challenges for the future,’ ieee vehicular technology magazine, 12(2), pp. 26–35. doi: 10.1109/mvt.2017.2670859. dunkels, a. and vasseur, j. (2008) the internet of things : ip for smart objects, ipso alliance white paper. finkenzeller, k. (2010) rfid handbook: fundamentals and applications in contactless smart cards, radio frequency identification, and near-field communication. nj: wiley. gil, d. et al. (2016) ‘internet of things: a review of surveys based on context-aware intelligent services,’ sensors, 16(7), p. 1069. doi: 10.3390/s16071069. gluhak, a. et al. (2011) ‘a survey on facilities for experimental internet of things research,’ ieee communications magazine, 49(11), pp. 58–67. doi: 10.1109/mcom.2011.6069710. gubbi, j. et al. (2013) ‘internet of things (iot): a vision, architectural elements, and future directions,’ future generation computer systems, 29(7), pp. 1645–1660. doi: 10.1016/j.future.2013.01.010. gusmeroli, s., sundmaeker, h. and bassi, a. (2015) ‘internet of things strategic research roadmap,’ the cluster of european research projects, tech. rep, pp. 9–52. hackmann, g. et al. (2008) ‘a holistic approach to decentralized structural damage localization using wireless sensor networks,’ in 2008 real-time systems symposium. ieee, pp. 35–46. doi: 10.1109/rtss.2008.40. holler, j. et al. (2014) from machine-to-machine to the internet of things. elsevier. doi: 10.1016/c2012-0-032632. hur, j. and kang, k. (2012) ‘dependable and secure computing in medical information systems,’ computer communications. elsevier b.v., 36(1), pp. 20–28. doi: 10.1016/j.comcom.2012.01.006. ieee (2004) guide to the software engineering body of knowledge, ieee. ieee computer society press. available at: http://www.computer.org/portal/web/swebok. infso d.4 (2008) ‘networked enterprise and rfid infso g.2 micro and nanosystems’, co-operation with the working group rfid of the etp eposs, internet of things in 2020, roadmap for the future, version 1.1, 2020(4). itu (2005) itu internet report 2005: the internet of things. doi: 10.1038/nphys3028. jardim-goncalves, r. et al. (2013) ‘systematisation of interoperability body of knowledge: the foundation for enterprise interoperability as a science,’ enterprise information systems. taylor & francis, 7(1), pp. 7–32. doi: 10.1080/17517575.2012.684401. kannry, j. et al. (2007) ‘small-scale testing of rfid in a hospital setting: rfid as bed trigger,’ amia annual symposium proceedings, pp. 384–388. available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?arti d=2813671&tool=pmcentrez&rendertype=abstract. koren, i. and krishna, c. m. (2007) ‘fault-tolerant systems’. elsevier. available at: https://ebookcentral.proquest.com/lib/feupebooks/reader.action?docid=294597&query=. kraijak, s. and tuwanut, p. (2016) ‘a survey on the internet of things architecture, protocols, possible applications, security, privacy, real-world implementation and future trends,’ international conference on communication technology proceedings, icct, 2016–february, pp. 26– 31. doi: 10.1109/icct.2015.7399787. larrucea, x. et al. (2017) ‘software engineering for the internet of things,’ ieee software, 34(1), pp. 24–28. doi: 10.1109/ms.2017.28. lee, i. and sokolsky, o. (2010) ‘medical cyber-physical systems,’ in proceedings of the 47th design automation conference on dac ’10. new york, new york, usa: acm press, p. 743. doi: 10.1145/1837274.1837463. li, s., xu, l. da and zhao, s. (2015) ‘the internet of things: a survey,’ information systems frontiers, 17(2), pp. 243– 259. doi: 10.1007/s10796-014-9492-7. madakam, s., ramaswamy, r. and tripathi, s. (2015) ‘internet of things (iot): a literature review,’ journal of computer and communications, 3(5), pp. 164–173. doi: 10.4236/jcc.2015.35021. matalonga, s., rodrigues, f. and travassos, g. (2015) ‘challenges in testing context-aware software systems,’ in 9th workshop on systematic and automated software testing 2015. belo horizonte, brazil, pp. 51–60. matalonga, s., rodrigues, f. and travassos, g. h. (2017) ‘characterizing testing methods for context-aware software systems: results from a quasi-systematic literature review,’ journal of systems and software. elsevier inc., 131, pp. 1–21. doi: 10.1016/j.jss.2017.05.048. miorandi, d. et al. (2012) ‘internet of things: vision, applications and research challenges,’ ad hoc networks. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 elsevier b.v., 10(7), pp. 1497–1516. doi: 10.1016/j.adhoc.2012.02.016. miranda, j. et al. (2015) ‘from the internet of things to the internet of people’, ieee internet computing, 19(2), pp. 40–47. doi: 10.1109/mic.2015.24. motta, r. c., oliveira, k. m. de and travassos, g. h. (2016) ‘characterizing interoperability in context-aware software systems,’ in 2016 vi brazilian symposium on computing systems engineering (sbesc). ieee, pp. 203– 208. doi: 10.1109/sbesc.2016.039. motta, r. c., de oliveira, k. m. and travassos, g. h. (2018) ‘on challenges in engineering iot software systems,’ in proceedings of the xxxii brazilian symposium on software engineering sbes ’18. new york, new york, usa: acm press, pp. 42–51. doi: 10.1145/3266237.3266263. nami, m. r. and sharifi, m. (2007) ‘a survey of autonomic computing systems,’ in intelligent information processing iii. boston, ma: springer us, pp. 101–110. doi: 10.1007/978-0-387-44641-7_11. nuzzo, p. and sangiovanni-vincentelli, a. (2014) ‘let’s get physical: computer science meets systems,’ in from programs to systems. the systems perspective in computing. springer, pp. 193–208. doi: 10.1007/978-3642-54848-2_13. o’reilly, t. and pahlka, j. (2009) ‘the web squared era,’ forbes, september 2009. panetto, h. and cecil, j. (2013) ‘information systems for enterprise integration, interoperability, and networking: theory and applications,’ enterprise information systems. taylor & francis, 7(1), pp. 1–6. doi: 10.1080/17517575.2012.684802. patel, p. and cassou, d. (2015) ‘enabling high-level application development for the internet of things’, journal of systems and software. elsevier ltd., 103, pp. 62–84. doi: 10.1016/j.jss.2015.01.027. perera, c. et al. (2014) ‘context-aware computing for the internet of things: a survey,’ ieee communications surveys & tutorials, 16(1), pp. 414–454. doi: 10.1109/surv.2013.042313.00197. rajkumar, r. (raj) et al. (2010) ‘cyber-physical systems,’ in proceedings of the 47th design automation conference on dac ’10. new york, new york, usa: acm press, p. 731. doi: 10.1145/1837274.1837461. ray, p. p. (2018) ‘a survey on internet of things architectures,’ journal of king saud university computer and information sciences. king saud university, 30(3), pp. 291–319. doi: 10.1016/j.jksuci.2016.10.003. sampigethaya, k., poovendran, r. and bushnell, l. (2008) ‘secure operation, control, and maintenance of future eenabled airplanes,’ proceedings of the ieee, 96(12), pp. 1992–2007. doi: 10.1109/jproc.2008.2006123. santos, i. de s. et al. (2017) ‘test case design for contextaware applications: are we there yet?’, information and software technology. elsevier b.v., 88, pp. 1–16. doi: 10.1016/j.infsof.2017.03.008. satyanarayanan, m. (2001) ‘pervasive computing: vision and challenges,’ ieee personal communications, 8(4), pp. 10–17. doi: 10.1109/98.943998. schmidt, a. and van laerhoven, k. (2001) ‘how to build smart appliances?’, ieee personal communications. ieee, 8(4), pp. 66–71. doi: 10.1109/98.944006. sethi, p. and sarangi, s. r. (2017) ‘internet of things: architectures, protocols, and applications’, journal of electrical and computer engineering, 2017. doi: 10.1155/2017/9324035. sharma, v., gusain, p. and kumar, p. (2013) ‘near field communication,’ setlabs briefings, 2013(cac2s), pp. 342–345. singh, d., tripathi, g. and jara, a. j. (2014) ‘a survey of internet-of-things: future vision, architecture, challenges and services,’ 2014 ieee world forum on internet of things, wf-iot 2014, pp. 287–292. doi: 10.1109/wfiot.2014.6803174. skiba, d. j. (2013) ‘the internet of things (iot),’ nursing education perspectives, 34(1), pp. 63–64. doi: 10.5480/1536-5026-34.1.63. spínola, r. o., pinto, f. c. r. and travassos, g. h. (2008) ‘supporting requirements definition and quality assurance in ubiquitous software project,’ in communications in computer and information science, pp. 587–603. doi: 10.1007/978-3-540-88479-8-42. spínola, r. o. and travassos, g. h. (2012) ‘towards a framework to characterize ubiquitous software projects,’ information and software technology, 54(7), pp. 759–785. doi: 10.1016/j.infsof.2012.01.009. strauss, a. and corbin, j. (1990) basics of qualitative research: techniques and procedures for developing grounded theory. newbury park: sage publications, inc. trappey, a. j. c. et al. (2017) ‘a review of essential standards and patent landscapes for the internet of things: a key enabler for industry 4.0’, advanced engineering informatics. elsevier ltd, 33, pp. 208–229. doi: 10.1016/j.aei.2016.11.007. vermesan, ovidiu and friess, peter and guillemin, patrick and gusmeroli, sergio and sundmaeker, harald and bassi, alessandro and jubert, ignacio soler and mazura, margaretha and harrison, m. and others (2009) ‘towards the web of things : web mashups for embedded devices’, workshop on mashups, enterprise mashups and lightweight composition on the web (mem 2009), pp. 1– 8. wan, j. et al. (2013) ‘from machine-to-machine communications towards cyber-physical systems,’ computer science and information systems, 10(3), pp. 1105–1128. doi: 10.2298/csis120326018w. wark, t. et al. (2007) ‘the design and evaluation of a mobile sensor/actuator network for autonomous animal control,’ in 2007 6th international symposium on information processing in sensor networks. ieee, pp. 206–215. doi: 10.1109/ipsn.2007.4379680. weiser, m. et al. (1999) ‘the origins of ubiquitous computing research at parc,’ ibm systems journal, 38(4), pp. 693–696. doi: 10.1147/sj.384.0693. towards a more in-depth understanding of the iot paradigm and its challenges motta et al. 2019 werner-allen, g. et al. (2006) ‘deploying a wireless sensor network on an active volcano,’ ieee internet computing, 10(2), pp. 18–25. doi: 10.1109/mic.2006.26. whitmore, a., agarwal, a. and da xu, l. (2015) ‘the internet of things—a survey of topics and trends,’ information systems frontiers, 17(2), pp. 261–274. doi: 10.1007/s10796-014-9489-2. wohlin, c. (2014) ‘guidelines for snowballing in systematic literature studies and a replication in software engineering,’ proceedings of the 18th international conference on evaluation and assessment in software engineering ease ’14, pp. 1–10. doi: 10.1145/2601248.2601268. wortmann, a., combemale, b. and barais, o. (2017) ‘a systematic mapping study on modeling for industry 4.0’, in 2017 acm/ieee 20th international conference on model driven engineering languages and systems (models). ieee, pp. 281–291. doi: 10.1109/models.2017.14. xianrong zheng et al. (2014a) ‘cloud service negotiation in internet of things environment: a mixed approach,’ ieee transactions on industrial informatics, 10(2), pp. 1506–1515. doi: 10.1109/tii.2014.2305641. xianrong zheng et al. (2014b) ‘cloudqual: a quality model for cloud services,’ ieee transactions on industrial informatics. ieee, 10(2), pp. 1527–1536. doi: 10.1109/tii.2014.2306329. zambonelli, f. (2016) ‘towards a general software engineering methodology for the internet of things.’ available at: http://arxiv.org/abs/1601.05569. journal of software engineering research and development, 2022, 10:10, doi: 10.5753/jserd.2022.2554  this work is licensed under a creative commons attribution 4.0 international license.. on the use of uml in the brazilian industry: a survey ed wilson júnior  [ universidade do vale do rio dos sinos | edwjr7@edu.unisinos.br ] kleinner farias  [ universidade do vale do rio dos sinos | kleinnerfarias@unisinos.br ] bruno da silva  [ california polytechnic state university | bcdasilv@calpoly.edu ] abstract over the past decade, uml modeling has been used in the industry in software development tasks, such as documenting design decisions and promoting better communication between teams, as pointed out in recent studies. however, little is known about the factors, practitioners’ perceptions, and practices that affect uml use in realworld projects. this article, therefore, reports exploratory research focused on investigating how uml is used in practice in the brazilian software industry. in total, 376 professionals from 210 information technology companies answered an online questionnaire about the factors affecting use, difficulty and frequency of use, perceived benefits, and contextual factors that prevent the adoption of uml models. in addition, 20 professionals participated in a semi-structured interview answering basic questions about professional experience, vision on software modeling, use of tools, and other aspects of uml. the main results show that: 74% of the participants answered that they do not use uml frequently. factors such as (1) high time pressure to develop features; (2) the cost of disseminating a common model understanding among diverse audiences and; (3) the difficulty of evaluating the quality of the models affect the effective use of uml. in general, most participants know uml, but do not use it frequently (or do not use at all) in their projects. finally, this article draws some challenges, implications and research directions that can be explored in upcoming studies for promoting uml modeling in practice. keywords: uml, unified model language, practice, industry, survey 1 introduction uml models can play a crucial role in software development tasks such as documenting design decisions and promoting better communication within and across teams (omg, 2017). some previous studies (bucchiarone et al., 2021; fernándezsáez et al., 2018; chaudron et al., 2012) highlight that the use of uml modeling can provide benefits to the software development process, such as providing a common understanding among team members, understanding the details of design decisions, and ultimately making the process more efficient after all. however, in practice, such benefits are often overlooked or not observed. some studies (fernándezsáez et al., 2018; chaudron et al., 2012; störrle, 2017) argue that such benefits can be realized when there is a consistent and (in)formal application of modeling, where developers typically use uml throughout the project and have precise control over its use. as we can rarely find such a scenario, researchers (fernández-sáez et al., 2018; petre, 2014) have tried to draw a clear picture of uml use in real-world projects. today, the current literature (akdur et al., 2021; fernández-sáez et al., 2018; petre, 2014; chaudron et al., 2012) lacks a broad and exploratory understanding of practitioners’ perceptions of the factors that affect or even compromise the adoption of uml modeling in real-world projects. more specifically, little is known about how practitioners deal with software modeling in the brazilian software development industry context. previous studies (petre, 2014, 2013) have focused on collecting opinions from participants to understand which uml diagrams are most used. however, this assumes that participants’ perceptions and experiences worldwide match those at the regional level (i.e., country or significant geographic region). these studies neither explore, for example, whether the project context can influence the uml adoption nor do they discuss practitioners’ views on the perceived usefulness of uml itself. this article investigates the state of the practice regarding the use of uml in the brazilian industry by surveying and interviewing software practitioners in that country. specifically, this work seeks to investigate (1) how practitioners use uml and (2) the relevance of its use in real-world software projects. therefore, this study surveyed 376 professionals from 210 brazilian information technology companies. we selected participants based on two criteria: (1) level of knowledge and practical experience related to software modeling; and (2) programming experience in regular projects. participants answered an online questionnaire about their experience with uml, the difficulties of adopting it, factors that affect its practical use, frequency of use, and the benefits it brings (or could bring). also, in the second phase, we interviewed 20 participants following a semi-structured interview protocol to understand further the survey results. our findings are encouraging and bridge the literature gap regarding the impact of the organizational culture issue in uml use, the analysis of factors that hinder uml use, and help to understand the broader landscape of the uml adoption. some evidence already reported in the literature is reinforced. this study can help companies and software practitioners understand the broader landscape of uml use, thus supporting their future decision-making around software practices and techniques in their future projects. academia and industry can benefit from our insights on how to improve their software modeling practices or develop new tools and processes. besides, this study also benefits researchers and practitioners by providing additional empirical knowledge about practical issues concerning uml modeling in a broader view. https://orcid.org/0000-0002-2225-3004 mailto:edwjr7@edu.unisinos.br https://orcid.org/0000-0003-1891-3580 mailto:kleinnerfarias@unisinos.br https://orcid.org/0000-0002-6467-9943 mailto:bcdasilv@calpoly.edu presenting the new sbc journal template júnior et al. 2022 this article is an extended version of our previous work (júnior et al., 2021) in several ways. first, the article underwent a careful review and was significantly improved as a whole. second, the research protocol was improved by adding the list of interview questions and considering the location of the companies where the participants work. third, the number of survey participants increased from 314 to 376 (i.e., 62 new participants), new findings are generated from this sample, and more thorough discussions regarding six research questions. in addition, this article presents additional discussions, identifies open challenges and implications, and describes the key underlying issues that need to be addressed in future investigations. the article is structured in seven sections. section 2 defines related work. section 3 details the adopted methodology. section 4 describes the results for each research question. section 5 brings up qualitative reflections and insights for future work. section 6 presents the main threats to the study’s validity. section 7 wraps up the article and include some ideas for future work. 2 related work the selection of related works was performed based on two steps: (1) initial search in digital repositories, such as google scholar1 and scopus2, was done to identify articles regarding the uml usage and survey in this research field; and (2) filtering of the selected articles considering the alignment of such works with the objective of our article (section 3.1). we selected studies from 2014 until now as our study is based on the findings reported in (petre, 2014). after that, they were analyzed (section 2.1) and compared to nine studies to identify research opportunities (section 2.2). 2.1 analysis of related works petre (2014). this work performed an empirical study about the use of uml in practice, which involved interviews conducted over two years with more than fifty software developers. the participants were mainly from north america and europe, but some were from brazil, india, and japan, and many had worked in more than one country. petre found that participants did not use uml universally but used it consistently in specific contexts such as embedded systems (e.g., automotive, aerospace, etc.). in addition, petre reported that the uml models are not used homogeneously, on the contrary, the interviewees reported heterogeneity in relation to the way to use the models in practice. typically, interviewees assumed different roles throughout the development cycle, using uml models differently in each role. petre also reported that the way practitioners used uml diagrams depended on the problem domain faced. ozkaya and erata (2020). this research involved 109 professionals from 34 countries, representing the different profiles, positions, types of software projects, and years of experience to understand how professionals use uml to 1https://scholar.google.com/ 2https://www.scopus.com/ model software architecture from different viewpoints: functional, information, concurrency, development, deployment, and operational. they found that the information and functional viewpoints are the most popular ones. moreover, the obtained results showed that most participants (88%) used uml when they needed to model system architecture from different viewpoints. fernández-sáez et al. (2015). this study presents a survey on the use of uml in software maintenance. they surveyed 178 practitioners working on software maintenance projects in 12 different countries. their results indicate that companies can improve system maintenance by leveraging the use of uml diagrams while executing maintenance tasks; however, it would require a significant effort to update uml diagrams as source code evolves. farias et al. (2018). we reported research findings on a shorter survey to identify the uml use in practice in the brazilian industry. two hundred and twenty-two practitioners from 140 different information technology companies answered a questionnaire concerning their experiences with uml, the difficulty in adopting it, and what should be done to increase adoption in practice. the results show that: (1) only 60 participants (28.2%) had used uml in their daily work; (2) 55.41% of the surveyed participants did not disagree with the statement that uml is the “lingua franca” in software modeling; (3) 61.26% reported to find that the automatic creation of uml diagrams to represent a big picture of the system under development would be useful to boost uml use. ciccozzi et al. (2019). this work carried out a systematic review that involved 63 research studies and 19 tools from more than 5400 initial entries. the objective was to identify, classify, and evaluate the existing solutions for uml model execution (i.e., automatically interpret or translate models into running software). the main results of this study are: (1) there is a growing scientific interest in the execution of uml models; (2) model-level debugging is supported in very few cases; (3) only a few surveys provide evidence of industrial use, with very limited empirical assessments; and the most common limitation is the coverage of the uml language. störrle (2017). this article conducted an online survey involving 82 professionals to determine whether and to what extent they use conceptual models and for what purposes. specifically, the author sought to grasp (1) if practitioners use uml and bpmn (business process modeling notation) for software modeling; (2) for what purposes are these modeling languages used; (3) what are the different ways of using these models in practice; and (4) how often practitioners use these modeling languages. storrle found that models are perceived to be widely used by study participants, and uml is the leading language. storrle reported three distinct usage modes of models, of which the most frequent is informal usage for communication and cognition. fernández-sáez et al. (2018). this study performed a case study in a multinational company’s ict department and involved 31 interviews with employees who work on software maintenance projects. the study mainly focused on the use of uml in software maintenance. they found that using software modeling notations such as uml is considered beneficial for software maintenance but needs to be tailored to presenting the new sbc journal template júnior et al. 2022 its context. the authors also provided a list of recommended practices that contribute to the increased effectiveness of software modeling. ho-quang et al. (2017). the authors conducted a largescale survey with 485 responses from contributors from 458 different open source projects. in that context, they found that collaboration was the most important motivation for using uml in open source projects as teams use uml during communication and planning of joint implementation efforts. uml models seem to benefit new contributors’ onboarding but do not seem to be a significant factor in attracting new contributors. neto et al. (2021). this study presents an overview of the adoption of uml in it companies in são carlos (brazil) and the region through a survey of 21 questions answered by 24 participants. also, it aims to compare how language is taught in universities. the results show a significant use of uml, including in companies that adopt agile methods and the authors suggest that the content on uml is preserved in the curriculum of educational institutions, in an updated and optimized way, meeting the trends presented by it companies. the study also points out that the opportunities in the area of modeling, with the mastery of agile methodologies and the trend of continuous acceleration of processes, are vast. one of them would be, at first, the adequacy of uml modeling for agile methodologies, without the most valued asset in these methodologies: time. 2.2 comparative analysis and opportunities six comparison criteria (cc) were defined to assist in identifying similarities and differences between the proposed work and the selected articles. this comparison is crucial to identify research opportunities using objective rather than subjective criteria. we describe the six comparison criteria below: • context (cc01): studies that involved professionals in the brazilian industry. • participant profile (cc02): studies that collected participant data for screening and profile characterization. • specific geographic region (cc03): works that explored the uml use in a specific regional scope. • applicability of uml (cc04): studies that evaluated which factors prevent the adoption of uml in the software industry. • interviews with participants (cc05): studies that triangulated quantitative and qualitative data. • different domains (cc06): studies that involved software developers working in different problem domains or business segments. table 1 compares the selected papers and summarizes whether they meet the criteria completely, partially, or do not meet, thus contrasting them with our work. moreover, it highlights the similarities and differences between them. we observe that only our work fulfill’s all criteria. in this sense, two research opportunities were identified: (1) few studies broadly inspect the adoption of uml models from the perspective of the brazilian industry; and (2) no study produced empirical evidence from a survey and conductingted interviews at the same time. the next section outlines a methodology to explore these identified research opportunities. table 1. comparative analysis of the selected related works related work comparison criterioncc1 cc2 cc3 cc4 cc5 cc6 proposed work petre (2014) g# # # # ozkaya and erata (2020) g# # # # fernández-sáez et al. (2015) # # # g# farias et al. (2018) # g# # ciccozzi et al. (2019) # # # # g# störrle (2017) # # # g# fernández-sáez et al. (2018) # # # g# # ho-quang et al. (2017) # # # neto et al. (2021) # # completely meets g# partially meets # does not attend 3 methodology this section presents the research methodology followed for conducting our survey. this protocol was formulated based on well-known guidelines (wohlin et al., 2012; kitchenham and pfleeger, 2008) to design and run empirical studies, as well as based on our experience in carrying out previous surveys (farias et al., 2018; júnior et al., 2021). this section is organized as follows. section 3.1 introduces the main objective and research questions. section 3.2 describes the adopted experimental process. section 3.3 describes the questionnaire and interview formulated and applied in the study. 3.1 objective and research questions the study objectives are twofold: (1) to understand the diffusion and relevance of the use of uml in the brazilian industry; and (2) to analyze at what level developers understand the benefits of uml in real-world projects. we formulated six research questions (rq) to analyze different facets of these objectives. table 2 describes the formulated rqs. 3.2 experimental process figure 1 introduces the adopted experimental process, composed of three phases discussed as follows: phase 1: selection of participants. participants were selected based on the following criteria: level of knowledge, practical experience related to software modeling, and programming in industrial software development projects. using these criteria, we sought to select participants with academic backgrounds and practical experience in the brazilian industry. this set of all possible participants represents the target population (kitchenham and pfleeger, 2008; wohlin et al., 2012). more specifically, the target population comprises practitioners working in brazil — including developers (different seniority levels), software architects, and project managers — with academic backgrounds obtained from brazilian universities. this population represents those who are in a position to answer the questions asked and to whom the research results apply (kitchenham and pfleeger, presenting the new sbc journal template júnior et al. 2022 table 2. research questions investigated in this article research questions motivation variable rq1: what factors influence the effective use of the uml? reveal the influencing factors in a broader usage of uml models in practice. usage-influencing factors rq2: what makes uml modeling a challenging practice? understand the challenges practitioners face that hinder the adoption of uml modeling adoption-hindering rq3: what benefits do practitioners realize when it comes to using uml? reveal the most commonly realized benefits when using uml modeling. perceived benefits rq4: how often do practitioners use uml? understand how often practitioners use uml modeling. frequency of use rq5: how does the context of software projects limit the use of uml in organizations? identify context factors that limit the use of uml in organizations. project context rq6: how do practitioners view uml modeling? reveal the practitioners’ vision regarding the adoption of uml modeling. practitioner view figure 1. experimental process 2008; wohlin et al., 2012). in total, 376 participants answered the questionnaire. phase 2: application of the questionnaire and interviews. this phase focused on the application of the questionnaire and the interviews execution. we conducted interviews to collect additional qualitative data related to research questions. such data is essential to triangulate the obtained results (section 4) from our questionnaire and interviews. the questionnaire (discussed in section 3.3) was sent by e-mail to the target population, totaling more than 406 people invited. in total, the study had 376 participants. we carefully selected the target population to avoid collecting data from people with inadequate profiles. we invited undergraduates, graduate students (master’s and doctorate), industry professionals with a recognized academic background, and professionals identified in the social network of professionals, such as linkedin. the 376 participants worked in 210 companies in different brazilian regions (midwest, south, southeast, and northeast). after completing the stages of answering and sending the questionnaire, we randomly invited 27 participants (out of 376) for a semi-structured interview (wohlin et al., 2012; farias et al., 2015). 20 participants, namely (p120) hereafter, accepted the invitation. the script was direct, starting from basic questions about the professional experience, the vision of software modeling, the use of tools, and other aspects of uml. the interviews were performed and recorded using the microsoft teams software. in a further step, we triangulated the qualitative and quantitative data from the interview and the questionnaire to explore complementary aspects of the data. phase 3: data analysis. this phase sought to analyze the data collected through the questionnaire and interviews carefully. for this, we first analyzed the collected data (interviews and survey) separately and then compared them (triangulation). initially, we analyzed the data collected through the survey and tabulated it. then, we used those initial survey results as the basis to formulate the interview questions. therefore, the interviewees answered questions that sought to explore the results obtained through the survey more deeply, seeking consistency in the data analysis. the investigation provided interaction through a dialectical process, interaction, and reflection between the researcher and the participants. we manually performed interview data analysis and went from a broad view to a more focal one without divergences. that helped us obtain complementary evidence to explain the quantitative results and then derive concrete conclusions from a chain of evidence formed from the systematic alignment of quantitative and qualitative data. 3.3 questionnaire and interviews data were collected from interviews and an online questionnaire3 (created in google forms). the study repository 4 has more information. participants reflected on their experience on uml software modeling in practice through our semi-structured interviews. table 3 presents the list of questions used in the interview. these interviews helped us to enrich the body of qualitative data. the authors ask a list of predefined questions for all respondents. new questions were formulated based on the answers given by the participants. we chose the online survey instrument because it enabled quick application, and fast distribution, thus reaching a larger number of individuals in geographically diverse locations at no additional cost. the survey questions examined research gaps in previous studies and apprehended the structures of the previously developed questionnaire. in addition, we based the design of the questionnaire and interview questions on the findings reported by petre (2014). 3questionnaire: https://forms.gle/tfrwsgj7ufucpafn7 4study repository: https://github.com/edwjr/surveyquestionnaire presenting the new sbc journal template júnior et al. 2022 table 3. list of questions used in the interview id question q1 which company do you currently work for? q2 what is your view on software modeling? q3 how is uml used where you work? q4 what is the main difficulty in using uml? q5 why do developers tend not to use uml in organizations? q6 when is the use of uml worth it? q7 do you use any specific software modeling tools to visualize and edit diagrams? q8 how often do you not consult the software documentation and work directly with source code? q9 how much effort do you put into reading uml diagrams? q10 what improvements should be made to enhance the use of uml? 4 results this section presents the obtained results concerning the formulated research questions (described in section 3.1). we used histograms to provide an overview of the collected data from the responses of 376 survey participants and 20 interviews. 4.1 analysis of the participants’ profile table 4 summarizes the participant’s profile, reporting different facets including education, undergraduate degree, job role, overall experience, professional experience with software modeling, experience with software development, and location. the 376 participants who responded to the survey came from 210 companies in brazil (at the time of data collection). as some questions were not required, the sum (n) is not necessarily equivalent to the total number of participants (376). education. the majority (68.1%) either had already graduated from college (36.9%) or were pursuing a degree as a student, while 10.6% had already completed either a postgraduate specialization (7.9%) or a master’s degree (3.7%) in the field of computing. 20.6% of the participants were “certified technicians” in the field of computing5. only one participant did not earn an undergraduate degree in computing but rather mathematics, subsequently pursuing a master’s degree in applied computing. regardless of their level of education, all participants were professionals with experience in the industry. undergraduatedegree. most participants (91.8%) had an undergraduate degree in computing. in brazil, universities offer computing degrees under different names, including systems analysis (51.9%), computer science (28.7%), and information systems (11.2%). this shows our participant pool has a strong academic background which complements the participants’ practical experience. considering their job roles, 50.7% were software developers, 23.6% were systems analysts and 2.4% were software architects. software architects 5in brazil, some schools have programs to offer high school degrees with an additional professional/technical certificate. table 4. the profile data of the participants. characteristic (n=376) answer # % education technical certificate 77 20.6% undergraduate student 117 31.2% graduate 138 36.9% specialization 22 7.9% master 14 3.7% undergraduate degree system analysis 195 51.9% computer science 108 28.7% information systems 42 11.2% others 31 8.2% position developer 187 50.7% systems analyst 87 23.6% software architect 9 2.4% manager 7 1.9% others 79 19.6% overall < 2 years 138 37.5% experience 2-4 years 129 35.1% 5-6 years 56 15.2% 7-8 years 10 2.7% > 8 years 18 4.9% professional experience < 2 years 227 61.2% with software modeling 2-4 years 91 24.5% 5-6 years 25 6.7% 7-8 years 10 2.7% > 8 years 18 4.9% professional experience < 2 years 126 34.1% with software development 2-4 years 120 32.5% 5-6 years 54 14.6% 7-8 years 28 7.6% > 8 years 41 11.1% geographical distribution of northeast 3 1% companies midwest 31 15% south 102 42% southeast 13 6% more than one location 61 29% and managers accounted for 1.9% of the sample. thus, 80% of the participants were in job positions directly related to software development practices. overall experience. the experience level is diverse in our participant pool, showing higher concentration in the 2 to 6 years range (62.5%), 7.6% had seven years or more of overall professional experience. modeling experience. regarding the characteristics of modeling experience, participants were experienced, but not highly, with software modeling. the expected result would be the lack of experience since previous empirical studies point to low adoption of uml models in the industry. about 38% of the participants had more than two years of professional experience in software modeling, while the others said they had less than two years of experience. development experience. regarding software development, overall, participants reported more years of experience compared to software modeling experience (when software modeling is considered a separate activity). as expected, practitioners are generally more exposed to experience programming tasks than modeling tasks. that is why we see more years of experience in “software development” than “software modeling” when these are considered “separate activities”. geographical distribution of companies. regarding work location, our participants came from 210 different companies located in all regions of the country except the northern region. the largest concentration was in the southern region with 102 companies, representing 42% of the sample. the midwest and southeast regions were 15% (31) and 6% (13), respectively, and the northeast region represented 1% (3). companies located in more than one region represent 29% (61). given the participant demographics, we consider the presenting the new sbc journal template júnior et al. 2022 participants’profile adequate to answer the research questions of our study for two main reasons. first, the participants came from a diverse set of companies (210), avoiding responses biased by experiences obtained in a limited set of companies. also, the large number of companies the participants came from increases the chances of participants with experiences in diverse business contexts and organizational cultures, thus improving the quality of the signal we can get in the study. second, all the participants had some formal education in computing, thus increasing the chances that they had some level of training in software modeling. this reduces the risk of biasing their answers because they had not known uml or had not heard about software modeling before the survey. moreover, the 20 interviewed participants reported modeling experience greater than five years, and they worked in software development in areas such as education (4 participants), agribusiness (3), e-commerce (2), government (3), trading (3), product exports (2), and finance (3). that diversity of areas, experience, and knowledge enriched the discussion. for ethical and privacy reasons, we chose not to present the names of the companies where participants worked. the following sections discuss the results obtained organized by research question. 4.2 rq1: what factors influence the effective use of the uml? figure 2 presents the collected data concerning the uml usage-influencing factors (rq1). we explored three factors to answer rq1: (a) time pressure that leads developers not to do software modeling, focusing only on working on the code; (b) the cost of promoting a common model understanding among the involved people with different levels of education/experience; and (c) the difficulty in assessing the quality of the created models. time. figure 2(a) indicates that 52% of the survey participants and 18 of the 20 interviewees reinforced that the short development time and high demands are the main factors that influence the use of uml since the software systems developed are getting larger and more complex every day due to the increasing demand of customers. “currently the projects are large and with a very short delivery time, you can barely deliver 100% software, imagine a documentation that would have to be updated at every step” (p17). this also leads to complex software projects that cannot be easily managed by project stakeholders and cause software systems to be delivered late (or with budget overrun) or incorrectly developed (ozkaya and erata, 2020). consequently, they end up opting for other complementary methods, such as screen prototyping, or not even creating uml models. cost of promoting understanding. figure 2(b) shows that most of the participants either fully agree (34%) or partially agree (34%) that the cost of promoting a common understanding among team members is a significant influencing factor on uml use. conversely, when we approached the interviewees with this question, most of them (12 out of 20) considered that the cost of promoting accurate modeling understanding between different people with different levels of education/experience and viewpoints is low, diverging from the survey data. this divergence possibly emerged since most interviewees worked in teams where all members had the same level of experience/training, thus leading to a smoother alignment regarding model understanding. the academic skill set affects where/how stakeholders have learned software modeling, influencing their modeling approaches and their relevant practices through the modeling experience akdur et al. (2017). difficulty evaluating. figure 2(c) shows the difficulty in evaluating the quality of uml models is another significant usage-influencing factor (21% fully agree, 40% partially agree). also, data from the interviews supported the difficulty in evaluating the models created and identified that this is one of the factors that affect the effective use of uml in the industry. moreover, the results on the usage-influencing factors support previous findings (chaudron et al., 2012; fernándezsáez et al., 2015; bucchiarone et al., 2021; störrle, 2017). bucchiarone et al. (2021) advocate that stakeholders model informally to support communicative and cognitive processes using emergent and flexible graphical notations in the early stages of the software development process. störrle (2017) also indicates that informal modeling (e.g., sketching on a whiteboard) is considered more effective in promoting communication, collaboration, and understanding. however, it is worth noting that such diagrams can be scrapped or become inaccurate since they are not maintained together with the updated source code. jackson (2019) points out that informal representations can be a good start for modeling, but it is limited, gives inconsistent interpretations, and cannot be analyzed mechanically. additionally, previous experimental studies such as (ho-quang et al., 2017; petre, 2014; scanniello et al., 2014) revealed that some issues challenge uml’s effectiveness. for instance, the uml complex notation as a whole, preference for other modeling approaches (e.g., informal sketches), and certain problem domains or industries might be more suitable than others for uml modeling. however, professionals have developed ad hoc practices that employ uml models in reasoning and communication about design, both individually and in collaborative dialogue. on the other hand, in some scenarios and industries, models can be transformed into programs using the proper tools. in such cases, models have a longer service life and must be kept up to date. it is also often observed that different teams and sub-organizations within the same company can use different modeling approaches for different purposes at different stages of the software development lifecycle (heldal et al., 2016). therefore, either informal modeling or “traditional uml modeling” with automated code generation can become alternatives when time is a first-class constraint. presenting the new sbc journal template júnior et al. 2022 figure 2. usage-influencing factors (rq1) summary of rq1: the results show that most participants indicate three points that affect the use of uml diagrams: (1) limited available time to create and maintain diagrams; (2) the cost of promoting proper understanding among different people with different levels of education/experience and viewpoints is high; and (3) difficulty in evaluating the quality of the diagrams. we understand that companies may need different modeling practices for different projects or roles within projects. practitioners should consider those three points when considering uml modeling as part of their development processes. 4.3 rq2: what makes uml modeling a challenging practice? figure 3 shows the collected data regarding rq2. from the survey responses, we highlight three adoption-hindering challenges: (a) the company’s culture, which affects the way uml is used, (b) the necessary effort to keep different uml diagrams in sync, and (c) the high effort to create and maintain the models. company culture. figure 3(a) indicates that 56% answered that they totally agree, 30% partially agree, and 10% were neutral. from the interviews, participants pointed out that, in some organizations, there is a culture of risking and failing as a path to learn quickly and meet customer needs, even if it requires much rework, thus, sometimes neglecting planning and upfront design. in addition, one of our interviewees mentioned: “i believe that the greatest difficulty is to change paradigms, especially when working with more mature teams that have grown without this modeling” (p4). although the current state of practice has reached some degree of automation in systems engineering, its tasks still require many human resources. thus, introducing process change in an organization already in operation is not easy (böhm et al., 2014). it is important to note that organizations may need different modeling approaches for different projects or even for different engineering roles within projects (akdur et al., 2021). as also described in (heldal et al., 2016), different units within the same company tend to use different modeling approaches. in addition, in the same project, different engineers may use different modeling practices, depending on their tasks and responsibilities (akdur et al., 2021). synchronization of diagrams. figure 3(b) shows that 37% of the participants partially agree and 30% fully agree that keeping diagrams in sync is a significant challenge that hinders uml use, corroborating the majority of the interviewees (19). although collaborative tools for software modeling exist, our result reinforces the findings reported in other studies conducted with industry participants (chaudron et al., 2012; cicchetti et al., 2016; kuhn et al., 2012; liebel et al., 2018), which appointed problems related to insufficient support for collaboration. there is a gap between uml tools and advanced solutions specialized in supporting collaboration. in addition, the next generation of modeling tools should support round-trip engineering to synchronize related uml diagrams and source code. since modeling a software system’s structural and behavioral aspects within a single model is not a trivial task, uml has proposed a set of diagrams to support a multiview modeling approach. thus, different aspects of the system under development are represented by different diagrams. high effort. figure 3(c) revealed that 41% totally agree, 38% partially agree, 13% are neutral, 7% partially disagree, and 1% totally disagree. therefore, the vast majority consider the effort invested in the creation and maintenance of uml models unanimously pointed out by the interviewees. “the biggest problem is the cost of keeping the diagrams as the system changes. in addition, it is still difficult to maintain a strong culture of maintenance and updating of models” (p17). another interviewee complements:“from a maintenance point of view, i think that some improvements would be necessary for the diagrams to provide a better figure of the big picture, allowing to identify more quickly relevant issues such as impact and points that can be taken into attention” (p4). in ozkaya and erata (2020), the authors mentioned that modeling software architectures based on uml from the concurrency point of view has relatively less interest on the part of professionals. one important reason here could be uml’s lack of support for modeling concurrency and race conditions. in addition, based on the findings of this study, most professionals are not used to planning development issues (e.g., source code organization and software construction and release processes) during the modeling and design, and this is usually omitted until the implementation. interviewee 11 reports: “uml is used at the beginning of the project, more specifically the projection phase, but with the progress being left aside, it ends up being outdated, since most developers focus only on the code and management does not make large charges on its use” (p11). in this context, fernández-sáez et al. (2015) pointed out that the modeling tool used to maintain/modify uml diagrams is an important factor when deciding whether to use a software development process. there are different types of tools with different benefits: licensed tools (which implies an investment but also return with possible training, customizations, etc.) vs. open tools or specific tools for modeling in uml (which check the syntax correction) or general modeling tools (are more “accessible”). uml was identified as the dominant notation in forward presenting the new sbc journal template júnior et al. 2022 and lethbridge (2008). the authors found that uml modeling tools are primarily used for initial design, while uml is not widely used for code generation. the study participants seemed open to incorporating modeling into their processes. however, the difficulty of keeping models up to date with code changes is a significant depreciation factor (68% agreement on this from forward and lethbridge (2008)). the analysis performed on forward and lethbridge (2008) is particularly interesting, finding that programmers are more likely to agree that modeling tools are “heavy-weight.” given this scenario, fernández-sáez et al. (2018) points out that it would be desirable to have a tool that would create and maintain documentation containing a mix of text and diagrams, in addition to having features that improve traceability between model and text to avoid leaving the documentation and the model out of sync. it would also be useful to have a tool that supports diagram versioning that matches the system version, searching model elements and presenting different views for the diagrams (for different consumers of information diagrams). in addition, another point we noted is that most participants are not used to putting effort into upfront planning and design (such as modeling) when they attempt to tackle coding issues. figure 3. adoption-hindering factors (rq2) summary of rq2: the results show that (a) organizational culture represents a significant challenge to enabling the adoption of uml models since the adopted engineering practices and the culture of agility sometimes do not give room to modeling. therefore, we observe that modeling in agile processes consists of a unique pattern of uml use. (b) synchronization between uml artifacts makes it difficult to use in highly collaborative software teams, and (c) the overall high effort to develop and maintain models is scarce in current organizational cultures. 4.4 rq3: what benefits are realized when using uml? figure 4 shows a summary of collected data related to rq3. we asked three questions related to (a) whether using uml selectively (only a few diagrams) helps to minimize complexity, avoid problems of completeness and inconsistency between diagrams, (b) whether uml models are helpful during application integration discussions, and (c) whether uml helps to form a common system understanding among developers. figure 4(a) indicates that 39% fully agree and 39% partially agree, and 15% are neutral. figure 4 (b) shows that 49% fully agree; 41% partially agree, and; 7% are neutral. figure 4 (c) reveals that 41% fully agree, 41% partially agree, and 11% are neutral. all twenty interviewees unanimously agreed that using uml benefits software development, as it helps in the general understanding of the system context, thus facilitating communication in the team. “the use of this language enables the understanding and discussion of the architecture of a project by the entire team and allows the representing more complex and difficult flows” (p17). “uml is a powerful language for understanding software at various levels of abstraction. when used properly it contributes to creating a better product. when used improperly (in a forced way) ends up consuming resources and not helping much. in short, diagrams should be used as a means to understand various aspects of the software to be developed and not as the end. the goal of development is software and not diagrams” (p9). these factors are identified in ho-quang et al. (2017) where most participants (79%) found uml useful for understanding systems, improving communication between developers, guiding implementation, and managing project quality. interviewees also mentioned uml could help with defect detection and design/implement integration of heterogeneous applications. however, inconsistent model interpretations can have serious consequences, especially when multiple and conflicting stakeholders are involved. for example, different interpretations between the development team, customers, and regulatory bodies can lead to rework, delays, and financial and legal repercussions. this risk may be exacerbated because compliance verification is usually performed later in the software development process. consequently, any problem discovered in the compliance check (when applicable) is expensive to repair (usman et al., 2020). participants of petre (2014) reported using uml more enthusiastically, working in a more scope-focused manner, and keeping the artifacts manageable in size and suitable to avoid synchronization and consistency issues. the interest revolves around problem-solving or decision-making to avoid undue costs. one area that deserves further research is how the use of uml is shaped by the context of the domain an investigation that requires much more access to a variety of software industries. this context demonstrates that it is necessary to understand what actually facilitates effective software development. all this evidence highlights the need to consider the relationship of tools, including notation, both with the community of practice and with the application domain. participresenting the new sbc journal template júnior et al. 2022 pants reinforced the fact that software developers are open to understanding the concepts and that, at the same time, they want to use tools that make the process effective. otherwise, they tend to discard them if they are at odds with their practices. figure 4. perceived benefits (rq3) summaryofrq3: selectively using only a few uml diagrams helps minimize complexity and avoid problems of completeness and inconsistencies between diagrams. in participants’ view, using uml is beneficial and can help avoid issues in the project, enabling better system understanding and assisting in integration discussions. 4.5 rq4: how often is uml used? figure 5 presents the participants’responses on the use of uml in their work. as the question was not mandatory, 365 of the 376 participants answered it. 74% answered that they do not use uml frequently, while 26% answered that they use uml quite often. this result reinforces findings in ozkaya and erata (2020), in which the authors report that 35 of the 50 subjects in the study do not use uml in practice. similarly, gorschek et al. (2014) found that practitioners do not frequently use uml. when they do it, they do it informally, with minimal or no tool support, and the notation is not necessarily enforced to be uml. figure 5. frequency of use (rq4) the twenty interviewees stated that they did not use uml frequently. however, they acknowledged the various benefits of using it in software development. “i understand that uml has a very strong semantic power, which favors its use in the elaboration of architecture, as well as in the construction of the system” (p4). störrle (2017) pointed out the importance of understanding the ever-changing demands of the software industry, which indicates more organizational and software development cultural differences as potential factors influencing uml use. similarly, the results of ozkaya and erata (2020) show that the majority of professionals (88%) use uml in modeling their software systems from different architectural points of view. among the architectural views (i.e., functional, information, concurrency, development, deployment, and operational), the most popular ones are functional and information views (96–99%). the operational point of view is the least popular, ignored by 61% of participants in their software modeling with uml. studies (kobryn, 2002; dori, 2002; thomas, 2004) argue that uml is not fulfilling the role of being a “lingua franca” or standard because of issues such as size, complexity, semantics, consistency, and model transformation. summary of rq4: the collected results show that uml modeling has low adherence in companies, although participants recognize the benefits of using uml models in software projects. these results are consistent with previous studies. 4.6 rq5: how does the context of software projects in companies limit the use of uml? figure 6 presents the collected data associated with rq5. three project context issues have been summarized that may affect uml use: (a) uml formalism (or lack thereof) – would more formalism in uml lead developers to use it more frequently? (b) the use of uml for practitioners arises from the fact of adapting its use for a specific purpose, and (c) companies tend to develop relatively small software that undergoes continuous modification. participants indicated that the high demand for software development may end up limiting the use of uml in practice. thus, developers start to keep design decisions ”in mind” (or through informal communication channels) and communicate effectively without any formal diagram. more formalism. regarding uml formalism, figure 6 (a) shows that 28% are neutral, 27% partially agree, and 21% totally agree that more formalism would help uml use. of the 20 participants we interviewed, 15 consider that the high degree of formalism becomes a negative factor for the applicability of uml since the processes are highly dynamic and agile, requiring a less formal and more interactive use. the project context our interviewees were involved in is usually very dynamic and agile, thus leading to constant changes in design, documentation, and uml models when they exist. more formalism in the language may lead to higher effort in producing and maintaining up-to-date models in such dynamic and agile scenarios. therefore, even though some participants seem to understand the benefits of having more presenting the new sbc journal template júnior et al. 2022 figure 6. context of use (rq5) formalism in modeling languages (e.g., more code generation and model transformations), in most of today’s projects, there are not enough resources to take up the high cost of creating and maintaining semantically-rich models (with a higher degree of formalism). adaptation of use. figure 6(b) summarizes to what extent participants agree that the uml use correlates to whether they can adapt it to their specific needs. the majority of the interviewees (12) pointed out that uml can be adaptable to a specific purpose (e.g., project domain, a specific section of the architecture, or a specific stakeholder’s view), but this adaptation is complex due to factors such as 1) it costs a lot to assure that documents/models are in sync with the code; 2) the difficulty in measuring the return on investment of adopting modeling practices; 3) uml use in legacy software; 4) the fear of adopting changes in the process, especially when working with more mature teams that have grown without modeling practices. therefore, that all leads us to believe that much research is still needed. continuous modification. figure 6(c) summarizes data on whether participants agree that the continuous modification nature of relatively small to medium projects makes it difficult to use uml. that data also matches with interviewees’ perceptions. even when practitioners work on larger projects, they usually break them into smaller iterations (and sub-projects) where developers can get along without much modeling activity. although the study participants of petre (2014) believe, for the most part, that uml is a “lingua franca” in companies and that they have theoretical knowledge about this type of modeling, participants end up not using it frequently. the results of fernández-sáez et al. (2015) revealed that software developers using uml diagrams end up experiencing difficulties with reading them. therefore, most surveyed companies use the “most understandable” uml diagrams. maintainers do not always use the available documentation and work directly with the source code; even when documentation with models is available, it is not typically used. summary of rq5: the project context matters. depending on the project and process, more or less formalism might help uml use. also, the ability to continuously update diagrams together with continuously changing code in specific projects is another influencing factor. finally, whether it is possible to adapt modeling practices to specific project needs affects uml use. 4.7 rq6: how do practitioners view uml modeling? figure 7 summarizes data regarding rq6. we explored three possible issues related to practitioners’ views on adopting uml modeling. not interested in modeling. figure 7(a) shows that 41% totally agree, 33% partially agree, and 13% are neutral on whether they are interested in modeling tasks. additionally, out of the 20 participants interviewed, 13 stressed that developers like and understand the importance of modeling; however, factors (discussed in rq1 and rq2) limit its adoption. in petre (2013), uml is considered “unnecessarily complex” by several participants in that study who reported variations in understanding and interpretation among developers, resulting in problems such as challenges in formal language semantics. others noted that the complexities of the notation limited its usefulness – or required targeted use – in discussions with stakeholders (including highly technical stakeholders). lack of modeling pattern. figure 7(b) indicates that 15% fully agree, 37% partially agree with the lack of modeling patterns and modeling guidance; in other words, the open-ended nature of uml makes it less attractive. according to the interviewees, this lack of modeling guidance on creating models correctly and effectively prevents developers from using uml modeling. “not all project participants will understand modeling, there is no pattern. there are no people qualified to generate uml” (p5). in hutchinson et al. (2011b,a), they found that there are various modeling languages people use in projects following model-driven engineering (mde). companies using mde tend to develop domain-specific languages (dsls), which have a very product/implementationfocused notion. general model. figure 7(c) shows that 19% fully agree and 39% partially agree that the lack of a general diagram that provides a system big picture with structural and behavioral elements makes the uml adoption less attractive. most of the interviewed participants (16) reinforced the difficulty of modeling structural and behavioral aspects of complex software in a single “big picture view.” fernández-sáez et al. (2018) sought to provide a comprehensive and systematic view of the main challenges in software modeling and to understand the different categories of them together with discussions of the concrete challenges in each category that professionals may face. in their study, they raised eight different types of challenges, including managing the complexity of the language, extensive modeling languages, domain-specific modeling environments, developing formal modeling languages, analyzing models, sepapresenting the new sbc journal template júnior et al. 2022 ration of concerns, transforming models, and management models. figure 7. practitioner view (rq6) summary of rq6: most developers do not demonstrate an interest in modeling, which can be justified by crucial factors such as the absence of standard modeling guidance and difficulty bringing upfront design aspects to the software development lifecycle. new modeling approaches are required to facilitate modeling and bring developers closer to it, making the process simpler, more dynamic, and motivating. 5 additional discussion in section 5.1 we provide reflections and future directions based on the obtained results. section 5.2 discusses issues related to the adoption of continuous modeling. section 5.3 outlines some discussions on gamified software modeling as a way to enhance the adoption of uml models. section 5.4 discusses the need for new approaches to assess uml diagrams in the context of modeling training education. section 5.5 draws implications from our findings. 5.1 summary of reflections time constraints and lack of knowledge. the study results point to time constraints as one of the main factors that affect the use of uml. although participants recognize the importance and benefits of creating uml diagrams, the short time spent on projects leads professionals not to use uml or to use it in a limited manner. in addition, the lack of in-depth knowledge about uml diagrams would be an impediment since the cost of promoting proper understanding among people with different levels of education/experience and viewpoints is high. also, the ability to evaluate the uml model quality is another considerable challenge. academic vision. one factor that interviewees consistently pointed out was the impression that uml tends to be more academic than industrial/practical and that new teaching approaches need to be adopted in academic programs that involve uml in their curriculum. software engineering education training with uml needs to be accompanied by real problems from the industry, which reinforces the findings from neto et al. (2021). it is evident that regardless uml is a dominant representation in practice, there is evidence that it plays an important role in software engineering teaching (petre, 2014). uml provides a common representation from which to direct the system design discussion and build a shared model of the problem. it provides a means for “model-based thinking” for students who do not yet have a repertoire of representations and reasoning tools. the typical use of uml in education introduces key concepts and directs attention and structure to student exploration and practical involvement with problems and design. one can argue that the value of uml in education lies in intellectual development rather than mirroring industry practice. company culture and agility. from the responses we got, we identified that the culture of agility in companies conflicts with the use of uml. preparing and maintaining uml diagrams are two manual activities requiring knowledge and time. therefore, the popularity of informal modeling (e.g., whiteboard sketches) has grown as an attempt to improve collaboration and communication effectiveness. also, informal and lower-cost models (in the sense of being more straightforward and faster to draw) become more flexible since learning is simplified. usually, working with the representation of abstractions (i.e., modeling) in the context of agility culture has not proved to be a popular choice. delivering quickly (without major planning) and considering failures as a natural process to arrive at the final software product have proved to be a priority. in this context, the multi-vision modeling proposed by uml does not find any application space, although it is recognized as something important. selective use of diagrams and complexity. when asked about what benefits they perceived when using uml, most participants responded that using uml diagrams selectively (i.e., the use of only a few diagrams) helps to minimize complexity, avoids problems of inconsistency between diagrams, and helps in forming a common understanding between developers. this conclusion was also verified by dzidek et al. (2008). the generality and freedom that enable uml to meet this wide range of purposes are also the sources of its weakness. uml has no formal semantics, which poses a problem when people use an uml model for different purposes. because one of uml’s main objectives is to communicate software design, different ways of using uml are potential causes of communication problems (lange et al., 2006). 5.2 adoption of continuous modeling companies seek not only to streamline their processes but mainly to find continuity throughout the software development cycle (rubert and farias, 2022; chen, 2015; elazhary et al., 2021; chen, 2017; laukkanen et al., 2017; fitzgerald and stol, 2017). fitzgerald and stol (2017) argue that achieving flow and continuity throughout the software development cycle is much more important in the first instance than velocity. since companies increasingly prioritize continuous delivery practices (chen, 2015), to benefit from uml adoption (bucchiarone et al., 2021; chaudron et al., 2012; dzidek presenting the new sbc journal template júnior et al. 2022 et al., 2008), companies must put effort into involving uml modeling practices throughout the software development cycle. however, this requires significant process changes, for example, augmenting the ci/cd pipeline, giving rise to continuous software modeling ( which poses a significant challenge). technical challenges. robust uml modeling approaches, tools, and good practices out-of-the-box and highly adaptable to the companies’ realities are lacking. the absence of such an approach has led to the isolated, as opposed to continuous, adoption of uml models throughout the continuous delivery pipeline. modeling tools that fill this gap can bring the already documented benefits of using uml models to the reality of companies, such as improving the traceability between models so as not to leave documentation and the process of modeling out of sync, not to mention the ability to highlight resource saving. when building a continuous modeling platform, different tools and technologies can be used as building blocks for the continuous delivery pipeline (chen, 2015, 2017). however, companies should not be trapped by such tool suppliers. the scientific community should propose widely accepted modeling guidelines and good practices applicable to organizational needs companies typically experience, define open apis (software modeling as a service) and build an ecosystem of tools for building a continuous software modeling pipeline. nowadays, software development iterations are short to of delivering newly requested features rapidly, establishing a continuous cycle of getting feedback. such large, monolithic models need to be characterized and rethought as featureoriented uml models. modeling practices must fit iterative processes (with very short release cycles) that are typically driven by incremental feature development. rather than designing a colossal set of uml diagrams upfront, it is recommended that software design with uml follows the same iterative approach driven by incremental feature development, which may ease the adoption of software modeling in agile teams. that poses the significant challenge of implementing continuous modeling approaches oriented by features, as well as the production of empirical evidence about the advantages and disadvantages of adopting continuous uml modeling. solving this challenge will require close collaboration between researchers and practitioners and will enable the benefits of uml modeling to be brought to the reality of more companies. process challenge. xavier et al. (2019) pointed out that people still associate uml modeling with traditional process practices (e.g. rup), while uml is not explicitly integrated with agile practices. our results indicate that agile teams tend not to adopt uml modeling. one of the participants reported: “if the preparation of uml models requires, for example, three days before it is ready for use by developers, this period will be responsible for much of the sprint time, for example” (p12). it is important to highlight that agile methodologies do not prohibit the use of uml, another participant states: “we work with scrum and with some uml diagrams, but few and only in the project phase. the system is giant to meet a bank’s demands, there are many requests for functionality changes and improvements on the part of the customer and we usually fit the demands into weekly sprints” (p11). there are research gaps in looking for alternatives, aiming at the alignment between business processes, agile development practices, and uml modeling. documentation and legacy monolithic systems. promoting large-system modeling practices without processes that support documentation is still a challenge for decades. there may also be the cultural tendency to assume that the status quo is the only possible path. the absence of design documentation complicates restructuring legacy monolithic systems into highly distributed systems such as those following the microservice architecture. legacy systems typically have dozens of tightly coupled subsystems that interact to provide different services for internal and external customers within companies. fitzgerald and stol (2017) point out that the lack of documentation only takes into account the tacit knowledge of software engineers who work in different teams. the legacy systems modeling based on creating a “big picture view” is still hard to implement due to the size, usually consisting of hundreds of thousands of lines of code. continuous updates to these models can be very challenging. the multiview modeling of uml allows updating complementary models, such as class diagrams and sequence diagrams. this can lead to inconsistencies between such models (kretschmer et al., 2021; khelladi et al., 2019; reder and egyed, 2013). 5.3 gamification of modeling software gamification can be defined as “the use of game design elements in non-game contexts” (deterding et al., 2011; huotari and hamari, 2017; liu et al., 2017). this technique uses the philosophy, elements, and mechanics of game design in nongame environments, aiming to bring all the positive aspects they provide. the current literature recognizes the benefits of applying gamification in software engineering practice. however, how to design and use gamification in the context of modeling applied to industrial needs is still an open question. as far as we know, only a few studies on the application of gamification in software engineering practices are available — most of which are related to broader contexts (porto et al., 2020; pedreira et al., 2015; ren et al., 2020). due to the related theoretical and practical difficulties, learning to use the full potential of uml can be a complex task, which makes developers feel discouraged and less engaged over time. this scenario could lead, for example, to the development of incomplete, decontextualized, and poorquality models. lange et al. (2006) reinforce that this issue brings potential risks that might cause misinterpretation and miscommunication, thus reducing software quality. therefore, finding configurations that favor developer practices, generate engagement, and consequently, increasingly effective uml models can become one of the main challenges encountered in the industry today. given this scenario, gamification emerges as a possible alternative to mitigate these problems, enhancing the adoption of uml, improving the models generated by developers, and generating high-quality software. there is no clear and usually accepted taxonomy of game elements (pedreira et al., 2015). shpakova et al. (2016) propresenting the new sbc journal template júnior et al. 2022 posed a unified view of the different classifications, which summarizes gamification in three dimensions: components, mechanics, and dynamics. components are the basic building blocks of gamification. they represent the objects that users see and interact with, such as badges, levels, and points. mechanics define the game as a rules-based system, specifying how everything behaves and how the player can interact with the game. dynamics are the top level of gamification elements. they include all aspects of the game that cannot be implemented and managed directly and are related to users’ emotional responses (e.g., progression, exploration). the success of gamifying a particular context unrelated to the game depends heavily on the gamification design choices for those three dimensions. several research efforts have focused on identifying the phases that make up the gamification project (mora et al., 2015; webb, 2013). however, similarly to the taxonomy of gamification elements, there are no commonly accepted phases. they can vary in number and terminology used. in software development, developers’ performance concerning productivity or quality may relate to the number of artifacts developers produce and how good the artifacts are. however, while performance is often a quantitative and objective metric for assessing the impact of gamification on users’ activities in the out-of-game context, in software development, performance may be related to productivity or quality (usually subjective). this article conjectures that the insertion of gamification techniques, such as feedback, progress, and challenges, in software modeling could help mitigate the issues of adopting uml modeling. for example, the incompleteness of uml models is a critical problem (lange et al., 2006; fernándezsáez et al., 2018). using gamification techniques, such as challenges, points, feedback, and progress, could motivate developers to create more complete models in exchange for points, for example. a ranking system for software teams could be created to rank them in terms of the quality of the models created. in addition, constant feedback during the model editing could foster learning and stimulate modeling. researchers can carry out empirical studies to analyze the integration between gamification and software modeling based on the factors mentioned in rq1 and rq2. that would increase the perception of benefits by practitioners (rq3) and the frequency of use (rq4). therefore, the use of gamification techniques can motivate developers, enhance the quality of the created uml models, and foster learning. 5.4 assessing and grading uml diagrams before using uml models, practitioners need to learn that there are structural and behavioral diagrams available in uml. also, students (or practitioners under training) submit their diagrams for assessment and grading in educational/training contexts. university courses worldwide teach uml modeling, to some extent, as the standard language for modeling software. additionally, uml is still a well-known language when practitioners need to model software systems. moreover, universities are increasingly adopting a learning-by-doing approach and having online classes with a high number of students. in this context, students need to practice through hands-on exercises and real-world tasks. instructors must find an efficient mechanism to fairly and equitably assess student projects and assignments. in addition, assessments must enable rapid feedback and provide learners with instructions on how to overcome their deficiencies or limitations. imagine that an instructor needs to train 120 people in geographically distributed teams. the instructor provides an exercise in which the learner needs to design 10 uml class diagrams. the instructor needs to provide feedback on the 1,200 uml class diagrams two days after delivery. the short time to evaluate a high number of diagrams makes the teaching and learning process difficult. therefore, the manual assessment of uml models proves to be a very costly and subjective activity, creating friction in the practice-assessment-learning feedback loop involving students and instructors. this reality is not exclusively found in universities; on the contrary, it is found anywhere where the teaching-learning cycle of uml models needs to happen quickly and with a relatively high number of learners. some tools and approaches (vesin et al., 2018; bian et al., 2019; stikkolorum et al., 2019) have been proposed in recent years. for example, sdmetrics6 presents a set of metrics for uml models but does not compute the differentiation between the rubric and the uml model created by the learner. the modelguru approach7 goes a little beyond sdmetrics when computing students’ grades using object-oriented measures of design size, coupling, and complexity. vesin et al. (2018) came up with a new integrated tool to support the evaluation of uml models produced by students. bian et al. (2019) introduced a grading process based on syntactic, semantic, and structural matching for computing grades by comparing students’ models with the desired model. in a different approach, stikkolorum et al. (2019) presented an exploratory study regarding machine learning for grading uml diagrams. however, a streamlined approach for grading uml diagrams based on syntactic, semantic, and structural criteria is still lacking. the use of machine learning also emerged as a trend and a new avenue to be explored. lastly, we outline the need for the scientific community to explore three objectives (farias and silva, 2020): (1) provide a tool to streamline the process of managing rubrics for grading uml diagrams; (2) allow students to get faster and more objective and itemized feedback for their submissions; and (3) ultimately, enhance the practicing-grading-learning feedback loop associated with designing uml diagrams. 5.5 practical implication when software development teams constantly change source code and revise uml models to keep them up-to-date, the effort engineers put in can make the difference between adopting or not uml models throughout the development process. from our findings, updating and synchronizing models with source code appears to be one of the major impediments to the broader use of uml modeling. rather than being easy and intuitive, study participants point to model update and synchronization as a highly time-consuming and error-prone process. 6sdmetrics: https://www.sdmetrics.com/ 7modelguru: http://modelguru.snotra.com.br/ presenting the new sbc journal template júnior et al. 2022 still, the need to update and synchronize uml models attracts the spotlight as organizations increasingly adopt devops and agile practices in globally distributed development teams. therefore, updating and synchronizing (upsync) uml models with source code emerges as a critical requirement to leverage uml adoption in real-world settings. the ability to “upsync” uml models can be seen as the mitigation with which modern development teams (adept to devops and agile practices) can update the uml’s structural and behavioral models to accommodate new design decisions, or requirements change. we conjecture that the greater the upsync, the better the quality of the source code. this paves the way for the scientific community to propose friendly round-trip engineering approaches — existing uml models can be transformed into source code and then be converted back — combined with the integrated development environment used by development teams. in that perspective, updating and synchronizing models helps improve the software system under maintenance. previous empirical studies (dzidek et al., 2008) have shown that using uml models improves source code quality and reduces bugs. for this, not only robust round-trip engineering approaches are needed, but also improvements that span the agile development process as a whole. for example, scrum-based development processes can have automated tasks at the end of each sprint to update and synchronize uml models. practical research implication: our findings highlight that the adoption of uml modeling in practice is affected by the difficulty of updating and synchronizing models with the source code. currently, development processes adopt source code as a primary artifact, thus demanding that new cost-effective updating and synchronization approaches be proposed. although upsync models sound like a promising trend, the scientific community needs to evaluate future proposed techniques and carry out empirical studies to investigate the impact on the quality of uml models and source code, as well as on the degree of practitioners’ satisfaction in real-world settings. 6 threats to validity this section discusses the possible threats to the study’s validity. internal validity. internal validity is related to issues that may affect the causal relationship between treatment and outcome. threats to internal validity include instrumentation and selection threats. the main points affecting our study’s internal validity refer to the participants’ profiles and experiences. when analyzing the profile of the participants, as presented in section 4.1, around 30% of them have low (up to 4 years) general experience, low experience with software modeling, and low experience with software development. this is probably because the level of education of about 50% of that 30% group is low compared to others, or they are still attending undergraduate degrees. in addition, many of these participants may not have studied uml yet during their undergraduate degrees. also, there was no option in the survey question corresponding to the time of 2 to 3 years of experience. however, due to the sample size and also the complementary interviews we conducted, we believe that the data collected are not affected by this threat. another internal threat is linked to the random process of selecting participants for the interview, which may have caused a potential similarity in the profile of the interviewed participants. thus, a selection bias may interfere with the potential validity of completion. although the interviewed participants work on software development in the fields of education, agribusiness, e-commerce, government, trade, product export, and finance, we recognize that qualitative data could be further explored if we had greater participation of professionals linked to other sectors. still, we have a wider variety of sectors from the survey participants. external validity. external validity concerns the ability to generalize the results beyond the actual study. to perform the correct interpretation of the survey results, although the demographic data of our sample are diversified, we understand that the generalization of them for the entire population may not be adequate. in our study, participants belonged to a geographic variety and worked in companies of different domains and sizes. however, we cannot be sure that this sample is representative of the sector in general. we understand that these threats are always present in industrial research. reliability focuses on the replicability of results by other researchers. this study has a repository with the collected data and an online form, both of free access. 7 conclusions and future work this article presented an exploratory survey on how practitioners have used uml modeling in the brazilian industry. in total, 376 employees from 210 information technology companies answered an online questionnaire about the factors affecting use, difficulty, and frequency of use, perceived benefits, and contextual factors that prevent the adoption of uml models. in addition, we interviewed 20 randomly chosen participants from the survey pool using a semi-structured interview protocol as a follow-up investigation to triangulate with the survey data. in summary, the results show that: 74.8% of the participants answered that they do not use uml frequently. participants who responded not to use uml models attributed factors such as continuous delivery practices, time constraints, lack of knowledge about modeling, company culture, and the always present difficulty of keeping the models up to date and synchronized with each other and source code. the results of this research reinforced some evidence already found in the literature concerning the use of uml (gorschek et al., 2014; petre, 2014). in general, most people know uml but do not use it in their projects. these results can help professionals understand how to invest to avoid increased development spending and provide a foundation to motivate software developers to design uml diagrams throughout development cycles. that would facilitate, for example, maintenance tasks. future work should focus effort on investigating more aspects related to uml practice, such as the possibilities of using uml in agpresenting the new sbc journal template júnior et al. 2022 ile teams/organizations, whether teaching methodologies in academia influence the practices in the software industry, and how gamification can be applied to software modeling practices. finally, we hope that the issues outlined throughout the article will encourage other researchers to replicate our study in the future in different circumstances and that this work represents a solid step in a more ambitious agenda to improve software engineering practices. references akdur, d., demirörs, o., and garousi, v. (2017). characterizing the development and usage of diagrams in embedded software systems. in 2017 43rd euromicro conference on software engineering and advanced applications (seaa), pages 167–175. ieee. akdur, d., say, b., and demirörs, o. (2021). modeling cultures of the embedded software industry: feedback from the field. software and systems modeling, 20(2):447–467. bian, w., alam, o., and kienzle, j. (2019). automated grading of class diagrams. in 2019 acm/ieee 22nd international conference on model driven engineering languages and systems companion (models-c), pages 700–709. ieee. böhm, w., junker, m., vogelsang, a., teufl, s., pinger, r., and rahn, k. (2014). a formal systems engineering approach in practice: an experience report. in proceedings of the 1st international workshop on software engineering research and industrial practices, pages 34–41. bucchiarone, a., ciccozzi, f., lambers, l., pierantonio, a., tichy, m., tisi, m., wortmann, a., and zaytsev, v. (2021). what is the future of modeling? ieee software, 38(2):119– 127. chaudron, m. r., heijstek, w., and nugroho, a. (2012). how effective is uml modeling? software & systems modeling, 11(4):571–580. chen, l. (2015). continuous delivery: huge benefits, but challenges too. ieee software, 32(2):50–54. chen, l. (2017). continuous delivery: overcoming adoption challenges. journal of systems and software, 128:72–86. cicchetti, a., ciccozzi, f., and carlson, j. (2016). software evolution management: industrial practices. in me@ models, pages 8–13. citeseer. ciccozzi, f., malavolta, i., and selic, b. (2019). execution of uml models: a systematic review of research and practice. software & systems modeling, 18(3):2313–2360. deterding, s., dixon, d., khaled, r., and nacke, l. (2011). from game design elements to gamefulness: defining” gamification”. in 15th int. academic mindtrek conference: envisioning future media environments, pages 9–15. dori, d. (2002). why significant uml change is unlikely. communications of the acm, 45(11):82–85. dzidek, w. j., arisholm, e., and briand, l. c. (2008). a realistic empirical evaluation of the costs and benefits of uml in software maintenance. ieee transactions on software engineering, 34(3):407–432. elazhary, o., werner, c., li, z. s., lowlind, d., ernst, n. a., and storey, m.-a. (2021). uncovering the benefits and challenges of continuous integration practices. ieee transactions on software engineering. farias, k., garcia, a., whittle, j., von flach garcia chavez, c., and lucena, c. (2015). evaluating the effort of composing design models: a controlled experiment. software & systems modeling, 14(4):1349–1365. farias, k., gonçales, l., bischoff, v., da silva, b. c., guimarães, e. t., and nogle, j. (2018). on the uml use in the brazilian industry: a state of the practice survey (s). in seke, pages 372–371. farias, k. and silva, b. c. d. (2020). what’s the grade of your diagram? towards a streamlined approach for grading uml diagrams. in 23rd acm/ieee international conference on model driven engineering languages and systems: companion proceedings, pages 1–2. fernández-sáez, a. m., caivano, d., genero, m., and chaudron, m. r. (2015). on the use of uml documentation in software maintenance: results from a survey in industry. in 2015 acm/ieee 18th int. conf. on model driven engineering languages and systems (models), pages 292– 301. ieee. fernández-sáez, a. m., chaudron, m. r., and genero, m. (2018). an industrial case study on the use of uml in software maintenance and its perceived benefits and hurdles. empirical software engineering, 23(6):3281–3345. fitzgerald, b. and stol, k.-j. (2017). continuous software engineering: a roadmap and agenda. journal of systems and software, 123:176–189. forward, a. and lethbridge, t. c. (2008). problems and opportunities for model-centric versus code-centric software development: a survey of software professionals. in proceedings of the 2008 international workshop on models in software engineering, pages 27–32. gorschek, t., tempero, e., and angelis, l. (2014). on the use of software design models in software development practice: an empirical investigation. journal of systems and software, 95:176–193. heldal, r., pelliccione, p., eliasson, u., lantz, j., derehag, j., and whittle, j. (2016). descriptive vs prescriptive models in industry. in proceedings of the acm/ieee 19th international conference on model driven engineering languages and systems, pages 216–226. ho-quang, t., hebig, r., robles, g., chaudron, m. r., and fernandez, m. a. (2017). practices and perceptions of uml use in open source projects. in 39th icse: software engineering in practice track, pages 203–212. ieee. huotari, k. and hamari, j. (2017). a definition for gamification: anchoring gamification in the service marketing literature. electronic markets, 27(1):21–31. hutchinson, j., rouncefield, m., and whittle, j. (2011a). model-driven engineering practices in industry. in proceedings of the 33rd international conference on software engineering, pages 633–642. hutchinson, j., whittle, j., rouncefield, m., and kristoffersen, s. (2011b). empirical assessment of mde in industry. in proceedings of the 33rd international conference on software engineering, pages 471–480. jackson, d. (2019). alloy: a language and tool for exploring software designs. commun. acm, 62(9):66–76. presenting the new sbc journal template júnior et al. 2022 júnior, e., farias, k., and silva, b. (2021). a survey on the use of uml in the brazilian industry. in brazilian symposium on software engineering, pages 275–284. khelladi, d. e., kretschmer, r., and egyed, a. (2019). detecting and exploring side effects when repairing model inconsistencies. in 12th acm int. conf. on software language engineering, pages 113–126. kitchenham, b. a. and pfleeger, s. l. (2008). personal opinion surveys. in guide to advanced empirical software engineering, pages 63–92. springer. kobryn, c. (2002). will uml 2.0 be agile or awkward? communications of the acm, 45(1):107–110. kretschmer, r., khelladi, d. e., lopez-herrejon, r. e., and egyed, a. (2021). consistent change propagation within models. software and systems modeling, 20(2):539–555. kuhn, a., murphy, g. c., and thompson, c. a. (2012). an exploratory study of forces and frictions affecting largescale model-driven development. in int. conf. on model driven engineering languages and systems, pages 352– 367. springer. lange, c. f., chaudron, m. r., and muskens, j. (2006). in practice: uml software architecture and design description. ieee software, 23(2):40–46. laukkanen, e., itkonen, j., and lassenius, c. (2017). problems, causes and solutions when adopting continuous delivery—a systematic literature review. information and software technology, 82:55–79. liebel, g., marko, n., tichy, m., leitner, a., and hansson, j. (2018). model-based engineering in the embedded systems domain: an industrial survey on the state-of-practice. software & systems modeling, 17(1):91–113. liu, d., santhanam, r., and webster, j. (2017). toward meaningful engagement: a framework for design and research of gamified information systems. mis quarterly, 41(4). mora, a., riera, d., gonzalez, c., and arnedo-moreno, j. (2015). a literature review of gamification design frameworks. in 2015 7th international conference on games and virtual worlds for serious applications (vs-games), pages 1–8. ieee. neto, j. c., bento, l. h. t. c., oliveirajr, e., and souza, s. d. r. s. (2021). are we teaching uml according to what it companies need? a survey on the são carlos-sp region. in anais do simpósio brasileiro de educação em computação, pages 34–43. sbc. omg (2017). uml: infrastructure specification. https://www.omg.org/spec/uml/2.5.1/pdf. ozkaya, m. and erata, f. (2020). a survey on the practical use of uml for different software architecture viewpoints. information and software technology, 121:106275. pedreira, o., garcía, f., brisaboa, n., and piattini, m. (2015). gamification in software engineering–a systematic mapping. information and software technology, 57:157–168. petre, m. (2013). uml in practice. in 2013 35th international conference on software engineering (icse), pages 722–731. ieee. petre, m. (2014). no shit or oh, shit!: responses to observations on the use of uml in professional practice. software & systems modeling, 13(4):1225–1235. porto, d., jesus, g., ferrari, f., and fabbri, s. (2020). initiatives and challenges of using gamification in software engineering: a systematic mapping. arxiv preprint arxiv:2011.07115. reder, a. and egyed, a. (2013). determining the cause of a design model inconsistency. ieee transac. on software engineering, 39(11):1531–1548. ren, w., barrett, s., and das, s. (2020). toward gamification to software engineering and contribution of software engineer. in 4th int. conf. on management engineering, software engineering and service sciences, pages 1–5. rubert, m. and farias, k. (2022). on the effects of continuous delivery on code quality: a case study in industry. computer standards & interfaces, 81:103588. scanniello, g., gravino, c., genero, m., cruz-lemus, j., and tortora, g. (2014). on the impact of uml analysis models on source-code comprehensibility and modifiability. acm tosem, 23(2):1–26. shpakova, a., dörfler, v., and macbryde, j. (2016). gamification and innovation: a mutually beneficial union. in bam 2016: 30th annual conference of the british academy of management. stikkolorum, d. r., van der putten, p., sperandio, c., and chaudron, m. (2019). towards automated grading of uml class diagrams with machine learning. in bnaic/benelearn. störrle, h. (2017). how are conceptual models used in industrial software development? a descriptive survey. in 21st int. conf. on evaluation and assessment in software engineering, pages 160–169. thomas, d. (2004). mda: revenge of the modelers or uml utopia? ieee software, 21(3):15–17. usman, m., felderer, m., unterkalmsteiner, m., klotins, e., mendez, d., and alégroth, e. (2020). compliance requirements in large-scale software development: an industrial case study. in int. conf. on product-focused software process improvement, pages 385–401. springer. vesin, b., klašnja-milićević, a., mangaroska, k., ivanović, m., jolak, r., stikkolorum, d., and chaudron, m. (2018). web-based educational ecosystem for automatization of teaching process and assessment of students. in proceedings of the 8th international conference on web intelligence, mining and semantics, pages 1–9. webb, e. n. (2013). gamification: when it works, when it doesn’t. in international conference of design, user experience, and usability, pages 608–614. springer. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media. xavier, a., martins, f., pimentel, r., and carvalho, d. (2019). aplicação da uml no contexto das metodologias ágeis. in anais do vi encontro nacional de computação dos institutos federais. sbc. introduction related work analysis of related works comparative analysis and opportunities methodology objective and research questions experimental process questionnaire and interviews results analysis of the participants' profile rq1: what factors influence the effective use of the uml? rq2: what makes uml modeling a challenging practice? rq3: what benefits are realized when using uml? rq4: how often is uml used? rq5: how does the context of software projects in companies limit the use of uml? rq6: how do practitioners view uml modeling? additional discussion summary of reflections adoption of continuous modeling gamification of modeling software assessing and grading uml diagrams practical implication threats to validity conclusions and future work journal of software engineering research and development, 2023, 11:8, doi: 10.5753/jserd.2023.2671 this work is licensed under a creative commons attribution 4.0 international license. identification and management of technical debt: a systematic mapping study update maría isabel murillo [ university of costa rica | maria.murilloquintana@ucr.ac.cr] gustavo lópez [ university of costa rica | gustavo.lopezherrera@ucr.ac.cr] rodrigo spínola [ salvador university | rodrigo.spinola@unifacs.br] julio guzmán [ university of costa rica | julio.guzman@ucr.ac.cr] nicolli rios [ federal university of rio de janeiro | nicolli@cos.ufrj.br] alexia pacheco [ university of costa rica | alexia.pacheco@ucr.ac.cr] abstract technical debt is a concept used to describe the lack of good practices during software development, leading to several problems and costs. identification and management strategies can help reduce these difficulties. in a previous study, alves et al. (2016) analyzed the research landscape of such strategies from 2010 to 2014. this paper replicates and updates their study to explore the evolution of technical debt identification and management research landscape over a decade, including literature from 2010 until 2022. we analyzed 117 papers from the acm digital library, ieee xplore, science direct, and springer link. newly suggested strategies include automatically identifying admitted debt in comments, commits, and source code. between 2015 and 2022, more empirical evaluations have been performed, and the general research focus has changed to a more holistic approach. therefore, the research area evolved and reached a new level of maturity compared to previous results from alves et al. (2016). not only are code aspects considered for technical debt, but other aspects have also been investigated (e.g., models for the development process). keywords: technical debt management, technical debt identification, software development process. 1 introduction technical debt (td) is the consequence of taking shortcuts during the software development process, providing shortterm benefits but potentially bringing more difficulties and costs in later stages (izurieta et al., 2012). when developers take these shortcuts, deficiencies may be inserted. the cost of fixing previous work increases as the development continues because correcting the defects becomes more complex when technical debt is not timely paid (akbarinasaji & bener, 2016) the interest is the additional cost that may have to be assumed because of the delayed payment. on the other hand, the principal is the amount over which interests are paid. in technical debt, the principal is the original cost of fixing the software (ampatzoglou et al., 2015). when developers cannot pay the existing technical debt, bankruptcy may occur (akbarinasaji & bener, 2016) several activities can help manage debt during the software development process. management activities may include measuring, prioritizing, preventing, monitoring, documenting, communicating, and paying the debt. (li et al., 2015). the purpose of performing these actions is to avoid major problems that may lead to significant consequences, such as the failure of software projects. management strategies can help determine the appropriate time to pay the debt before interests become very costly. consequently, it is possible to make faster deliveries in a controlled manner (freire et al., 2020). also, strategies allow even to recognize if the debt needs to be paid because there may be cases when there is no need to pay it, for example, when there is the certainty that a module will not change in the future (guo et al., 2014). technical debt management is complex because there may be uncertainty during software development. also, many factors must be considered for its management, such as the present and future costs, as well as the risks that are implied (guo et al., 2014). the identification of td comprises the activities or actions taken to detect the presence of debt in software artifacts. technical debt identification is the first step that needs to be taken to start managing it and avoid its possible high costs (guo et al., 2014). for instance, td identification is essential to prevent unwanted consequences of debt. alves et al. (2016) investigated the technical debt identification and management landscape between 2010 and 2014 by analyzing 100 primary studies. they found that strategies mainly addressed types of technical debt associated with source code. nevertheless, few empirical evaluations demonstrated the proposals' actual benefits, limitations, and applicability. alves et al. (2016) also presented an initial taxonomy of technical debt types and a list of indicators for their identification. in that study, td management was understood as the activities that follow its identification. the findings of alves et al. (2016) provided valuable contributions for both researchers and practitioners, while they also characterized the state of the art in the research area. in this paper, we update the mapping study of alves et al. (2016) to find proposals done between 2015 and 2022 about managing and identifying technical debt. additionally, this paper provides a comparison with the previous results obtained by alves et al. (2016) and an analysis of the research landscape comprising more than a decade. keeping the results updated is essential because it helps to understand the https://orcid.org/0000-0002-8729-3867 identification and management of technical debt: a systematic study update murillo et al. 2023 evolution of the research topic and new findings (nepomuceno & soares, 2019). we consider the application of the same research questions, search string, search strategy, sources, inclusion, and exclusion criteria as an update of the previous systematic mapping. we intend to answer the same research questions (without adaptations) since changing them could be considered a new mapping instead of an update (nepomuceno & soares, 2019). likewise, we considered td management as the activities following debt identification to be conceptually consistent. the main difference in the protocol was the time-frame delimitation since our update only included papers published after the original study’s year of inclusion. we also provide a more detailed definition than the original study of what we considered as “general” technical debt papers for their classification. furthermore, two original authors assisted in the update process to ensure the compatibility between the concepts, procedure, and results of both systematic studies. the high-level research question we aim to answer is: • what strategies have been proposed to identify or manage technical debt in software projects? similarly, the complementary research questions are: • rq1. what are the types of technical debt found in the literature? • rq2. what are the strategies proposed to identify technical debt? o rq2.1. which empirical evaluations have been performed? o rq2.2. which artifacts and data sources have been proposed to identify technical debt? o rq2.3. which software visualization techniques have been proposed to identify technical debt? • rq3. what strategies have been proposed for the management of technical debt? o rq3.1. which empirical evaluations have been performed? o rq3.2. which software visualization techniques have been proposed to manage technical debt? we analyzed 117 primary studies and identified new proposals and indicators between 2015 and 2022. empirical evaluations of the analyzed papers include case studies, controlled experiments, and action research, but more evaluations are still required. also, we found that technical debt visualization is yet an area that researchers have not extensively studied. this is a relevant finding since visualization techniques may be useful to aid decision-making for td. this paper’s results benefit researchers since we provide knowledge about state-of-the-art and open problems that are future research opportunities. it is also helpful to practitioners since we present identification, management, and visualization strategies applicable to software projects to prevent technical debt unwanted negative consequences. future research opportunities include investigating new ways to use developers’ knowledge about debt (not only through commits and comments) and exploring new strategies with a less-technical approach (such as incentives and td guilds). moreover, analyzing the applicability of strategies in different contexts, such as public or private organizations, is a future research opportunity. the structure of this article is as follows. section 2 describes previous literature related to this work. section 3 presents the methodology used to perform the literature review. section 4 presents the results obtained. section 5 includes the discussion, section 6 the threats to validity, and section 7 covers the conclusions. 2 related work this section presents several previous works performed by other authors, particularly those that have addressed the identification and management of technical debt during software development. table 1 gives an overview of each authors’ contributions. rios et al. performed a tertiary study to identify the state of the art regarding technical debt between 2012 and 2018 (rios et al., 2018). the authors studied the understanding of the technical debt concept and the research efforts on its identification and management. they found nine secondary studies about technical debt management and two regarding its identification. until 2018 there was little knowledge about the benefits and limitations of the proposed management strategies and indicators. another systematic mapping studied the concept of technical debt and its management activities and tools (li et al., 2015). the authors analyzed 94 studies published between 1992 and 2013 and identified activities and tools for technical debt. some of the mentioned tools are checkstyle, debtflag, sonarqube, codevizard, and findbugs. likewise, the activities include code analysis, cost categorization, calculation models, code metrics, and portfolio approach. also, the authors proposed a classification of technical debt types and argued that there needs to be more literature about what should not be considered technical debt. the work of fernández-sánchez et al. consisted of a systematic mapping to identify the elements to consider for the management of technical debt, based on the literature until 2015 (fernández-sánchez et al., 2017). the authors identified the main aspects of technical debt management. they found that the business organizational perspective has not been addressed much in the literature, while research has focused more on the technical point of view. another systematic literature review focused on technical debt in the digital government area (nielsen et al., 2020). this paper aimed to discover what fields of technical debt management are being studied and the focus of the performed research. the authors analyzed 31 pieces, from which a third proposed a tool, method, technique, or model for technical debt management. the authors found several gaps, including a lack of research on the public sector and a limited abstraction level of the analyses. authors conclude that technical debt management is mainly studied either on open software projects or in the private sector. identification and management of technical debt: a systematic study update murillo et al. 2023 macit et al. performed a systematic mapping study regarding methods for identifying architectural debt based on the analysis of 28 papers published between 2011 and 2020 (macit et al., 2020). the authors mention that architectural debt identification has been increasingly investigated in recent years. also, code mining and expert opinion are common methods. alfayez et al. performed a systematic literature review on technical debt prioritization (alfayez et al., 2020). the authors aimed to identify the current prioritization approaches, the decision factors, and the artifacts on which these approaches are based. a total of 23 papers published between 1992 and 2018 were analyzed. as a result, 24 strategies were found for technical debt prioritization. these approaches mainly addressed code, general, and design technical debt. lenarduzzi et al. performed a literature review regarding strategies and tools for technical debt prioritization (lenarduzzi et al., 2021). in this study, they analyzed 44 primary studies published until 2020. code, architecture, and design were the most frequent types of technical debt addressed. the authors found a lack of consensus on the factors to consider when prioritizing and measuring technical debt. also, they show a lack of validated and reliable tools for technical debt prioritization. alves et al. present a systematic mapping regarding technical debt identification and management (alves et al., 2016). in that study, the authors analyzed 100 primary studies, discussed a taxonomy of td types, presented a list of the strategies found in the literature, and created a list of indicators that can help identify technical debt. table 1. contributions by other authors. authors research topic analyzed period contributions li et al., 2015 technical debt management 1992 – 2013 • analyzed the technical debt concept on 94 existing research efforts. • proposed a classification of ten technical debt types. • identified the quality attributes compromised by technical debt. • determined activities and tools for technical debt management. alves et al., 2016 technical debt identification and management 2010 2014 • analyzed 100 papers and determined a classification for technical debt types. • listed strategies to identify or manage technical debt. • determined the empirical evaluations, artifacts, and data sources cited in the literature for technical debt identification and management. fernández-sánchez et al., 2017 elements to manage technical debt 2010 2015 • provided a taxonomy of elements for technical debt management by analyzing 63 papers. • identified the proposed methods and techniques to manage technical debt. • analyzed technical debt management elements from the perspective of stakeholders. rios et al., 2018 technical debt 2012 – 2018 • studied 13 secondary studies and their td research topics. • proposed a taxonomy of technical debt types. • identified activities, strategies, and tools to support technical debt management. nielsen et al., 2020 technical debt management in digital government 2017 2020 • analyzed 31 papers about technical debt management research in the public sector. • determined a research agenda for the digital government area. alfayez et al., 2020 technical debt prioritization 1992 2018 • identified approaches and decision factors for technical debt prioritization by studying 23 papers. • analyzed the type of human involvement and artifacts needed for technical debt prioritization. lenarduzzi et al., 2021 technical debt prioritization 2011 2020 • determined the prioritization strategies for technical debt by analyzing 44 primary studies. • analyzed factors and measures considered for technical debt prioritization. • identified tools for technical debt prioritization. identification and management of technical debt: a systematic study update murillo et al. 2023 three previous studies analyzed and proposed a classification of technical debt types (alves et al., 2016; li et al., 2015; rios et al., 2018). however, there is still no consensus on these taxonomies. this paper does not aim to provide a consensus but to find if new td types have been mentioned recently and should be considered for new or refined taxonomies. more recent efforts in the research area were those made by nielsen et al. (2020) and lenarduzzi et al. (2021). their studies focused on technical debt prioritization and technical debt management in the digital government area. this paper focuses on technical debt identification and management, a related but different scope than their contributions. alves et al. (2016) and li et al. (2015) made previous efforts specifically about technical debt identification or management. however, they analyzed literature published between 1992 and 2014. our study aims to replicate and update the work of alves et al. (2016) to integrate the obtained results by including papers published between 2015 and 2022. this delimitation is the main difference between this work and previous contributions made by other authors. the relevance of performing this study is justified by the application of the framework proposed by (mendes et al., 2020) for updating systematic literature reviews: • does the previous study still address a current question? the high-level research question of this paper is: what strategies have been proposed to identify or manage technical debt in software projects? any software may contain technical debt issues, regardless of the developing company’s size or resources. the consortium for information and software quality (cisq) reports that the cost of poor software quality in the us is at least $2.41 trillion and the accumulated technical debt principal is about $1.52 trillion in 2022 (consortium for information & software quality, 2022). therefore, td remains an expensive issue. identifying and managing td is still a major problem in the software development industry. thus, investigating these topics is relevant for both practitioners and researchers. • has the previous study had good access or use? the work of alves et al. (2016) was published in the information and software technology journal and is fully available through the science direct library. by march 2023, this paper has 589 reads and 238 citations (according to researchgate metrics). thus, the previous study has good access and use. • has the previous study used valid methods and it was well conducted? alves et al. (2016) based their methods on the standard process for conducting systematic mapping studies by petersen et al. (2008). they provide a full explanation of the study implementation (research questions, search strategy, selection criteria, and classification scheme). the study presents a clear view of each step’s outcome. hence, it provides sufficient details and data to replicate the procedures. moreover, two of the original authors participated in the update process. • are there any new relevant studies, methods, or new information? research on technical debt is constantly being published in different venues, such as conferences and journals. for example, the international conference on technical debt (techdebt) is held annually since 2018, which is two years after the publication of the previous study in 2016. consequently, there are plenty of new pieces on td. • will the inclusion of new studies/information/data change the findings, conclusions, or credibility? since the publication of the previous study in 2016, the concepts and focus of the td research area have evolved. in this paper, we will discuss in detail these changes. one of the most important aspects is the increase of research efforts that address technical debt management with a more holistic perspective. by updating the previous study by alves et al. (2016), we provide the following contributions: • an analysis of the technical debt identification and management research landscape between 2010 and 2022; • analysis of the previously proposed technical debt types and identification of new potential types mentioned recently in the literature; • list of the strategies or techniques for technical debt identification, management, and visualization; • an analysis of the empirical evaluations of the proposed methods, including the artifacts, programming language, and data sources used. • discussion on technical debt concepts and their evolution from 2010 to 2022. the contributions presented herein provide insights to both practitioners and researchers regarding the most recent proposals for identifying and managing technical debt. this may help for further industry application of new proposals and for finding new research opportunities. the following section presents the methodology applied to perform this study. 3 research method this paper aims to analyze the research landscape on technical debt identification and management. this section details the search strategy, study selection process, and synthesis methods. 3.1 research questions in this section, we present the rationale and importance of the research questions. • rq1. what are the types of technical debt found in the literature? this question aims to determine if there are new technical debt types described in the literature different from identification and management of technical debt: a systematic study update murillo et al. 2023 those proposed by alves et al. (2016). also, we aim to know which types of technical debt have been most studied in the literature between 2015 and 2022. this research question is important because there is still no consensus on the different technical debt types. we aim to analyze the evolution of these concepts between 2010 and 2022 by integrating our results and those provided by alves et al. (2016). however, the intent of this paper is not to establish a consensus but to find out the td types mentioned in the literature. • rq2. what are the strategies proposed to identify technical debt? this research question aims to determine new artifacts or data sources mentioned in the literature. also, we aim to know which artifacts and data sources are the most cited. this allows us to determine trends or changes in recent years. we also aim to analyze the empirical evaluations of previously mentioned strategies since alves et al. (2016) describe the need for more assessments to determine the applicability of such strategies. visualization techniques for technical debt identification are also crucial because they may help communication between developers and stakeholders, affecting decisionmaking during software development. • rq3. what strategies have been proposed for the management of technical debt? this research question aims to determine the strategies for technical debt management and how they have been empirically tested to determine their applicability. also, we aim to identify the visualization techniques proposed for technical debt management. alves et al. (2016) found few visualization strategies. this study analyzes how this specific research topic has evolved from 2010 to 2022. 3.2 search strategy we retrieved papers from the databases acm digital library, ieee xplore, science direct, and springer link. we also consulted engineering village, scopus, citeseer, and dblp, but no papers were included from these libraries. since this paper updates the previous work of alves et al. (2016), we used the same search string: (“technical debt”) and (“software”) this search string was used in all the sources, restricting the results to publications between 2015 and 2022. 3.2.1 inclusion criteria we considered papers that met the following inclusion criteria: • address the identification or management of technical debt in the context of software development; • explain one or more strategies, techniques, or activities for identifying or managing technical debt; • the year of its publication is between 2015 and 2022 since the previous work of alves et al. (2016) included papers from earlier years. we also considered papers that address technical debt in a general manner or focus on a specific type of debt. moreover, we included those that either provided empirical proof of their proposal or only a theoretical description. 3.2.2 exclusion criteria only the most recent paper was considered when several pieces reported the same study, and each study was considered separately when multiple studies were contained in a single paper. also, we applied the next exclusion criteria: • papers that do not specify how to identify or manage technical debt with a strategy, activity, or technique. therefore, we excluded exploratory studies of technical debt management; • papers in progress (incomplete) or those that do not provide full-text access; • papers published before the year 2015; • duplicate papers; • papers published in a language different than english. moreover, papers in the form of powerpoint presentations, reports, and abstracts only were not considered. 3.3 study selection the study selection was performed by following the following steps: • search: the first step of the process was to perform the search using the defined search string on the databases (acm digital library, ieee xplore, science direct, engineering village, springer link, scopus, citeseer, and dblp). as a result, we found 2517 papers in total. • identification (filter 1): the second step was removing duplicate papers and applying the exclusion and inclusion criteria. in total, 466 duplicate studies were identified, leaving a total of 2051 articles (without duplicates). • screening (filter 2): the next step was screening the articles. we read each of the 2051 titles and abstracts to find those that comply with the eligibility criteria. in this step, we identified 209 studies. in this step, 1852 papers were excluded because they did not comply fully with the inclusion criteria. this is explained because the search string is generic and returned articles that are not relevant to this study. • inclusion and analysis (filter 3): all 209 papers were read in full at this stage. after reading them, only 111 were selected using the eligibility criteria. at this stage, we also extracted data from the selected papers as described in the following subsection (3.4 synthesis methods). • backward snowballing: during the final stage, we reviewed the references of each of the studies. as a result, we included seven more papers, which were analyzed and combined with the results of the study selection. one researcher performed the search, identification, screening, and inclusion for every paper. later, two other researchers were randomly assigned a set of papers each to independently review each paper and extract the data. the results were compared and discussed in case of disagreement. the process was performed in mid-march 2022. identification and management of technical debt: a systematic study update murillo et al. 2023 3.4 synthesis methods from each of the included papers, we extracted six categories of information. table 2 summarizes the data variables collected. • metadata: for a demographic characterization, we collected the title, authors, type and year of publication, and digital library from each included paper. these data were extracted as explicitly found on each corresponding database. we also considered two research topics: identification or management of technical debt. from each paper, we also collected the corresponding research topic. • technical debt types: we documented each paper’s addressed type of technical debt. some papers explicitly mentioned the studied type of technical debt, but in others, this was implicit. also, many papers addressed technical debt without focusing on any specific type (77 in total). therefore, we considered the types as follows: o direct: papers that explicitly mention the name of the type + debt. o indirect: determined from phrases in the text, such as technical debt derived from issues on the documentation (documentation debt). o general: we classified into general technical debt those papers that do not focus on a specific technical debt type directly or indirectly and consider only the concept of technical debt, such as technical debt management approach. • indicators: indicators are elements that help identify technical debt items (alves et al., 2016). we created a list of the indicators cited by authors on each paper and its associated type of technical debt, based on the indicators found by alves et al. (2016). a new indicator was created when these previous indicators did not fit what was mentioned in a paper. we also collected data on how these indicators were empirically tested, including the artifact in which they are identified and data sources. • management strategies: we extracted the management strategies described in each included paper. we used the same criteria as alves et al. (2016): to be considered a management strategy, it must support the decision-making about technical debt items. this definition includes activities for measuring, prioritizing, preventing, monitoring, documenting, and paying the debt. each strategy and its definitions were collected as mentioned in each paper. • evaluation studies: evaluations are needed to determine the feasibility of the proposed strategies. there are several types of evaluation studies. we classified them into case studies, controlled experiments, or ethnographic studies with the same criteria as in the previous study (alves et al., 2016). also, we documented the artifact considered, the programming language used, and the data sources used on each paper that performed an empirical evaluation. • visualization techniques: several visualization techniques help understand the potential problems of technical debt in software projects. we extracted the visualization techniques described in the included papers for technical debt identification or management mentioned in each paper as alves et al. (2016). the aforementioned research method is based on the procedure performed by alves et al. (2016) in their study. in this paper, we aim to answer the same research questions by applying the same study selection and synthesis methods. however, this update’s protocol has two main differences from the original study methodology. one is the time-frame delimitation. we only considered publications between 2015 and 2022, a restriction that was added to the search strategy criteria. moreover, we provide the definition of “general” technical debt classification. in the previous study, the authors refer to “type not specified” while we classify these papers as “general” technical debt to provide more clarity to the reader. however, we refer to the same type of papers (as described in the synthesis methods). table 2. data collection variables and their purpose. data collection variable purpose title demographic characterization author type of publication (workshop, conference, journal) year of publication digital library (database) research topic (identification or management) technical debt type rq1 indicators rq2 artifact considered (identification studies) rq2 data source (identification studies) rq2 management strategy (management studies) rq3 evaluation type (if applicable: case studies, controlled experiments, ethnographic studies, action research) rq2 and rq3 visualization techniques rq2 and rq3 4 results this section presents the integration of our results and those obtained by alves et al. (2016), which included 100 papers published between 2010 and 2014. our study analyzes 117 additional articles dating from 2015 to 2022 (see appendix a1). figure 1 shows the number of studies included by publication type and year. identification and management of technical debt: a systematic study update murillo et al. 2023 figure 1. number of studies by year and publication type. we searched the same databases using the same search string and applied the established selection criteria. papers were published in symposia, journals, magazines, workshops, and conferences. from 2010 to 2014, workshops and conferences were the most common publication types. between 2015 and 2022, the most common publication types were conferences and journals. the decrease in publication on workshops and the rise of conferences shows that the theme has developed certain maturity over the years. the number of publications on technical debt identification and management has been irregular during the last decade. overall, the number of articles included in conferences has been rising. in 2010 only two papers were from conferences, while this number increased to 17 in 2019. however, there may have been an impact on the publications done in 2020 and 2021 due to the coronavirus pandemic. in this study, we performed the search on acm digital library, ieee xplore, science direct, springer link, engineering village, scopus, citeseer, and dblp. overall, since 2010 most papers have been published in the ieee xplore and acm digital library. however, the number of papers on springer link has increased considerably since 2015. figure 2 shows the number of studies by digital library. 4.1 technical debt types (rq1) alves et al. (2016) proposed a taxonomy of technical debt that includes: design, architecture, documentation, test, code, defect, requirements, infrastructure, people, test automation, process, build, service, usability, and versioning debts. from 2010 to 2014, the most common technical debt types studied in the literature were design, architecture, and documentation. also, a high concentration of studies addressed the test, code, and defect debt. between 2015 and 2022, 77 studies did not focus on a particular type but addressed the topic in a general manner. in contrast, other papers focused on a specific technical debt type, such as code, design, or architecture. consequently, technical debt is increasingly studied with a holistic approach rather than distinct kinds of debts that need to be managed differently. figure 3 shows the number of papers included by type of technical debt. of the 117 included papers (2015 2022), 34 addressed self-admitted technical debt, a concept commonly mentioned in the literature. self-admitted technical debt (satd) refers to situations in which developers are aware and admit that technical debt has been incurred. these scenarios are different from those with no consciousness that debt is present. when satd exists, issues may correspond to various technical debt types, such as code, architecture, documentation, etc. for this reason, those papers that addressed satd were classified into the general technical debt (td) category. figure 2. number of included papers by digital library. u m b e r o f s tu d ie s ear orkshop conference ournal ymposium magazine cm igital ibrary pringer ink copus cience irect citeseer ngineering illage umber of included papers ig it a l ib ra ry identification and management of technical debt: a systematic study update murillo et al. 2023 figure 3. number of included papers by technical debt type and year. forty out of the 117 papers addressed a specific technical debt type, as described by (alves et al., 2016). from 2015 to 2022, architecture, code, and design debt were the most common types. however, there has been a significant reduction in studies focused on these types over the last seven years. for example, there was a reduction of nearly half of the publications on architecture and code debts, while design debt went from 42 publications to only five compared to the previous period (2010 – 2014). we found no articles about documentation, people, build, services, or usability debt during the last seven years. these types of technical debt have not been as extensively studied compared to others, so they represent potential areas for further investigation in future work. as a result of the literature review, we did find two new types mentioned in the literature: security and elasticity debts. security debt refers to security issues in the software, such as vulnerabilities or exploitable weaknesses (izurieta et al., 2018). elasticity debt is a concept that describes non-effective or non-efficient resource provisioning resulting from the lack of dynamical adaptation to resource consumption (mera-gómez et al., 2016). these two types of technical debt have been mentioned in few studies. consequently, they cannot still be considered widely accepted types of technical debt. both may be subtypes of requirement (security) and infrastructure (elasticity) papers and classified them as such when performing the literature review. 4.2 technical debt identification (rq2) an essential step for technical debt management is its identification. identification comprises activities or actions to detect the presence of debt in software artifacts. out of the 117 included papers, 47 addressed technical debt identification. we extracted the indicators and type of technical debt associated with each paper. indicators are symptoms that help identify technical debt items (alves et al., 2016). from 2010 to 2014, forty-five indicators were found and presented by alves et al. (2016). in this study, we found 11 indicators mentioned in the literature between 2015 and 2022. table 3 shows the eleven indicators and the top 5 most common indicators presented previously (alves et al., 2016): code smells, documentation issues, software architecture issues, violation of modularity, and automatic static analysis issues. these indicators were either just mentioned or analyzed in the included papers. the results show significant differences between both periods. code smells were the most common indicator in previous years, while the comments and commits were mostly mentioned during the last seven years. this fact is due to the considerable number of papers (34 in total) that addressed self-admitted technical debt in recent years, which used several strategies to analyze comments or commits to identify different types of technical debt, not only those related to source code. this represents a more holistic view, in which not only code issues are intended to be identified. authors have recently studied satd through natural language processing, neural networks, deep learning, and machine learning. satd may be identified using these different approaches on commits, comments, and issue trackers to be further prioritized and managed. 4.2.1 evaluation studies most studies on technical debt identification have performed empirical evaluations through case studies in recent years. however, there has been an increase in this type of study during the last seven years. a possible explanation for this is that the knowledge consolidated before 2015 gave the necessary foundations to perform empirical evaluations, such as case studies. the execution of case studies helps provide more information about the context in which the different identification strategies are applicable. the growth in the number of case studies is relevant because it is vital to have multiple sources of empirical data to generalize results. we also found a significant increase in the number of controlled experiments. figure 4 shows the number of empirical evaluations performed by the number of papers and year of publication. figure 4. empirical evaluations on technical debt identification studies. ersioning uild ervice sability eople rocess nfrastructure e uirement efect test ocumentation code rchitecture esign eneral t umber of studies t e c h n ic a l d e b t ty p e case studies controlled e periment thnographic study umber of papers t y p e o f s tu d y identification and management of technical debt: a systematic study update murillo et al. 2023 table 3. indicators organized by technical debt (td) type and period. indicator 2010 2014 2015 2022 # papers technical debt types # papers technical debt types code smells 52 code, design 1 general td, architecture documentation issues 17 documentation software architecture issues 9 architecture 9 architecture violation of modularity 9 architecture automatic static analysis issues 9 code, design 3 architecture, code, general td comments 1 documentation 26 code, requirements, general td uncorrected known defects 6 defect, test 1 general td immature software 1 general td feature usage and maintenance costs 1 general td insufficient resource provisioning 1 infrastructure low external/internal quality 1 design 1 general td software design issues 4 design 1 design 4.2.2 artifacts and data sources we extracted the data source and artifact considered in each paper that performed an empirical evaluation. figure 5 shows the number of studies by artifact. from 2010 to 2014, the most common artifact was source code. the obtained results show that source code remains the primary artifact used to perform empirical evaluations; this may be because static analysis tools can help for these purposes. however, the number of studies considering source code decreased from 58 between 2010 and 2014 to 39 from 2015 to 2022. figure 5. number of studies by artifact considered for technical debt identification. researchers have started investigating by mining the repositories to extract metadata about technical debt in recent years. alves et al. (2016) identified four different data sources: cms (configuration management systems), software repositories, and bug tracking. the cms were the most used in that period. in contrast, we found six different data sources from 2015 to 2020. software repositories predominated, which makes sense since the most common artifact was source code. figure 6 shows the number of papers by the data source used. 4.2.3. visualization techniques only two papers on technical debt identification mentioned a visualization technique. the proposed methods are not mature because there is not much validation. therefore, the visualization of technical debt is still an area that requires further investigation. the proposed visualization techniques were the assessment graph (shapochka & omelayenko, 2016) and coupling probability matrix (l. xiao et al., 2016). figure 6. number of papers by the data source for technical debt identification. ource code ocumentation test report ugs report rchitecture specification acklogs commit change report mplemented tests e uirement specification ervice utilization atabase data umber of studies rt if a c t c o n si d e re d ug tracking and cm orkload simulation ug tracking nterview ot specified cm oftware repositories umber of papers a ta s o u rc e identification and management of technical debt: a systematic study update murillo et al. 2023 4.3 technical debt management (rq3) the management of technical debt includes several activities to control debt during the software development process. these activities aim to avoid bankruptcy situations in which the debt becomes uncontrollable. out of the 117 included papers, 70 addressed technical debt management. this section presents the strategies proposed by authors in the literature, the evaluation studies performed, and the visualization techniques mentioned. 4.3.1 strategies for managing technical debt the first step for technical debt management is to identify its presence. then, several strategies could be used for its timely administration to reduce interests’ impact. table 4 shows the complete list of management strategies found during the literature review. the top 5 most studied strategies between 2015 and 2022 were the following: • automated analysis of code issues: recent studies mention several tools to aid td management: sonargraph for analyzing software architecture (von zitzewitz, 2019), teamscale to analyze software quality based on data from version control systems, issue trackers and other tools (haas et al., 2019), sonarqube to analyze code and get several code metrics (baldassarre et al., 2020), and codescene to perform behavioral code analysis, which can be helpful for debt prioritization and communication with stakeholders (tornhill, 2018). these papers report code, architecture, test, and general technical debt management supported by tools that automatically identify code issues. the generated metrics or reports may be used by developers and stakeholders to prioritize, monitor, and perform the necessary management actions. one of the advantages of such tools is that they need minor human intervention to measure several code issues while creating awareness that td exists. • calculation of technical debt (td) interest: interest is the additional cost that will have to be assumed because of the delayed payment. authors have proposed methods for its calculation to prioritize technical debt items according to the interest that will have to be assumed (chatzigeorgiou et al., 2015; falessi & reichel, 2015). this allows decision-making about the appropriate moment to pay the debt, depending on the acceptable costs of each scenario. • portfolio approach: in finance, the portfolio comprises the assets that an investor has. portfolio management is carried out to decide what investments to make with the assets considering risks and return of investment. a td portfolio approach brings financial concepts to td and considers it a potential investment, whose final goal is to get more gains than losses (guo & seaman, 2011). the td portfolio approaches are based on the financial portfolio theory and consider principal, interest, or correlations with other td items to help decision-making. authors proposed a glossary of financial technical debt concepts (akbarinasaji & bener, 2016), while others present frameworks that consider portfolio theory (nielsen & skaarup, 2021; rindell et al., 2019). papers that consider more than only interest calculation and reference the portfolio theory were classified into portfolio approaches rather than just interest calculation, while those that only mention interest formulas were classified as calculation of td interest. • prioritization approach: authors suggest different methods to prioritize td items. the purpose of prioritization is to determine the order in which technical debt will be paid. the proposed strategies include code smells ranking through automated tools (alfayez & boehm, 2019; vidal et al., 2016), backlogs managed considering risks and business needs (besker et al., 2019), and approaches that focus on the business perspective (stochel et al., 2020). • satd removal approach: authors have also suggested management strategies for self-admitted technical debt removal. for example, natural language processing could analyze source code comments and later compare their evolution among different versions of each file (da maldonado et al., 2017). it is also possible to use deep neural networks to provide recommendations for satd removal (zampetti et al., 2020). some of the included papers addressed strategies or techniques identified in previous years (alves et al., 2016). these proposals are the portfolio approach, options analysis, calculation of the principal and interest, and td management in database schemas. however, the number of empirical evaluations performed on each strategy is still small. overall, the authors have proposed their own strategies and tested them empirically instead of validating or comparing them to previous proposals. 4.3.2 evaluation studies case studies have been the most frequent type of empirical evaluation performed on technical debt management. this is true for both periods, as shown in figure 7. nevertheless, the number of this type of study raised to more than double from 2015 to 2022. also, ten papers presented action research and controlled experiments in recent years, adding some diversity to the type of evaluation studies. from 2010 to 2014, few empirical studies were performed in real settings. in contrast, subsequent years show more case studies and action research in real settings. the number of these evaluations is still small for every management strategy. still, it is essential to highlight that researchers have started to acknowledge the need for empirical testing. figure 7. number of papers by type of study. controlled e periment ction research case studies umber of papers t y p e o f st u d y identification and management of technical debt: a systematic study update murillo et al. 2023 4.3.3. visualization techniques only four papers on technical debt management mentioned visualizations techniques, which were the following: dynamic graphic (pacheco et al., 2018), line chart (falessi & reichel, 2015), portfolio matrix (plösch et al., 2018), and probabilistic cause-effect diagrams (rios et al., 2019). each technique was only mentioned once. therefore, they still require further research to determine their applicability. 5 discussion this paper studied the technical debt identification and management research landscape from 2015 to 2022 and integrated our results with previous investigation efforts that analyzed the period 2010 2014 (alves et al., 2016). this section presents a discussion of the obtained results. 5.1 technical debt types (rq1) technical debt as an analogy with financial debt is well known among authors in the academic literature. overall, there is a common understanding of the technical debt concept itself as taking shortcuts during software development, leading to several future costs. however, different technical debt classifications exist, and there is no clarity on which are the accepted types. alves et al. (2016) proposed a taxonomy that includes: design, architecture, documentation, test, code, defect, requirements, infrastructure, people, test automation, process, build, service, usability, and versioning debts. still, other studies mention different classifications, but there is a lack of consensus on some technical debt types. to the best of our knowledge, besides the proposal of alves et al. (2016), only three other papers address technical debt types or propose a classification (li et al., 2015; rios et al., 2018; tom et al., 2013). one of these classifications was presented as a result of a non-academic literature review and interviews with people in the software development industry (tom et al., 2013). others were derived from a systematic mapping and a tertiary study of academic literature (alves et al., 2016; rios et al., 2018). some types of technical debt are presented in the three studies: code, design, architecture, and test debt. in fact, we found that from 2015 to 2022, the most addressed types correspond to design, architecture, code, and test debts. we observed that authors use these terms consistently, agreeing with their general meaning. therefore, these particular types may be considered accepted technical debt types. on the other hand, the concept of self-admitted technical debt (satd) is overall consistent among papers and referred to as a technical debt type. other types are much less established in the literature. for example, between 2010 and 2022, process and people debts were only mentioned in three papers each, while usability, service, build, and versioning debts were only cited in two papers each. there is also another new concept mentioned in the literature: variability debt. it was not identified through the performed review because papers mentioning it do not meet the acceptance criteria proposed in this study. however, it may be considered for future research. variability debt refers to software’s characteristics that allow it to adapt (create variants) for different needs (wolfart et al., 2021). these concepts are still not widely accepted since not much literature is available on them. in some cases, they may not even represent technical debt categories themselves but subcategories. the same may be true for security and elasticity debts, which could be subcategories of other types of debt. another relevant aspect is a position in the literature that considers defects and processes as non-technical debt (li et al., 2015). however, this does not imply that the elements addressed by these types of technical debt lack importance. figure 8 presents a heatmap showing the number of publications by technical debt type and year, including papers from 2010 to 2022. the lack of clarity on some technical debt types and the number of existing categories may have influenced the authors’ choice of categorizing their work. still, the number of papers that presented typifications of technical debt dropped in 2016. authors may have inadvertently reached the consensus that technical debt is an issue to be managed without necessarily specifying its type. there was a turning point between 2014 and 2015, in which authors left aside the classification and began to focus their studies on technical debt management. 5.2 technical debt identification (rq2) technical debt identification comprises actions to detect debt presence; it is the first step necessary for its management. in recent years, source code has been the most common artifact that helps technical debt identification since it is possible to implement several techniques, algorithms, or tools to detect debt automatically. however, other artifacts may be used, such as test cases. figure 9 summarizes the findings on technical debt identification between 2010 and 2022. when technical debt exists, there are several indicators that show symptoms of its presence. identification approaches help find indicators through several artifacts and data sources for further management. comments on source code were the most common indicator from 2015 to 2022. analyzing comments helps to identify self-admitted technical debt. the increasing number of studies on satd suggests that there is valuable information that developers themselves can provide through comments. nevertheless, it is a future research opportunity to explore how to take advantage of developers’ knowledge of the code issues in other ways, different from only comments or commit messages. identification and management of technical debt: a systematic study update murillo et al. 2023 figure 8. heatmap showing the number of publications by technical debt type and year. the automatic detection of technical debt allows getting a variety of quantitative measurements. additionally, organizations may be interested in also having qualitative measures on technical debt, which has not been much explored in the literature and constitutes a future work opportunity. epending on every project’s business objectives and needs, organizations may identify those measurements that may help them further manage technical debt in their contexts. in the academic literature, the number of empirical evaluations on technical debt identification has increased in recent years, which is beneficial for both researchers and practitioners because they help discover the indicators’ applicability in different contexts. however, research has concentrated on identification based on source code, while other artifacts and data sources may be further investigated. few studies proposed visualization techniques for technical debt identification between 2010 and 2022. this is still an open issue and research opportunity. technical debt visualization is important because it may support communication between developers and stakeholders while aiding decisionmaking on further technical debt management and prevention. 5.3 technical debt management (rq3) technical debt management comprises actions or activities performed to control debt once it has been identified. authors in the literature have proposed several strategies for debt management. in general, papers present new proposals and test them empirically instead of testing others previously described in the literature. however, the number of empirical evaluations, especially case studies, has increased during the last seven years, along with the number of proposals. table 4 shows the complete list of strategies found in this replication study. from 2015 to 2022, many strategies proposed for technical debt management were supported by automatic tools applied to source code, such as sonargraph, codescene, sonarqube, and teamscale (baldassarre et al., 2020; haas et al., 2019; tornhill & ab, 2018; von zitzewitz, 2019). the obtained measurements through such tools help to prioritize and support decision-making. other authors compared penalties and gamification techniques for technical debt using automated tools in educational contexts, showing that rewards may be a suitable option for td management (crespo et al., 2021). other papers presented novel frameworks or models for managing technical debt during the software development process with a more holistic perspective, including several process elements or phases. for example, some authors propose creating a guild, a group of people that help address td management and guide its payment (detofeno et al., 2021). moreover, another paper mentioned encouraging and rewarding incentives for developers to manage technical debt (besker et al., 2022). other authors evaluate a business prioritization approach that allows an alignment between business and technical stakeholders for prioritizing td items (reboucas de almeida, 2019), while additional research efforts report using td tickets that allow td management and prevention (wiese et al., 2021). nevertheless, few papers specifically address the human resources involved during software development, which is essential because it is known that people issues can also lead to technical debt (rios et al., 2020). between 2010 and 2014, twenty-two papers on technical debt management described a visualization technique. in contrast, we only found four visualization techniques in papers about technical debt management from 2015 to 2022. this shows a significant decrease in research efforts, even when only a few studies address this topic. although it is not part of the research questions of this paper, it is worth mentioning that there are different perspectives regarding the definition of td management in the identification and management of technical debt: a systematic study update murillo et al. 2023 literature. in this paper, we used the same definition of td management as alves et al. (2016) to be conceptually consistent. however, some authors consider that td management includes its recognition, analysis, monitoring, and measurement (izurieta et al., 2016), while others consider its identification, assessment, and remediation (griffith et al., 2014). furthermore, li et al. (2015) present nine activities for technical debt management: identification, measurement, prioritization, prevention, monitoring, repayment, documentation, and communication. as several definitions of td management suggest, there are plenty of actions that help to control debt during software development. however, the concepts mentioned by the different authors agree that the first step for managing td is to identify or recognize the presence of debt and start measuring (quantifying) it. these two activities alone are not the solution for td issues. subsequent strategies are needed to take effective actions toward td management. it may be necessary to prioritize, monitor, repay, and document debt (li et al., 2015). prioritization includes deciding the order of importance or urgency to pay debt items. monitoring refers to supervising several aspects related to td, such as historical costs and resolution times. it is not possible to monitor debt if metrics have not been established and measured. also, the progress on debt issues cannot be tracked if there is no monitoring. moreover, debt repayment or remediation is the resolution of a td item. also, documentation and communication with stakeholders may be needed. lastly, organizations may be interested in establishing prevention actions. technical debt is a context-dependent issue (fernándezsánchez et al., 2017). therefore, the context must be well understood to take appropriate actions for debt management. gathering and analyzing data (not only about td) may be useful for establishing a td management plan. for example, debt management may be different in an agile organization than in a traditional one. also, the team size and type of software that is developed are variables that may be considered. moreover, determining the main debt issues perceived by the software developers could be a starting point. regardless of which definition of td management is used, the appropriate strategies will depend solely on the specific needs, issues, and objectives of the organizational context. furthermore, the selected strategies may vary or be adapted in time depending on the obtained outcomes. the following sections present the threats to the validity and conclusions of this paper. figure 9. technical debt identification concept map. rtifacts ata sources ource code ocumentation acklogs tests oftware repositories cm ug tracking tatic nalysis atural anguage rocessing machine earning nterview urvey manifests through ound using upported by ncluding epresented using ssessment graph coupling probability matri lags in code maps catterplot correlation matri time range timeline tree map ncluding identification and management of technical debt: a systematic study update murillo et al. 2023 table 4. management strategies proposed in the academic literature from 2015 to 2022. strategy proposed number of papers references automated analysis of code issues 7 (anderson et al., 2019; baldassarre et al., 2020; fontana et al., 2016; haas et al., 2019; lahti et al., 2021; sharma, 2019; tornhill & ab, n.d.; von zitzewitz, 2019) calculation of td interest 6 (ampatzoglou, ampatzoglou, avgeriou, et al., 2015; ampatzoglou et al., 2018; chatzigeorgiou et al., 2015; falessi & reichel, 2015; kontsevoi et al., 2019; martini & bosch, 2016, 2017a) portfolio approach 5 (akbarinasaji & bener, 2016; aldaeej & seaman, 2018; nielsen & skaarup, 2021; plösch et al., 2018; rindell et al., 2019) prioritization approach 5 (alfayez & boehm, 2019; besker et al., 2019; de lima et al., 2022; stochel et al., 2020; vidal et al., 2016) satd removal approach 3 (da maldonado et al., 2017; t. xiao et al., 2021; zampetti et al., 2020) approach for technical debt decision making 3 (codabux & williams, 2016; pacheco et al., 2018; ribeiro et al., 2017) model for td alignment with business 3 (reboucas de almeida, 2019; reboucas de almeida et al., 2018, 2019) calculation of td principal 3 (akbarinasaji et al., 2016; kontsevoi et al., 2019; kosti et al., 2017) process framework for managing td 2 (oliveira et al., 2015; ramasubbu & kemerer, 2019) model for optimizing technical debt 2 (perez et al., 2019; yli-huumo et al., 2016) strategic td management model 2 (ciancarini & russo, 2020; martini et al., 2016) framework for td management 2 (borup et al., 2021; wiese et al., 2021) continuous architecting framework for embedded software and agile (caffea) 1 (martini & bosch, 2017b) automated identification of refactoring candidates 1 (tornhill, 2018) automated refactoring 1 (mohan et al., 2016) automatic identification and interactive monitoring 1 (fernandez-sanchez et al., 2017) benchmarking-based model 1 (mera-gómez et al., 2016) continuous/extensive testing 1 (trumler & paulisch, 2016) estimation approach 1 (lenarduzzi et al., 2019) linear-predictive lifecycle/incrementalpredictive lifecycle application 1 (fairley & willshire, 2017) maintainability model 1 (di biase et al., 2019) managing td in database schemas 1 (albarak et al., 2020) metric for managing architectural technical debt 1 (kouros et al., 2019) model-driven development (preemptive) 1 (izurieta et al., 2015) model of maintenance cost growth 1 (snipes & ramaswamy, 2018) propagation model 1 (holvitie et al., 2016) real options analysis 1 (abad & ruhe, 2015) td enhanced backlog 1 (martini, 2018) visual thinking 1 (chicote, 2017) td cause-effect analysis 1 (rios et al., 2019) normative process framework 1 (de leon-sigg et al., 2020) td predictive model 1 (aversano et al., 2021) conceptual model for holistic debt management 1 (malakuti & heuschkel, 2021) automated identification of deprecation in metamodels 1 (iovino et al., 2020) td management guild 1 (detofeno et al., 2021) encouraging and rewarding incentives 1 (besker et al., 2022) identification and management of technical debt: a systematic study update murillo et al. 2023 6 threats to validity the results presented in this systematic mapping may have been affected by the following threats to validity: • publication bias: relevant studies may not have been returned when performing the literature search. several databases were consulted, and we did a backward snowballing process to find as many studies as possible to address this threat. • search string: it is possible that some papers in the literature propose a way to manage or identify technical debt but do not explicitly mention that they are suggesting an approach for technical debt. therefore, these papers may have been left out when performing the search. still, our focus was on literature regarding technical debt strategies. also, since this was an update to a previous systematic mapping study, other limitations and threats to validity include: • consistency in integrating the results: this paper updates the previous work by alves et al. (2016). different researchers performed the data extraction than the original study, and we cannot ensure that this could cause some differences in the updated results. however, our research method is based on the procedure performed by alves et al. (2016) in their study. also, two of the original authors contributed to the elaboration of this paper and reviewed the obtained results from the data extraction to ensure there was no misunderstanding of concepts between the two sets of primary sources to address this risk. lastly, we performed the paper selection process in march 2022, so the results of that year are not fully complete. these aspects are the limitations of this study. 7 conclusion this paper explored the evolution of technical debt identification and management research landscape over a decade. we searched for studies on eight databases and analyzed academic literature published between 2015 and 2022. by applying the defined search string and the inclusion criteria, we found 117 papers. we integrated our results with a previous study (alves et al., 2016) that analyzed literature from 2010 to 2014. in addition to the technical debt types mentioned in the taxonomy by alves et al. (2016), there are three new terms in the literature: security, elasticity, and variability debt. the security type refers to security issues in the software, such as vulnerabilities or exploitable weaknesses (izurieta et al., 2018). on the other hand, elasticity debt is a concept that refers to non-effective or non-efficient resource provisioning resulting from the lack of dynamical adaptation to resource consumption (mera-gómez et al., 2016). lastly, variability debt comprises the lack of software characteristics that allow it to adapt (create variants) for different needs (wolfart et al., 2021). unlike the previous mapping, most of the included papers addressed technical debt without focusing on specific types. this shows that the technical debt phenomenon is analyzed more holistically. still, those papers that focused on specific types of technical debt studied those identifiable or measurable through code. the most frequent artifacts and data sources are the source code and repositories; this may be because there are various code and repository data analysis tools. there are no such abundant tools for analyzing other types of debts, like documentation, people, and infrastructure. over the years, several proposals have been developed for technical debt management. however, as in the previous systematic mapping, there is a need for more research to validate the effectiveness of the proposals and their applicability in different contexts. another finding was that only a few studies included in the update proposed a visualization strategy. therefore, the topic of technical debt visualization continues to be a future research opportunity. the automatic identification of debt through the analysis of comments, commits, and source code is among the main proposals found in the literature published between 2015 and 2022. several evaluations have been performed through case studies, controlled experiments, and action research. the number of evaluations has been rising through the years, which is particularly important for consolidating the knowledge gained in the research area. however, it is still required to perform more evaluations to generalize the obtained results. the most relevant findings of this paper were the following: • investigations on technical debt identification and management have increasingly changed their focus to a more holistic perspective, considering technical debt as a global problem during the software development process instead of analyzing it as a set of different isolated problems. however, a significant number of investigations still focus on technical debt types closely related to source code. • the number of empirical evaluations performed on each strategy is still small. in most cases, authors have proposed their own strategies and tested them empirically instead of testing previous proposals. • recent research on technical debt has focused on its management, while the proposal of new types has decreased dramatically since 2016. creating new categories seems unnecessary, while authors may have inadvertently reached the consensus that technical debt is an issue to manage without specifying its type. • overall, authors agree on the general meaning of code, design, architecture, and test debt, which suggests that these are widely accepted technical debt types. likewise, future work possibilities include the following: • research on how to use developers’ knowledge on existing technical debt, not only focused on their comments or commit messages. it is an opportunity to identification and management of technical debt: a systematic study update murillo et al. 2023 explore this knowledge as a valuable asset for technical debt identification. • creating tools for analyzing certain types of debt, such as documentation, people, and infrastructure is a potential research opportunity since there is a lack of such tools. • there is still a small number of proposals regarding technical debt visualization; this is a future research opportunity, particularly considering that visualization techniques can help to better communicate with stakeholders. • few studies have explored strategies with a less-technical approach but focus on human resources, such as creating guilds, communities of practice, and rewards or incentives. therefore, performing such investigations is a future opportunity. • there is a need to analyze which strategies are best in specific contexts (for example, public or private organizations). the next steps in the research could be regarding how technical debt can be used as a competitive advantage, generating value rather than bringing undesired and costly consequences. acknowledgments the authors thank dr. carolyn seaman for her valuable suggestions and comments. this work was partially supported by citic at the university of costa rica. grant no. 834-b4412. appendices a1. complete list of included papers the complete bibliography of the 117 papers analyzed in the full-text review is available at: https://drive.google.com/file/d/1g8thuunysvuhdwbr5a_scvltxwccnta/view?usp=sharing a2. list of included papers about technical debt identification the complete list of included papers about technical debt identification and the artifact considered is available at; https://drive.google.com/file/d/1txn8sv6og_n59dhzkjktc mmazcl4ud3e/view?usp=sharing references abad, z. s. h., & ruhe, g. (2015). using real options to manage technical debt in requirements engineering. 2015 ieee 23rd international requirements engineering conference, re 2015 proceedings, 230–235. https://doi.org/10.1109/re.2015.7320428 akbarinasaji, s., & bener, a. (2016). adjusting the balance sheet by appending technical debt. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 36–39. https://doi.org/10.1109/mtd.2016.14 akbarinasaji, s., bener, a. b., & erdem, a. (2016). measuring the principal of defect debt. proceedings 5th international workshop on realizing artificial intelligence synergies in software engineering, raise 2016, 1–7. https://doi.org/10.1145/2896995.2896999 albarak, m., bahsoon, r., ozkaya, i., & nord, r. l. (2020). managing technical debt in database normalization. ieee transactions on software engineering. https://doi.org/10.1109/tse.2020.3001339 aldaeej, a., & seaman, c. (2018). from lasagna to spaghetti, a decision model to manage defect debt. proceedings international conference on software engineering, 67–71. https://doi.org/10.1145/3194164.3194177 alfayez, r., alwehaibi, w., winn, r., venson, e., & boehm, b. (2020). a systematic literature review of technical debt prioritization. proceedings 2020 ieee/acm international conference on technical debt, techdebt 2020, 10, 1–10. https://doi.org/10.1145/3387906.3388630 alfayez, r., & boehm, b. (2019). technical debt prioritization: a search-based approach. proceedings 19th ieee international conference on software quality, reliability and security, qrs 2019, 434–445. https://doi.org/10.1109/qrs.2019.00060 alves, n. s. r., mendes, t. s., de mendonça, m. g., spinola, r. o., shull, f., & seaman, c. (2016a). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100–121. https://doi.org/10.1016/j.infsof.2015.10.008 alves, n. s. r., mendes, t. s., de mendonça, m. g., spinola, r. o., shull, f., & seaman, c. (2016b). identification and management of technical debt: a systematic mapping study. information and software technology, 70, 100–121. https://doi.org/10.1016/j.infsof.2015.10.008 ampatzoglou, a., ampatzoglou, a., avgeriou, p., & chatzigeorgiou, a. (2015). a financial approach for managing interest in technical debt. lecture notes in business information processing, 257, 117–133. https://doi.org/10.1007/978-3-319-40512-4_7 ampatzoglou, a., ampatzoglou, a., chatzigeorgiou, a., & avgeriou, p. (2015). the financial aspect of managing technical debt: a systematic literature review. information and software technology, 64, 52–73. https://doi.org/10.1016/j.infsof.2015.04.001 ampatzoglou, a., michailidis, a., sarikyriakidis, c., ampatzoglou, a., chatzigeorgiou, a., avgeriou, p., ampatzoglou, a., michailidis, a., sarikyriakidis, c., chatzigeorigiou, a., & avgeriou, p. (2018). a framework for managing interest in technical debt: an industrial validation. proceedings of the 2018 international conference on technical debt, 10. https://doi.org/10.1145/3194164 identification and management of technical debt: a systematic study update murillo et al. 2023 anderson, p., kot, l., gilmore, n., & vitek, d. (2019). sarif-enabled tooling to encourage gradual technical debt reduction. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 71–72. https://doi.org/10.1109/techdebt.2019.00024 aversano, l., bernardi, m. l., cimitile, m., & iammarino, m. (2021). technical debt predictive model through temporal convolutional network. proceedings of the international joint conference on neural networks, 2021-july. https://doi.org/10.1109/ijcnn52387.2021.9534423 baldassarre, m. t., lenarduzzi, v., romano, s., & saarimäki, n. (2020). on the diffuseness of technical debt items and accuracy of remediation time when using sonarqube. information and software technology, 128, 106377. https://doi.org/10.1016/j.infsof.2020.106377 besker, t., martini, a., & bosch, j. (2019). technical debt triage in backlog management. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 13–22. https://doi.org/10.1109/techdebt.2019.00010 besker, t., martini, a., & bosch, j. (2022). the use of incentives to promote technical debt management. information and software technology, 142, 106740. https://doi.org/10.1016/j.infsof.2021.106740 borup, n. b., christiansen, a. l. j., tovgaard, s. h., & persson, j. s. (2021). deliberative technical debt management: an action research study. lecture notes in business information processing, 434 lnbip, 50–65. https://doi.org/10.1007/978-3-03091983-2_5/tables/3 chatzigeorgiou, a., ampatzoglou, a., ampatzoglou, a., & amanatidis, t. (2015). estimating the breaking point for technical debt. 2015 ieee 7th international workshop on managing technical debt, mtd 2015 proceedings, 53–56. https://doi.org/10.1109/mtd.2015.7332625 chicote, m. (2017). startups and technical debt: managing technical debt with visual thinking. proceedings 2017 ieee/acm 1st international workshop on software engineering for startups, softstart 2017, 10–11. https://doi.org/10.1109/softstart.2017.6 ciancarini, p., & russo, d. (2020). the strategic technical debt management model: an empirical proposal. ifip advances in information and communication technology, 582 ifip, 131–140. https://doi.org/10.1007/978-3-030-47240-5_13 codabux, z., & williams, b. j. (2016). technical debt prioritization using predictive analytics. proceedings international conference on software engineering, 704–706. https://doi.org/10.1145/2889160.2892643 consortium for information & software quality. (2022). cost of poor software quality in the u.s.: a 2022 report cisq. https://www.it-cisq.org/the-cost-of-poorquality-software-in-the-us-a-2022-report/ crespo, y., gonzalez-escribano, a., & piattini, m. (2021). carrot and stick approaches revisited when managing technical debt in an educational context. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 99–108. https://doi.org/10.1109/techdebt52882.2021.000 20 da maldonado, e. s., abdalkareem, r., shihab, e., & serebrenik, a. (2017). an empirical study on the removal of self-admitted technical debt. proceedings 2017 ieee international conference on software maintenance and evolution, icsme 2017, 238–248. https://doi.org/10.1109/icsme.2017.8 de leon-sigg, m., vazquez-reyes, s., & rodriguez-avila, d. (2020). towards the use of a framework to make technical debt visible. proceedings 2020 8th edition of the international conference in software engineering research and innovation, conisoft 2020, 86– 92. https://doi.org/10.1109/conisoft50191.2020.0002 2 de lima, b. s., garcia, r. e., & eler, d. m. (2022). toward prioritization of self-admitted technical debt: an approach to support decision to payment. software quality journal, 1–27. https://doi.org/10.1007/s11219021-09578-7/figures/10 detofeno, t., malucelli, a., & reinehr, s. (2021). technical debt guild: when experience and engagement improve technical debt management. xx brazilian symposium on software quality. https://doi.org/10.1145/3493244 di biase, m., rastogi, a., bruntink, m., & van deursen, a. (2019). the delta maintainability model: measuring maintainability of fine-grained code changes. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 113–122. https://doi.org/10.1109/techdebt.2019.00030 fairley, r. e., & willshire, m. j. (2017). better now than later: managing technical debt in systems development. computer, 50(5), 80–87. https://doi.org/10.1109/mc.2017.124 falessi, d., & reichel, a. (2015). towards an open-source tool for measuring and visualizing the interest of technical debt. 2015 ieee 7th international workshop on managing technical debt, mtd 2015 proceedings, 1–8. https://doi.org/10.1109/mtd.2015.7332618 fernández-sánchez, c., garbajosa, j., yagüe, a., & perez, j. (2017). identification and analysis of the elements required to manage technical debt by means of a systematic mapping study. journal of systems and software, 124, 22–38. https://doi.org/10.1016/j.jss.2016.10.018 fernandez-sanchez, c., humanes, h., garbajosa, j., & diaz, j. (2017). an open tool for assisting in technical debt management. proceedings 43rd euromicro conference on software engineering and advanced applications, seaa 2017, 400–403. https://doi.org/10.1109/seaa.2017.60 identification and management of technical debt: a systematic study update murillo et al. 2023 fontana, f. a., roveda, r., & zanoni, m. (2016). tool support for evaluating architectural debt of an existing system: an experience report. proceedings of the acm symposium on applied computing, 04-08-april2016, 1347–1349. https://doi.org/10.1145/2851613.2851963 freire, s., rios, n., mendonça, m., falessi, d., seaman, c., izurieta, c., & spínola, r. o. (2020). actions and impediments for technical debt prevention: results from a global family of industrial surveys. proceedings of the acm symposium on applied computing, 1548– 1555. https://doi.org/10.1145/3341105.3373912 griffith, i., taffahi, h., izurieta, c., & claudio, d. (2014). a simulation study of practical methods for technical debt management in agile software development. proceedings of the winter simulation conference 2014. https://doi.org/10.1109/wsc.2014.7019961 guo, y., & seaman, c. (2011). a portfolio approach to technical debt management. guo, y., spínola, r. o., & seaman, c. (2014). exploring the costs of technical debt management – a case study. empirical software engineering 2014 21:1, 21(1), 159–182. https://doi.org/10.1007/s10664-014-9351-7 haas, r., niedermayr, r., & juergens, e. (2019). teamscale: tackle technical debt and control the quality of your software. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 55–56. https://doi.org/10.1109/techdebt.2019.00016 holvitie, j., licorish, s. a., & leppanen, v. (2016). modelling propagation of technical debt. proceedings 42nd euromicro conference on software engineering and advanced applications, seaa 2016, 54–58. https://doi.org/10.1109/seaa.2016.53 iovino, l., di, a., davide, s., ruscio, d., pierantonio, a., salle, a. di, ruscio, d. di, & pieran, a. (2020). metamodel deprecation to manage technical debt in model co-evolution. proceedings 23rd acm/ieee international conference on model driven engineering languages and systems, models-c 2020 companion proceedings, 306–315. https://doi.org/10.1145/3417990.3419625 izurieta, c., ozkaya, i., seaman, c., kruchten, p., nord, r., snipes, w., & avgeriou, p. (2016). perspectives on managing technical debt : transition point and roadmap from dagstuhl. ceur workshop proceedings, 1771, 84–87. izurieta, c., rice, d., kimball, k., & valentien, t. (2018). a position study to investigate technical debt associated with security weaknesses. proceedings of the 2018 international conference on technical debt. https://doi.org/10.1145/3194164 izurieta, c., rojas, g., & griffith, i. (2015). preemptive management of model driven technical debt for improving software quality. proceedings of the 11th international acm sigsoft conference on quality of software architectures. https://doi.org/10.1145/2737182 izurieta, c., vetrò, a., zazworka, n., cai, y., seaman, c., & shull, f. (2012). organizing the technical debt landscape. 2012 3rd international workshop on managing technical debt, mtd 2012 proceedings, 23–26. https://doi.org/10.1109/mtd.2012.6225995 kontsevoi, b., soroka, e., & terekhov, s. (2019). tetra, as a set of techniques and tools for calculating technical debt principal and interest. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 64–65. https://doi.org/10.1109/techdebt.2019.00021 kosti, m. v., ampatzoglou, a., chatzigeorgiou, a., pallas, g., stamelos, i., & angelis, l. (2017). technical debt principal assessment through structural metrics. proceedings 43rd euromicro conference on software engineering and advanced applications, seaa 2017, 329–333. https://doi.org/10.1109/seaa.2017.59 kouros, p., chaikalis, t., arvanitou, e. m., chatzigeorgiou, a., ampatzoglou, a., & amanatidis, t. (2019). jcaliper: search-based technical debt management. proceedings of the acm symposium on applied computing, part f147772, 1721–1730. https://doi.org/10.1145/3297280.3297448 lahti, j. r., tuovinen, a. p., & mikkonen, t. (2021). experiences on managing technical debt with code smells and antipatterns. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 36–44. https://doi.org/10.1109/techdebt52882.2021.000 13 lenarduzzi, v., besker, t., taibi, d., martini, a., & arcelli fontana, f. (2021). a systematic literature review on technical debt prioritization: strategies, processes, factors, and tools. journal of systems and software, 171, 110827. https://doi.org/10.1016/j.jss.2020.110827 lenarduzzi, v., martini, a., taibi, d., & tamburri, d. a. (2019). towards surgically-precise technical debt estimation: early results and research roadmap. maltesque 2019 proceedings of the 3rd acm sigsoft international workshop on machine learning techniques for software quality evaluation, co-located with esec/fse 2019, 37–42. https://doi.org/10.1145/3340482.3342747 li, z., avgeriou, p., & liang, p. (2015). a systematic mapping study on technical debt and its management. journal of systems and software, 101, 193–220. https://doi.org/10.1016/j.jss.2014.12.027 macit, y., giray, g., & tüzün, e. (2020). methods for identifying architectural debt: a systematic mapping study. 2020 turkish national software engineering symposium, uyms 2020 proceedings. https://doi.org/10.1109/uyms50627.2020.9247070 malakuti, s., & heuschkel, j. (2021). the need for holistic technical debt management across the value stream: lessons learnt and open challenges. proceedings identification and management of technical debt: a systematic study update murillo et al. 2023 2021 ieee/acm international conference on technical debt, techdebt 2021, 109–113. https://doi.org/10.1109/techdebt52882.2021.000 21 martini, a. (2018). anacondebt: a tool to assess and track technical debt. proceedings of the 2018 international conference on technical debt. https://doi.org/10.1145/3194164 martini, a., besker, t., & bosch, j. (2016). the introduction of technical debt tracking in large companies. proceedings asia-pacific software engineering conference, apsec, 0, 161–168. https://doi.org/10.1109/apsec.2016.032 martini, a., & bosch, j. (2016). an empirically developed method to aid decisions on architectural technical debt refactoring: anacondebt. proceedings international conference on software engineering, 31–40. https://doi.org/10.1145/2889160.2889224 martini, a., & bosch, j. (2017a). the magnificent seven: towards a systematic estimation of technical debt interest. proceedings of the xp2017 scientific workshops. https://doi.org/10.1145/3120459 martini, a., & bosch, j. (2017b). revealing social debt with the caffea framework: an antidote to architectural debt. proceedings 2017 ieee international conference on software architecture workshops, icsaw 2017: side track proceedings, 179–181. https://doi.org/10.1109/icsaw.2017.42 mendes, e., wohlin, c., felizardo, k., & kalinowski, m. (2020). when to update systematic literature reviews in software engineering. journal of systems and software, 167, 110607. https://doi.org/10.1016/j.jss.2020.110607 mera-gómez, c., bahsoon, r., & buyya, r. (2016). elasticity debt: a debt-aware approach to reason about elasticity decisions in the cloud. proceedings of the 9th international conference on utility and cloud computing. https://doi.org/10.1145/2996890 mohan, m., greer, d., & mcmullan, p. (2016). technical debt reduction using search based automated refactoring. journal of systems and software, 120, 183–194. https://doi.org/10.1016/j.jss.2016.05.019 nepomuceno, v., & soares, s. (2019). on the need to update systematic literature reviews. information and software technology, 109, 40–42. https://doi.org/10.1016/j.infsof.2019.01.005 nielsen, m. e., østergaard madsen, c., & lungu, m. f. (2020). technical debt management: a systematic literature review and research agenda for digital government. lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 12219 lncs, 121–137. https://doi.org/10.1007/978-3-03057599-1_10 nielsen, m. e., & skaarup, s. (2021). it portfolio management as a framework for managing technical debt; it portfolio management as a framework for managing technical debt. 14th international conference on theory and practice of electronic governance. https://doi.org/10.1145/3494193 oliveira, f., goldman, a., & santos, v. (2015). managing technical debt in software projects using scrum: an action research. proceedings 2015 agile conference, agile 2015, 50–59. https://doi.org/10.1109/agile.2015.7 pacheco, a., marín-raventós, g., & lópez, g. (2018). designing a technical debt visualization tool to improve stakeholder communication in the decisionmaking process: a case study. lecture notes in business information processing, 327, 15–26. https://doi.org/10.1007/978-3-319-99040-8_2 perez, b., correal, d., & astudillo, h. (2019). a proposed model-driven approach to manage architectural technical debt life cycle. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 73–77. https://doi.org/10.1109/techdebt.2019.00025 petersen, k., feldt, r., mujtaba, s., & mattsson, m. (2008). systematic mapping studies in software engineering. 12th international conference on evaluation and assessment in software engineering, ease 2008. https://doi.org/10.14236/ewic/ease2008.8 plösch, r., bräuer, j., saft, m., & körner, c. (2018). design debt prioritization: a design best practice-based approach. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 ramasubbu, n., & kemerer, c. f. (2019). integrating technical debt management and software quality management processes: a normative framework and field tests. ieee transactions on software engineering, 45(3), 285–300. https://doi.org/10.1109/tse.2017.2774832 reboucas de almeida, r. (2019). business-driven technical debt prioritization. proceedings 2019 ieee international conference on software maintenance and evolution, icsme 2019, 605–609. https://doi.org/10.1109/icsme.2019.00096 reboucas de almeida, r., kulesza, u., treude, c., cavalcanti feitosa, d., & lima, a. h. g. (2018). aligning technical debt prioritization with business objectives: a multiple-case study. proceedings 2018 ieee international conference on software maintenance and evolution, icsme 2018, 655–664. https://doi.org/10.1109/icsme.2018.00075 reboucas de almeida, r., treude, c., & kulesza, u. (2019). tracy: a business-driven technical debt prioritization framework. proceedings 2019 ieee international conference on software maintenance and evolution, icsme 2019, 181–185. https://doi.org/10.1109/icsme.2019.00028 ribeiro, l. f., alves, n. s. r., de mendonca neto, m. g., & spinola, r. o. (2017). a strategy based on multiple decision criteria to support technical debt management. proceedings 43rd euromicro conference on software engineering and advanced applications, identification and management of technical debt: a systematic study update murillo et al. 2023 seaa 2017, 334–341. https://doi.org/10.1109/seaa.2017.37 rindell, k., bernsmed, k., & gilje jaatun, m. (2019). managing security in software or: how i learned to stop worrying and manage the security technical debt. acm international conference proceeding series. https://doi.org/10.1145/3339252.3340338 rios, n., mendonça neto, m. g. de, & spínola, r. o. (2018). a tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. information and software technology, 102, 117–145. https://doi.org/10.1016/j.infsof.2018.05.010 rios, n., spinola, r. o., de mendonça neto, m. g., & seaman, c. (2019). supporting analysis of technical debt causes and effects with cross-company probabilistic cause-effect diagrams. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 3–12. https://doi.org/10.1109/techdebt.2019.00009 rios, n., spínola, r. o., mendonça, m., & seaman, c. (2020). the practitioners’ point of view on the concept of technical debt and its causes and consequences: a design for a global family of industrial surveys and its first results from brazil. empirical software engineering 2020 25:5, 25(5), 3216–3287. https://doi.org/10.1007/s10664-020-09832-9 shapochka, a., & omelayenko, b. (2016). practical technical debt discovery by matching patterns in assessment graph. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 32–35. https://doi.org/10.1109/mtd.2016.7 sharma, t. (2019). how deep is the mud: fathoming architecture technical debt using designite. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 59–60. https://doi.org/10.1109/techdebt.2019.00018 snipes, w., & ramaswamy, s. (2018). a proposed sizing model for managing 3rd party code technical debt. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 stochel, m. g., cholda, p., & wawrowski, m. r. (2020). continuous debt valuation approach (codva) for technical debt prioritization. proceedings 46th euromicro conference on software engineering and advanced applications, seaa 2020, 362–366. https://doi.org/10.1109/seaa51224.2020.00066 tom, e., aurum, a., & vidgen, r. (2013). an exploration of technical debt. journal of systems and software, 86(6), 1498–1516. https://doi.org/10.1016/j.jss.2012.12.052 tornhill, a. (2018). assessing technical debt in automated tests with codescene. proceedings 2018 ieee 11th international conference on software testing, verification and validation workshops, icstw 2018, 122– 125. https://doi.org/10.1109/icstw.2018.00039 tornhill, a., & ab, e. (n.d.). prioritize technical debt in large-scale systems using codescene. proceedings of the 2018 international conference on technical debt, 18. https://doi.org/10.1145/3194164 trumler, w., & aulisch, . ( ). how “ pecification by ample” and test-driven development help to avoid technial debt. proceedings 2016 ieee 8th international workshop on managing technical debt, mtd 2016, 1–8. https://doi.org/10.1109/mtd.2016.10 vidal, s., vazquez, h., diaz-pace, j. a., marcos, c., garcia, a., & oizumi, w. (2016). jspirit: a flexible tool for the analysis of code smells. proceedings international conference of the chilean computer science society, sccc, 2016-february. https://doi.org/10.1109/sccc.2015.7416572 von zitzewitz, a. (2019). mitigating technical and architectural debt with sonargraph. proceedings 2019 ieee/acm international conference on technical debt, techdebt 2019, 66–67. https://doi.org/10.1109/techdebt.2019.00022 wiese, m., riebisch, m., & schwarze, j. (2021). preventing technical debt by technical debt aware project management. proceedings 2021 ieee/acm international conference on technical debt, techdebt 2021, 84–93. https://doi.org/10.1109/techdebt52882.2021.000 18 wolfart, d., assunção, w. k. g., & martinez, j. (2021). variability debt: characterization, causes and consequences. sbqs ’21: proceedings of the xx brazilian symposium on software quality. https://doi.org/10.1145/3488042.3488048 xiao, l., cai, y., kazman, r., mo, r., & feng, q. (2016). identifying and quantifying architectural debt. proceedings of the 38th international conference on software engineering. https://doi.org/10.1145/2884781 xiao, t., wang, d., mcintosh, s., hata, h., kula, r. g., ishio, t., & matsumoto, k. (2021). characterizing and mitigating self-admitted technical debt in build systems. ieee transactions on software engineering, 1–1. https://doi.org/10.1109/tse.2021.3115772 yli-huumo, j., maglyas, a., smolander, k., haller, j., & törnroos, h. (2016). developing processes to increase technical debt visibility and manageability – an action research study in industry. lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 10027 lncs, 368–378. https://doi.org/10.1007/978-3-319-49094-6_24 zampetti, f., serebrenik, a., & di penta, m. (2020). automatically learning patterns for self-admitted technical debt removal. saner 2020 proceedings of the 2020 ieee 27th international conference on software analysis, evolution, and reengineering, 355–366. https://doi.org/10.1109/saner48275.2020.9054868 journal of software engineering research and development, 2019, 7:8, doi: 10.5753/jserd.2019.155  this work is licensed under a creative commons attribution 4.0 international license.. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ eduardo pinheiro  [ universidade federal de são carlos | edu.g.pinheiro@gmail.com ] larissa lopes  [ universidade federal de são carlos | larii.albano@gmail.com ] tayana conte  [ universidade federal do amazonas | tayana@icomp.ufam.edu.br ] luciana zaina  [ universidade federal de são carlos | lzaina@ufscar.br ] abstract context: requirements elicitation phase in software development investigates both requirements, functional and user experience (ux). proto-persona is a technique that encourages attention on the needs of a group of users. usually, the elaboration of proto-personas is done by software specialists and technical stakeholders without the participation of non-technical stakeholders. however, non-technical stakeholders often have a well-knowledge about target users. objective: this work aims to investigate the contribution that non-technical stakeholders bring to the specification of ux requirements when they use the proto-persona+ technique to this end. to achieve our objective, we extend the original proposal of proto-persona technique creating the proto-persona+. we also explored the construction of proto-persona+ artifacts and their use to the prototyping of solutions. method: we conducted an empirical study in two rounds, wherein we analyzed and compared the contributions of technical and non-technical stakeholders on the specification of ux requirements. in the first round, 8 non-technical and 5 technical stakeholders built the proto-personas+. in the second round, 36 software developers worked in pairs to create low fidelity prototypes using the information provided by the proto-persona+ artifacts. for the two rounds, we conducted a qualitative analysis exploring which ux requirements were described and used. results: the results revealed that although both types of stakeholders had written the details of ux requirements on the artifact, they did in different and complementary perspectives. we could also observe that the proto-persona+ artifacts that were produced by both stakeholders were used on the prototyping activity. conclusion: our study indicates that non-technical stakeholders are able to contribute to the specification of ux requirements and that proto-persona+ is a suitable artifact to promote such activity. the details described by non-technical stakeholders brought new and different contributions when compared to the ones described by the technical stakeholders. from the results of the first round, we concluded that the non-technical stakeholders elicited requirements which impact on accessibility and fun issues. by considering the findings of the second round, we concluded that the ux requirements provided by both stakeholders allowed the developers to build more comprehensive and minimalist user interface prototypes. keywords: non-technical stakeholder, proto-personas, requirement engineering, ux requeriments 1 introduction requirements elicitation is widely discussed in software engineering. the challenges of this important area of software development include issues that range from technical aspects (e.g. use of appropriate tools) to human aspects (e.g. type of stakeholders involved in the process), sharma and pandey (2014); aranda et al. (2016); abelein et al. (2013); hadar et al. (2014). some works have highlighted that the involvement of endusers in the elicitation process can bring important contributions to software construction and that consequently, this affect user satisfaction on the software, berti et al. (2004); maceli and atwood (2011). additionally, the authors stated that the process of requirements elicitation can be enriched not only by the participation of end-users but also by including different stakeholders in this process. non-technical stakeholders are recognized as those who are not a part of the software team, hadar et al. (2014). these stakeholders can be professionals that have close contact with the end-users, for instance. nevertheless, they often have much knowledge about the audience and the domain of the application, aranda et al. (2016). during the elicitation process, a diversity of types of requirements can arise. non-functional software requirements, such as usability and user experience, are linked to qualityrelated requirements; therefore they can impact software acceptance by end-users, de la vara et al. (2011); palomares et al. (2017). nielsen and norman (2013) state a definition of user experience (ux) in a holistic perspective: “user experience encompasses all aspects of the end-users interaction with the company, its services, and its products”. differently, in a more pragmatic definition, garrett (2010) states that for a product provide a good user experience, the software developers have to pay attention to what the product does and how it does it. considering both definitions above, we can affirm that the elicitation of ux requirements involve the gathering of aspects and characteristics of the end-user and the product. these requirements should assist the technical stakeholders (i.e. software experts) in designing and developing software that has good acceptance and brings value to end-users, brown et al. (2011); kashfi et al. (2017). technical stakeholders can be supported by several techniques and methods to the eliciting ux requirements. for this purpose, questionnaires, interviews, as well as techniques and methods from the human-computer interaction (hci) area can be applied, garcia et al. (2017); brown et al. https://orcid.org/0000-0001-5719-4852 mailto:edu.g.pinheiro@gmail.com https://orcid.org/0000-0001-5189-0291 mailto:larii.albano@gmail.com https://orcid.org/0000-0001-6436-3773 mailto:tayana@icomp.ufam.edu.br https://orcid.org/0000-0002-1736-544x mailto:lzaina@ufscar.br on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 (2011). personas are artifacts that have applied to support software teams in both activities, elicitation and use of ux requirements, ferreira et al. (2015). the technique to create the personas follows a process that analyzes end-users data. the persona artifact generated from the technique consists of a fictional character that represents a group of real users of the system and their relevant characteristics within a given software domain, gothelf (2012); grudin (2006); cooper et al. (2014). additionally, personas are useful for establishing an empathy relationship between technical stakeholders and endusers, grudin and pruitt (2002); billestrup et al. (2014). however, the application of personas can be onerous and costly to the team. by the classical definition, a persona is created by analyzing a significant amount of data regarding end-users that requires extensive research and data collection, billestrup et al. (2014). gothelf proposes a new approach to elaborating personas called proto-persona1, gothelf (2012); gothelf and seiden (2013). rather than using the classical technique to creating personas, gothelf’s proposal considers the prior knowledge of stakeholders about end-users and the software domain in question. the technique to constuct proto-persona recognizes that these stakeholders are able to build a sketch of a persona with their assumptions based on their knowledge about a given domain. the technique of constructing proto-personas provides a practical way to gather the knowledge that the stakeholders have about end-users. however, the author recommends that the proto-persona artifact should be validated later by conducting end-user research, gothelf (2012). usually, technical stakeholders work on the development of a diversity of software, which can bring difficulties in obtaining in-depth way the knowledge about different software domains. furthermore, non-technical stakeholders (i.e. who are not a part of the software team) are those who have knowledge about a given domain and can provide relevant information about end-users and the aspects of their interaction with the software. considering the aforementioned discussion, we decided to study the use of proto-personas to elicit ux requirements in the perspective of non-technical stakeholders. our study was focused on investigating how non-technical stakeholders contribute to the requirements elicitation activity. to do this, we selected the proto-persona technique, that produces a lean artifact which can be easily used by this type of stakeholders. the intention of this study is not focus on the comparison of different personas techniques, but to collect evidence about the potential of use proto-persona technique to the purpose of elicitating requirements. to support our study, we extended the proto-persona technique proposed by gothelf (2012) and gothelf and seiden (2013) creating the proto-persona+. developers frequently report that they struggle on how to arrange information of persona, billestrup et al. (2014). considering the difficulties that the participants would have to handle with the proto-persona technique, in our extension we included a new template and 1proto-persona is also known as lean persona, gothelf and seiden (2013). guide questions which support the individuals that will use the proto-persona+. the construction of the proto-personas is supported by the template and the questions that guides the participants to the writing of the personas. taking into account the basis of the gothelf’s proposal, gothelf (2012), the template outlines the important points that individuals should have during the design of the proto-personas. to conduct our study, we defined three research questions (rqs): (rq1) which ux requirements do non-technical stakeholders describe while using the proto-persona technique?; (rq2) how is the acceptance of the use of the proto-persona+ technique by these stakeholders?; and (rq3) which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?. we conducted two rounds of experimental studies. firstly, we explored the use of the technique to construct the protopersona+ artifact with the participation of 8 non-technical stakeholders and 5 technical stakeholders; and consequently, we could answer the rq1 and rq2. to respond to rq3, we invited 36 software developers to design user interface prototypes by using the proto-personas+ artifacts that were previously created in the first round. the participants worked in pairs and produced 18 user interface prototypes in total. from this second round, we were able to examine the use of the proto-personas+ that was previous developed by the different stakeholders (i.e. technical and non-technical). this paper presents these two rounds in details and discusses the results. in this paper, we extend our previous result presented in the brazilian symposium on software engineering in 2018. in the earlier version, we have discussed the results regarding the rq1 and rq2. in this version, we added a new perspective of analysis (related to rq3) that enriched our findings regarding the contributions of non-technical stakeholders and the potential of proto-persona to support the elicitation of ux requirements. the process of selecting of individuals to participate in the study as non-technical stakeholders was directly related to the domain that our research focused on. our domain was defined as: applications to support e-learning, and consequently, pedagogues2 were the non-technical stakeholders. as our research group have experience in the development of applications in the e-learning domain, we created a network of contacts with pedagogues, (i.e., potential non-technical stakeholders) which was the key factor to our choice. the study allowed us to observe how the stakeholders described ux requirements by applying the proto-persona+ technique and how these artifacts were used to design software solutions. our main contribution is the discussion of the feasibility in introducing the non-technical stakeholder as an active agent in the specification of ux requirements through the use of the proto-persona technique. our study not only examines the construction of the artifacts but also their use in the elaboration of software. the rest of the paper is organized as follows: section 2 presents the fundamentals and related work; the proto2in brazil, pedagogues are professionals who are responsible for the education of children in elementary schools; they obtain their degrees by attending a pedagogy course. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 persona+ artifact is presented in section 3; the domain selected to the study and scenario we applied in are explained in section 4; the details of the first round of investigation are discussed in section 5 and its results are in follow section 6; the second round of investigation and its results are presented in section 7 and section 8, respectively; in section 9 we return to our research questions to point out the important results and make a comparison with the literature; the main limitations of our study is pointed out in section 10; and finally section 11 presents the conclusion and future work. 2 fundamentals and related work requirements elicitation can be considered a complex task that often requires the participation of different stakeholders. these stakeholders contribute with different knowledge in this process, fernandez and wagner (2015). recently, the identification of ux requirements during requirements elicitation has become a trend, castro et al. (2008); ferreira et al. (2015, 2018a); choma et al. (2016a,b). personas allow the production of artifacts in which ux related issues, such as personal characteristics, needs, and restrictions of end-users, are described, cooper et al. (2014). personas are recognized as important artifacts by both of professionals, academics and practitioners, billestrup et al. (2014). it can support teams during the software development by providing important insights about end-users, ferreira et al. (2018b). another benefit of this technique is to place the user at the center of the development process, which keeps the teams informed about end-users’ requirements. frequently, software teams have personal assumptions about end-users’ characteristics that may differ from the users’ needs in real life, jansen et al. (2017). the team can predict user behavior in a more pragmatic perspective by using persona in their activities. therefore, persona plays the role of developing the empathy of developers toward endusers, cooper et al. (2014); grudin (2006). alves and ali (2018) applied goal-oriented requirements engineering (gore) together with the personas technique to enrich the specification of human factor requirements. gore is focused on fulfilling the demands regarding business goals. the authors stated that by including personas in the process, they could improve the specification of the users’ needs in the software with more assertive and specific details. consequently, they could satisfy the needs of groups of real end-users. gothelf (2012) proposes proto-personas as a technique in which the domain-specific knowledge that specialists have about the audience is used to describe personas. the technique run from a series of brainstorming sessions, osborn (1979), wherein each participant (i.e. specialists) proposes the personas individually. in the next step, these initial proposals are refined by all the participants in the session until they produce a maximum of four personas that represent the target audience. afterwards, the software teams apply these sketches of personas during the software development. these sketches can be validated in future development cycles. the proto-persona technique has the main goal of capturing the knowledge of the experts and uses it in the writing of the proto-persona artifact. this artifact can aid the teams in kicking-off a discussion about the user in the early development phases(e.g. design phase). in the work of anvari et al. (2015), the traditional persona technique was used to hold the emotional characteristics of users. the authors’ intention was to verify whether the developers could see the differences among the characteristics of the personas and whether these differences caused some influence during the software design. results revealed that most participants noticed the variance on the details of those personas and reported that the artifacts helped them in designing the software. ferreira et al. (2015) proposed the pathy, a technique that adapts the empathy map to the construction of personas. the empathy map provides a different form to the building of personas, wherein the focus is on establishing an empathy relationship between end-users and developers. pathy provides a set of questions that drives the software engineer to the artifact elaboration . the technique includes the specification of the user characteristics as well as of other software features. subsequently, ferreira et al. (2018b) investigated the feasibility of combining the pathy technique with user stories to support software development. the results suggested that pathy helps the team in understanding the context of use, identifying potential software requirements, and integrating personas into the design and development process. bhattarai et al. (2016) applied the proto-persona technique in the construction of user profiles. the experience was conducted in several sessions with the participation of different developers. the findings showed that proto-persona supports the teams in aligning their point of view about the software to a set of testable hypotheses about consumers or end-users. kortbeek (2016) presented an experience of using the gothelf technique to build and communicate the hypothesis of a user in industry context. later, in order to verify whether the hypothesis reflected information about the end-users, interviews were conducted with users that have the same characteristics found in the proto-persona. unlike previous works, this paper presents not only the application of proto-persona+ technique but also discusses the contribution that the non-technical stakeholder brings to the specification of ux requirements while using the protopersona technique. to the best of our knowledge, no prior research investigated the contribution of non-technical stakeholders in elaborating personas. however, there are other works regarding the participation of non-technical stakeholders in different requirements engineering tasks, mainly in end-user development or end-user software engineering context. berti et al. (2004) discuss how scenarios and sketches can be used to capture informal input from enduser developer stakeholders. faily (2008) presents a case study where end-user developers obtained practical benefit by adopting professional requirements engineering practices. maceli and atwood (2011) claim that people need to be involved in the software design, not just as workers, but as someone who brings their entire life experience into the design. they identified some principles for participatory codesign, and they described guidelines to help to achieve these principles. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 3 proto-persona+ we chose the gothelf’s proto-persona technique, gothelf (2012) and gothelf and seiden (2013), to conduct our study. however, in this work we made some improvements to the original proposal of gothelf that resulted in a new version of proto-personas, which we named proto-persona+. protopersona+ extends the original proto-persona by adding a set of guideline questions that aid the stakeholders to produce the proto-personas artifacts. we considered that this adaptation was fundamental to support non-technical stakeholders. the main difference between the traditional persona, cooper et al. (2014), and the proto-persona is in the order that the steps to its construction are performed. the building of traditional persona begins with wide demographic research about end-users. on the contrary, the proto-persona elaboration is not driven by collecting data from direct users but it is constructed based on the knowledge that specialists have of the domain, gothelf and seiden (2013). according to gothelf and seiden (2013), the design of proto-personas starts with assumptions of potential personas and the validation of assumptions are performed afterwards. additionally, the whole team contributes to the process of proto-persona creation by providing their premises about the end-users. as the team members participate actively, this process becomes an effective way to create a shared understanding of the end-users needs and characteristics. as a consequence, the feeling of empathy to the end-users is evolved by the team members. proto-persona produces a lean artifact that is seen as one of the advantages of the technique. the artifact focuses on delivering only the relevant information about end-users, gothelf and seiden (2013). after examining different proto-persona’s templates, we concluded that by joining different parts we could provide a better way to use the artifact. the mix of templates aids the stakeholders to describe ux requirements whereas keeping the concise format of the proto-personas. we considered two templates proposed by gothelf, wherein the information is reported in four quadrants. proposal (a) has two quadrants in which demographic information and characterization of users (e.g. how user looks like, individual’s name, and attributes that defined the users) are described. the other two quadrants refer to attitudes (e.g. life history, routine, habits) and needs (e.g. what motivates them, what they do daily), gothelf (2012). in proposal (b), the first quadrant outlines of the persona’s name and its role in the software, the second describes the basic demographic information, the third informs the needs and frustrations of users about a product, and the fourth reports potential solutions that can fulfill the needs of the users, gothelf and seiden (2013). after analyzing the similarities and differences in both gothelf’s proposals, we rearranged the quadrants to give a new shape to proto-persona+. table 1 presents its objectives and a relationship with the gothelf’s models. different from others’ templates, proto-persona+ provides a set of guideline questions to aid the stakeholder during its elaboration. we decided to add guideline questions because professionals claim that persona is a difficult technique of handling, billestrup et al. (2014). those responsible for the creation of protofigure 1. proto-persona+: template and guideline questions personas+ fill the template by answering the guideline questions. however, it is not mandatory to answer all the questions to use our proposal. finally, the proto-persona+ proposal is flexible by allowing to extend the guideline adding other questions in future research. some questions could be more or less related to the domain in which the study is running. the flexibility for adapting the set of questions can improve the potential of proto-persona+ to catch relevant knowledge of the different types of stakeholders in a particular domain. table 1. proto-persona+: purpose of the quadrants q objective relation to gothelf proposals (q1) provides persona characterization and relevant information about the individuals that impact on the software development. joint of the two demographic quadrants of proposal a and quadrants 1 and 2 of proposal b. (q2) provides details of what users need to reach their objective while using the software. based on the quadrant about the user needs presented in proposal a and in parts of the quadrant 3 of proposal b. (q3) points how users like to accomplish the steps to fulfill their objectives, description that focuses on the content, and interaction types that they prefer. based on the quadrant about attitude from proposal a and in some parts of the quadrant 4 of proposal b. (q4) describes the difficulties faced by the user while interacting with the software and identifies the potential frustrations that could arise during software use. refined from quadrant 3 of proposal b. 4 study context before starting our study, we decided to focus on in a particular domain area. our research group has worked on the software development to support e-learning area. consequently, we have several contacts with non-technical stakeholders in this field. e-learning is the term that defines the use of electronic systems in the context of learning being applied in both situation in-class and distance courses, clark and mayer on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 (2007). in our study, we took m-leaning area which is a subset of e-learning domain. m-learning applications allow the interaction of students and teachers in a learning environment through the use of mobile devices and the internet, dodero et al. (2014). although several companies in the world have demonstrated their interest in the development of applications for educational purposes, the software teams often face difficulties in the mlearning domain, filho and barbosa (2013); chimalakonda and nori (2013). in addition to the common issues that arise during the software development for mobile devices, filho and barbosa (2013), m-learning domain demands for close work with different domain stakeholders (e.g. teachers, government regulations, designers of learning contents) to capture the knowledge they have, dodero et al. (2014). for our study, we used a scenario of an application within the m-learning domain. we chose an application of virtual museum that would aid in the learning of history and arts. it was a part of a project that the research group was developing. the scenario is described as follows: “an interactive museum is adopted by an elementary school to support the learning of students aged 9 to 11 in history and arts. the museum’s collection comprises of several galleries that deliver the artworks in different formats (e.g. games, videos, images, texts). access to the museum will be facilitated through a mobile application that should provide a variety of options for student interaction (e.g. speech recognition, touchscreen, and recognition of gestures) with the aim of being comprehensive to the public.”. 5 first round: using proto-persona+ 5.1 planning the first round of our study had the goal of answering rq1 and rq2. therefore, we analyzed whether non-technical stakeholders could describe ux requirements by using the proto-persona technique (i.e., proto-persona+). additionally, we verified the acceptance of this technique. to do this, we compared the artifacts produced by technical and nontechnical stakeholders, i.e. software engineers and pedagogues, respectively, looking for evidence of ux requirements. our analysis focused on exploring qualitative data by examining the descriptions presented in the proto-persona+ artifacts. quantitative descriptive data were used only to illustrate the acceptance of the artifacts from the perspective of the participants. the first round was conducted in five steps. before the conduction, the participants filled (i) a profile questionnaire. then, we carried out (ii) a training session presenting the key concepts of the study to level the participants’ knowledge before performing the activity. to complement the training, (iii) a hands-on exercise was applied using an m-learning scenario which was different from the scenario of the study. the activity of (iv) elaboration of proto-personas+ was performed. finally, the participants (v) completed the questionnaire on the acceptance of using the proto-persona+. a set of artifacts to support the steps above was prepared. besides demographic information, the profile questionnaire (i) had questions to capture the participants’ prior knowledge about m-learning applications. a consent form, wherein the participants should agree about the use of their data for academic purposes, was also prepared. a set of slides to present concepts about persona and m-learning was designed to be used in the 15-minute training session (ii). for the hands-on exercise (iii), a scenario of a m-learning application was used. from this exercise, the participants could have contact with the proto-persona+ template. after performing these steps, the experiment to construct the protopersona+ artifact was conducted within a period of 40 minutes (iv). upon completion, the participants answered the acceptance questionnaire (v) on the proposal of proto-persona+, indicating their opinions and suggestions. 5.2 execution the experiment was performed in two different days for the groups of technical and non-technical stakeholders. the study followed the steps that were planned and was conducted in the same physical space of a classroom at ufscar sorocaba. all the participants signed the consent form and declared to have experienced e-learning software at least. a total of thirteen stakeholders participated, wherein eight undergraduate students of a pedagogy course who represented the non-technical stakeholders (i.e., pedagogues ped), and five students of computer science courses: four bachelor’s students and one postgraduate, who represented the technical stakeholders (i.e., software engineers eng). participants built the proto-personas+ individually. each participant generated at least one and at most four artifacts. in total, 22 proto-personas+ were designed, being 11 created by pedagogues and 11 by software engineers. the participants did not receive any recommendations or restriction about the number of proto-personas they should produce. participants were encouraged to construct as many proto-personas+ they considered appropriate to provide the characterization of the end-users in the virtual museum scenario. 5.3 analysis a qualitative analysis was performed in two stages on the 22 artifacts produced by participants. first, the proto-personas+ were evaluated to identify if they reported ux requirements. then, we conducted an analysis on the results of the first step to find out the focus of these requirements. as ux has several definitions in the literature, the researchers could have different interpretations regarding what was a ux description. to avoid the different interpretations, the authors of this article decided to create an instrument to guide the data analysis. the instrument was based on a compilation of a set of ux dimensions. the works of winckler et al. (2013) and ardito et al. (2006) gave us the ground to elect and compile the ux dimensions. we selected these works as the basis of our dimensions because they discuss ux in the two areas our study focused on, mobile with the work of winckler et al. (2013) and e-learning applications with the work of ardito et al. (2006). to define the dimensions, we examined the similarities between the dimensions described in the two works and those on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 that attended the particularities of our study domain. the dimensions of stimulus and value were selected from winckler et al. (2013). the work of ardito et al. (2006) presents a set of heuristics for evaluating e-learning applications and a methodology for using such heuristics. from this work, four dimensions were chosen: access, media, organization, and interaction. as a result, six dimensions were outlined and considered in our analysis. these dimensions focused on the type of ux requirements that we had to search for in the proto-personas. to keep the researchers’ attention to the same ux definitions, we wrote in the guide the meaning of each dimension in details. access dimension covers the aspects of technology and its quality for use; media specifies what media support the communication considering the e-learning context; organization focuses on how learning contents and navigation are arranged; stimulus examines the motivations that lead the participants to engage in the interaction, and encompasses impressions and opportunities for use; value explores what the use of that product brings to the students’ learning; and interaction focuses on the results that each type of interaction deliver to the student. considering the dimensions, first, four researchers searched for evidence of ux requirements on each protopersona+. this first step was carried out for three master’s students in software engineering (se) and human-computer interaction (hci) and an undergraduate student in computer science with experience in hci. after examining an artifact, the researcher had to assign labels on it. the labels indicated in which degree the ux dimensions were being fulfilled or not considering the description found on the artifact. these degrees were classified into three levels (fulfilled completely, fulfilled widely and fulfilled partially). besides, the researchers took notes to justify their rationale to have assigned one or another classification for each dimension. in case of the researchers did not assign any degree they did not make notes. the researchers examined the artifact from a whole perspective because the information of one quadrant of proto-persona+ was complementary to the other. each researcher analyzed 11 artifacts: 2 researchers evaluated 5 proto-personas+ of pedagogues and 6 of software engineers; and 2 others evaluated 6 artifacts of pedagogues and 5 of engineers. in the second round, two senior researchers in se and hci revisited the data and refined the results. taking into account the results of the first step, the first author of this article performed a new qualitative analysis. for this, the open coding technique was used, strauss and corbin (1998). open coding relates codes to chunks of text. these codes receive denominations that give certain significance to the chunks of texts they refer to, strauss and corbin (1998). subsequently, these codifications were revisited and they are grouped when patterns of information were identified. for instance, the code interface could be assigned to chunks of texts that report information on user interface. during the coding process, codes were assigned to parts of the notes written by the researchers. then, this set of codes was re-analyzed to search for patterns of information. the results of these two steps were verified by two senior researchers in the areas of se and hci. the coding was performed using the nvivo 113 tool and a total of 26 codes related to the ux dimensions were generated in this process. 5.4 threats to validity internal threat could be refereed to the tiredness of the participants. this could happen due to participants spend a long time concentrating on the activity of the experiment. to mitigate this, we scheduled a break between the hands-on training and the activity of proto-persona+ elaboration. external threat refereed to the use of students as participants. however, salman et al. (2015) provide evidence that there are few differences in the performance of students and practitioners when they performed an activity they have not previously knowledge. even with greater practical expertise, the fact that professionals do not know a new technique such as protopersona+ allows us to compare them to students. salman et al. results allow us to state that our findings getting from students using proto-persona+ can be extended to more experienced professionals who have never used proto-persona technique. the threat of construct was mitigated by the training and hands-on exercise when the participants had the opportunity to request clarification about the technique and the template. consequently, we considered our sample of artifacts have good quality. additionally, all participants were prior users of e-learning applications. we handled the threat of conclusion by using a common definition of ux based on dimensions. all the researchers inspected the artifacts using this guide avoiding different interpretations about the ux meaning. a bias on the conclusion could be introduced in the study by the fact that there were no limits on how many personas each participant could create. as a consequence, a participant could produce more personas than others, and therefore, s/he could become more representative within his/her group. however, our goal was not to verify how much information each participant offered individually. rather, our focus was to see the contributions that arose from the different types of stakeholders. besides, this study analyzes two groups with the same number of artifacts in both, which mitigates the problem of comparing unbalanced groups. anyway, we considered this is an issue that other researchers should be aware of if they decide to run a similar study. 6 findings of the first round the profile questionnaire showed that out of the 13 participants, 84.5% used mobile devices 5 or more days in a week, 61.5% preferred to access the internet through their mobile phones, and 77% had participated in an online course in the last two years. the findings of the first round aided us to answer the rq1 and rq2. we will present the results in the follow sections. 3http://www.qsrinternational.com/nvivo on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 6.1 ux requirements to respond the (rq1) which ux requirements do nontechnical stakeholders describe while using the protopersona technique?, we observed the codes generated from the open coding process. figure 2 presents the codes associated with the artifacts of each type of stakeholder. our analysis did not have the purpose of quantifying the occurrence of a code. rather, the qualitative analysis concentrated on exploring the evidence of ux issues that arose from the data. in this analysis, we observed the convergence or divergence of the codes assigned to the artifacts of the different stakeholders. we see the codes that are common and different considering the artifacts built by pedagogues and software engineers. we will concentrate our discussion on the codes in bold that represents more relevant findings. both types of stakeholders described characteristics of enjoyment, stimulus, and satisfaction to highlight the importance of building an enjoyable experience that holds the student attention during the learning process. however, by observing the codes, we can see that this objective was expressed in different ways. the pedagogue described a learning process that should be fun (see the code in bold), thereby showing the intention of organizing lessons from this perspective. on the other hand, the software engineers pointed out requirements regarding the focus on use and easiness of use with the intention of avoiding users frustration. these examples are shown in figure 3. the examples highlight the parts where we see how each type of stakeholder describes a way of maintaining students’ interest in learning. the following examples show the different contributions provided by the stakeholders. in addition to focusing on distinct user information and user characteristics, each type of stakeholder provided specific user profile details. two non-technical stakeholders specified requirements for visual impairment or attention deficit that can attend users with special needs. two technical stakeholders delineated the characteristics of users who like to learn by participating in interactive spaces where they can interact with their colleagues. these examples can be seen in the two artifacts in figure 4. figure 5 shows the codes that were found in common or not from the proto-personas+ per dimension and per type of stakeholders. from the access dimension, we see that pedagogues’ artifacts had codes related to accessibility that refers to the availability of hardware that would meet the special needs of each user, as well as the forms of interaction that could attend this audience. only the artifact of the technical stakeholders provided different codes in the media dimension. while reporting the use of different types of media (e.g. video), the software engineer showed concern about media that could provide interactions; therefore the interaction mode code was assigned to the artifact of this stakeholder. an example of this is interaction with text on a small screen of mobile device that can introduce barriers for users to perform their actions. in the organization dimension, the pedagogues focused on how structuring the learning path for a given student profile. the codes assigned related tto this dimension were application complexity, focus on learning process, user restrictions, and student objective. additionally, from the proto-personas+ created by the software engineers, we could see the focus on building applications that could motivate the students to the interaction by providing different medias. both types of stakeholders were concerned about stimulating the students by offering an enjoyable application (see figure 3). it can be seen from the stimulus dimension that had the codes satisfaction and fun related to the proto-personas+ which were created by the pedagogues. on the contrary, the software engineers considered that the care in aspects that bring frustration would encourage the student to continue using the application. in the value dimension, the codes media, device, and user restrictions demonstrated the matter of enriching the user experience during the learning process. finally, in the interaction dimension, we could identify the contributions that the pedagogues did from observing their artifacts. the accessibility and application complexity showed that these stakeholders concentrated their attention on delivering a more personalized interaction in accordance with users’ profile. consequently, these issues can bring stimulus and value to the ux. 6.2 acceptance of proto-persona+ to answer (rq2) how is the acceptance of the use of the proto-persona+ technique by these stakeholders?, three different analyses were performed: (i) the importance that the stakeholders perceived on the template’s quadrants to perform the activity; (ii) the usage and relevance the stakeholders saw in the guideline questions to complete the quadrants; and (iii) the perception of usefulness and ease-ofuse regarding the proto-persona+. the participants answered the questions after finishing the elaboration of the protopersonas+. given the small size of our sample, we analyzed the data from a descriptive perspective. the results will be presented in detail in the following subsections. 6.2.1 importance of quadrants we explored the importance of the quadrants (figure 1) in relation to the description of the proto-persona+ in the perspective of the participants. for this, the participants should classify each quadrant in one of the following categories: very important (vi), important (imp), unimportant (ui) or irrelevant (irr). table 2 presents these classifications in two complementary representations: the sum of classifications for each quadrant in brackets and the percentage of the participants that chose that classification. in table 2, it can be seen that all the quadrants were almost solely classified as very important or important. the quadrant (q2): objectives and necessities was considered very important for all the stakeholders. although all the quadrants seemed to have similar importance to the stakeholders, an exception was observed for quadrant (q1) demographic data: only one software engineer (i.e. technical stakeholders) pointed out the q1 as unimportant. comparing the classifications for the q1, it could be seen that the software engineers mostly pointed this quadrant as on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 2. codes assigned to the artifacts of each type of stakeholders figure 3. two ways of working on student engagement: examples of technical and non-technical stakeholders figure 4. definitions of different end-user profiles: examples of technical and non-technical stakeholders important, while the pedagogues (i.e., non-technical stakeholders) indicated it as very important. the personas technique has the focus on developing the empathy between users and developers; therefore, we can conclude that the nontechnical stakeholders can contribute to characterizing the end-users. on the contrary, technical stakeholders were not concerned with these aspects. table 2. degree of importance of the quadrants q1 q2 q3 q4 vi 20% (1) 100% (5) 60% (3) 60% (3) eng (5) imp 60% (3) 0% (0) 40% (2) 40% (2) ui 20% (1) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) vi 75% (6) 100% (8) 62% (5) 62% (5) ped (8) imp 25% (2) 0% (0) 38% (3) 38% (3) ui 0% (0) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) vi 53.8% (7) 100% (13) 61.5% (8) 61.5% (8) total (13) imp 38.5% (5) 0% (0) 38.5% (5) 38.5% (5) ui 7.7% (1) 0% (0) 0% (0) 0% (0) irr 0% (0) 0% (0) 0% (0) 0% (0) 6.2.2 usage and relevance of the guideline questions we examined the participants’ answers about the perception of the stakeholders regarding the relevance and the use of the guideline questions. an open question asked the participants for suggestions to improve the proto-persona+ template. table 3 presents the results in percentage and in absolute numbers of the “yes” answers. this double representation of the results provides a more real overview considering that we had a small sample of technical stakeholders. therefore, the percentages might not clearly indicate the differences and similarities between the two types of stakeholders. in table 3, we can see that the software engineers used and considered relevant the question q3 what are they better at doing?. on the contrary, most the pedagogues did not show the same results. an inversion was observed from the question q3 how do they like to do it?, which was not used by the software engineers but had great application to the pedagogues. finally, the question q4 what frustrates them? presented a considerable difference in the responses; while all the software engineers used and found it relevant, the pedagogues used it very little, although they found it to be a relevant question. these differences from the perceptions of both types of stakeholders restate that both stakeholders have the potential to give different contributions. by exploring the stakeholders’ written notes for the q3, it can be observed that the questions motivate different points of view. the pedagogues focused on encouraging students to overcome their barriers. they reported the need of performing activities that helped students in developing new skills and not only on improving something in which they were considered good. on the contrary, the software engineer showed emphasis on what the student already knows in a tentative on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 5. codes per ux dimensions and per type of stakeholders table 3. usage and perception of relevance of the guideline questions uses relevance guideline questions eng (5) ped (8) eng (5) ped (8) who are they? 100% (5) 100% (8) 80% (4) 100% (8) q1 what are their ages? 100% (5) 100% (8) 100% (5) 100% (8) what are their school levels? 100% (5) 100% (8) 100% (5) 100% (8) q2 what do they want to accomplish? 100% (5) 100% (8) 100% (5) 100% (8) what do they need to reach their objective? 80% (4) 88% (7) 100% (5) 88% (7) what do they like? 100% (5) 75% (6) 100% (5) 100% (8) q3 what are they better at doing? 80% (4) 38% (3) 80% (4) 50% (4) how do they like to do it? 40% (2) 75% (6) 80% (4) 100% (8) what are the difficulties they can face? 80% (4) 100% (8) 100% (5) 100% (8) q4 what frustrates them? 100% (5) 63% (5) 100% (5) 88% (7) what are the known issues that affect their interaction? 80% (5) 100% (8) 80% (4) 100% (8) of stimulating such student behavior during the use of the application. among 13 participants, only one pedagogue and one software engineer gave suggestions through the open question; both were for the q1 demographic data. one pedagogue suggested the addition of the question: “do users have any deficiencies or restrictions?” that focuses on the individual characterization of users. on the contrary, one software engineer suggested a more technological question: (“do users have access to mobile devices?”). table 4. preferences for each guideline question guideline questions uses relevance who are they? ped q1 what are their ages? what are their school levels? q2 what do they want to accomplish? what do they need to reach their objective? ped eng what do they like? eng q3 what are they better at doing? eng eng how do they like to do it? ped ped what are the difficulties they can face? ped q4 what frustrates them? eng eng what are the known issues that affect their interaction? ped ped we examined the number of times that each question was answered by the participants and identified the questions that were more important for the different stakeholders. the relevance column in table 4 indicates which stakeholder presented more answers for each question. software engineers demonstrated greater interest in the use of the q3 questions. on the contrary, the q4 was the most used for the pedagogues. the quadrants q1 and q2 were answered in a similar manner by the two types of stakeholders. 6.2.3 perception of usefulness and ease-of-use this analysis was based on the responses of the technology acceptance model (tam) questionnaire, conceived by davis (1989), that aims to analyze the acceptance of certain technology by a group of participants, dias et al. (2011). we included a question regarding the ease of memorizing the technique that was based on the work of steinmacher et al. (2015). table 5 lists the questions. for each question, the participants chose the option that best represented their degree of agreement. the options available were “fully agree”, “largely agree”, “partially agree”, “partially agree”, “largely disagree”, and “fully disagree”. by observing the percentages of both types of stakeon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 figure 6. perception of usefulness and ease-of-use table 5. questions used from the tam questionnaire dimension question usefulness u1 by using the persona technique, i was able to describe the user characteristics more quickly. u2 by using the persona technique, i was able to enhance my ability to describe the user characteristics. u3 by using the persona technique, i was able to enhance my efficiency during user characteristics description. u4 by using the persona technique, i was able to more effectively describe the user characteristics. u5 by using the persona technique i was able to improve my perception about the good practices for describing user characteristics. u6 i consider the persona technique useful in describing the user characteristics. ease-of-use f1 it was easy to learn to use the persona technique. f2 i was able to use the technique in the way i intended to. f3 the orientations of use for the persona technique were easy to understand. f4 i understand what happened during my interaction with the persona technique. f5 it was easy to gain ability to use the persona technique. f6 the persona technique allows flexibility to describe the user profile using the quadrants. f7 it is easy for me to remember how to use the persona technique. holders, it can be seen that a great number of questions was answer as “largely agree”. few exceptions could be found. the difference in agreement perceptions in question f5 about “the easiness to gain ability to use the technique” was high. the group of software engineers answered 60% (3 of 5) with “partially agree” and had 20% (1 of 5) that “partially disagree” on the questioning that the technique was easy to gain ability. revisiting the notes in the proto-persona+ artifact, we found out that the software engineers struggled in describing the proto-persona+, which can explain the low “easy to gain ability” perception on the above question. on the contrary, the pedagogues showed a lower perception that the technique improved their efficiency to describing the audience. the majority of the pedagogues answers was “partially agree” in the question u5 (50%, 4 of 8). however, 60% (3 of 5) of the software engineers indicated that they “largely agree” that the proto-persona+ improved their efficiency to describing the users (question u3). overall, only the software engineers pointed out some degree of disagreement (“partially disagree”). moreover, “fully agree” prevailed in the pedagogues’ responses, which can reiterate the fact that they had the perception that the technique was useful to describe end-users. 7 second round: using the protopersonas+ in design 7.1 planning after exploring the creation of the proto-persona+ artifacts we decided to investigate whether these artifacts could support developers during the prototyping of solutions. the results of this investigation aided to answer the (rq3) which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?. the objective of this second round was to analyse whether the information from proto-persona+ artifacts contributed to the design of the low fidelity prototypes. the participants constructed low fidelity prototypes by using storyboards technique. storyboard is a technique in which the people’s interaction with an application is shown. often it delivery a complementary view of the static drawings of user interfaces. storyboard simulates the flow that users can follow from one part of the interface to another, rogers et al. (2015). in this round, our subjects were software developers. in our study, the storyboard artifacts were drawn on paper. the participants could enrich their proposal by adding stickers around of interface elements. these stickers contained supplementary textual information such as actions associated with buttons, navigation flow between the screens, and so on. additionally, in the stickers, the developers also reported which part of the proto-personas+ and the scenario have aided them to make their choices regarding the deon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 sign. through these textual justifications, we could analyze what information they used and from which proto-persona+ it originated. this second round was conducted in four steps. first, we performed a (i) pre-analysis of the participants knowledge about hci techniques that were used in the activity; (ii) a training session on hci techniques and mobile development; (iii) a hands-on exercise to prototyping a user interface by using a scenario; and (iv) the construction of the storyboards considering the proto-personas+. artifacts to support the steps were prepared. the prequestionnaire of participants’ profiles (i) contained as much personal information as information about their knowledge about personas and prototyping techniques, and about the nielsen heuristics, nielsen (1995). the training session (ii) consisted of a two-hour class that presented techniques that were required for the development of the storyboards. for the hands-on exercise (iii) a two-hour activity was planned, wherein some proto-personas+ and an example of scenario were made available, from which the participants experienced the same artifact that they would use in the study. for the step of the construction of the storyboards, (iv) a consent form was distributed to the participants to indicate their agreement on the use of their data for the purpose of academic research; here, the same scenario used in the first round was applied. 7.2 execution thirty-six undergraduate students in computer science at ufscar participated in the study, known as developers henceforth. they answered the pre-questionnaire, and in the preanalysis we were able to fathom their knowledge about hci techniques (see figure 7). we noticed that 78% of developers “did not know the technique of persona”; 67% “did not know the nielsen heuristics”; and about prototyping technique: 22% “did not know” and 47% “knew, but had never used”. from the questionnaire results, we separated the developers into 18 pairs. we balanced the pair composition based on their knowledge on the techniques. as noticed, the participants did not have practical knowledge about the techniques we planned to use (i.e., personas and storyboards). to mitigate this, we conducted the training session in two steps to leverage the participants’ knowledge. first, a senior professor in se and hci carried out twohour class covering topics about personas, storyboard, and how nielsen heuristics could help them to the application of design good practices. later, on the same day, two master’s students conducted a two-hour hands-on in which the participants built a storyboard based on a new scenario and examples of proto-personas+ (the artifacts were different from those used later). a week later, the study was conducted in a 3-hour session, wherein 18 pairs of developers constructed storyboards by using the scenario of the study (i.e., the same that was used to construct the proto-personas). we also requested the pairs to select only two proto-personas+ to support their work. this decision to limit the choice into two artifacts was taken so that the participants did not have to deal with a large diversity of user profiles. first, the pairs received 22 proto-personas+ figure 7. participants’ knowledge about hci techniques in a random order to prevent that the same artifact was always placed in the same position of the order of presentation. to avoid the participants selected always the same artifacts, we shuffled the proto-personas+ before presenting them to the participants. the participants received a set of these artifacts arranged from different orders. by doing this, we avoided that the order of presentation could cause biases on the selection of the proto-persona+. the pairs built the storyboards and also fixed the post-it stickers to report their decisions on the design. the participants were instructed to explain through the stickers, which parts of the scenario and proto-personas+ they used to gain the insights into the design. each pair generated five user interfaces (three to nine user interfaces) on average. 7.3 analysis we performed the analysis in two phases. the first one examined which proto-personas+ were selected and applied to the construction of the storyboards by the developers. in the second step, through a more in-depth analysis, we explored which parts of the proto-persona+ were used. from this second analysis, we intended to understand how the information found in the proto-personas+ aided the developers’ work on building the solutions. we first identified the most chosen proto-personas. further, by considering the developers’ notes about the use of the proto-persona, we could identify which parts of the protopersonas+ were used most. the first phase followed the same procedure as in the first round, wherein ux dimensions were applied (see the definitions in section 5.3). different from the first round, in this phase, the storyboards were the targets of the evaluation. twelve software engineers with different profiles attended this session: two undergraduate students in computer science from ufscar (campus sorocaba), five master’s students, of which four were from the graduate program at ufscar (campus sorocaba) and one from the graduate program at unicamp, two graduates working for more than on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 three years in software companies; and three masters in computer science. all had experience in hci and background in computer science. none of these evaluators participated in the previous evaluation of the proto-persona+ (i.e., the first round described in this work). the storyboards were distributed among the evaluators. each storyboard produced by a developer had five low fidelity prototypes on average; therefore, the division of what each evaluator would explore considered the following factors: (i) we made a uniform distribution so that each participant received at the same number of prototypes to evaluate, and (ii) each storyboard was evaluated by two participants; this redundancy in the evaluation intended to enrich the analyzes. however, the same pair did not analyze the same set of storyboards. considering the ux dimensions, each evaluator examined fifteen low fidelity prototypes. as a result, the evaluator took notes to justify whether a given ux dimension was being applied or not in that prototype. none of the participants had seen the proto-personas+ used to create the prototypes they were evaluating. after, we proceeded to the second step, wherein the open coding process happened in two iterations. the first author of this article inspected the notes that the evaluators took in the first step based on 24 codes that were previously generated. later, the fourth author refined the findings and 23 new codes were generated at the end of this round. this generated a total of 49 codes in the two experimental rounds combined. 7.4 threats to validity to deal with the bias on the preference of proto-persona+ selected by the developers, which is an internal threat, we presented the 22 artifacts in a random order to the participants. in our arrangement, the proto-persona+ did not appear more than twice in the same ordinal position of the list. with the order changed for each group, the threat of a possible false preference was mitigated and the results became more reliable for the inferences and support. another threat to the internal validity refers to the motivation of the participants during the experiment because the workshop was applied during a compulsory course in computer science. we collected the participants’ opinion about the activity at the end of the study. the participants’ feedback showed that they considered the activity important, e.g., “i found the activity very interesting”; and opinions like the “[proto-persona] was useful to the achiement of my goal...”. the feedback showed that the participants felt motivated to participate in the study. a threat to the external validity was the fact that the storyboards were constructed by participants that had no prior contact with proto-persona+ and storyboards. to mitigate this threat, we conducted a training about the proto-persona+ and storyboard techniques and a hands-on exercise using them. on this validity, we arranged the developers in homogeneous pairs that had complementary knowledge. similar to the first round, the subjects here were also students. salman et al. (2015) in their work provide evidence that students and experienced professionals have equal performances in new activities. although storyboard and prototyping are largely applied techniques, in our case, we changed the traditional application of both. by using a scenario and the proto-personas, we provided a method to mitigate the lack of experience of the developers because it is different from the usual prototyping. 8 findings of the second round using the results of the second round, we answered (rq3): which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces. the details are presented in the following two subsections. 8.1 developers’ preferences firstly, we identified the proto-personas+ that the developers chose and used on considering that they should select only 2 from the 22 proto-personas+ that are available. we organized this result in three groups of proto-personas. group (i) presents the proto-personas+ that were widely used, being the most chosen; group (ii) comprises the proto-personas+ that were chosen by the groups in an amount equal to the average relative to the distribution of the choices; and group (iii) comprises the proto-personas+ that were chosen at least once by the pairs (i.e. developers). table 6 summarizes these groups and indicates the id of the proto-persona, which stakeholder was the creator of the proto-persona, and some features of the artifacts. one of the goals of proto-personas+ is to promote the empathy between developers and users; therefore, the use of an image to represent the persona could be important. we also obtained direct and indirect findings about the use of the artifacts. the direct analysis comprises the absolute number of references that each proto-persona+ received by the developers. on the contrary, the indirect one comprises the results from the perspective of the authors analysis regarding the preference between the two proto-personas+ chosen by the pair. considering the two artifacts, it was analyzed which of the two proto-personas+ was most emphasized during the construction of the storyboard while counting the number of references to the parts of each artifact. the indirect analysis resulted in two cases: (1) of equal interest, wherein both the artifacts obtained the same amount of references and (2) different interests among the proto-personas+ (classifying artifacts into primary or secondary personas). the classification in primary and secondary personas happens when there are more than one user profiles that will use the application; however, one of them should be considered with higher priority owing to being the primary user of the application, cooper et al. (2014). a primary persona is defined as a profile that represents the user’s focus of the application; therefore, it will have its prioritized needs met. secondary persona refers to a user profile that will use the application; however, to fulfill its needs is not a priority for the application. based on these definitions, a classification of the proto-persona+ that fits in case (2) was performed. the protopersona+ with the highest number of parts referenced was classified as primary, whereas the lower one was classified as secondary. in table 6 were found some relevant results. all the protopersonas+ that had an image in (q1) (i.e. demographic data) on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 table 6. proto-personas selected to the construction of the storyboards direct indirect id protoprimary secundary persona+ stakeholder image references persona persona equal i 9 pedagogue x 10 5 2 3 22 engineer x 8 4 0 4 8 pedagogue x 3 1 1 1 ii 21 engineer x 3 0 1 2 14 engineer 3 0 3 0 2 pedagogue 2 0 1 1 20 engineer 2 1 1 0 18 engineer 1 1 0 0 iii 11 pedagogue x 1 0 0 1 1 pedagogue 1 0 1 0 7 pedagogue x 1 0 1 0 19 engineer 1 0 1 0 were chosen at least once by the pairs of developers. additionally, from the five most chosen artifacts (i.e., groups i and ii), four had an image associated to the proto-persona. this fact reinforces the idea that persona is a technique that stimulates empathy in the developer. the use of image to represent the target audience is a method to instigate the developers to think and develop the association of their ideas with those of the user represented in the persona, grudin (2006). it can be seen that two artifacts, i.e. id 9 and 22 obtained the highest numbers of references (group i) in both indirect and direct analysis. they were designed by the pedagogue and the software engineer, respectively. during direct analysis, we observed that proto-persona+ 9 was chosen by 10 of the 18 developers, whereas the 22 was chosen by 8 of the 18. considering that only 10 of the 22 proto-personas+ got indications of at most 3 groups and that 10 others were not chosen by any group, we can see a clear preference for artifacts 9 and 22 to support the construction of the storyboards. additionally, the proto-personas+ 9 and 22 were classified as primary personas in most cases. examining the data, we could observe that proto-persona+ 9 was used 5 times as primary, 22 was used 4 times as primary, and all other artifacts obtained only one emphasis as a primary persona. these restate our results found out in the direct analysis. to explain the preference for proto-personas+ 9 and 22, the 4 authors of this article conducted a qualitative analysis on the content of the quadrants of these proto-personas. the results demonstrated that both artifacts had a more clear definition on the users they represented. they provided information in rich details of who the end-user is, being these details evident in quadrant 2 (objectives and needs). fisher’s exact test fisher (1922) was taken to analyze the existence of a statistical significance between the protopersonas+ produced by the pedagogues and software engineers. by running the same testing, we also checked the influence that an image have on the choice of an artifact. fisher’s exact test is recommended either for small samples of categorical data and for calculating the exact significance of the deviation from a null-hypothesis using the p-value. the statistical analysis was conducted with certain scenarios and proto-personas+ groups, with their respective null (h0) and alternative (h1) hypotheses. to conduct the testings, we defined null and alternative hypotheses considering that the characteristic c1 could influence the results c2 (see table 7). taking into account these assumptions, we could represent the null and the alternative hypotheses respectively as (h0) there is no influence of on the and (h1) there is an influence of on the . table 7. fisher exact tests results c1 c2 p-value stakeholder that create the proto-persona+ classification of the artifact as a primary persona 1 stakeholder that create the proto-persona+ classification of the artifact as a secondary persona 1 stakeholder that create the proto-persona+ classification of the artifact as “equal interest” 1 stakeholder that create the proto-persona+ number of references of the artifact in the prototypes be equal or greater than 3 1 the presence of a representative image number of references of the artifact in the prototypes be equal or greater than 3 0.2424242 we run the testings using r software environment4. it was assumed a p-value with significance 0.05 in the analysis. in table 7, the final p-value got after performing the fisher exact test. the p-value of the analysis do not indicate any statistical significance to refute the null hypothesis in any one of the analyzed pairs of elements. statistically, the protopersona+ creator (i.e. pedagogue or engineer) could not be related to how the proto-persona+ was used. similarly, we could see that the fact of a proto-persona+ presenting an image did not affect the number of times that artifact was referred in prototypes. finally, we explored which proto-personas+ were chosen in the perspective of who created them. table 8 presents a mapping of the storyboard and the type of stakeholder who was the author of the artifact used in the construction of the solution. only four groups used the proto-personas+ created only by the pedagogues. the same could be seen for the application of those built by the software engineers. from this, it was confirmed that the developers mostly opted to build their solutions considering the proto-personas+ of the two specialties. we need to restate that the set of artifacts was delivered in a random order and without indication of which of the two stakeholders elaborated them. the results showed that 4https://www.r-project.org/ on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 table 8. type of proto-persona selected vs storyboards proto-persona+ used storyboard id number of times only pedagogue s1, s9, s12, s16 4 only software engineer s4, s10, s11, s13 4 mix of both s2, s3, s5, s6, s7, s8, s14, s15, s17, s18 10 the combination of the artifacts from different stakeholders aided the developers in most cases. 8.2 application of ux requirements the codes that emerged in the analysis of the storyboards were related to the codes found in the analysis of the protopersonas’ descriptions (see sub-section 6). to support our presentation of the results, we discuss the codes of the storyboards comparatively with the codes used in the first round of the analysis and presented in figure 5. to illustrate our discussion, we will show figures in which the codes was split into three groups. group a represents the five most recurrent codes for that dimension. group c represents the codes that appear only once related to that dimension. and finally, group b represents the codes that arose more than once associated with a dimension but not in an amount that justified to be one of the top five codes (group a). figure 8. codes of access dimension found in the storyboards in figure 8, it can be seen that the questions regarding the physical devices and infrastructure to access were the main focus of the participants. this was noted by the recurring codes of hardware (a), internet (a), and the characteristics of device (a). while comparing the codes found out in this analysis with the ones uncovered in the proto-personas+ analysis, we got the interaction mode (a) code as one of the most presented for this dimension. this code was identified in the proto-personas+ produced by the software engineers and appeared in several ux dimensions in the previous analysis (figure 5). this result demonstrates the concern with these forms of interaction were presented in the prototypes of the storyboards to meet users’ needs. it is also seen that the codes universal accessibility (c) and social interaction (c) that refer to the two profiles built by the pedagogues and the software engineers, respectively, as shown in figure 4. this finding illustrates how the knowledge of different stakeholders contributed in enriching the description of end-user details. in the media dimension (see figure 9), the image (a) and game (a) media of interaction were the major codes mentioned. the code of interaction mode has been found several figure 9. codes of media dimensions found in the storyboards times in the analysis of proto-personas+ created by the software engineers. in this context, the focus reiterates the results found in the first round that took the concerning on which media could affect the users’ learning process and consequently their user experience. considering the common points between the pedagogue and the software engineer stakeholders (see figure 5), we noticed that they concentrated on focus on learning process (b), student preferences (b), and student objective (b). finally, the concerning on a misleading (b) of how a media works or what it stands for has also emerged as a code, thereby demonstrating how app overall organization problems (a) and frustration (c) can affect student learning process. figure 10. codes of organization dimension found in the storyboards simplicity (a), easiness of use (a), navigation (a), app overall organization (a), and confusion (a) were the codes that arose in the organization dimension (see figure 10). these codes indicate that applications in this domain on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 should not introduce complex ways of interaction providing a simple manner of use. to promote the stimulus (b) to the users engagement into the software and a pleasant interaction should be the goals of the applications. these codes indicate that applications in this domain should provide a simple and not complex manner of use. the applications should have the stimulus (b) and a pleasant (c) experience as their goals when the user learns and uses this application. by observing the previous results (see figure 5), we noticed that the artifacts developed by the pedagogues had the codes: user restrictions (b), application complexity (b), and focus on learning process (a) assigned to them; this demonstrates the importance of providing a learning application in which users can have an easy journey. figure 11. codes of stimulus dimension found in the storyboards in the stimulus dimension (see figure 11), the focus on media (a), stimulus (a), and focus on learning process (a) was presented. by looking at figure 5, it can be seen that both pedagogues and software engineers focused on the same points during the construction of the proto-persona+. regarding the fun and satisfaction codes that were assigned to the proto-personas+ of pedagogues, we noticed that the low fidelity prototypes had similar codes associated to them (i.e., fun (b), curiosity (b), and enjoyment (b)); this is an evidence that the developers have tried to keep an exciting experience for the students. considering the codes related to the proto-personas+ of the software engineers, frustration (b) and student objective (b) were the codes identified in the prototypes which demonstrated the concern that these stakeholders had on encouraging students to use the application. lastly, focusing on the app overall organization (a), the prototypes provided a method in which students can customize their learning process and consequently improve their experience. a prevailing occurrence of the codes: media (a), focus on learning process (a), and game (a) could be found in the value dimension (see figure 12). these three code have already been found out from both types of the stakeholders (i.e., pedagogues and software engineers) in the results of the proto-personas’ analysis (see figure 5). while explorfigure 12. codes of value dimension found in the storyboards ing the proto-personas+ of both stakeholders we saw their concerns on the learning process, user experience, and the use of suitable channels of interaction. codes such as stimulus (b), user experience (b), satisfaction (b), fun (b) and pleasant (c) demonstrate that developers who constructed the storyboards were able to catch such claim. considering the code app overall organization (a), we noticed that only the proto-personas+ of software engineers had such code assigned. this code provides evidence that this stakeholder worried on the integration of different resources and features of the application. figure 13. codes of interaction dimension found in the storyboards finally, in the interaction dimension the focus on learning process was the main common code. accessibility (b) and social interaction (c) are codes that were pointed out respectively from pedagogues and software engineers protopersonas+ and that appeared again in the analysis of the storyboards. these codes allowed to reaffirm the different contributions that both types of stakeholders provide to the design of solutions. by observing the differences between the two types of stakeholders we see that interaction mode (a) and stimulus (b) were found in the proto-personas+ of software engineers and pedagogues respectively. these codes clearly demonstrated that software engineers concerned more on technical aspects of interaction whereas the pedagogues were worried on keeping students motivated to the learning. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 9 discussion this study investigated the effect of the non-technical stakeholders’ participation on ux requirement specification. it is different from other works, wherein the non-technical stakeholders provide information in a passive participation. in our investigation, we considered these stakeholders as active members during the elicitation of ux requirements by using proto-personas+. the findings showed that non-technical stakeholders brought important contributions to the ux requirements elaboration. they could point out the requirements that report ux in different perspective from that provided by the technical stakeholder. ux requirements are strongly context-dependent, and given that this context suffers constant changes over time, kashfi et al. (2017). non-technical stakeholders are the ones that have the knowledge about the context. from the findings, it could be noticed that although the technical stakeholders had experience in the domain, the non-technical ones demonstrated to check the aspects that can directly influence the acceptance of the software, hadar et al. (2014). by looking at the steps that the technical and non-technical stakeholders followed, we can summarize the findings below. the main points in the first round were the preparation of the stakeholders to be able to apply the proto-persona+ technique correctly. firstly, the non-technical stakeholders took part in the training regarding the presentation of the protopersona+ technique, its benefits, and purposes of use it. additionally, a scenario about the domain of the application was presented to these stakeholders in order they had a clear view about the scope of the application. subsequently, a hands-on exercise about the use of the technique in practice was run. this step allowed the participants to clarify their doubts and consequently it avoids misunderstandings and misusing of proto-persona+. finally, the proto-personas+ artifacts were constructed by using the template with the guideline questions and supported by the information presented in the previous steps. in the second round, the focus moved on to the use of the information described in the proto-personas+. the artifacts produced in the previous round were explored, and the information on it supported the construction of the user interface prototypes. to make suitable the usage of the information provided by the non-technical stakeholders, we conducted some actions to the participants (i.e. developers) got the expertise to use the artifact. first, the developers took a part of training session about the concepts of proto-personas+ and how to use these artifacts in the practice. the scenario used in the previous round was presented to keep the same scope of the application. after, a hands-on exercise was run with the purpose of the participants become acquainted with protopersonas+ artifacts. this hands-on focused on the reading of the details available on the proto-persona+ for then extracted the information the developers considered relevant. an artifact example was delivered to the developers that should read and explore it as well as ask questions for clarifying their doubts. afterwards, all the proto-personas+ produced in the first round were offered to the developers. they could select the ones they considered that provided useful information to their activity of prototyping the user interfaces. we must mention that the ux requirements that were raised are relative to a more minimalist application in the m-learning area. this enables them to be reused within the same scope. however, the exploration of results in other elearning applications should be done to verify these requirements reuse. it is also relevant to discuss the scope of the answers to our research questions. since this a first study about the contribution that non-technical stakeholders bring to the specification of ux requirements, we tried to understand this phenomenon, and we asked exploratory questions, easterbrook et al. (2008), aiming at characterizing the non-technical stakeholder contributions. however, the answers to our research questions are context-dependent. different stakeholders would describe different ux requirements. nevertheless, the answers to these questions result in a clearer understanding of the phenomenon, since they show that the nontechnical stakeholders bring a valid contribution to the specification of ux requirements. considering (rq1): which ux requirements do non-technical stakeholders describe while using the proto-persona technique?, we could answer that the non-technical stakeholder has elicited different ux requirements when compared to the technical stakeholders. on exploring the artifacts that both types of stakeholders produced, we could affirm that they contributed in different perspectives. the first round showed that both types of stakeholders described the ux requirements differently e.g., the different actors described in their proto-personas’ characteristics of how to keep the student using the application. while the pedagogues pointed out that students would be encouraged by the enjoyable features that would give rise to the interaction fun, the software engineers preferred to mention the student motivation focused on dealing with student frustration. these approaches are a reflection of the requirements that the e-learning application should have to delivery fun in a learning space, gomes et al. (2018). another evidence of the different contributions that both types of stakeholders brought was seen in the user profiles described by them. the pedagogues suggested profiles in which accessibility issues were at the center, whereas the software engineer provided the description of profiles associated with developing work in groups. therefore, it can be inferred that the knowledge of both are complementary. our findings restate the need of an interdisciplinary participation of various stakeholders, fernandez and wagner (2015). concerning the (rq2): how is the acceptance of the use of the proto-persona+ technique by these stakeholders?, we concluded that the technique proved to be suitable to be used by both types of stakeholders. we can point out some different perspectives in using the technique. the pedagogues showed greater degrees of importance in the use of the demographic quadrant. this represents an important result on the description of end-users. therefore, this quadrant reports an individual’s personal information that can contribute to the construction of a picture of the end-users; it can consequently boost the development of empathy between the developers and the audience, billestrup et al. (2014); ferreira on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 et al. (2018b). in the guideline questions, it was noticed that the two stakeholders demonstrated different perceptions for each questions. as a result, it was seen that these different perceptions can provide complementary viewpoints on the audience, thereby enriching the details about the end-user. by observing the perceptions of ease-of-use and usefulness of use, the findings showed that the non-technical stakeholders presented easiness in using the technique. these results revealed that the proto-persona+ is a suitable technique to be handle non-technical stakeholders with the purpose of eliciting ux requirement. other techniques could be taken into account to eliciting ux requirements. however, often personas are artifacts that stimulate the discussion about the end-user needs indepth. therefore the proto-persona+ provided an adaptation of proto-persona which the aim of being easier to the use by the stakeholders. by answering the rq2, we could verify that the proto-persona+ was suitable to capture the particular knowledge of the different type of stakeholders. this revealed that the different types of stakeholders can contribute to describing different ux requirements. by answering (rq3): which ux requirements presented in the proto-personas+ can support the prototyping of user interfaces?, we noticed what were the sets of ux requirements presented in the proto-personas+ that supported the developers on the prototyping of solutions. by comparing which ux requirements are presented in the storyboards we saw that the proto-personas+ of both types of stakeholders (i.e., pedagogues and software engineers) provide information that these artifacts supported the developers in the design of solutions. additionally, the findings from storyboards analysis reaffirmed that both stakeholders provided complementary information fernandez and wagner (2015). 10 study limitations considering all the steps of our study, we can highlight some limitations which we discuss follow. proto-persona is an approach that focuses on providing a sketch of the representative group of people in a specific domain. from conducting workshops the proto-persona technique allows the participants (i.e. stakeholders) to achieve a shared understanding about the audience. one of the advantages is that the technique offers a practical way to gather the specialists’ knowledge and discuss their inputs about the endusers. however, as the proto-persona is built from assumptions about the end-users it presents some limitations regarding their validation. differently from proto-persona, the traditional persona is constructed by using data gathering from the audience. to mitigate the problem of not collecting data from real end-users, gothelf proposes that proto-persona validation should be carried out later. we did not performed this validation, this could be conducted in another study. we can point out as another limitation, the fact that this study was conducted with a specific group of stakeholders in a specific city in brazil. further studies are necessary to reiterate the proposed methodology as a generalized approach to capture non-technical stakeholder knowledge in other contexts. up to now, our research did not compare the results of the proto-personas+ with different approaches that use traditional personas; or even no personas to elicit requirements from non-technical stakeholders. therefore, we do not claim that applying proto-personas+ leads to a better result than using traditional personas approaches. we also do not claim that the proto-personas+ results are better than not applying any persona approach at all. further comparative studies are needed to fully understand the effectiveness of the protopersona approach. our results must not be generalized to all scenarios and the particularities of our study must be considered. protopersona construction should be seen as a tool to encourage the sharing and discussion of stakeholder knowledge. this study investigated whether the proto-persona is suitable to be used by both technical and non-technical stakeholders to support the ux requirement elicitation. 11 conclusions and future work this paper presents an experimental study that aimed to explore whether a non-technical stakeholder contributes to the description of ux requirements. to conducted the study, we applied the proto-persona+ technique. the results showed that the non-technical stakeholder contributed by giving details about the end-users in a complementary view of the technical stakeholder. considering the types of ux requirements the participants described, we noticed that the non-technical stakeholders raised different ones. fun and accessibility issues were found exclusively in the proto-personas+ created by these stakeholders. accessibility issues are fundamental to meet the needs of a wide range of end-users in the domain we explored in this study. in addition, by taking into account fun issues these stakeholders demonstrated their concerns on motivating the users to keep engaged in the application. we could conclude that by describing these type of ux requirements the non-technical stakeholders had an important contribution on eliciting requirements which have a great impact on the experience of the end-users. the results of our second round revealed that the user interface prototypes produced by the developers encompassed different ux requirements in a complementary way. we could see that the prototypes presented a diversity of details about ux. we could conclude that for the design of the protopersonas+ of the different stakeholders allow the developers to build more comprehensive prototypes at the same time that provided minimalist solutions. to sum up, we could point out that our study provided two important contributions. first, our investigation brought the discussion of how a non-technical stakeholder can contribute to the elicitation of requirements that are linked to the end-users characteristics. our findings revealed that the non-technical stakeholder can be a co-participant in the elicitation process and not just a provider of information. in addition, we extended the proto-persona technique by creating the proto-persona+ and showing that our proposal is suitable for the purpose of including the non-technical stakeholder in the process of eliciting ux requirements. our work also preon the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 sented as contribution the structuring of a qualitative analysis that can be replicated in other studies on ux requirements. as future work, we intend to carry out studies on the quality of the low fidelity prototypes by conducting an usability inspection on these. we also intend to evaluate the quality of the storyboards on the perspective of domain experts that in our case are the pedagogues. 12 acknowledgements we thank the financial support of the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. we also thank the grant #2013/25572-7 são paulo research foundation (fapesp) and the support of cnpq (311494/2017-0). references abelein, u., sharp, h., and paech, b. (2013). does involving users in software development really influence system success? ieee software, 30(6):17–23. alves, c. and ali, r. (2018). a persona-based modelling for contextual requirements. in requirements engineering: foundation for software quality: 24th international working conference, refsq 2018, utrecht, the netherlands, march 19-22, 2018, proceedings, volume 10753, page 352. springer. anvari, f., richards, d., hitchens, m., and babar, m. a. (2015). effectiveness of persona with personality traits on conceptual design. in proceedings of the 37th international conference on software engineering-volume 2, pages 263–272, florença, itália. ieee press. aranda, a. m., dieste, o., and juristo, n. (2016). effect of domain knowledge on elicitation effectiveness: an internally replicated controlled experiment. ieee transactions on software engineering, 42(5):427–451. ardito, c., costabile, m. f., marsico, m. d., lanzilotti, r., levialdi, s., roselli, t., and rossano, v. (2006). an approach to usability evaluation of e-learning applications. universal access in the information society, 4(3):270–283. berti, s., paterno, f., and santoro, c. (2004). natural development of ubiquitous interfaces. communications of the acm, 47(9):63–64. bhattarai, r., joyce, g., and dutta, s. (2016). information security application design: understanding your users. in international conference on human aspects of information security, privacy, and trust, pages 103–113. springer. billestrup, j., stage, j., nielsen, l., and hansen, k. s. (2014). persona usage in software development: advantages and obstacles. in the seventh international conference on advances in computer-human interactions, achi, pages 359–364, barcelona, espanha. citeseer. brown, j. m., lindgaard, g., and biddle, r. (2011). collaborative events and shared artefacts: agile interaction designers and developers working toward common aims. in 2011 agile conference, pages 87–96. castro, j. w., acuña, s. t., and juristo, n. (2008). integrating the personas technique into the requirements analysis activity. in 2008 mexican international conference on computer science, pages 104–112. chimalakonda, s. and nori, k. v. (2013). what makes it hard to apply software product lines to educational technologies? in 4th international workshop on product line approaches in software engineering. choma, j., zaina, l. a. m., and beraldo, d. (2016a). userx story: incorporating ux aspects into user stories elaboration. in human-computer interaction. theory, design, development and practice 18th international conference, hci international 2016, toronto, on, canada, july 17-22, 2016. proceedings, part i, pages 131–140. choma, j., zaina, l. a. m., and da silva, t. s. (2016b). softcoder approach: promoting software engineering academia-industry partnership using cmd, dsr and ese. j. software eng. r&d, 4:8. clark, r. c. and mayer, r. e. (2007). e-learning and the science of instruction: proven guidelines for consumers and designers of multimedia learning. pfeiffer, 2nd edition edition. cooper, a., reimann, r., and cronin, d. (2014). about face 2.0 the essentials of interaction design. john wiley & sons wiley. davis, f. d. (1989). perceived usefulness, perceived ease of use, and user acceptance of information technology. management information systems research center, 13(3):319–340. de la vara, j. l., wnuk, k., svensson, r. b., sanchez, j., and regnell, b. (2011). an empirical study on the importance of quality requirements in industry. in 23rd international conference software engineering and knowledge engineering, pages 438–443. seke. dias, g. a., da silva, p. m., no junior, j. b. d., and de almeida, j. r. (2011). technology acceptance model (tam): avaliando a aceitação tecnológica do open journal systems (ojs). informação & sociedade, 21(2):133–149. dodero, j. m., garcía-peñalvo, f.-j., gonzález, c., morenoger, p., redondo, m.-a., sarasa-cabezuelo, a., and sierra, j.-l. (2014). development of e-learning solutions: different approaches, a common mission. ieee revista iberoamericana de tecnologias del aprendizaje, 9(5):72–80. easterbrook, s., singer, j., storey, m.-a., and damian, d. (2008). selecting empirical methods for software engineering research. in guide to advanced empirical software engineering, chapter 11, pages 285–311. springer. faily, s. (2008). towards requirements engineering practice for professional end user developers: a case study. in proceedings of the 2008 requirements engineering education and training, pages 38–44. ieee. fernandez, d. m. and wagner, s. (2015). naming the pain in requirements engineering: a design for a global family of surveys and first results from germany. information and software technology, 57(1):616–643. ferreira, b., barbosa, s., and conte, t. (2018a). creating personas focused on representing potential requirements to support the design of applications. in proceedings of the 17th brazilian symposium on human factors in computing systems, page 15. acm. on the contributions of non-technical stakeholders to describing ux requirements by using proto-persona+ pinheiro et al. 2019 ferreira, b., silva, w., barbosa, s. d. j., and conte, t. (2018b). technique for representing requirements using personas: a controlled experiment. iet software, 12(3):280–290. ferreira, b., silva, w., jr., e. a. o., and conte, t. (2015). designing personas with empathy map. in the 27th international conference on software engineering and knowledge engineering, seke 2015, wyndham pittsburgh university center, pittsburgh, pa, usa, july 6-8, 2015, pages 501–505. filho, n. f. d. and barbosa, e. f. (2013). a requirements catalog for mobile learning environments. in proceedings of the 28th annual acm symposium on applied computing, pages 1266–1271. acm. fisher, r. a. (1922). on the interpretation of χ 2 from contingency tables, and the calculation of p. journal of the royal statistical society, 85(1):87–94. garcia, a., silva da silva, t., and selbach silveira, m. (2017). artifacts for agile user-centered design: a systematic mapping. in proceedings of the 50th hawaii international conference on system sciences (2017). garrett, j. j. (2010). the elements of user experience: usercentered design for the web and beyond. new riders publishing, thousand oaks, ca, usa, 2nd edition. gomes, t. c. s., falcão, t. p., and de azevedo restelli tedesco, p. c. (2018). exploring an approach based on digital games for teaching programming concepts to young children. international journal of child-computer interaction, 16:77 – 84. gothelf, j. (2012). using proto-personas for executive alignment. uxmagazine. gothelf, j. and seiden, j. (2013). lean ux: applying lean principles to improve user experience. o’reilly media. grudin, j. (2006). why personas work: the psychological evidence. in the persona lifecycle, chapter 12, pages 642–663. elsevier inc. grudin, j. and pruitt, j. (2002). personas participatory design and product development: an infrastructure for engagement. in pdc’02, pages 144–152. hadar, i., soffer, p., and kenzi, k. (2014). the role of domain knowledge in requirements elicitation via interviews: an exploratory study. requirements engineering, 19(2):143– 159. jansen, a., van mechelen, m., and slegers, k. (2017). personas and behavioral theories: a case study using selfdetermination theory to construct overweight personas. in proceedings of the 2017 chi conference on human factors in computing systems, pages 2127–2136. acm. kashfi, p., nilsson, a., and feldt, r. (2017). integrating user experience practices into software development processes: implications of the ux characteristics. peerj computer science, 3:e130. kortbeek, c. (2016). interaction design for internal corporate tools. maceli, m. and atwood, m. (2011). from human crafters to human factors to human actors and back again: bridging the design time – use time divide. in end-user development. is-eud 2011. lecture notes in computer science, volume 6654, pages 76–91. springer. nielsen, j. (1995). 10 usability heuristics for user interface design. https://www.nngroup.com/articles/ ten-usability-heuristics/. online; acessado em 12 de agosto de 2016. nielsen, j. and norman, d. (2013). the definition of user experience. osborn, a. f. (1979). applied imagination. newyork: scribner. palomares, c., quer, c., and franch, x. (2017). requirements reuse and requirement patterns: a state of the practice survey. empirical software engineering, 22(6):2719– 2762. rogers, y., sharp, h., and preece, j. (2015). interaction design: beyond human-computer interaction. john wiley & sons, united states, 4th edition. salman, i., misirli, a. t., and juristo, n. (2015). are students representatives of professionals in software engineering experiments? in proceedings of the 37th international conference on software engineering-volume 1, pages 666–676. ieee press. sharma, s. and pandey, s. k. (2014). requirements elicitation: issues and challenges. in 2014 international conference on computing for sustainable global development (indiacom), pages 151–155. steinmacher, i., conte, t. u., treude, c., and gerosa, m. a. (2015). overcoming open source project entry barriers with a portal for newcomers. in icse ’16 proceedings of the 38th international conference on software engineering, pages 273–284, austin, estados unidos. strauss, a. and corbin, j. (1998). basics of qualitative research: techniques and procedures for developing grounded theory, volume 4. thousand oaks, ca: sage, 2 edition. winckler, m., bach, c., and bernhaupt, r. (2013). identifying user experience dimensions for mobile incident reporting in urban contexts. ieee transactions on professional communication, 56(2):97–119. https://www.nngroup.com/articles/ten-usability-heuristics/ https://www.nngroup.com/articles/ten-usability-heuristics/ introduction fundamentals and related work proto-persona+ study context first round: using proto-persona+ planning execution analysis threats to validity findings of the first round ux requirements acceptance of proto-persona+ importance of quadrants usage and relevance of the guideline questions perception of usefulness and ease-of-use second round: using the proto-personas+ in design planning execution analysis threats to validity findings of the second round developers' preferences application of ux requirements discussion study limitations conclusions and future work acknowledgements journal of software engineering research and development, 2022, 10:5, doi: 10.5753/jserd.2021.1992 this work is licensed under a creative commons attribution 4.0 international license. first step climbing the stairway to heaven model results from a case study in industry paulo sérgio dos santos júnior [ federal institute of education, science and technology of espírito santo | paulo.junior@ifes.edu.br] monalessa perini barcellos [ federal university of espírito santo | monalessa@inf.ufes.br ] rodrigo fernandes calhau [ federal institute of education, science and technology of espírito santo | calhau@ifes.edu.br] abstract context: nowadays, software development organizations have adopted agile practices and data-driven software development aiming at a competitive advantage. moving from traditional to agile and data-driven software development requires changes in the organization´s culture and structure, which may not be easy. the stairway to heaven model (sth) describes this evolution path in five stages. objective: we aimed to investigate how systems theory tools, gut matrix, and reference ontologies can help organizations in the first transition of sth, i.e., moving from traditional to agile development. method: we performed a participative case study in a brazilian organization that develops software in partnership with a european organization. we applied systems theory tools (systemic maps and archetypes) to understand the organization and identify undesirable behaviors and their causes. thus, we used gut matrices to decide which ones should be addressed first and we defined strategies to change the undesirable behaviors by implementing agile practices. we also used the conceptualization provided by reference ontologies to share a common understanding of agile and help implement the strategies. results: by understanding the organization, a decision was made to implement a combination of agile and traditional practices. the implemented strategies improved software quality and project time, and cost. problems due to misunderstanding agile concepts were solved by using reference ontologies, process models, and other diagrams built based on the ontologies conceptualization, allowing the organization to experience agile culture and foresee changes in its business model. conclusion: systems theory tools and gut matrix aid organizations to move from traditional to agile development by supporting better understanding the organization, finding leverage points of change, and enabling to define strategies aligned to the organization characteristics and priorities. reference ontologies can be useful to establish a common understanding about agile, enabling teams to be aware of and, thus, more committed to agile practices and concepts. the use of process models and other diagrams can favor learning the conceptualization provided by the ontologies. keywords: stairway to heaven, agile, systems theory, gut matrix, ontology 1 introduction typically, fast-changing and unpredictable market needs, complex and changing customer requirements, and pressures of shorter time-to-market are challenges faced by organizations. to address these challenges, many organizations have started adopting agile development methods with the intention to enhance the organization´s ability to respond to change. in emphasizing flexibility, efficiency and speed, agile practices have led to a paradigm shift in how software is developed (williams and cockburn 2003) (olsson et al. 2012). different flavors of the agile methods have become the de facto way of working in the software industry (rodriguez et al. 2012). in allowing for more flexible ways of working with an emphasis on customer collaboration and speed of development, agile methods help organizations address many of the problems associated with traditional development (dybå and dingsøyr 2008). the adoption of agile practices has enabled organizations to shorten development cycles and increase customer collaboration. however, this has not been enough. there has been a need to learn from customers also after deployment of the software product. this requires practices that extend agile practices, such as continuous deployment (i.e., the ability to deliver software more frequently to customers and benefit from frequent customer feedback), which enables shorter feedback loops, more frequent customer feedback, and the ability to more accurately validate whether the developed functionalities correspond to customer needs and behaviors (olsson et al. 2012). therefore, organizations should evolve from traditional development towards datadriven and continuous software development. continuous software engineering (cse) aims to establish a continuous flow between software-related activities, taking into consideration the entire software life cycle. it seeks to transform discrete development practices into more iterative, flexible, and continuous alternatives, keeping the goal of building and delivering quality products according to established time and costs (fitzgerald and stol 2017). therefore, a continuous software engineering approach is based on agile and continuous practices driven by development and customer data. considering that organizations struggle with the changes to be made along the path and with the order in which to implement them, olsson et al. (2012) proposed the stairway to heaven model (sth), which describes the typical successful evolution of an organization from traditional to continuous and customer data-driven development. the model comprises five stages, where the first transition consists in moving from traditional to agile development. this transition requires a careful introduction of agile practices, a shift to small development teams, and a focus on features rather than components. in this paper, we report the experience of a brazilian organization (here called organization a for anonymity rea first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 sons) which decided to evolve from traditional to agile, continuous, and data-driven software development. for that, we have followed the sth model (olsson et al. 2012). we selected this model because it represents in a simple way the main stages an organization should follow to move from a traditional to a continuous software engineering approach based on data-driven and agile development. moreover, sth does not prescribe the practices that should be performed at each stage, thus, there is flexibility to define and implement them according to the organization characteristics and priorities. in this paper, our focus is on the first transition of the sth model. although there is an increasing number of organizations moving from traditional to agile, implementing the changes needed to the first transition prescribed in sth is not trivial because it involves changes not only in the development process, but also in the organization culture. moreover, there is no “one and right” way to implement agile practices in an organization because each agile practice needs to be tailored to fit the business goals, culture, environment, and other aspects of the organization. therefore, organizations should find their own way to go through the path from traditional to agile (karvonen et al. 2015). organization a has a particular characteristic that needs to be considered when defining strategies to implement agile practices: the software projects of organization a are built in partnership with a european organization (here called organization b). in this partnership, organization b is responsible for the software requirements specification process, while organization a is responsible for design, coding, testing, and deployment processes. furthermore, organization b is responsible for the communication between organization a and the project client. both organizations a and b work in traditional but many times ad hoc manners. this way of working has brought problems, such as budget overloading, teams divided into disciplines (testers, architects, programmers, etc.) causing many intermediary delivery points in the organization and increasing delays between them, and large periods required to deploy new versions of the software products (williams and cockburn 2003) (olsson et al. 2012)(karvonen et al. 2015). organization a was in the first stage of sth and, in order to evolve, the first step was to go towards becoming an agile organization. two main challenges were faced in this context: (i) how to move from a traditional development culture to an agile culture and (ii) how to implement agile practices in an organization that shares requirement-related activities with another organization and does not have direct access to the project client. to overcome these challenges, it would be necessary to get to know the organization so that it would be possible to define suitable strategies to implement agile practices. thus, we employed an approach that combined systems theory tools (mainly systemic maps and archetypes) (meadows 2008) (sterman 2010), gut matrix (kepner and tregoe 1981) and reference ontologies (guizzardi 2007) to identify the path to implement agile practices and get into agile culture based on the organizational characteristics and context. systems theory tools were chosen because they allow understanding how different variables relate to each other in an organizational environment. thus, by using such tools, it is possible to understand how processes, practices, culture and other factors affect the software development process and produced results. this helps identify aspects that should be addressed in improvement actions. the first and third authors have knowledge of and experience with system theory and saw an opportunity to apply it in organization a. gut matrix was selected because it helps prioritize actions and was already known by organization a. finally, reference ontologies were used because they have been recognized as an important instrument to deal with knowledge-related problems, supporting communication and learning (guizzardi 2007). the authors have successfully experienced the use of ontologies as knowledge artifacts in different contexts (e.g., (ruy et al. 2017), (santos et al. 2019), (fonseca et al. 2016)). they developed the scrum reference ontology (sro) (santos jr et al. 2021a), which provides knowledge that aids in the understanding of scrum in a broader software engineering context and is suitable for meeting a learning need identified in the study addressed in this paper. as main results perceived from the experience reported here, we highlight: (i) it was possible to understand the organization behavior, identify behavior patterns and leverage points of change; (ii) strategies were defined to implement agile practices by changing undesirable behaviors and focusing on leverage points, taking the organization characteristics into account; (iii) by implementing the strategies, organization a improved software quality, project time and cost and started to develop agile culture; (iv) by using the conceptualization provided by reference ontologies, the team learned agile concepts and practices, which is useful to implement strategies aiming at the agile organization; and (v) a process based on systems theory to aid organization define strategies to implement agile practices arose from the study. this work brings contributions to researchers and practitioners. the study can serve as an example for other organizations similar to organization a and the process resulting from the study can be used by other organizations. moreover, the way ontologies were used to provide knowledge for the team can inspire others to make the most of this powerful instrument to knowledge structuring, representation and sharing. furthermore, researchers can reflect and provide advances on the use of systems theory to support the definition of strategies in the agile software development context. this paper extends (santos jr et al. 2020) mainly by exploring how reference ontologies were used to help the team learn about scrum concepts and practices in the case reported here. we also illustrate the roles of systems theory tools, gut matrix, and reference ontologies in the study, present additional information about organizations a and b and a new systemic model produced during the study. the paper is organized as follows: section 2 presents the theoretical background; section 3 discusses related work; section 4 presents the study planning, execution, and results; section 5 discusses threats to validity and section 6 presents our final considerations and future works. 2 background 2.1 stairway to heaven traditional software development is organized sequentially, handing over intermediate artifacts (e.g., requirements, designs, code) between different functional groups in the organization. this causes many handover points that lead to first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 problems such as time delays between handovers of different groups and amounts of resources are applied to creating these intermediate artifacts that, to a large extent, are replacements of human-to-human communication (bosch 2014). in agile software development, the notion of cross-functional, multidisciplinary teams plays a central role. these teams have the different roles necessary to take a customer´s need all the way to a delivered solution. moreover, the notion of small, empowered teams, the backlog, and daily stand-up meetings and sprints guide software development through shorter cycles and help bring the software development closer to the client (bosch 2014). moving from traditional to agile development is the first transition prescribed in stairway to heaven model (sth) (olsson et al. 2012). sth describes the evolution path organizations follow to successfully move from traditional to datadriven software development. it comprises five stages: traditional development, agile organization, continuous integration, continuous deployment, and r&d as an innovation system. in a nutshell, organizations evolving from traditional development start by experimenting with one or a few agile teams. once these teams are successful, agile practices are adopted by the organization. as the organization starts showing the benefits of working agile, system integration and verification become involved, and the organization adopts continuous integration. once it runs internally, lead customers often express an interest to receive software functionality earlier than through the normal release cycle. they want continuous deployment of software. the final stage is where the organization collects data from its customers and uses a customer base to run frequent feature experiments to support customer data-driven software development (olsson et al. 2012). many organizations have moved from traditional to agile. there are many ways of doing that and each organization should consider its business goals, culture, environment and other aspects to find the best way to go through the path. in the experience reported in this paper we have used systems theory tools, gut matrix and reference ontologies, which are briefly introduced in the following. 2.2 system theory it has been used in industry and academy to support (re)design of organizations (sterman 1994) (meadows 2008) (sterman 2010). it sees an organization as a system, consisting of elements (e.g., teams, artifacts, policies) and interconnections (e.g., the relation between the development team, the software artifacts it produces and the policies that influence their production) coherently organized in a structure that produces a characteristic set of behaviors, often classified as its function or purpose (e.g., the development team produces a software product aiming to accomplish its function in the organization)(meadows 2008). in the systems theory literature, there are several tools that support understanding the different elements and behaviors of a system, such as systemic maps and archetypes (meadows 2008)(sterman 2010). a systemic map (also known as causal loop diagram) allows representing the dynamics of a system by means of the system borders, relevant variables, their causal relationships, and feedback loops. a positive causal relationship means that two variables change 1 seon specification is available at http://nemo.inf.ufes.br/en/projects/seon/ in the same direction (e.g., increasing the number of bad design decisions causes increasing in software defects), while a negative causal relationship means that two variables change in opposite directions (e.g., increase test efficacy causes decreasing in software defects). feedback loops are mechanisms that change variables of the system. there are two main types: balancing and reinforcing feedback loops. the former is an equilibrant structure in the system and is a source of stability and resistance to change. the latter compounds change in one direction with even more change. one beneficial effect of using systemic maps is that they help identify archetypes. an archetype is a common structure of the system that produces a characteristic pattern of behavior. for example, the archetype shifting the burden occurs when a problem symptom is “solved” by applying a symptomatic solution, which diverts attention away from a more fundamental solution (kim 1994). the archetype fix that fail, in turn, occurs when an effective fix in the shortterm creates side effects, a “fail”, for the long-term behavior in the system (kim 1994). usually, fix that fail appears inside of another complex archetype as shifting the burden. each archetype has a corresponding modeling pattern. therefore, by analyzing a systemic map is possible to identify archetypes by looking for their modeling patterns. archetypes and systemic maps can be useful to identify problems and possible leverage points to solve them. leverage points are points in the system where a small change can lead to a large shift in behavior (meadows 2008). 2.3 gut matrix it allows to prioritize the resolution of problems, considering that resources are limited to solve them (kepner and tregoe 1981). the prioritization is based on: gravity (g), which describes the impact of the problem on the organization; urgency (u); referring to how much time is available to address the problem; and tendency (t), which measures the predisposition of a problem getting worse over time. 2.4 reference ontology ontologies have been recognized as important instruments to solve knowledge-related problems. an ontology is a formal, explicit specification of a shared conceptualization (studer et al. 1998). ontologies can be developed for communication purposes (reference ontologies) or for computational solutions (operational ontologies). a reference ontology is a special kind of conceptual model representing a model of consensus within a community. it is a solution-independent specification with the aim of making a clear and precise description of the domain in reality for the purposes of communication, learning and problem-solving (baskerville 1997). in the work described in this paper, we used the scrum reference ontology (sro) (santos jr et al. 2021a), which addresses the main aspects of scrum, such as, ceremonies, activities, roles, artifacts, and so on. the first and second authors of this paper are also authors of sro. it is a reference ontology of the software engineering ontology network (seon)1 (ruy et al. 2016). seon is an ontology network that contains several integrated ontologies describing various subdomains of the software engineering domain (e.g., http://nemo.inf.ufes.br/en/projects/seon first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 software requirements, software process, software measurement, software quality assurance, software project management, etc.). by providing a comprehensive and consistent conceptualization of the software engineering domain, seon has been successfully used to solve knowledgerelated and interoperability problems in that domain (e.g., (fonseca et al. 2017)(ruy et al. 2017)(bastos et al. 2018) (santos jr et al. 2021a)). sro reuses concepts of other seon ontologies, namely: software process ontology (spo) (briguente et al. 2011), enterprise ontology (eo) (ruy et al. 2014) and reference software requirements ontology (rsro) (duarte et al. 2018). by doing that, sro connects scrum concepts to more general software engineering concepts, enabling it to better understand scrum in a broader software development context. sro was developed by following the sabio method (falbo 2014) and it was evaluated through verification and validation activities. detailed information about sro, including its conceptual models, descriptions and a study in which we used sro for semantic interoperability purposes can be found in (santos jr et al. 2021a). 3 related work some works have reported the use of systems theory in the agile development context. for example, vidgen and wang (2009) proposed a framework based on the systems theory that identifies enablers and inhibitors of agility and discusses capabilities that should be present in an agile team. gregory et al. (2016) discuss challenges to implementing agile and suggest some organizational elements that could be used to do that. considering the sth context, karvonen et al. (2015) used bapo categories (business, architecture, process, and organization) to identify some practices to each sth step. however, they do not discuss how to understand the organization to establish proper strategies to implement them. considering scenarios involving more than one organization to produce software, de sousa et al. (2016) discuss agile transformation in brazilian public institutions. different from organizations a and b, which work together to produce software for the client, brazilian public institutions hire software organizations to develop software (i.e., the public institution is a client of the hired organization). moreover, different from the scenario discussed in (de sousa et al. 2016), in our study, organization a needed to develop skills, processes, and culture that enabled it to work with multicultural issues, because organization a, organization b and clients are in different countries, and have different cultures. none of the aforementioned works use systems theory tools, gut matrix, and reference ontologies to help organizations to define strategies to agile practices, as we did in our study. some works address aspects related to developing software with distributed teams (jim et al. 2009)(prikladnicki and audy 2010). they show that there are many challenges related to communication, knowledge management, coordination and requirement management caused by different location, time and culture. aiming to address these issues, l’erario et al. (2020) propose a framework that provides some concepts, a structure and a flow of communication in distributed software projects. ali and lai (2018), in turn, focus on requirements communication and propose to use a requirements graph combined with a software requirement specification document to help the stakeholders in the establishment of a better understating of software requirements. similar to our work, the aforementioned works aim to support organizations in which the software development process is distributed. however, differently from our work, those works consider software development geographically distributed among several development teams of the same organization. as we previously discussed, our work considered two organizations working as one in the projects, with two teams in different countries, and each team controlling part of the software development process. we propose to use system theory tools, gut matrix and reference ontologies to create strategies that minimize the impact caused by culture, time, and distance and, sometimes, use them as a competitive advantage. we believe that our work can contribute to organizations that work with geographically distributed teams by providing useful knowledge to create tailored strategies. for example, they can be inspired by our strategy to communicate requirements, which uses bdd (behavior driven development) (wynne et al. 2017) as a protocol to specify, communicate and validate requirements. 4 case study, planning, execution, and results participative case study was selected as the research method in this study because two researchers acted as consultants in organization a and ,thus, were participants in the process being observed (baskerville 1997). together with other participants, they gathered information to understand the organization and defined strategies to implement agile practices. thus, the researchers had some control over some intervening variables. 4.1 study design 4.1.1 diagnosis organization a is a brazilian software development organization that works together with a european organization (organization b) to develop software products for european clients. it has 30 developers organized in teams managed by tech leaders. organization b elicits requirements with clients and organization a is in charge of developing the corresponding software. as a consequence of the increasing number of projects and team members, added to the lack of flexible processes, some problems emerged, such as projects late and over budget, increasing in software defects, overloading of the teams due to rework on software artifacts, and communication issues among client, organization a, and organization b. aiming to minimize these problems, in the first semester of 2019, organization a decided to implement scrum practices, but without success. according to the directors, the main difficulties were due to non-direct communication with the client and included: difficulty to define product backlog, select a product owner and carry out scrum ceremonies that need the client’s feedback. furthermore, they pointed out that agile culture demands knowledge and its clients, business partners and developers were not prepared for it. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 other factors that harmed scrum implementation were: (i) teams without self-management characteristics, (ii) difficulties in internal communication, (iii) lack of feedback culture, (iv) lack of openness and other scrum’s values. moreover, organization a has had a lot of systemic issues, such as: (a) directors have been much focused on operational and technological issues, (b) lack of management professionals, (c) focus on short-term issues instead of long-term ones and, (d) lack of focus on applying strategic and systemic thinking. in addition, the first and third authors noticed that organization b has had a traditional culture based on linear and nonadaptative processes and methods. this scenario indicated to us that a particular characteristic and context of the organization had not been considered in the first try to implement agile practices. therefore, at the beginning of 2020, we proposed to use sth as a reference model to evolve organization a from traditional to datadriven software development, in a long-term process improvement program. the first step: move from traditional to agile. considering the peculiar scenario of organization a, we decided to use systems theory to understand the organization in a systemic way. then, we used gut matrix to support prioritization of problems resolution, and reference ontologies to provide common knowledge about agile development. 4.1.2 planning the study goal was to analyze the use of systems theory tools (particularly systemic maps and archetypes), gut matrix, and reference ontologies to help define strategies to implement agile practices when the organization is moving from traditional to agile development. by strategies, we mean actions or plans established to implement agile development. aligned to this goal, the following research question was defined: are systems theory, gut matrix, and reference ontologies useful to define suitable strategies for an organization to move from traditional to agile development? the expected outcomes were: (i) a view of important aspects of the organization by means of systemic maps; (ii) prioritization of problems and causes to be addressed; (iii) strategies to address problems and implement agile practices;(iv) artifacts built based on reference ontologies and that help the team to learn agile concepts and practices; (v) a systems theory based process to define strategies to move from traditional to agile. figure 1 illustrates how systems theory tools (particularly systemic maps and archetypes), gut matrix, and reference ontologies (blue circles in figure 1) were used in the study. reference ontologies and systems theory tools were used in the problem domain (represented in the yellow region in figure 1). ontologies provide the conceptual perspective, while systemic maps and archetypes afford a dynamic perspective. in other words, the former supports understanding the domain itself (agile) by providing structural knowledge, while the latter helps understand the organization in which the problems manifest and how they manifest. gut matrix, in turn, was used in the solution domain (represented in the green region in figure 1) as a means to prioritize the problems to be addressed, providing, this way, a problem-solving perspective. figure 1. overview of the approach used in this work. to be more specific, ontologies were used to provide a common conceptualization to support communication among the organizations and their employees in the software development context. systemic maps, in turn, aimed to make explicit the variables and relations present in the dynamic of the system between the organizations. finally, gut matrix was used to support the decision-making process that guided the solution process. the study participants who directly participated in interviews to data collection and results in the evaluation were the two directors (software development director and sales director), one tech leader, and two developers. the first and third authors worked as consultants in organization a and, thus, also participated in the study. working together with the other participants, they were responsible for creating systemic maps, gut matrices, as well as for defining the strategies to be implemented to move from traditional to agile software development. once these artifacts were created, they were validated with the team. for example, systematic maps were created based on information provided by the team. then, the team evaluated them in meetings and provided feedback so that we reached the maps shown in the next section. the second author did not interact directly with organization a. she worked as an external reviewer, evaluating the produced artifacts and helping other authors improve such artifacts. 4.2 study execution and data collection 4.2.1 data collection data collection involved interviews, development of systemic maps and gut matrix, and definition of strategies to implement agile practices. a. initial interviews data collection started with interviews to gather general information about the organization. six interviews were conducted, four with the directors and two with the developers, and the tech leader. participants were told to feel free to talk as much as they wanted to. each interview lasted about 90 minutes. the funnel questions technique was used, i.e., the first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 interview started with general questions (e.g., “what kind of software does the organization develop?”, “how is the software development process?”), and then went deeper into more specific points of each one (e.g., “tell me more about the software test activity”). the interviews were recorded, transcribed, and validated with each participant. the interviews with the directors aimed to get information about the following aspects: organizational environment, culture, rules of relationship with partners, future plans, software development process, software development issues, and agile knowledge. among the information provided by the directors, they pointed out that some problems were caused by misunderstood software requirements or project scope not clearly defined. according to them, organization b did not describe requirements in a consistent and clear way. the interviews with the tech leader and developers aimed at understanding software development problems under their perspective and how familiar they were with agile methods and practices. the problems mentioned by the directors were also reported by the tech leader and developers. when asked about team organization, they pointed out that the teams were not self-organized. contrariwise, tech leaders were responsible for allocating tasks, coordinating team members, establishing deadlines, and monitoring projects. moreover, the team knowledge of agile was limited. b. systemic maps information obtained in the interviews was used to build systemic maps. figure 2 shows a fragment of one of the developed systemic maps. the elements in blue in the figure form a modeling pattern that reveals the presence of the archetype shifting the burden. figure 2. fragment of systemic map (1). as previously said, organization b is responsible for eliciting requirements with the client, specifying and sending them for organization a to develop the software. the development teams of organization a often misunderstand requirements that describe the software, component, or functionality to be developed, since organization b produces requirements poorly specified, neither adopting a technique nor following a pattern to describe them. misunderstood requirements contribute to increasing the number of defects in software artifacts, since design, code, and test are produced based on the requirements informed by organization b. defects in software artifacts make organization a mobilize (and often overload) the development team to fix defects by performing new urgent development activities, which decrease the number of defects in software artifacts. these urgent activities are performed as fast as possible, aiming not to delay other activities. thus, they do not properly follow software quality good practices. moreover, they contribute to increasing the project cost and time (late and over-budget project). defects in software artifacts increase the need of using software quality techniques that, when used, lead to fewer defects in software artifacts. this causal relationship has a delay since the effect of using software quality techniques can take a while to be perceived. as shown in figure 2, the archetype shifting the burden is composed of two balancing feedback loops and one reinforcing feedback loop. the balancing feedback loops (between new urgent development activities and defects in software artifacts, and between defects in software artifacts and software quality techniques) mean that the involved variables influence each other in a balanced and stable way (e.g., higher/lower the number of defects in software artifacts, more/less new urgent development activities are performed). in the reinforcing feedback loop, new urgent development activities are a symptomatic solution that leads to defects fixed through rework, a side effect, because once urgent development activities fix the defects in software artifacts, organization a feels like the problem was solved. this, in turn, decreases the need for using software quality techniques, which is a more fundamental solution. as a result, software artifacts continue to be produced with defects, overloading the development team with new urgent development activities. shifting the burden is a complex behavior structure because the balancing and reinforcing loops move the system (organization a) in a direction (new urgent development activities) usually other than the one desired (software quality techniques). new urgent development activities contribute to increasing project cost and time (project is late and over-budget) because these activities were not initially planned in the project. when organization b does not properly define the project scope (scope poorly defined), organization a may allocate a team not suitable for the project, contributing to defects in software artifacts and to changes in the project team during the project. usually, when the team is changed, the new members need to get knowledge about the project. moreover, often the new members are more experienced and thus more expensive, which contributes to late and overbudget project. to change the project team, members can be moved from one project to another, causing deficit in other project teams. furthermore, there is a balancing loop between changes in the project team and defects in software artifacts. the former may cause the latter due to instability inserted into the team. the latter, in turn, contributes to the former because defects in software artifacts may lead to the need to change the team. there is a delay in this relationship because it can take a while to notice defects and the need to change the team. finally, scope poorly defined causes unrealistic deadlines, which contributes to late and over-budget projects. figure 3 illustrates another fragment of the developed systemic maps, showing variables related to different organizational levels. as observed in figure 3, organization b is first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 responsible for the direct communication with the client, i.e., organization a depends on organization b to obtain information from the client. this causes in organization a lack of contact with the final client, which contributes to low commitment from the team with the project´s goals, since the team is not empowered and loses motivation. this, in turn, leads to non-self-organized teams, because the team members do not have the opportunity to implement values and practices to become self-organized, which keeps the team away from the final client. these three variables create a reinforcing loop that prevents the organization from having more proactive and committed teams. figure 3. fragment of the systemic map (2). non-self-organized teams and low commitment from the team with the project´s goals contribute to a high involvement of the directors at the operational level, because they need to support the teams to solve problems (e.g., scope poorly defined, unrealistic deadline and late and overbudget project, shown in figure 2). as a consequence, they do have not enough time to concern with tactical and strategic levels, which causes damage to organization growth, because the directors do not have time to plan and implement strategies that allow getting new clients, reducing costs, etc. the previous paragraph describes an example of the archetype fix that fails that impacted the operational, tactical, and strategical level of organization a. the archetype fix that fails is composed of a balancing feedback loop that is intended to achieve a particular result or fix a problem, and a reinforcing feedback loop of the unintended consequences. the balancing feedback loop occurs when there is a high involvement of directors at the operation level trying to resolve problems of projects because of the low commitment from the team with the project´s goals. the reinforcing feedback loop, in turn, occurs when the directors do have not enough time to concern with the tactical, and strategic level because there is a high involvement of the directors at the operational level, resulting in damage to organization growth. this loop affects different organizational levels, from operational to strategic, and hampers organization evolving and growing. c. gut matrix after getting a comprehensive view of the organization and how it behaves, we reflected on the behaviors on which the strategies should be focused. thus, we created a gut matrix to identify and prioritize behaviors of the system that are not fruitful, i.e., undesirable behaviors. they were identified mainly from the systemic maps. for example, from the fragment depicted in figure 2 based on the positive causal relationship between misunderstood requirements and defects in software artifacts, the following undesirable behavior was identified: software artifacts are developed based on misunderstood requirements. from the shifting the burden archetype, we identified: software quality techniques are not often applied to build software artifacts. to complement the information provided by the systemic maps, we used information from the interviews to look for behaviors the literature points out as desirable in organizations moving to agile (e.g., self-organized teams) (leffingwel 2016). after identifying the undesirable behaviors, the study participants validated and prioritized them considering the gut dimensions. each dimension was evaluated considering values from 1 (very low) to 5 (very high). 13 undesirable behaviors were identified. table 1 shows a fragment of the gut matrix. table 1. fragment of gut matrix. # undesirable behaviors g u t gxuxt ub1 software artifacts are developed based on misunderstood requirements 5 5 5 125 ub2 software quality techniques are not often applied to build software artifacts 5 5 4 100 ub3 projects are late and over budget 5 5 4 100 ub4 organization has inconsistent knowledge of agile methods 5 5 4 100 ub5 teams are not self-organized 5 4 4 80 for each undesirable behavior, we analyzed the systemic maps and the interviews and identified its causes. (ub1) software artifacts are developed based on misunderstood requirements because (c1) requirements are not satisfactorily described and (c2) poor communication between client and development team. c1 was identified directly from the systemic map. c2 was based on information about the procedure followed by organization a to communicate with the client. when there is any doubt about requirements, the contact was made mainly through email or comments on issues in the project management system. only organization b has direct contact with the client. c1 and c2 are also causes of (ub2) software quality techniques are not often applied to build software artifacts, since the lack of well-defined requirements and direct contact with the client impact verification and validation activities. moreover, there is a (c3) lack of clear and objective criteria to evaluate results and (c4) large deliverables, which make it difficult to evaluate results. as it can be noticed in figure 1, projects are late and over budget (ub3) mainly because c1 and (c5) unstable scope and deadline. moreover (c6) unsuitable team allocation and c4 also affect projects cost and time. the former because low productivity impacts on project time and, thus, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 cost. the latter is because it is difficult to estimate large projects. regarding (ub4) organization has inconsistent knowledge of agile methods, some members of the organization had previous experience with agile methods in other companies, others had a previous unsuccessful experience in organization a and others did not have experienced agile methods. most of the members were not sure about agile concepts and practices. therefore, this undesirable behavior is caused by (c7) organization’s members had different experiences with agile and (c8) agile concepts and practices are not well-known by the organization. finally, teams are not self-organized (ub5) due to the (c9) traditional development culture that produces functional and hierarchical teams. after identifying the causes of undesirable behaviors, the study participants validated them. table 2 shows the identified causes and respective undesirable behaviors. table 2. causes of undesirable behaviors. # causes ub1 ub2 ub3 ub4 ub5 c1 requirements are not satisfactorily described x x x c2 poor communication between client and development team x x c3 lack of clear and objective criteria to evaluate results x c4 large deliverables x x c5 unstable scope and deadline x x c6 unsuitable team allocation x c7 organization’s members had different experiences with agile x c8 agile concepts and practices are not well-known by the organization x c9 traditional development culture x d. strategies the causes of undesirable behaviors and the prioritization made in the gut matrix showed us leverage points of the system, i.e., points that if changed could change the system behavior. therefore, we defined strategies to help organization a move towards the second stage of sth by changing leverage points of the system and thus creating new behaviors in the system in that direction. we started by defining strategies to change undesirable behaviors at the top of the gut matrix and causes related to more than one undesirable behavior. after we had defined the strategies, we presented them to the team in a meeting and they provided feedback that helped us to make the strategies more suitable for the organization. next, we present four strategies defined to address the causes presented in table 2. considering organization a characteristics, mainly its partnership with organization b, the strategies combined agile and traditional practices. agile approaches bring the culture of self-organized teams, shorter development cycles, user stories, smaller deliverables, among other notions (karvonen et al. 2015)(leffingwel 2016). traditional approaches were used to complement agile practices. after all, agile methods usually do not detail how to manage some aspects of a software project, such as costs and risks. the first strategy, the new procedure to communicate requirements (s1), consisted in establishing a new procedure to be followed by organizations a and b regarding requirements and communication, aiming to address c1 and c2. due to business agreements, a big change in organization b was not possible. for example, we could not change the fact that only organization b could directly contact the project client. hence, it was defined that requirements would be sent from organization b to the project tech leader, who would rewrite the requirements as user stories and validate them with organization b. by representing requirements as user stories, the project tech leader also needs to represent their acceptance criteria, which aids to address c3. moreover, to properly define the acceptance criteria, the tech leader needs to obtain detailed information about the requirement, stimulating organization b to get such information from the client, which indirectly improves communication with the client. only user stories defined according to the defined template and validated with organization b follow to the next development activities. we also suggested the use of a template based on bdd (behavior driven development) (wynne et al. 2017) and gherkin syntax (binamungu et al. 2020), describing business rules, acceptance criteria and scenarios to serve as a protocol to communicate requirements among organizations a, b and the client. it is worth mentioning that we were not allowed to ask organization b to write the requirements itself by following the new guidelines, because this change was beyond the partnership agreements. in this strategy, we designated organization b to play the product owner role. this way it is not only a business partner, but it represents the client interests and has responsibilities in this context. with this strategy, we also aimed to minimize the symptomatic solution (new urgent development activities) indicated in the shift the burden archetype identified in the systemic map. according to meadows (2008), the most effective strategy for dealing with a shifting the burden structure is to employ the symptomatic solution and develop the fundamental solution. thus, it is possible to resolve the immediate problem, and also work to ensure that it does not return. by improving requirements descriptions and defining clear acceptance criteria, software quality techniques (e.g., verification and validation), which are the fundamental solution identified in the shifting the burden, can be properly applied. another strategy, budget and time globally and locally managed through short development cycles (s2), focused on changing the undesirable behavior ub3 (projects are late and over budget). again, to change that, organization a depended on changes in organization b. therefore, it was established that at the beginning of a project, organization a and b should agree on the project scope, deadline, budget and involved risks. the project characteristics (e.g., technologies, domain of interest, platform, etc.) should also be clearly established. the project team would not be allocated before this agreement. by properly aligning information about the project between organizations a and b, it would be possible to allocate a development team with skills and maturity suitable for the project. by doing that, c5 and c6 would be minimized. complementary, it was defined to change the development process as a whole. in the organization a business model, when a project is contracted by a client, usually there is a cost and time associated to it. this prevented us from using a pure agile development process, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 where costs are dynamically established. as a strategy to implement tailored agile practices, it was defined: after requirements are validated, the development team (tech leader and developers) selects the requirements to be developed in a short cycle of development (i.e., a sprint), defines tasks and estimates the time and costs related to them. this information is aligned between organizations a and b. this way, organization b manages time and budget at a project level, while organization a manages time and budget in the sprint context. once a week, monitoring meetings are performed to check time and budget performance. during the sprint, meetings based on the scrum ceremonies are carried out in a flexible way. for example, if the team informs that there is nothing to report at the day, the daily meeting is not performed. meetings that depend on the client’s feedback should be carried out with organization b (in the product owner role). by breaking the development process into shorter cycles, c4 is addressed, since the product is also decomposed in smaller deliverables. this strategy also contributes to treating c9, as it changes the traditional development culture. aiming to change the way teams are organized in organization a (ub5) and thus address c9, the strategy self-organized teams (s3) was defined to implement squad and guild concepts (leffingwel 2016). a squad is a team with all skills and tools needed to develop and release a project. it is self-organized and can make decisions about its way of working. for example, a squad can define the project development timebox (sprint) and how to implement some practices of strategies s1 and s2 (e.g., the use of bdd and how flexible scrum ceremonies can be in the project). the members are responsible for creating and maintaining one or more projects. a squad is composed of developers and a tech leader, who is responsible for communicating with organization b mainly regarding aspects related to budget, time, and requirements. a guild is a team responsible for defining standards and good practices that will be used for all squads. a guild is composed of members with expertise in the subject of interest (e.g., a senior programmer can define good programming practices). its purpose is to record and share good practices among the squads in the organization, aiming at achieving a homogeneous level of quality in the projects. to address c7 and c8, which cause the organization to have inconsistent knowledge of agile methods (ub4), we defined agile common conceptualization (s4) as a strategy to use reference ontologies to provide a common conceptualization about the software engineering domain as a whole, and about the agile development process in particular. we used ontologies from seon (ruy et al. 2016) to extract the view relevant to understand agile development. it contains a conceptual model fragment, axioms and textual descriptions that provide an integrated view of agile and traditional development, defining concepts in a clear, objective, and unambiguous way. we suggested the use of seon because its ontologies have been developed based on the literature and several standards, providing a consensual conceptualization. moreover, as we discussed in section 2, we have successfully used it in several interoperability and knowledge-related initiatives. the seon view used in the study focuses on the scrum reference ontology (sro) and can be seen in (santos jr et al. 2021a). to make it easier for the teams to learn and apply the conceptualization provided by the ontology, the authors created complementary artifacts that combined graphical and textual elements. we show some of the produced artifacts in section 4.3.2. table 3 summarizes the defined strategies, the leverage points (causes) addressed by them, and main agile concepts involved. it is worth noticing that some agile concepts were indirectly addressed. for example, although we did not directly use product backlog in s1, the set of requirements agreed with organization b works as such. similarly, in s3, when the team selects the requirements to be addressed in a development cycle, we are applying the sprint backlog notion. we decided not to use some of the original terms because organization a had a previous bad experience trying to implement agile practices by following scrum “by the book”, which did not work and provoked resistance to certain practices. thus, we tried to give some flexibility even to the practices’ names, to avoid bad links with the previous experience. table 3. strategies, causes and agile concepts. # strategies agile concepts causes s1 new procedure to communicate requirements user story, bdd, product owner and product backlog c1, c2, c3 s2 budget and time globally and locally managed through short development cycles sprint, sprint backlog, scrum meetings and small deliverables c4, c5, c6, c9 s3 self-organized teams squad and guild c9 s4 agile common conceptualization concepts related to agile software development c7, c8 after defining and validating the strategies with the team, they were executed by the organization in two projects with the supervision of the first and third authors. the first project started and finished during this study. the second project started before the study and was still ongoing at the time we wrote this paper. the new practices started to be used in early february 2020. about four months later, we conducted an interview to obtain feedback. at that point, one of the projects had already been concluded and the other was ongoing. 4.3 study analysis, interpretation and lessons learned in this section, we present results from the interviews that helped us to answer the research question, the resulting systems theory-based process that arose from this study and some lessons learned. 4.3.1 results to answer the research question, we carried out an interview with the software development director and the tech leader aiming to obtain their perception about the use of systems theory tools, gut matrix and reference ontologies, as well as to get information about results obtained from the use of the defined strategies. they were interviewed together in a single section. the director said that, in his opinion, systems theory tools provided means to understand how different organizational aspects (e.g., business rules and quality software practices) are interrelated and influence each other, and how these aspects first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 and interrelations produce desirable and undesirable behaviors. for example, he said that “the systemic maps allowed me to understand how poorly specified requirements can negatively impact different parts of the project and of the organization”. moreover, according to him, “systems theory helped create strategies to change undesirable behaviors, since it provided a comprehensive understanding of the organization behavior and supported identifying causes of undesirable behaviors”. for example, by knowing the impacts of poorly specified requirements, “i perceived the need to implement practices to guarantee the quality of the requirements and that development tasks should only start if the developer truly understood the requirement”. regarding gut matrix, the director stated that it found it easy to use, and important to prioritize the undesirable behaviors to be changed first. according to him, using these tools “was easier and clearer when compared to ishikawa and pareto diagrams, because systemic maps allow more comprehensive and freer views and gut matrix has a simple way of prioritization.” concerning reference ontologies, he reported that they were useful to create a common communication among project stakeholders and business partners, eliminating some misunderstandings not only about agile practices but also about software engineering in general. for example, the director said that “by using the conceptualization provided by the ontology, the team truly understood the “done” concept”, commonly used in agile projects, in the sense that a software item (e.g., a functionality, a component) is done (i.e., ready to be delivered to the client) only if it met all the acceptance criteria established to the user stories materialized in that software item. the tech leader commented that “by using the ontology conceptualization, it was clearer the necessary information a requirement description should contain so that it can be properly understood.” an interesting aspect pointed out by the interviewees was that the conceptualization provided by the reference ontologies was used by the development teams as a basis to quality rules in the projects (e.g., when a software item is done) and, also to business rules in new business contracts (e.g., acceptance criteria need to be defined). the director and tech leader informed that the first project in which the strategies were implemented was considered a successful experience and served as a pilot. in similar projects, organization a used to be 30% to 50% over time and budget due to spending extra resources on new urgent development activities to fix defects. by adopting the defined strategies, the project delivered a better product (at the moment of the interview, the client did not have reported any defect in the production environment). however, the project was about 15% over budget and time due to changes in the agreed requirements. this may suggest that strategies s1 and s2 need adjustments. although they seek to give some agility features to the development process, the project had its scope predefined by organization b, which established it together with the client and set cost and time considering that scope. as organization a started to develop the agreed requirements, organization b noticed that some of them needed to change to better satisfy the client needs. although the project was late and over budget, the deviation in relation to the agreed cost and time was smaller than in similar projects that did not follow the strategies. the director pointed out that being able to show this difference to organization b, indicating the causes that contribute to increase or decrease it, was an important result and can even be used to motivate organization b to be more involved in the changes to improve the software development process as a whole. this would make it possible, for example, to adjust strategies s1 and s2 to make requirements elicitation, cost, and time estimation more flexible. the tech leader reported that using the strategies reduced misunderstandings in software requirements among the stakeholders and enabled better managing budget and time locally, in short development cycles. moreover, according to him, in the second project adopting the strategies (ongoing project), the development team spent only 45 hours in new urgent development activities in a total of about 2000 hours of performed development activities. he also highlighted the use of user stories and bdd as an effective way to communicate requirements in this project. in addition, the interviewees said that the self-organization culture has been developed in the teams and that the use of squads has been very helpful. the use of guilds was still in progress. finally, they commented that, although the proposed strategies were used to address some undesirable behaviors by applying agile practices and concepts, they felt that “changing the entire traditional culture can be a complex work”, mainly because it requires to change mental models, processes and culture that also involve the organization partners (particularly organization b) and clients. aiming to obtain quantitative data to complement the feedback provided by the software development director and the tech leader and help us identify the effects of the adopted strategies, we collected data from the two projects (one finished and another ongoing) where the strategies were implemented and from other projects that did not use the strategies. data was extracted from jira, which is used by organization a to support part of the software development process. considering that the strategies were applied in the projects in different moments (the first project adopted the strategies from its beginning to its end, while the second adopted the strategies when it was already ongoing), we decided to analyze them separately. first, we collected data regarding the tasks performed in the first project and in other 22 projects that did not adopt the strategies and were carried out in the same time-box of our study. the tasks were classified into development tasks, which create new features, and bug-fixing tasks, which fix problems (found by the quality assurance team or by the client) in the developed features. for each project, we calculated the percentage of effort spent on tasks dedicated to developing new features and the percentage spent on tasks performed to fix bugs. thus, we calculated the median of the obtained values for the 22 projects that did not use the strategies, so that we could compare the resulting value with the project where the strategies were adopted. table 4 shows the results. table 4. effort spent on development and bug-fixing tasks in different projects. task project that adopted the strategies projects that did not adopt the strategies development 97,62% 81,07% bug-fixing 2,38% 18,93% first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 as it can be observed in table 4, when compared with the other projects developed in the same time-box, the development team from the project that adopted the defined strategies spent more effort on developing new features (97,62%) than fixing problems (2,38%). this corroborates interviewees’ perception that the proposed strategies improved product and process quality. aiming to verify changes caused by the strategies in the same project, we also collected data from the beginning of the second project (i.e., jan/2019), until the last month of our study. our purpose was to compare the effort spent on each type of task before and after applying the strategies. table 5 presents the obtained values. table 5. effort spent on development and bug-fixing tasks before and after applying the strategies in the project. task before the strategies after the strategies development 62,15% 88,21% bug-fixing 37,85% 11,79% as it can be noticed, after applying the strategies, there was an increase in the effort spent on developing new features and a reduction in the effort to fix bugs, which is consistent with the interviewees’ perception. it is worth noticing that there was more time spent on the project before applying the strategies (about one year) than after that (about four months). this should be considered together with the obtained data (e.g., we do not know if the amount of effort spent on which type of tasks may significantly change over time). 4.3.2 using reference ontologies to learn scrum although reference ontologies are a good way to structure and represent knowledge, it may not be much easy for some people to capture and internalize the conceptualization represented in the ontology. thus, in the case reported in this paper, we used some complementary artifacts to help in this matter. first, we asked the team which artifacts they were used to. based on their answers, we decided to use mainly textual descriptions and process models, since the team considered them user-friendly, and they were present in its daily activities. we also used other diagrams to illustrate scrum concepts and a kanban board to map scrum concepts to concepts already familiar to the team. the seon extract addressing agile aspects and connecting them to traditional aspects provided the common conceptualization and knowledge about the domain of interest. for example, the ontology makes it explicit that only deliverables (i.e., software items, such as a functionality or a component) that met all the acceptance criteria established to the user stories they materialize can be added to the sprint deliverable (e.g., a software module) and, thus, to the project deliverable (e.g., a software product). the complementary artifacts, in turn, present the conceptualization to the team by using alternative representations. as we previously said, the seon extract used in this study focuses mainly on the scrum reference ontology (sro) and can be found in (santos jr et al. 2021a)). table 6 summarizes some concepts from sro used in this study. table 6. some concepts from sro. concept description scrum project software project that adopts scrum in its process. sprint backlog artifact that contains the requirements of the product to be developed in the scrum project. planning meeting ceremony performed in a sprint where the development team plans it. user story requirement artifact (i.e., a requirement recorded in some way) that describes requirements in a scrum project. it indicates a goal that the user expects to achieve by using the system and, thus, represents value for the client. a user story can be an atomic user story, when it is not decomposed into others, or an epic, when it is composed of other use stories. acceptance criteria requirement established to a user story and that must be met when the user story is materialized. thus, it is used to verify if the user story was developed correctly and meets the client needs. intended scrum development task development task planned to be performed in a sprint. performed scrum development task development task performed in a scrum project. deliverable software item that materializes user stories. accepted deliverable deliverable that is in conformance to all the acceptance criteria established to the user stories materialized by that deliverable. not accepted deliverable deliverable that is not in conformance to at least one acceptance criteria established to the user stories materialized by that deliverable. sprint deliverable accepted deliverable resulting of a sprint. figure 4 illustrates the relationship between the reference ontologies and the complementary artifacts. as a result of this approach, we shortened the distance between the team and the conceptualization provided by the ontologies, improving domain understanding and communication. figure 4. reference ontologies and complementary representation artifacts. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 to address behavioral aspects of scrum (e.g., activities, the flows between them and objects they manipulate), we created process models. for that, we first mapped concepts from seon to constructs of bpmn (omg 2013), which was the modeling language used to represent the process models. put it simply, we identified which bpmn constructs should be used to represent seon concepts or their instances. for example, the activity bpmn construct should be used to represent performed scrum development task, ceremony, planning meeting and other seon concepts referring to activities, tasks or processes. the actor bpmn construct, in turn, should be used to represent seon concepts referring to people or roles, such as developer and product owner. then, we represented and complemented knowledge provided by the reference ontologies by creating process models like the one illustrated in figure 6. in the process models, by following the approach suggested in (guizzardi et al. 2016), we used the event construct to represent the state of affairs (i.e., a situation) caused by the execution of an activity or when a temporal constraint started or ended. the process model shown in figure 5 was used to illustrate the creation of the sprint backlog in the planning meeting ceremony, the selection of user stories to be implemented in performed scrum development tasks and materialized by deliverables, and the validation of the deliverables that, if accepted (accepted deliverable), are integrated into the sprint deliverable. if not accepted (not accepted deliverable), they must be addressed in new tasks. the bold terms aforementioned refer to the seon concepts addressed in the process model presented in figure 5. the process complements the conceptualization provided by the ontology by making explicit some activities, the flow between them and the state of affairs resulting from the activities execution. figure 5. example of process model created based on seon conceptualization. in addition to process models, we also used some diagrams to better illustrate some concepts. for example, to help the team visualize that (i) an epic is a complex user story composed of others, (ii) user stories must have acceptance criteria established to them, and (iii) in the sprint backlog, tasks are planned (i.e., intended scrum development tasks) to implement the user stories, we used the diagram shown in figure 6. organization a did not have a clear semantic distinction between epic, user story and task. many times, these concepts were treated in the same way, being considered as a simple issue by the developers. this lack of conceptual distinction caused problems in project management, estimation, requirements prioritization and communication with organization b and the client. by using the conceptualization provided by seon and a simple diagram (as the one shown in figure 6), the team better understood these concepts and was able to properly use them in backlog management. figure 6. diagram used to illustrate the relation among sprint backlog, epic, user story and intended task. we also used a kanban board to illustrate some concepts. for example, figure 7 depicts a sketch where we explored tasks and deliverables, showing that if a card is moved to the “done” column, that means that the deliverable produced by the corresponding task must have been evaluated (considering acceptance criteria related to the respective user story) and accepted. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 figure 7. kanban board illustration used to explore task and deliverable concepts. to complement the created artifacts, we also created a dictionary of terms (similar to table 6) containing textual definitions of seon concepts and some constraints. the ontology and complementary artifacts were used in two workshops where the first and third authors presented the reference ontology and explained its conceptualization by using its conceptual model and the complementary artifacts. 4.3.3 systems theory-based process an important result that arose from this study is a process that combines systems theory tools and gut matrix to aid organizations to move from traditional to agile. figure 8 shows the process, and we briefly explain it next. figure 8. process to aid defining strategies and implementing agile practices. understand the organization: it consists in obtaining information to understand the organization as a whole so that it will be possible to define strategies to implement agile practices in a suitable way for the organization, considering its culture, environment, business rules, software processes, agile experience and knowledge, people, and so on. information can be obtained by using techniques such as interviews, document analysis and observation, among others. build a systemic view: this consists in using information obtained in the previous step to build systemic maps to understand organization behaviors relevant in the agile development context. organization borders, relevant variables that drive organization behavior, causal relationships between them and feedback loops must be represented. archetypes describing behavior patterns must also be identified from the systemic maps. identify leverage points: this involves analyzing systematic maps and archetypes to identify undesirable behaviors and their causes. at this point, desirable behaviors in agile organizations suggested in the literature can also be used to verify if the organization fits them. undesirable behaviors should be prioritized by using a gut matrix, so that it is possible to identify which ones represent leverage points and will be addressed in the strategies. establish strategies: this consists in defining strategies (i.e., plans and actions) to implement agile practices focusing on the leverage points and considering the organization culture, business, rules, environment, people, etc. implement strategies: this involves implementing the defined strategies. it is suggested to start with one or two projects. after that, if the strategies work, they can be extended to other projects and then to the entire organization. monitor strategies: this consists in evaluating if undesirable behaviors changed as expected after strategies execution. the new behaviors caused by the strategies need to be evaluated and, depending on the results, strategies can be extended to other projects, aborted or adjusted. 4.3.4 lesson learned in this section, we discuss some lessons we learned in the study. in the lessons learned, we adopt terms such as should and may instead of mandatory terms such as must because we learned the lessons from a single case study. thus, we believe that other studies are needed to corroborate what we have learned. systemic maps should be built with a goal in mind: since systemic maps allow to represent a comprehensive view of how the organization behaves and this may involve many aspects, it is important to focus on variables relevant to the goal to be achieved from the use of the systemic maps. otherwise, the maps can be too complex and involve variables that do not provide meaningful information for the desired purpose. the boundaries of the system should be clearly identified: to understand how external elements can influence organization behaviors, it is important to identify the organization boundaries as well the elements that the organization controls and the ones controlled by external agents. this way, it will be possible to create suitable strategies considering both the organization and the external agents. changes in leverage points may change the system as a whole: we noticed that when the changes are made in leverage points, particularly in the ones connected to undesirable behaviors with higher priority, the changes tend to provoke a meaningful shift in the organization behavior as a whole, first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 changing existing behaviors and creating others. for example, by changing the way organizations a and b deal with project scope, time and budget, there were also changes in the way organization a allocates teams and selects requirements to be implemented, and the need for changes in the partnership rules with organization b was perceived. strategies should be integrated into the software processes: for strategies to be performed as part of the organization daily activities, it is important that they are incorporated into the processes performed by the organization. in the study, the strategies were incorporated into the organization software process, involving development, management, and quality assurance activities. strategies should be gradually implemented and start in relevant projects: implementing the changes gradually and starting with one or two projects it was positive and the obtained results contributed for the organization to keep the intention of expanding the changes to other projects. we selected projects in which the teams were interested in using agile practices and that was important for the organization, so that the commitment of the team would be higher. this helped to minimize resistance to the new practices. once they experienced the benefits of following the strategies, team members became disseminators of the new practices and concepts, helping to extend agile culture to other team members. strategies results should be measurable: when defining the strategies, we did not define any indicator to measure its effectiveness. however, the tech leaders used some metrics in the projects (e.g., number of hours spent in new urgent development activities, budget deviation, etc.) that helped us to evaluate the strategies. thus, when defining the strategies, it is important to define the indicators to be used to evaluate them. using system theory tools may be costly and not trivial: although system theory tools were very useful to provide an understanding of the organization, they may be a costly choice, because they demand time, effort and knowledge of the tools and organization. hence, depending on the scope to be considered, it may be difficult or unfeasible to use them. other methods can be helpful in this context. considering this learned lesson, we created zeppelin (santos jr et al. 2021b), a diagnosis instrument that helps get a “big picture” of the organization by identifying software practices performed by it. thus, zeppelin can be used to provide initial knowledge about the organization scenario, allowing to narrow the scope to be further investigated through system theory tools. representing the ontology conceptualization using process models, textual descriptions and simple diagrams can be more palatable than conceptual (structural) models: the reference ontologies of seon are represented by means of conceptual (structural) models, textual descriptions, and axioms. although the conceptual model of the seon view used in the study provides an abstract view showing all the relevant concepts and relations in a single model (santos jr et al. 2021a), we noticed that the team preferred textual descriptions and other representations to the seon conceptual model. thus, we prepared a document containing the concepts relevant to the study and their detailed description, also including information about constraints and relationships. we also prepared complementary artifacts using process models and other diagrams to illustrate and complement knowledge provided by seon. this way, the conceptualization provided by the ontologies was represented in a more palatable way for the team. a consolidated and accessible body of knowledge may help achieve a common conceptualization: in the study, we used ontologies as a reference to establish a common conceptualization of agile development. we are very familiar with ontologies and two of the authors are also authors of the ontologies used in the study, which were established based on the literature and standards and, thus, provide a consensual view of the domain of interest. considering organization a needs and the participation of the authors in the study, the used ontologies perfectly fit. however, we are aware that, for an organization not familiar with ontologies, using an ontology as the starting point to establish a common conceptualization can be challenging. we believe that organizations should use a body of knowledge suitable for its characteristics to establish a common conceptualization about the domain of interest. for example, some organizations may prefer to use textual references, such as the scrum body of knowledge (satpathy 2013). changes involving business partners can be hard to implement and demand more flexibility and time: the way organization b works directly affects organization a. due to business arrangements, organization a does not have enough influence to make changes in organization b. it can suggest changes, but it cannot demand them. thus, it was necessary to define strategies that caused only small changes in organization b (e.g., help to better describe requirements, allow shared control of time and cost). by noticing improvements from the use of the proposed strategies, organization b may be more willing to further changes. squads should have autonomy to choose methods and tools: the organization can have a set of tools, techniques and methods to be adopted in the projects. guilds can help define that. according to the project team and characteristics, some tools, methods and techniques can suit better. we noticed that the squad became more self-organized when its members could choose the techniques to solve the project problems. for example, in the study, a squad decided to adopt user stories and bdd to describe requirements, while the other used the complete user story template. in both cases, information about requirements was clear and complete. however, each squad chose the technique more suitable for the project and team characteristics. agile-related human aspects need to be developed gradually: agile culture demands some soft skills (e.g., self-organized teams, proactivity, empathy) (lima and porto 2019) that not are much common in a traditional plan-driven environment. we observed that some members had problems materializing what it means to be self-organized, proactive and empathetic, because they were used to command-control from traditional culture. we noticed that by using short-, mediumand long-term actions (what is short, medium and long is established by the organization), it is possible to gradually develop agile culture. short-term actions should focus on understanding the needed skills (e.g., promoting debates about soft skills in software development) and practicing them in the projects. medium-term actions should empower the use of soft skills combined with hard skills (e.g., human-centered design (smith et al. 2012)). finally, long-term actions should institutionalize the soft and hard skills and truly change the whole organization culture. first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 5 threats to validity to the study results the validity of a study denotes the trustworthiness of the results. every study has threats and limitations that should be addressed as much as possible and considered together with the results. in this section, we discuss some threats considering the classification proposed in (runeson et al. 2012). the main threat in this study is related to the researchers who conducted the study. participative case studies are biased and subjective as their results rely on the researchers (baskerville 1997). the first and third authors acted as consultants in organization a and were responsible for conducting the interviews, creating systemic maps and gut matrix, and defining strategies. moreover, the authors created the complementary artifacts used to share knowledge of scrum. since the authors were very familiar with seon, they did not have difficulties creating the artifacts. other people, less familiar with seon, could have difficulties to create the artifacts or could have created different artifacts. furthermore, to create the artifacts, the authors took the team preferences into account (the team told us that process models and diagrams were a good choice for it). the researchers participation affects internal validity, which is concerned with the relationship between results and the applied treatment; external validity, which regards to what extent it is possible to generalize the results from the case specific findings to different cases; and reliability validity, which refers to what extent data and analysis depend on specific researchers. aiming to reduce researchers’ bias, the members of organization a that participated in the study (i.e., two directors, one tech leader and two developers) participated in the activities and validated results. moreover, another researcher (the second author), external to the organization, evaluated data collection and analysis and was involved in discussing and reflecting on the study and results. concerning construct validity, which is related to the constructs involved in the study, the main threat is that we did not define indicators to evaluate results. data collection was performed through interviews, which are subjective. to minimize this threat, we used some measures collected in the projects to evaluate the new behaviors caused by the proposed strategies. however, since the measures were not previously defined, they are limited to enable a proper evaluation of the strategies and the effects caused by them. another threat concerns the notations used to create the complementary artifacts, since the team could misunderstand the represented concepts due to different semantics assigned to the constructs. to address this threat, the authors asked the team to choose the notations to be used and types of artifacts to be created, so that it was possible to produce artifacts consistent with the team knowledge. in case-based research, after getting results from specific case studies, generalization can be established for similar cases. however, the threats aforementioned constraint generalization. moreover, the study involved only one organization. thus, it is not possible to generalize results for cases without researcher intervention or for organizations not similar to organization a. 6 conclusions, future work and implications this paper presented a case study carried out in a brazilian organization towards the first transition in the path prescribed by the stairway to heaven (sth) model (olsson et al. 2012). organization a develops software in partnership with a european organization (organization b) and it does not have direct contact with clients. after an unsuccessful attempt to implement agile practices “by the book”, the organization started a long-term process improvement program. to support it, we have used sth to describe the evolution path to be followed by organization a. to aid in the first transition and move from traditional to agile, we combined systems theory tools, gut matrix and reference ontologies. in summary, systems theory tools and gut matrix were helpful to better understand the organization, find leverage points of change and define strategies aligned to the organization characteristics and priorities. reference ontologies were useful to establish a common understanding of agile methods, enabling teams to be aware of and, thus, more committed to agile practices and concepts. by using process models, textual descriptions and other diagrams, the conceptualization provided by seon, the software engineering ontology network (ruy et al. 2016), became more palatable to the team, helping achieve a common understanding. as a result of the initiative, the organization has implemented agile practices in a flexible way and combined with some traditional practices, which is more suitable for the organization characteristics. due to the obtained results, the organization kept its intention to continue evolving by following the sth stages. in the first transition, it was not possible to propose big changes in the way organization b works. however, organization a expects that considering the positive results, organization b will be more willing to be involved in the evolution path. this will be crucial in the more advanced stage, where data from the clients are needed to support decision-making and identify new opportunities. regarding human aspects, we focused mainly on soft skills related to agile culture. strategy s3 is directly related to human aspects, being responsible for implementing squads and guilds. squads promoted self-organization, trust, leadership, and other important skills in agile organizations. guilds promoted the creation of processes and organizational culture that enabled sharing and managing knowledge at individual, team, and organizational levels. this knowledge is valuable to the continuous improvement of organization a. by changing human aspects, s3 enabled organization a to create processes, vocabulary, and mindset, i.e., an organizational culture that supported the movement from traditional to agile. moreover, the soft skills developed by s3 supported other strategies. for example, s1 and s2 were possible because s3 developed some soft skills (e.g., effective communication, self-organization and adaptability) that supported s1 and s2. as for the limitations of our approach, we highlight that it involves a lot of tacit knowledge and judgment. besides knowledge about system thinking tools and gut matrix, it is necessary to have organizational knowledge to apply them (e.g., one must be able to properly identify problems, investigate causes, define strategies etc.). moreover, the evaluation of our proposal was limited. we have used it only in the first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 study reported in this paper, which involved the participation of the authors. furthermore, the evaluation was mainly based on qualitative data. thus, new studies are necessary to evaluate the proposal in other organizations and quantitatively evaluate the effects of using it. as future work, we plan to add knowledge (e.g., by means of guidelines) to help others to use our approach. we also intend to explore other systems theory tools and combine them with enterprise architecture models to connect system variables, undesirable behaviors and causes to elements of the organization architecture. concerning organization a, we plan to monitor the implemented strategies and extend them to other projects. once the new practices become solid, we plan to aid organization a in the next transitions, where continuous integration and continuous deployment are performed. concerning the use of seon, we must point out that the authors were familiar with its conceptualization. in fact, as we previously said, the first and third authors are also authors of the scrum reference ontology (santos jr et al. 2021a), the seon ontology concerning scrum that provided the central concepts explored in the study. this made it easier to create the complementary artifacts and use them to share knowledge with the team to achieve a common understanding and conceptualization in organization a. it is also worthy highlighting that, although the complementary artifacts are simple artifacts, the conceptualization behind them, provided by seon is the key point to achieve a common conceptualization and understanding. we have applied the portion of seon used in the case reported in this paper to integrate data from different applications and provide consolidated information to support decision making in agile organizations, as we reported in (santos jr et al. 2021a). we intend to use seon with this purpose in organization a. since the team has learned seon conceptualization, we believe that the first step towards this goal has already been given. finally, the contributions of this paper have implications for practice and research. regarding implication for practice, this paper promotes the use of systems thinking tools as a means to identify leverage points relevant to moving an organization from traditional to agile development. furthermore, the proposed strategies can be used by practitioners and organizations to address problems similar to the ones of organization a. in addition, we showed how ontologies could be used to create artifacts and share a common conceptualization and understanding of agile development. other people can be inspired by that to solve knowledge problems in agile and other contexts. the systems theorybased process also has implications for practice, since it can be used by other organizations to help the transition from traditional to agile development. concerning implications for research, this paper introduces the combined use of systems theory tools, gut matrix and reference ontologies to support the transition from traditional and agile development. the combined use of them and the proposed system theory-based process can bring new research questions to be explored in further research. moreover, the successful use of ontologies to create more palatable artifacts to practitioners can be a starting point to new research aiming to make the most of this powerful instrument of knowledge structuring and representation. using reference ontologies in the industry is still a challenge. the use of operational ontologies is more common in this context, mainly due to the semantic web and also to data and systems interoperability solutions. however, reference ontologies are also valuable artifacts and provide structured, common and well-founded knowledge useful to learning and communication. we believe that new research should be conducted to investigate how to make reference ontologies more palatable for the industry. in the study reported in the paper, we gave the first step towards that. however, other advances are needed. in this sense, we believe that new research aiming to overcome the challenges of using ontologies in industrial settings are necessary. references ali n, lai r (2018) requirements engineering in global software development: a survey study from the perspectives of stakeholders. j softw 13:520–532. https://doi.org/10.17706/jsw.13.10.520-532 baskerville r (1997) distinguishing action research from participative case studies. j syst inf technol 1:24–43. https://doi.org/10.1108/13287269780000733 bastos ec, barcellos mp, falbo r (2018) using semantic documentation to support software project management. j data semant 7:107–132. https://doi.org/10.1007/s13740-018-0089-z binamungu lp, embury sm, konstantinou n (2020) characterising the quality of behaviour driven development specifications. springer international publishing bosch j (2014) continuous software engineering: an introduction. in: continuous software engineering. springer international publishing, cham, pp 3–13 bringuente ac, falbo r, guizzardi g (2011) using a foundational ontology for reengineering a software process ontology.in: journal of information and data management (jidm), vol. 2, pp. 511–526 de sousa tl, venson e, figueiredo rmdc, et al (2016) using scrum in outsourced government projects: an action research. in: 2016 49th hawaii international conference on system sciences (hicss). ieee, pp 5447–5456 duarte bb, leal castro al, falbo r, guizzardi g, guizzardi rss, souza vs (2018) ontological foundations for software requirements with a focus on requirements at runtime. in: applied ontology, vol.13, pp. 73-105 dybå t, dingsøyr t (2008) empirical studies of agile software development: a systematic review. inf softw technol 50:833–859. https://doi.org/https://doi.org/10.1016/j.infsof.2008.0 1.006 falbo r (2014) sabio: systematic approach for building ontologies. in: onto. com/odise@ fois. falbo r, ruy f, guizzardi g, barcellos mp, almeida jpa (2014) towards an enterprise ontology pattern language. in: proceedings of the 29th acm symposium on applied computing (acm sac 2014) fonseca v, barcellos mp, falbo r (2017) an ontologybased approach for integrating tools supporting the software measurement process. sci comput program first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 135:20–44. https://doi.org/10.1016/j.scico.2016.10.004 fitzgerald b and stol k. (2017). continuous software engineering: a roadmap and agenda. journal of systems and software, volume 123, pp 176-189, issn: 0164-1212, https://doi.org/10.1016/j.jss.2015.06.063 guizzardi g (2007) on ontology, ontologies, conceptualizations, modeling languages, and (meta)models. in: proceedings of the 2007 conference on databases and information systems iv: selected papers from the seventh international baltic conference db&;is’2006. ios press, nld, pp 18–39 guizzardi g, guarino n, almeida jpa (2016) ontological considerations about the representation of events and endurants in business models. in: 14th international conference, bpm 2016. rio de janeiro, pp 20–36 jim m, piattini m, vizca a (2009) challenges and improvements in distributed software development : a systematic review. 2009:. https://doi.org/10.1155/2009/710971 karvonen t, lwakatare le, sauvola t, et al (2015) hitting the target: practices for moving toward innovation experiment systems. in: international conference of software business (icsob 2015). springer, pp 117– 131 kepner ch, tregoe bb (1981) the new rational manager. princeton research press princeton, nj kim dh (1994) systems archetypes i. diagnosing systemic issues and designing highleverage interventions.(toolbox reprint series) cambridge ma: pegasus communications l’erario a, gonçalves ja, fabri ja, et al (2020) cfdsd: a communication framework for distributed software development. j brazilian comput soc 26:. https://doi.org/10.1186/s13173-020-00101-7 leffingwel d (2016) safe® 4.0 reference guide: scaled agile framework® for lean software and systems engineering. lima t, porto j (2019) análise de soft skills na visão de profissionais da engenharia de software. in: anais do iv workshop sobre aspectos sociais, humanos e econômicos de software. sbc, porto alegre, rs, brasil, pp 31–40 meadows dh (2008) thinking in systems: a primer. chelsea green publishing olsson hh, alahyari h, bosch j (2012) climbing the stairway to heaven: a mulitiple-case study exploring barriers in the transition from agile development towards continuous deployment of software. in: 2012 38th euromicro conference on software engineering and advanced applications. ieee, pp 392–399 omg (2013) business process model and notation (bpmn). version 2.0.2, object management group (technical report, object management group) prikladnicki r, audy jln (2010) process models in the practice of distributed software development: a systematic review of the literature. inf softw technol 52:779–791. https://doi.org/10.1016/j.infsof.2010.03.009 rodriguez p, markkula j, oivo m, turula k (2012) survey on agile and lean usage in finnish software industry. in: proceedings of the acm-ieee international symposium on empirical software engineering and measurement. association for computing machinery, new york, ny, usa, pp 139– 148 runeson p, host m, rainer a, regnell b (2012) case study research in software engineering: guidelines and examples, 1st edn. wiley publishing ruy f, souza e, falbo r, barcellos m (2017) software testing processes in iso standards: how to harmonize them? in: in proceedings of the 16th brazilian symposium on software quality (sbqs). pp 296–310 ruy fb, falbo r, barcellos mp, et al (2016) seon: a software engineering ontology network. in: lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). pp 527–542 santos la, barcelos mp, falbo r, reginaldo cc, campos pmc (2019) measurement task ontology. in 12nd seminar on ontology research in brazil (ontobras 2019). santos jr ps, barcellos mp, calhau rf (2020) am i going to heaven? in: proceedings of the 34th brazilian symposium on software engineering. acm, natal, brazil, pp 309–318 santos jr ps, barcellos mp, falbo r de a, almeida jpa (2021a) from a scrum reference ontology to the integration of applications for data-driven software development. inf softw technol 136:106570. https://doi.org/https://doi.org/10.1016/j.infsof.2021.1 06570 santos jr ps, barcellos mp, ruy fb (2021b) tell me: am i going to heaven? a diagnosis instrument ofcontinuous software engineering practices adoption. in: evaluation andassessment in software engineering (ease 2021). acm, trond-heim satpathy t (ed) (2013) a guide to the scrum body of knowledge : sbok guide. scrumstudy a brand of vmedu, inc schwaber, ken; sutherland j (2013) the scrum guide-the definitive guide to scrum: the rules of the game smith pj, beatty r, hayes cc, et al (2012) human-centered design of decision-support systems. in: jacko ja (ed) the human computer interaction handbook, 3rd edn. crc press, boca raton, fl, pp 589–622 sterman j (2010) business dynamics. irwin/mcgraw-hill c2000.. sterman jd (1994) learning in and about complex systems. syst dyn rev 10:291–330 studer r, benjamins vr, fensel d (1998) knowledge engineering: principles and methods. data knowl eng 25:161–197. https://doi.org/10.1016/s0169023x(97)00056-6 williams l, cockburn a (2003) agile software development: it’s about feedback and change. ieee comput 36:39–43 wynne m, hellesoy a, tooke s (2017) the cucumber book: behaviour-driven development for testers and first step climbing the stairway to heaven model results from a case study in industry santos jr. et al. 2022 developers. pragmatic bookshelf first step climbing the stairway to heaven model results from a case study in industry 1 introduction 2 background 2.1 stairway to heaven 2.2 system theory 2.3 gut matrix 2.4 reference ontology 3 related work 4 case study, planning, execution, and results 4.1 study design 4.1.1 diagnosis 4.1.2 planning 4.2 study execution and data collection 4.2.1 data collection a. initial interviews b. systemic maps c. gut matrix d. strategies 4.3 study analysis, interpretation and lessons learned 4.3.1 results 4.3.2 using reference ontologies to learn scrum 4.3.3 systems theory-based process 4.3.4 lesson learned 5 threats to validity to the study results 6 conclusions, future work and implications references journal of software engineering research and development, 2023, 11:6, doi: 10.5753/jserd.2023.2646  this work is licensed under a creative commons attribution 4.0 international license.. insights from the application of exploratory tests in the daily life of distributed teams: an experience report jarbele c. s. coutinho [ federal rural university of the semi-arid | jarbele.coutinho@ufersa.edu.br ] wilkersonl.andrade [ federaluniversityofcampinagrande | wilkerson@computacao.ufcg.edu.br ] patrícia d. l. machado [ federal university of campina grande | patricia@computacao.ufcg.edu.br ] abstract the exploratory testing (et) approach has been adopted in the context of agile development due to the effectiveness of its application. due to these benefits, the need arose to train agile professionals based on the practical application of this type of test to contribute to its incorporation into the daily work of teams. in this sense, the objective of this article is to investigate the contributions and limitations of adopting problem-based learning (pbl) and just-in-time teaching (jitt) in et teaching-learning, and the main aspects that favor or limit the incorporation of et into the day-to-day of agile teams. for this, we conducted a course in remote teaching format with agile professionals from a software development company, distributed geographically. at the end of the course, data were collected through an online questionnaire and examined with quantitative and qualitative analysis. then, the et activities performed by the participants in their daily lives were monitored and a brainstorming session was conducted to evaluate this experience. our main findings are that (1) the collaboration between participants and the adoption of a real problem, along with (2) activities and resources made available before the class, and (3) the existence of specific tool support for et sessions optimized learning in the context of remote teaching. other main results refer to the planning and registration of et and the need for guidelines to guide the execution of et. therefore, integrating theory and practice in et is necessary for a better understanding of the effects of tests in the agile environment. additionally, it is necessary to investigate specific approaches and tools that contribute to the execution of the et and, consequently, to the incorporation of this test into the daily lives of the teams. keywords: software testing, exploratory testing, testing education, testing learning and teaching, active learning, jitt, just-in-time teaching, pbl, problem based learning 1 introduction aligning theory and practice regarding the teaching of software engineering (se) is a persistent challenge, both in the academic context and in the industry (leite et al., 2020). providing and stimulating experiences that contribute to the technical and non-technical training of students and professionals in this area requires actions to plan the curriculum and curricular components, articulate new teaching methodologies, and include innovative pedagogical elements (cheiran et al., 2017). in this context, the teaching of software testing (st) also stands out. for cheiran et al. (2017), st is one of the areas of se that presents challenges for teaching. it may be difficult and inefficient to teach st through lectures and lectures. additionally, the simplicity of the criteria is a factor that makes it possible for st contents to be part of non-specific subjects, such as se (paschoal and souza, 2018). moreover, st contents may be part of the training provided by companies when their employees do not know a given st practice or technique. among the existing st practices, we have exploratory testing (et). et emphasizes the responsibility and freedom of the tester to explore the system, allowing the tester to acquire knowledge of the program in parallel with the execution of the tests (costa et al., 2019; hendrickson, 2013; bach, 2003; whittaker, 2009), as there is no script planning or the definition of test cases defined in test plans (hendrickson, 2013). for bach (2003), et is learning, designing, and executing tests performed simultaneously. as a way to meet the need for management and measurement of et, bach (2003) proposed (1) to divide the testing activities into sessions, which would be the basic unit of work, (2) to stipulate a mission for each session, and (3) adopt time metrics related to testing activities (castro, 2018), thus giving rise to the session-based test management (sbtm) approach. although the problem associated with st teaching is being discussed with greater visibility by the academic and scientific community (paschoal and de souza, 2018; garousi et al., 2017, 2020; scatalon et al., 2019; aniche et al., 2019) and is producing more specific developments (cheiran et al., 2017; de andrade et al., 2019; martinez, 2018; coutinho and bezerra, 2018; paschoal and souza, 2018; paschoal et al., 2017; queiroz et al., 2019), few studies investigate the possibilities of streamlining the teaching and application of et in practice (costa et al., 2019; ferreira costa and oliveira, 2020). adopting more dynamic strategies that bring theory and practice closer together to provide academic-professional training in the real scenario of the software industry is not a trivial task, especially when this experience is conducted with geographically distributed teams that work in an agile environment. when conducting experiences like this, some challenges https://orcid.org/0000-0002-7058-7631 mailto:jarbele.coutinho@ufersa.edu.br https://orcid.org/0000-0003-0656-6139 mailto:wilkerson@computacao.ufcg.edu.br mailto:patricia@computacao.ufcg.edu.br coutinho et al. 2023 emerge, such as (1) integrating the team that works in a crossfunctional way, due to the adoption of agile practices; (2) creating conditions for the flow of knowledge to develop, considering the different ways in which people assimilate information; and, (3) dealing with contextual challenges, such as communication, time, internet connection, among others. there is a need to investigate ways to conduct et teaching for agile teams working with distributed software development (dsd). therefore, our research question is: how to encourage practical et learning with geographically distributed agile teams, seeking integration among members, and promoting active learning, in order to encourage their insertion in the daily work? in these circumstances, learning in a participatory way, from real problems and situations, can contribute to learning evolution. problem-based learning (pbl), is an active learning approach (bonwell and eison, 1991; mcconnell, 1996) that, from problem-solving, enables students to live experiences that portray the reality of the professional context in the academic environment (cheiran et al., 2017) and aims to encourage the collaborative resolution of challenges through research, reflection and development of solutions. in an associated way, just-in-time teaching (jitt) (novak, 2011) also aims to contribute to student learning. based on activities carried out before class, jitt encourages the development of prior knowledge of students (novak, 2011; martinez, 2018) so that we can further develop discussions about a given content during class. this study aims to investigate the contributions and limitations of adopting pbl and jitt in et teaching-learning with agile dsd teams in a remote learning context, in order to encourage the incorporation of et practices in the daily lives of these teams. thus, it is expected to contribute to the mitigation of the main challenges mentioned above faced in the execution of courses conducted in a dsd context, and encourage the adoption of et in the st practices developed by agile teams. for this, we carried out an et course with agile professionals from a software development company, distributed geographically. at the end of the course, data were collected through an online questionnaire and examined. then, the et activities performed by the participants in their daily lives were monitored and a brainstorming session was held to evaluate this experience. it is worth noting that this paper is an extended version of the award-winning paper “teaching exploratory tests through pbl and jitt: an experience report in a context of distributed teams”, published in the proceedings of the 35th brazilian symposium on software engineering (sbes 2021), education track (coutinho et al., 2021). in addition to this introductory section, this paper is structured as follows: section 2 discusses an overview of st teaching. section 3 describes the methodological procedure used in this study. section 4 presents the results obtained, in response to the defined research questions. section 5 discusses the perspectives, challenges, and limitations of this study, based on the results obtained. section 6 discusses the threats to the validity of this study. section 7 exposes the analysis of some related works. finally, section 8 presents the final considerations and perspectives for future work. 2 background in this section, we discuss important aspects related to teaching software testing and exploratory testing. we then present and discuss some relevant concepts about two main approaches to active methodologies, pbl and jitt. 2.1 teaching software testing st is an essential activity to guarantee the quality of software. seeking to meet the need to use teaching methods that make the learning of this activity more effective, some studies have been dedicated to investigating systematic approaches to contribute to the teaching in this area of se (paschoal and de souza, 2018; garousi et al., 2017, 2020; scatalon et al., 2019; aniche et al., 2019). one of the most significant difficulties for teaching st is the need to apply the process in practice (paschoal and de souza, 2018; coutinho and bezerra, 2018). at university, sometimes, the teaching of st is distributed in disciplines in the se area and does not provide an opportunity for the st learned in depth. this aspect causes students to graduate with deficiencies in software testing skills (scatalon et al., 2019). on the other hand, the industry needs professionals with formation and more solid training in testing. in practice, testing professionals (test analysts, test engineers, or testers) have been looking for options to improve the effectiveness and efficiency of testing (garousi et al., 2017) both to perform a more effective job and to find better positions in their professional career. thus, university graduates and se professionals self-learn (self-train) st through books or online resources or by participating in industry training and obtaining certification in the st area (garousi et al., 2020), such as those provided by international software testing qualifications board (istqb), for example. 2.2 exploratory testing one type of testing that has become widespread in the agile environment is et. in this method, test professionals can interact with the system the way they want and explore, without restriction, its functionality (suranto, 2015). in layman’s terms, it can be said that et allows professionals to learn quickly, adjust their tests, and, in the process, encounter software problems that are often not anticipated in test plans or scripts. for bach (2003) et is the learning, design, and execution of tests performed simultaneously. thus, the test professional adapts to the system being tested, creates, and improves the tests based on the knowledge acquired during the exploration of the system, without the aid of instructions about the system (castro, 2018). in et, test design and execution are performed at the same time (whittaker, 2009). however, we can perceive some disadvantages in the application of this test. for instance, the lack of preparation, structure, and guidance can lead to many unproductive hours (suranto, 2015). also, we can test the same functionality more than once while others are not tested (castro, 2018), especially when multiple testers or test coutinho et al. 2023 teams are involved. moreover, it can be not easy to track the progress of testing professionals (suranto, 2015; castro, 2018); among others. to overcome some of these disadvantages and as a way of meeting the need for et management and measurement, bach (2003) proposed (1) to divide the testing activities into sessions, which would be the basic unit of work, (2) to stipulate a mission for each session and (3) adopt time metrics related to testing activities, originating the sbtm strategy. the sbtm strategy is used to make et is more effective and with clearer goals (castro, 2018). for these reasons, too, et has gained greater popularity in the agile industry (suranto, 2015; raappana et al., 2016; garousi et al., 2017), requiring testing professionals to display a little knowledge, experience, and skills with et. thus, although garousi et al. (2020) highlight that most courses have trained little about et, it also recommends more et coverage in st education. ghazi (2017) highlights that an et session should start with a document, called a charter, which contains the mission described in a succinct way. the purpose is to ensure that the tester remains focused only on executing the session described in the charter. some guidelines are indicated to define the mission, in the charter: (i) the mission must not be too specific, nor too generic; (ii) the mission determines what is to be tested (not how the test is to be carried out); (iii) at the end of the et session, new ideas, opportunities or problems, found by the tester, can be used to create new missions; (iv) after completion of the mission, it is important to have an evaluation of the session in order to discuss the results found. for hendrickson (2013) the mission format should be based on the following premise: define the mission and what should be explored. the mission of an et, can be defined with the estimation of test points.a test points is related to each test job performed on the et mission. each mission can contain one or several test points that must be investigated during the time of the et session. it is important to note that the test points list is dynamic, that is, new points can be added, based on errors found and corrections (ghazi, 2017); and, they must be tested according to risk (high, medium or low), being the most at risk first. 2.3 active learning as a way to streamline teaching and offer students differentiated strategies that lead to effective learning, active methodologies emerge as an alternative proposal to traditional teaching-learning approaches (bonwell and eison, 1991; mcconnell, 1996). currently, active methodologies are being adopted in teaching-learning from different areas of knowledge as a way to improve current techniques and involve students in this process (paiva et al., 2016), not limiting their learning only during class. active learning is characterized by stimulating students’ autonomy and continuous participation in the learning process (bonwell and eison, 1991), through different teaching approachessuch as problem based learning (pbl), teambased learning (tbl), the flipped classroom, just-in-time teaching (jitt), among others. some other trends in active methodologies have emerged such as peer instruction (pi) (crouch and mazur, 2001), design thinking (brown and katz, 2011), storytelling (andrews et al., 2009) and maker culture (milne et al., 2014), for example. among these modalities of active methodologies, pbl and jitt were adopted in a complementary way during the course of exploratory tests (et). as the course was conducted remotely, pbl contributed to initiating and motivating participants to learn through real-life problems and encouraging group work skills and autonomous learning (bonwell and eison, 1991; coutinho and bezerra, 2018; de andrade et al., 2019). jitt influenced active participation in different activities before and during classes, encouraging participants to read the material and perform online tasks. for these reasons, these active methodology modalities were selected and applied in this study. next, we describe pbl and jitt separately. 2.3.1 problem based learning pbl pbl is a teaching method that is characterized by the use of problems to initiate and motivate the learning of concepts and promote skills and attitudes necessary for their solution (figuerêdo et al., 2011). in addition, pbl also aims to include the acquisition of an integrated and structured knowledge base around real-life problems, as well as promoting group work skills and autonomous learning (figuerêdo et al., 2011; de andrade et al., 2019; cheiran et al., 2017), through collaboration and ethics. pbl is considered a methodology strongly oriented to processes and accompanied by instruments that can assess its effectiveness (figuerêdo et al., 2011). therefore, the practical immersion promoted by pbl requires a teaching plan. this plan includes well-defined learning objectives, the structuring of a practical environment, the determination of roles for the subjects involved (teacher and student), and result evaluation strategies (figuerêdo et al., 2011; cheiran et al., 2017). in summary, pbl starts with the proposition of a problem and ends with the resolution of this problem. for this, some steps are indicated: (1) clarify terms that are difficult to understand; (2) list the problem(s); (3) discuss the problem(s); (4) summarize the discussion; (5) formulate learning objectives based on the problems; (6) seek information; and, (7) integrate the information gathered to resolve the case. to carry out the pbl steps, the participation of a group of 10 to 12 students is indicated (with the figure of a coordinator and a secretary), a tutor and the definition of a script, with a description of the problem and a recommended bibliography or material support, if necessary. pbl is suggested as a teaching-learning practice when there is a need to encourage the participation of students or professionals in the learning process, placing them as protagonists in this process, and consequently removing them from the condition of receiver of knowledge. 2.3.2 just-in-time teaching jitt jitt is a pedagogical strategy developed by novak (2011), whose essence is to connect activities inside and outside the class through warm-ups (martinez, 2018). in this approach, coutinho et al. 2023 students are encouraged to read material about the content of the class and complete a small task online, a few hours before the class takes place (martinez, 2018). this activity allows the teacher to plan the next class or make considerations in class according to the student’s expectations or doubts (answers). jitt also aims to encourage students to participate actively in different classroom activities, through greater control over their learning, motivation, and engagement (novak, 2011). with jitt, class time is used more effectively because less time is spent on material that students have learned from reading, and more time is spent on more difficult subjects (martinez, 2018). in summary, the development of jitt encompasses three basic stages, centered on the student: (1) warmup exercise, where the student is encouraged to read support materials and answer conceptual questions from there, the teacher prepares the class; (2) class discussions on reading tasks (rt), through the re-presentation of questions and (some) answers from some students, maintaining anonymity; and, (3) group activities involving the concepts worked in the tl and in the class discussion, which can be expository, fixation exercises, among others. jitt is indicated when you want to stimulate, in the student or professional, the construction of prior knowledge about the content that will be discussed in class. and also to create the habit of studying before class. other factors involve oral and written communication skills, maximizing effectiveness and class time, among other factors. jitt is mainly suggested for the execution of short courses or for content taught in a short class time. 3 methodology this research examines the contributions of the use of active methodologies, pbl and jitt, used in association, to assist in the teaching-learning process of et, during the application of a course conducted remotely with members of agile teams distributed geographically. thus, the research is classified as an experience report (wohlin et al. (2012)), as it precisely describes: the planning, in section 3.1.1; the execution, in section 3.1.2; and, the analysis procedures, in section 3.1.3, as a way to contribute with relevant considerations for the st teaching area, as well as to allow the replication of this experience in other se teaching contexts. in order to learn more about et, a bibliographic research was initially conducted to understand the main approaches and tools that have been used to support the practice of et in the agile environment. this study culminated in a et course, aimed at agile professionals, to validate the practical application of the sbtm approach. to understand the development phases of the experiment conducted, figure 1 illustrates the activities developed, from planning to evaluation. 3.1 study design 3.1.1 planning the goals of this experience were defined following the guidelines of the goal question metric (gqm) paradigm. thus, we seek to analyze the pbl and jitt approach in teaching exploratory tests, with the purpose of realizing their contributions, concerning collaboration and integration between participants of a remotely conducted course, from the researchers’ point of view in the context of geographically distributed agile teams. to achieve this goal and conduct this research, we defined the following research question (rq): “how to encourage practical et learning with geographically distributed agile teams, seeking integration among members, and promoting active learning, in order to encourage their insertion in the daily work?”. thus, rq aims to identify the main contributions and limitations of the implementation of active learning pbl associated with jitt in an et course in remote format, about content learning, integration, the collaboration between participants, practical activities, and other aspects inherent in solving problems based on real scenarios. thus, to answer this rq a course on et was planned and executed (see section 3.1.2) with an agile dsd team. and, in the end, an online questionnaire was applied to collect the participants’ feedback on the adopted teaching-learning methodology. as shown in figure 1, the planning phase of the et course consisted of four well-defined steps, described below. step 1. define the course plan. in this stage, we defined the course syllabus, the number of hours to be taught, the date of the course, the target audience, the objectives to achieve, the materials needed, and classes to be produced in a detailed manner according to the adopted methodology. it is important to remark that the definition of this course plan was widely discussed, reviewed, and evaluated by two specialists in the st field. moreover, we defined the tools to be used in the course, considering the context of remote learning as the following: google meet, for video communication during classes; discord, for communication between participants during practical activities; google drive, for storing and sharing class materials and resources (in documents, spreadsheets, and presentations); google forms, for the elaboration and availability of the evaluation questionnaire, after the course; and, the xray exploratory app1, for et planning and execution, only in the last practical activity. it is important to highlight that the contributions of the xray exploratory app tool were as follows: (i) it has desktop and mobile versions; (ii) it is possible to integrate it with jira software although it was not possible to apply it in this study; (iii) assists in bug detection, while et sessions are recorded in video, audio and/or screen capture format; (iv) et sessions are detailed and executed directly in the tool; (v) when closing the session, a report is automatically attached to the test run this strategy provides quick feedback to testers. in summary, the xray exploratory app assists in the test report and in the documentation produced. step 2. develop class materials. the classes in this course are intended to train participants on the subject of et in the agile context and balance the level of knowledge among all 1xray exploratory app: https://www.getxray.app/ exploratory-testing-app/ https://www.getxray.app/exploratory-testing-app/ https://www.getxray.app/exploratory-testing-app/ coutinho et al. 2023 figure 1. development phases of this experience. table 1. relation of questions addressed in the questionnaire with the purpose of each section. section goal questionnaire questions question format i identify the profile of the professional participating in this research and their experience in the st area. 01, 02, 03, 04, and 05 all questions are objective. ii identify the organizational procedures and practices in relation to the practice of st in the sprints of the projects developed by the agile teams, before the course is offered. 06, 07, 08, 09, 10, 11, and 12 all questions are objective, except question 12 it was subjective. questions 09 and 10 follow the response format based on the likert scale. iii identify the participants’ perceptions about the teaching-learning obtained with the et course. 13, 14, 15, 16, 17, 18, 19, 20, 21, and 22 all questions follow the response format based on the likert scale, except for question 22. question 22 is objective. iv identify the contributions of the pbl and jitt approach, used in an associated way, in the et course. 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, and 41 all questions follow the response format based on the likert scale, except for question 41. question 41 is subjective. agile professionals participating in the course. in this context, the content covered in the class materials was based on bach (2003); castro (2018); hendrickson (2013); whittaker (2009); crispin and gregory (2009); ghazi (2017) and on current lectures, conducted by renowned experts in the field of et. it is important to highlight that (1) lecture notes (slides) with theoretical content and practical examples on et were prepared and (2) a handout with a detailed synthesis of the content covered in the course; as well as, we selected (3) a list of tools that support et planning and execution, (4) a list of videos (tutorials and lectures) available on the web, and (5) a list of technical articles and books on et in the agile context. the class material adopted in the et course can be accessed at https://bityli.com/puehugfn. step 3. develop practical activities. to exercise and reinforce learning about the content taught in each module of the course, examples and practical activities were prepared, based on the guidelines provided by the pbl and jitt methodologies. at this stage, the materials and resources needed to carry out these activities were defined and elaborated, for example, the selection of the web system to be tested; a guide with basic guidelines for each practical activity; templates of the test artifacts (such as charters, test points, and session report) to optimize the time devoted to each activity; requirements artifacts (such as a system requirements specification document and a use case diagram); and, selecting an installation manual for the xray exploratory app program. some of these materials and resources needed to be improved during the course, to meet the doubts and needs of the students, diagnosed in advance (ie, before the class) through the application of the jitt methodology. it is important to highlight that it was possible to follow all the stages of pbl and jitt in full (figuerêdo et al., 2011; novak, 2011), even though the course was carried out in a remote teaching format. step 4. elaborate on the evaluation questionnaire. to collect information about the experience and learning of the participants, a questionnaire was created online2, with objective and subjective questions. a total of 41 (fortyone) questions were included, distributed between 39 (thirtynine) multiple-choice questions and 02 (two) open questions, whose answer was optional for the participant. the questionnaire was designed in google forms and or2access to the evaluation questionnaire: https://cutt.ly/ym5vek2 https://bityli.com/puehugfn https://cutt.ly/ym5vek2 coutinho et al. 2023 table 2. structure of the exploratory testing course topics contents practical activity (pa) course workload (h) module i introduction 1.1. what is et? 1.1.1. et characteristics 1.2. what is not et? 1.2.1. randomness and testing ad hoc 1.2.2. scripted tests 1.3. when to use et? pa1 goal: understand the product, create hypotheses, and plan test scenarios. 02h module ii et in practice 2.1. et heuristics 2.2. et planning 2.3. writing et cases: charters 2.4. introduction to sbtm 2.5. running tests based on sessions 2.6. evaluation of a session pa2 goal: investigate heuristics, run tests, and log failures. pa3 goal: apply task breakdown structure (tbs) metrics. 04h module iii a little more about et 3.1. problems, challenges, solutions 3.2. et good practices 3.3. et support tools pa4 goal: practice using the xray exploratory app tool through the execution of an et session. 02h ganized into four sections. so, the first section aimed to briefly characterize the professional profile of respondents. the second section sought to identify the organizational procedures and practices about the st practice in the sprints of the projects developed by the agile teams before the course was offered. the third section sought to identify the respondent’s perceptions about the teaching-learning obtained during the et course. finally, the fourth section aimed to identify the contributions of pbl and jitt in conducting the et course. table 1 relates the questions addressed in the questionnaire with the goal of each section (see section 3.1.1). it is important to highlight that to answer the questionnaire, participants should: (1) have participated in all modules of the course; (2) have carried out the practical activities developed in each module; and, (3) right at the beginning of the questionnaire, have agreed to a free and informed consent form (ficf) for the research. in table 2 the structure of the course is presented, together with the description of the topics and contents covered in the syllabus, and the practical activities planned for the end of each class module. additionally, the workload defined for each module of the course is informed. 3.1.2 execution the population of this study included twelve professionals from the software development industry who work with agile methodologies, in the same organization. currently, these professionals work in geographically distributed locations, due to the corona virus disease (coronavirus disease) or sars-cov-2 pandemic, which specifically affects the brazilian population since february 2020. for this reason, too, the et course was conducted in a completely remote teaching context. it is important to highlight that 50% of the course participants have already performed et, even without knowing the definition of the practice in detail. in general, the execution of the experience took place as internal training with agile teams of that organization and as planned, in four virtual meetings, with a duration of 02 hours each meeting, on the dates of 06, 07, 12, and 13 april 2021. it is important to highlight that module ii was divided into two meetings, due to the extent of the content taught. at each meeting, the content was taught and participants were able to ask questions and resolve their doubts throughout the class. then, to exemplify the discussed theory, a demonstration was made with real examples. and then, participants were instructed to exercise the knowledge obtained through a practical activity based on a real web system. for this, some guidance on the activity was provided. participants were distributed in teams and encouraged to interact and collaborate, through the dynamics of each activity. the resolution of a real problem also sought to encourage participants to research, reflect and develop et relevant to the context analyzed in the activity. this strategy was based on the guidelines provided by the pbl. at the end of each meeting, class materials and resources were made available to participants so that they had prior knowledge of the next content to be discussed in the course. this strategy, based on jitt, sought to encourage interaction between the teacher and the course participants, in addition to enabling more in-depth discussions during the class and anticipating feedback on the materials and resources adopted for the next meeting. to solve the proposed problem, the participants were monitored for approximately 40 to 50 minutes. the time was stipulated according to the complexity of the activity proposed in each module. for that the activities were defined in order to build knowledge about the execution of et sessions and each activity involved a practice related to the content studied in the respective course module (see table 2). in each activity, a set of practices was defined that served as guidance for the execution of the et sessions (see the course material). at the end of the course, participants were instructed to fill out a questionnaire online, whose purpose was to collect information about the experience and learning about et coutinho et al. 2023 through pbl and jitt practices. 3.1.3 analysis procedures after data collection, through the online questionnaire, individual reports were generated, according to the objective of each section investigated in the questionnaire. it is worth noting that the information in these reports was anonymized to preserve the identity of the participants. thus, to analyze the data extracted from the content of the responses provided by the participants, a quantitative analysis was conducted (wohlin et al., 2012), mainly in the responses provided through the likert scale, with options from 1 to 5 (being: 1 totally disagree; 2 partially disagree; 3 neither agree nor disagree; 4 partially agree; 5 totally agree). in this sense, the answers were analyzed by class: disagreement, indecision, and agreement. additionally, a qualitative analysis was conducted on the answers to the subjective questions in total, alone two questions 12 and 41), but as they were optional or complementary answers to the objective questions, there was little need to apply this type of analysis. thus, when necessary, we synthesize and analyze the responses from open, axial and selective coding oriented in the grounded theory (corbin and strauss (2014)). 3.2 checking the use of et in practice after the execution of the et course, the activities performed by the participant were monitored during their daily work with agile development. based on the guidelines learned in the course, the et sessions were planned and executed. then, a brainstorming session was conducted with the professionals to understand the real advantages and difficulties experienced in this context. 3.2.1 brainstorming planning brainstorming is a technique used in groups to generate innovative ideas or insights into a particular topic (bonnardel and didier, 2020). overall, brainstorming should (i) generate as many ideas as possible, (ii) extend the interpretation of ideas, (iii) present original ideas, and (iv) perform the combination and improvement of existing ideas. to conduct the brainstorming, the main question was defined, that is, a problem to be solved, and a set of activities to be followed. thus, the following question was defined: “what to do to be able to integrate exploratory tests, as a test practice, in the team’s daily life?”. from this main question, other specific questions were presented to guide and contribute to the generation of ideas (see table 3). regarding the set of activities followed in brainstorming, the following were planned (see figure 2): 1. activity 1. brainstorming in silence. this activity consists of generating ideas, individually, to try to solve the presented problem. thus, participants must write their ideas on post-its. 2. activity 2. sharing ideas. this activity consists of presenting the ideas that were generated and transcribed in the post-its. other participants are allowed to ask questions or add any new information or ideas. figure 2. activities performed in brainstorming. 3. activity 3. filtering ideas. the objective of this activity is to discard ideas that are not aligned with the context of the problem or that generate disagreements. 4. activity 4. first vote. in this activity, all participants must select the ideas that best solve the exposed problem. only the 6 most-voted ideas are listed for the next activity. 5. activity 5. improvement of ideas. the objective of this activity is to improve the most voted ideas, adding important new information, through more post-its, with details of artifacts, testing activities, documentation, platforms or te tools, and team organization, among others. 6. activity 6. second vote. finally, the participants vote for the second time on the most applicable idea to solve the presented problem. 3.2.2 execution of brainstorming after the execution of the te course, the participants were led to apply te sessions in projects they develop. in total, nine te sessions were held, divided into two specific moments of the project. then the brainstorming took place. the brainstorming took place in a completely remote context, with the participants distributed locally. for this reason, the online tool lucidspark 3was adopted to facilitate the transcription of ideas and collaboration between physically distant participants. after some initial orientations, the participants were led to the brainstorming activities. in total, the brainstorming lasted a period of seventy-nine (79) minutes. it is important to note that the brainstorming lasted longer than anticipated in the initial planning. figure 3 illustrates the execution of post-its brainstorming, presented by the lucidspark online tool. the results are discussed in section 4.2. 4 results after the experience was carried out, data were collected and analyzed. in total, the information provided by the twelvecourse participants was considered, as they all agreed to participate in this study, followed the discussions during classes, and performed all the practical activities provided for at the end of each module. thus, these results are discussed in section 4.1. then, the results of the brainstorming carried out with the participants, after et insertion with agile teams, in section 4.2. 3lucidspark: https://lucidspark.com/pt https://lucidspark.com/pt coutinho et al. 2023 specific questions 1. how did the session-based testing strategy get in the way of et execution? 2. does something prevent et from being routinely practiced by the team? what actually prevents it? (process, tool, team, project, time, etc.) 3. what can we do to improve the execution of et? 4. which requirements strategy or artifact is most useful to assist in the realization of et? 5. what can be done to make these requirements clearer to the team? 6. what information is important to record/plan before performing the et, in addition to what was indicated? 7. what information is important to record during the execution of the te, in addition to those indicated? 8. what information is important to record after performing the et, in addition to what was indicated? 9. which practices were most interesting? 10. what practices did you not find interesting? 11. what benefits for the team’s day-to-day activities were observed in the course? 12. in light of what was learned, what was the most difficult thing to implement on a day-to-day basis? 13. has anything changed in the team’s testing practice after the course? what has changed? 14. what do you see that would change in test practice after the course? 15. did the et course influence the incorporation of testing practices? what really influenced you? 16. is et useful as a testing practice in the context of remote work? what could be incorporated to contribute to remote work? table 3. brainstorming specific questions. figure 3. execution of brainstorming. 4.1 results of experiment next, sections 4.1.1 to 4.1.4 present the characterization of the participants, the most common agile st practices adopted by the participants before the course, the perception of et after the course, and the main contributions of pbl and jitt to the teaching-learning et. 4.1.1 characterization of participants initially, to characterize the participants’ professional profiles, an analysis was made regarding each team member’s attributions and professional experience. considering that the composition of agile teams is multidisciplinary, that is, each team member can perform different functions during the developed software project, among the participants in this study, we identified different attributions distributed among the team members (see table 4), among them are developer back-end and developer front-end, played by 50% of participants; software engineer, 41.7%; project manager, 25%; database administrator and tester or quality analyst, 16.7% each; software architect, scrum master, designer, mobile developer, and infrastructure engineer, with 8.3% each. other attributions such as analyst or business leader, analyst or requirements engineer, and product owner (po), among others, were not informed. table 4. assignments of participants in agile teams. assignments answers (nº) answers (%) database administrator 2 16.7% architect 1 8.30% back-end developer 6 50% front-end developer 6 50% designer or humancomputer interaction specialist 1 8.30% project or product manager 3 25% scrum master 1 8.30% software engineer 5 41.70% quality tester or analyst 2 16.70% mobile developer 1 8.30% infrastructure engineer 1 8.30% regarding the level of academic education of the participants, 58.3% have completed graduation, while 33.3% have a stricto sensu post-graduation at the master’s level, and only 8.3% are still attending graduation. another factor observed was the professional experience of the participants: 1. working experience in the industry software: 50% of them work in this context between 1 and 2 years; 16.7%, between 3 and 5 years; 25%, between 6 and 10 years; and, 8.3%, for more than 11 years. none of them reported little experience with software development, that is, less than 1 year of experience in the market. 2. working time with agile methodologies: 50% work in this context between 1 and 2 years; 25%, between 3 and 5 years; 16.7%, between 6 and 10 years; and, 8.3%, for more than 11 years. none of them reported little excoutinho et al. 2023 perience (less than 1 year) or no experience with agile methodologies. 3. working time with agile st: 50% perform tests between 1 and 2 years; 16.7%, between 3 and 5 years; and, 16.7%, between 6 and 10 years. however, another 16.7% reported not working with testing at all. 4.1.2 common practices in agile st additionally, to identify how tests are commonly conducted by agile teams, an analysis of the main st organizational practices performed in the sprints of the projects was carried out. generally those responsible for testing the software or the software module developed are the back-end developer (41.70%), the front-end developer (25 %), the product owner (po) (16.70%), the project or product manager (41.70%), scrum master (8.30%), the engineer software (16.70%), the tester or quality analyst (25%) and, in some cases, everyone on the team (41.70%). we also identified that tests are usually performed throughout the software lifecycle (41.7%). in some phases with more emphasis such as, during (33.3%) or after coding the software (58.30%); and, during (25%) or after the software integration phase (25%). in other phases the test takes place with less intensity such as, during (8.3%) or after the software verification phase (16.70%); during (8.3%) or after the production of software documentation (8.30%); or, during (25%) or after the software maintenance phase (16.70%). in agile st the test types are categorized in quadrants crispin and gregory (2009). considering this categorization, we notice that the tests performed most frequently by the participants are unit tests (50%), exploratory tests (50%), component/integration tests (41.7%), functional tests (41.7%), usability tests (41.7%), performance and load tests (41.7%), simulations (33.3%), scenarios (16.7%), user acceptance tests (16.7%), alpha/beta ( 8.3%) and examples (8.3%). to assess the participants’ perception of the types of tests performed on their teams, with regard to their professional activities, we expose the following questions: • question 09: “i believe that the software testing strategies adopted so far, and reported above, have been sufficient to detect bugs in the system”. • question 10: “i believe we need to extend and improve the software testing practices used so far to try to ensure higher quality in whatever product we develop”. these statements contained multiple-choice items, according to the likert scale, which is detailed in section 3.1.3. table 5 presents the result of the answers to questions 09 and 10, and highlights the choices of the likert scale as follows: (1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree. regarding question 09, it was possible to observe a predominance of responses following agreement, which may symbolize that the participants consider the st strategies used to detect system bugs to be sufficient. although, in question 10, they unanimously agree that the st practices, untable 5. results of responses to questions 9 and 10. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) question 09 8.3% 25% 8.3% 50% 8.3% question 10 0% 0% 0% 8.3% 91.7% til then adopted by the teams, need to be improved and expanded. these results indicate that although the participants consider the agile testing practices adopted by the team to be sufficient, they also perceive the need to add other st practices to try to ensure greater quality in the developed projects. to understand the main problems related to the execution of tests in the projects developed by the participants in the daily work of their teams, we also expose assertions as response options, with multiple-choice items, according to the likert scale. these assertions can be consulted in the evaluation questionnaire (see section 3.1.1) and are described below. the results of the answers obtained can be seen in table 6. thus, the assertions ‘a’ to ‘s’, presented below, belong to question 11 of the questionnaire and comprise the participants’ perception of the problems encountered in the practice of st. these assertions were represented by letters of the alphabet so as not to be confused with individual questions or statements in the questionnaire. • assertive a: “a weak relationship between the client and the project leader”. • assertive b: “a weak relationship between the leader and other team members”. • assertive c: “constantly changing objectives, business process and/or requirements during sprint”. • assertive d: “lack of collaboration between test analysts and developers (programmers)”. • assertiva e: “failure to communicate within the development team (programmers) of the project”. • assertive f: “software requirements are purposely expressed in general terms, omitting specific implementation details”. • assertive g: “hidden, incomplete or inconsistent requirements”. • assertive h: “sprints too short”. • assertive i: “lack of knowledge about software testing practices and techniques”. • assertive j: “lack of training on specific software testing practices and techniques”. • assertive k: “there is no time to test as it should”. • assertive l: “there is no specific professional to run the tests within the team”. • assertive m: “trainings are time consuming and tiring”. • assertive n: “there is too much effort to plan/design the tests”. • assertive o: “there is an effort to run the tests”. • assertive p: “finding defects during production causes rework, delaying the completion of the sprint”. • assertive q: “use of traditional testing practices in the agile environment does not favor the work developed during the sprint”. coutinho et al. 2023 • assertive r: “a programmer tests their own code or the development team tests their own project”. • assertive s: “test cases are written only for valid and expected inputs”. the assertions that were most in the agreement were those referring to the constant change in objectives, business process, and/or requirements during the sprint (assertive c); the existence of hidden, incomplete, or inconsistent requirements (assertion g); lack of knowledge about software testing practices and techniques (assertive i); lack of training in specific software testing practices and techniques (assertive j); insufficient time to perform the tests as it should (assertive k); inexistence of a specific professional, on the team, to perform the tests (assertive l); and, the effort to perform the tests (assertive o). most of the problems highlighted are typical results of teams that work in an agile context alliance (2016) and can explain, for example, the need pointed out by the participants to expand and improve the adopted st practices. table 6. results of responses on problems in st practice. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) assertive a 50% 25% 16.7% 0% 8.3% assertive b 58.3% 16.7% 8.3% 8.3% 8.3% assertive c 16.7% 8.3% 8.3% 50% 16.7% assertive d 58.3% 0% 16.7% 16.7% 8.3% assertive e 58.3% 16.7% 0% 16.7% 8.3% assertive f 25% 16.7% 16.7% 25% 16.7% assertive g 33.3% 8.3% 0% 33.3% 25% assertive h 33.3% 25% 16.7% 16.7% 8.3% assertive i 0% 16.7% 25% 25% 33.3% assertive j 0% 8.3% 25% 33.3% 33.3% assertive k 0% 25% 8.3% 41.7% 25% assertive l 0% 0% 8.3% 25% 66.7% assertive m 25% 25% 16.7% 16.7% 16.7% assertive n 25% 16.7% 25% 8.3% 25% assertive o 25% 8.3% 16.7% 16.7% 33.3% assertive p 50% 8.3% 8.3% 25% 8.3% assertive q 33.3% 25% 16.7% 16.7% 8.3% assertive r 25% 8.3% 33.3% 16.7% 16.7% assertive s 25% 16.7% 33.3% 16.7% 8.3% some participants reported how the st process occurs on your team. figure 4 highlights some of the reports made. 4.1.3 perception of et after the course we also investigate the learning gained by participants during the course by analyzing the information collected on some key topics in the et content covered. for this, we exposed the following questions to the participants and asked that the answers be assigned according to the multiple-choice options, according to the likert scale. • question 13. “i have come to understand the importance of using heuristics in exploratory testing”. figure 4. st process executed in team projects agile described by the participants question 12). • question 14. “i was able to understand that a list of heuristics to be adopted in exploratory tests helps in deciding how to test the functionality/module/system”. • question 15. “i have come to understand the usefulness and importance of test letters in exploratory tests”. • question 16. “i was able to realize that although it is not necessary to prepare a detailed test plan, simple planning helps with the execution of the exploratory test”. • question 17. “i managed to learn how to plan the exploratory test”. • question 18. “i was able to see that requirements artifacts, even if not very detailed, can contribute significantly to the planning (setup) of the session”. • question 19. “i was able to see those test artifacts (mission letter, test point, and session report) generated while conducting the sbtm were useful for the execution of the exploratory test”. • question 20. “from the explanation about sbtm i was able to apply this approach with ease in the practice activity”. • question 21. “i was able to understand the importance of the alignment meeting between the team to register possible failures, create possible formal test cases, create new missions, register possible requirements, and register new test points”. the results associated with questions 13 to 21 demonstrate a predominant agreement on the learning of all content and practices taught during the course. among the questions presented in this evaluation criterion, the following stood out with more emphasis: the importance of simple planning for the execution of the et (question 16); that requirements artifacts, even less detailed, can contribute to session setup (question 18); the importance of the alignment meeting as a strategy to register possible failures, create possible formal test cases, create new missions, register possible requirements, and register new test points (question 21); the usefulness and importance of simple et artifacts generated in conducting the sbtm are useful for the execution of (question 19); the relevance of defining heuristics in et (question 13); among other relevant questions presented in the table 7. the performance of practical activities provided participants with a real experience with challenges common to et, such as little domain knowledge and necessary qualities of coutinho et al. 2023 table 7. results of answers from questions 13 to 21. ((1) totally disagree, (2) partially disagree, (3) neither agree nor disagree, (4) partially agree and (5) completely agree ) (1) (2) (3) (4) (5) question 13 0% 0% 8.3% 33.3% 58.3% question 14 0% 0% 25% 25% 50% question 15 0% 0% 16.7% 33.3% 50% question 16 0% 0% 0% 25% 75% question 17 0% 8.3% 8.3% 50% 33.3% question 18 0% 8.3% 8.3% 16.7% 66.7% question 19 0% 0% 25% 16.7% 58.3% question 20 8.3% 0% 25% 50% 16.7% question 21 0% 8.3% 0% 25% 66.7% testers in the application of et (91.7%) makes it difficult to carry out the tests; the absence of an et plan results in the same functionality being tested several times or an important functionality may not be tested or a serious error may go undetected (91.7%); the lack of a definition of test cases makes it difficult to reproduce the tests performed, if necessary, such as in regression tests (75%); an incorrectly interpreted output can lead to defects that may remain in the system or be eventually detected in future tests (5%); as there is no detailed test guide or plan, and no more complete artifacts are produced other than the crash report, it is difficult to know what has been and has not been tested (50%); and, et is not suitable for real-time systems (8.3%). 4.1.4 contributions from the pbl and jitt approach to identify the contributions of pbl and jitt in the teachinglearning process applied in the et course, an analysis of the characteristics of these methodologies was carried out. in this perspective, a set of eighteen questions (23 to 40) were exposed to the participants to be analyzed and answered through multiple-choice options, also following the likert scale. questions are listed below: • question 23. “the scenario (web system) worked on in the practical activities represented a real scenario of software development.” • question 24. “the scenario (web system) worked on in the practical activities had a high level of complexity.” • question 25. “practicing the theoretical content with a real web system helped me to better understand the concepts of exploratory testing.” • question 26. “through practical activities with a real web system, the course made it possible to learn, autonomously and independently, the main methods and techniques of exploratory testing.” • question 27. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to broaden the discussions in the team about the theory learned.” • question 28. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to deliver the project activities on time.” • question 29. “through practical activities with a real web system, the course made it possible to work collaboratively in groups in order to deliver the project activities with quality.” • question 30. “although physically separated, interacting with the team during practical activities was not difficult.” • question 31. “the use of conversation tools (such as discord) and collaboration (such as google sheets, google drive, and google docs) contributed to the team’s interaction in practical activities, decreasing the physical distance.” • question 32. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), contributed to the organization of the class and the instructor’s practice in the next class.” • question 33. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), helped the instructor to focus on the main difficulties that were expressed by the participants.” • question 34. “i realized that giving my opinion (feedback) about the class regarding the approach adopted, the exposed content, or the supporting material used (slides, pdf, artifacts, videos, etc. ), maximized efficiency and class time.” • question 35. “i realized that the practical activities were also aimed at stimulating my oral and written communication, through discussions with the team and elaboration of the test artifacts.” • question 36. “i realized that the practical activities were also aimed at stimulating group work skills, such as distributing the roles of each member, setting goals, understanding objectives, providing collaboration and communication, among other aspects .” • question 37. “i collaborated more with my team in practical activity 2 (investigating heuristics) and 3 (applying tbs metrics) than in practical activity 1 (creating hypotheses and planning test scenarios) because i felt more secure about the web system i was exploring, only in these activities, as i didn’t know the business scenario well before.” • question 38. “i felt more secure in carrying out the practical activities, only after the course instructor provided more specific guidance on the task, as the guidance in the support material (slides) was not clear enough .” • question 39. “i felt more motivated in practical activity 2 (investigating heuristics) and 3 (applying tbs metrics) after the course instructor made the test artifact templates available.” • question 40. “i had problems collaborating on practical activities because i couldn’t understand them.” to more accurately classify the answers given, we grouped the questions correlated to the main practices of active methodologies, in general perceived in questions 23 to 25; pbl, in questions 26 to 29, and 36; and, jitt, in questions 32 to 36. we highlight that questions 36 to 40, characterized both practices common to pbl and jitt. questions 30 and coutinho et al. 2023 31 sought to understand the participants’ perception of the dynamics of the course in the remote setting. table 8 displays the answers given to the questions. table 8. result of responses from questions 23 to 40. (1) (2) (3) (4) (5) question 23 0% 0% 8.3% 16.7% 75% question 24 25% 25% 33.3% 8.3% 8.3% question 25 0% 8.3% 0% 25% 66.7% question 26 8.3% 16.7% 0% 33.3% 41.7% question 27 0% 0% 0% 25% 75% question 28 8.3% 16.7% 16.7% 25% 33.3% question 29 0% 0% 16.70% 16.7% 66.7% question 30 0% 0% 8.3% 25% 66.7% question 31 0% 0% 0% 8.3% 91.2% question 32 0% 0% 0% 16.7% 83.3% question 33 0% 0% 8.3% 25% 66.7% question 34 0% 0% 33.3% 16.7% 50% question 35 0% 8.3% 8.3% 8.3% 75% question 36 0% 0% 8.3% 16.7% 75% question 37 0% 16.7% 8.3% 33.3% 41.7% question 38 0% 8.3% 16.7% 25% 50% question 39 0% 0% 8.3% 33.3% 41.7% question 40 25% 33.3% 0% 25% 16.7% about general practices guided by active methodologies, we investigated the participants’ perception of the inclusion of real practical examples in the activities carried out in the course. it was possible to observe a predominance in the follow-up of agreement in the answers to questions 23 and 25, which refer respectively to “the scenario (web system) worked on in practical activities represented a real scenario of software development” and “practicing the content theory with a real web system helped to better understand the concepts of et”. however, in question 24 we identified a majority of disagreement with the “high level of complexity of the scenario worked on in practical activities”. the answers provided in questions 26 to 29 and 36, related to pbl practices, which deal with the inclusion of real problems as practical activities in the teaching of content, provide evidence of the efficiency of using this methodology, especially about: learning autonomously and independently about et (question 26); collaborative group work to expand team discussions on the theory learned (question 27), deliver project activities on time (question 28) and with quality (question 29); encouragement of group work skills, such as distributing roles for each member, setting goals, understanding objectives, providing collaboration and communication, among other aspects (question 36). about statements 32 to 36, which deal with information about practices common to jitt, the results provided by the participants express a majority of agreement on the contributions and feedback given to the class from the access of the contents and materials of the class. thus, from the point of view of the participants, this strategy: contributed to the organization of the class and the instructor’s practice in the next class (question 32); helped the instructor to focus on the main difficulties that were expressed by the participants (question 33); and, maximized efficiency and class time (question 34). another jitt characteristic observed in the questions and which showed a predominance of the agreement was related to the objective of practical activities, namely: the encouragement of oral and written communication, through discussions with the team and preparation of test artifacts (question 35) and the encouragement of group work skills (question 36). otherwise, we investigated how collaboration between the team in practical activities stimulated the participants’ learning. agreement in questions 37, 38, and 39 prevailed, which referred, respectively, to security in collaborating more in the final practical activities than in the initial ones, as it is already better adapted to the business scenario provided as a real example in the activity; security in performing the activities after more specific instructions from the instructor; and, motivation after the instructor provides templates for the test artifacts. a positive aspect was the predominant disagreement in statement 40. a large part of the participants disagreed that they had “problems in collaborating in practical activities because they could not understand them”. this result can be explained by the aspects already confirmed in statements 37 to 39. finally, the benefits and difficulties of participating in the theoretical and practical activities of the course were investigated, given its implementation in a completely remote teaching context. table 9 presents the main testimonies of the participants regarding the perceived benefits and difficulties. according to the statements reported in the responses, we identified that the content, the main approaches, and the et tools were not known by some of the participants, as well as the usefulness of this test in agile methodologies. these aspects were pointed out as a benefit of the course for the work developed by the teams. regarding the reported difficulties, we found that the practices could have been conducted with products developed by the teams themselves as a way to facilitate the understanding of the business scenario, and the course load was also considered short for the extension of the content and developed practices. 4.2 results of et insertion with agile teams the six activities planned for brainstorming were conducted. there was no maximum or minimum limit of ideas to be expressed by the participants. thus, participants were encouraged to present their ideas within the time interval defined for each activity (see section 3.2.1). in activity 1, 37 answers were obtained to the specific questions listed in the preparation of the brainstorming. then, in activity 2 some questions were asked in order to clarify doubts related to the exposed ideas. in activity 3, some ideas had to be grouped together, and others discarded. thus, a total of 29 ideas that were not aligned with the main brainstorming context were disregarded. it is important to highlight that not all the answers obtained were considered viable ideas to be applied, as some were repeated, complemented each other, or were outside the research context. in activity 4, 06 ideas were voted on to be included in the next phase. in activity 5, the most voted ideas were briefly discussed and improved, with the aim of favoring the vote to be carried out in the next activity. finally, activity 6 resulted in a single coutinho et al. 2023 table 9. benefits and difficulties of participating in an et course in remote format. benefits participant b: “the et scope and planning to deliver quality software.” participant h: “learn how to document the execution of exploratory tests.” participant i: “learning about the topic. although the team, which also works together on development, somewhat adopted what was proposed in the classes, it was clear how we could improve.” participant j: “the theoretical content and practical activities were of great importance for a more solid understanding of the et... the requirements document should also be detailed enough to enable the planning of the et by the responsible team ... in addition, the et technique presented in the training can contribute a lot to the quality of the developed artifacts.” participant k: “through the course, i had my first contact with et, and participating in it made me learn a lot... what was seen seemed very attractive for the context of agile methodology.” participant n: “i found the use of heuristics in the tests interesting, i didn’t know about it.” difficulties participant b: “not having any type of st course in my graduation.” participant f: “the guidelines of the support material, sometimes it was not clear what should be done, generating doubt in the group.” participant h: “differentiate what information should be placed in each field of the template provided during et planning.” participant l: “it was a little difficult to think of scenarios for an initially unknown system. i think it would be more beneficial if the practice of the course had used a system developed and well-known by the team/class.” participant o: “i had difficulty participating due to the course schedule, as it wasn’t my actual work schedule.” viable idea to implement: the definition of a more viable approach to be implemented in the daily life of the team. the ideas presented in the brainstorming were related to the implementation of et in the participants’ daily lives, from the application of the et course. thus, the exposition of some ideas was crucial to understanding the effectiveness and usefulness of the practices exercised and the artifacts generated. we categorize these ideas to explain what actually applies and what does not apply to the daily lives of agile teams. the following is a summary of the ideas exposed in the brainstorming that favored the incorporation of et in the daily life of agile teams: • the registration of test points and the test report were considered important for the planning and execution of the et. • regarding the benefits for the team’s day-to-day activities, the importance of recording the et made was highlighted, in order to highlight the points that were tested and their pending issues. • the registration of testpoints was seen as a significant contribution of the te sessions because, in practice, there was an improvement in the activity of registration of the test to be done. • considering the context of distributed work, adopting files or artifacts with permission for simultaneous collaboration, and online tools contributed to the te performed remotely. below is a summary of the ideas exposed in the brainstorming that stood out as limiting factors to the incorporation of et in the daily life of agile teams: • the minimum and maximum time limit for executing an et session does not apply in practice, as this factor is relative to the tested functionality. as well as recording the complete execution of an et session. • the unavailability of time to plan et sessions was defined as a difficulty in implementing te in practice. • having a well-defined te process could contribute to the insertion of te as a test practice in the team’s daily life. • the minimum time reserved for an et session, in sbtm, could be smaller for small features. • the lack of experience with the et makes the professional who performs the test, dedicate a significant effort to the preparation of the te session as a whole. • the absence of a well-defined process that optimizes the preparation time of an et session limits the insertion of the et in the team, as well as the organization and commitment of the entire team to find a common moment to perform the et. in a complementary way, some other ideas emerged to facilitate the insertion of et in the daily lives of teams, such as: • include a description and examples in the attributes of each artifact of the et session, to facilitate the understanding of the artifact, or to exemplify the description of a test point. • set the test points in advance. • a document or a use case model could contribute as an artifact of base requirements for the planning of et sessions. • providing a brief description of the functionality to be tested in the et session could facilitate the planning and execution of the test. • capturing the print screen or screen recordings that contain the functionality defects identified in the et session contributes to the record of the test performed. • after running the test, it would be interesting to record the test execution step by step, the scenario tested and any impediments or difficulties encountered by the tester. coutinho et al. 2023 5 discussion of results in this section, we discuss the results obtained and presented in section 4, in order to expand the considerations about the et course applied with agile professionals 5.1 and the monitoring of the incorporation of et in the daily work of these professionals 5.2. 5.1 overview on using pbl and jitt to promote active learning and integration among geographically distributed participants during an et course in a remote learning format, pbl and jitt approaches have proven to be useful in stimulating hands-on learning in this context. according to the agreement of information and reports from the participants, some characteristics of pbl and jitt stood out, such as: 1. the use of a real scenario of software development contributed to the practice of the et concepts covered in the course. the actual scenario practiced encouraged the participants to further investigate the possible failures of the analyzed system from the heuristics and planned et scenarios. this strategy allowed the quick identification of some bugs implemented in the system’s functionalities in total, there were 6 bugs in three different missions (test scenarios) tested in practice 2, and 4 bugs in three distinct missions tested in practice 3. we highlight that each mission was executed in a 30-minute session. the low level of complexity of the adopted web system also contributed to the understanding of its operation, since there were in the first practical activities more detailed requirements or business artifacts. 2. autonomous learning was stimulated through practical activities, simulating the participants’ daily situations through the exploration of the web system; studying or reading classroom material and resources in advance; the discussion of content and activities during classes; the elaboration of questions about the understanding of practical activities; and, the construction of the generated et artifacts. 3. collaborative work stimulated different learning styles among the participants, such as: distributing the roles of each member, setting goals, understanding objectives, and providing communication. it also contributed to the expansion of discussions in the team, with the different points of view of the participants, and with the delivery of the activity on time although, in some cases, additional time to complete the activity was necessary and with quality meeting the requirements of the activity. the use of online conversation and collaboration tools contributed to the team’s interaction, narrowing the physical distance. 4. stimulation of additional skills, such as reading materials, using logical reasoning to understand the features of the web system during practical activities, discussions between teams, and exploring the system, among others. teamwork was also an encouraging practice, although the participants already act in this way in their daily work. 5. the motivation. some clarifications about technical terms, expressions, or et artifacts were useful to keep the participants motivated in carrying out the practical activities. as well as the availability of et artifact templates and the socialization of the generated artifacts at the end of each practical activity. 6. feedback provided before, during, and after classes about the content, materials, resources, and methodology used, contributed to the organization and practice of the instructor in the following class; help the instructor focus on the main difficulties that were expressed by the participants; and, maximize the effectiveness and time of the class. 7. the examples shown, as well as the way to present them, contributed to improving the understanding of te, as all the examples also referred to contexts of real systems. exemplifying the theory in this way helped the participants in the understanding and applicability of et, especially during the performance of practical activities. it is important to highlight that the likert scale helped to identify both the benefits and limitations of the pbl and jitt approaches, through assertions that represent their main characteristics. however, the answers provided by the participants pointed to these (question) characteristics more as benefits than as limitations to the use of these approaches in remote learning. although guidance and some clarifications were provided during the practical activities, some participants agreed on the difficulty in collaborating in practical activities because they were unable to understand them well. perhaps this is justified by the absence of face-to-face contact to facilitate communication. in summary, it is also important to highlight that: (i) although during the course, the participants were geographically dispersed, we did not address any specific dsd process. the purpose of the course was to use strategies and tools that would make et viable in the context of isolated and remote work; (ii) the use of a specific tool (xray exploratory app) for planning, executing and reporting bugs (with video recording, capturing and annotating screenshots, annotations, among other aspects) in et sessions are benefits not found in other tools that aid the execution of et; (iii) the experience of remote learning with geographically distributed participants is challenging and factors such as stimulating participation, collaboration and attention skills need to be considered for learning to actually happen; (iv) we noticed: the interest and engagement of the participants, when practicing the theoretical content through a real problem adopted in the activities and discussions during the classes, mainly due to the pre-constructed knowledge through prior access to the materials of the classroom; the quality of the answers in the exercises, as a good part of the test artifacts generated were in accordance with the criteria suggested in the description of the activities; a motivation to use the xray exploratory app tool, due to the ease in creating, executing and exploring the et performed; among other aspects already discussed in this section. coutinho et al. 2023 5.2 overview of insights in practice the et course facilitated the participants’ understanding of et concepts and practices. in order to facilitate the incorporation of this test practice in the daily lives of the teams, the participants were motivated to apply et sessions in their work context. then, a brainstorming was conducted to evaluate the execution of the et. next, we discuss the information obtained from brainstorming from the questions listed in table 3. therefore, some factors were perceived that favored the incorporation of et in the daily lives of the teams, such as: 1. the planning of test points, and defining the degree of importance of each test point facilitated the understanding of what is a test priority for the moment. 2. documenting what was tested also contributed to the conduct of the alignment meeting with the team, considering that all the information inherent to the test execution, such as bugs, suggestions for improvements, or pending issues, was recorded. on the other hand, some limitations were also noticed in the implementation of et sessions: 1. the artifacts adopted in the et sessions charters, tests points and test report proved to be complex (i.e. difficult to understand) or with underused fields for the test record. in this case, there is a need to implement guidelines or examples to clarify the effective use of the artifact. 2. there was a need for adaptations in the artifacts of planning and recording the test for a more specific adaptation to the agile context in which the et was conducted. 3. the orientation of minimum and maximum time for the realization of the sessions: sometimes, the minimum time (30 minutes, indicated by the sbtm) for the session was not used because the functionality tested was very simple, and the session that could have been done less time needed to be extended to reach the minimum time. in this case, a new orientation for situations like this needs to be considered. 4. the absence of a well-defined process or approach that is compatible with the real work context of agile teams. what is proposed in the literature, such as the sbtm, is not always applicable in its entirety in the real context of agile professionals. 5. experience of the professional who defines the test points because when this activity is performed by professionals who are unaware of the application to be tested, there is a risk of specifying a test point in an inconsistent or incomplete, generating gaps in the et planning and generating difficulties in its subsequent execution. to actually incorporate et as a testing practice in agile teams, it is still necessary to define an approach that fits the context of these professionals, in order to consider practical application guidelines, more specific tools that consider the particularities of planning and et execution, and simple, clear and effective artifacts to be adopted. for this reason, there is a need for an approach that fits the needs of professionals working in agile development and that goes beyond the concept presented in the literature on et. 6 limitations and threats to validity some potential threats to the validity of this study were perceived, such as threats to internal, external, construct, and conclusion validity. for this reason, some measures were taken to minimize them. to mitigate construct validity, the course material and evaluation questionnaire were iteratively planned, updated, and validated by the authors. as well as, elaborated based on works related to the et area, in the context of agile st (bach, 2003; hendrickson, 2013; suranto, 2015; whittaker, 2009; castro, 2018). to mitigate internal validity and ensure the anonymity of responses, participant identification was optional via email address. this allowed the data analysis to be performed in an impersonal way. other aspects inherent to the selection of individuals and conduct of the experiment also contributed and are detailed in sections 3.1.2 and 3.1.3. the external validity was attenuated with the availability of resources and teaching materials to facilitate the application of the active methodologies mentioned. thus, the results can be valid for other course participants either in a remote or face-to-face teaching format. to mitigate the conclusion validity, only percentages were used to identify common patterns. complementarily, the questionnaire validation answers were also discarded, regarding possible errors, such as answer format, and textual expressions used in the questions, among others. we tried to reduce bias using likert scale data. thus, all the conclusions we draw in this study are strictly traceable to the data. 7 related works although the literature shows interest in st teaching and is seeking strategies to boost practical teaching closer to the real context of the software industry through approaches based on active learning (cheiran et al. (2017); de andrade et al. (2019); martinez (2018); figuerêdo et al. (2011)) see section 7.1 -, there are still few studies that present results on the teaching of et (costa et al. (2019); ferreira costa and oliveira (2020)) see section 7.2. some other studies also have been dedicated to the investigation of et in the context of the industry in order to understand the impact or effectiveness of this testing practice in real projects (gebizli and sozer, 2017; mårtensson et al., 2019; pfahl et al., 2014; afzal et al., 2015) see section 7.3. a brief discussion of these works in relation to this study is presented in section 7.4. 7.1 teaching-learning of st using active methodology cheiran et al. (2017) present an account of two experiences on the teaching of st using pbl in an undergraduate course of se at the federal university of pampa (unipampa). in coutinho et al. 2023 total, 51 students participated 25 in the 1st edition and 26 in the 2nd edition of the course. data collection took place through questionnaires. to analyze the collected data, statistical and content analyses were adopted. the results point to evidence of students’ maturity in the context of the curricular component and the benefits and problems faced by integrating pbl and gamification elements. de andrade et al. (2019) also conducts a study on st learning using pbl practices, with students from computer science, at the university of são paulo (usp), and information systems, of the federal university of juiz de fora (ufjf). the results show that (i) classes with many students should have fewer presentations; (ii) courses with an average number of students can choose to keep weekly presentations more dynamic or with fewer presentations; (iii) the approach pbl is not as effective for students who have less time for extra classes. in summary, it was noticed that the successful adoption of an active approach is not directly linked to the infrastructural aspects. figuerêdo et al. (2011) apply pbl to train test engineers. for this, an empirical study was carried out with two groups, where each group was composed of five undergraduate students. the group should test a case tool to support functional testing using et. two evaluations were made with the participants one before and one after the execution. participants’ knowledge, grades, and amount of bugs identified were evaluated. the results obtained highlight that the pbl provides the engagement of participants and obtaining experience in scenarios that simulate real st situations. martinez (2018) describes the results of an experience with jitt based teaching in a graduate course in st over two semesters. the approach adopted was evaluated from the perspective of students, through a survey, and of teachers, from an assessment of strengths and limitations. the results show that a large majority of students (1) believe that their learning has improved when they prepare for class by reading the material in advance and (2) consider jitt to be an adequate teaching strategy for the course. teachers highlighted that students became more involved and participatory in discussions during class. 7.2 et teaching costa et al. (2019) use gamification as a motivating strategy in teaching and learning et. this dynamic consisted of practical activity to apply et in the form of a game, which refers to the “treasure hunt”. an experience was carried out with students from an se discipline of an undergraduate course in computer science. the results indicate that the qualitative results converged with the quantitative results obtained, showing that gamification helped in the teaching and learning process of the students (forehead pains). in another work, ferreira costa and oliveira (2020) replicate a new experience with the gamification strategy for teaching et discussed in costa et al. (2019), with a group of undergraduate students in computer science and with graduate students in a computer technician course. as a result, students achieved good overall performance. some reports highlight that gamification facilitated and significantly contributed to better performance, converging with the quantitative data obtained. this can be evidenced mainly by the fact that both “runs” of the experience (classes) reached a percentage higher than 70% of achievement. 7.3 et in the context of industry gebizli and sozer (2017) evaluate the impact of the education and level of experience of testers on the effectiveness of the et. for this, a case study was carried out with 19 industry professionals, with different educational backgrounds and levels of experience. a digital tv system was tested, and detected failures were categorized according to their severity. thus, the effectiveness of the et was evaluated on two aspects: criticality of detected faults and efficiency in terms of the number of faults detected per unit of time. the results show that et efficiency is significantly affected by training and educational experience. mårtensson et al. (2019) conducted a study based on interviews to understand the success factors in the application of et in industry projects. for this, interviews were conducted with 20 professionals. finally, a list of key factors that enable the efficiency and effectiveness of et in large-scale software systems was presented. the nine factors identified are grouped into four themes: (i) testers’ knowledge, experience, and personality; (ii) purpose and scope; (iii) ways of working; and, (iv) registration and reporting. pfahl et al. (2014) investigated how software engineers understand and apply the principles of exploratory testing, as well as the advantages and difficulties they experience. for this, an online survey was carried out among estonian and finnish software developers and testers. the main results indicate that the majority of testers, developers, and test managers who use et, (1) apply et to critical software for usability, performance, security, and security to a high degree; (2) use et very flexibly at all kinds of levels, activities, and phases; (3) they perceive et as an approach that supports creativity during testing and is effective and efficient; and (4) feel that et is not easy to use and has few support tools. in addition, there was a need for more support for et users, such as guidelines and tools. afzal et al. (2015) sought to quantify the effectiveness and efficiency of et vs. tests with documented test cases (tct). for this, four controlled experiments were carried out, with a total of 24 professionals and 46 students. manual functional tests using et and tct were done. the number of defects identified in the 90-minute test sessions, the difficulty of detection, the severity and types of defects detected, and the number of false defect reports was measured. the results show that et found a significantly higher number of defects. however, the two testing approaches did not differ significantly in terms of the number of false defect reports. 7.4 discussion of related works we could not find works that apply jitt and pbl in et. in general, the application of jitt or pbl in st, as reported in the literature (cheiran et al., 2017; figuerêdo et al., 2011; martinez, 2018), achieved results that converge with ours in the sense that the adoption of these methodologies has provided positive gains, related to motivation, engagement, coutinho et al. 2023 collaboration, and content learning. we also emphasize that most of the works are developed in academic environments (with undergraduate students), others in practical environments (with industry professionals). generally, the types of tests investigated are different, sometimes a more specific type of test or in a more general context, such as defect detection only. but, it is not always possible to identify in which development process the work was applied or which development methodology was adopted. in this way, this paper differs from the others in that it identifies and discusses the contributions of the integration of active methodologies, pbl and jitt, in teaching et in a remote learning course with agile professionals from the software industry and distributed geographically. some strategies and guidelines seeking to optimize teaching-learning with pbl and jitt, as well as the discussion of some perceived challenges, were also highlighted. another differential is the monitoring of et execution in the daily agile development of industry professionals, in order to highlight the aspects that favor or limit the incorporation of et in the daily routine of agile teams. 8 conclusions this work investigates the use of the pbl and jitt methodologies to teach et to a dsd team. based on a literature review and evaluation of the resources available, we planned and performed a training course on et and analyzed the results obtained. while teaching st has been challenging, under the circumstances imposed by social distance, where each team member is working remotely and isolated, teaching such a subject becomes even more challenging. next, we followed the incorporation of et into the daily lives of the teams that participated in the course and made an analysis of the application of this practice in the context of agile development. through brainstorming, ideas were raised about the characteristics that favored or limited the execution of et. the use of these methodologies significantly contributed to the success of the course. they provided the grounds for adopting a real problem, assessing the student’s needs with resources available before the class, adjusting the course to meet students’ expectations and needs, and promoting collaboration. additionally, the existence of a support tool for et was key to optimizing remote learning. these aspects also favored the application of et by agile teams in their projects. so that both the practices and the artifacts were put to good use in the test execution. however, even with this support given in the course, some limitations were perceived such as the absence of more specific support for the planning and execution of the et, such as guidelines and tools. as future works, we intend to (1) propose an approach that facilitates the implementation of et, considering the dsd scenario and the generation of simple and robust et artifacts, for the effective insertion of this test practice in the daily life of agile teams. next, we also intend to (2) validate the approach through experiments with professionals from the software development industry. acknowledgements the authors would like to thank the anonymous reviewers for their valuable comments. the second and third authors were supported by cnpq/brazil (processes cnpq 303773/2021-9, and 311215/2020-3). references afzal, w., ghazi, a. n., itkonen, j., torkar, r., andrews, a., and bhatti, k. (2015). an experiment on the effectiveness and efficiency of exploratory testing. empirical software engineering, 20:844–878. alliance, a. (2016). agile glossary. url: https://www. agilealliance.org/agile101/agile-glossary/ accessed on august 13, 2020. andrews, d. h., hull, t. d., and donahue, j. a. (2009). storytelling as an instructional method: descriptions and research questions. technical report, oak ridge inst for science and education tn. aniche, m., hermans, f., and van deursen, a. (2019). pragmatic software testing education. in proceedings of the 50th acm technical symposium on computer science education, sigcse ’19, page 414–420, new york, ny, usa. acm. bach, j. (2003). exploratory testing explained. online: http://www. satisfice. com/articles/et-article. pdf. bonnardel, n. and didier, j. (2020). brainstorming variants to favor creative design. applied ergonomics, 83:102987. bonwell, c. c. and eison, j. a. (1991). active learning: creating excitement in the classroom. 1991 ashe-eric higher education reports. eric, sl. brown, t. and katz, b. (2011). change by design. journal of product innovation management, 28(3):381–383. castro, a. k. s. d. (2018). testes exploratórios: características, problemas e soluções. b.s. thesis, universidade federal do rio grande do norte. cheiran, j. f. p., de m. rodrigues, e., de s. carvalho, e. l., and da silva, j. a. p. s. (2017). problem-based learning to align theory and practice in software testing teaching. in proceedings of the 31st brazilian symposium on software engineering, sbes’17, page 328–337, new york, ny, usa. acm. corbin, j. and strauss, a. (2014). basics of qualitative research: techniques and procedures for developing grounded theory. sage publications. costa, i., oliveira, s., cardoso, l., ramos, a., and sousa, r. (2019). uma gamificação para ensino e aprendizagem de teste exploratório de software: aplicação em um estudo experimental. xviii simpósio brasileiro de jogos e entretenimento digital (education track–short papers), 2019(1):1232–1235. coutinho, e. f. and bezerra, c. i. m. (2018). uma avaliação inicial do jogo para o ensino de testes de software itestleaening sob a ótica de um software educativo. in congresso sobre tecnologias na educação, volume 3, pages 11–22, fortaleza,ce. sbc open library. coutinho, j., andrade, w., and machado, p. (2021). teachhttps://www.agilealliance.org/agile101/agile-glossary/ https://www.agilealliance.org/agile101/agile-glossary/ coutinho et al. 2023 ing exploratory tests through pbl and jitt: an experience report in a context of distributed teams. in proceedings of the xxxv brazilian symposium on software engineering, sbes ’21, page 205–214. association for computing machinery. crispin, l. and gregory, j. (2009). agile testing: a practical guide for testers and agile teams. pearson education, sl. crouch, c. h. and mazur, e. (2001). peer instruction: ten years of experience and results. american journal of physics, 69(9):970–977. de andrade, s. a. a., de oliveira neves, v., and delamaro, m. e. (2019). software testing education: dreams and challenges when bringing academia and industry closer together. in proceedings of the xxxiii brazilian symposium on software engineering, sbes 2019, page 47–56, new york, ny, usa. acm. ferreira costa, i. e. and oliveira, s. r. b. (2020). the use of gamification to support the teaching-learning of software exploratory testing: an experience report based on the application of a framework. in 2020 ieee frontiers in education conference (fie), pages 1–9, uppsala, sweden. ieee. figuerêdo, c. d. o., dos santos, s. c., borba, p., and alexandre, g. (2011). using pbl to develop software test engineers. in international conference on computers and advanced technology in education, pages 305–322, cambridge, united kingdom. sn. garousi, v., felderer, m., kuhrmann, m., and herkiloğlu, k. (2017). what industry wants from academia in software testing? hearing practitioners’ opinions. in proceedings of the 21st international conference on evaluation and assessment in software engineering, ease’17, page 65– 69, new york, ny, usa. acm. garousi, v., rainer, a., lauvås, p., and arcuri, a. (2020). software-testing education: a systematic literature mapping. journal of systems and software, 165:110570. gebizli, c. s. and sozer, h. (2017). impact of education and experience level on the effectiveness of exploratory testing: an industrial case study. in 2017 ieee international conference on software testing, verification and validation workshops (icstw), pages 23–28. ghazi, a. n. (2017). structuring exploratory testing through test charter design and decision support. phd thesis, blekinge tekniska högskola. hendrickson, e. (2013). explore it!: reduce risk and increase confidence with exploratory testing. pragmatic bookshelf, sl. leite, f. t., coutinho, j. c. s., and de sousa, r. r. (2020). an experience report about challenges of software engineering as a second cycle course. in proceedings of the 34th brazilian symposium on software engineering, sbes ’20, page 824–833, new york, ny, usa. acm. mårtensson, t., martini, a., ståhl, d., and bosch, j. (2019). excellence in exploratory testing: success factors in largescale industry projects. in product-focused software process improvement, pages 299–314. springer international publishing. martinez, a. (2018). use of jitt in a graduate software testing course: an experience report. in 2018 ieee/acm 40th international conference on software engineering: software engineering education and training (icse-seet), pages 108–115, gothenburg,sweden. ieee. mcconnell, j. j. (1996). active learning and its use in computer science. in proceedings of the 1st conference on integrating technology into computer science education, pages 52–54, barcelona,spain. acm. milne, a., riecke, b., and antle, a. (2014). exploring maker practice: common attitudes, habits and skills from vancouver’s maker community. studies, 19(21):23. novak, g. m. (2011). just-in-time teaching. new directions for teaching and learning, 2011(128):63–73. paiva, m. r. f., parente, j. r. f., brandão, i. r., and queiroz, a. h. b. (2016). metodologias ativas de ensinoaprendizagem: revisão integrativa. sanare-revista de políticas públicas, 15(2):145–153. paschoal, l. n. and de souza, s. d. r. s. (2018). a survey on software testing education in brazil. in proceedings of the 17th brazilian symposium on software quality, sbqs, page 334–343, new york, ny, usa. acm. paschoal, l. n., silva, l., and souza, s. (2017). abordagem flipped classroom em comparação com o modelo tradicional de ensino: uma investigação empírica no âmbito de teste de software. in brazilian symposium on computers in education (simpósio brasileiro de informática na educação-sbie), page 476, recife,pe. sbc open library. paschoal, l. n. and souza, s. r. (2018). planejamento e aplicação de flipped classroom para o ensino de teste de software. renote, 16(2):606–614. pfahl, d., yin, h., mäntylä, m. v., and münch, j. (2014). how is exploratory testing used? a state-of-the-practice survey. in proceedings of the 8th acm/ieee international symposium on empirical software engineering and measurement, esem ’14, new york, ny, usa. association for computing machinery. queiroz, r., pinto, f., and silva, p. (2019). islandtest: jogo educativo para apoiar o processo ensino-aprendizagem de testes de software. in anais do xxvii workshop sobre educação em computação, pages 533–542, belém,pa. sbc open library. raappana, p., saukkoriipi, s., tervonen, i., and mäntylä, m. v. (2016). the effect of team exploratory testing– experience report from f-secure. in 2016 ieee ninth international conference on software testing, verification and validation workshops (icstw), pages 295–304, chicago, il, usa. ieee. scatalon, l. p., carver, j. c., garcia, r. e., and barbosa, e. f. (2019). software testing in introductory programming courses: a systematic mapping study. in proceedings of the 50th acm technical symposium on computer science education, sigcse ’19, page 421–427, new york, ny, usa. acm. suranto, b. (2015). exploratory software testing in agile project. in 2015 international conference on computer, communications, and control technology (i4ct), pages 280–283, kuching, malaysia. ieee. whittaker, j. a. (2009). exploratory software testing: tips, tricks, tours, and techniques to guide test design. pearson coutinho et al. 2023 education, sl. wohlin, c., runeson, p., host, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media, sl. introduction background teaching software testing exploratory testing active learning problem based learning pbl just-in-time teaching jitt methodology study design planning execution analysis procedures checking the use of et in practice brainstorming planning execution of brainstorming results results of experiment characterization of participants common practices in agile st perception of et after the course contributions from the pbl and jitt approach results of et insertion with agile teams discussion of results overview on using pbl and jitt overview of insights in practice limitations and threats to validity related works teaching-learning of st using active methodology et teaching et in the context of industry discussion of related works conclusions 15-##_source texts-486-1-18-20190826 journal of software engineering research and development, 2019, 7:4, doi: 10.5753/jserd.2019.15 this work is licensed under a creative commons attribution 4.0 international license. on challenges in engineering iot software systems rebeca campos motta [ universidade federal do rio de janeiro and lamih cnrs umr 8201 | rmotta@cos.ufrj.br] káthia marçal de oliveira [université polytechnique hauts-de-france lamih cnrs umr 8201 | kathia.oliveira@uphf.fr] guilherme horta travassos [universidade federal do rio de janeiro | ght@cos.ufrj.br ] abstract contemporary software systems, such as the internet of things (iot), industry 4.0, and smart cities represent a technology changing that offer challenges for their construction since they are calling into question our traditional form of developing software. they are a promising paradigm for the integration of devices and communications technologies. it is leading to a shift in the classical monolithic view of development where stakeholders used to receive a software product at the end (that we have been doing for decades), to software systems incrementally materialized through physical objects interconnected by networks and with embedded software to support daily activities. therefore, we need to revisit the traditional way of developing software and start to consider the particularities required by these new sorts of applications. since such software systems involve different concerns, this paper presents the results of an investigation towards defining a framework to support the software systems engineering of iot applications. to support its representation, we evolved the zachman’s framework as an alternative to the organization of the framework architecture. the filling of such a framework is supported by a) 14 significant concerns of iot applications, recovered from the technical literature, practitioner’s workshops and a government report; b) seven structured facets emerged from iot data analysis, that together represent the engineering challenges to be faced both by researchers and practitioners towards the advancement of iot in practice. keywords: internet of things, iot, contemporary software systems engineering, empirical software engineering 1 introduction the internet of things (iot) contributes to a new technological revolution affecting society. iot is a paradigm that allows composing systems from uniquely addressable objects (things) equipped with identifying, sensing, or acting behaviors and processing capabilities that can communicate and cooperate to reach a goal. from primary devices with simple software solutions to the large-scale, highperformance software systems producing and analyzing massive amounts of data, iot is going to reach all areas of interest (jacobson et al. 2017). due to its far-reaching potential, iot can use all kinds of technologies available today and will drive the development of new software systems to solve new problems, some still unknown (atzori et al. 2010; jacobson et al. 2017). software engineering, as a discipline, has gone through constant changes since its conception. several concepts, methods, tools, and standards have been proposed to support the development of software (ieee 2004; trappey et al. 2017), and the internet has given a significant shift in the area. it makes more explicit the need to evolve the software technologies previously proposed to support the building of systems fitting new features. systems engineering is a research area embracing multidisciplinarity, integrating different disciplines to reach successful systems according to their purposes, including software, which is essential for iot materialization. therefore, iot leads to an era where, rather than developing software, practitioners are going to engineer systems embedding much software into the system´s parts. in this scenery, the initial problem of our research is to identify the concerns regarding the development of iot software systems, and whether the existing software technologies within the areas (facets) related to engineering such systems are enough for supporting their development. overall, this paper describes the results of investigations dealing with the road ahead on iot development. the concerns captured through observations in the technical literature, from practitioners in specific workshops and a national initiative regarding iot in brazil pave this road. the filling of iot facets combined with the concerns is what we call engineering challenges, capturing knowledge necessary to support a specific activity. the conceptual framework aims to contemplate all facets involved in iot and present the recovered concerns, by simplifying and organizing their presentation. the zachman’s proposition for information systems architecture (zachman 1987) (subsection 2.2) was borrowed and tailored to compose such a framework. our motivation to investigate and contribute to the iot paradigm is therefore supported by its relevance (cni 2016; lu 2017) and the need for a holistic approach and multidisciplinary view for the development of new software solutions (bauer and dey 2016; aniculaesei et al. 2018). this reflects in a demand for technical competencies and skills detained by different practitioners to engineer such software systems (desolda et al. 2017; de farias et al. 2017) and the lack of specific software engineering methodologies to support iot (zambonelli 2016; larrucea et al. 2017; jacobson et al. 2017). some of the challenges are focused on interaction issues, whether it is between humans or things, which is essential for the complete establishment of the paradigm (motta et al.). in our proposal, we introduce a multifaceted conceptual framework as a step towards addressing some of these issues. this paper extends (motta et al. 2018), including more details of the studies, deepen the discussions and providing a usage example of the proposed framework. the paper is on challenges in engineering iot software systems motta et al. 2019 organized as follows: section 2 presents the context of this research and the zachman’s framework. section 3 presents the research strategy followed by the primary results: the iot concerns (section 4), iot facets (section 5) and the definition of the framework (section 6) with an example of use. section 7 concludes the paper with final remarks regarding threats to validity and ongoing works. 2 conceptual background this section starts by presenting the source of motivation for this research, the cactus project. next, it presents some basic concepts related to the zachman’s framework, which is the ground for the proposed conceptual framework organizing the results presented in this paper. 2.1 the cactus project the cnpq cactus research project was performed based on the aim of understanding test strategies for quality assessment of actor-computer interaction in context-aware systems, as one of the chief characteristics of ubiquitous systems (spínola and travassos 2012; santos et al. 2017; matalonga et al. 2017). research teams from two brazilian universities (federal university of rio de janeiro and federal university of ceará) and one french university (université polytechnique hauts-de-france) worked together in the project. it started with the assumption that interaction is not limited to human and computers in ubiquitous systems. it encompasses the interaction among different devices, such as sensors, actuators, as well as other systems, which we consider the term actor-computer interaction as more adequate. from the results achieved in the project and the technological evolution of the area, we look iot as related to ubiquitous systems, sharing some characteristics and challenges (andrade et al. 2017). this work is one of the results of this change of perspective. 2.2 zachman’s framework the zachman’s framework (zachman 1987) was introduced in 1987 to comprehend the scope of control within an enterprise and to provide a holistic view of the enterprise architecture that may be used as a base for its management. it still is an essential reference for enterprise architecture and supported by many types of modeling tools and languages (goethals et al. 2006). zachman’s motivation to develop the framework was that “with increasing size and complexity of the implementation of information systems, it is necessary to use some logical construct for defining and controlling the interfaces and the integration of all of the components of the system” (zachman 1987). the framework is suitable for working with complex systems, and despite its original purpose, its use is not limited to enterprise architecture. alongside with that, it has been used to assess the development process (de villiers 2001), for requirement engineering (de villiers 2001; technology 2015), business process modeling (sousa et al. 2007), to instantiate an iec standard (panetto et al. 2007), and applied to systems of systems (bondar et al. 2017). also, zhang et al. used this framework for safety analysis in avionics systems (zhang et al. 2014). more framework evidence use can be observed in different case studies (panetto et al. 2007; nogueira et al. 2013; aginsa et al. 2016), the latter claiming that “zachman’s framework continues to represent a modeling tool of great utility and value since it can integrate and align the it infrastructure and business goals.” from the data recovered in our research, we realize that concepts and properties related to iot change according to the context and actors involved. this multifaceted view of iot shows once again that it is a multidisciplinary paradigm. for this reason, a representation of the concepts should be as comprehensive as possible to represent all aspects involved. this framework is primarily defined considering a table, crossing perspectives, and interrogative questions as presented in table 1 (zachman 1987; sowa and zachman 1992). table 1. zachman framework with cells filled showing examples of description (sowa and zachman 1992). interrogative questions what how where when who why p e r s p e c t i v e s planner things important to the business process performed business locations of operations events and cycles important to the business organizations and agents important to the business business goals and strategies owner semantic model business process model business logistic system master schedule workflow model business plan designer logic data model application architecture distributed system architecture process structure human interface architecture knowledge architecture builder physical data model system design technology architecture control structure presentation architecture knowledge design implementer data definition program network architecture timing definition security architecture knowledge definition user data function network schedule organization strategy on challenges in engineering iot software systems motta et al. 2019 the framework formalization and its conception were presented as a metaphor from the building architecture to system architecture. the perspectives are therefore described as (sowa and zachman 1992):  planner it corresponds to an executive summary for a planner or investor who wants a system scope estimate, what it would cost, and how it would perform.  owner – it relates to the enterprise business model, which constitutes the business design and shows the business entities and processes, and how they interact.  designer – it corresponds to the system model designed by a systems analyst who must determine the data elements and functions representing business entities and processes.  builder – it refers to the technology model, which must tailor the information system model to the details of the programming languages, i/o devices, or other technologies.  implementer – it relates to the detailed specifications that are given to programmers who code individual modules without being concerned with the overall context or system structure.  user the user perspective was added in a later version and represents the view of the functioning building, or system, in its operational environment. the framework presents six fundamental questions in the columns to outline each perspective and support the answering to the questions regarding:  what: some entity (that can be real-world objects, logical or physical data types).  how: some process.  where: some location.  who: some role played by a person or a computational agent.  when: time, a subtype such as a date, or time that is coincident with some event.  why: some goal or subgoal that provides the reason that motivates the model for that row. considering the extensive use of the zachman framework for representing different domains and technologies and its flexibility of being customized to represent the complexity in each context, we decided to take it as a basis of our work. to that end, we analyzed concerns and facets related to the multidisciplinarity of iot applications to be used as inputs of information and requirements for its first organization. 2.3 related work in this work, we propose a holistic engineering view, based on the principles of systems engineering. in the search, we came across the work of patel and cassou that propose a development methodology and framework to support the implementation of iot applications. their approach is designed to address essential challenges (lack of division of roles, heterogeneity, scale, different lifecycle phases) that differentiate iot applications from others (patel and cassou 2015). the proposal of their methodology is based on the separation of concerns: domain, functional, deployment, and platform. each concern has specific steps to guide the development, implemented in a defined process. there are some similarities to our proposal. we highlight their strategy to attack multidisciplinarity by using four concerns with a diverse set of skills performed by five different roles. however, our proposal differentiates from that because it offers a broader view of the concerns and focuses more on supporting the development team to move out of the problem domain with an action plan stepping into the solution domain. two other works (alegre et al. 2016) and (sánchez guinea et al. 2016) are literature reviews, focusing on engineering strategies to develop context-aware software systems (cass) and ubiquitous systems, respectively. in (alegre et al. 2016), the results are based on a literature review, and a survey carried out with specialists in cass. it presents an extensive work in the cass area, analyzing and characterizing the concept of context as well as their interaction types and main features. the most exciting part for the perspective of our work is that they search the literature for developing techniques and methods that have been adapted from conventional systems to cass throughout the most common stages of a development process: requirements elicitation, analysis & design, implementation, and deployment & maintenance. none of the techniques presented fully meets the cass requirements, and the authors conclude the work by recommending a more holistic and unified approach for the development of cass, arguing that it should be different from the conventional software engineering approach for creating these systems (alegre et al. 2016). another work is from costa et al. (costa et al. 2017). it presents more than just the requirements and needs of an iot application, focusing on its challenges and proposing an approach to support the requirements specification of iot systems named iot requirements modeling language (iotrml). we share some of the motivations with this work since it states that different perspectives and the heterogeneous nature of iot should be considered in the development of such software systems. the domain model composes their proposal for the abstraction and a sysml profile for the specification. in their model, a stakeholder expresses a requirement as a proposition, and the requirement may influence or conflict with other requirements. their approach supports both functional and nonfunctional requirements, which is crucial in this scenario. through their solution, four requirements specification activities are supported: elicitation of system’s requirements from the stakeholders that will generate an initial model in their tool, the analysis to identify influences and conflicts among requirements updating the model representing them, resolution of conflicts, and the last activity, to decide on a candidate solution containing the requirements to be addressed. a proof of concept is presented to illustrate the approach used in the context of a smart building, focusing on employees’ safety and energy efficiency. on challenges in engineering iot software systems motta et al. 2019 our proposal can somehow be related to the iot-rml approach (costa et al. 2017). however, we aim to address the problem understanding in the conceptual phase, which focuses on a step before the specification requirements considering a multi-perspective and multidisciplinary strategy. another related work is from aniculaesei et al., which argues that conventional engineering methods are not adequate for providing guarantees to some of the challenges specific of autonomous systems, such as dependability, the focus of their work (aniculaesei et al. 2018). some of the main points discussed is the possibility of adaptive behavior present in iot, as they adapt their behavior to better interact with other systems and people or to solve problems more effectively, and variations in the context, the formerly closed and valid development artifacts may not capture the changes and be inadequate since the environment and the system behavior can no longer be fully predicted or described in advance (aniculaesei et al. 2018). in response to these challenges and gaps, the authors propose an approach based on the notion of dependability cages. their approach deals with external risks (uncertainties in the environment) and internal risks (system changing behavior), both at the development and the operation. at the moment of preparing this manuscript, we observed a lack of more concrete proposals for the materialization of the iot paradigm. we aim to address the challenges presented in (alegre et al. 2016) and (sánchez guinea et al. 2016), filling the gaps from (patel and cassou 2015), (costa et al. 2017) and (aniculaesei et al. 2018) focusing on the issue of multidisciplinarity and providing support to decision-making in the initial development phase of problem understanding. 3 research strategy figure 1 presents our investigation strategy. it is composed of three parts and involves performing different lines of investigation and studies. the first part of our investigation regards iot concerns. it aims at presenting concerns, issues, and difficulties frequently reported regarding the development of iot applications. to recover such concerns, we collected data from different sources of information considering a literature review (subsection 4.1), discussion with practitioners (subsection 4.2) and reading a brazilian government report (subsection 4.3). based on the identified concerns, it was possible to observe research gaps and main iot development issues that need effort in their understanding and evolution. these intermediate results can be useful to researchers looking for research opportunities and practitioners planning the construction of iot applications. the literature review also supported the identification of 29 iot definitions. from this set, we conducted a textual analysis, using coding procedures, from grounded theory (see section 3.1), to assign concepts to a portion of data (strauss and corbin 1990). the result was the identification of iot facets (section 4) necessary for iot materialization, in the sense of being the set of parts composing an iot software system. we understand facets as “one side of something many-sided” (oxford dictionary), “one part of a subject, a situation that has many parts” (cambridge dictionary). these facets are the basis for tailoring the 6x6 matrix of the zachman’s framework (zachman 1987). figure 1: research strategy. the idea of investigating the facets from iot definitions came in the sense of finding a set of parts composing an iot software system. it does not try to be exhausting because due to its far-reaching potential, we do not know to what extent iot will meet or drive the development of new software technologies to solve new problems. we wanted to differentiate concerns from challenges. each application alone has a set of concerns that must be addressed with software technologies and other solutions for the software system development or construction. in the case of iot, we understand that it is a multidisciplinary, multidimensional, and multifaceted paradigm (gluhak et al. 2011; gubbi et al. 2013; jacobson et al. 2017). in this sense, this work presents the iot facets that must meet the concerns, this being the real engineering challenge (to fill a cell in the framework). the procedures and activities performed for each part are detailed, together with a broader discussion of the results. however, the concerns are somewhat related to the facets, and our next activity was to find a way to represent all the concepts that transparently emerged from the sources of information, and that could guide the next research activities. thus, the last step of our research, as a result of the studies, we introduce a conceptual framework to organize the challenges of engineering iot software systems (section 5). our work focusses on discussing the iot perception and its central issues from three perspectives: technical literature, practitioners, and government, using data collected with different studies. we briefly present the studies and dive into the analyzed challenges, from which we propose a conceptual framework to support the development of such software systems by considering different and complementary facets. this paper presents the research path that led us to the framework, not detailing each study, but instead informing how they inspired a structure composed of six questions, six perspectives and seven facets aiming to define an engineering strategy for iot development. on challenges in engineering iot software systems motta et al. 2019 3.1 grounded theory (gt) the gt methodology comes as a mechanism to deal with and understand research data and how they relate to each other, considering the iot domain and features. we rely on these procedures to analyze the recovered data from each study. the principles and procedures of gt according to (strauss; corbin, 1998) were used to assist us in developing and analyzing the concepts in this research, as presented below:  planning: initially, it identifies the area of interest and the process to be followed inside the gt paths. in our case, each study was planned individually, with the execution and analysis performed by the researchers.  data collection: initially, gt resorted to interviews. however, any method can be used, like focus groups, observations, artifacts, or texts. in our case, we rely on the data extracted from the articles resulting from the literature review, the data recovered from the discussions with practitioners and all the textual documents from the brazilian government.  coding: at this step, the researcher should work their sensibility to identify significant data and use the constant comparative analysis method: through iteration, going back and forth in the codes generated observing and comparing to find adequacy, conformity, and coherence among the codes. in our case, the qda miner lite 1tool was used to support this part. all the matching from text to code was performed by one researcher and then revised by another. the procedure followed was to review each extraction and the respective code, contributing to the constant comparison until reaching an agreement on the coding.  reporting: writing memos, comments, and decision points during the coding phase can enhance the report. being able to narrate the process of abstraction and describing the rationale behind the codes is the last challenge to sound analysis. in our case, this article comes as the report for the result, from where portions of the extracted data lead to the coding that represents concerns for iot development. this approach has been used in software engineering research (seaman 1999; carver 2007; badreddin 2013) and was selected since gt provides reference support for the procedures and is adequate to work with a large amount of information. considering that some concepts have different meanings, this methodology is suitable to establish the similarities and differences among them. the same analysis strategy was used throughout the study. 4 iot concerns being a multidisciplinary domain, iot covers many topics from socio-technical to business. we conducted different studies to recover iot concerns. each study was planned, considering a specific perspective on the subject. initially, 1 provalisresearch.com/products/qualitative-data-analysis-software/ we contemplate the academy perspective, recovered through a literature review. then we decided to broaden the range to represent two other perspectives collected from practitioners and a government report, contributing to a more comprehensive representation. although they represent different visions, they discuss the same topic. thus they become complementary, giving us a more comprehensive view of the area. 4.1 inputs 4.1.1 from the literature review the concerns presented in the technical literature were extracted from a literature review (our first empirical study). for the review, we followed the recommendations well established in the literature (biolchini et al. 2008) focusing on secondary studies since there were already reviews on iot. the goal of this gqm-based (basili et al. 1994) review was defined as: analyze the internet of things domain with the purpose of characterization with respect to definitions, characteristics and application areas from the point of view of software engineering researchers in the context of the available technical literature. the selected articles were secondary studies as they rely on primary studies and survey other sources of information to present a bigger picture. presents a research protocol summary. the search engine was scopus since it indexes several peer-reviewed databases, and is well-balanced regarding coverage and relevance. snowballing procedures can mitigate the lack of other search engines and complements the search strategy (motta et al. 2016; matalonga et al. 2017). to reduce bias, three researchers executed the review. the process was carried between march and may 2017. eighty-one articles resulted in the search in scopus. after the execution of four trials, a selection from title and abstract according to the established criteria, and one level backward/forward snowballing (wohlin 2014), 12 secondary studies compose the final set. the reviewers read the articles and extracted relevant information according to an extraction form. we used an extraction form to retrieve the following information from the secondary sources: reference information, abstract, iot definition, iot related terms, iot application features, iot application domain, development strategies for iot, study type, study properties, challenges and article focus. from the discussion of rq1, we extracted 34 iot definitions that lead us to understand that iot is a paradigm allowing the composition of software systems from uniquely addressable objects equipped with identifying, sensing or actuation behaviors and processing capabilities that can communicate and cooperate to reach a goal. regarding rq2, we recovered 29 different attributes, where nine of them are discussed with clear evidence from the sources of information. considering that the results retrieved are from secondary studies, the characteristics on challenges in engineering iot software systems motta et al. 2019 represented reflect more than just the selected set, but rather the whole set of primary studies involved in them, which can strengthen these results. one contribution of the review is to present an organized perspective regarding iot state-of-the-art. besides, it allows observing which areas of application are making use of iot (rq3). all of these findings were related and summarized in a report to enrich the iot paradigm comprehension (see link table 2). iot related concepts such as cyber-physical systems (khaitan and mccalley 2015) and systems of systems (nielsen et al. 2015) are also discussed in the final report of the review. the data for discussions and analysis came from part what was extracted from the form, which we treat in this section as concerns. we based our analysis procedure on textual analysis, using codes to assign concepts to a portion of data, identifying patterns from similarities and differences emergent from the extracted data, based on the gt procedures (strauss and corbin 1990). it was conducted by two researchers, with crosschecking to achieve a consensus with the analysis to decrease potential misinterpretation and bias. the 12 papers provided 38 excerpts regarding iot challenges that were organized into seven categories: architecture, data, interoperability, management, network, security, and social. 4.1.2 from practitioners another perspective used to recover iot concerns was the practitioners’ opinion. we performed qualitative studies during two scientific events from which all the participants were developers and/or researchers in the iot domain. for this reason, we considered them representative, insightful, and experienced in the topic. the following questions guided the discussions in both studies: a) regarding product quality between conventional software and iot: what is similar? what is different? what needs to be investigated? b) regarding the software technologies between conventional software and iot: what do we have that can be used directly? what do we have that needs adaptation to be used? what don’t we have but need? the first event (in august 2017) was the 1st qualityiot at the brazilian symposium on software quality (sbqs). the 21 participants were divided by convenience into groups to deal with the mentioned questions in the following perspectives:  people focused on the human end-user. challenges and impact of this technology in our daily lives, such as social, legal, and ethical. group of five (5) participants.  product focused on iot products that can be generated, considering the inclusion of software and “smartness” in general objects and the possibilities of new products in this scenario. group of nine (9) participants.  process focused on the software development process that should be included in the things and consider the big picture of organizing the things together. group of seven (7) participants. the groups had one hour for discussion. a representative of each group wrote down the main points identified and later presented the ideas for all the workshop participants. the second event (carried out in september 2017) was a panel in the brazilian congress on software: theory and practice (cbsoft) conducted by the same first event moderator. in this panel, five (5) iot domain practitioners (experts from academy and industry) and audience were motivated to discuss the same previous study questions. the moderator acted as the reporter in the panel discussion, gathering the central issues, and producing a document reporting the notes. next, the notes from both events were collected and analyzed. besides, open coding procedures based on gt table 2. protocol summary. research questions (rq1) what is internet of things? (rq2) which characteristics define iot applications? (rq3) which are the applications for iot? search string population ("*systematic literature review" or "systematic* review*" or "mapping study" or "systematic mapping" or "structured review" or "secondary study" or "literature survey" or "survey of technologies" or "driver technologies" or "review of survey*" or "technolog* review*" or "state of research") and intervention ("internet of things" or "iot") search strategy scopus (www.scopus.com) + snowballing (backward and forward) inclusion criteria to provide an iot definition; or to provide iot properties; or to provide applications for iot. exclusion criteria not provides an iot definition; and not provides iot properties; and not provides applications for iot; and studies in duplicity; and register of proceedings. study type secondary studies acceptance criteria three distinct readers: all readers accept => paper is accepted all readers exclude => paper is excluded the majority of accept, others in doubt => paper is accepted else => discuss and consensus technical report detailed information about the planning and execution https://goo.gl/czvvdc on challenges in engineering iot software systems motta et al. 2019 (strauss and corbin 1990) were used and allowed the identification of nine categories of iot concerns: architecture, interoperability, professional quality properties, requirements, scale, social, security, and testing. 4.1.3 from the governmental report many initiatives from governments and organizations have demonstrated a growing interested in iot. in this context, the brazilian national bank of economic development (bndes), organized a study to promote economic and social development by analyzing and proposing public policies for iot. the idea is to obtain an overview of the impact of iot in brazil, understanding the country competencies and defining initial aspirations for promoting iot in brazil, as it had been documented in a plan of action. the research is being conducted since the beginning of 2017, and the material2 available was used to recover iot concerns. these results were based on registered initiatives developed by 11 countries and the european union on iot, initiatives developed in global scope and interviews with experts to the implementation of the area in brazil. it was also based on textual analysis on 28 textual documents available. reading the material allowed extracting information focusing on the presented concerns, analyzing, and similarly organizing them as the two previous information sources (the literature and practitioners). from this, seven categories of iot concerns emerged: data, interoperability, network, professionals, regulation, security, and things. 4.2 output: putting all together extracting the perception and concerns regarding iot from different points of view was essential for the strengthening and direction of our research. for instance, it is possible to observe that, although there are different perspectives, they become complementary to represent the concerns to produce quality software systems. together, the three sources provided 14 different concerns, which must be met in favor of higher quality iot software systems (figure 2). we present each of the 14 categories with a definition and some example from the input source to support its comprehension:  architecture issues and concerns regarding design decisions, styles, and the structure of iot systems. excerpt example: “finding a scalable, flexible, secure and cost-efficient architecture, able to cope with the complex iot scenario, is one of the main goals for the iot adoption.” (borgia 2014);  data it refers to the management of a large amount of data, and how to recover, represent, store, interconnect, search, and organize data generated by iot from so many different users and devices. excerpt example: “this new field offers many research challenges, but the main goal of this line of research is to make sense of data in any iot environment. it has been pointed out 2 https://goo.gl/nmfece that it is always much easier to create data than to analyze them.” (gil et al. 2016);  interoperability related to the challenge of making different systems, software, and things to interact for a purpose. standards and protocols are also included as issues. excerpt example: “the end goal is to have plug n' play smart objects which can be deployed in any environment with an interoperable backbone allowing them to blend with other smart objects around them.” (gubbi et al. 2013);  management the application of management activities, such as planning, monitoring, and controlling, in the iot system will raise the interaction of different things. excerpt example: “iot is a very complex heterogeneous network, which includes the connections among various types of networks through various communication technologies […]. addressing things management is still a challenge.” (xu et al. 2014);  network technical challenges related to communication technologies, routing, access, and addressing schemes considering the different characteristics of the devices. excerpt example: “designing an appropriate topology, routing, and mac layer is critical for scalability and longevity of the deployed network” (gubbi et al. 2013);  professionals to invest resources in the training of engineers and other professionals can result in the creation of a strategic differential. however, the scenario is different, so more than proficiency in programming languages of lower level; the professional who develops software for iot should be able to carry out the customization of solutions already developed for specific demands;  quality properties although some specific properties such as interoperability, privacy, and security are primarily discussed, several other quality attributes are considered different in the iot domain such as capacity (device and network), installation difficulty, responsiveness, context awareness. contemplate non-functional requirements by considering what the individual sees, feels and how the things can contribute to that;  regulation governments are working on crucial issues that require significant investment and coordination between the public and private sectors. within regulatory issues, standardization is one of the most critical, and there is no single strategy to follow. in some cases, it is necessary for the creation of specific laws and institutions regulate privacy and security issues, a topic that is debated today by all the countries mentioned in the report;  requirements considering the iot nature, with a tendency for more innovation, mainly based on ideas, the requirements can be presented in a less structured on challenges in engineering iot software systems motta et al. 2019 form. another concern is that the user can also be a developer since the solutions reach different types of individuals and devices and new features can be attached;  scale to develop, manage, and maintain a large-scale software system is a concern. as the number of devices in the software system increases along with the number of relationships, new technologies are needed to maintain a software system with the quality level required.  security issues related to several aspects to ensure data security in iot systems. for that, a series of properties, such as confidentiality, integrity, authentication, authorization, non-repudiation, availability, and privacy, should be investigated. excerpt example: “security issues are central in iot as they may occur at various levels, investing technology as well as ethical and privacy issues […] this is extremely challenging due to the iot characteristics.” (borgia 2014);  social concerns related to human end-user to understand the situation of its users and their appliances. excerpt example: “for a lay person to fully benefit from the iot revolution, attractive and easy to understand visualization have to be created.” (gubbi et al. 2013);  testing iot will provide unprecedented universal access to connected devices. testbed and acceptation tests are sophisticated, and there is a greater need for other types of tests, for example, usability, integrity, security, performance;  things for the devices, which includes their access and gateways, there are several non-functional restrictions inherent to iot that should be present in the products. these restrictions increase the total cost of the objects, such as an energy consumption alternative when it is not possible to connect to the power grid. it is interesting to notice that the concerns are usually interrelated, confirming the multidisciplinary nature of iot. for example: “for technology to disappear from the consciousness of the user, the internet of things demands software architectures and pervasive communication networks to process and convey the contextual information to where it is relevant” (gubbi et al. 2013), this excerpt is coded for an architectural issue and network as well. another example is “central issues are making full interoperability of interconnected devices possible, providing them with an always higher degree of smartness by enabling their adaptation and autonomous behavior, while guaranteeing trust, privacy, and security.” (atzori et al. 2010), which was coded both for interoperability and for security issues. provide solutions to the issues presented here can be tricky to achieve due to the diversity of concerns, a variety of devices 3 https://aioti.eu/ and https://ec.europa.eu/commission/priorities/digital-single-market_en we can see that each source has its particularities, and some are consistent with its origin. it is expected that practitioners have a more technical and in-depth view presenting more individual and software-oriented issues regarding iot software systems. the concerns with management and quality are transversal to the implementation of such software systems and can be observed in any point of view, but the practitioners have specific concerns of quality, such as meeting non-functional requirements, which bring more specificity and definition to this issue. also, requirements and testing issues are still somewhat open on how to represent, describe, and integrate software systems. these three aspects must be met in the software systems regardless of their scale, which in iot software systems can reach the ultra-large-scale, bringing their associated problems. these three concerns are affected by one aspect that we observed in the literature review. from the characteristics extracted, we could observe that iot properties and its characterization are not explicit, neither the characteristics that can affect the development process of such applications. unclear characteristics can impair requirements, which in turn affects the testing, hindering the overall system quality. we consider that this difficulty is partially due to conceptual aspects, since iot and the related concepts are not yet established and not enclosed by a single definition, being the concept still under discussion (shang et al. 2016). considering the increasing number of interconnected devices, the size or scale of iot software systems can grow consistently. the systems can achieve a more wide-scale coupled with complicated structure-controlling techniques, which brings new challenges to their design and deployment (huang et al. 2017). new solutions for architectural foundations, orchestration, and management are essential for dealing with scale issues, especially for ultra large scale systems such as smart cities and autonomous vehicles (roca et al. 2018). concerning regulation, some actions are being made, from governments 3 and other institutions 4 , to form an adequate legal framework. it is necessary to prompt action to provide guidance and decisions regarding governance and how to operate iot applications in a lawful, ethical, socially and politically acceptable way, respecting the right to privacy and ensuring the protection of personal data (caron et al. 2016; almeida et al. 2018). for the devices, sensors, actuators, tags, smart objects and all the things in the internet of things, or of everything, these are some of the aspects that should be taken into consideration: a) resources and energy consumption, since intelligent devices should be designed to minimize required resources as well as costs; b) deployment since they can be deployed one-time, or incrementally, or randomly depending on the requirements of applications; c) heterogeneity and communication: different things interacting with others, they must be available, able to communicate and accessible (madakam et al. 2015; li et al. 2015). 4 https://www.kiot.or.kr/main/index.nx and https://www.digicatapult.org.uk/ on challenges in engineering iot software systems motta et al. 2019 figure 2. iot concerns. at the intersection between industry and literature, we have architectural and social issues. both concerns are open due to the area novelty in which there is still an uncovering of how to deal and what to expect. architecture is a recurrent issue in the literature being pointed out by (liao et al. 2017) as one of the priority areas for action and reported by (trappey et al. 2017) to be one of the official objectives of iso/iec jtc1. in general, the status is that there still no consolidated standard nor well-established terminologies to uniform advancements for architecture in iot. regarding social concerns, given that the objects, devices and a myriad of things are likely to be connected to many others, being people one of the actors as well (matalonga et al. 2017), it is necessary to explore the potential sociotechnical impacts of these technologies (whitmore et al. 2015). using such devices to provide information about and for people are one of the applications. many challenges and concerns should be addressed to achieve the benefits aimed with iot. in facilitating the development is required the design of data dissemination protocols, and evolve the solutions for privacy, security, trust maintenance, and effective economic models (guo et al. 2012). as affirmed by (dutton 2014), if not designed, implemented, and governed in an appropriate way, these new iot could undermine such core values as equality and individual choice. at the intersection between industry and government, we have the concern of professionals, which is represented by the preparation of their skills and knowledge as for the teams that should be multidisciplinary to meet iot premises. if requirements, testing, and other technical activities are under discussion, we need to think about the professional who will satisfy and perform such activities (yan yu et al. 2010). with the development of iot, different people, systems, and parties will have a variety of requirements; one of the abilities required is how to translate these requirements into new technologies and products. other skills are related to manage the frequency of information generated, manage the ubiquity and actors involved in interactions, develop and maintain privacy and security policies (tian et al. 2018). as the area is new and it is defining the professionals and teams that will work on it too, so it is essential to discuss the professional, develop skills and knowledge necessary for this new generation of innovators, decision-makers and engineers (kusmin et al. 2017). connectivity, communication, network, and the multiple related concepts that enable the evolution of interconnected objects is a critical point for the materialization of iot (gubbi et al. 2013). one of the main challenges of this scenario is a vast amount of information identified, sensed and act upon that must be processed mostly in realor near-real time with an unobtrusive delivery of personalized manner, ensuring data availability and reliability, the channel between devices, and between the human and devices (mihovska and sarkar 2018). many open challenges require new approaches to a quality network in this scenario. therefore research should progress into practice to ensure the benefits for the users. together with network concerns, we have data issues. in a world with “anytime, anyplace connectivity for anyone and connectivity for anything” (conti 2006), we can see how quickly data can be generated and how vast amounts of information are created. some of the challenges are related to the continuous and unstructured creation of connection points (devices, things); the persistence of data objects, unknown scale, and data quality (uncertainty, redundancy, ambiguity, inconsistency, incompleteness) (gil et al. 2016). however, above these, security and interoperability concerns are at the center of all iot related discussions. for iot, for example, it enables computing capabilities in things around us and interoperability is the attribute that enables the interaction among heterogeneous devices, with varied requirements of different applications. interoperability can range from different levels like technical, syntactical, semantic, and organizational, which varies according to the software system needs. complete interoperability is an open question for current software and essential for iot due to its comprehensive nature. issues like encryption, trust, privacy, and any security-related concerns are of utmost importance since iot are inserted in someone’s personal life or into the industry. high coverage procedures should guarantee software system security and trustworthiness. 5 iot facets iot leads to an era where, rather than developing software, we need to engineer software systems embracing multidisciplinarity, integrating different areas for the realization of successful products according to their purposes. it means that software is one of the iot facets, which, together with others, are necessary for iot materialization. aiming at identifying those different facets that characterize this multidisciplinarity, we performed an analysis of the iot definitions identified in the literature review (section 3.1). this analysis was based on gt procedures (strauss and corbin, 1990). the 29 extracted iot definitions were organized in a table with one field of “code” to assign an area, topic, discipline (named here as a facet) on challenges in engineering iot software systems motta et al. 2019 related to a definition excerpt. this coding process was executed by three researchers separately, using separate and independent documents. an example of the document is presented in figure 3. it is composed of three columns: a) index: with the definition number; b) definition: where each definition is presented as extracted from the paper; c) code: with the codes associated with portions of the definition, with a color scheme to help their identification. figure 3. example of a document filed with the definitions and marked with coding. there were two rounds of discussions, first with two, then with all the three researchers. it was done to discuss the similarity and differences in the coding, support the concepts, and reduce bias, until reaching a consensus. from this analysis, we would like to have a set of facets, based on the data we had so far, and be able to sort among the most used to present a minimal set of areas that must be considered when building an iot software system. after the documents merge, meetings for discussions were held, some of the discussion was regarding the coding granularity level. for example, network and telecommunication can all be part of a single facet called connectivity, aiming to encompass several concepts and keep the same level of abstraction. as a result of this process, we came to the consensus that an iot software system should consider seven different facets that are defined below including some examples related to the identified challenges and some potentially used technologies:  connectivity – the internet is a relevant concept naturally involved in the iot paradigm. we argue that it is necessary to have available a medium by which things can connect to materialize the iot paradigm. it is essential some form of connection, a network for the development of solutions, and our idea is not to limit internet-only connectivity but to be able to cover other media. o related challenges example: one of the concerns for connectivity is the traffic management and control to deal with the enormous data generated by these devices and guarantee the quality of service (bera et al. 2017; li et al. 2018). o related technologies example: it uses specific solutions according to the application domain and tries to re-use legacy cellular infrastructure and invest in novel communication solutions. it is mostly based on wireless communication technologies that could be divided into shortrange, long-range, and cellular-based.  things in this sense, it means the things by themselves in iot. tags, sensors, actuators, and all hardware that can replace the computer, expanding the connectivity reach. o related challenges example: to deal with heterogeneity and scale (rojas et al. 2017), distribution -geographically distributed and sometimes, in inaccessible and critical regions (chen et al. 2018) as well as mobility – iot devices are not static they tend to move between different coverage areas (bera et al. 2017), are issues related to requirements to be covered in iot. o related technologies example: many solutions were combined to build devices like sensors, actuators, smartphones, microcontrollers, interactables, cameras, communication and network enablers, and others. some systems treat things giving a virtual representation of these devices enabling remote access, and control of them.  behavior the existence of things is not new nor their intrinsic capacities. what iot provides is the chance of enhancements in the things, extending their behaviors. in the beginning, the things in iot systems were objects attached to electronic tags, so these systems present the behavior of identification. subsequently, sensors and actuators composing the software systems enabled the sensing and actuation behaviors, respectively. it can be necessary the use of software solutions, semantic technologies, data analytics, and other areas to enhance the behavior of things. o related challenges example: some emerged behavior cannot be attributed to a single system but results from the interplay of some or all systems in the network. therefore, each system involved must adjust its behavior according to the common goal, which is an open issue (brings 2017). o related technologies example: the first and most common way to treat behavior is in stages, where the more significant behaviors are constituted by, the smaller ones, with this it is possible to reduce the complexity of taking care of the behaviors. another way to manage behavior is through the use of state machines (jackson, 2015; giammarco, 2017).  smartness smartness or intelligence is related to behavior but as to managing or organizing it. it is more referring to orchestration associated with things and to on challenges in engineering iot software systems motta et al. 2019 what level of intelligence technology can evolve their initial behavior. o related challenges example: what makes in many cases a system smart is not only the devices that are used and the decision-making process but the whole solution architecture as well (atabekov et al. 2015), which leads to an architectural challenge in order to achieve smartness. o related technologies example: it uses actuators, decision-makers, and acting according to the data autonomously collected and treated to perform some activity in the environment. it uses techniques from artificial intelligence, machine learning, neural networking, and fuzzy logic to deal with the data.  problem domain a problem domain is the area of expertise or application that needs to be examined to solve a problem. iot software systems are developed to reach a goal for a specific purpose. at this point, we are starting from a goal (problem domain) to reach a solution (software system). focusing on a problem domain is merely looking at only the topics of interest and excluding everything else. it, in general, directs the objective of that solution. o related challenges example: iot applications development is a multi-disciplined process where knowledge from many concerns intersects. this development assumes that the individuals involved in the application development have similar skills. it is in apparent conflict with the varied set of skills required during the overall process involving this engineering (patel and cassou 2015) o related technologies example: it varies, but the majority deals with software activities related to analysis, design, and activities to understand the problem domain.  interactivity it refers to the involvement of actors in the exchange of information with things and the degree to which it happens. the actors engaged with iot applications are not limited to humans. therefore, beyond the sociotechnical concerns surrounding the human actors, we also have concerns with other actors like animals and the interactions thing-thing. the degree to which it happens works together with the medium through which things can connect (connectivity) so that in addition to being connected, they can understand (interoperability). o related challenges example: the wide range of heterogeneity issues introduced by among different iot devices. standardization, therefore, is a must but is not enough as no single standard can cover everything, as well as some organizations (manufacturers, software companies), would like to follow different standards or even proprietary protocols (dalli and bri 2016). o related technologies example: to guarantee communication: http, xmpp, tcp, udp, coap, mqtt, and others. to guarantee to understand: json, xml, owl, ssn ontology, coci, and others.  environment the problem and the solution are embedded in a domain, an environment, or a context. this facet seeks to represent such an environment and how the context information can influence its use. o related challenges example: things can be created, adapted, personalized, and rely on contextual data. the integration of things with the social and natural environment can contribute to improving this contextual data and is both a challenge and a research opportunity (davoudpour et al. 2015). o related technologies example: in general, the environments are composed of sensors and actuators to sense and change an ambient state. technologies like iot, cloud, smart objects, middleware’s, wireless sensor networks, vehicular ad-hoc networks, edge computing, artificial intelligence, machine learning, data mining can be employed on these systems. section 6.1 presents the facets regarding an example of a specific iot application. from the data recovered in our research, we realize that concepts and properties related to iot change according to the context and actors involved. for this reason, the representation should be as comprehensive as possible to represent all aspects involved, motivating the iot facets proposition. 6 defining the framework as observed in our investigations, the iot scenario is covered by concerns (discussed in section 4.4) that are seen and treated according to facets (detailed in section), which leads to challenges for its development. in this context, strategic decisions are essential to the development and need to handle all factors involved in iot without prejudice to the original software life-cycle concerns with deadlines, costs and quality levels of products and processes (pfleeger and atlee 1998; fitzgerald and stol 2017). in our proposal, we consider both concerns and facets of iot development. in the latest technologies, the software is only one of the components since further development is necessary for requirements representation, data infrastructure, network configuration, and others (tang et al. 2004). our aim is regarding the conceptual organization of the recovered data that should consider the requirements of different stakeholders and the activities in the different iot facets. having such a conceptual structure, we do not aim that it will guide the software system development but rather to organize the concepts more explicitly and support the decision making when engineering iot software systems. with this goal in mind, we have identified the zachman’s on challenges in engineering iot software systems motta et al. 2019 framework (zachman 1987) as the structure that could support the organization of the concepts. for this reason, we tailored the framework, presented in section 2, as in figure 4. there are several definitions for iot, but the concept refers to a paradigm that allows composing software systems from uniquely addressable objects (things) equipped with identification, sensing or action behaviors and processing capabilities that can communicate and cooperate to reach a goal. this understanding encompasses the definitions recovered from the literature review and states the composing and characteristic of iot. the main difference between traditional software systems regards heterogeneity, scale, and the possibilities inherent to the iot paradigm. the zachman’s framework is generic and flexible enough to be used in different scenarios embedding different points of view, so our choice to use it to organize the information we gathered. because of the meaning of iot, we will have the following demands to develop an iot software system:  a paradigm that allows composing systems: iot is not just the things by themselves. it represents a more substantial aggregate consisting of several parts. it implies that there is not a single iot solution, but a myriad of options that can derive from the things and other systems available. it will require some domain and business-specific strategies.  from uniquely addressable objects (things): things should be able to be distinguished using unique ids, a unique identification for every physical object. it concerns the network solutions and hardware technologies required to devise the composing parts of the iot paradigm.  equipped with identifying, sensing or acting behaviors and processing capabilities: once the object is identified, it is possible to enhance it with personalities and other information and enable it to connect, monitor, manage and control things. this understanding implies that depending on the “smartness” degree required for a setting,” a software solution can be more robust and involve other technical arrangements, such as artificial intelligence.  that can communicate and cooperate: the other part of the paradigm, alongside with the things, is the internet. the internet (in a broader sense) is the connection channel of the available things. together with this network solution, things should be able to communicate, interchange, share, and other issues. for this, a set of characteristics, such as interoperability, should also be present in the things.  to reach a goal: this whole scenario is set for a purpose, for a reason, motivated by something. this primary goal is what will guide the development. this description leads us to tailor the zachman’s framework in a faceted scheme; each one represents a facet required for an iot software system (see figure 4). we argue that a solution for iot cannot be done without considering all fundamental paradigm aspects, requiring multidisciplinary technologies and a diverse team to meet them. we consider the iot facets to address this multidisciplinarity. they were extracted from the literature review and cover a set of dimensions needed to be present, in different degrees, in an iot software system. this initial set can be extended if needed, as we progressed in the research since it limited to the set of sources dealt with in this research. alongside with the facets we have: perspectives and communication interrogatives evolved both from the zachman framework (sowa and zachman, 1992). the perspectives were divided as control (business, executive and user), who support the definition of the problem domain, and construction (architect, engineer, technician, and user) parts, that will specialize the facets to solve the problem. we are considering the user perspective as a hybrid because the future vision is that users have active participation in the construction of iot solutions (singh and kapoor, 2017). the framework considers all the perspectives involved in the planning, conception, building, usage, and maintaining activities of iot software systems:  executive perspective it focuses on the system scope and management plans, and how it would relate to a particular context.  business perspective it is concerned with the business models, which constitute business design, how they relate, and how the system will be used.  architect perspective this perspective translates the system model designed and determines the logic behind a system considering data elements, process flows, and functions are representing the business entities and processes.  engineer perspective it corresponds to the technology models, which must tailor the model to the programming languages details, devices, or other required supporting technology.  technician perspective the developer follows detailed specifications to build modules, sometimes without being concerned with the overall context or structure of the system.  user perspective it concerns the functioning system in use. from the guidelines provided in the zackman’s framework, we consider the questions as communication interrogatives for our context since the answer to each question in each perspective, and each facet will give us more direct information leading an engineer closer to the solution specification. these are fundamental questions to outline each perspective:  what referring to the information required for the understanding and management of a system. it begins at a high level, and as it advances in the perspectives, the data description becomes more detailed;  how it relates to translating abstract goals into the definition of its operations using software technologies (techniques, technologies, methods, and solutions); on challenges in engineering iot software systems motta et al. 2019  where it concerns the activities location; it can be a geographical distribution or something external to the system;  who it describes the roles involved with the systems to deal with the facet development, detailing the representation of each one as it advances;  when it concerns the effects of time over the system, such as the life cycles, describing the transformations and states of the systems;  why it concerns to translate the motivation, goals, and strategies going to what is implemented in the facet. the perspectives in the framework should be mapped and updated according to each iot facet since different stakeholders are concerned with each area. in the questions part, we seek to keep the original questions and adapt the definitions to be clear and cut for the use of iot. for instance, the “what” is the final product to be delivered by each facet, which in turn can be composed of what each perspective delivers. in software, for example, we have the model built by the software architect, the code made by the developer. they are all part of the final product, the software product. the framework structure will be the combination of these concepts filled in as in figure 4. each facet aggregates different knowledge areas, such as software and network. with the simple framework structure, we can organize existing knowledge of software technologies, observing gaps for possible research and development opportunities. the framework can be filled with knowledge using a bottom-up approach with studies from the technical literature, practitioners, and real cases. we aim to achieve a complete solution on a small scale, to be evolved incrementally. if any adjustment is necessary, we sought to make available the protocols at delfos5 (the observatory of the engineering of contemporary software systems) to facilitate access, dissemination, re-execution, and evolution of the findings in order to keep the body of knowledge updated. the existing knowledge may or may not be enough to cover the iot paradigm demands, and this must be investigated from each facet to develop high-quality software systems. also, each facet will be responsible not only to meet its original premises but to cover the concerns and essential needs of iot related to that area, such as those presented in this paper. for instance, security and interoperability (the common concerns from the sources) are transversal concerns and must be addressed in the iot facets related to things, behavior, and connectivity. as we evolve the framework structure and deepen our iot facets of knowledge, we will seek to provide software technologies to meet the concerns as well. the use of the framework can be performed in three steps. by aligning different stakeholders’ perspectives, we want to characterize the problem domain (step 1). then, using the framework structure (figure 4), we aim to extract relevant information for the project (step 2), that support the definition of decision-making strategy (step 3). an example of use detailing the steps is presented in the following section. 6.1 exemplifying the usage of the framework once we have filled the framework with relevant information regarding practices and technologies, considering the different facets and perspectives, we will use it as the basis to support a development strategy. project information is used to direct and specialize in the framework, to present the concerns that should be taken into account and be used in the strategy to decision making for the specific project. with this proposal, the goal is not to replace defined activities that are common in the development of traditional software projects. instead, we hope to address the particularities of iot projects since they present different and additional characteristics that can bring challenges to its engineering. this section aims to exemplify the use of the proposed framework. for this, we rely on the results of an iot software project carried out in the context of an undergraduate software engineering last year discipline of the computer and information engineering course at the federal university of rio de janeiro. five last periods bachelors’ students with previous knowledge in engineering conventional software systems formed the development team. the course is regularly offered to support the students to work in real problem domain demand and tool-based software engineering environment. as a case for practice, the development team should provide a software system solution to support the creation of freshwater shrimp in farms. a sebrae (brazilian service to support micro and small enterprises)’s claim motivated it: “due to the complexity of the production process, and a large number of variables that must be constantly monitored, we suggest the acquisition of software of management, which was not found on the market with enough to be indicated here. most companies that produce software can provide such a solution, provided that there is a customization of the software.” a professor (the last author of this paper) and members of the experimental software engineering group 5 http://146.164.35.157/ figure 4. a framework for engineering iot applications. on challenges in engineering iot software systems motta et al. 2019 mentored the developers. the software project was executed in the first semester of 2018 and the product (camarão iot) deployed in july of 2018. therefore, the proposal was to idealize and to build an iot software system to support freshwater carniciculture in brazil. based on the described motivation, we present a proof of concept of a solution organized in the structure of the proposed framework. this example enables us to show different facets of’ arrangements in a basic solution. because the software system had been implemented and design decisions were taken, we mapped the results to exemplify the use of the framework. 6.1.1 step 1 define problem characterization: given the lack of software solutions and the market opportunity for this product, the proposal was to idealize and to build an iot software system to support freshwater carniciculture in brazil. our intention with this exemplification was to take what they accomplished and translate it into the proposed framework. in this characterization step, from the project context, the executive, business and user perspectives proposed in the framework are used to support the identification of different concerns and relevant information that must be considered in the solution to be developed. different roles expressed their expectations regarding the system in the 5w1h structure. the information was condensed and mapped below: executive perspective the owner represents this perspective and desires a solution that will enable remote and real-time business management. also, to be able to monitor the overall state of production. wants to receive notifications of critical conditions and current status, receive periodic reports and estimations, anywhere. business perspective the manager represents this perspective and wants to receive quick and easy information at any time through the used technologies. modernize production and have greater control to meet the foreign market. need to define deadlines and demands, receive information about the water tank, consult stock and production, notifications of critical conditions and current status, receive periodic reports, and estimations in real-time. user perspective different personas were established for the user perspective, representing the following roles:  installation oversees s/he takes care of the installation and stock, reporting back to the manager when it is necessary. s/he needs something that can help the work with clear and direct visual information of when and what actions to take. the system can help to check stock status, receive notification of demands, and notify manager about the need for purchases.  shrimp keeper s/he is responsible for preparing the ration and feeding the shrimps. wants a system that can make the tanks documentation and their characteristics simpler and easier to understand, that would make the job less stressful. another point that would help in the day-to-day professional life would be to facilitate the feeding process to avoid repetitive strain injuries. it will be useful to receive feeding schedule, notify biologist shrimp status, visualize tank and shrimp status.  tank keeper s/he monitors the tank status, perform measures, and adjust tank conditions. s/he would like to control the tanks more accurately and with a better frequency, without the need to always be running between different tanks. he wants the peace of mind that work is according to the need of the business. wants to monitor tank status, generate reports, notify critical conditions, secure tank to return to normal conditions, biologist shrimp status, check environmental conditions that can affect the tank and visualize tank and shrimp status.  biologist s/he sets the conditions and is responsible for the production health. s/he would like to have past information to be able to perform more precise analyzes and to minimize the error of his estimates, besides being able to compare the evolution of the production in addition to obtaining information about the shrimps in a more accurate and faster way. wants to update production demand required, update tank conditions to achieve the production demand, define and monitor shrimp health parameters, define and monitor feeding schedule, visualize tank and shrimp status, and generate reports. as described above, from this step with the framework structure, it is possible to contemplate the different goals for the same solution, thus enriching the initial characterization of the project. due to the full range of perspectives and goals, the team organizes and prioritizes the primary needs. from this initial part, we defined the primary needs of a system that (1) allows the clear visualization of information regarding the whole process in real-time; (2) support the feeding of shrimp; (3) assist in estimating production and (4) monitor the tank status (figure 5). figure 5. system’s needs. alongside with the needs presented by the control perspectives, it is required to identify which information can indicate a match with facets, which will support the analysis of the body of knowledge to identify relevant knowledge to engineer a solution to that context. the problem characterization template (used in step 1) will be defined to map the identified system needs with each facet of the body of knowledge in a way that could support the identification of the relevant knowledge. the next on challenges in engineering iot software systems motta et al. 2019 activity of this research in this step is the design of this template. it will comprise the investigation of the concerns defining the facets. a preliminary example of the problem characterization artifact for this case study is presented in figure 6. the idea is to capture the need using questions and perspectives; then we want to map them throughout the artifact, enlightening which concerns should be considered in a given context. it aims to bridge the problem to the facets. figure 6. a preliminary view of the problem characterization for this project. 6.1.2 step 2 analyzing the framework structure for decision making after characterizing the problem domain with the characterization artifact, the next step is to analyze the body of knowledge. in the context of this proposal, the framework structure can be seen as a body of knowledge. this structure was a body of knowledge, has been proposed at a conceptual level, but it has not yet been wholly populated for the iot systems. hence, it is one of the next proposed activities, in which we plan to conduct studies to provide evidence-based findings to fill in the body of knowledge. once body of knowledge is organized, it can be specialized to the problem context. for instance, as presented in figure 5, one of the needs refers to (4) monitor the tank status. this feature represents goals from the owner (executive perspective), manager (business perspective), tank keeper, and biologist (user perspective) and can be developed by different solutions (table 3). the body of knowledge specialization should assist in the decision-making to implement the desired solution considering this feature’s properties. table 3. possible solutions to monitor tank status. solution option description manually the manager defines the required shrimp production and requests production report; he communicates verbally to the biologist. the biologist sets new parameters for the tank and goes to the tank keeper to inform him. the tank keeper manually adjusts tank conditions to meet demand. he also manually collects information for the production report and deliver the report to the manager. there is no technical support in the process. communication support the manager defines the required shrimp production and requests production report; he uses a communication system to inform the biologist. the biologist defines new parameters for the tank and uses the communication system to inform the tank keeper. the tank keeper manually adjusts tank conditions to meet demand. he also manually collects information for the production report and deliver the report to the manager. there is technical support for communication in the process. control support the manager defines the required shrimp production and requests a production report; he uses a control system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that manually adjusts tank conditions to meet demand. he also manually collects information for the production report and make the report available in the system. there is technical support for control in the process. sensing support the manager defines the required shrimp production and requests a production report using the system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that manually adjusts tank conditions to meet demand. he automatically collects information from the sensors for the production report and makes the report available in the system. there is technical support for sensing in the process. actuation support the manager defines the required shrimp production and requests a production report using the system. the system notifies the biologist that defines new parameters for the tank. the system notifies the tank keeper that uses the system actuators to adjust the tank conditions to meet demand. he automatically collects information from the sensors for the production report and makes the report available in the system. there is technical support for actuation in the process. the solutions presented are simplified in high-level but are only to exemplify the variety of options dependent on technology, to a greater or lesser degree. for example, if we choose the sensor support solution, exemplified in table 7, it can analyze which relevant knowledge from the body of knowledge should be taken into account, as shown in table 4. in order to support decision-making to guide the choice and development of the proposed solution, the body of knowledge aims to present the practices and technologies that allow engineers to develop the chosen solution. table 4. some examples of possible practices and technologies from the body of knowledge. sensing support connectivity bluetooth low energy, zigbee, zwave, nfc (near field communication), rfid (radiofrequency identification), wi-fi as enabling technologies, low-power wide-area technologies, sigfox, ingenu-rpma (random phase multiple access), 2g, 3g, 4g, software-defined network (sdn) and network function virtualization (nfv) and others. things temperature sensor ttc104, temperature sensor ds18b20, luminosity sensor ldr 5mm, rain sensor fc37, rain sensor grove, humidity and temperature sensor rht03, gravity ph sensor, and others. on challenges in engineering iot software systems motta et al. 2019 behavior collect water ph-value, water level, water turbidity, water oxygenation, water salinity, water temperature. 6.1.3 step 3 – generate decision-making strategy the output from the previous step (table 4) should be presented as a set of software practices and technologies, with options of the body of knowledge specialization, and will compose the strategy to support decision-making to drive the solution. from the problem domain established (the context), the team started the solution engineering. the project was conducted with the team working together. therefore, there was no formal division of work for construction roles such as architect, engineer, and technician perspectives. they worked as a group to achieve the expected results. for this reason, in this exemplification, we cannot represent different perspectives. it is for illustration purposes and does not allow to demonstrate the full framework potential yet, which is a crucial issue in the continuity of this research. the solution implemented for the problem (4) monitor the status of the tank is presented in figure 7 and was implemented in a floater. the floater collects data from the water tank where it was deployed, and it works at each determined interval of time. an operator can adjust the frequency in which the dashboard will update the information received from the sensors, implemented in the floater. a dashboard panel was implemented to enable the visualization of the data collected by the floater and attends the problem stated in (1) allows the clear visualization of information regarding the whole process in real-time. in this context, it is a technological arrangement for data exhibition, where the data producers are the sensors in the floater, which through the connectivity with wi-fi can share data with the dashboard to exhibit the data. the overall floater solution encompasses (figure 7):  behaviors: sensing and data collection to collect water level, water turbidity, and water temperature as well as processing, to provide data for the dashboard.  things: the water level sensor, water turbidity sensor, water temperature sensor, and water salinity were implemented in an arduino board that worked as the processing unit.  interactivity: interacts with the dashboard to provide data.  connectivity: supports the provision of data for the dashboard, implemented by a wi-fi module in the arduino.  environment: the water tank was the environment settled for the sensors to collect data and the network layer used for connectivity. figure 7. floater solution implemented for the need (4). it is necessary to emphasize that the previously described context is only a simplified example of using the framework and does not represent its full use. it was used for illustrating, in a real case scenario, how the different facets overlap and impact each other. it requires a multidisciplinary view of the problem and an adequate development strategy to embrace different disciplines and skills for the accomplishment of successful iot software systems. we understand that more research is needed to address the open points and to evolve the proposal in general. it represents future tasks to be conducted throughout the continuity of this research. 7 final remarks the emergence of iot software systems brings new challenges in software engineering. to address these challenges, we should change our way of developing a software system from a monolithic structure to a broader multidisciplinary approach. this paper has presented the results obtained by analyzing data acquired through different strategies, which identified challenges in engineering iot software systems and the initial results of a conceptual framework to support its development. first, we identified concerns from the technical literature, practitioners, and a government report. next, we presented the facets that compose iot software systems, derived from a qualitative study. these results can support practitioners in evaluating risks to construct iot applications and highlight some research opportunities for researchers. then we presented a conceptual framework, a way of summarizing the results of the executed studies and structurally present the multidisciplinarity of iot. this structure shall be filled with the existing knowledge of software technologies. empty cells can identify current technology gaps to engineer iot software systems. the contribution of this work explains a set of concerns that need to be investigated, showing that it is necessary to distinguish this new software system from traditional ones. also, the work evolves the zachman´s framework, to allow the necessary multi-facets representation. 7.1 threats to validity the literature review used only scopus as a search engine so that it may be missing some relevant studies. however, from our experience, it can give reasonable coverage when performing together snowballing procedures backward and forward (matalonga et al. 2015; motta et al. 2016). data extraction and interpretation biases were mitigated with crosschecking between two researchers and by having a third researcher to revise the results. all phases of this review were peer-revised; any doubt was discussed on challenges in engineering iot software systems motta et al. 2019 among the readers, to reduce selection bias. we have not performed a quality assessment regarding the research methodology of the selected studies due to the lack of information in the secondary reports. therefore, it is a threat to this study validity. however, the triangulation with data acquired of practitioners and information extracted from the government report strengthened the representativeness of data and reduced the researchers' bias, powering the results. from both the data collected from practitioners and the government, the interpretation of data was supported by the practices of gt, which allowed to get consistency among researchers and shared an understanding of the central concepts. however, other perspectives could be used for data interpretation, imposing a risk of changing the results. it represents a threat to any qualitative study and constitutes a menace that we cannot completely mitigate. 7.2 ongoing works we foresee some scenarios of proposed framework utilization. as envisioned contributions of its use, initially, we expect the production of scientific research which considers essential knowledge to practitioners concerning the problem domain. such knowledge will compose the body of knowledge, which can be useful for both researchers and practitioners sharing and exchanging it. we consider that the evidence-based facets and perspectives have the potential to support the collection of various practices and technologies that can be used in iot. we expect that the more a facet is filled in a given perspective in response to a question (for example, response to the how, in the engineering perspective, in the behavior facet) more evidence of information will exist about it, which aids the decisionmaking in practice. in turn, the lack of answers (for example, an empty cell in the body of knowledge) may represent a research opportunity for the academy. in this sense, opportunities and risks are opposites, since an opportunity for researchers is a risk for practitioners. once we have filled the cells, it is possible that some will continue empty, since current technologies do not meet iot needs, representing research opportunities; and both practice and research can get a better observation of what we know and do not know regarding development, since it will allow visualizing where the engineering stands regarding iot. our next steps include the filling of the facets in the manner proposed by zachman (sowa and zachman 1992). however, our primary research aims to fill the cells in the matrix. we conjecture that some of the slots will be empty or partially filled, which means the available software technologies will not support such activity in the way required for iot. therefore, it can represent research and development opportunities, which are necessary for the establishment of iot as a reality. another conjecture is that some of the concerns can repeat themselves in different slots and different facets, what we call transversal challenges. these cross-sectional slots represent broader concerns that should cover the iot software system as a whole, for example, security and interoperability issues. we aim to investigate transversal challenges in the nearest future. after that, we plan to evaluate and refine this conceptual framework. 8 declarations availability of data and materials details of the protocol are available in https://goo.gl/czvvdc. competing interests the authors declare that they have no competing interests. funding we thank cnpq for the grant. professor travassos is a cnpq researcher. this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) finance code 001. consent for participation and publication not applicable. acknowledgments not applicable. references agina a, matheus edward iy, shalannanda w (2016) enhanced information security management system framework design using iso 27001 and zachman framework a study case of xyz company. in: 2016 2nd international conference on wireless and telematics (icwt). ieee, pp 62–66 alegre u, augusto jc, clark t (2016) engineering contextaware systems and applications: a survey. j syst softw 117:55–83. doi: 10.1016/j.jss.2016.02.010 almeida vaf, doneda d, moreira da costa e (2018) humane smart cities: the need for governance. ieee internet comput 22:91–95. doi: 10.1109/mic.2018.022021671 andrade rmc, carvalho rm, de araújo il, et al. (2017) what changes from ubiquitous computing to internet of things in interaction evaluation? in: international conference on distributed, ambient, and pervasive interactions. pp 3–21 aniculaesei a, grieser j, rausch a, et al. (2018) towards a holistic software systems engineering approach for dependable autonomous systems. in: proceedings of the 1st international workshop on software engineering for ai in autonomous systems sefais ’18. acm press, new york, new york, usa, pp 23–30 atabekov a, starosielsky m, lo dc-t, he js (2015) internet of things-based temperature tracking system. in: 2015 ieee 39th annual computer software and applications conference. ieee, pp 493–498 atzori l, iera a, morabito g (2010) the internet of things: a survey. comput netw 54:2787–2805. doi: 10.1016/j.comnet.2010.05.010 on challenges in engineering iot software systems motta et al. 2019 badreddin o (2013) thematic review and analysis of grounded theory application in software engineering. adv softw eng 2013:1–9. doi: 10.1155/2013/468021 basili vr, caldeira g, rombach hd (1994) goal question metric paradigm bauer c, dey ak (2016) considering the context in the design of intelligent systems: current practices and suggestions for improvement. j syst softw 112:26–47. doi: 10.1016/j.jss.2015.10.041 bera s, misra s, vasilakos a v. (2017) software-defined networking for the internet of things: a survey. ieee internet things j 4:1994–2008. doi: 10.1109/jiot.2017.2746186 biolchini j, mian pg, candida a, natali c (2008) software and data technologies. springer berlin heidelberg, berlin, heidelberg bondar s, hsu jc, pfouga a, stjepandić j (2017) agile digital transformation of system-of-systems architecture models using the zachman framework. j ind inf integr 7:33–43. doi: 10.1016/j.jii.2017.03.001 borgia e (2014) the internet of things vision: key features, application, and open issues. comput commun 54:1–31. doi: 10.1016/j.comcom.2014.09.008 brings j (2017) verifying cyber-physical system behavior in the context of cyber-physical system-networks. in: 2017, ieee 25th international requirements engineering conference (re). ieee, pp 556–561 caron x, bosua r, maynard sb, ahmad a (2016) the internet of things (iot) and its impact on individual privacy: an australian perspective. comput law secure rev 32:4–15. doi: 10.1016/j.clsr.2015.12.001 carver j (2007) the use of grounded theory in empirical software engineering. in: empirical software engineering issues. critical assessment and future directions. springer berlin heidelberg, berlin, heidelberg, pp 42–42 chen g, tang j, coon jp (2018) optimal routing for multihop social-based d2d communications in the internet of things. ieee internet things j 5:1880–1889. doi: 10.1109/jiot.2018.2817024 cni cn da i (2016) indústria 4.0: novo desafio para an indústria brasileira. indicadores cni 17:13 conti jp (2006) itu internet reports 2005: the internet of things. commun eng 4:20. doi: 10.1049/ce:20060603 costa b, pires pf, delicato fc (2017) specifying functional requirements and qos parameters for iot systems. in: 2017 ieee 15th intl conf on dependable, autonomic and secure computing, 15th intl conf on pervasive intelligence and computing, 3rd intl conf on big data intelligence and computing and cyber science and technology congress(dasc/picom/datacom/cyberscitech). ieee, pp 407–414 dalli a, bri s (2016) acquisition devices in the internet of things: rfid and sensors. j theor appl inf technol 90:194–200 davoudpour m, sadeghian a, rahnama h (2015) synthesizing social context for making the internet of things environments more immersive. in: 2015 6th international conference on the network of the future (nof). ieee, pp 1–5 de farias cm, brito ic, pirmez l, et al. (2017) comfit: a development environment for the internet of things. future gener comput syst 75:128–144. doi: 10.1016/j.future.2016.06.031 de villiers d (2001) using the zachman framework to assess the rational unified process overview of the zachman framework. ration edge desolda g, ardito c, matera m (2017) empowering end users to customize their smart environments. acm trans comput-hum interact 24:1–52. doi: 10.1145/3057859 dutton wh (2014) putting things to work: social and policy challenges for the internet of things. info 16:1–21. doi: 10.1108/info-09-2013-0047 fitzgerald b, stol k-j (2017) continuous software engineering: a roadmap and agenda. j syst softw 123:176–189. doi: 10.1016/j.jss.2015.06.063 gil d, ferrández a, mora-mora h, peral j (2016) internet of things: a review of surveys based on context-aware intelligent services. sensors 16:1069. doi: 10.3390/s16071069 gluhak a, krco s, nati m, et al. (2011) a survey on facilities for experimental internet of things research. ieee commun mag 49:58–67. doi: 10.1109/mcom.2011.6069710 goethals fg, snoeck m, lemahieu w, vandenbulcke j (2006) management and enterprise architecture clic the fad(e)e framework. inf syst front 8:67–79. doi: 10.1007/s10796-006-7971-1 gubbi j, buyya r, marusic s, palaniswami m (2013) internet of things (iot): a vision, architectural elements, and future directions. future gener comput syst 29:1645– 1660. doi: 10.1016/j.future.2013.01.010 guo b, yu z, zhou x, zhang d (2012) opportunistic iot: exploring the social side of the internet of things. in: proceedings of the 2012 ieee 16th international conference on computer supported cooperative work in design (cscwd). ieee, pp 925–929 huang j, duan q, xing c-c, wang h (2017) topology control for building a large-scale and energy-efficient internet of things. ieee wirel commun 24:67–73. doi: 10.1109/mwc.2017.1600193wc ieee (2004) guide to the software engineering body of knowledge. ieee computer society press jacobson i, spence i, ng p-w (2017) is there a single method for the internet of things? commun acm 60:46–53. doi: 10.1145/3106637 khaitan sk, mccalley jd (2015) design techniques and applications of cyberphysical systems: a survey. ieee syst j 9:350–365. doi: 10.1109/jsyst.2014.2322503 kusmin m, saar m, laanpere m, rodriguez-triana mj (2017) work in progress — smart schoolhouse as a datadriven inquiry learning space for the next generation of engineers. in: 2017 ieee global engineering education conference (educon). ieee, pp 1667–1670 on challenges in engineering iot software systems motta et al. 2019 larrucea x, combelles a, favaro j, taneja k (2017) software engineering for the internet of things. ieee softw 34:24–28. doi: 10.1109/ms.2017.28 li s, xu l da, zhao s (2015) the internet of things: a survey. inf syst, front 17:243–259. doi: 10.1007/s10796-0149492-7 li s, xu l da, zhao s (2018) 5g internet of things: a survey. j ind inf integr 10:1–9. doi: 10.1016/j.jii.2018.01.005 liao y, deschamps f, loures e de fr, ramos lfp (2017) past, present, and future of industry 4.0 a systematic literature review and research agenda proposal. int j prod res 55:3609–3629. doi: 10.1080/00207543.2017.1308576 lu y (2017) industry 4.0: a survey on technologies, applications, and open research issues. j ind inf integr 6:1– 10. doi: 10.1016/j.jii.2017.04.005 madakam s, ramaswamy r, tripathi s (2015) internet of things (iot): a literature review. j comput commun 03:164–173. doi: 10.4236/jcc.2015.35021 matalonga s, rodrigues f, travassos g (2015) challenges in testing context-aware software systems. in: 9th workshop on systematic and automated software testing 2015. belo horizonte, brazil, pp 51–60 matalonga s, rodrigues f, travassos gh (2017) characterizing testing methods for context-aware software systems: results from a quasi-systematic literature review. j syst softw 131:1–21. doi: 10.1016/j.jss.2017.05.048 mihovska a, sarkar m (2018) new advances in the internet of things. springer international publishing, cham motta rc, de oliveira km, travassos gh (2018) on challenges in engineering iot software systems. in: proceedings of the xxxii brazilian symposium on software engineering sbes ’18. acm press, new york, new york, usa, pp 42–51 motta rc, de oliveira km, travassos gh a framework to support the engineering of internet of things software systems. eics 19 june 18–21 2019 valencia spain 6. doi: 10.1145/3319499.3328239 motta rc, oliveira km de, travassos gh (2016) characterizing interoperability in context-aware software systems. in: 2016 vi brazilian symposium on computing systems engineering (sbesc). ieee, pp 203– 208 nielsen cb, larsen pg, fitzgerald j, et al. (2015) systems of systems engineering: basic concepts, model-based techniques, and research directions. acm comput surv 48:1–41. doi: 10.1145/2794381 nogueira jm, romero d, espadas j, molina a (2013) leveraging the zachman framework implementation using the action – research methodology – a case study: aligning the enterprise architecture and the business goals. enterp inf syst 7:100–132. doi: 10.1080/17517575.2012.678387 panetto h, baïna s, morel g (2007) mapping the iec 62264 models onto the zachman framework for analyzing products information traceability: a case study. j intell manuf 18:679–698. doi: 10.1007/s10845-007-0040-x patel p, cassou d (2015) enabling high-level application development for the internet of things. j syst softw 103:62–84. doi: 10.1016/j.jss.2015.01.027 pfleeger sl, atlee jm (1998) software engineering: theory and practice. pearson education india roca d, milito r, nemirovsky m, valero m (2018) fog computing in the internet of things. springer international publishing, cham rojas ra, rauch e, vidoni r, matt dt (2017) enabling connectivity of cyber-physical production systems: a conceptual framework. procedia manuf 11:822–829. doi: 10.1016/j.promfg.2017.07.184 sánchez guinea a, nain g, le traon y (2016) a systematic review on the engineering of software for ubiquitous systems. j syst softw 118:251–276. doi: 10.1016/j.jss.2016.05.024 santos i de s, andrade rm de c, rocha ls, et al. (2017) test case design for context-aware applications: are we there yet? inf softw technol 88:1–16. doi: 10.1016/j.infsof.2017.03.008 seaman cb (1999) qualitative methods in empirical studies of software engineering. ieee trans softw eng 25:557– 572. doi: 10.1109/32.799955 shang x, zhang r, zhu x, zhou q (2016) design theory, modeling, and the application for the internet of things service. enterp inf syst 10:249–267. doi: 10.1080/17517575.2015.1075592 sousa p, pereira c, vendeirinho r, et al. (2007) applying the zachman framework dimensions to support business process modeling. in: digital enterprise technology. springer us, boston, ma, pp 359–366 sowa jf, zachman ja (1992) extending and formalizing the framework for information systems architecture. ibm syst j 31:590–616. doi: 10.1147/sj.313.0590 spínola ro, travassos gh (2012) towards a framework to characterize ubiquitous software projects. inf softw technol 54:759–785. doi: 10.1016/j.infsof.2012.01.009 strauss a, corbin j (1990) basics of qualitative research: techniques and procedures for developing grounded theory . sage publications, inc, newbury park tang a, jun han, pin chen (2004) a comparative analysis of architecture frameworks. in: 11th asia-pacific software engineering conference. ieee, pp 640–647 technology i (2015) requirement formalization using owl ontology-based zachman framework tian b, yu s, chu j, li w (2018) analysis of direction on product design in the era of the internet of things. matec web conf 176:01002. doi: 10.1051/matecconf/201817601002 trappey ajc, trappey c v., hareesh govindarajan u, et al. (2017) a review of essential standards and patent landscapes for the internet of things: a key enabler for industry 4.0. adv eng inform 33:208–229. doi: 10.1016/j.aei.2016.11.007 whitmore a, agarwal a, da xu l (2015) the internet of things—a survey of topics and trends. inf syst, front 17:261–274. doi: 10.1007/s10796-014-9489-2 on challenges in engineering iot software systems motta et al. 2019 wohlin c (2014) guidelines for snowballing in systematic literature studies and a replication in software engineering. proc 18th int conf eval assess softw eng ease 14 1– 10. doi: 10.1145/2601248.2601268 xu l da, he w, li s (2014) internet of things in industries: a survey. ieee trans ind inform 10:2233–2243. doi: 10.1109/tii.2014.2300753 yan yu, jianhua wang, guohui zhou (2010) the exploration in the education of professionals in applied internet of things engineering. in: 2010 4th international conference on distance learning and education. ieee, pp 74–77 zachman ja (1987) a framework for information systems architecture. ibm syst j 26:276–292. doi: 10.1147/sj.263.0276 zambonelli f (2016) towards a general software engineering methodology for the internet of things zhang c, shi x, chen d (2014) safety analysis and optimization for networked avionics system. in: 2014, ieee/aiaa 33rd digital avionics systems conference (dasc). ieee, pp 4c1-1-4c1-12 journal of software engineering research and development, 2022, 10:4, doi: 10.5753/jserd.2021.1878 this work is licensed under a creative commons attribution 4.0 international license. investigating knowledge management in humancomputer interaction design murillo v. h. b. castro [ federal university of espírito santo | murillo.castro@aluno.ufes.br ] simone d. costa [ federal university of espírito santo| simone.costa@ufes.br ] monalessa p. barcellos [ federal university of espírito santo| monalessa@inf.ufes.br ] ricardo de a. falbo [ federal university of espírito santo | falbo@inf.ufes.br ] abstract developing interactive systems is a challenging task. it involves concerns related to human-computer interaction (hci), such as usability and user experience. therefore, hci design must be addressed when developing such systems. hci design often involves people with different backgrounds, which makes communication and knowledge transfer a challenging issue. in this scenario, knowledge management can support understanding concepts from different knowledge areas and help learn from previous experiences. aiming at investigating how knowledge management has supported hci design and contributed to the development of interactive systems, we performed a mapping study in the literature and analyzed 15 publications reporting the use of knowledge management in hci design. following that, we conducted a survey with 39 hci design professionals to find out how knowledge has been managed in their hci design practice. in this paper, we present the studies and discuss their main findings. in summary, the results indicate that knowledge management has been used in hci design mainly to improve product quality and reduce the effort and time spent on design activities. however, there is a need for simpler and more practical knowledge-based solutions to support hci design. such approaches would be capable of reaching more hci design practitioners that could benefit from them. keywords: hci design, mapping study, survey, knowledge management, interactive system 1 introduction the interest in interactive systems and their impact on people’s life has promoted the study and practice of usability (carroll, 2014). usability is a key aspect of a successful interactive system and is related to user efficiency and satisfaction when interacting with the system. for an interactive system to reach high usability levels, it is necessary to take human-computer interaction (hci) design aspects into account during its development process (carroll, 2014). hci is concerned with usability and other aspects related to the interaction between users and computer systems, necessary to produce more usable software (carroll, 2014). it involves knowledge from multiple fields, such as ergonomics, cognitive science, user experience, human factors, among others (sutcliffe, 2014). due to the diverse body of knowledge involved when designing interactive systems, interactive system development teams are frequently multidisciplinary, joining people from different backgrounds, with their own technical language, terms and knowledge. collaboration among team members is not straightforward, since hci designers and developers, for example, look at the same problem under different perspectives, which leads to difficulties that include a lack of a shared vocabulary and harsh epistemological conflicts (neto et al., 2020). even the conceptualization of the product may be conflicting among different stakeholders, which hampers communication and knowledge transfer (carroll, 2014; rogers et al., 2011). developing software is a knowledge-intensive task. knowledge management (km) principles and practices have been successfully applied to support knowledge capture, storage, use and transfer in the software development context in general (rus & lindvall, 2002; valaski et al., 2012). km can also be helpful to address challenges in the design of interactive systems since it might provide support to capture and represent knowledge in an accessible and reusable way and facilitate collaboration among team members. for example, design solutions developed by an organization can be stored and related to the requirements that motivate them, components and patterns used to build them and evaluation results. as a result, the team can learn from previous experiences and share a common understanding of the system, producing better products and performing processes more efficiently. considering the challenges of designing interactive systems, mainly due to the diversity of knowledge and people involved, and the potential of km to help address those challenges, we decided to investigate the use of km in hci design. although km can be used in different domains and there are some general motivations for using it (e.g., knowledge structuring) and benefits (e.g., improve knowledge reuse) provided by its use, km can be applied to solve specific problems in each domain, different techniques can be used, investigating knowledge management in hci design castro et al. 2022 and so on. thus, the main question that guided our investigation refers to how km has been used in the hci design domain. besides investigating general motivations and benefits observed in the use of km in the hci design domain, we also intended to identify specificities of the use of km in that domain. first, we searched for secondary studies addressing the research topic. since we did not find any, we decided to perform a systematic mapping in the literature. we analyzed 12 different km approaches used in hci design, identified from 15 publications. in general, km has aided in hci design mainly by enabling replicability of knowledge and solutions, improving product quality and communication. however, difficulty to generalize knowledge, issues related to features of the system and low engagement of the team have been pointed out as challenges to implement km in the hci design context. after investigating the literature, we performed a survey with 39 brazilian hci design practitioners that were asked about how knowledge has been managed in hci design practice. most participants are concerned with managing hci design knowledge and perceive that km helps them to improve product quality and reduce effort and time spent on hci design activities. they follow organizational or individual km practices and apply technologies such as brainstorming, mental models and electronic spreadsheets. this paper presents our studies (the mapping study and the survey) and their main results. it extends our previous work (castro et al., 2020), in which we presented the main results of our mapping study, by adding information about the survey and presenting a more comprehensive view of the mapping results, updating the search period and providing new information (e.g., new graphs and details about the identified km approaches). the mapping and the survey results are further analyzed together, providing an overview of the research and practice of km in hci design and pointing out some gaps that can be addressed in future research. the paper is organized as follows: section 2 provides the background for the paper, addressing hci design and km; section 3 concerns the mapping study; section 4 addresses the survey; section 5 provides a consolidated view of the mapping and the survey results; and section 6 presents our final considerations. 2 background 2.1 hci design hci design focuses on how to design a system to support the user to achieve her goals through the interaction between her and the system (sutcliffe, 2014). it is concerned with usability and other important attributes such as user experience, accessibility and communicability. usability is the extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use (iso, 2019). it addresses the effort and ease of the user during the interaction, considering her cognitive, perceptive and motor skills. user experience relates to users' emotions and feelings and is essential for interaction design because it takes into account how a product behaves and is used by people in the real world (rogers et al., 2011). accessibility refers to the removal of barriers that prevent interface and interaction access. finally, communicability concerns the ability of the interface to communicate design logic to the user (de souza, 2005). hci design is user-centered, hence it is said user-centered design (ucd) (chammas et al., 2015). ucd is based on ergonomics, usability and human factors. it focuses on the use and development of interactive systems, with an emphasis on making products usable and understandable. it puts human needs, capabilities and behavior first, then designs the system to accommodate them. its main principles are user focus (its characteristics, needs and objectives), observable metrics (user performance and reactions) and iterative design (repeat as often as needed) (chammas et al., 2015; iso, 2019). the term human-centered design (hcd) has been adopted in place of ucd to emphasize the impact on all stakeholders and not just on those considered users (iso, 2019). in general, ucd involves: understand and specify context of use, which aims to study the product users and intended uses; specify requirements, which aims to identify user needs and specify functional and other requirements for the product; produce design solutions, which aims to achieve the best user experience and includes the production of artifacts such as prototypes and mock-ups that will be used in the future as a basis for developing the system; and evaluation, when the user evaluates the results produced in the previous activities (iso, 2019). hci design can be understood as an intensive knowledge process, requiring effective mechanisms to collaboratively create and support a shared understanding about users, the system, its purposes, context of use and the design necessary for the user to achieve her goals. therefore, hci design could take advantage of km solutions. 2.2 knowledge management according to schneider (2009), knowledge is a human specialty stored in people's minds, acquired through experience and interaction with their environment. historically, an organization’s knowledge was undocumented, being represented through the skills, experience and knowledge of its professionals, typically tacit knowledge (rus & lindvall, 2002), which made investigating knowledge management in hci design castro et al. 2022 its use and access limited and difficult (o’leary, 1998). knowledge management (km) aims to transform tacit and individual knowledge into explicit and shared knowledge. by raising individual knowledge to the organizational level, km promotes knowledge propagation and learning, making knowledge accessible and reusable across the entire organization (o’leary, 1998; rus & lindvall, 2002; schneider, 2009). knowledge helps software organizations to react faster and better, supporting more accurate and precise responses, which contributes to increasing software quality and client satisfaction (schneider, 2009). when an organization implements km, its experiences and knowledge are recorded, evaluated, preserved, designed and systematically propagated to solve problems (schneider, 2009). thus, km addresses knowledge in its evolution cycle, which consists in creating, capturing, transforming, accessing and applying knowledge (rus & lindvall, 2002; schneider, 2009). in the software process context, km works for explicitly and systematically managing knowledge, addressing knowledge acquisition, storage, organization, evolution, retrieval and usage. among other aspects, km has been applied in the software development context to support document management, competence management, experts identification, software reuse, support learning and product and project memory (rus & lindvall, 2002). by investigating empirical studies of km in software engineering, bjørnson & dingsøyr (2008) reported that the studies’ major focus has been on explicit knowledge and there is a need to focus also on tacit knowledge. 3 systematic mapping: km in hci design according to the literature considering the challenges involving knowledge transfer and sharing in the hci design context and the benefits of using km in the software development context, we decided to investigate the use of km in hci design through a mapping study. a mapping study is a secondary study designed to give an overview of a research area through classification and counting contributions concerning the categories of that classification. it makes a broad study on a topic of a specific theme and aims to identify available evidence about that topic (petersen et al., 2015). moreover, the panorama provided by a mapping study allows identifying issues in the researched topic that could be addressed in future research. we followed the process defined in kitchenham & charters (2007), which comprises three phases: (i) planning: in this phase, the topic of interest, study context and object of the analysis are established. the research protocol to be used to perform the research is defined, containing all the necessary information for a researcher to perform the research: research questions, sources to be searched, publication selection criteria, procedures for data storage and analysis and so on. the protocol must be evaluated by experts and tested to verify its feasibility, i.e., if the results obtained are satisfactory and if the protocol execution is viable in terms of time and effort. once the protocol is approved, it can be used to conduct the research. (ii) conducting: in this phase, the research is performed according to the protocol. publications are selected and data are extracted, stored and quantitatively and qualitatively analyzed. (iii) reporting: in this phase, the produced research results are recorded and made available to potentially interested parties. next, in section 3.1, we present the research protocol followed in our study. section 3.2 summarizes the mapping study results. section 3.3 discusses the results and section 3.4 regards threats to validity. 3.1 research protocol this section presents the protocol used in the mapping study. it was defined gradually, being tested with an initial set of publications and then refined until we reached the final protocol, which was evaluated by another researcher, resulting in the protocol used in the study and presented in this section. the study goal was to investigate the use of km in the hci design context. for achieving this goal, we defined the research questions presented in table 1. table 1. systematic mapping: research questions and their rationale. id research question rationale rq1 when and where have publications been published? give an understanding of when and where (journal/conference/workshop) publications about km in the hci design context have been published. rq2 which types of research have been done? investigate which type of research is reported in each selected publication. we consider the classification defined in (wieringa et al., 2005). this question is useful to evaluate the maturity stage of the research topic. rq3 why has km been used in the hci design context? understand the purposes and reasons for using km in the hci design and verify if there have been predominant motivations. investigating knowledge management in hci design castro et al. 2022 rq4 which knowledge has been managed in the hci design context? investigate which knowledge items have been managed in the hci design context, aiming to verify if some of them have been managed more frequently and if there has been more interest in certain hci aspects. rq5 how is the managed knowledge related to the hci design process? understand, in the context of the hci design process, where the managed knowledge has come from and where it has been used. rq6 how has km been implemented in the hci design context? investigate how km has been implemented in the hci context in terms of the adopted technologies. rq7 which benefits and difficulties have been noticed when using km in the hci design context? identify the benefits and difficulties of using km in the hci design context and analyze if there is a relation between them. rq1 and rq2 are common systematic mapping questions that provide a general panorama of the research topic. the other questions aim to investigate why (rq3 and rq7), how (rq4 and rq6) and when (rq5) km has been used in hci design, which are important questions to provide an understanding of the research topic. the search string adopted in the study contains two groups of terms joined with the operator and. the first group includes terms related to hci design. the general term “human-computer interaction” was used to provide wider search results. the second group includes terms related to knowledge management. within the groups, we used the or operator to allow synonyms. the following search string was used: ("human-computer interaction" or "user interface design" or "user interaction design" or "user centered design" or "human-centered design" or "ui design" or "hci design") and ("knowledge management" or "knowledge reuse" or "knowledge sharing"). for establishing the string, we performed tests using different terms, logical connectors and combinations among them, selecting the string that provided better results in terms of the number of publications and their relevance (i.e., the number of publications returned by the search string and, considering a sample, the inclusion of the really relevant ones for the study). if a new term added to the search string resulted in a much larger number of returned publications, without adding new relevant ones to the study, then that term was not considered in the search string. in that sense, more restrictive strings excluded important publications identified during the informal literature review that preceded the study. more comprehensive strings (e.g., those including “usability”) returned too many publications out of the scope of interest. the search was performed in four sources, namely scopus, science direct, engineering village and web of science. we selected these sources because scopus is one of the largest databases of peer-reviewed literature. it indexes papers from other important sources such as ieee and acm, providing useful tools to search, analyze and manage scientific research. complementarily, to increase coverage, we selected sci 1 http://bit.ly/start-tool ence direct, engineering village and web of science, which are also widely used in secondary studies recorded in the literature and on other experiences in our research group. publications selection was performed in five steps. in preliminary selection and cataloging (s1), the search string was applied in the search mechanism of each digital library used as a source of publications (we limited the search scope to the title, abstract and keywords metadata fields). after that, in duplications removal (s2), publications indexed in more than one digital library were identified and duplications were removed. in selection of relevant publications 1st filter (s3), the abstracts of the selected publications were analyzed considering the following inclusion (ic) and exclusion (ec) criteria: (ic1) the publication addresses km in the hci design context; (ec1) the publication does not have an abstract; (ec2) the paper was published only as an abstract; (ec3) the publication is not written in english; (ec4) the publication is a secondary study, a tertiary study, a summary, an editorial or a tutorial. in selection of relevant publications 2nd filter (s4), the full text of the publications selected in s3 were read and analyzed considering the cited inclusion and exclusion criteria. in this step, to avoid study repetition, we considered another exclusion criterion: (ec5) the publication is an older version of an already selected publication. when the full text of a publication was not available either from the brazilian portal of journals, from other internet sources or by contacting its authors, the publication was also excluded (ec6). publications that met one of the six cited exclusion criteria or that did not meet the inclusion criteria ic1 were excluded. finally, in snowballing (s5), as suggested in kitchenham & charters (2007), the references of publications selected in s4 were analyzed by applying the first and second filters and, the ones presenting results related to the research topic were included in the study. we used the start tool1 to support publications selection. to consolidate data, publications returned in the publication selection steps were cataloged and stored in spreadsheets. we defined an id for each publication and recorded the publication title, authors, year, and vehicle of publication. data from publications returned in s4 and http://bit.ly/start-tool investigating knowledge management in hci design castro et al. 2022 s5 were extracted and organized into a data extraction table oriented to the research questions. the spreadsheets produced during the study can be found in http://bit.ly/mapping-km-in-hcidesign. the first and second authors performed publication selection and data extraction. the third and fourth authors reviewed both. once data has been validated, the first and the second authors carried out data interpretation and analysis, and again third and fourth authors reviewed the results. discordances were discussed and resolved. quantitative data were tabulated and used in graphs and statistical analysis. finally, the four authors performed qualitative analysis considering the findings, their relation to the research questions and the study purpose. 3.2 results the study considered papers published until october 2020. searches were conducted for the last time in november 2020. figure 1 illustrates the followed process and the number of publications selected in each step. figure 1. publication selection process. in the 1st step, as a result of searching the selected sources, a total of 381 publications was returned. in the 2nd step, we eliminated duplicates, achieving 228 publications (reduction of approximately 40%). in the 3rd step, we applied the selection criteria over the abstract, resulting in 21 papers (reduction of approximately 91%). at this step, we only excluded publications that were clearly unrelated to the subject of interest. in case of doubt, the paper was taken to the next step. in the 4th step, the selection criteria were applied considering the full text, resulting in 11 publications (reduction of approximately 48%). finally, in the 5th step, we performed snowballing technique by checking the references of the 11 selected publications and identified 4 more publications, which in total added up to 15 publications. when analyzing the publications to identify the km approaches applied in the hci design context, we noticed that some publications addressed complementary works from the same research group. hence, we considered complementary works as a single km approach when extracting data about rqs 3, 4, 5, 6 and 7. table 2 shows the list of identified km approaches, their descriptions and corresponding publications. two papers were grouped into a km approach and three other papers were grouped in another km approach. thus, we considered a total of 12 different km approaches found in 15 publications. along with this and the next section, we refer to the approaches by using the id listed in the table. after table 2, we present the data synthesis for each research question. further information about the selected publications, including detailed extracted data, can be found in http://bit.ly/mapping-km-in-hci-design. table 2. selected publications. id approach brief description ref. #01 trading off usability and security in user interface design through mental models proposes the development of an organizational mental model through knowledge transfer and transformation, using collaborative brain power from various knowledge constellations to design. (mohamed et al., 2017) #02 knowledge management challenges in collaborative design of a virtual call centre proposes a knowledge-based system with the following functionalities: (a) storing design primitives and formal knowledge in an online library; (b) preserving procedures and rules that proved successful in past design problems; (c) formal modeling of knowledge elements that might be applicable for usability improvements; (d) providing multiple mechanisms for knowledge acquisition, preserving, transfer and sharing. (sikorski et al., 2011) #03 applying knowledge management in ui design process defines a process to automate the transformation of a task description into an interaction description. first, it identifies and uniformizes existing knowledge about ui design process using knowledge classification techniques. then, captured knowledge is represented in the form of ontologies, deriving a task metamodel and an interaction metamodel. this extracted knowledge is integrated to design defining a transformation of task description into interaction description using an intermediate model between them and a two-step transformation. (suàrez et al., 2004) #04 a knowledge management tool for speech interfaces proposes a knowledge-based system to help developers of speechdriven interfaces learn with previous design solutions. these solutions are collected, made accessible and divided into categories regarding their content type. solutions with corresponding structures are clustered and compared within their own category, providing designers with a suggestion mechanism based on their desired kind of solution. (bouwmeester, 1999) http://bit.ly/mapping-km-in-hci-design http://bit.ly/mapping-km-in-hci-design http://bit.ly/mapping-km-in-hci-design investigating knowledge management in hci design castro et al. 2022 there is also a ranked suggestion mechanism of design elements based on available design material and design guidelines. #05 design knowledge reuse based on visualization of relationships between claims presents a tool that aims to improve design and knowledge acquisition by exploring relationships between claims. it allows a better search and retrieval mechanism to a design knowledge repository, which is obtained by applying km strategies (generalize, classify, store, retrieve) to claims. (wahid, 2006; wahid et al., 2004) #06 design knowledge reuse and notification systems to support design in the development process presents a system connected to a design knowledge repository based on claims. it allows teams to leverage knowledge from previous design efforts by searching for reusable claims relevant to their current project and to extend the repository by updating existing claims and creating new ones. (chewar et al., 2004; chewar & mccrickard, 2005; j. l. smith et al., 2005) #07 exploring knowledge processes in user-centered design process proposes a conceptual framework that guides the design process based on five propositions: (1) designers and users should be actively included as actors in the process since they both have the knowledge needed for a successful design; (2) this knowledge possessed by them is context-specific; (3) there is useful knowledge that has not been articulated by both users and designers and, therefore (4) knowledge processes transforming tacit knowledge into explicit knowledge by users and designers are linked and should be combined; and finally, (5) resulting knowledge obtained along the process is embedded into concepts, products or services. (still, 2006) #08 lessons learnt from an hci repository concerns about the implementation of a knowledge repository using windows help files. it is maintained by a group within the organization that receives content updates from the team and properly inserts this new material into the repository. new versions are released from time to time and distributed as physical copies to be installed on each computer. (wilson & borras, 1998) #09 a pattern language approach to usability knowledge management presents a km system that used principles of use case writing and pattern languages to describe problems found in user testing sessions and the following solutions to them. patterns can be retrieved by forms with filters, text search and database queries. filters include goals and subgoals, being useful respectively to show all problems related to a specific user goal and possible solutions and to provide insights of what interactions or devices have been problematic regardless of user goal. (hughes, 2006) #10 an expert system for usability evaluations of business-to-consumer ecommerce sites proposes a knowledge-based system to help with e-commerce usability evaluations. a knowledge engineer is responsible for acquiring and representing knowledge, eliciting knowledge from textual, non-live sources of expertise about design guidelines that affect the usability of 11 e-commerce elements. the elicited knowledge is consolidated and presented in a form of rules in the expert system. (gabriel, 2007) #11 a framework for developing experiencebased usability guidelines presents a km system to manage design guidelines contextualized by usability examples. the system allows designers to describe their current problems and requirements and then search for cases with similar characteristics. they can also follow hyperlinks to more general guidelines, which also point to other cases and search from a list of hierarchically arranged guidelines and follow other related guidelines and cases. the system is initially seeded with organization-wide usability guidelines and is updated as new projects are developed. (henninger et al., 1995) #12 prototype evaluation and redesign: structuring the design space through contextual techniques proposes a method based on contextual inquiry and brainstorming to identify usability issues in interface evaluations and derive proper design solutions to them. first, interface evaluation sessions are conducted with users when they share their perceptions while interacting with a high-fidelity prototype of the system. those sessions are recorded and, later, relevant comments are transcribed into usability flaws. in a second moment, there are brainstorm meetings where developers, designers and hci specialists propose design solutions to the previously identified usability flaws. (a. smith & dunckley, 2002) publication year and type (rq1): figure 2 shows the distribution of the 15 selected publications over the years and their distribution considering the publication type. papers addressing km in the hci design context have been published since 1995 in journals and conferences (no workshop publications were found). conferences have been the main forum, encompassing 73.3% of the publications (11 out of 15). four papers (26.78%) were published in journals. investigating knowledge management in hci design castro et al. 2022 figure 2. publications over the years. the venues of each selected publication were also analyzed to investigate if they were more related to hci, km or software engineering (se). table 3 summarizes the venues of the selected publications and indicates their main focus. figure 3 presents the distribution of the venue orientation across the publications. 53.3% of the publications (8 out of 15) were published in hci venues and the remaining of the publications are divided into km (26.7%) and se (20.0%) venues. table 3. venue orientation of the selected publications. ref. venue area (mohamed et al., 2017) behavior & information technology hci (sikorski et al., 2011) international conference on knowledge-based and intelligent information and engineering systems ai (wahid, 2006) conference on designing interactive systems hci (suàrez et al., 2004) conference on task models and diagrams hci (bouwmeester, 1999) international acm sigir conference on research and development in information retrieval information retrieval (j. l. smith et al., 2005) ieee international conference and workshops on engineering of computer-based systems software engineering (chewar et al., 2004) international conference on computer-aided design design (wahid et al., 2004) ieee international conference on information reuse and integration data science (chewar & mccrickard, 2005) hawaii international conference on system sciences information systems (still, 2006) european conference on knowledge management km (wilson & borras, 1998) international journal of industrial ergonomics hci (hughes, 2006) journal of usability studies hci (gabriel, 2007) isoneworld conference information systems (henninger et al., 1995) dis conference on designing interactive systems: processes, practices, methods, and techniques hci (a. smith & dunckley, 2002) interacting with computers hci figure 3. venue orientation of the selected publications. research type (rq2): figure 4 presents the classification of the research types (according to the classification proposed in wieringa et al. (2005)) reported in the 15 selected publications. 13 publications (86.7%) propose a solution to a problem and argue for its relevance. thus, they were classified as proposal of solution. five of them (33.3%) also present some kind of evaluation, being one (6.7%) evaluated in practice (i.e., also classified as evaluation research), and four (26.7%) investigating the characteristics of the proposed solution not yet implemented in practice (i.e., validation research). one publication (6.7%) refers exclusively to evaluation research, discussing the evaluation of km in an industrial setting, and another is a personal experience paper, reporting the experience of the authors in a particular project in the industry. investigating knowledge management in hci design castro et al. 2022 figure 4. research type of the identified publications. motivation for using km in hci design (rq3): we identified six reasons for using km in hci design, as shown in table 4. some approaches presented more than one motivation, thus the total sum is greater than 12. table 4. motivations for using km in hci design. motivation approaches total improve product quality #01, #02, #04, #05, #06, #07, #10, #11, #12 9 reduce design effort #02, #03, #08, #09, #10 5 reduce design time #04, #05, #08 3 reduce design cost #05, #10 2 improve design team performance #06 1 improve hci design learning #06 1 nine approaches (75%) use km to improve product quality, most of them concerning usability. these approaches aim to provide benefits related to the quality of the interactive system in terms of its interaction with users. for example, approach #11 is proposed to help developers to design effective, useful and usable applications. approach #01, in turn, aims to improve alignment between design features and users’ requirements. seven approaches (58.3%) are motivated by improving one or more aspects related to the hci design process, namely: effort, time and cost. from these, reducing effort is highlighted. five approaches (41.7%) use km to reduce design effort, mainly by not depending on internal usability experts to perform hci design activities. approach #02, for example, applied km to decrease the need for experts to support the design team with their knowledge and experience, due to lack of knowledge to be reused. approaches #04, #05 and #08 were motivated by reducing hci design time through the reuse of previous solutions implemented for similar problems. reducing costs in the hci design process was the motivation for approaches #05 and #10, which focus on minimizing the involvement of external usability experts in the process and conducting usability evaluation more effectively. approach #06 aimed to improve design team performance by providing support for team coordination and collaboration. this approach also aimed to improve hci learning for the students involved in the project. managed knowledge in hci design (rq4): analyzing the publications, we identified 24 different types of knowledge items managed by the km approaches, as shown in table 5. some items are shown in the same line to save space. the most common knowledge items have been design guidelines and design solutions, addressed by four approaches, followed by test results, addressed by three approaches. we noticed that, in the context of hci design, km approaches have dealt with only one (#10) or two (#01, #03, #05, #06, #09, #11 and #12) different knowledge items. table 5. managed knowledge items. knowledge item approaches total design guidelines #04, #08, #10, #11 4 design solutions #02, #04, #07, #08 4 test results #02, #04, #12 3 claims #05, #06 2 design features #01, #12 2 design patterns #09, #11 2 lessons learned #04, #08 2 usability measures #02, #08 2 claims relationships #05 1 design changes #06 1 design feature checklists; design methods; design processes; design standards; design templates; interface objects #08 1 interaction model; task model #03 1 scenarios; test scenarios #02 1 user knowledge; user needs #07 1 user requirements #01 1 user tasks #09 1 we identified four different hci aspects addressed by the identified km approaches. the main aspect is usability, which is treated in all the identified approaches. two approaches (#03 and #08) also address ergonomics. #03 and #04 focus on particular types of design or interfaces. the former focuses on task-based design while the latter on speech driven interfaces. figure 5 shows the hci aspects addressed in the identified km approaches. the sum exceeds 12 because some approaches address more than one aspect. investigating knowledge management in hci design castro et al. 2022 figure 5. hci aspects addressed in km approaches. when knowledge is captured and used (rq5): table 6 shows when hci design knowledge has been captured and when it has been used along the hci design process. three approaches capture and use knowledge throughout the whole process. eight approaches (66.7%) use knowledge when producing design solutions. a smaller number (six, 50%) capture knowledge in this activity. the behavior is the opposite in design evaluation: more approaches are capturing (five, 41.7%) than using (three, 25%) knowledge in this activity. only one (8.3%) approach captures knowledge during requirements specification. table 6. capture and use of knowledge along the hci design process. activity (iso, 2019) knowledge capture knowledge use specify requirements 1 (#01) 0 produce design solutions 6 (#02, #03, #04, #07, #10, #11) 8 (#01, #02, #03, #04, #07, #09, #11, #12) design evaluation 5 (#02, #04, #09, #10, #12) 3 (#02, #09, #10) whole cycle 3 (#05, #06, #08) 3 (#05, #06, #08) technologies used in km approaches (rq6): table 7 shows the technologies (systems, methods, tools, theories, etc.) used in the analyzed km approaches. the most common technologies were knowledge-based systems and knowledge repositories, which are used in three approaches. for example, #04 proposes a knowledge-based system to help developers of speech-driven interfaces learn with previous design solutions. #08, in turn, proposes the implementation of a knowledge repository using windows help files. knowledge management systems and knowledge-based analysis were used in two approaches. a knowledge management system is proposed in #09 to describe problems detected in user test sessions and the respective solutions and in #11 to describe design problems and requirements and then search for usability examples with similar characteristics and hyperlinks to more general related guidelines. knowledgebased analysis, in turn, was used in #03 and #07 combined with other technologies, such as ontology and model transformation (#3) and conceptual framework (#7). other technologies such as brainstorming, contextual inquiry, heuristic evaluation and mental models were used in only one km approach. table 7. technologies used in km approaches in hci design context. technology approaches total knowledge-based system #02, #04, #10 3 knowledge repository #05, #06, #08 3 knowledge management system #09, #11 2 knowledge-based analysis #03, #07 2 ontology; model transformation #03 1 conceptual framework #07 1 contextual inquiry; brainstormingbased technique #12 1 mental model; internalization awareness; observation; behavioral interviews; absorptive capacity; heuristic evaluation #01 1 benefits and challenges of using km in hci design (rq7): table 8 summarizes the benefits and difficulties reported in the publications. two approaches (#04 and #10) did not report any benefit or challenge in using km in hci design. considering the 10 other approaches, it can be noticed that, in general, more benefits than difficulties were reported. the most reported benefit was to enable replicability of domain or context knowledge. for example, #07 reached wide scope applicability because of the common conceptualization proposed as a conceptual framework. on the other hand, the most reported difficulty was that knowledge is often too specific for a given context. for example, in #11 it is stated that the approach is best suited for contexts in which common customer needs are being addressed in similar application domains. investigating knowledge management in hci design castro et al. 2022 table 8. benefits and difficulties of using km in hci design context. benefits approaches total enable replicability of domain/context knowledge #03, #06, #07, #09, #12 5 improve product quality #02, #05, #06, #12 4 improve communication #01, #03, #11 3 increase team engagement/empowerment #02, #06 2 increase organizational integration #03, #08 2 reduce design effort #03, #12 2 improve design conceptualization #03, #07 2 promote standardization #02 1 increase productivity #11 1 promote organizational competitive advantage #02 1 decrease implementation and maintenance effort #08 1 decrease implementation and maintenance costs #08 1 difficulties approaches total knowledge is often context-specific #02, #06, #09, #11 4 issues related to features of the km technologies #05, #06, #09 3 low team engagement/empowerment #01, #05, #08 3 user involvement #07, #12 2 integration of the km approach into the organization #06, #11 2 km implementation and maintenance effort #08, #09 2 lack of consensus about hci design conceptualization #01, #02 2 3.3 discussion taking the period of publications into account (rq1), we can notice a long-term effort regarding the use of km in hci design, since this topic has been targeted by researchers for more than 20 years. however, the low average of publications per year (0.6 since 1995) shows that the topic has not been widely addressed. we can also notice that most of the publications are from the 2000s decade. the low percentage of journal publications, which generally require more mature works, can be seen as a reinforcement that the research on this topic is not mature enough yet. besides, results about the research type (rq2) show that only 40% of the works included some kind of evaluation, being only 13% evaluation of solutions in practice. this can be a sign of difficulty in applying the proposed approaches in industry, which reinforces that research on this topic is not mature enough yet and there seems to be a gap between theory and practice. concerning rq3, we can notice that using km in hci design has been motivated mainly by delivering better products to users or optimizing the hci design process in terms of effort, time and cost. improving the performance of the hci design team was also mentioned, which is consistent with the other motivations related to the hci design process since increasing performance can contribute to decreasing effort, time and cost. by analyzing the results of approaches that applied some validation or evaluation, we noticed that only two (#03 and #12) provided results related to the initial motivation for using km in hci design (reduce design effort and improve product quality, respectively). the other publications were more focused on validating or evaluating features or functionalities of the proposed solutions. a common concern in several publications was the need for hci design expert consultants, which can increase hci design cost and effort. capturing and reusing knowledge contribute to retaining organizational knowledge and reducing dependence on external consultants. another concern refers to communication problems. a. smith & dunckley (2002) highlight that barriers to effective communication between designers, hci specialists and users, due to their differing perspectives, affect product quality. km solutions are helpful in this context. usability has been the focus of the km initiatives in the hci context (rq4). in fact, this is not a surprise, because usability has been one of the most explored hci aspects in the last years. moreover, this property is quite comprehensive and includes other important aspects of hci design, such as learnability, memorability, efficiency, safety and satisfaction (iso, 2019). however, there are other important properties not addressed in the analyzed papers, such as user experience, communicability and accessibility. the knowledge items managed by the km approaches are quite diverse. design solutions, guidelines, test results and design patterns are some knowledge items found in different publications. despite the variety of knowledge items, we noticed that most of the approaches (66.7%) manage up to two different knowledge items. by analyzing the coverage of the approach in terms investigating knowledge management in hci design castro et al. 2022 of single or multiple projects, we found out that four approaches (#01, #03, #07 and #12) manage knowledge involved in a single project, while the other eight approaches are more extensive, accumulating knowledge from multiple projects. in order to elevate knowledge reuse to the organizational level, a km approach must comprehend multiple projects in that organization. concerning knowledge use and capture (rq5), at first, we expected that knowledge was captured and used in the same activity of the hci design process. therefore, results showed us that the same knowledge could be produced and consumed in different parts of the hci design process. for example, there are more approaches capturing knowledge in the design evaluation activity than using it. this reinforces the iterative characteristic of hci design, where knowledge obtained in evaluation activity in one cycle can be used to improve the design in the next cycle. different technologies have been used to implement km in the hci design context (rq6). the most common are system-based approaches that use software to support the km process and store knowledge. we expected this result because km systems, knowledge-based systems and knowledge repositories are widely adopted technologies in the km area. on the other hand, only two approaches use specific hci techniques, namely contextual inquiry and heuristic evaluation. this may indicate that km traditional approaches are suitable for addressing km problems in hci design (what was indeed expected) and that hci techniques can be used to address specificities of the hci design domain. earlier steps of the development of km solutions, such as knowledge analysis and modeling, are also addressed in some publications. moreover, there is also concern with later steps, like the integration of the km system into the organization. some approaches combine different technologies, which can be a sign that the use of different techniques is a good strategy to address a more complete km approach in hci design. as for the benefits and challenges of using km in the hci design context (rq7), when categorizing the findings, we noticed that several of them are benefits and challenges of using km in general. however, by analyzing the context of each km approach, we can better understand how the findings relate to hci design. for example, regarding the benefit improve communication, the works highlight the use of km to support communication among the different actors involved in the hci design process. in #10, communication between hci specialists, designers and users is mediated by prototypes aiming at an agreement about the system design. in #01, km facilitates the elicitation of the user’s knowledge for the designer to apply it to the design. in #03, km reduces errors of interpretation and contextualization among the people involved in the system design. some of the identified challenges and benefits are opposite each other. for example, there is the challenge of low team engagement on one hand and the benefit of increasing team engagement on the other hand. we kept both because they were cited in different publications, thus under different perspectives. moreover, we can see the challenge as a difficulty that, when overcome by the use of km, can be turned into a benefit. by analyzing the most cited benefits and challenges, we noticed that the generality level of the knowledge is an important question in a km approach. the most cited benefit points to knowledge replicability in a specific context/domain. the most cited challenge points to the fact that it is difficult to generalize knowledge. looking at data from rq5, we noticed that approaches handling knowledge from multiple projects reported the knowledge generalization challenge, while approaches handling knowledge in a single project reported easy replication of knowledge. thus, the generality level of knowledge should be determined by the context where the km approach will be applied. when dealing with a high diversity of knowledge and contexts, it becomes harder to produce general knowledge to be widely used to solve specific problems and be adopted in different contexts. one way of achieving improvements in replicability is using knowledge-based analysis methods, as reported by approaches #03 and #07. based on the panorama provided by the mapping study results, in summary, we can say that km has not been much explored in the hci context; it has been used mainly to improve software quality and hci design process efficiency; it has focused on usability; and the km approaches have been based on systems and repositories. as for benefits, km has enabled knowledge replicability, improved product quality and communication. the main difficulties have been to generalize knowledge, address issues related to features of the system and low engagement of the team. 3.4 threats to validity as with any study, our mapping study has some limitations that must be considered together with the results. following the classification presented by (petersen et al., 2015), next we discuss the main threats to the mapping study results. descriptive validity is the extent to which observations are described accurately and objectively. to reduce descriptive validity threats, a data collection form was designed to support data extraction and recording. the form objectified the data collection procedure and could always be revisited. however, data extraction and recording still involved some subjectivity and was dependent on the researcher’s decisions. an important investigating knowledge management in hci design castro et al. 2022 limitation in this sense is related to the classifications we made. we defined classification schemas for categorizing data in some research questions. some categories were based on classifications previously proposed in the literature (e.g., type of research (wieringa et al., 2005)). others were established during data extraction, based on data provided by the analyzed publications (e.g., rq4). with an aim towards minimizing the threat, data extraction, classification schemas and data categorization were done by the first and second authors and reviewed by the other two authors. discordances were discussed and resolved. however, determining the categories and how data fit them involves a lot of judgment. thus, different results could be obtained by other researchers. theoretical validity is determined by the researcher’s ability to capture what is intended to be captured. in this context, one threat refers to the sources. we used four digital libraries selected based on other secondary studies in software engineering. although this set of digital libraries represents a comprehensive source of publications, the exclusion of other sources may have left some valuable publications out of our analysis. acm was not included in the sources because scopus covers most of its publications. however, there are hci publications indexed by acm and not indexed by scopus, which may have jeopardized the mapping results. to minimize this risk, we performed snowballing. another threat refers to the fact that the study focused on scientific literature and did not include other alternatives, such as grey literature, that could enhance the systematic mapping coverage. hence, extending this study with a multivocal literature review through grey literature analysis could complement and enrich the obtained results. there are also limitations related to the adopted search string. even though we have used several terms, there are still synonyms that we did not use. for example, since km is a subjective area, many publications may have addressed km aspects using other words such as “collaboration” and “organizational learning”, which were not covered by our search string. moreover, we did not include hci and km acronyms alone (hci was combined with “design”), which could be an additional threat. however, the string includes the full terms referring to hci and km and we believe that it is probable that publications including the acronyms also include the full terms in either their title, abstract or keywords. hence, our search string might have covered them anyway. the researcher bias over publications selection, data extraction and classification is also a threat to theoretical validity. to minimize this threat, as we previously said, the steps were initially performed by the first and second authors and, to reduce subjectivity, the other two authors performed these same steps. discordances and possible biases were discussed until reaching a consensus. finally, interpretive validity is achieved when the drawn conclusions are reasonable given the data obtained. the main threat in this context is the researcher bias over data interpretation. to minimize this threat, like in the other steps, interpretation was performed by the first and second authors and reviewed by the other two. discussions were carried out until a consensus was reached. however, subjectivity still relies on qualitative interpretation and analysis. even though we have treated many of the identified threats, the adopted treatments involved human judgment, therefore the threats cannot be eliminated and must be considered together with the study results. 4 survey: km in hci design practice the systematic mapping provided information about km approaches to support hci design according to the literature records. after conducting the mapping study, we performed a survey with 39 brazilian hci design practitioners to investigate km in hci design practice. a survey is an experimental investigation method usually done after the use of some technique or tool has already taken place (pfleeger, 1994). surveys are retrospective, i.e., they allow to capture an “instant snapshot” of a situation. questionaries and interviews are the main instruments used to apply a survey, collecting data from a representative sample of the population. the resulting data are analyzed, aiming to draw conclusions that can be generalized for the whole population represented by that sample (mafra & travassos, 2006). in this work, we intended to reach many participants and analyze data objectively and quantitatively. thus, in our survey, we decided to use a questionnaire containing objective questions. we followed the process defined in (wohlin et al., 2012) which comprises five activities. scoping is the first step, where we scope the study problem and establish its goals. planning comes next, where the study design is determined, the instrumentation is considered and the threats to the study conduction are evaluated. operation follows from the design, consisting in collecting data which then are analyzed and evaluated in analysis and interpretation. finally, in presentation and package, the results are communicated. next, in section 4.1 we present the survey planning and execution. section 4.2 concerns the survey results. section 4.3 discusses the results and section 4.4 presents threats to validity. investigating knowledge management in hci design castro et al. 2022 4.1 survey planning and execution the study goal was to investigate aspects related to km in hci design practice. aligned to this goal, we defined the research questions presented on table 9, which were based on the systematic mapping research questions and results. table 9. survey: research questions and their rationale. id research question rationale rq1 which stakeholders have been involved in hci design practice? identify which stakeholders have been involved in hci design practice, which helps identify different perspectives and information needs in hci design. rq2 which knowledge has been involved in hci design practice? investigate which knowledge has been involved in hci design practice, particularly knowledge items (e.g., design solutions, guidelines and lessons learned) and design artifacts (e.g., wireframes, mockups and prototypes) used as sources of knowledge or produced to record useful knowledge. rq3 which hci design activities have demanded better km support? investigate which hci design activities have needed better support of km (e.g., because there have not been enough knowledge resources to support their execution). rq4 how has km been applied in hci design practice? investigate how km principles have been applied and identify technologies (e.g., tools, methods, etc.) that have been used to support knowledge access and storage in hci design practice. rq5 which benefits and difficulties have been noticed when using km in hci design practice? identify benefits and difficulties that have been experienced by practitioners when applying km in hci design practice and verify if practitioners have experienced more benefits or difficulties. rq6 which goals the use of km in hci design practice has contributed to achieving? identify which goals the use of km in hci design has contributed to, aiming to figure out predominant reasons for using km in hci design practice. the participants were 39 brazilian professionals with experience in hci design of interactive software systems. the participants profile was identified through questions regarding their current job positions, education level, knowledge of hci design and practical experience in hci design activities. most participants (79.5%) declared to play roles devoted to hci design activities (nine ux/ui designers; six ux designers; four product designers, two designers, two ux research designers, one art director, one it analyst & ux designer, one interaction designer, one lead designer, one lead ui designer, one staff product designer and one ui designer). others 20.5%) play roles that perform some activities related to hci design (one programmer, one requirement analyst, one chief growth officer, one product owner, one it analyst, one it manager, one marketing manager and one project leader). although these roles cannot be considered hci design experts, we did not exclude these participants because they declared to have practical experience and knowledge in hci design (probably acquired in their previous job and academic experiences). moreover, even playing roles not dedicated to hci design, they are often involved in hci design in some way. eight participants (20.5%) had masters’ degrees, 26 (66.7%) had bachelor’s degrees, and five (12.8%) had not yet finished bachelor’s degree courses. all participants declared theoretical knowledge of hci design. four of them (10.3%) declared low knowledge (i.e., knowledge acquired by himself/herself through books, videos or other materials). 16 participants (41%) declared medium knowledge, acquired mainly during courses or undergraduate research. finally, 19 participants (48.7%) declared high knowledge (i.e., they are experts or have a certification, masters or ph.d. degree related to hci design). some areas of the courses cited by participants that declared medium or high knowledge are design (46.2%), computer science (38.5%), arts (28.2%), social communication (15.4%) and user experience (7.7%). the participants were allowed to choose more than one option, hence the sum of the values is over 100%. other areas such as anthropology, neuroscience, information science, psychology were also mentioned by one participant each. 26 participants (66.7%) declared more than three years of experience in hci design practice, 11 participants (28.2%) declared between one and three years and two (5.1%) declared less than one year. the instrument used in the study consisted of a questionnaire composed of 10 objective questions. most answer options for each question were defined based on the mapping study results. for example, when asked about the goals achieved with the help of km in hci design (rq6), the options provided to the participants refer to the goals we found in the mapping study. however, some options were rewritten in a way that could enhance participants understanding (e.g., we changed “test results” to “previous design evaluation results” on rq2) and others were added based on the authors’ knowledge and experience (e.g., we included forums, blogs and social networks in rq4). furthermore, most questions also allowed the participant to provide additional information in text boxes to complement his/her answers. for example, besides selecting goals from the list provided in the question related to rq6, the participants were also allowed to include new goals in their answers. the questionnaire is available at http://bit.ly/questionnaire-km-in-hci-design. http://bit.ly/questionnaire-km-in-hci-design http://bit.ly/questionnaire-km-in-hci-design investigating knowledge management in hci design castro et al. 2022 the procedure adopted in the study consisted in sending the invitation to participate in the study, receiving the answers, verifying them, consolidating and analyzing data. the invitation was posted in discussion groups on facebook, linkedin and interaction design foundation’s website2. the authors also sent the invitation by email to potential participants. since the platforms did not inform how many people visualized the posts, we could not infer the percentage of invites that led to answers before sending the invitation, we performed a pilot with three participants. considering the participants’ feedback, we improved the questionnaire aiming to ensure that the questions were clear and understandable. the invitation to participate in the study was posted on social media and sent by email on december 16th, 2020. we received answers until january 11th, 2021. we received 40 answers to the questionnaire, however, after analyzing the participants profile related to hci design knowledge and experience, we excluded one participant who reported to have low knowledge and experience with hci design and did not answer some of the questionnaire questions. after that, each provided answer was verified and data was consolidated and analyzed against the research questions. 4.2 results in this section, we present the data synthesis for each research question. stakeholders involved in hci design practice (rq1): aiming to identify stakeholders involved in hci design practice, we asked the participants to identify the stakeholders they directly interact with within their hci design practice. as it can be seen in table 10, developer has been the most common stakeholder involved in hci design practice, being mentioned by 37 participants (94.9%). following that, project manager, designer, user and client were mentioned, respectively, by 34 (87.2%), 33 (84.6%), 27 (69.2%) and 26 (66.7%) participants. product owner was cited by three participants (7.7%) and others (business analyst, customer experience analyst, data analyst, hr people, product manager and scrum master) were mentioned only once. table 10. stakeholders involved in hci design practice. stakeholder number of participants % developer 37 94.9% designer 34 87.2% project manager 33 84.6% client 27 69.2% user 26 66.7% product owner 3 7.7% business analyst 1 2.6% customer experience analyst 1 2.6% 2 https://www.interaction-design.org data analyst 1 2.6% hr people 1 2.6% product manager 1 2.6% scrum master 1 2.6% knowledge involved in hci design practice (rq2): first, the participants were asked about the knowledge items they use or produce during hci design activities. we consider as knowledge items pieces of knowledge that can be useful in hci design, such as lessons learned, standards, guidelines and patterns. figure 6 presents the results of this question. some items have been used and produced by a high number of participants: organizational design standards (used by 34 participants, 87.2%, and produced by 26 participants, 66.7%), lessons learned (used by 34 participants, 87.2%, and produced by 24 participants, 61.5%), guidelines (used by 34 participants, 87.2%, and produced by 22 participants, 56.4%) and libraries of design components or elements (used by 32 participants, 82.1%, and produced by 23 participants, 59%). other knowledge items have also been used by many participants, but produced by a smaller number, such as examples (used by 34 participants, 87.2%, and produced by 14 participants, 35.9%), design solutions from the organization (used by 35 participants, 89.7%, and produced by 18 participants, 46.2%) and design solutions from outside the organization (used by 35 participants, 89.7%, and produced by 11 participants, 28.2%). in general, hci design practitioners have used and produced different knowledge items (11.1 and 6.6 in average, respectively). figure 6. knowledge items used and produced in hci design practice. https://www.interaction-design.org/ investigating knowledge management in hci design castro et al. 2022 the participants were also asked about design artifacts they use or produce during hci design activities. we use the term design artifact to refer to documents, models, prototypes and others that record information about the design solution. figure 7 shows the results. user requirements, scenarios and interaction models were the most cited artifacts used during hci design. on the other hand, wireframes, functional prototypes and mockups were the most cited artifacts produced during hci design. figure 7. design artifacts used and produced in hci design practice. we also asked the participants to inform whether the artifacts used and produced by them sufficiently provide all information needed to describe the hci design solution (i.e. if the knowledge recorded in the artifacts is enough for the implementation and evaluation of the solution). 26 participants (66.7%) answered “yes” and 13 (33.3%) answered “no”. eight out of the 13 participants pointed out they missed information about personas, user research data and usability tests. these 13 participants were also asked about the ways the missing information is communicated. the results are presented in table 11. annotations and talks have been the most used ways (eight participants, 61.5%) to complement the information provided in design artifacts. seven participants (53.9%) reported the use of meetings, while one used documentation or specific tools. the participants indicated that annotations and talks had been used informally, while meetings, documentation or tools have been used systematically, following organizational practices. table 11. ways to obtain missing information. method number of participants % annotations 8 61.5% talks 8 61.5% meetings 7 53.9% documentation or tool 1 7.7% none 1 7.7% hci design activities demanding better km support (rq3): taking the hci design activities established by iso 9241-210 (iso, 2019) as a reference, the participants were asked to judge whether the knowledge resources (e.g., knowledge items, artifacts) used by them have provided sufficient knowledge to support each activity. figure 8 presents the results. in general, most participants consider that they have access to enough knowledge to perform hci design activities. produce design solutions has the highest number of participants (31 participants, 79.5%) reporting to have had sufficient knowledge to perform it. on the other hand, evaluate design solutions has the highest number of participants (10 participants, 25.6%) declaring that the available knowledge has not been enough. sixteen participants (41%) declared to have not had sufficient knowledge to support at least one hci design activity. they pointed out that, in order to address the lack of knowledge, they have performed user research, searched for successful use cases, talked to stakeholders, and looked at the literature. figure 8. available knowledge to support hci design activities. how km has been applied in hci design practice (rq4): figure 9 shows the approaches that have been used to support knowledge access or storage in hci design practice. brainstorming and blogs have been the most used ways to access knowledge (28 participants, 71.8%), followed by mental models and electronic documents and spreadsheets (26 participants, 66.7%). except for blogs, those have also been the most used ways to store knowledge: brainstorming has been used by 27 participants (69.2%); mental models and electronic documents and spreadsheets by 24 (61.6%). ontologies have been the less used way by the participants. only 7 participants (18%) have used ontologies to access knowledge and 5 participants (12.8%) have used it to store knowledge. concerning knowledge storage, social networks (6 participants, 15.4%) and forums (8 participants, 20.5%) have also not been much investigating knowledge management in hci design castro et al. 2022 used. in general, the approaches shown in figure 9 have been more used to support knowledge access than to support knowledge storage. figure 9. approaches to support knowledge access and storage in hci design. benefits and difficulties of using km in hci design practice (rq5): 34 participants (87.2%) reported performing km practices to support hci design activities. 16 of them (41.0%) have followed institutionalized organizational practices, while 18 (46.2%) have performed on their own initiative. these 34 participants were asked about the benefits and difficulties they have perceived in using km to support hci design. the results are summarized in table 12 and table 13. table 12. benefits of using km in hci design practice. benefit number of participants % enable replicability of domain or context knowledge 27 79.4% promote standardization 26 76.5% improve communication 25 73.5% increase productivity 24 70.6% reduce design effort 24 70.6% improve product quality 23 67.6% improve design conceptualization 20 58.8% improve team learning 18 52.9% reduce dependency on specialists 18 52.9% increase team engagement or empowerment 17 50.0% increase organizational integration 16 47.1% reduce design cost 16 47.1% promote organizational competitive advantage 11 32.4% table 13. difficulties of using km in hci design practice. difficulty number of participants % low team engagement or empowerment 16 47.1% km implementation and maintenance effort 15 44.1% integration of the km approach into the organization 15 44.1% lack of consensus about hci design conceptualization 14 41.1% find relevant knowledge to a given context 13 38.2% low user involvement 9 26.5% issues related to features of the km technologies 8 23.5% unclear business model 1 2.9% goals to which the use of km in hci design practice has contributed (rq6): aiming to identify the predominant reasons for using km in hci design practice, the participants were asked how much km support to hci design contributes to achieving certain goals. the goals presented to them were identified in the systematic mapping as motivations to perform km in the hci design context. figure 10 shows the results. figure 10. km contribution to goals achievement when supporting hci design. according to the participants, the goals to which using km in hci design contributes the most are improve product quality (84.6% of the participants stated that km contributes a lot or contributes to it) and reduce effort spent on design activities (79.5% of the participants stated that km contributes a lot or contributes to it). on the other hand, the participants have seen less contribution of km in hci design to reduce the usage of financial resources in design and to reduce the dependency on specialists (43.6% of the participants stated that km contributes little or is indifferent to both of them). investigating knowledge management in hci design castro et al. 2022 4.3 discussion in this section, we present some discussions about the results shown in the previous section. by analyzing the participants’ profile, we noticed that several stakeholders (20.5%) who had knowledge of and experience with hci design did not play a role devoted to hci design by the time of the survey execution. we believe that this reinforces the multidisciplinary nature of hci design and corroborates with a recent finding from (neto et al., 2020) that some professionals may choose to pursue a double background involving design and development areas. concerning stakeholders (rq1), it can be noticed that a variety of them are involved in hci design. considering that the interactions usually occur in the context of projects, the results indicate that teams of hci design projects have included designers, developers, project managers, and frequently also have involved clients and users. these stakeholders have different roles in hci design, and thus may have different hci design knowledge needs. for example, a developer may need to implement the design solution presented in a design artifact. for that, this artifact should present technical decisions that affect the implementation. a project manager, in turn, may need to have a broader view of several design artifacts to verify if the implemented solution satisfies the requirements agreed with the client. hence, km approaches must consider the needs of different stakeholders to properly support hci design. moreover, it may be necessary to integrate knowledge from different sources to provide a solution that integrates the needs of different stakeholders. this can be done, for example, with a knowledge management system with multiple views for each different role. regarding knowledge involved in hci design (rq2), by analyzing the knowledge items used and produced in hci design practice, we can notice which knowledge has been more useful to practitioners. most participants use knowledge items that provide design knowledge obtained from previous design experiences, such as design solutions from the organization, design solutions from outside the organization and examples. this can be a sign that new designs have been created based on previous experiences adapted to the new context. however, these knowledge items have not been much produced by the participants. this may be due to the effort required to record knowledge for future reuse. hence, it would be important to facilitate capture, recording and retrieval of knowledge embedded in design solutions. on the other hand, two of the knowledge items produced by the highest number of participants (organizational design standards and guidelines) record general principles and practices to be followed when designing hci solutions. this may indicate that the participants have found it easier to produce knowledge independent of specific solutions. considering the relation between the number of knowledge items used and produced by the participants, the higher number of used items shows that, in general, the participants have acted more as knowledge consumers than knowledge producers. this may happen because either the participants do not have enough time to produce knowledge items, or the knowledge production is done by someone else. consulting knowledge directly helps designers in the activities they were doing at that moment. in contrast, knowledge production does not seem to be immediately useful to them, although it is important at an organizational level. we believe that approaches that promote knowledge recording and storage requiring less effort could motivate designers to act as knowledge producers. as for design artifacts, we noticed that the ones produced by more participants (wireframes, functional prototypes and mockups) represent abstractions of the design solution. hence, the creation of such artifacts is part of the design solution development. on the other hand, the artifacts used by more participants (user requirements, sceneries and interaction models) provide useful information to develop the design solution (i.e., they represent inputs to design development). one-third of the participants (33.3%) considered the artifacts used or produced by them limited to meet information needs about the design solution and reported the use of complementary ways to transfer missing knowledge. when analyzing the three most cited ways, we observed that two of them (talks and meetings) are based on the conversation between team members. this can be a sign that it may be difficult to articulate certain pieces of knowledge in artifacts. this is reinforced by the high usage of annotations, which are less formal and structured, and the low usage of documentation and tools. besides, considering that the use of more than one method of knowledge transfer is a common practice used by the participants, it is likely that they prefer to have this communication redundancy as a way of reinforcing the understanding of all stakeholders about the design. therefore, we believe that the missing knowledge in hci design artifacts can be transferred, for example, by performing regular meetings and by providing means to easily attach additional annotations on design artifacts. concerning hci design activities (rq3), ‘produce design solutions’ was the one that more participants (79.5%) indicated to have access to enough knowledge to perform it. this can be a sign that participants have used knowledge mainly to support the creation of design solutions. on the other hand, a high number of participants indicated that they had not had sufficient knowledge to perform the activities ‘understand and specify the context of use’ (23%), ‘specify investigating knowledge management in hci design castro et al. 2022 user requirements’ (23%) and ‘evaluate the design solution’ (25.6%). therefore, it is necessary to identify useful knowledge to support these activities (e.g., missing knowledge related to personas and user research data, as reported in rq2) and provide means to represent and access it in an easy way. as for the approaches to support knowledge access and storage in hci design (rq4), it can be observed that the most used approaches, such as brainstorming, mental models and electronic spreadsheets and documents, usually support both knowledge access and storage. this may suggest that it is easier and simpler to implement and use them. brainstorming, for example, has the advantage of the participants sharing and obtaining knowledge at the same time. on the other hand, web-based resources, such as blogs, forums and social networks are more used to support knowledge access than knowledge storage. probably, these resources have been used more as sources of inspiration to bring new ideas from outside the organization. in addition, the reason why these resources have been less used by practitioners to record knowledge may be a concern in not exposing organizational design knowledge on the internet. hci design knowledge must be captured, recorded and propagated in order to be raised from the individual level to the organizational level. hence, we believe that km initiatives in hci design should consider approaches such as the ones most used by practitioners to support both knowledge access and storage. concerning the benefits and difficulties of using km in hci design (rq5), most participants declared to have experienced km practices in hci design. 41.0% followed institutionalized practices and 46.2% have performed on their own initiative. this indicates that hci design professionals have been concerned with the need for practices that help manage knowledge and are seeking solutions by themselves when they are not provided by the organization. according to the participants, in general, using km to support hci design brings more benefits than difficulties. the most cited benefits were related to standardization, reuse, communication and productivity, while the most cited difficulties were related to the lack of consensus in hci design conceptualization and to the effort of implementing, engaging the team and integrating the km approach in the organization. based on that, to effectively implement a km approach, it would be interesting to convince people and the organization that the additional effort in the beginning is worth the benefits they obtain afterward. finally, by analyzing goals to which the use of km in hci design has contributed (rq6), ‘reduce the usage of financial resources’ and ‘reduce the dependency on specialists’ have been considered less impacted by the use of km in hci design. this may be because reducing costs can be a side effect of reducing time spent on design or producing better designs, with fewer errors. moreover, even if expert’s knowledge is transferred and managed at the organizational level, user-centered design deals with people, hence there are subjective aspects that still need to be addressed by specialists. another point to be considered is that the participants of the survey were, in the majority, hci design experts, which could have biased their answers about the impact of using km to reduce the dependency on hci design experts. it is also important to note that ‘reduce the effort spent on design activities’ was the goal which participants believe to be most impacted by the use of km in hci design. by having in hand proper knowledge resources, the designer can learn from previous experiences, reuse solutions and explore more design alternatives, which can lead to designing better and more efficiently. 4.4 threats to validity as discussed in the context of the systematic mapping, when carrying out a study, it is necessary to consider threats to the validity of its results. in this section, we discuss some threats involved in the survey using the classification presented in (wohlin et al., 2012). internal validity: it is defined as the ability of a new study to repeat the behavior of the current study with the same participants and objects. the main threat to internal validity is communication and sharing of information among participants. to address this threat, the questionnaire was made available online, so that the participants could answer it at the time they considered most appropriate. this can minimize the threat of communication since participants were not physically close during the study and did not necessarily perform the study at the same time. external validity: it is related to the ability to repeat the same behavior with different groups of participants. in this sense, the limited number of participants and the fact that all of them are brazilian professionals are also threats to the results. moreover, some of the participants were invited based on the authors’ relationship network, which may also have influenced the answers. construction validity: it refers to the relationship between the study instruments, participants and the theory being tested. in this context, the main threat is the possibility that the participants have misunderstood some questions. to address this threat, we performed a pilot that allowed us to improve and clarify questions. moreover, we provided definitions for the terms used and examples of information that should be included in the survey, so that the participants could better understand how to answer it. conclusion validity: it measures the relationship between the treatments and the results and affects the ability of the study to generate conclusions. a threat to conclusion validity refers to the subjectivity in data analysis, which may reflect investigating knowledge management in hci design castro et al. 2022 the authors’ point of view. in addition, the results reflect the participants’ personal experience, interpretation and beliefs. hence, the answers can embed subjectivity that could not be captured through the questionnaire. these and the other threats discussed above affect the representativeness of the survey results and, thus, the results must be understood as preliminary evidence and should not be generalized. 5 consolidated view of findings in this section, we present some discussions involving the systematic mapping and survey results, aiming to provide a consolidated view of the findings from both studies. the three most cited motivations for using km found in the systematic mapping (rq3) are the same as the three goals most impacted by the use of km in hci design practice, according to survey participants (rq6). this shows that, in general, it is expected that the use of km in hci design can contribute to improving product quality and reducing effort and time spent on design activities. considering the most reported benefits and difficulties of using km in hci design, the survey results provided some of them that were not observed in the literature. for example, most survey participants reported ‘standardization’ and ‘productivity’ as benefits and ‘km implementation and maintenance effort’ and ‘lack of consensus about hci design conceptualization’ as difficulties. this difference is not a surprise, since the mapping results showed that most proposed approaches had not been applied in the industry. we believe that to achieve success in implementing knowledge management, it is important to consider hci design professionals’ perspectives, pursuing the benefits and implementing strategies to overcome the difficulties. there are other differences between the mapping and survey results. for example, traditional km technologies, such as knowledge management systems, knowledge repositories and knowledge-based systems, have been the most used approaches reported in the literature, but have not been much used by hci design professionals. the reasons why they do not use those approaches may be quite diverse, including not being aware that they exist or considering them too complex. since 46.2% of the participants perform km practices on their own initiative, they have likely preferred simpler approaches that can be implemented by themselves. this reinforces the gap between industry and academy perceived from the analysis of the systematic mapping results. in order to decrease this gap, km approaches to support hci design should be closer to approaches that professionals are already familiar with, which can contribute to simpler and easier implementation and use. results from both studies show that design guidelines and design solutions have been reused in hci design. organizational design standards, lessons learned and design component libraries have also been useful for hci design professionals. therefore, km approaches to support hci design should be able to handle these knowledge items, supporting their capture, storage and retrieval. as indicated by results from both studies, these knowledge items have probably been most used to support the activity ‘produce design solutions’. this was the activity in which most approaches found in the literature use knowledge and most participants considered having sufficient knowledge support. km approaches should also provide support to other activities such as ‘understand and specify context of use’, ‘specify user requirements’ and ‘evaluate design solutions’, contributing to the hci design process as a whole. 6 conclusion in this paper, we presented an investigation about the use of knowledge management in the hci design context. to investigate the state of the art, we performed a systematic mapping. after that, we carried out a survey with 39 brazilian professionals who work on hci design. as the main result of the studies, we provided a panorama of research related to the topic and identified gaps and opportunities for improvements to organizations interested in applying km initiatives in the hci design context. we noticed that, although hci design is a favorable area to apply knowledge management, there have been only a few publications exploring this research topic. due to the increasing importance of interactive systems and the diversity of interfaces that have been made available for people’s use, we believe that there are many challenges and questions to be addressed in future research. for example: (i) the lack of a common conceptualization of hci design (pointed out in #01 and #02 in the mapping study and also by 35.9% of the survey participants) leads to communication problems between the different actors involved in the hci design process. we believe that the use of ontologies to establish this common conceptualization could help in this matter. however, since ontologies are not much familiar to practitioners (survey rq4 results), ontologybased km approaches in hci design should abstract the ontology to final users (e.g., using the ontology to derive the conceptual model of a knowledge-based system). (ii) the gap between theory and practice (systematic mapping rq2 results) shows that it is necessary to take km solutions to practical hci design environments. the investigating knowledge management in hci design castro et al. 2022 survey results show that hci design professionals are familiar with more robust km approaches (such as knowledge management systems), but prefer to use simpler ways to deal with knowledge, such as brainstorming sessions and electronic spreadsheets and documents. therefore, lightweight technologies and a divide and conquer strategy to reduce the complexity of the conception, implementation and evaluation of a km approach might be useful, allowing to provide results for the organizations in smaller periods of time and increasing benefits as the approach evolves. (iii) other aspects besides usability (e.g., user experience, communicability and accessibility) should be explored in km initiatives to improve hci design. (iv) the benefits and difficulties identified in the mapping (rq7) and reported by the survey participants (rq5) indicate issues that can be investigated in future research. for example, case studies can be carried out in organizations to evaluate the use of km approaches in the hci design context. concerning related works, we did not find any study investigating the use of km in the hci design context. a work that can be related to ours is (stephanidis & akoumianakis, 2001), consisting of a literature review about categories of computer-aided hci design tools and a proposal of a new category to address the knowledge complexity involved in hci design. however, the study focused on computational tools, not investigating how other kinds of km approaches can help in the hci design process. as future work, concerning the systematic mapping, new studies can be conducted to better understand the state of the art of km in hci design and improve the use of km in this context. for example, the results obtained in our mapping study could be compared with results from other studies investigating km use in other domains (e.g., requirements engineering). moreover, km solutions proposed in other domains can inspire new proposals to support hci design by using km. as for the survey, it can be extended to include more participants from different countries and also to investigate other aspects. considering the studies’ results, which showed us a gap between the hci design professionals and the approaches proposed in the literature, we have worked on the development of a tool to support km in the context of hci design of interactive systems (castro et al., 2021). by making use of the information provided by this study, we aim to reduce the gap between academy and industry by proposing a tool able to meet the needs of hci design professionals. references bjørnson, f. o., & dingsøyr, t. (2008). knowledge management in software engineering: a systematic review of studied concepts, findings and research methods used. information and software technology, 50(11), 1055–1068. bouwmeester, n. (1999). a knowledge management tool for speech interfaces (poster abstract). proceedings of the 22nd annual international acm sigir conference on research and development in information retrieval, 293–294. https://doi.org/10.1145/312624.312721 carroll, j. m. (2014). human computer interaction (hci). in m. soegaard & r. f. dam (eds.), the encyclopedia of humancomputer interaction (2nd ed., pp. 21–61). the interaction design foundation. castro, m. v. h. b., barcellos, m. p., falbo, r. de a., & costa, s. d. (2021). using ontologies to aid knowledge sharing in hci design. xx brazilian symposium on human factors in computing systems (ihc’21). https://doi.org/10.1145/3472301.3484327 castro, m. v. h. b., costa, s. d., barcellos, m. p., & falbo, r. de a. (2020). knowledge management in human-computer interaction design: a mapping study. 23rd iberoamerican conference on software engineering, cibse 2020. chammas, a., quaresma, m., & mont’alvão, c. (2015). a closer look on the user centred design. procedia manufacturing, 3, 5397– 5404. https://doi.org/https://doi.org/10.1016/j.pr omfg.2015.07.656 chewar, c. m., bachetti, e., mccrickard, d. s., & booker, j. e. (2004). automating a design reuse facility with critical parameters. in r. j. k. jacob, q. limbourg, & j. vanderdonckt (eds.), computeraided design of user interfaces iv (pp. 235–246). springer netherlands. chewar, c. m., & mccrickard, d. s. (2005). links for a human-centered science of design: integrated design knowledge environments for a software development process. proceedings of the 38th annual hawaii international conference on system sciences, 256c-256c. https://doi.org/10.1109/hicss.2005.390 de souza, c. s. (2005). the semiotic engineering of human-computer (acting with technology). in b. a. nardi, v. kaptelinin, & k. a. foot (eds.), technology. the mit press. https://doi.org/10.1017/cbo97811074153 24.004 gabriel, i. j. (2007). an expert system for usability evaluations of business-toconsumer e-commerce sites. proceedings of the 6th annual isoneworld conference, las vegas, nv. henninger, s., haynes, k., & reith, m. w. (1995). a framework for developing investigating knowledge management in hci design castro et al. 2022 experience-based usability guidelines. proceedings of the 1st conference on designing interactive systems: processes, practices, methods, & techniques, 43–53. https://doi.org/10.1145/225434.225440 hughes, m. (2006). a pattern language approach to usability knowledge management. j. usability studies, 1(2), 76–90. iso. (2019). iso 9241-210:2019(en) ergonomics of human-system interaction part 210: human-centred design for interactive systems. in int. organization for standardization. kitchenham, b. a., & charters, s. (2007). guidelines for performing systematic literature reviews in software engineering (vol. 2). mafra, s. n., & travassos, g. h. (2006). estudos primários e secundários apoiando a busca por evidência em engenharia de software. relatório técnico, rt-es, 687(06). mohamed, m. a., chakraborty, j., & dehlinger, j. (2017). trading off usability and security in user interface design through mental models. behav. inf. technol., 36(5), 493–516. https://doi.org/10.1080/0144929x.2016.1 262897 neto, e. h., van amstel, f. m. c., binder, f. v., reinehr, s. dos s., & malucelli, a. (2020). trajectory and traits of devigners: a qualitative study about transdisciplinarity in a software studio. 2020 ieee 32nd conference on software engineering education and training (csee\&t), 1–9. o’leary, d. e. (1998). enterprise knowledge management. computer, 31(3), 54–61. https://doi.org/10.1109/2.660190 petersen, k., vakkalanka, s., & kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology, 64, 1–18. https://doi.org/https://doi.org/10.1016/j.inf sof.2015.03.007 pfleeger, s. l. (1994). design and analysis in software engineering: the language of case studies and formal experiments. acm sigsoft software engineering notes, 19(4), 16–20. rogers, y., sharp, h., & preece, j. (2011). interaction design: beyond humancomputer interaction (3rd ed.). john wiley & sons. rus, i., & lindvall, m. (2002). knowledge management in software engineering. ieee software, 19(3), 26–38. https://doi.org/10.1109/ms.2002.1003450 schneider, k. (2009). experience and knowledge management in software engineering (1st ed.). springer publishing company, incorporated. sikorski, m., garnik, i., ludwiszewski, b., & wyrwiński, j. (2011). knowledge management challenges in collaborative design of a virtual call centre. knowlegebased and intelligent information and engineering systems, 657–666. smith, a., & dunckley, l. (2002). prototype evaluation and redesign: structuring the design space through contextual techniques. interacting with computers, 14(6), 821–843. https://doi.org/10.1016/s09535438(02)00031-0 smith, j. l., bohner, s. . a., & mccrickard, d. s. (2005). toward introducing notification technology into distributed project teams. 12th ieee international conference and workshops on the engineering of computer-based systems (ecbs’05), 349– 356. https://doi.org/10.1109/ecbs.2005.69 stephanidis, c., & akoumianakis, d. (2001). knowledge management in hci design. in w. karwowski (ed.), international encyclopedia of ergonomics and human factors (vol. 1, pp. 705–710). taylor & francis. still, k. (2006). exploring knowledge processes in user-centered design process. the 7th european conference on knowledge management, 533. suàrez, p. r., jùnior, b. l., & de barros, m. a. (2004). applying knowledge management in ui design process. in p. slavik & p. palanque (eds.), proceedings of the 3rd annual conference on task models and diagrams tamodia ’04 (pp. 113–120). acm press. https://doi.org/10.1145/1045446.1045468 sutcliffe, a. g. (2014). requirements engineering from an hci perspective. in m. soegaard & r. f. dam (eds.), the encyclopedia of human-computer interaction (2nd ed., pp. 707–760). the interaction design foundation. valaski, j., malucelli, a., & reinehr, s. (2012). review: ontologies application in organizational learning: a literature review. expert system with applications: an international journal, 39(8), 7555– 7561. https://doi.org/10.1016/j.eswa.2012.01.07 5 wahid, s. (2006). investigating design knowledge reuse for interface development. proceedings of the 6th conference on designing interactive systems, 354–356. https://doi.org/10.1145/1142405.1142462 wahid, s., smith, j. l., berry, b., chewar, c. m., & mccrickard, d. s. (2004). visualization of design knowledge component investigating knowledge management in hci design castro et al. 2022 relationships to facilitate reuse. proceedings of the 2004 ieee international conference on information reuse and integration, 2004. iri 2004., 414–419. https://doi.org/10.1109/iri.2004.1431496 wieringa, r., maiden, n., mead, n., & rolland, c. (2005). requirements engineering paper classification and evaluation criteria: a proposal and a discussion. requir. eng., 11(1), 102–107. https://doi.org/10.1007/s00766-005-00216 wilson, p., & borras, j. (1998). lessons learnt from an hci repository. international journal of industrial ergonomics, 22(4), 389–396. https://doi.org/https://doi.org/10.1016/s01 69-8141(97)00093-0 wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., & wesslén, a. (2012). experimentation in software engineering. springer. investigating knowledge management in human-computer interaction design 1 introduction 2 background 2.1 hci design 2.2 knowledge management 3 systematic mapping: km in hci design according to the literature 3.1 research protocol 3.2 results 3.3 discussion 3.4 threats to validity 4 survey: km in hci design practice 4.1 survey planning and execution 4.2 results 4.3 discussion 4.4 threats to validity 5 consolidated view of findings 6 conclusion references journal of software engineering research and development, 2021, 9:12, doi: 10.5753/jserd.2021.1898  this work is licensed under a creative commons attribution 4.0 international license.. development of an ontology-based approach for knowledge management in software testing: an experience report érica ferreira de souza [ federal university of technology paraná | ericasouza@utfpr.edu.br ] ricardo de almeida falbo [ federal university of espírito santo ] nandamudi l. vijaykumar [ national institute for space research | vijay.nl@inpe.br ] katia r. felizardo [ federal university of technology paraná | katiascannavino@utfpr.edu.br ] giovani v. meinerz [ federal university of technology paraná | giovanimeinerz@utfpr.edu.br ] marcos s. specimille [ federal university of espírito santo | marcosspecimille@gmail.com ] alexandre g. n. coelho [ federal university of espírito santo | alexandregncoelho@gmail.com ] abstract software development organizations are seeking to add quality to their products. testing processes are strategic elements to manage projects and product quality. however, advances in technology and the emergence of increasingly critical applications make testing a complex task and large volumes of information are generated. software testing is a knowledge-intensive process. because of this, these organizations have shown a growing interest in knowledge management (km) programs, which in turn support the improvement of testing procedures. km emerges to manage testing knowledge, and, consequently, to improve software quality. however, there are only a few km solutions supporting software testing. this paper reports experiences from the development of an approach, called ontology-based testing knowledge management (ontot-km), that aims to assist in launching km initiatives in the software testing domain with the support of knowledge management systems (kmss). ontot-km provides a process guiding how to start applying km in software testing. ontot-km is based on the findings of a systematic mapping on km in software testing and the results of a survey with testing practitioners. moreover, ontot-km considers the conceptualization established by a reference ontology on software testing (roost). as a proof of concept, ontot-km was applied to develop a kms called testing km portal (tkmp), which was evaluated in terms of usefulness, usability, and functional correctness. results show that the developed kms from ontot-km is a potential system for managing knowledge in software testing, so, the approach can guide km initiatives in software testing. keywords: knowledge management, knowledge management system, software testing, testing ontology 1 introduction with the emergence of new technologies during the last decades, more advanced techniques have been applied in software development, in order to achieve high-quality software products (thrane, 2011). thus, more efficient techniques to qualify a software product should be incorporated in its development life cycle, ensuring a well-managed process. testing activities play an important role in assessing and achieving the quality of a software product (souza, 2014). currently, software testing is considered a process consisting of activities, techniques, resources, and tools. advances in technology and the emergence of increasingly critical applications also make testing a complex task. during software testing, large volumes of information are generated. software testing is a knowledge-intensive process, and thus it is important to provide computerized support for tasks of acquiring, processing, analyzing, and disseminating testing knowledge in an organization (andrade et al., 2013; souza, 2014). in this context, knowledge management (km) emerges to manage testing knowledge, and, consequently, to improve software quality. km can be defined as a set of organizational activities that must be performed systematically to acquire, organize, and sharing the different knowledge types in the organization (o’leary and studer, 2001). the adoption of principles of km in software testing can help testers to promote reuse of knowledge, to support testing processes, and even to guide management decisions in organizations (souza et al., 2015a). software testing, in general, can benefit from reusing test cases, testing techniques, lessons learned, and personal experiences, among others (li and zhang, 2012; janjic and atkinson, 2013; souza et al., 2015a). to enable the reuse of testing knowledge, software organizations should be able to capture this knowledge and make it available to be shared with their teams. however, there are only a few km solutions in the context of software testing (souza et al., 2015a). the major problems in organizations regarding software testing knowledge are the low reuse rate of knowledge and barriers to knowledge transfer. this occurs because most of the testing knowledge in organizations is not explicit and it becomes difficult to articulate it (souza et al., 2015a). on the other hand, implementing km solutions, in general, is not an easy task. according to storey and barnett (2000), a large number of organizations are taking great interest in the idea of km, but, these organizations are not familiar with how and where to start since they lack the proper guidance to implement km. so, with an orientation on how to implement new km solutions in the organization, or even with an existing solution that can be customized, becomes interesting for organizations since it is an opportunity for continued cost remailto:ericasouza@utfpr.edu.br mailto:vijay.nl@inpe.br mailto:katiascannavino@utfpr.edu.br mailto:giovanimeinerz@utfpr.edu.br mailto:marcosspecimille@gmail.com mailto:alexandregncoelho@gmail.com development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 duction, quality improvement, and reduction in software delivery (rokunuzzaman and choudhury, 2011). concerning technologies for km, ontologies have been widely recognized as a key technology (herrera and martinb, 2015). ontologies can be used for establishing a common conceptualization to be used in the knowledge management system (kms) to facilitate communication, search, storage, and representation of knowledge (o’leary and studer, 2001). however, only a few initiatives have used an ontology-based approach for km in the software testing domain (souza et al., 2015a). this paper reports our experiences in developing an approach to assist in launching km initiatives in the software testing domain with the support of kmss. in this paper, we present ontot-km, an ontology-based testing knowledge management approach. ontot-km provides a process to apply km in software testing. ontot-km considers the conceptualization established by a software testing ontology. a striking feature of ontot-km is to describe how a testing ontology can be used for guiding km initiatives in software testing. the software testing ontology used in ontot-km is a reference ontology on software testing, called roost (souza et al., 2017). roost was developed for establishing a common conceptualization about the software testing domain and can be used to serve several km-related purposes, such as defining a common vocabulary for knowledge workers regarding the testing domain, structuring knowledge repositories, annotating knowledge items, and making searching easier (souza, 2014; souza et al., 2017). lessons learned and experiences acquired in conducting this study are presented on two main fronts. firstly, the ontot-km approach is presented to help software organizations to implement an initial km solution in software testing. subsequently, a prototype of a kms was developed, called testing km portal (tkmp), both as a proof of concept from the ontot-km approach, as well as understanding the needs of software development professionals in having a kms in software testing ready and available for customization. this research is an extension of a preliminary study published in (souza et al., 2020). the extensions of this work are essentially threefold. first, we improved several sections to provide better research understanding through the inclusion of new text, extra depth in some paragraphs, and the inclusion of new figures and tables. second, we analyzed the database created from roost using data mining techniques, to present the applicability of this type of research in the search for useful knowledge in knowledge repositories. third, we improved the tkmp analysis by software engineering practitioners. we carried out an analysis separating the participants by professional position, such as professionals directly related to software development companies and professionals directly related to scientific research. the main contributions of this research are the guidelines provided by ontot-km for guiding km initiatives in software testing. these guidelines are supported not only by roost, but also from the findings of the mapping study souza et al. (2015a) and the results of a survey with 86 testing practitioners. ontot-km was applied to develop tkmp, which was evaluated by test leaders of real projects in which it was applied. tkmp also was evaluated by 43 practitioners in terms of usefulness, usability, and functional correctness. such evaluation was designed applying the goal, question, metric (gqm) paradigm (basili et al., 1994) and technology acceptance model (tam) (davis, 1993). the remainder of this study is structured as follows. section 2 presents the main research concepts. section 3 presents ontot-km. section 4 presents the application of ontot-km and the evaluation results. section 5 discusses related works. finally, in section 6, we present our final considerations. 2 background in this section, the main concepts of this study are discussed. 2.1 software testing software testing consists of the dynamic verification & validation (v&v) activities of the behavior of a program on a finite set of test cases, against the expected behavior abran et al. (2004). testing activities are supported by a welldefined and controlled testing process (abran et al., 2004). the process consists of several activities, namely (abran et al., 2004), (myers, 2004), (black and mitchell, 2011), (mathur, 2012): test planning, test case design, test coding, test execution and test result. in the first activity, the testing should be planned, such as, the test environment for the project, scheduling testing activities, and planning for possible undesirable outcomes. test planning is documented in a test plan. then, in the test case design the test cases to be run are designed, documented, and then coded. during test execution, test code is run, producing results, which are then analyzed to determine whether test cases have been passed or failed. the testing activities are performed at different levels. unit testing focuses on testing each program unit or component. integration testing takes place when such units are put together, aiming at ensuring that the interfaces among the components are defined and handled properly. finally, system testing regards the behavior of the entire system (abran et al., 2004), (myers, 2004), (black and mitchell, 2011), (mathur, 2012). in addition, many testing techniques are providing systematic guidelines for designing test cases, intending to make testing efforts more efficient and effective. testing techniques can be classified, among others, as (burnstein, 2003): white-box testing techniques, which are based on information about how the software has been designed and coded; black-box testing techniques, which generate test cases relying only on the input/output behavior, without the aid of the code that is under test; defect-based testing techniques, which aim at revealing categories of likely or predefined faults; and model-based testing techniques, which are based on models, such as statecharts, finite state machines, and others. one of the main characteristics of the software testing process is that it has a large intellectual capital component and can thus benefit from experiences gained from past projects (souza et al., 2015a). during software testing, large volumes of information are processed and generated. so, it can be considered a knowledge-intensive process, making it necessary development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 for automated support acquiring, processing, analyzing, and disseminating testing knowledge for reuse. in this context, knowledge management (km) can be used (souza et al., 2015a). 2.2 knowledge management km can be viewed as the development and leveraging of organizational knowledge to increase an organization’s competitive advantage (zack and serino, 2000). in general, km formally manages the increase of knowledge in organizations in order to facilitate its access and reuse, typically by using information systems (iss) and kmss (herrera and martin-b, 2015). in particular, kmss aims at supporting organizations in knowledge management, in an automated way. one issue in kmss is how to represent knowledge. one alternative is ontologies (o’leary and studer, 2001) as they are considered a key technology for km (herrera and martinb, 2015), by defining the shared vocabulary to be used in the kms facilitating knowledge communication, integration, search, storage, and representation. in ontology-based kmss, ontologies are typically used to structure the content of knowledge items, to support knowledge search, retrieval, and personalization, serving as a basis for knowledge gathering, integration, and organization, and support knowledge visualization, among others. km has shown important benefits for software organizations. in souza et al. (2015a), we performed a systematic mapping (sm) looking for studies presenting km initiatives in software testing. an sm is a secondary study for an overview of a research area through the classification of theavailableevidence(kitchenhamandcharters,2007).the main conclusions from this sm were: (i) there are few publications (only 15 studies were retrieved) addressing km initiatives in software testing; (ii) the major problems that have motivated applying km in software testing are low knowledge reuse rate and barriers in knowledge transfer; (iii) as a consequence, knowledge reuse and organizational learning are the main purposes for managing software testing knowledge; (iv) there is a great concern with both explicit and tacit knowledge; (v) reuse of test cases is the perspective that has received more attention; (vi) kmss are used in almost all initiatives (11 of the 15 studies); and (vii) different technologies have been used to implement those kmss, such as conventional technologies (databases, intranets, and internet), yellow pages (or knowledge maps), recommendation systems, data warehouse, and ontologies. in particular, one finding drew our attention: only two studies, actually, used ontologies in a km initiative applied to software testing (liu et al., 2009; li and zhang, 2012). this seems to be a contradiction, since, as pointed out by staab et al. (2001), ontologies are the glue that binds km activities together, allowing a content-oriented view of km. one possible explanation for this low number of studies is the fact that developing an ontology is a hard task, especially in complex domains, as is the case of software testing (souza et al., 2015a). based on the findings of the sm, we decided to perform a systematic literature review (slr) looking for ontologies on the software testing domain in the literature (souza et al., 2013). an slr also is a secondary study that uses a welldefined process to identify available evidence (kitchenham and charters, 2007). from this slr, 12 ontologies addressing this domain were identified. as the main findings, it is possible to highlight (souza et al., 2013): (i) most ontologies have limited coverage; (ii) the studies do not discuss how the ontologies were evaluated; (iii) none of the analyzed testing ontologies is truly a reference ontology, i.e., a domain ontology that is constructed with the main goal of making the best possible description of the domain as realistic as possible; and, finally, (iv) although foundational ontologies have been recognized as an important instrument for improving the quality of conceptual models in general, and more specifically of domain ontologies, none of the analyzed ontologies is grounded in foundational ontologies. this motivated us to build roost, a reference ontology on software testing (souza et al., 2017). roost was developed for establishing a common conceptualization of the software testing domain. 2.3 roost roost is presented very briefly here since it is not the scope of this paper to present the entire ontology. details of the ontology can be found in (souza et al., 2017). since the testing domain is complex, roost was developed in a modular way, comprising four modules (sub-ontologies): (i) software testing process and activities representing the testing process and the main activities that comprise it, namely test planning, test case design, test coding, test execution, and analysis of the test results; (ii) testing artifacts focusing on the artifacts used and produced by the testing activities; (iii) techniques for test case design looking at testing techniques, such as black-box, white-box, defect-based, and model-based testing techniques; and (iv) software testing environment addressing the main components of the testing environment, including test hardware resources, test software resources, and human resources. in order to develop roost, the systematic approach for building ontologies (sabio) (falbo, 2014) was adopted. sabio method incorporates best practices commonly adopted in software engineering and ontology engineering and addresses the design and coding of operational ontologies. furthermore, roost has been developed by reusing and extending ontology patterns from the software process ontology pattern language (sp-opl) (falbo et al., 2013) and the enterprise-ontology pattern language (eopl) (falbo et al., 2014). an ontology pattern language (opl) is a network of interconnected domain-related ontology patterns that provides holistic support for solving ontology development problems for a specific domain (falbo et al., 2013). more recently, roost has been integrated into the software engineering ontology network (seon) (ruy et al., 2016). the full model of roost is available at http://dev.nemo.inf.ufes.br/seon/. given the size of roost, figure 1 presents only its testing process and activities sub-ontology. concepts reused from software process ontology are shown in gray. specific concepts are shown in white. some of the main concepts of this sub-ontology are also presented below. more specific details of roost’s testing process and activities sub-ontology can development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 be found at (souza et al., 2017) and seon network1. figure 1. roost’s testing process and activities sub-ontology in this sub-ontology, the process and activity execution (pae) pattern was reused. pae concepts were extended to the testing domain, as shown in figure 1. testing process is a subtype of specific performed process, since a testing process occurs in the context of the entire software process (general performed process) of a project. a testing process, in turn, is composed of testing activities, and thus testing activity is considered a subtype of performed activity. similarly to performed activity, testing activity can be further divided into composite and simple testing activity. in pae pattern, specific performed processes are composed of two or more performed activities. a performed activity, analogously, can be simple or composite. a composite performed activity is composed of other performed activities; a simple performed activity cannot be decomposed into smaller activities (falbo et al., 2013). besides specializing concepts, relationships were also specialized from pae. for instance, in pae, there is a whole-part relationship between specific performed process and performed activity. the whole-part relationship between testing process and testing activity is a subtype of the former. whenever a roost relationship is a subtype of another relationship defined in sp-opl, the same name is used for both. regarding the testing process activities, test planning and level-based testing are composite performed activities. although not shown in figure 1, test planning involves several sub-activities, such as defining the testing process, allocating people and resources for performing its activities, analyzing risks, and so on. level-based testing comprises test case design, test coding, test execution and test result analysis, which are considered simple performed testing activities. considering the test levels, level-based testing groups simple testing activities according to the test level to which they are related. thus, level-based testing is specialized ac1http://dev.nemo.inf.ufes.br/seon/ cording to the instances of test level, a second-order type, whose instances partition level-based testing in more specific types of testing activities. in figure 1, the three most cited testing levels in the literature are made explicit: unit testing, integration testing and system testing. however, there may be others, such as regression testing. regarding testing stakeholders, the test manager is responsible for performing test planning activities. test manager also participates in test result analysis activities. test case designer participates in test planning activities, and she is in charge of performing test case design and test result analysis activities. finally, the tester is responsible for performing test coding and test execution. with respect to testing artifacts, test planning produces a test plan, which is used by level-based testing activities. test case design uses several artifacts as test case design inputs and applies testing techniques for developing test cases. during test coding, test code is produced, implementing a test case. during test execution, test cases are executed by running the code to be tested and the test code, producing test results. finally, in a test result analysis activity, test results are analyzed and the findings are reported in a test analysis report. 3 ontot-km given the applicability of km to improve software testing processes, we developed ontot-km for assisting companies that want to create their solutions for km initiatives in dynamic software testing, supported by a kms. ontot-km consists of a process and a set of guidelines for implementing a kms in software testing organizations. ontot-km is supported by roost, in particular, to structure the kms knowledge repository. moreover, ontot-km guidelines are based on the findings of the sm presented in (souza et al., 2015a), and the results of a survey performed with testing practitioners presented in (souza et al., 2015b). the ontot-km process comprises the following steps: (i) diagnose the organization’s testing process; (ii) establish the testing km scope; (iii) develop a testing kms; (iv) load existing knowledge items; and (v) evaluate the testing kms. figure 2 presents ontot-km process as a uml activity diagram. as this figure shows, roost is used to support steps (i) and (iii). steps (i) and (ii), shown in another color in figure 2, are considered optional. following, each process step is presented, describing the main guidelines that apply. step 1: diagnose the current state of the organization’s testing process the first step of the ontot-km process is to make a diagnosis of the current state of the organization’s testing process. it refers to investigating the existing knowledge within the testing process, in order to identify knowledge assets and understand how and where testing knowledge is developed and used in the organization. this step may become optional given the organization’s maturity level. this step is an important step for organizations with low maturity. once identifying the knowledge items, organizations can then proceed to development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 figure 2. ontot-km process manage them. this step may be accomplished by surveys with questionnaires and/or interviews. this activity should consider the entire current state of software testing in the organization concerning, at least, the following aspects: the adopted testing process, the activities of the candidate testing processes to be targeted by the km initiative, the artifacts produced during this process, the testing techniques applied, the test levels contemplated by the process, and the test environments adopted by the organization’s software projects. aspects related to km should also be investigated, such as the current km practices applied in the testing process, the organization’s purpose of applying km to software testing, problems related to testing knowledge in the organization, among others. roost can be used in this step as the common vocabulary for supporting the analysis of the current status, as well as to formulate the questions to be used in the questionnaires and/or interviews with the organization’s testing practitioners. the results of these questionnaires/interviews should be used as guidelines for the next step (establish scope). for this step, we suggest asking, at least, the following questions to the testing practitioners participating in the diagnosis: • what are the testing activities that comprise the organization’s testing process? evaluate the answers considering the consensual activities considered in roost (test planning, test case design, test coding, test execution, and test result analysis), and consider the possibility of improving the organization’s testing process by aligning it to the testing process captured by roost. • in which activities of the testing process is km more useful? the activities pointed out in the previous step should be the ones considered here as possible answers. • what are the testing levels in which the organization performs tests? testing activities can be performed at different levels. taking roost as a basis, consider at least the following levels: unit testing, integration testing, and system testing. however, if the organization tests software at other levels, these should be considered. • in which testing level is km more useful? the testing levelspointedout inthepreviousstepshouldbetheones considered here as possible answers. • which resources do you consider more important to have the knowledge available about them when defining the testing environment? according to roost, the possible answers that are considered for this question are the following types of resources: hardware, software, and human. • concerning tacit and explicit knowledge, what are the types of knowledge you consider to be more important to manage during the software testing process? testing practitioners tend to consider both useful, but we need to evaluate which one is more important and which is easier to implement. in general, for organizations starting a km initiative in software testing, explicit knowledge is easier to handle. in particular, test cases highlight the most important artifacts to be managed as a knowledge item, as pointed in sm (souza et al., 2015a) and the survey (souza et al., 2015b). • what is the purpose of applying km in software testing? what benefits can km bring to software testing? this question captures the feeling of testing practitioners regarding why and how an organization can benefit by applying km to software testing. step 2: establish the scope of the testing km initiative once the diagnosis of the status of the testing process has been carried out, the next step is to establish the km scope. similarly, as in the case of step 1, this step may also be considered optional if the organizations already know their needs. for the km scope, it is necessary to familiarize oneself with the organization’s needs. the organization must define the testing process activities that are to be supported, and knowledge types to be managed. a major challenge for organizations is to know which knowledge is useful, and thus identify potential knowledge items among the several knowledge assets generated in the testing process. results from step 1 should be used here. in addition, it is suggested that organizations start with small km initiatives. as a general guideline, we recommend considering the survey results that we performed (souza et al., 2015b). from this survey, both test case design and test planning were considered the most important testing activities to be supported by km practices, and capturing as knowledge items their main outcomes, namely: test case and test plan. when considering test cases as knowledge items, it is necessary to build an appropriate infrastructure that allows for the analysis, storage, and retrieval of existing test cases. this structure can be achieved from the ontotkm approach. in the reuse of this knowledge item, for example, the test reuse system may be able to cover a variety of search scenarios in order to assist its users in different situations. the search engine enables searching for test cases by informed parameters, for example, test levels or testing techniques. the returned development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 test cases can be reused in similar scenarios. according to werner (2014), reusable tests that have been written for a similar scenario are likely to help to better understand how a previously created similar system works. in addition, by reusing the knowledge contained in existing tests, developers can benefit from the knowledge that others have invested in developing them. these tests can help to gain better insights into how a particular kind of component should behave. regarding test planning, the survey led to the selection of testing techniques and the definition of the testing environment as the most important tasks to be supported by a km initiative. concerning knowledge about the testing environment, managing knowledge about human resources and software resources is pointed out as the most promising approach. regarding knowledge about human resources, this impression is corroborated by the sm (souza et al., 2015a), where yellow pages and knowledge maps appear in various initiatives. other knowledge items related to making tacit knowledge explicit can also be considered in km scope, namely: • lessons learned (ll): ll can be understood as knowledge acquired through experience in a particular situation. ll can be classified as best practices, errors/critiques, and success factors. lls are informal knowledge items that can be understood as ideas, facts, questions, points of view, decisions, among others. in addition, ll can also be classified as informative, success, or failure lessons. informative lls explain how to proceed in a given situation; lessons of success provide examples of problems that were solved positively; and failure lessons provide examples of negative responses to attempt to solve a problem and potential ways to cope up with the situation (o’leary, 1998). • knowledge regarding discussion: discussion among the organization’s members may be submitted as knowledge items. tools to support discussion among the organization’s members, such as discussion forums, have been fundamental in km environments (fischer and ostwald, 2001). discussion forums become important tools for knowledge management for the following reasons: (i) very useful knowledge can be generated and captured during discussions (falbo et al., 2004), and (ii) a major challenge of km is to convert tacit knowledge into explicit knowledge (nonaka and krogh, 2009; davenport and prusak, 2000). step 3: develop a testing kms this phase concerns developing a kms to support the km initiative and comprises the main activities for developing systems, in general: requirements specification, design, implementation, and testing. requirements must be elicited and specified. functional requirements may be created from use case models, class diagrams, and state diagrams to model the behavior of knowledge items throughout their existence in the kms. nonfunctional requirements should also be addressed, such as security, usability, accessibility, etc. roost is very useful in this step. roost can serve as the initial conceptual model for the kms, and thus as the basis for structuring the testing knowledge repository. specific information (attributes of the classes in the conceptual model) should be identified, taking the characteristics of the organization’s testing artifacts into account, and, most importantly, information that is available in the tools used for supporting the testing process. furthermore, interoperability issues should also be analyzed. ideally, software tools that are part of the test environment should be integrated with the testing kms to act together, interacting, and exchanging data to obtain the expectedresults. inthiscontext,possibleknowledgeitemsidentified in these tools can be automatically converted/imported to the testing kms. another key point is to define the km process activities that are to be supported by the testing kms. we recommend providing support to the following typical activities of a km process: creating knowledge items, evaluating knowledge items before making them available, searching knowledge items, assessing the usefulness of available knowledge items, and maintaining the knowledge repository. during the design of testing kms, developers should consider the platform in which the system is to be built, and non-functional requirements should be addressed. several technologies can be used, including those that are commonly considered in km solutions like content management systems, document management systems, and wiki, as well as those considered intelligent km solutions, such as knowledge-based and expert systems, reasoners, and semantic wikis. once designed, the kms should be coded and tested. step 4: load existing knowledge items. for initially populating the knowledge repository of the testing kms, the organization should look for existing knowledge items. for instance, if the system must manage test cases, existing test cases can be imported to the testing kms. the existing knowledge items should be reengineered to ensure conformance with the knowledge repository structure. knowledge items can be registered manually in the testing kms or mechanisms for loading and reengineering these knowledge items can be built to automate the loading process. once a knowledge repository can be created and populated, data mining can be explored. the knowledge repository can contain useful hidden information (knowledge) of major relevance to the business, so, mining on these data can be performed. data mining is the application of specific algorithms for extracting patterns from data. data mining integrates the knowledge discovery in databases (kdd), process knowledge data structuring (fayyad et al., 1996). data mining methods are used in the identification of relevant information in large volumes of data, such as classification, regression, clustering, summarization, association rule, dependency modeling, among others (fayyad et al., 1996). mining stored data, in large databases to discover potential information and knowledge, has been a popular topic in database research. data mining is a technology to obtain information and valuable knowledge (yun et al., 2003). according to basili and rombach (1991), the quality of development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 software development can be improved by reusing acquired experiences, rather than starting from scratch. therefore, as a result of applying kdd, another knowledge item type can be considered: mined items. mined items can be provided by a mining process in a km database and can identify relationships that are not apparent, facilitating decisionmaking. furthermore, identifying behavior patterns in data stored in knowledge bases can help the organization to reuse and share the knowledge acquired in previous projects. step 5: evaluate the testing kms. evaluation should be done to determine if the testing kms meets the expectations. improvements can be carried out, implying a return to the previous steps. a suggestion to evaluate the testing kms is to analyze some quality characteristics, such as usefulness, usability, and functional correctness. to do that, two models can be considered: gqm (basili et al., 1994) and tam (davis, 1993). gqm is a measurement model, organized into three levels. in the first level (conceptual level), the study goals should be defined. the second level (operational level) refers to a set of questions that should be defined to characterize the evaluation or the accomplishment of a specific goal. finally, in the last level (quantitative level), a set of metrics should be associated with questions, to answer them measurably. the result of applying the gqm approach is the specification of a measurement system targeting a particular set of issues and a set of rules for interpreting measurement data (basili et al., 1994). gqm is useful because it facilitates identifying not only the precise measures required but also the reasons why the data are being collected (park et al., 1997). tam determines the acceptance of a given technology by users, considering two-factor analysis: usefulness and usability. when evaluating these two factors, it is possible to map the users’ acceptance of new technology. usefulness refers to how much the user realizes that certain technology is useful to her in terms of productivity increase. according to iso/iec 25010 (iso/iec, 2011), usefulness is the “degree to which a user is satisfied with their perceived achievement of pragmatic goals, including the results of use and the consequences of use”. in this standard, usefulness is part of the quality in use model. the perception of usability refers to the effort reduction that the user achieves when using a given technology instead of using other alternatives (davis, 1993). in iso/iec 25010 (iso/iec, 2011), usability refers to the “degree to which a product or system can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use”. it is a quality characteristic of the product quality model, but for consistency with its established meaning, it is also defined as a subset of the quality in use model. in this work, we also decided to evaluate another quality characteristic: functional correctness, a sub characteristic of functional suitability in the iso/iec 25010 product quality model. according to iso/iec (2011), functional correctness is the “degree to which a product or system provides the correct results with the needed degree of precision”. 4 applying ontot-km our experience in developing the ontot-km approach has two fronts. first, we introduce the approach, and then we create a prototype of a kms based on ontot-km that allows us to evaluate the approach, as well as obtain the opinion of software professionals in having a kms in software testing ready and available for customization. following, the kms development based on ontot-km and all processes of the evaluations are presented. to evaluate the ontot-km approach, we applied it to build a prototype of a kms for managing software testing knowledge, called testing knowledge management portal (tkmp). the resulting system was populated with data from two real projects and different evaluations were conducted. the projects were (souza, 2014): (i) amazon integration and cooperation for modernization of hydrological monitoring (icammh) project; and (ii) on-board data handling (obdh) software inside the inertial systems for aerospace application (sia) project. icammh project was a collaboration involving the brazilian aeronautics institute of technology and the brazilian national water agency, supported by the brazilian financial foundation for projects finep. the project developed a pilot system for modernization and integration of telemetry points collected from hydrological data, as a basis for managing water resources in the amazon region. the second project is devoted to developing software for the onboard computer of the sia project, which is a computational system for obdh (on-board data handling) to attitude and orbit control (aoc) of satellites that can be adapted for future space applications at the national institute for space research (inpe). the first version of the obdh software was in the testing phase when this work was being done. the final version of this software aims at adding all the functionalities of obdh of a satellite. its main functionalities are: (i) receiving and analyzing ground station telecommands; (ii) formatting and transmission of telemetry; (iii) data acquisition from on-board subsystems (real time and stored); (iv) housekeeping; and (v) fault detection isolation and recovery (fdir). at the time we were carrying out this research, icammh project had already been finalized and the sia project was in its early stage. 4.1 diagnose the current state of the organization’s testing process as icammh project has already been finalized and the testing activities of the sia project were only in the very initial phase, it was not possible to run the diagnostic step. this step was replaced by the findings from the survey with 86 testing practitioners we performed (souza et al., 2015b). out of these 86 participants, some are also team members and leaders of the icammh and sia project. the survey’s purpose was to identify which is the most appropriate scenario in the software testing domain, from the point of view of testing stakeholders, for starting a km initiative. the survey presents questions that addressed aspects considered both in the conceptualization of roost and by the sm presented in (souza et al., 2015a), as shown in table development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 1. furthermore, managing testing knowledge is not an easy task, and thus it is better to start with a small-scale initiative. thus, firstly, it is necessary to identify essential knowledge items of a sub-topic of software testing to be dealt with in the kms. from the survey results, the following conclusions are considered: (i) the participants identified test case design and test planning as being the activities in which km would be most useful. therefore, test cases and test plans are considered the most useful artifacts to be reused; (ii) explicit knowledge was considered more important than tacit knowledge. explicit knowledge represents the objective and rational knowledge that can be documented, and thus it can be accessed by many (nonaka and takeuchi, 1997). on the other hand, the tacit knowledge is the subjective and experience-based knowledge and typically remains only in people’s minds (nonaka and takeuchi, 1997); (iii) among the most targeted artifacts to reuse, test cases stood out with 90.7%; and (iv) the purposes for which experts are more interested in applying km in software testing are related to improving the quality of results in software testing, and reducing cost, time and effort spent in a software project. 4.2 establish the scope of the testing km initiative considering the main findings of the survey, test case design was considered the software testing activity to be supported, and test case the main artifact to be managed. all relevant information for designing test cases had to be considered in the scope of the tkmp development. thus, concepts related to test cases in roost were also considered in the scope of the initiative, namely: test case input, expected result, test result, test code, test case designer and testing technique. besides the test cases as the main artifacts to be managed, ll and knowledge regarding discussion were also considered in the scope of tkmp. these two types of knowledge items were considered in the scope of tkmp since survey participants pointed out individual experiences and communication between test team members as the types of tacit knowledge with more significant importance to generate explicit knowledge items. in addition, meetings with the project leaders from icammh and sia projects also helped to reach this scope. still, concerning tacit knowledge, we decided that tkmp should also include a yellow page system since survey participants pointed out human resources as the most useful resource to be managed and test case designers are in the scope of this km initiative. finally, we also decided to apply kdd for discovering useful knowledge from existing data and identifying the mined items. as presented in section 3, step 4, different mining methods can be used in the identification of relevant information in large volumes of data. in this project, for creating the mined items, the method of association rule was used. the association rule method identifies patterns of behavior in the set of data that often occur jointly in the database and model rules from these sets. the association rules, when applied to a data set, allow finding rules of the type of x → y, i.e., transactions of the database which contain x tend to contain y. the method of rule association was used along with the apriori algorithm (agrawal and srikant, 1994; witten et al., 2005). apriori algorithm is the best known in rule discovery methods (agrawal and srikant, 1994). 4.3 develop a testing kms considering the scope defined in the previous activity, tkmp was developed. the specification of the main requirements was developed, such as the use case diagram and class diagram conceptual model. figure 3 shows a partial use case diagram describing the main functionalities of tkmp and actors. the use cases in gray are general, in the sense that they apply to manage software engineering knowledge items of different nature. use cases in white represent testing-specific features. the developer is the main actor, representing all types of professionals involved in the software development process. knowledge manager represents a user with specific permissions, guaranteeing access to features inherent only to a knowledge manager. next, the use cases shown in figure 3 are briefly described. • create knowledge item: this use case allows developers to create a knowledge item. • create discussion-related knowledge item: this use case allows developers to register a discussion-related knowledge item. • create lesson learned: this use case allows developers to register a lesson learned. • create mined item: this use case allows the developer to register a mined item. • create test case: this use case allows developers to register a test case. • include test result: this use case allows the developer to include a test result relative to a test case. • include incident: this use case allows the developer to report an incident related to a test result. • include issue: this use case allows the developer to register an issue related to an incident. • change knowledge item: this use case allows the knowledge manager to change a knowledge item. • delete knowledge item: this use case allows the knowledge manager to delete a knowledge item. • pre-evaluate knowledge item: this use case allows the knowledge manager to pre-evaluate a knowledge item, making it available, rejecting it, or selecting experts to evaluate it. • evaluate knowledge item: this use case allows a developer to make a detailed evaluation of a knowledge item, to support the knowledge manager in making decisions about whether the item should be approved or rejected. • visualize knowledge item: this use case allows developers to visualize the details of a knowledge item. • visualize test case: this use case allows developers to visualize the details of a test case. • search knowledge item: this use case allows the developer to search for knowledge items available per informed parameters. development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 1. relationships between the survey questions (sq) and the research questions (rq) from the mapping study and roost survey question based on sq1. in which activities of a testing process is km more useful? sq2. in which activities of testing planning is km more useful? roost: testing process and activities subontology sq3. a test environment consists of, among others, human resources, hardware, and software. about which of these resources are more important to have available knowledge when defining the test environment? roost: testing environment sub-ontology sq4. in which testing level is km more useful? roost: testing process and activities subontology sq5. what is the type of knowledge you consider to be more important during the software testing process? mapping study: rq7. what are the types of knowledge items typically managed in software testing? sq6. regarding the types of knowledge items listed below, indicate the importance of generating explicit knowledge from tacit knowledge. mapping study: rq7 sq7. regarding testing artifacts, which are the ones you judge to be more appropriate for reuse? mapping study: rq7 roost: testing artifacts sub-ontology sq8. what is the purpose of applying km in software testing? mapping study: rq6. what are the purposes of employing km in software testing? sq9. what benefits km can bring to software testing? mapping study: rq9. what are the main benefits and problems reported regarding applying km in software testing? figure 3. functionalities of tkmp • search test case: this use case allows the developer to search for test cases per informed parameters. • value knowledge item: this use case allows the developer to value the utility of a knowledge item consulted. • find experts: this use case allows the developer to find and select experts with a desired profile, as well as viewing the profiles of experts found. it works as a yellow pages system. figure 4 shows a partial conceptual model of tkmp. this model focuses on knowledge items, on test cases. classes in gray are derived from roost, i.e., the roost conceptual model was used as the starting point for specifying tkmp, mainly to structure its knowledge repository regarding the software testing notions. information from the software tools that compose the testing environment of the icammh project was used as the basis for identifying attributes and enumerated types, to specify tkmp in detail. these tools are testlink2, a web-based test management system, and mantisbt3, a bug tracking system. 2http://testlink.org/ 3http://www.mantisbt.org/ development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 figure 4. conceptual model of tkmp. testlink is a web-based test management system. it offers support for test cases, test suites, test plans, test projects, and user management and reports. mantisbt is a bug (or defect) tracking system. however, it is often configured by users to serve as a more generic issue tracking system and project management tool. in the case of the icammh project, mantisbt was customized to deal with two categories of requests: activity-related requests and defect-related requests. in the context of the icammh project, an integration scheme between testlink and mantisbt was used. testlink can integrate with mantisbt, allowing for a test case to be associated with a defect-related request. thus, all incidents that were registered in mantisbt, as defect-related requests, were conditioned to the existence of a test case in testlink. tkmp project and requirements specifications are currently available at https://cutt.ly/kybolun. 4.4 load existing knowledge items once tkmp was developed, previous existing knowledge items in the two projects were loaded to the knowledge repository. initially, tkmp’s knowledge repository was populated with 1568 test cases extracted from icammh project. next, other test cases from the sia project were also inserted in tkmp’s knowledge repository using tkmp’s functionalities. in the context of the icammh project, test case related information was stored both in testlink and mantisbt. each one of these tools has its data repository, implemented in different ways, demanding analysis of the structure of each one to load the data. moreover, each tool has its terminology to represent the manipulated data, i.e., different terms are used to represent the same concept. thus, to load existing test cases, a feature was developed to connect and get data from the repositories of mantisbt and testlink, and then to development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 convert them into objects (instances) of the data schema of tkmp. roost was used for mapping concepts from the involved tools. this procedure is illustrated in figure 5. figure 5. loading existing knowledge items in this step, we decided to mine the stored knowledge items, since mined items also were considered as scopes of km initiatives in software testing (see section 4.2). data mining was performed on icammh data. to create the mined items, the method of rule association was used. one of the algorithms to the better known association rules is the apriori algorithm. it can work with a large number of attributes, generating various combinations among them. for the generation of association with the apriori algorithm, the waikato environment for knowledge analysis (weka) tool was used (witten et al., 2005). weka is a collection of machine learning algorithms for data mining tasks. a brief explanation of how this item can be generated is given below. considering only those test cases that failed, 415 records were returned from a query in the knowledge repository. table 2 presents the first 20 returned and 8 attributes considered in this data mining, corresponding to classes: human resource, test case, incident, issue. after loading the data set, the apriori algorithm was executed using the weka tool. weka returns the most important 10 associations. this number can be changed in the algorithm settings. the listing in table 3 shows the results of the associations that were found. analyzing the rules some conclusions can be inferred. the fifth rule, for example, shows that out of 219 recorded incidents with status resolved and resolution priority normal, the importance of test cases is medium in 210 of them. this is quite reasonable because the importance of completing the test case is considered medium, an incident generated by this test case can also be a priority of correction normal. just as with all the other rules, one realizes that there are consistencies among associations that were presented. about the associations returned with the tkmp data, no irregularities were detected. in this case, it is concluded that the classes used to generate associations have the correct registration patterns by the project members. however, more classes could be incorporated into the associations to allow more analyses of the data. furthermore, other mining algorithms could be used. by using association rules combined with other mining methods one could detect behaviors not seen by the naked eye, for example, to notice or register a certain type of defect tends to appear when changing a certain software component or the severity of the test case is always major when they are related to a particular module. behaviors like these could help the responsible expert in project decisions related to the tests being conducted. for the registration of a knowledge item of the mined item type in tkmp, generic information about that item was considered given the diversity of methods and algorithms that exist in data mining. in the conceptual model of tkmp (figure 4), the mineditem table shows the attributes that are available for the registration of a mined item. the attributes are: description, algorithm, result, and analysis. 4.5 evaluate the testing kms although tkmp is still considered a prototype, built as a proof of concept for the ontot-km approach, we decided to conduct different evaluations for this kms in order to get the feeling of software professionals in having a kms available for customization. tkmp went through a preliminary evaluation in two steps. firstly, tkmp was evaluated by the leaders of the two projects, icammh and sia. secondly, tkmp was made available on the web, and software engineering practitioners were invited to use it and then to answer a questionnaire to give feedback in terms of usefulness, usability, and functional correctness. 4.5.1 evaluation with the project leaders once tkmp’s knowledge repository was populated with data from the two real projects (icammh and sia), demonstrations with the data obtained from the projects were made and the leaders were requested to use and analyze the portal. then, we interviewed in order to collect opinions and/or impressions of the leaders about the tkmp. the interview was conducted in an unstructured manner and anonymously. this configuration for the interview allowed information to emerge more freely. we began the interview by considering three open questions to serve as a guide: “what is the perceived usefulness of tkmp?”, “do you think it’s easy to learn to use tkmp?” and “do you notice inconsistencies when using tkmp?”. open questions allowed respondents a wide range of answers and diverse discussions about the tool. some of the leaders’ comments on the tkmp are presented below. the leaders of both projects stressed the importance of such a system to better support the software testing processes. positive responses were presented by the leaders to the tkmp in terms of usefulness, usability, and inconsistencies. with respect to icammh project, the leader observed that there was always a great loss of knowledge due to the turnover rate of the team members. in her words, “a kms such as tkmp would be indeed beneficial for finding similar test cases to be reused in the design of new ones to other similar situations in different modules and future projects”. with respect to the sia project, the leader’s evaluation was that tkmp would be very important for dealing with critical systems. however, he pointed out that a challenge would be to change team members’ culture because many development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 2. attributes analyzed (first 20 records) test case author execution author importance severity reproducibility issue status priority resolution status 7 7 high minor bug always closed normal fixed 7 7 high major bug always closed normal fixed 7 7 high minor bug always closed normal fixed 7 7 high crashes the application or os always closed high fixed 7 7 high major bug always closed normal fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high major bug always closed high fixed 7 7 high minor bug always resolved normal fixed 7 7 high major bug always closed normal fixed 7 7 high major bug always resolved high fixed 7 2 high minor bug have not tried resolved normal fixed 7 7 high mijor bug always closed normal fixed 7 2 high minor bug have not tried closed normal fixed 7 2 high minor bug have not tried closed normal fixed 7 2 high minor bug have not tried resolved normal fixed 7 7 high minor bug always closed normal not a bug 7 2 high major bug have not tried closed normal fixed 7 2 high minor bug have not tried resolved normal fixed table 3. results of the associations rule associations 1 issuestatus=resolved 236 ==> resolutionstatus=fixed 235 conf:(1) 2 importance=medium issuestatus=resolved 226 ==> resolutionstatus=fixed 225 conf:(1) 3 issuestatus=resolved priority=normal 219 ==> resolutionstatus=fixed 218 conf:(1) 4 importance=medium issuestatus=resolved priority=normal 210 ==> resolutionstatus=fixed 209 conf:(1) 5 issuestatus=resolved priority=normal 219 ==> importance=medium 210 conf:(0.96) 6 issuestatus=resolved priority=normal resolutionstatus=fixed 218 ==> importance=medium 209 conf:(0.96) 7 issuestatus=resolved 236 ==> importance=medium 226 conf:(0.96) 8 issuestatus=resolved resolutionstatus=fixed 235 ==> importance=medium 225 conf:(0.96) 9 issuestatus=resolved priority=normal 219 ==> importance=medium resolutionstatus=fixed 209 conf:(0.95) 10 issuestatus=resolved 236 ==> importance=medium resolutionstatus=fixed 225 conf:(0.95) times the team is not ready or does not accept new concepts, tools, and ideas. 4.5.2 evaluation by software engineering practitioners tkmp was also evaluated by 43 practitioners in software engineering, and it was based on gqm, tam, and functional correctness. the evaluation based on the gqm paradigm involved four steps: planning, definition, data collection, and interpretation. (i) planning and definition. at gqm’s conceptual level, measurement goals should be defined. we identified three goals for this evaluation, and from these goals, at the operational level, we defined seven questions, as table 4 shows. finally, at the quantitative level, we defined metrics associated with the questions, in order to answer them measurably. for each question, as table 5 shows, we defined five metrics, each one aiming at computing the number of participants that strongly disagree (mg.q.1), disagree (mg.q.2), neither agree nor disagree (mg.q.3), agree (mg.q.4), or strongly agree (mg.q.5) with a statement corresponding to the question. figure 6 summarizes the gqm approach we followed. table 6 presents the statements that we used to represent the questions in the questionnaire that participants answered. questions q1.1–q1.4 were used to characterize the portal usefulness, questions q2.1–q1.2 were used to collect data on the level of usability. question q3.1 was used to evaluate tkmp functional correctness. table 7 shows how to interpret the results. the lines should be read as “if <> then <>”. for example, the interpretation of question 3.1 (q3.1) is “if m1+m2 > m4+m5 then the users do not notice inconsistencies when using the tkmp”, where m1, m2, m4, m5 are the responses given by the participants (metrics). it is important to notice that m1 and m2 (see table 5) are answers that totally or partially disagree with the question. on the other hand, m4 and m5 are answers that totally or partially agree with the question. in addition to the questions created using gqm’s condevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 4. defined goals and questions g1: evaluate tkmp usefulness q1.1. what is the perceived usefulness of tkmp regarding creating software testing knowledge items? q1.2. what is the perceived usefulness of tkmp regarding searching for software testing knowledge items? q1.3. what is the perceived usefulness of tkmp regarding reusing software testing knowledge items? q1.4. what is the perceived global usefulness of tkmp? g2: evaluate tkmp usability q2.1. to what extent do users recognize that it is easy to learn to use tkmp? (learnability) q2.2. to what extent do users recognize that tkmp is appropriate for their needs? (appropriateness recognizability) g3: functional correctness q3.1 do users notice inconsistencies when using tkmp? table 5. metrics used in the gqm mg.q.1 number of participants who strongly disagree mg.q.2 number of participants who disagreed mg.q.3 number of participants who neither agree nor disagree mg.q.4 number of participants who agree mg.q.5 number of participants who strongly agree table 6. statements used to refer to the questions q1.1 tkmp is useful to create software testing knowledge items. q1.2 tkmp is useful to search for software testing knowledge items. q1.3 tkmp is useful to reuse software testing knowledge items. q1.4 i would use or recommend the tkmp. q2.1 i learned to use the tkmp quickly. q2.2 i recognize tkmp as being suited to my tester needs. q3.1 i did not notice inconsistencies when using the tkmp. ceptual level, at the end of the questionnaire, we present three open questions to professionals in order to allow the participant to externalize their opinion about the tkmp in terms of good points, bad points, and general comments. (ii) data collection. the data used to evaluate the tkmp were based on the metrics presented above. to collect the data, we requested experts in software organizations to use tkmp to perform activities to create, validate and search for knowledge items. after using the tool, 43 participants answered a questionnaire containing the questions previously presented. considering the participants’ profile, out of these 43, 8 hold doctoral degrees, 13 hold masters, 22 finished figure 6. gqm approach to evaluate the tkmp undergraduate programs. all of them are from the software engineering area and they have an average of six years of experience in the area. in relation to software testing knowledge, 42.9% of participants reported having basic knowledge, 37.2% reported having intermediate knowledge, and 23.3% considered having advanced knowledge on software testing. a summary of the responses given by the participants is shown in table 8. this table shows the number of responses according to the goals, questions and metrics used. (iii) interpretation. figures 7, 9 and 10 present charts that show the answers per question used in our gqm model. these answers were interpreted according to table 7: goal 1: evaluate tkmp usefulness figure 7 presents the chart generated from the answers related to tkmp usefulness. applying the interpretation expressions shown in table 7, in relation to this goal, the results show that the participants considered tkmp a useful tool for managing software test knowledge items. regarding evaluating tkmp usefulness, we also carriedoutananalysisseparatingthe43participantsbyprofessional position: professionals directly related to software development companies (23 professionals); and professionals directly related to scientific research (22 participants). this separation by position allowed us to infer how the software industry and the academic environment view the usefulness investigated topic. figure 8 presents the chart generated from the answers related to tkmp usefulness by position. in general, analysis of the metrics for this chart, both for industry professionals and for researchers, tkmp is a type of tool that they would use or indicate, especially for research-related professionals (14 strongly agree). despite the interest, industry professionals presented a lower perception of the usefulness of the tkmp than academic researchers. in the sm conducted by souza et al. (2015a) the main problems reported on the implementation of km initiatives in software testing in the organization were investigated. the main problems mentioned were that km systems are not yet appropriate; employees are normally reluctant to share their knowledge and increased workload. we believe that these problems may be related to the participants’ responses. development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 7. results interpretation 01 for q1.i, i=1 to 4: m1+m2 > m4+m5 tkmp is not useful for managing software testing knowledge items. 02 for q1.i, i=1 to 4: m1+m2 < m4+m5 tkmp is useful for managing software testing knowledge items. 03 for q1.i, i=1 to 4: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say that tkmp is useful to manage software testing knowledge items 04 to q2.1 and q2.2: m1+m2 > m4+m5 tkmp cannot be easily used to manage software testing knowledge items. 05 to q2.1 and q2.2: m1+m2 < m4+m5 tkmp can be easily used to manage software testing knowledge items. 06 to q2.1 and q2.2: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say that tkmp can be easily used to manage software testing knowledge items. 07 to q3.1: m1+m2 > m4+m5 tkmp can be considered functionally correct. 08 to q3.1: m1+m2 < m4+m5 tkmp cannot be considered functionally correct. 09 to q3.1: m3 > m4+m5 or m1+m2 = m4+m5 we cannot say if tkmp is functionally correct or not. table 8. results summary questions metrics goal m1 m2 m3 m4 m5 total q1.1 0 0 0 19 24 43 g1 q1.2 2 2 8 17 14 43 q1.3 0 1 7 16 19 43 q1.4 0 4 9 8 22 43 g2 q2.1 0 2 11 17 13 43 q2.2 1 0 11 17 14 43 g3 q3.1 3 5 14 14 7 43 on the other hand, in the academic area, there is considerable growth in research in km and software engineering. in 2008, bjørnson and dingsøyr (2008) already presented the growing interest in research on km in software engineering. this growing interest continues to these days (menolli et al., 2015; vasanthapriyan et al., 2015; pinto et al., 2018; napoleão et al., 2021). figure 7. questions and answers related to usefulness of tkmp goal 2: evaluate the usability of tkmp figure 9 presents the chart generated from the answers related to usability. the results showed the participants considered that tkmp can be easily used to manage software testing knowledge items. figure 8. questions and answers related to usefulness by position figure 9. questions and answers related to usability of tkmp goal 3: evaluate the functional correctness of tkmp figure 10 presents the chart related to functional correctness. the results show that tkmp can be development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 considered functionally correct. however, even the metrics pointing out that most of the participants did not find inconsistencies to the point that they were not able to use tkmp, by figure 10, it is possible to notice that the participants found inconsistencies in tkmp. we consider this a normal result since tkmp is still a prototype. figure 10. question and answers related to functional correctness of tkmp. as mentioned in the questionnaire planning, we present three open questions to professionals to externalize good points, bad points, and general comments about the tkmp. as can be seen in figure 7 (usefulness), figure 9 (ease of use) and figure 10 (functional correctness), some professionals chose the option “strong disagreement or disagree” in the tkmp evaluation. we analyzed open-ended responses that practitioners wrote to identify improvements to the tool and consequently the approach. when analyzing the responses of these 10 participants, most of the comments are related to functional correctness analysis (figure 10). in general, we noticed that many of the observations were more related to the small inconsistencies identified in the tool. two of the professionals, for example, mentioned that the search for knowledge items could be improved to be faster. one of the professionals mentioned that when there is a lot of data to be returned in a database, in the implementation it is possible to use more advanced strategies to optimize this process. other comments for improvements were: the system would better work with images; allow access to an instructional help in any part of the tool; keep all fields in the tool as case sensitive; and allow sharing of information via email, as well as sending an email of evaluations of knowledge items performed. it is worth noting that the tkmp is a prototype considered as a proof of concept. despite this, all suggestions for improvements will be considered in the evolution of this research. it’s also possible to notice, especially, by the charts of figures 7 and 9, that a considerable number of participants chose the option “neither agree nor disagree” for tkmp usefulness and usability. when analyzing the responses of these participants separately (15 participants), we did not find any pattern that justified this choice. we only noted that concerning the knowledge level in software testing, 10 participants mentioned having basic or intermediate knowledge. it is not possible to say, but we believe that a low time of knowledge about software testing may have some influence on the answer about tkmp utility and usability. 4.6 other partial applications of ontot-km wealsostartedtoapplyontot-kminsoftwareorganizations. three companies evaluated ontot-km and tkmp. first, we conducted the diagnostic and scope definition activities in these three companies by applying a questionnaire based on the survey presented in souza et al. (2015b). respondents to the questionnaire were software testers responsible for the software testing activities within the companies. for privacy reasons, we do not mention the company’s names. however, some characteristics are: located in brazil; medium sized software organizations; the main products they develop are systems for the fiscal area, such as an electronic fiscal receipt, metrology, and also customized systems to meet the needs of customers from diverse segments. the main diagnosis results by the three companies are: (i) “test case design” activity was the most useful; (ii) “test environment structuring” was the testing planning activity in which km is most useful; (iii) “human resource” and “software resource” are considered the resources from which it is quite important to have the knowledge available at the time of setting the test environment; (iv) the explicit knowledge was considered more important than tacit knowledge; (v) “test plan” and “test case” were considered the artifacts most reusable ones; (vi) there is no formal instrument for km within the three companies; and (vii) “increasing the testing process efficiency” and “best test case selection” are the main expected benefits of applying km in software testing. the results were very close to the results obtained in the general survey applied to the 86 participants in souza et al. (2015b). from the diagnosis results in the companies, it was possible to establish the scope for software testing initiatives. the test plan definition and test case design were considered the software testing activities to be first supported, and test cases the main knowledge item to be managed. until now, we have not conducted the remaining activities of the ontot-km approach (develop a kms testing, load existing knowledge items, and evaluate the testing kms), although companies have shown interest in developing their own kms solution. we intended that companies would also use tkmp. we proposed to the three companies the use of tkmp already developed from the research project’s scope since the diagnosis results were similar. companies were at ease in uploading the organization’s data in tkmp or registering new data if they wish. the purpose was to analyze if an already existing km tool, such as tkmp, could be customized by the organization to meet its current needs. some suggested customizations were: (i) to implement a traceability matrix among test cases and lessons learned in order to assist the test coverage; (ii) to develop a repository of artifacts (historical basis); and (iii) turning tkmp into a plug-in to integrate with project management tools (e.g., jira, redmine). the two fronts analyzed in this study were well accepted by the three organizations that participated in the survey. the participants mentioned that it is interesting to have their solution for a kms using ontot-km. however, they mentioned this would be possible if the company had a team to develop the system. on the other hand, it is also attractive to have a more general and open source kms available to be cusdevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 tomized by the company. we believe that the experience we had from evaluations performed both ontot-km, as well as tkmp, give us motivations and directions for future works, for example, we intend to consider some of the customization suggestions and enhancements to be implemented in tkmp since there is an intention to create a robust version of the portal to be made available to the software testing community. 4.7 study limitations there were limitations in the study. the first limitation refers to the low representativeness of companies participating in the study (3 companies). validating an approach, as ontotkm in a real environment, needs the authorization and trust of the organization to use its data and information, and allocate employees for system development. however, we noticed an enormous barrier to this. several other companies were invited to apply ontot-km. while they recognize the benefits from km in software testing, many refused to participate. the invited organizations mentioned that the idea of implementing km in the organization, even with an existing tool, could generate an increased workload. this position is in line with the results detected through the sm conducted in souza et al. (2015a). shortage of time is also a potential risk to incorporate km principles in software testing because knowledge sharing can imply increasing the employee workload and costs. we intend to continue inviting software companies to participate in the research and look for strategies that allow the company to feel safe in relation to use or build a kms, for example, allow the company to install the system on-premises and on their database server. a second limitation of this research concerns the sample size of the software engineering practitioners that answered the questionnaire. 43 practitioners answered the questionnaire. of these, 23 are professionals directly related to software development companies. the results cannot be generalized. therefore, we intend to replicate this survey as many software practitioners as possible in real projects in the industry. in addition, we also intend to conduct interviews with these professionals. the interview purpose is to better understand the responses that professionals have about tkmp, for example, to better understand the reason that led professionals to shore up the option “neither agree nor disagree”, as can be seen in the figures 7, 9 and 10. another limitation is related to step 1 of ontot-km. this step was not employed on both projects (icammh and sia). for this step, we used the results of a survey. the diagnosis step was not made exclusively for the projects in question. however, some survey participants were team members and leaders of the icammh and sia projects. we believe that the participation of these team members may have helped to achieve a specific diagnosis for the projects under study. 5 related work different approaches to the development of kmss can be found in the literature. dehghani and ramsin (2015) provided a review of seven methods for kms development. in general, these methods provide activities, principles, and techniques intending to apply km in the organizations (rmontano et al., 2001; calabrese and orlando, 2006; chalmeta and grangel, 2008; iglesias and garijo, 2008; sarnikar and deokar, 2010; moteleb et al., 2009; amine and ahmednacer, 2011). some of these kms methodologies are presented below. chalmeta and grangel (2008) presented a methodology called km-iris. km-iris was defined on a general level that can be used as a guide to managing knowledge in any kind of organization. the methodology is divided into five phases: (i) analysis and identification of the target knowledge; (ii) extraction of the target knowledge; (iii) classification and representation; (iv) processing and storage. in this activity an operational kms is implemented; and (v) utilization and continuous improvement by using the kms. chalmeta and grangel (2008) mention that ontologies can be used in the first phase of the methodology, that is, after identifying the knowledge, this knowledge can be detailed building on an ontological classification so that it can be represented, processed, and used in a later phase. ontologies are also suggested by chalmeta and grangel (2008) to be used in the second phase of the methodology. ontot-km also has guidelines that identify target knowledge, called knowledge items, and this item should be ranked. ontot-km is based on a test ontology and the diagnostic phase of the test environment, as well as the development of kms, are strongly related to this ontology. in (r-montano et al., 2001), a methodology to develop a kms was presented. the phases of this methodology are as follows: (i) a strategic planning; (ii) models logical and physical aspects by specifying the strengths and weaknesses of the organizational km process; (iii) development of the kms prototype; (iv) verification and validation the kms through practical usage of the system; and (v) deploy and maintain the kms. similar to the methodology of (rmontano et al., 2001), ontot-km also proposes a planning stage, called diagnosis, as well as generation of models to support the construction of a kms and its validation. however, these main activities in ontot-km are supported by a software testing ontology. calabrese and orlando (2006) presented a methodology that consists of 18 phases: (i) km principles and governance; (ii) organizational structure and sponsorship; (iii) requirements analysis; (iv) measurement; (v) knowledge audit; (vi) initiative scoping; (vii) prioritization; (viii) technology solution assessment; (ix) planning the development of the kms; (x) knowledge elicitation; (xi) building the kms; (xii) verifies and validates the kms; (xiii) review and update the kms; (xiv) knowledge maintenance processes; (xv) communication and change management; (xvi) train and publish the kms; (xvii) maintenance and support; and (xviii) measurement and reporting. in general, the methodology presentedbycalabreseandorlando(2006)is thedetailingofthe process for constructing a kms. for example, in the ontotkm process (figure 2) it is possible to notice that after the evaluation activity of kms testing, improvements can be the system returning to previous process activities. however, we do not treat this action as an explicit activity but in the form of a relationship arrow. on the other hand, in the case presented development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 by calabrese and orlando (2006), this action is considered in phases (xiii), (xiv), (xv), (xvii) and (xviii). in (sarnikar and deokar, 2010), a methodology is presented to direct the development process based on the workflows within the organization. the methodology consists of 7 different design steps: (i) business process model development of the organization; (ii) knowledge intensity identification; (iii) requirements’ identification; (iv) knowledge sources identification; (v) knowledge reuse assessment; (vi) task-user knowledge profile development; and (vii) designs the system components to support the tasks investigated in previous phases. different from the ontot-km process and also from the other methodologies presented in (dehghani and ramsin, 2015), the (sarnikar and deokar, 2010) methodology presents the design and construction of kms only in its last phase. iglesias and garijo (2008) presented a methodology that is not specifically targeted at developing a kms but can be effectively used for this purpose. iglesias and garijo (2008) proposed the methodology mascommonkads that extends object-oriented and knowledge engineering techniques for the conceptualization of multi-agent systems. the phases of the methodology are as follows: (i) obtains the initial view of the problem domain; (ii) discovers system requirements; (iii) designs the system; (iv) develops and tests; and (v) operates and maintains the system. in the designs phase, an initial set of agents are determined and a model is developed. the communication between the agents is expressed in an ontology. in (amine and ahmed-nacer, 2011), an ontology-based agile methodology is presented to develop a kms to reduce the risks of component-based development through managing the knowledge needed for component selection, update, and maintenance. the phases are as follows (the last four phases are iterative): (i) initialization. the main objective of this phase is to have the deepest understanding possible of the organization. in this phase the creation of an initial ontology of the organization domain can be conducted; (ii) domain mapping. continuously refines the problem domain ontologies created in the initialization phase; (iii) profiles and policies identification; specifies the authentication mechanisms and the level of system access allowed for each user; (iv) implementation and personalization of the kms; and (v) verification and validation of the kms. the phases of the methodology proposed by amine and ahmed-nacer (2011) are very similar to ontot-km. as with amine and ahmednacer (2011), we also use the resources of an ontology. however, unlike amine and ahmed-nacer (2011), the ontology we use is not created based on the organization but rather on an already validated domain ontology and that aims at establishing a common conceptualization about the software testing domain. finally, the methodology presented by moteleb et al. (2009) aims at using practical experiences for developing kmss in small organizations. the methodology is divided into five phases: (i) sense-making that aims at investigating whether kms development is a conceivable solution for the organizational problems; (ii) categorize the conceivable solutions through communicating with the stakeholders; (iii) the system is designed based on the solutions presented in the previous phase; (iv) specifies the appropriate technologies based on the technical, social and organizational features of the kms; and (v) monitors and maintains the kms. ontot-km also analyzes the solution for the organization in the diagnostic phase, as well as the design to construct the kms. however, as mentioned, ontot-km is supported by software testing ontology, since this domain is the goal of ontot-km. table 9 presents a brief comparison of related work, discussed above, that presented approaches to the development of a system for supporting km. to the best of our knowledge, there is no method devoted to developing a kms for supporting km in software testing. in this way, we compared the system developed using ontot-km (tkmp) with other works addressing km in software testing. these works are some of the ones selected in the mapping study on initiatives applying km in software testing presented in (souza et al., 2015a). thus, the studies retrieved in this mapping were used here as a baseline for comparison with our work. most of the studies providing automated support for managing testing knowledge employing a kms. in addition, the mapping results point out that test case reuse has been the major focus of these initiatives. these results are in line with the findings of the survey that guided us in the development of tkmp, concerning the fact that test cases are the main knowledge item to be managed. in (janjic and atkinson, 2013), an automated test recommendation approach that proactively makes test case suggestions while a tester is designing test cases was presented. they developed a prototype of an automated, non-intrusive test recommendation system called test tenderer. a search engine, called sentre, uses the current test case to perform a search for reusable, semantically matching components. analogously to (janjic and atkinson, 2013), test case design was considered the software testing activity to be supported by tkmp. however, test tenderer addresses unit testing, while tkmp is more general. although janjic and atkinson say that sentre performs a search for reusable, semantically matching components, the heuristics applied are namebased searches. in tkmp, in turn, the knowledge repository is structured based on roost, which is also used as the basis for the search functionality. finally, test tenderer works non-intrusively in the background and smoothly integrates into normal working environments. thus, the developer’s normal working practices are not disturbed, and they only need to break away from the task of writing new test cases to consider already existing tests suggested by the recommendation engine. tkmp, on the other hand, does not proactively suggest test cases. testers must make a query for retrieving similar test cases. the technologies to support km in software testing were another important question investigated by the mapping. the mapping showed that knowledge maps/yellow pages seem to have good results. a knowledge map contains information about experiences that employees possess. in (liu et al., 2009), for instance, a km model whose one of its main components is a knowledge map repository was created. the system identifies, utilizing statistics, the staff with some knowledge, improving the culture of knowledge-sharing in the enterprise. analogously, tkmp also provides a yellow page development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 table 9. characteristics for different approaches to the development of kmss approach objective number phases ontology evaluation chalmeta and grangel (2008) methodology for directing the process of developing and implementing a kms in any type of organisation 5 ontologies are suggested to be used in the steps (i) analysis and identification of the target knowledge and (ii) extraction of the target knowledge the methodology was applied to a large textile enterprise r-montano et al. (2001) recommendations to develop a kms 5 calabrese and orlando (2006) process for a comprehensive kms 12 a sensitivity/realism assessment using an actual configuration management application to demonstrate the utility of the process was conducted sarnikar and deokar (2010) a design process for kms 7 the design process was validated by demonstrating the feasibility of the proposed design process and comparing the approach with others modeling approaches iglesias and garijo (2008) methodology mascommonkads that extends object-oriented and knowledge engineering techniques for the conceptualization of multi-agent systems 5 an ontology can be used in the communication between the agents a case study was conducted in a travel agency context amine and ahmednacer (2011) implementation of kms using component-based software engineering (cbse) 5 an ontology-based agile methodology was used a case study of the application of the methodology was conducted in a software organization moteleb et al. (2009) use of practical experiences for developing kmss in small organizations 5 the approach was validated in practice by an inquiry into a number of problems experienced by particular organizations ontot-km development of an ontology-based approach for km in software testing 5 an a reference ontology on software testing was used a kms was development as proof of concept. kms was evaluated in terms of usefulness, usability, and functional correctness feature. li and zhang (2012) presents a knowledge management model and one of the elements of this model is also a knowledge map. this model is based on an ontology of reusable test cases. however, this ontology has limited coverage when compared with the roost. 6 conclusions this work presents our experiences in developing an approach to assist in launching km initiatives in software testing. ontot-km provides guidelines to apply km with the development of kmss and based on a software testing ontology. although there are approaches for developing kmss dehghani and ramsin (2015), to the best of our knowledge, there is no approach devoted to developing a kms for supporting km in software testing. in this respect, ontot-km is an original contribution. results show that the developed kms from ontot-km is a potential system for managing knowledge in software testing, so, the approach, can guide km initiatives in software testing. an approach like ontot-km can support different scenarios in software development companies. organizations that develop different products or product lines, for example, have a large turnover of knowledge when compared to organizations that build specific software for each client/project (matturroandsilva,2005).hence, thereuseoftestingknowledge becomes more frequent in the later phases of software development. thus, a km system, such as tkmp, would allow searching for solutions to similar problems registered in the tool. reuse is related not only to similar test cases, but development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 also to lessons learned, best practices, and patterns of behavior in the project that can be identified by item mined and that can be reused or at least assist in project decisions. in relation to ontot-km evaluation, in this work, we intended to evaluate the approach, as well as the generated kms. now we intend to apply the diagnosis to as many software development companies as possible to reach a common scope in order to be developed in a general kms. this kms will be part of an environment already maintained by this research project, called software engineering knowledge management diagnosis (seknow) santos et al. (2019). currently, seknow was developed only to analyze km in software development organizations (diagnostics step), however, given the evolution of research, seknow has been undergoing adaptations to meet more activities related to km and software organizations. we also intend as future work to extend tkmp considering other conceptualizations established by roost. we also intend to conduct more experimental studies to confirm the results of the evaluations discussed in this paper. as mentioned earlier, we will apply km diagnosis in software development companies that maintain different project domains with agile or traditional developments. the objective of km diagnosis is to measure the organization’s current state of km. km diagnosis can help the company to understand the real needs before devoting costly efforts to km implementation and thus better target km application initiatives at strategic points (bukowitz and williams, 2000). conducting km diagnostics in different domains of software development can shows how km activities are present in environments with agile or traditional practices. for this reason, we have been conducting the synthesis on km and agile software development (asd) (napoleão et al., 2021), and it will certainly be considered in the next stages of this project. just like ads, purpose development and operations (devops) practices are also strongly related to km. devops is a methodology that combines flexibility with rigorous testing and communication routines, aiming to deliver software efficiently and quickly (mishra and otaiwi, 2020). the adoption of devops in an organization provides many benefits including quality but also brings challenges to an organization, for example, knowledge reuse. it is in our interest to use the study conducted in this work in an organization that adopts devops and measures how much is possible to manage knowledge in software testing in this context. acknowledgements the first author would like to thank professor ricardo de almeida falbo (in memoriam) for successfully leading this work and sharing your valuable advice. the authors would like to thank: brazilian aeronautics institute of technology (ita) and the brazilian agency of research and projects financing (finep) project 5206/06icammh; and the sia project for providing the data. brazilian funding agency cnpq project 432247/2018-1. all participants that used the tkmp and answered the evaluation questionnaire are also duly acknowledged. references abran, a., bourque, p., dupuis, j., and moore, w. (2004). guide to the software engineering body of knowledge swebok. technical report, a project of the ieee computer society professional practices committee. agrawal, r. and srikant, r. (1994). fast algorithms for mining association rules in large databases. in 20th international conference on very large data bases, pages 487– 499. amine, m. and ahmed-nacer, m. (2011). an agile methodology for implementing knowledge management systems: a case study in component-based software engineering. software engineering applications, 5:159–170. andrade, j., ares, j., martinez, m., pazos, j., rodriguez, s., romera, j., and suarez., s. (2013). an architectural model for software testing lesson learned systems. an architectural model for software testing lesson learned systems, 55:18–34. basili, v. and rombach, h. d. (1991). support for comprehensive reuse. software engineering journal, 6:303–316. basili, v. r., caldiera, c., and rombach, h. (1994). guide to the software engineering body of knowledge swebok. technical report, goal question metric paradigm, new york: john wiley & sons. bjørnson, f. o. and dingsøyr, t. (2008). knowledge management in software engineering: a systematic review of studied concepts, findings and research methods used. information and software technology, 50:1055–1068. black, r. and mitchell, j. l. (2011). advanced software testing. rocky nook, usa, 3 edition. bukowitz, w. and williams, r. l. (2000). the knowledge management fieldbook. financial times prentice hall, great britain. burnstein, i. (2003). practical software testing: a processoriented approach. springer professional computing, new york, 3 edition. calabrese, f. and orlando, c. (2006). deriving a 12-step process to create and implement a comprehensive knowledge management system. journal of information and knowledge management systems, 3(36):238–254. chalmeta, r. and grangel, r. (2008). methodology for the implementation of knowledge management systems. journal of the american society for information science and technology, 5(59):742–755. davenport, t. h. and prusak, l. (2000). working knowledge. harward business school press, boston, usa, 2 edition. davis, f. d. (1993). user acceptance of information technology: system characteristics, user perceptions and behavioral impacts. international jounal of man-machine studies, 38:475–487. dehghani, r. and ramsin, r. (2015). methodologies for developing knowledge management systems: an evaluation framework. journal of knowledge management, 19(4):682–710. falbo, r. a. (2014). sabio: systematic approach for building ontologies. in 8th intern. conference on formal ontology in information systems. falbo, r. a., arantes, d. o., and natali, a. c. c. (2004). development of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 integrating knowledge management and groupware in a software development environment. in international conference on practical aspects of knowledge management, pages 94–105. falbo, r. a., barcellos, m., nardi, j., and guizzardi, g. (2013). organizing ontology design patterns as ontology pattern languages. in extended semantic web conference, montpellier. falbo, r. a., ruy, f. b., guizzardi, g., barcellos, m. p., and almeida, j. p. a. (2014). towards an enterprise ontology pattern language. in symposium on applied computing, gyeongju. fayyad, u., gregory, p., and p.smyth, p. (1996). from data mining to knowledge discovery in databases. american association for artificial intelligence, pages 37–54. fischer, g. and ostwald, j. (2001). knowledge management: problems, promises, realities, and challenges. ieee intelligent systems, 16:60–72. herrera, r. j. g. and martin-b, m. j. (2015). a novel processbased kms success framework empowered by ontology learning technology. engineering applications of artificial intelligence, 45:295–312. iglesias, c. and garijo, m. (2008). the agent-oriented methodology mas-commonkads. in intelligent information technologies: concepts, methodologies, tools, and applications, information science, pages 445–468. iso/iec (2011). iso/iec 25010 systems and software engineering systems and software quality requirements and evaluation(square)syste m and software quality models. janjic, w. and atkinson, c. (2013). utilizing software reuse experience for automated test recommendation. in international workshop on automation of software test, pages 100–106, san francisco. kitchenham, b. and charters, s. (2007). guidelines for performing systematic literature reviews in software engineering. technical report ebse 2007-001, keele university and durham university, uk. li, x. and zhang, w. (2012). ontology-based testing platform for reusing. in intern. conference on internet platform for reusing, pages 86–89, henan, china. liu, y., wu, j., liu, x., and gu, g. (2009). investigation of knowledge management methods in software testing process. in inter. conference on information technology and computer science, pages 90–94, kiev. mathur, a. p. (2012). foundations of software testing. pearson education in south asia, india, 5 edition. matturro, g. and silva, a. (2005). a knowledge-based perspective for preparing the transition to a software product line approach. in international conference on software product lines, pages 96–101, berlin, heidelberg. menolli, a., cunha, m. a., reinehr, s., and malucelli, a. (2015). “old” theories, “new” technologies: understanding knowledge sharing and learning in brazilian software development companies. information and software technology, 58:289–303. mishra, a. and otaiwi, z. (2020). devops and software quality: a systematic mapping. computer science review, 38:100308. moteleb, a., woodman, m., and critten, p. (2009). towards a practical guide for developing knowledge management systems in small organizations. in european conference on knowledge management, pages 559–570. myers, g. j. (2004). the art of software testing. john wiley and sons, canada, 2 edition. napoleão, b. m., souza, e. f., ruiz, g. a., felizardo, k. r., meinerz, g. v., and vijaykumar, n. l. (2021). synthesizing researches on knowledge management and agile software development using the meta-ethnography method. journal of systems and software, 178:110973. nonaka, i. and krogh, g. (2009). tacit knowledge and knowledge conversion: controversy and advancement in organizational knowledge creation theory. organization science, 30:635–652. nonaka, i. and takeuchi, h. (1997). the knowledge-creating company. oxford university press, oxford, usa. o’leary, d. and studer, r. (2001). knowledge management: an interdisciplinary approach. ieee intelligent systems, 16(1). o’leary, d. e. (1998). enterprise knowledge management. ieee computer magazine, pages 54–61. park, r., goethert, w., and florac, w. (1997). goal-driven software measurement. handbook cmu/sei-96-hb002. pinto, d., oliveira, m., bortolozzi, f., matta, n., and tenório, n. (2018). investigating knowledge management in the software industry: the proof of concept’s findings of a questionnaire addressed to small and medium-sized companies. in 10th international joint conference on knowledge discovery, knowledge engineering and knowledge management kmis, pages 73–82. r-montano, b., liebowitz, j., buchwalter, j., mccaw, d., newman, b., and rebeck, k. (2001). a systems thinking framework for knowledge management. decision support systems, 31:5–16. rokunuzzaman, m. and choudhury, k. p. (2011). economics of software reuse and market positioning for customized software solutions. journal of software, 6:31–1029. ruy, f. b., falbo, r., barcellos, m., costa, s. d., and guizzardi, g. (2016). seon: a software engineering ontology network. in 20th inter. conference on knowledge engineering and knowledge management (ekaw), pages 527–542. santos, v., salgado, j. g., souza, e. f., felizardp, k. r., and vijaykumar, n. l. (2019). a tool for automation of knowledge management diagnostics in software development companies. in brazilian conference on software: theory and practice (cbsoft) tools session. sarnikar, s. and deokar, a. (2010). knowledge management systems for knowledge-intensive processes: design approach and an illustrative example. in international conference on system sciences, pages 1–10. souza, e. f. (2014). knowledge management applied to software testing: an ontology based framework. thesis in computer science, national institute for space research (inpe), brazil. souza, e. f., falbo, r. a., specimille, m. s., coelho, a. g. n., vijaykumar, n. l., felizardo, k. r., and meindevelopment of an ontology-based approach for knowledge management in software testing: an experience report souza et al. 2021 erz, g. v. (2020). experience report on developing an ontology-based approach for knowledge management in software testing. in 19th brazilian symposium on software quality experience reports (sbqs ’20), pages 1–10. souza, e. f., falbo, r. a., and vijaykumar, n. (2017). roost:reference ontology on software testing. applied ontology, 12:59–90. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2013). ontology in software testing: a systematic literature review. in research seminar ontology of brazil (ontobras), pages 71–82, belo horizonte. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015a). knowledge management initiatives in software testing: a mapping study. information and software technology, 57:378–391. souza, e. f., falbo, r. a., and vijaykumar, n. l. (2015b). using lessons learned from mapping study to conduct a research project on knowledge management in software testing. in 41st euromicro conference on software engineering and advanced applications (seaa), pages 208– 215, madeira, portugual. staab, s., studer, r., schurr, h. p., and sure, y. (2001). knowledge processes and ontologies. intelligent systems, 16:26–34. storey, j. and barnett, e. (2000). knowledge management initiatives:learning from failure. journal of knowledge management, 4:145–156. thrane, c. (2011). quantitative models and analysis for reactive systems. thesis in applied computing, department of computer science aalborg university, denmark. vasanthapriyan, s., tian, j., and xiang, j. (2015). a survey on knowledge management in software engineering. in international conference on software quality, reliability and security companion (qrs-c), pages 237–244, vancouver, bc, canada. werner, j. (2014). reuse-based test recommendation in software engineering. phd thesis, universität mannheiml, mannheim. zugl. als druckausg. im verl. dr. hut, münchen erschienen. witten, i. h., frank, e., and hall, m. a. (2005). data mining: practical machine learning tools and techniques. morgan kaufmann, san francisco, 3 edition. yun, h., ha, d., hwang, b., and ryu, k. (2003). mining association rules on significant rare data using relative support. journal of systems and software, 67:181–191. zack, m. and serino, m. (2000). knowledge management and collaboration technologies. in knowledge, groupware and the internet, pages 303–315, butterworth. introduction background software testing knowledge management roost ontot-km applying ontot-km diagnose the current state of the organization's testing process establish the scope of the testing km initiative develop a testing kms load existing knowledge items evaluate the testing kms evaluation with the project leaders evaluation by software engineering practitioners other partial applications of ontot-km study limitations related work conclusions journal of software engineering research and development, 2022, 10:1, doi: 10.5753/jserd.2021.1973  this work is licensed under a creative commons attribution 4.0 international license.. tact: an instrument to assess the organizational climate of agile teams a preliminary study eliezer dutra  [ unirio and cefet/rj | eliezer.goncalves@cefet-rj.br ] patrícia lima  [ unirio | patricia.lima@edu.unirio.br ] cristina cerdeiral  [ univeris | cerdeiral@gmail.com ] bruna diirr  [ unirio | bruna.diirr@uniriotec.br ] gleison santos  [ unirio | gleison.santos@uniriotec.br ] abstract background: measuring the organizational climate of agile teams is a challenge for organizations, mainly because of the shortages of specific instruments to agile methodologies. on the other hand, finding companies willing to participate in the preliminary validation of an instrument is a challenge for researchers of the organizational climate. the preliminary validation allows identifying problems and improvements in the instrument. objective: we present the preliminary evaluation of tact. tact is an instrument to assess the organizational climate of agile teams. its initial version comprises the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. method: we planned and executed a case study considering three development teams. we evaluated tact using open-ended questions, quantitative methods, and tam dimensions of intention to use, perceived usefulness, and output quality. results: tact allowed to classify the organizational climate of the teams for the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. some items were assessed negatively or neutrally, which represent points of attention. tact captured the lack of agile ceremonies, the difficulty of the product owner in planning iterations, and the distance in leadership. in addition, tact dimensions presented high levels of reliability. conclusions: tact captured the organizational climate of the teams adequately. the team leaders reported intention of future use. the items that compose tact can be used by researchers investigating the influence of human factors in agile teams and practitioners who need to designorganizationalclimateassessmentsofagile teams. byusinganinstrumentadaptedtoassesstheorganizational climate of agile teams, an organization can better identify issues and improvement actions aligned with agile values, principles, and practices. keywords: organizational climate, agile software development, human factor influence 1 introduction several factors can influence the organizational climate of agile software development teams, such as trust, openness, respect, team engagement, a culture of action and change, innovation, leadership, communication, personality, software quality, performance, support from management and the availability of resources for the project (acuña et al., 2008; soomro et al., 2016; grobelna and stefan, 2019; serrador et al., 2018; vishnubhotla et al., 2020). curtis et al. (2009) propose that organizations should periodically identify each person’s opinion on their working conditions. the authors recommend the organizational climate survey to learn and understand the factors influencing teams, their activities, and, consequently, the software’s quality (curtis et al., 2009). the instrument used in the assessment of the organizational climate must consider the most critical factors in the domain, as the organizational climate is evaluated through behavior, attitudes, feelings, policies, practices, and procedures that characterize life in the organization (lenberg et al., 2015; schneider et al., 2014). vishnubhotla et al. (2020) point out the need for further studies to investigate the influence of human factors on the organizational climate of agile teams. both academia and industry suggest that collaboration, communication, autonomy, decision-making, client involvement and leadership are critical human factors that influence agile software development projects (chagas et al., 2015; dybå and dingsøyr, 2008). to assess the organizational climate of agile teams, organizations should select the organizational climate instruments that measure the desired factors. many organizations may find it difficult to select instruments for copyright reasons. hiring a specialized consulting company can aid this process. however, dutra et al. (2012) report that many consulting companies do not disclose details of how the instrument was designed, its reliability, nor the statistical procedures adopted to its validation. several studies have investigated the impact of human factors in agile projects (chagas et al., 2015; vishnubhotla et al., 2018), including surveys with members of agile teams (grobelna and stefan, 2019). however, the literature review we conducted did not identify studies that report the design of scales, models, or questionnaires specific to assess the organizational climate of agile teams. some studies use generic scales/questionnaires that can be used in different business domains (acuña et al., 2008; vishnubhotla et al., 2020). other studies only present factors that exert some influence on the organizational climate of agile teams (serrador et al., 2018; soomro et al., 2016). in previous work, dutra et al. (2020) presented the initial version of tact: “an instrument to assess the organizational climate of agile teams”. tact was devised https://orcid.org/0000-0002-9000-8369 mailto:eliezer.goncalves@cefet-rj.br https://orcid.org/0000-0002-2637-011x mailto:patricia.lima@edu.unirio.br https://orcid.org/0000-0002-3443-2202 mailto:cerdeiral@gmail.com https://orcid.org/0000-0002-1197-9133 mailto:bruna.diirr@uniriotec.br https://orcid.org/0000-0003-0279-0440 mailto:gleison.santos@uniriotec.br tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 and validated preliminary for the communication, collaboration, and leadership dimensions. the instrument dimensions showed high reliability. in the current work, we extended the initial study by adding the client involvement, autonomy, and decision-making dimensions, creating new items to measure the organizational climate of the teams considered in the previous study, and expanding the users of tact to include a third team. moreover, we increased the literature background to show the constructs (delgado-rico et al., 2012) considered to guide the creation of tact items, and we used factor analysis to identify the most influential items for each dimension considered in the case study. this study aims to evaluate tact preliminarily for the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimensions. tact was built considering the main human factors that influence agile teams. two specialists confirmed the validation of the tact items for agility. the data collection procedures used in the case study showed that tact evaluated the organizational climate correctly for the three teams. the quantitative analysis indicated the most influential items by each dimension in the case study. tact items showed high factor loading. tact showed excellent psychometric indices, for example, high correlation inter items in the spearman correlations (ρ) and high alfa cronbach value (> 0.8). practitioners can use tact items in their organizational climate assessment. researchers can explore new evidence of reliability and validity of the tact dimensions. the paper is organized as follows: section 2 discusses the organizational climate in agile teams; section 3 presents the design of tact; section 4 deals with the study planning; section 5 presents the results; in section 6, we discuss the results; section 7 addresses the study limitations and threats to validity; finally, section 8 shows our final considerations. 2 background 2.1 specific characteristics for the formation of the organizational climate of agile teams the organizational climate is the meaning that employees attribute to the policies, practices, and procedures they experience, besides the behaviors they observe being rewarded, supported, and expected (schneider et al., 2014). as such, members of agile teams expect the values, practices, adopted procedures, and, even, the behavior of those involved to reflect the values, principles, and practices of the “agile philosophy” (hohl et al., 2018; beck et al., 2001). agile methods differ from traditional development methods in several aspects (dybå and dingsøyr, 2008; pmi and agile alliance, 2017). leadership, collaboration, communication, autonomy, decision-making, and client involvement are examples of factors that demand different behaviors among those involved, as they impact the adoption and use of agile methods (dybå and dingsøyr, 2008; chagas et al., 2015; noll et al., 2017; jia et al., 2016). schneider et al. (2014) claim that leadership is a crucial point in the formation of the climate in organizations. in agile development, the leadership is based on the role of the servant leader (pmi and agile alliance, 2017). pmi and agile alliance (2017) argue that servant leadership is the practice of leading by service, focusing on the team members’ comprehension, development as well as meeting their needs to enable them to perform at their best. dybå and dingsøyr (2008) argue that, in traditional methodologies, the management style is based on command and control with highly bureaucratic and formalized organizational structures, while in agile methodologies, the management style must be collaborative and the structure of the organization is organic (dybå and dingsøyr, 2008). chagas (2015) reports that collaboration in agile methodologies takes place between team members and the customer. in agile methodologies, the project is divided into small cycles, called iterations, which are planned and specified according to the client and based on the team’s development capacity (pmi and agile alliance, 2017). this negotiation is based on the communication and collaboration the team has while executing the development tasks. a process of communication and collaboration between members of the agile team in the iteration planning and the development tasks execution positively impacts the project’s success (chagas et al., 2015). unlike traditional approaches, in agile methodologies, the team has the autonomy to create and change the responsibility for performing the tasks (karhatsu et al., 2010; chagas, 2015; pmi and agile alliance, 2017; noll et al., 2017). jia et al. (2016) argue that the decision-making behavior of each individual will influence the behaviors of other teammates and the project outcome. for example, each member makes a decision about effort estimation and gives user story points under these conditions; different individual decision-making behaviors will generate different results, which are pertinent to the success or failure of the project. (jia et al., 2016). dutra and santos (2020) investigated difficulties associated with organizational climate assessments. the authors identified pitfalls in (i) non-assessment of behaviors and factors specific to the development of an organizational climate in agile teams, and (ii) not explicitly considering agile roles and other organizational structure management functions. the authors argue that the items of assessment instruments should be detailed enough to allow respondents to think about the organizational culture and better characterize the agile behaviors depicted (dutra and santos, 2020). 2.2 organizational climate in agile teams there are several studies on organizational climate in software development teams (soomro et al., 2016). however, many of these studies do not report characteristics of the software development process considered in the evaluated teams. in addition, the studies measured the climate using generic instruments used in different business domains, without considering values, principles, or specific practices of development teams. our literature review identified three studies (acuña et al., 2008; grobelna and stefan, 2019; vishnubhotla et al., 2020) that investigated the organizational climate of agile teams by survey climate instruments. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 acuña et al. (2008) investigated whether the climate of software developers teams has any relationship with the qualityof thesoftwareproduct.theauthorsusedthetci©(team climate inventory) instrument (anderson and west, 1998) to assess the climate. the experimental study was carried out with 105 students allocated in 35 teams. all teams used an adaptation of the extreme programming method (xp) to develop the same software. the authors found that the climatic preferences of the team’s vision and their perception of participatory security were significantly correlated to better software. according to the authors, it is important to track the organizational climate of teams as one of many indicators of the quality of the software to be delivered. grobelna and stefan (2019) investigated how the organizational climate factors (e.g., leadership style, autonomy, rewarding, and communication) in agile software development teams affected the regularity of work speed and the teams’ efficiency. the authors prepared a questionnaire to measure the organizational climate, but the items created were not disclosed. the results confirmed that the desired organizational climate was based primarily on a positive relationship with the leader and other coworkers, commitment to work, and challenges at work. the authors argue that there are elements that point out that the more the team’s organizational climate is characterized by the team’s preferences, the greater the regularity of the work speed of this team is, and thus the team is more efficient (grobelna and stefan, 2019). vishnubhotla et al. (2020) investigated the association between personality traits and the climate in agile software development teams. the study was implemented with 43 members in eight agile teams. the authors used the tci© instrument (anderson and west, 1998) to assess the climate for each dimension (vision, participatory security, support for innovation, and task orientation). the study identified a statistically significant positive correlation between personality (considering the trait openness to experience) and the climate dimension (support for innovation). they concluded that the results of the regression analysis suggest that more data may be needed, and there are other human factors in addition to personality traits that should also be investigated in relation to the climate of agile teams. in summary, the tci© instrument is grounded in a theoretical model to measure vision, participatory security, support for innovation, and task orientation dimensions (anderson and west, 1998). tci© was used in acuña et al. (2008) and vishnubhotla et al. (2020) to measure factors that influence the capability of innovation of software development teams. the tci© dimensions do not measure the dimensions proposed on tact. the questionnaire items elaborated by grobelna and stefan (2019) were not published. regarding the use of questionnaires or generic scales to assess the organizational climate in agile teams, dutra and santos (2020) claim that the use of assessment instruments that do not consider agile values, principles, practices, and roles in a proper context may create difficulties for the analysis of possible causes of problems and the execution of corrective actions within organizational climate management. therefore, there is a need for specific instruments to measure the organizational climate of agile teams in the communication, collaboration, leadership, autonomy, decision-making, and client involvement dimenfigure 1. main steps used to build tact and to execute the case study sions. 3 tact overview in this section, we present the conception of the instrument to assess the organizational climate of agile teams (tact). instruments for organizational climate assessments measure behaviors, attitudes, or preferences (anderson and west, 1998; patterson et al., 2005). as such, tact conception and evaluation are based on psychometry concepts (dima, 2018; patterson et al., 2005; graziotin et al., 2020). tact design followed specific procedures suggested for elaborating and validating climate scales and other questionnaires in general (graziotin et al., 2020; anderson and west, 1998; bandura, 2006; dybå, 2000; gonzález-romá et al., 2009; recker, 2013; shull et al., 2008). figure 1 shows the steps followed to define tact and to execute the case study used to evaluate it. the steps involving the definition of constructs, items design, evaluation by specialists, and pretesting are described in the next subsections. the activities used for data collection from the case study, such as the interview with the process coordinator, documentation analysis, a survey using tact, leaders interview, and tam evaluation, are in section 4.3. the quantitative analysis from the case study is shown in section 5.3. 3.1 conceptual definition of the construct the first step to define the construct is a literature review (spector, 1992). the researchers should carefully read the literature about the construct, paying attention to specific details of exactly what the construct has been described (spector, 1992). in the delineation of a construct, it is helpful to base the conceptual and scale development effort on work that already exists. for each tact dimension, we identified (i) conceptual definitions to show a general description of the construct measured, and (ii) operational definitions to understand how the construct can be assessed (delgado-rico et al., 2012; spector, 1992). an operational definition is a description of something in terms of the operations (procedures, actions, or processes) by which it could be observed and measured (vandenbos, 2017). the constructs are presented in appendix a.1. to start step 1, we identified systematic literature reviews and other relevant sources to provide (i) theoretical and operational definitions for the investigated constructs (i.e., comtact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 munication, collaboration, leadership, autonomy, decisionmaking, and client involvement), (ii) human factors and their influences on agile teams, and (iii) factors, models, scales, questionnaires, and items for assessing climate of software development teams. we have identified some systematic literature reviews about human factors that impact agile software development (dybå and dingsøyr, 2008; franca et al., 2011; chagas et al., 2015; vishnubhotla et al., 2018; dutra et al., 2021). soomro et al. (2016) paper was considered for having identified studies, instruments, and factors used to assess the organizational climate of development teams. pmi and agile alliance (2017) and miller (2020) were used to standardize names of roles, practices, and artifacts considered in agile development. we used the most influential human factors related to agile software development teams (chagas et al., 2015) to select the tact dimensions investigated in this study. the agile manifesto (beck et al., 2001) was also used in this step. the identified literature was used (i) to make the conceptual and operational definition of constructs (delgado-rico et al., 2012; spector, 1992) and (ii) to capture examples of behaviors, attitudes, climate instruments, and practices and their influences. for example, a) dybå and dingsøyr (2008) showed that “the planning game activity was found to have a positive effect on collaboration within the company”, b) karhatsu et al. (2010) reported that “communication and collaboration are at the heart of agile software development. as the agile manifesto states, individuals and interactions over processes and tools and customer collaboration over contract negotiation. one aspect in communication and collaboration is customer cooperation”, and c) through soomro et al. (2016), we identified some items (açıkgöz et al., 2014) that could be adapted to measure the collaboration. 3.2 design/adaptation/selection of items step 2 aims to propose items that will be used to assess each dimension, adapted to the population’s culture. thus, the constructs (appendix a.1) identified in step 1, the identified systematic reviews, and other relevant literature were considered. some items or questionnaires and examples of behaviors identified in the previous step must be adapted to agile roles, practices, or artifacts. pmi and agile alliance (2017) and miller (2020) were considered a reference to identify the main roles and essential activities in agile software development projects. after reading the selected works, we started creating tact. for each considered dimension, namely communication, collaboration, leadership, autonomy, decision-making, and client involvement, evaluation items were selected, adapted, or created. some items from scales without any copyright were selected and translated to portuguese, e.g., “it13. team members work together as a whole” used in açikgöz (2017) to assess collaboration between software development team members. in other cases, only the role of the person exercising the action was altered. for example, the original item “my direct supervisor listened to my ideas and concerns”, proposed in sharma and gupta (2012), was changed to item “it20. the team facilitator listens to my ideas and concerns”. new items were also proposed to assess the organizational climate specific to agile teams. for this purpose, critical factors and/or items were selected, and the descriptions were adapted to the roles and activities performed by agile teams. for example, to assess the communication dimension, we defined the item “the team and the product owner always reach consensus on the priority of the user stories by negotiating which bug to fix or functionality to add”. this item was based on the team climate factor described in nianfang ji and jie wang (2012) “supervisors and staff communication and agreement their tasks, including what to do, to what degree, and how to do?” and the description presented by chagas (2015) for the communication factor “frequent communication can be used to prioritize features, set focus on bug-fixing or include more functionality”. on completion of step 2, 49 items had been established, with 9 items to measure communication, 8 items for collaboration, 10 items for leadership, 7 items for autonomy, 8 items for decision-making, and 7 items for client involvement dimension. the items included in the tact initial version are shown in appendix a.2. tact also comprises a dashboard, which is shown in section 5. 3.3 evaluation by specialists at the beginning of step 3, the tact items were analyzed by two specialists in agile software development. for each item, two questions were considered “can it be inferred that the presented item represents a behavior related to agile software development teams?” and “do you suggest any adaptation to the item description?”. the first specialist has 10 years of experience in using such methods and 5 years as a consultant focused on the agile transformation of organizations and teams. the second specialist is a process coordinator at a huge company. she has 14 years of experience in software process improvement and 4 years as responsible for defining and monitoring changes in agile processes. every tact item was considered related to agile software development teams. two researchers, co-authors of this work, discussed all comments and suggestions made by the specialists. after that, some adaptations in item descriptions were made. for example, in it08, the proposed description “the team and the product owner always agree (...)” was altered to “the team and product owner always reach consensus (...)”. 3.4 pretesting google sheets was used as a tool to develop tact. it mostly contains the form for conducting the climate survey and a dashboard with the results of the frequency by item and dimension (figure 2). the items proposed in appendix a.2 are measured using a 5-point likert scale (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree and 5 = strongly agree). in tact, the organizational climate of the team is classified as positive (values 5 and 4), neutral (value 3), or negative (values 2 and 1). to begin step 4, a pretesting was performed with 3 developers to identify possible problems of interpretation for tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 the tact items and layout. in the end, the developers reported no difficulties in answering the survey. the authors implemented a layout suggestion presented in this step. to continue the preliminary assessment of tact, a case study (yin, 2013) was performed and is described in the next section. 4 case study planning and execution runeson and höst (2009) claim that case studies in software engineering aim to investigate a contemporary phenomenon in a real context for understanding how and why software engineering activities should be carried out. they also argue that improving the software process and the resulting products with the acquired knowledge is possible. the authors also highlight the main characteristics of a case study, namely 1) their conclusions must reflect a clear chain of evidence, whether qualitative or quantitative, collected from various sources in a planned and consistent manner; and 2) they must add to the existing body of knowledge, based on established theory, if any, or build such theory. thus, the case study described below is proposed as a method of evaluating both the case addressed and the tact instrument (yin, 2013). 4.1 research questions the study aims to evaluate tact preliminarily. to achieve the aim, the research questions (rq) are defined as follows: • rq1. how is the organizational climate in the examined agile teams? – rq1.1. how did working from home affect the organizational climate of the teams for the analyzed dimensions? • rq2. how do leaders perceive tact? • rq3. which are the most influential items in each dimension for the analyzed case? during the planning and execution of the study, teams allocated in the same physical environment were working from home due to the covid-19 pandemic described in davis et al. (2020). to investigate whether this fact could have impacted the organizational climate of the studied teams, we defined rq1.1. 4.2 description of the organization and teams the organization analyzed in the study is a big brazilian bank with millions of customers. it has dozens of development teams, composed of employees and outsourced collaborators. each team defines the software development process and can choose traditional (structured and rup) or agile (scrum, kanban, xp) methods, among others defined by the organization. each team has the freedom to define the scenario and artifacts to be developed as long as it is officially stated to the process sector. regarding leadership, some teams use the role of scrum master, but in others, this role is played by the hierarchical leader of the team. when present, the role of coach facilitates the understanding and dissemination of good agile practices by the teams. during this time of working from home, the team’s monitoring by the agile leader occurs through the ceremonies that continue to be performed, the monitoring of task execution, and meetings and interactions using microsoft teams and corporate skype resources. even with the change in the work routine, it was reported that tasks continue to be delivered within the established deadlines and with the required quality. three teams, named a, b, and c, were selected by convenience to participate in the case study. the teams have employees from the organization as well as outsourced members. 4.3 data collection for the data collection, we used interviews, document analysis, and the application of tact. data collection took place between january 2020 and march 2021. the first data collection procedure was an interview with a process coordinator of the organization. the objective was to understand how the company assessed the organizational climate, which difficulties were faced with assessing the organizational climate of agile software development teams, how the development process was like, and how the composition of agile teams was like. the second procedure was to analyze the executive reports with the last two organizational climate assessments results. it is noteworthy that the assessment performed by the organization is biennial and does not consider the team in which employees are allocated. only employees and superintendence department. for this reason, it is not possible to understand the climate of individual teams. the third procedure was the assessment of the organizational climate in the teams through tact. all members of teams were invited to participate voluntarily and anonymously in the study. the organizational climate survey was applied in three cycles called pulses. table 1 shows the dimensions applied to each team by pulse and the number of participants by each team in each pulse. pulse 1 was executed in june 2020 for team a and b. pulse 2 was executed in february 2021 for team c. lastly, pulse 3 was executed in march 2021, and all teams participated. the numbers in the columns team a, team b, and team c represent the size of each team at the moment each pulse was executed. in the period between pulse 1 and pulse 2, some team members from team a and b were allocated to other teams due to the conclusion of the product module. table 1. measurement cycles pulse dimension date team a team b team c 1 communication, collaboration, leadership jun20 13 10 2 communication, collaboration, leadership feb21 4 3 autonomy, decisionmaking, client involvement mar21 9 5 4 in addition to the items present in appendix a.2, three tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 open-ended questions were introduced: “regarding the examined dimensions (communication, collaboration, leadership, autonomy, decision-making, and client involvement), what are the main challenges for your team at this time working from home?” and “do you have anything to add about your team’s organizational climate?”. in addition, at the beginning of the instrument, we included a description with the definition of the organizational climate and the objective of the assessment. next, we presented a consent form to comply with ethical principles in which we informed that participation would be anonymous, voluntary and that the participant could abandon the assessment at any time without penalties. the fourth procedure represents the execution of semistructured interviews with the leaders of the respective teams. these interviews were designed to present the results of the climate assessment and capture the leader’s perception of tact and the team’s organizational climate. to do this, they were asked some questions such as “how do you evaluate the results, by dimension, of the organizational climate assessment carried out by the team? do the results by dimension represent your perception of the team’s daily life? in your opinion, was there any result that surprised you? do you believe that the items used represent expected behaviors in agility (mindset, values, principles, and practices)? otherwise, explain why the item does not represent expected behavior”. at the end of the interview, we sent a link to the leader to evaluate tact through tam (technology acceptance model) (venkatesh and davis, 2000; venkatesh and bala, 2008). the dimensions of intention to use, perceived usefulness, and output quality were used (venkatesh and davis, 2000; venkatesh and bala, 2008). in the interviews, we used a consent form to present and assure ethical aspects. 5 case study results this section aims to present the results of the organizational climate assessment, thus answering the research questions. 5.1 how is the organizational climate in the examined agile teams? (rq1) teams were allowed to answer the survey for 8 days on each pulse. we checked the data and calculations performed by tact. in total, 22 team members participated in pulse 1, 12 out of 13 (i.e., 92.31% of members) from team a and 10 (100%) from team b. in pulse 2, 3 out of 4 (75%) members from team c answered the survey. on the last pulse, 4 out of 9 (44%) members from team a, 5 (100%) members from team b, and 4 (100%) members from team c participated in the study. table 2 shows the frequency for each investigated dimension. the “dimension” column represents the description of the dimension. for each team, the relative frequencies (count for each value assigned by the members) and absolute frequencies (percentage in parentheses) were calculated according to the aforementioned likert scale. in table 2, we chose to count the values “strongly agree” and “agree” in the column “positive”, and “strongly disagree” and “disagree” in the column “negative”. finally, we consider the frequency of “neutral” to categorize the organizational climate as neutral. figure 2 shows the tact dashboard, which is used to present the results of the climate assessment. the climate is classified as positive, neutral, or negative to facilitate the analysis of results by team members, leaders, and others involved. when analyzing the results in table 2, higher frequencies can be observed in the “positive” column for team b and c in all dimensions. considering that the 49 items represent good behavior expected by the main existing roles in an agile team, it is possible to classify the organizational climate of teams b and c as positive or favorable for all dimensions. in team a, the organizational climate can be classified as i) positive for the communication, collaboration, and leadership, and ii) negative for autonomy, decision-making, and client involvement dimensions. table 2 shows that team b and c presented a positive climate superior to that of team a in all dimensions. for example, the frequency of the communication dimension was 82 (91.1%) for team b, 20 (74%) for team c, and 62 (58%) for team a. neutral and negative results represent points of attention for an analysis of possible causes and impacts on involved roles, elements of the process, the development project, or the team’s culture in general. 5.1.1 analysis of organizational climate from team a among the assessed teams, team a showed more items evaluated as negative and neutral (see table 2). thereby, the organizational climate can be considered negative for the autonomy, decision-making, and customer involvement dimensions. however, it is observed (i) positive evaluations in the items referring to the interaction between the team members, and (ii) negative and neutral evaluations in the interactions that involve the product owner and the leader. some points of attention were clarified in open-ended questions and the interview with the leader. in response to the question about the challenges for the communication dimension at this time of working from home, a member of team a said that “virtual rooms, when poorly managed, end up providing a space for inopportune conversations”. this statement was also corroborated by the leader of team b “they think they talk too much, lose focus a little bit”, mentioning the feedback obtained from the team at the previous day’s daily meeting. these comments can be associated with item “it04. team members frequently talk about club, entertainment, gym, parties, sports, and films”. for item “it07. in the current project, the daily meeting allows to know project problems and team difficulties”, the leader of team a admitted the negative result, “the team decided not to hold the daily meeting during the period of working from home anymore, the difficulties are addressed by whatsapp and the virtual room at microsoft teams”. in addition, the leader of team a agreed with the team, noting the negative result for item “it02. the team keeps the list of impediments, risks, and control actions updated” “many times i have to register the impediments myself, tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 2. results of the organizational climate assessment for teams a, b and c team a team b team c dimension negative neutral positive negative neutral positive negative neutral positive communication 23 (21%) 23 (21%) 62 (58%) 6 (7%) 2 (2%) 82 (91%) 1 (4%) 6 (22%) 20 (74%) collaboration 10 (10%) 20 (20%) 66 (70%) 0 (0%) 2 (3%) 78 (97%) 0 (0%) 2 (7%) 22 (93%) leadership 27 (23%) 29 (24%) 64 (54%) 0 (0%) 13 (13%) 87 (87%) 0 (0%) 5 (17%) 25 (83%) autonomy 12 (43%) 9 (32%) 7 (25%) 2 (6%) 2 (6%) 31 (88%) 0 (0%) 2 (7%) 26 (93%) decisionmaking 9 (28%) 16 (50%) 7 (22%) 0 (0%) 4 (10%) 36 (90%) 0 (0%) 3 (9%) 29 (91%) client involvement 6 (22%) 11 (39%) 11 (39%) 0 (0%) 2 (6%) 33 (94%) (0%) 0 (0%) 28 (100%) figure 2. part of tact’s dashboard (pulse 1: team a results) they don’t do it”. in the pulse 3, the item “it39. my team has open and effective communication” had all neutral assessments (4 100%), reflecting a change in the team’s climate for communication dimension. team a showed a greater positive climate in relation to the collaboration between the members themselves, for example, in items “it10. team members consider sharing know-how with each other” and “it12. my team works efficiently together when in the face of difficulties”. however, when collaboration involves the product owner and the leader, points of attention in the item “it17. in the current project, the team, the product owner, and the team facilitator work excellently together to plan the iteration” deserve to be stressed. with the analysis of the open question “do you have anything to add about your team’s organizational climate?”, it was possible to identify potentials causes for the negative assessment for item it17. the members reported that “after the coordination change occurred, there was some distancing between the po and the team” and “the team leader does not play her role”. this assessment of the negative climate was repeated in pulse 3. during the interview, in the analysis of it17, the leader of team a stated that “the team often wants to impose on the po what they think should be implemented in the product, they feel like they own the product”. the leader also pointed out that “the employee designated as po cannot develop stories at team speed. often, the product owner cannot approve a sprint with the business customer, as customers have other priorities, which compromises the next sprint planning”. regardingtheautonomyanddecision-makingdimension for iteration planning, the leader reported “sometimes there are demands that override all planning. we lived this recently, every time an unplanned demand arrived that passed over all others demands. this hinders the planning team’s autonomy”. the leader also declared “these past few months have been hard, a little stressful. most of the demands were out for planning”. the comments were said by the leader in the analysis of items “it34. my team has the decision authority and responsibility to plan the iteration” and “it35. my team has time to plan the changes without excessive stress or pressure”. the climate can be characterized as negative to decisionmaking dimension, considering 9 (28%) items assessed as negative and 16 (50%) items as neutral. the item “it41. the dependencies between the tasks do not hinder the fluidity of the project and do not cause major restrictions” obtained 75% of negative evaluations. about decision-making, a member of team a reported: “the decision-making process is still not very participatory”. in the analysis of item it41, the leader of team a stated “the dependencies between the tasks are getting in the way. demands have a number of tasks that impact. if the po does not approve the changes, this creates a configuration and change problem. in the company, if you put a demand into production and do not validate with the po, the infrastructure team rollback the demand”. with respect to client involvement, a member from team a described “business representatives, fail to fulfill their role during homologation, impacting the delivery in production not only that specific demand but also many others, as they depend on the implementation of the first demand”. 5.1.2 analysis of organizational climate from team b in team b, more than 90% of the items were positively evaluated. however, some items were evaluated as negative and neutral, thus representing points of attention. about the communication dimension, a team member reported “communication continues to flow very well, keeping productivity high and positive”. another report showed the good climate for collaboration between team members “when there is some difficulty to identifying an error in the tests, we share the screen, we make audio calls, we include other team members, whom we know have some more spetact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 cific experience at that point, in the conversations”. team b leader did not obtain any negative evaluations, only 3 neutral in the item “it25. the team facilitator gives the team helpful feedback on how to be more agile”. several praises for the performance of the agile leader during the period of working from home were registered in response to the open questions. these reports include “... his work remained close and very positive”, “... considering different points of view”, “... moving together, even at a time of working from home”. regarding the autonomy dimension, the team leader said “the team autonomy is very good. the members are participatory. in the team, there is no expression ‘this is my task, or this is not my problem’ ”. a member of team b wrote, “team members have always been autonomous about their tasks within each user story developed”. concerning the decision-making dimension, a member of team b reported dissatisfaction “the main challenges are when the team’s decisions come up against approval from other areas”. analyzing the item “it35. my team have time to plan the changes without excessive stress or pressure”, the leader reported “in the last few months, we had several po changes in the projects. before, the po was of it , now by determination of the company the po is of the business. the new po does not ‘walk’ with the team. she does not feel part of the team. she did not want to be a po. as the po was not planning , the team had to plan it”. the previous problems reported by team b leader may have influenced the two neutral evaluations (2 6%, see table 2) recorded in the dimension client involvement. the items “it44. in the current project, there are frequent meetings with business representatives and the team” and “it47. the current project does not have frequent requirement changes due to bad user stories definition” had neutral evaluations. analyzing items it44 and it47, the team b leader reported “many times the team had to prioritize and refine the user stories without the participation of the po. after planning, she made several changes to the user stories and the acceptance criteria”. 5.1.3 analysis of organizational climate from team c regarding the communication and collaboration dimension, team leader c said “the team is new. they have only 3 months in this project. they already knew each other. we have an excellent interaction. i do not know the team personally. what gets in the way are limitations of the tool (microsoft teams) because they do not have full access. but the collaboration between them is excellent”. regarding all neutral (3 100%) assessments in the item “it05. during the retrospectives, the team finds the best way to do things”, the leader reported “we still have not managed to do the retrospective meetings formally, the team is new. the team started by resolving only incidents. we talked, but not formally at a ceremony”. regarding the 3 neutral assessments involving iteration planning items “it34. my team has decision authority and responsibility to plan the iteration” and “it35. my team has time to plan the changes without excessive stress or pressure”, the leader said “they have autonomy. in the current project, they managed to negotiate changes in user stories. they had the autonomy to adjust the planning”. about the pressure in team c, the leader commented “it should also be considered that the product under development has a fixed date (which cannot be changed) to be launched. the product impacts millions of bank customers”. analyzing the decision-making dimension, a member of team c wrote “decision-making is shared between the members outsourced, the members of the company, and the business representatives. we can all contribute with equal weight. working from home facilitated the engagement and collaboration between these 3 roles”. on autonomy dimension, another member wrote “the autonomy limits are agreed with the client”. despite 100% positive evaluations of the client involvement dimension, one member reported that the product owner was not allocated in the same virtual environment. “in some moments, communication with the management area is not so synchronous, as we do not have access to the same communication tool (microsoft teams), but the continuous meetings in this same tool make it easier to exchange information and questions”. the 100% positive assessment of the team in the client involvement dimension did not surprise the leader. the leader declared “the managers praise the team a lot. in these last weeks, the managers have stayed together for up to 4 hours doing the backlog refining. i have never seen such engagement. in this project, there are many stakeholders involved. at this time of working from home, they are available to answer questions over the phone. now, we are currently holding 1-hour refinement meetings. the report used at the demonstration meeting containing the evidence was highly praised by the po. the po said: ‘i never got a homologation script with evidence that did not have errors’ ”. 5.1.4 how did working from home affect the organizational climate of the teams for the analyzed dimensions? (rq1.1) team members reported some challenges that could have impacted the organizational climate as they adapted to the period of working from home. the challenges mentioned were difficulty with communication tools; infrastructure problems; difficulties in reaching the support team; managing inopportune conversations in virtual rooms; absence of the facilitator at the ceremonies; customer contract hinders the action of the facilitator; and other challenges already present before the period of working from home. regarding the communication dimension, members of team a reported that “working from home actually facilitated team communication” and that there has been “improved contact while working from home, we communicate more”. in team b, the statement “our team is managing to maintain a good dialogue to clarify project issues” stands out. the challenges identified for this dimension mention the network infrastructure and supporting software. in relation to the collaboration dimension, the challenges tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 captured in the open-ended questions point to team b’s collaboration difficulties with the external support team “there have been challenges, some of which required the involvement of the support team”. one member reported a preference for working in person with the team: “... but i believe that being in the same physical space, help, and assistance would sometimes flow better”. another member stated that “the challenges are the same as they were before working from home”. regarding the leadership dimension, no issues were noticed in the performance of the leaders of team b and c. on the other hand, members of team a reported the absence of the team’s leader in ceremonies and a certain distance from the team’s activities. in general, the members’ responses did not indicate changes in organizational climate due to working from home for the dimensions investigated on tact. 5.2 how do leaders perceive tact? (rq2) during the interviews, for each analyzed dimension, the following question was asked to the leaders: “do you believe that the items used represent expected behaviors in agility (mindset, values, principles, and practices)? otherwise, explain why the item does not represent expected behavior”. regarding this question, no item was assessed as not being consistent with agility. in the final stage of the interview, the following questions were asked “in your opinion, what are the benefits of using this instrument?” and “how can the organizational climate assessment tool be improved?”. in relation to the first question, the leader of team a answered “i found it interesting, you can map out what needs attention... i can notice other things, interesting... it exposes, gives you a view of what is happening. very practical, because we can focus on the point that needs attention”. the leader of team b agreed, saying “i was able to see the positive things and the neutral points in order to try to improve... the visual formatting (graphics) was very clear. i managed to understand the results effortlessly”. the leaders did not report any suggestions to improve the instrument. after the interview, tam (venkatesh and davis, 2000; venkatesh and bala, 2008) was used, through the dimensions of tam for the leaders to evaluate tact. some items taken into consideration in the assessment, for example, were “assuming i have access to the instrument, i intend to use it”, “using the instrument improves my performance in my job”, and “the quality of the output i get from the system is high”. considering a 7-point likert scale, most leaders’ responses were the options “somewhat agree” and “strongly agree” for all items of dimensions of intention to use, perceived usefulness, and output quality. 5.3 which are the most influential items in each dimension for the analyzed case? (rq3) due to the large number of items defined on tact, it is relevant to identify the most important items for this case study, i.e., the most influential items in each dimension. for this purpose, we performed factor analysis (fa). fa is commonly used in software engineering to analyze items that use the likert scale (sharma and gupta, 2012; klünder et al., 2020; graziotin et al., 2020). graziotin et al. (2020) assert that fa allows to reduce the dimensionality of the problem space (i.e., reducing factors and/or associated items) and explaining the variance in the observed variables. in the case of analyses intended to assess a single construct, factor analysis helps identify those items that (best) represent the construct we are interested in, so that we can exclude the other items (graziotin et al., 2020). the quantitative results were processed using the r tool (v. 4.0.2) using primarily the psych library (revelle, 2018). it should be stressed that these procedures have an initial exploratory purpose and are not conclusive, as the small sample size (n = 25 pulse 1 and 2; n = 13 pulse 3), nonrandomness and data distribution can have interfered with the results (dima, 2018; kyriazos, 2018). the adopted procedures were i) analysis frequency of variation of the items and correlation matrix and ii) factor analysis. in step one, the response frequencies for all items are checked to verify whether the items have enough variation to differentiate respondents. if an insufficient variation is identified (i.e., 95% of responses in a single category for an ordinal item), the item needs to be excluded from further analysis (dima, 2018). in this case study, no items needed to be excluded. to continue the analysis of step one, the item correlations (see figure 3) were plotted for an initial visual diagnosis of the items and structure of the tact dimensions (dima, 2018). a higher degree of association between items of the same dimension may already be visible in the correlation matrix (figure 2). negative associations between items may indicate the need for reverse item coding, while items with weak associations consistent with other items may prove to be non-scalable in later stages (dima, 2018). analyzing the spearman correlations (ρ) for the test case (figure 3), we can observe: i) absence of negative correlations; ii) it04 and it32 with insignificant positive correlation, thus it04 and it032 will be excluded from next analyzes; and iii) in general, high and moderate positive correlation between items in the dimensions. critical values of ρ (0.9 to 1 very high; 0.70 to 0.90 high; 0.51 to 0.70 moderate; 0.31 to 0.5 – low; and 0 to 0.3 insignificant) (hinkle et al., 2003). to start the second step, we performed the test of calculating the kaiser-meyer-olkin index (kmo). the kmo index is a statistical test that suggests the proportion of variance of the items that may be explained by a latent variable. the values kmo (see table 3) were considered appropriate for the fa in each dimension. value of kmo ( < 0.5 unacceptable; > 0.5 and < 0.7 mediocre; > 0.7 < 0.8; good; 0,8 e 0,9 excellent) (field et al., 2012). as indicated by field et al. (2012), the next analysis was conducted on the polychoric correlation matrix. we used the parallel analysis graph (horn, 1965) to investigate the plausibility of the initial model proposed on tact, i. e., the association of the items with their dimension. figure 4 shows the parallel analysis graph (x-axis displays the “factor number” tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 figure 3. correlation matrix of the dimensions figure 4. parallel analysis graphic tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 3. quantitative analysis results dimension item λ communication it03 0.823 it05 0.760 it09 0.703 it01 0.686 it07 0.682 it08 0.606 kmo = 0.75 it06 0.603 α = 0.9 it02 0.543 collaboration it12 0.883 it15 0.842 it10 0.800 it13 0.694 it11 0.632 it14 0.622 kmo = 0.67 it17 0.618 α = 0.9 it16 0.579 leadership it21 0.858 it19 0.813 it20 0.745 it23 0.739 it27 0.728 it24 0.721 it22 0.698 it26 0.668 kmo = 0.85 it18 0.665 α = 0.97 it25 0.585 autonomy it29 0.827 it28 0.702 it33 0.680 it30 0.610 kmo = 0.63 it31 0.553 α = 0.95 it34 0.514 decision-making it39 0.827 it36 0.672 it42 0.651 it37 0.625 it40 0.563 it38 0.547 kmo = 0.7 it41 0.491 α = 0.94 it35 0.387 client involvement it45 0.792 it46 0.746 it43 0.731 it44 0.731 it49 0.720 kmo = 0.74 it48 0.452 α = 0.94 it47 0.337 and y-axis represents the “eigenvalue”). as per the kaiser criterion, only factors with eigenvalues greater than 1 can be retained (kaiser, 1960). the data simulated by the parallel analysis confirmed the hypothesis of retaining one factor by dimension. as shown in figure 4, all dimensions can be explained by a single factor. the fa was performed separated for each dimension to verify the more significant items. table 3 shows the quantitative results. analyzing the column “dimension” (table 3) and the first line “communication”, it is possible to verify that the items are ordered by significance. the factor loading (λ) (third column of table 3) indicates the correlation of the item for the associated dimension. regarding the small sample size, field et al. (2012) argue that if a factor has four or more loadings greater than .6 then it is reliable regardless of sample size. analyzing the communication dimension (table 3), the items it03 (λ = 0.823) and it05 (λ = 0.760) have the highest factor load, and they can be considered the most significant ones. therefore, for effective communication, the team should consider empathic listening (it03) and ensure the necessary discussions on possible decision-making agreed during the retrospectives (it05). for the collaboration dimension, items it12 (λ = 0.884) and it15 (λ = 0.842) have the highest factor load, and they can be considered the most relevant. the it12 represents that the team should work efficiently together to solve problems, and the it12 the collaboration to innovation. intheleadershipdimension, itemit21ishighlighted.the item it21 (λ = 0.858) measures the activities of the team leader in discussing the problems and impediments of the team. the facilitator’s behavior in protecting the team autonomy from external interference it29 (λ = 0.827) has a high correlation to other items to the autonomy dimension. for effective decision-making, the teams should have open and effective communication it39 (λ = 0.827). lastly, for dimension client involvement, the item it45 (λ = 0.792) represents the opportunity of stakeholders to suggest changes or improvements to the software. we calculated the reliability (see table 3) of tact dimensions using the α-cronbach coefficient (landis and koch, 1977). the α-cronbach indexes for each dimension are α > 0.8, which implies the reliability of tact for this case study is high (landis and koch, 1977). 6 discussion 6.1 case study we execute a case study to assess tact preliminarily. tact has 49 items to assess the climatic dimensions of communication, collaboration, leadership, autonomy, decisionmaking, and client involvement in agile development teams. the case study was carried out with three teams working at a bank. the climate assessment took place during a period in which teams that were previously physically allocated together were instead conditioned to work from home. in addition to the items established in tact, open-ended questions were used to understand the challenges faced by members working from home. in the end, we conducted interviews with the leaders to understand the possible causes or impacts of the items evaluated. analyzing the frequency of responses attributed to the items by the members, the answers to the open questions and the data from the interviews, there are signs of a positive organizational climate in teams b and c. on the other hand, there are signs of a negative organizational climate in team a. thereby, negative and neutral frequencies were observed tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 in some items, which can represent points of attention. communication, collaboration, autonomy, and decisionmaking are critical human factors in agile software development teams because members use them to plan and execute iterations, besides periodically adjusting the process or the team’s behavior (chagas et al., 2015; pmi and agile alliance, 2017). regarding the communication, collaboration, autonomy, and decision-making dimensions, there were positive frequencies for the relationship between the members of each team (e.g., it03, it11, it12, it28, it31, and it38). however, negative and neutral frequencies point to possible difficulties in team a when collaboration, communication, autonomy, and decision-making involve the roles of the product owner (it33 and it49) and the facilitator (it09, it17, and it29) and, also, agile ceremonies (it08 and it34) and artifacts (it02 and it08). team a abandoned or mischaracterized some agile practices while working from home, for instance, the daily meeting (it07). regarding the artifacts, the team was not communicating some impediments to the leader (it02) and both the product owner and the team were not adapting the requirements for the user story format (it08 and it16), due to the contract with the software factory, which established the requirement in another format for payment estimates. another critical factor identified in team a was the inability of the product owner to establish requirements according to the team’s speed and capacity (it17). thus, although the collaboration between team members was classified as positive in team a, the relationship with the product owner and the team facilitator reflected points of attention, which can be observed in the statement from one member: “the agile methodology is being abolished in our team”. as previously stated, leadership is one of the central elements for forming the organizational climate (schneider et al., 2014). the main activities of the servant leader can be summarized in (i) remove team impediments and (ii) facilitate, disseminate, and ensure the use of agile values, practices, and rules (noll et al., 2017; pmi and agile alliance, 2017). concerning leadership, during the interview, the leader of team a clarified the negative assessment for item it19, “i follow it closely when i am called, when i am needed”. in team b, tact captured a closer relationship between the leader and the team. however, when the leader of team b analyzed the neutral points of it19, she made the following statement: “i have not been able to dedicate myself, to be the scrum master that i was [before working from home]. the agility factor has been the greatest challenge, solving impediments faster. i need to do things that i still have not managed to”. the challenges captured in several reports did not point out new insights about working from home for the dimensionsinvestigated.concerningthechallenges, theteammembers reported “the challenges are the same as those that existed before working from home”, “there are no new problems in working from home. they [the challenges] existed before”, and “the current moment of working from home has not brought any new challenges so far(...)”. it is worth noting that, according to the report by the process coordinator, the quality and performance indicators are the same as before working from home began. supporting the report of the process coordinator, serrador et al. (2018) claim that it is often argued that teams allocated in the same physical space have a better performance, a greater success in the project. however, the authors also did not identify a significant difference between local and remote teams in the study on the climate for the success of development projects (serrador et al., 2018). 6.2 preliminary evaluation of tact the literature recommends implementing a pilot study for the initial assessment of instruments that measure behaviors, attitudes, or feelings (dybå, 2000; patterson et al., 2005; shahzad et al., 2017; recker, 2013). the pilot must utilize a sample with the same characteristics as the target population (anderson and west, 1998; dybå, 2000; shahzad et al., 2017; patterson et al., 2005; recker, 2013). on tact, we decided to carry out the preliminary assessment through a casestudybecausewewantedtocapturetheperceptionofthe teams’ climate in different data sources. the results and analysis presented in the previous sections established a chain of evidence that allows us to infer that tact can capture the context of organizational climate experienced in the teams. in the evaluation by specialists (see section 3.3), every tact item related to agile software development teams was considered. this assessment is already evidence of the content validity of tact. in a qualitative analysis (see section 5.2), the leaders confirmed that the items represent agile values, principles, and practices. through the dimensions of tam (venkatesh and davis, 2000; venkatesh and bala, 2008), leaders rated tact positively for intention to use, perceived usefulness, and output quality. in quantitative analysis, the correlation matrix (figure 3) revealed a high and moderate positive correlation between most of the items of each dimension. only the items it04 and it32 showed an insignificant correlation with the other dimension items. thus, we excluded it04 and it032 from the factor analysis. development teams that talk about the subject of it04 reported a positive emotion, contributing to the group’s optimism (licorish and macdonell, 2014). however, team a members understood that talking about these issues would be a negative behavior when team a analyzed the results. this misinterpretation may have been caused by the description of the item “it04. team members frequently talk about club (...)”. regarding it04, leader team c said, “perhaps the word ’frequently’ caused the misunderstanding”. considering quantitative analysis and the reports, we excluded it04 of tact. we have not captured reports of misinterpretation on item it32. the low correlation may be relative to the sample and not to the construct. thus, we opted to keep it32 in tact. factor analysis allowed, based on the response patterns, to verify the proposed structure of tact, i.e., the associated items in each dimension. the parallel analysis graph (figure 4) indicated that a single factor could explain all dimensions. furthermore, most tact items have high (> 6) factor loadings (see table 3). therefore, there is initial empirical evidence that the structure proposed in tact is acceptable. quantitative analysis revealed high reliability of the tact dimensions (see table 3). the α-cronbach indexes for each tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 dimension are α > 0.8, which implies the reliability of tact for this case study is high (landis and koch, 1977). 6.3 tact use recommendations wagner et al. (2020) recommend that software engineering research should either adopt or develop psychometrically validated questionnaires. we extend that recommendation to the companies that realize organizational climate assessment. validating a climate instrument without selling intent is challenging because it is necessary to find companies or persons willing to invest their time answering a questionnaire without a counterpart. we highlight that all evidence of validity and reliability are conditioned to the date this research was conducted, i.e., the more investigations are executed using tact, the more evidence of validity and reliability there will be. thereby, tact items can be used by researchers who want to measure proposed constructs or investigate other possible factor structures. the organizational climate is measured through manifested behaviors or perceived feelings by the employees. climate instruments are self-reports. only the team member knows how he is feeling. although many factors can skew team member’s views, when several individuals point in the same direction, a point of investigation is revealed. for example, if an item with too many negative ratings might indicate a lack of practice, a specific problem, or a misunderstanding about the agile mindset. therefore, climate instruments only allow for a pre-diagnosis of what must be investigated and dealt with in the later stages of the organizational climate management process. organizational climate instruments measure some latent variables (those that are not directly observable). tact items represent examples of good behaviors or practices widely used in agile software development teams. therefore, team leaders, managers, or the responsible for preparing and conducting organizational climate assessments can use the tact items for a more accurate diagnosis. if a specific item has many neutral or negative evaluations, an investigation point is revealed. for example, the assessment of item “it35. my team has time to plan the changes without excessive stress or pressure” shows how the team member feels (stressed/pressured) and suggests what project activities or situations (such as interaction planning, task estimation, abusive or unrealistic deadlines given by po or manager) might be the cause of that feeling. notice that terms in the item description (for example, plan, stress, and pressure in it35) allows team members to reflect on how they are feeling about the day-to-day events. to create every tact items description, we used generic nomenclatures for roles and practices used in hybrid and agile processes. scrum is the most used agile methodology (digital.ai, 2020). however, we do not use the names of the roles or ceremonies from scrum, e.g., we use team facilitator, iteraction, and meeting review instead of scrum master, sprint, and sprint retrospective. by doing that, we expect to reach more teams using different process configurations. thereby, if tact is used by a team where the software development process has other names to roles or ceremonies or still does not have a specific role, the team members can misunderstand the items. to address this limitation and threat, at the beginning of the climate survey, we show the vocabulary of terms used in tact compared with scrum terms. regarding the number of items and time interval of the application of tact, based on a previous study (dutra and santos, 2020), we claim that using many items and having a long time interval in the organizational climate survey in agile teams can hinder the assessment, diagnosis, and establishment of actions to climate management. having too many items in climate surveys and the lack of control activities can also demotivate the team member’s participation in new climate surveys. in that regard, we recommend that practitioners adopt one or two dimensions by cycle, performing several cycles per year. however, more critical than measuring the organizational climate is involving the team in discussions of possible actions that allow a climate change. a simple open-ended question that can help team engagement in climate management is “how to improve your team’s organizational climate?”. 7 limitations and threats to validity the research procedures used in this study are adequate to build an organizational climate instrument, but we faced some limitations. the main one concerns the small sample size. as mentioned in section 5.3, the quantitative procedures have an initial exploratory purpose. due to the small sample size, the use of factor analysis (fa) is not possible without segregating the data. due to that, we conducted fa by each tact dimension. in future studies (see section 8.1), we will perform exploratory and confirmatory fa. in pulse 3, only 4 out of 9 team a members answered the survey. the number of participants can hinder team a’s organizational climate assessment because the four respondents may have the same perspective of the team organizational climate while the other members of team a that did not participate in pulse 3 would have another perspective. the team leader a interview helped us confirm the results and deal with this limitation. recker (2013) proposes some principles for evaluating qualitative and quantitative studies. concerning reliability, a contextual description of the organization was presented as well as direct quotes from team members and leaders which were considered to support the analysis. thus, it is possible to guarantee that individuals other than the researchers, when considering the same observations or data, will reach the same or similar conclusions (recker, 2013). from a quantitative point of view, an investigation was carried out to assess the study’s reliability, using descriptive statistics, correlations, and cronbach’s α coefficient. thus, the reliability of tact dimensions for the case study sample is high. to address possible threats to internal validity, we decided to use multiple sources of evidence. the team members assessed the organizational climate through the tact items and open-ended questions. in addition, the leaders’ perceptions were captured through interviews. in this way, a chain of evidence was established, and the review of the evaluation results was assured (which also relates to measurement tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 validity). regarding tact, two auditors experienced in agile methods and the leaders in the study assessed whether the item descriptions represented elements of agile values, principles, and practices. external validity concerns how much and when the results of a study can be generalized to other cases or domains (recker, 2013). to mitigate this threat, we provide detailed descriptions of the study context. however, schneider et al. (2014) claim that everything that happens in the organization changes its climate. thus, it is not possible to guarantee similar results in another cycle in the same examined teams or even in other teams of the same organization. 8 final considerations we presented the initial version of tact (instrument to assess the organizational climate of agile teams), designed to measure the dimensions of communication, collaboration, leadership, autonomy, decision making, and client involvement. we also presented a case study to evaluate tact and measure the organizational climate of three agile teams from the same organization. data collection included tact results, interviews with team leaders, and answers to openended questions by the participants. the sample data revealed a positive organizational climate for all dimensions in teams b and c and negative for autonomy, decision making, and client involvement dimensions for team a. thereby, some items assessed as negative or neutral indicated points of attention. through open-ended questions and interviews with leaders, the evaluation carried out through tact was confirmed and the points of attention were better explored. we identified the abandonment of some agile ceremonies, difficulties in planning the iteration, the inability of the product owner to keep up with the speed and capacity of the team, and even the absence of leadership. based on the statistical analysis of the data from assessing the organizational climate, there is an initial evidence that the validity and reliability of tact dimensions are high. 8.1 future works besides the tact dimensions proposed in the present study, we are investigating new constructs: motivation, trust, learning, and knowledge. other case studies are being executed to assess the climate of the same three teams mentioned in this study and other four teams of another organization. after finishing the case studies cycle, we will execute a survey to investigate and validate the factorial structure of all tact dimensions. we will use exploratory and confirmatory factor analysis to investigate and confirm the measured dimensions. as a result, tact dimensions and items will likely be reduced. after conducting the survey, we will have the means to create guidelines for using tact and interpret the results. we also intend to investigate the influence of gender, team size, and team members’ experience on agile methodologies in the organizational climate. moreover, in the future, there might be some value in digging deeper into an investigation on whether the organizational climate of employees and outsourced team members differs. acknowledgements we thank unirio (ppq-unirio 01/2019 and 04/2020; ppinstunirio 05/2020) for their financial support. references açikgöz, a. (2017). the mediating role of team collaboration between procedural justice climate and new product development performance. international journal of innovation management, 21(04):1750039. acuña, s. t., gómez, m., and juristo, n. (2008). towards understanding the relationship between team climate and software quality—a quasi-experimental study. empirical software engineering, 13(4):401–434. açıkgöz, a. and gunsel, a. (2016). individual creativity and team climate in software development projects: the mediating role of team decision processes. creativity and innovation management, 25(4):445–463. açıkgöz, a., günsel, a., bayyurt, n., and kuzey, c. (2014). team climate, team cognition, team intuition, and software quality: the moderating role of project complexity. group decision and negotiation, 23(5):1145–1176. açıkgöz, a. and ö. i̇lhan, ö. (2015). climate and problem solving in software development teams. procedia social and behavioral sciences, 207(20 october 2015):502– 511. ahmed, s., ahmed, s., naseem, a., and razzaq, a. (2017). motivators and demotivators of agile software development: elicitation and analysis. international journal of advanced computer science and applications, 8(12):1– 11. ancona, d. g. and caldwell, d. f. (1992). bridging the boundary: external activity and performance in organizational teams. administrative science quarterly, 37(4):634. anderson, n. r. and west, m. a. (1998). measuring climate for work group innovation: development and validation of the team climate inventory. journal of organizational behavior, 19(3):235–258. annosi, m. c., martini, a., brunetta, f., and marchegiani, l. (2020). learning in an agile setting: a multilevel research study on the evolution of organizational routines. journal of business research, 110:554–566. askarinejadamiri, z. (2016). personality requirements in requirement engineering of web development: a systematic literature review. in 2016 second international conference on web research (icwr), pages 183–188, tehran, iran. ieee. bandura, a. (2006). summary for policymakers. in intergovernmental panel on climate change, editor, climate change 2013 the physical science basis, pages 1–30. cambridge university press, cambridge. beck, k., beedle, m., and et al. bennekum, a van (2001). manifesto for agile software development. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 chagas, a. (2015). (in portuguese) o impacto dos fatores humanos nos métodos ágeis. chagas, a., santos, m., santana, c., and vasconcelos, a. (2015). the impact of human factors on agile projects. in 2015 agile conference, pages 87–91, national harbor, md, usa. ieee. curtis, b., hefley, w. e., and miller, s. a. (2009). people capability maturity model (p-cmm ) version 2.0, second edition. technical report, carnegie mellon university. davis, k. g., kotowski, s. e., daniel, d., gerding, t., naylor, j., and syck, m. (2020). the home office: ergonomic lessons from the “new normal”. ergonomics in design: the quarterly of human factors applications, 28(4):4– 10. delgado-rico, e., carretero-dios, h., and ruch, w. (2012). content validity evidences in test development: an applied perspective. international journal of clinical and health psychology. digital.ai, t. (2020). 14th annual state of agile report. technical report, digital.ai. dima, a. l. (2018). scale validation in applied health research: tutorial for a 6-step r-based psychometrics protocol. health psychology and behavioral medicine, 6(1):136–161. dutra, e., diirr, b., and santos, g. (2021). human factors and their influence on software development teams a tertiary study. in brazilian symposium on software engineering, sbes ’21, page 442–451, new york, ny, usa. association for computing machinery. dutra, e., lima, p., and santos, g. (2020). an instrument to assess the organizational climate of agile teams a preliminary study. in 19th brazilian symposium on software quality, pages 1–10, são luis, brazil. acm. dutra, e. and santos, g. (2020). organisational climate assessments of agile teams – a qualitative multiple case study. iet software, 14(7):861–870. dutra, j. s., fischer, a. l., nakata, l. e., pereira, j. c. r., and veloso, e. f. r. (2012). the use categories as indicator of organizational climate in brazilian companies. revista de carreiras e pessoas, 2:145–176. dybå, t. (2000). instrument for measuring the key factors of success in software process improvement. empirical software engineering, 5:357–390. dybå, t. and dingsøyr, t. (2008). empirical studies of agile software development: a systematic review. information and software technology, 50(9-10):833–859. field, a., miles, j., and field, z. (2012). discovering statistics using r. sage publications, london, 1 edition. franca, a., gouveia, t., santos, p., santana, c., and da silva, f. (2011). motivation in software engineering: a systematic review update. in 15th annual conference on evaluation and assessment in software engineering (ease 2011), pages 154–163, durham, uk. iet. ganesh, m. p. (2013). climate in software development teams: role of task interdependence and procedural justice. asian academy of management journal. ganesh, m. p. and gupta, m. (2006). study of virtualness, task interdependence, extra-role performance and team climate in indian software development teams. proceedings of the 20th australian new zealand academy of management (anzam) conference on management, pragmatism, philosophy, priorities, central queensland university, rockhampton, 20:1–19. gonzález-romá, v., fortes-ferreira, l., and peiró, j. m. (2009). team climate, climate strength and team performance. a longitudinal study. journal of occupational and organizational psychology, 82(3):511–536. graziotin, d., lenberg, p., feldt, r., and wagner, s. (2020). psychometrics in behavioral software engineering: a methodological introduction with guidelines. acm trans. softw. eng. methodol., i(1):article 111 – 49 pages. grobelna, k. and stefan, t. (2019). the impact of organizational climate on the regularity of work speed of agile software development teams. entrepreneurhip and management, 12(1):229–241. hinkle, d., wiersma, w., and jurs, s. (2003). applied statistics for the behavioural sciences. houghton mifflin, boston, 5 edition. hoda, r., kruchten, p., noble, j., and marshall, s. (2010). agility in context. acm sigplan notices, 45(10):74– 88. hohl, p., klünder, j., van bennekum, a., lockard, r., gifford, j., münch, j., stupperich, m., and schneider, k. (2018). back to the future: origins and directions of the “agile manifesto” – views of the originators. journal of software engineering research and development, 6(1):15. horn, j. l. (1965). a rationale and test for the number of factors in factor analysis. psychometrika, 30(2):179–185. jia, j., zhang, p., and capretz, l. f. (2016). environmental factors influencing individual decision-making behavior in software projects. in proceedings of the 9th international workshop on cooperative and human aspects of software engineering, pages 86–92, new york, ny, usa. acm. kaiser, h. f. (1960). the application of electronic computers to factor analysis. educational and psychological measurement, 20(1):141–151. karhatsu, h., ikonen, m., kettunen, p., fagerholm, f., and abrahamsson, p. (2010). building blocks for selforganizing software development teams: a framework model and empirical pilot study. in icste 2010 2010 2nd international conference on software technology and engineering, proceedings. kettunen, p. (2014). directing high-performing software teams: proposal of a capability-based assessment instrument approach. in bergsmann j., editor, lecture notes in business information processing, chapter model-base, pages 229–243. springer, cham. klünder, j., karajic, d., tell, p., karras, o., münkel, c., münch, j., macdonell, s. g., hebig, r., and kuhrmann, m. (2020). determining context factors for hybrid development methods with trained models. in proceedings of the international conference on software and system processes, pages 61–70, new york, ny, usa. acm. kyriazos, t. a. (2018). applied psychometrics: sample size and sample power considerations in factor analysis (efa, cfa) and sem in general. psychology. tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 landis, j. r. and koch, g. g. (1977). the measurement of observer agreement for categorical data. biometrics, 33(1):159. lee, j.-n. (2001). the impact of knowledge sharing, organizational capability and partnership quality on is outsourcing success. information and management, 38(5):323– 335. lenberg, p., feldt, r., and wallgren, l. g. (2015). behavioral software engineering: a definition and systematic literature review. journal of systems and software, 107(september 2015):15–37. licorish, s. a. and macdonell, s. g. (2014). understanding the attitudes, knowledge sharing behaviors and task performance of core developers: a longitudinal study. information and software technology, 56(12):1578–1596. mcavoy, j. and butler, t. (2007). the impact of the abilene paradox on double-loop learning in an agile team. information and software technology, 49(6):552–563. miller, g. j. (2020). framework for project management in agile projects: a quantitative study. misra, s. c., kumar, v., and kumar, u. (2009). identifying some important success factors in adopting agile software development practices. journal of systems and software. moe, n. b., dings, t., and dyb, t. (2008). understanding self-organizing teams in agile software development. in 19th australian conference on software engineering (aswec 2008), pages 76–85. ieee. moe, n. b. and dingsøyr, t. (2008). scrum and team effectiveness: theory and practice. in lecture notes in business information processing. moe, n. b., dingsøyr, t., and øyvind, k. (2009). understanding shared leadership in agile development: a case study. in 2009 42nd hawaii international conference on system sciences, pages 1–10. ieee. nianfang ji and jie wang (2012). a software project management simulation model based on team climate factors analysis. in 2012 international conference on information management, innovation management and industrial engineering, pages 304–308, sanya, china. ieee. noll, j., razzak, m. a., bass, j. m., and beecham, s. (2017). a study of the scrum master’s role. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), pages 307–323. springer, innsbruck, austria. patterson, m. g., west, m. a., shackleton, v. j., dawson, j. f., lawthom, r., maitlis, s., robinson, d. l., and wallace, a. m. (2005). validating the organizational climate measure: links to managerial practices, productivity and innovation. journal of organizational behavior, 26(4):379–408. pmi,p.m. i.and agilealliance,a. a.(2017). agile practice guide. pmi, pennsylvania, 1st edition. recker, j. (2013). scientific research in information systems. springer berlin heidelberg, berlin, heidelberg. revelle, w. (2018). how to: use the psych package for factor analysis and data reduction. technical report, northwestern university. runeson, p. and höst, m. (2009). guidelines for conducting and reporting case study research in software engineering. empirical software engineering, 14(2):131–164. schneider, b., barbera, k. m., schneider, b., and barbera, k. m. (2014). summary and conclusion. in barbera, b. s. and m., k., editors, the oxford handbook of organizational climate and culture, chapter summary an, pages 1–14. oxford university press, new york, ny, usa. senapathi, m. and srinivasan, a. (2013). sustained agile usage. in proceedings of the 17th international conference on evaluation and assessment in software engineering ease ’13, page 119, new york, new york, usa. acm press. serrador, p., gemino, a., and horner, b. (2018). creating a climate for project success. journal of modern project management, may/august:38–47. shahzad, f., xiu, g., and shahbaz, m. (2017). organizational culture and innovation performance in pakistan’s software industry. technology in society, 51:66–73. sharma, a. and gupta, a. (2012). impact of organisational climate and demographics on project specific risks in context to indian software industry. international journal of project management, 30(2):176–187. shull, f., singer, j., and sjøberg, d. i. (2008). guide to advanced empirical software engineering. springer london, london. soomro, a. b., salleh, n., mendes, e., grundy, j., burch, g., and nordin, a. (2016). the effect of software engineers’ personality traits on team climate and performance: a systematic literature review. information and software technology, 73(may 2016):52–65. spector, p. (1992). summated rating scale construction: an introduction. sage publications, inc. stewart, k. j. and gosain, s. (2006). the moderating role of development stage in free/open source software project performance. software process: improvement and practice, 11(2):177–191. stone, r. w. and bailey, j. j. (2007). team conflict selfefficacy and outcome expectancy of business students. journal of education for business, 82(5):258–266. vandenbos, g. r. e. (2017). apa dictionary of psychology. venkatesh, v. and bala, h. (2008). technology acceptance model 3 and a research agenda on interventions. decision sciences, 39(2):273–315. venkatesh, v. and davis, f. d. (2000). a theoretical extension of the technology acceptance model: four longitudinal field studies. management science, 46(2):186–204. vishnubhotla, s. d., mendes, e., and lundberg, l. (2018). an insight into the capabilities of professionals and teams in agile software development. in proceedings of the 2018 7th international conference on software and computer applications icsca 2018, pages 10–19, new york, new york, usa. acm press. vishnubhotla, s. d., mendes, e., and lundberg, l. (2020). investigating the relationship between personalities and agile team climate of software professionals in a telecom company. information and software technology, 126:106335. wagner, s., mendez, d., felderer, m., graziotin, d., and kalinowski, m. (2020). contemporary empirical methods in software engineering. springer international pubtact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 lishing, cham. yin, r. k. (2013). case study research: design and methods. sage publications, los angeles, 5 edition. zaineb, g., shaikh, b., and ahsan, a. (2012). recommended cultural and business practices for project based software organization of pakistan for supporting restructuring of functional organization for implementing agile based development framework in software projects. in 2012 international conference on information management, innovation management and industrial engineering, pages 16–20. ieee. a appendix a.1 constructs construct communication • conceptual definition – frequent communication between project stakeholders is core to agile software development (chagas et al., 2015; chagas, 2015). – “the perception of participatory safety could encourage team members to be open in communicating their ideas with the team, which could otherwise be risky” (ganesh, 2013). – vishnubhotla et al. (2018) reported “the ‘insider’voices of scrum practitioners about the soft skills they consider most valued to have by product owner and scrum master. communication skills and teamwork were most valued for both roles. besides them, customer orientation was expressed as important for program managers, whereas commitment, responsibility, interpersonal and planning skills were considered valuable for scrum masters”. – “gap in communication between developer and customer can guarantee the success of the project while in contrast lack of communication skill causes project problems” (askarinejadamiri, 2016). • operational definition – communication is a capability for the team member (vishnubhotla et al., 2018). – communication is an attribute for team (vishnubhotla et al., 2018). – the team has formal and informal communication (dybå and dingsøyr, 2008). – the team discusses the project and impediments (moe and dingsøyr, 2008; pmi and agile alliance, 2017). – the team discusses how to improve the process and the project (moe and dingsøyr, 2008; pmi and agile alliance, 2017). construct collaboration • conceptual definition – “team collaboration is a set of functions and activities carried out before, during, and after teamwork to achieve team objectives” (açikgöz, 2017). – “customer collaboration over contract negotiation” (beck et al., 2001). – “communication and collaboration (c&c) are at the heart of agile software development. as the agile manifesto states, ‘individuals and interactions over processes and tools’and ‘customer collaboration over contract negotiation. one aspect in c&c is customer cooperation” (karhatsu et al., 2010). • operational definition – team collaboration involves communication and coordination (karhatsu et al., 2010). – collaboration involves work as a team with i) the client (or their representative), ii) the team, and iii) others stakeholders (açıkgöz et al., 2014; chagas et al., 2015; vishnubhotla et al., 2018). construct leadership • conceptual definition – the leadership (in agile projects) is based on the role of the servant leader (pmi and agile alliance, 2017). – “team leadership plays a significant role in improving interpersonal and group processes within the team. team leaders who play the role of ‘communication integrators’ are very crucial for the success of the team. the team leader should also ensure periodically whether the members are clear with the team objectives and understand their level of agreement with those objectives” (ganesh and gupta, 2006). – “agile software engineering adopts a leadership style that empowers the people involved in the development process” (chagas, 2015). • operational definition – leadership is played by a formal role (pmi and agile alliance, 2017; noll et al., 2017). – the leader facilitates ceremonies, removes impediments, and shields the team from outside interference (pmi and agile alliance, 2017; noll et al., 2017). – the leader is a “communication integrator” (ganesh and gupta, 2006). construct autonomy • conceptual definition – “the autonomy of a team is defined as the ability to continue to operate in its own way without external interference. the role of formal authority is redesigned, so that governance and coordination appear to be the outcome of actions of networks, operating without any formal sanction” (annosi et al., 2020). tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 – “autonomy refers to the authority and responsibility that a team has in their work. it is a significant factor for team effectiveness. a team must have a real possibility to influence relevant matters; otherwise self-organization is more symbolic than real. on the other hand, a team should not be left completely alone. instead, while management should give a team substantial freedom, it should maintain subtle control and have regular checkpoints. three levels of autonomy are external, internal, and individual. the external refers to the degree that the people outside of a team influence the team’s decisions. moreover, it sets the decision-making boundaries for the team. meanwhile, internal autonomy defines how the work is organized inside the team. the team may have substantial power to make decisions while some individuals have none. great care should be taken to make sure that there really is internal autonomy instead of, for example, team leader autonomy. finally, individual autonomy, on its part, tells how much an individual has freedom to decide about his or her own work processes” (karhatsu et al., 2010). • operational definition – individual, internal, and external autonomy (karhatsu et al., 2010). – the team planning the tasks (karhatsu et al., 2010). – the leader protects the team (noll et al., 2017; pmi and agile alliance, 2017). – the team has good communication with the client (moe et al., 2008). construct decision-making • conceptual definition – “responding to change over following a plan” (beck et al., 2001). – “at regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly” (beck et al., 2001). – “software development involves interdependent individuals working together to achieve favorable outcomes, so the decision-making behavior of each individual will influence behaviors of other teammates and the project outcome. individuals have many chances to make a decision in a development process. for example, individuals may choose a resolution to deal with a conflict. in agile development, each one makes a decision about effort estimation and gives user story points. individuals may often independently make ‘work’or ‘shirk’choices in teamwork. under these conditions, different individual decision-making behaviors will generate different results, which are pertinent to the success or failure of the project” (jia et al., 2016). – “product development teams quite often experience problems, barriers and setbacks during the new product development project, which require an immediate and effective decision process to generate sufficient courses of action. decision processes refer to team members’ collective efforts to process knowledge about key task-related components, emerging issues and problems. individual creativity represents a possible contribution to the teams to deal with these difficulties. moreover, creativity-based decision processes likely allow the teams to become more proactive when dealing with emerging issues. indeed, product development teams have to think outside the box when making decisions, as well as offer practical solutions for problems that can be implemented beyond organizational constraints. such a process is characterized by the ability to understand complexity, to break through prevailing cognitive patterns, and to try new paths when old sets do not work” (açıkgöz and gunsel, 2016). • operational definition – task identity and significance (jia et al., 2016). – the member perceives recognition of management and leadership (jia et al., 2016). – the team has fast and effective communication (chagas, 2015; chagas et al., 2015). – the team plains the project without stress or pression (jia et al., 2016). – the team shares decision-making (chagas, 2015; chagas et al., 2015). – the team autonomy influences decision-making (chagas, 2015; chagas et al., 2015). construct client involvement • conceptual definition – having a client focus is one of the main aims of an agile team (karhatsu et al., 2010). – “customer collaboration over contract negotiation” (beck et al., 2001). – “agile processes promote sustainable development. the sponsors, developers, and users should be able to maintain a constant pace indefinitely” (beck et al., 2001). – “lack of client involvement is ‘the biggest problem’because agile [requires] fairly strong client involvement” (karhatsu et al., 2010). – “welcome changing requirements, even late in development. agile processes harness change for the customer’s competitive advantage” (beck et al., 2001). • operational definition – client satisfaction, collaboration, and commitment are features of client involvement. (jia et al., 2016). – a good relationship with users/clients is a motivating aspect for the team (franca et al., 2011). – the client (or their representative) provides and elucidates requirements (dybå and dingsøyr, 2008). tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 – the client (or their representative) validates the software (dybå and dingsøyr, 2008). a.2 the items of tact by dimension table 4. items used to measure the communication dimension items source it01. in this team, we can freely talk to each other about difficulties we are having stewart and gosain (2006) it02. the team keeps the list of impediments, risks and control actions updated # anderson and west (1998); miller (2020); pmi and agile alliance (2017) it03. my opinion is always listened to by my team anderson and west (1998) it04. team members frequently talk about club, entertainment, gym, parties, sports, and films # * anderson and west (1998); licorish and macdonell (2014); shahzad et al. (2017) it05. during the retrospectives, the team finds the best way to do things # chagas et al. (2015); chagas (2015); gonzálezromá et al. (2009) it06. the team knows the skills and technical expertise of team members, and they use the skills and technical expertise appropriately and adequately # nianfang ji and jie wang (2012) it07. in the current project, the daily meeting allows to know project problems and team difficulties # chagas et al. (2015); dybå and dingsøyr (2008) it08. the team and the product owner always reach consensus on the priority of the user stories by negotiating which bug to fix or functionality to add # chagas (2015); nianfang ji and jie wang (2012) it09. in the current project, the team and the product owner always solve the disagreements about the iteration scope # miller (2020); noll et al. (2017) # represents original items table 5. items used to measure the collaboration dimension items source it10. team members consider sharing know-how with each other lee (2001) it11. team members always help each other when there is a need shahzad et al. (2017) it12. my team works efficiently together when in the face of difficulties açikgöz (2017); shahzad et al. (2017) it13. team members work together as a whole anderson and west (1998) it14. all project-related decisions are applied consistently across to affected team members anderson and west (1998) it15. the team collaborates to look for new ways to analyze the problems # patterson et al. (2005); vishnubhotla et al. (2018) it16. the team has excellent ability to design the software based on user stories # açıkgöz et al. (2014); pmi and agile alliance (2017) it17.inthecurrentproject, theteam, the product owner, and team facilitator work excellently together to plan the iteration # dybå and dingsøyr (2008); noll et al. (2017) # represents original items tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 6. items used to measure the leadership dimension items source it18. the team facilitator gives me helpful feedback on how to be more effective sharma and gupta (2012) it19. the team facilitator eliminates barriers, encourages, and facilitates the use of agile methods # noll et al. (2017); senapathi and srinivasan (2013) it20. the team facilitator listens to my ideas and concerns sharma and gupta (2012) it21. the team facilitator discusses the problems of the team açıkgöz and ö. i̇lhan (2015) it22. the team facilitator protects the team from outside interference ancona and caldwell (1992) it23. the team facilitator helps my team to acknowledge and solve our disagreements stone and bailey (2007) it24. the team facilitator assists to understand whether the iteration objectives are clear and whether the team agrees with these objectives # ganesh and gupta (2006); pmi and agile alliance (2017) it25. the team facilitator gives the team helpful feedback on how to be more agile # pmi and agile alliance (2017); sharma and gupta (2012) it26. the team facilitator is always free to support the team when business requirements conflict with the technical reality # noll et al. (2017); pmi and agile alliance (2017) it27. the team facilitator investigates and helps the team to be more effective, taking into account the team velocity and the team capacity # chagas et al. (2015); miller (2020); noll et al. (2017) # represents original items table 7. items used to measure the autonomy dimension items source it28. in the current project, i am free to choose the tasks i want to execute in the iteraction # karhatsu et al. (2010) it29. in the current project, the team facilitator protects the team autonomy from external interferences # karhatsu et al. (2010); moe and dingsøyr (2008) it30. in this organization, we have the autonomy to suggest change the team’s software process development # patterson et al. (2005) it31. in this team, we switch assignments in tasks to avoid specialization and individualism # moe and dingsøyr (2008); chagas (2015) it32. the team has autonomy to adopt technical solutions without consulting the product owner or the management # patterson et al. (2005) it33. my team has autonomy to communicate with the product owner and other relevant stakeholders # moe and dingsøyr (2008); chagas (2015) it34. my team has decision authority and responsibility to plan the iteration # karhatsu et al. (2010); pmi and agile alliance (2017) # represents original items table 8. items used to measure the decision-making dimension items source it35. my team has time to plan the changes without excessive stress or pressure # jia et al. (2016); kettunen (2014) it36. in my team, members must not need to think equally # chagas (2015); mcavoy and butler (2007) it37. in the iteration planning, the team analyzes the technical alternatives and chooses the most appropriate one # chagas (2015); moe et al. (2009); pmi and agile alliance (2017) it38. in the retrospective, the team identifies, analyzes and selects improvement items # jia et al. (2016); pmi and agile alliance (2017) it39. my team has open and effective communication # misra et al. (2009) it40. this organization allows the team to make their own technical decisions about the best way to develop the project # patterson et al. (2005); chagas (2015) it41. the dependencies between the tasks do not hinder the fluidity of the project and do not cause major restrictions # jia et al. (2016); pmi and agile alliance (2017) it42. in the current project, my work is recognized by management # jia et al. (2016) # represents original items tact: an instrument to assess the organizational climate of agile teams a preliminary study dutra et al. 2022 table 9. items used to measure the client involvement dimension items source it43. during the demo review, the team shows and validates the new functionalities with the right people # ancona and caldwell (1992); pmi and agile alliance (2017) it44. in the current project, there are frequent meetings with business representatives and the team serrador et al. (2018); zaineb et al. (2012) it45. stakeholders always have the opportunity to suggest changes or improvements to the software # pmi and agile alliance (2017) it46. in the demo review, project problems and improvements are identified with stakeholders participation # serrador et al. (2018); pmi and agile alliance (2017) it47. the current project does not have frequent requirement changes due to bad user stories definition # sharma and gupta (2012); ahmed et al. (2017) it48. the current project has met or exceeded the client expectations # misra et al. (2009); ahmed et al. (2017) it49. the product owner is always available to explain the user stories’ details # hoda et al. (2010); pmi and agile alliance (2017) # represents original items introduction background specific characteristics for the formation of the organizational climate of agile teams organizational climate in agile teams tact overview conceptual definition of the construct design/adaptation/selection of items evaluation by specialists pretesting case study planning and execution research questions description of the organization and teams data collection case study results how is the organizational climate in the examined agile teams? (rq1) analysis of organizational climate from team a analysis of organizational climate from team b analysis of organizational climate from team c how did working from home affect the organizational climate of the teams for the analyzed dimensions? (rq1.1) how do leaders perceive tact? (rq2) which are the most influential items in each dimension for the analyzed case? (rq3) discussion case study preliminary evaluation of tact tact use recommendations limitations and threats to validity final considerations future works appendix constructs the items of tact by dimension 473-##_article-632-5-15-20200116 1 journal of software engineering research and development, 2019, 8:3, doi: 10.5753/jserd.2020.473 this work is licensed under a creative commons attribution 4.0 international license. towards a new template for the specification of requirements in semi-structured natural language raúl mazo [ lab-sticc, ensta bretagne, brest, francia. giditic, universidad eafit, medellín, colombia | raul.mazo@ensta-bretagne.fr ] carlos andrés jaramillo [ universidad eafit, medellín, colombia | cajaramilg@eafit.edu.co ] paola vallejo [ giditic, universidad eafit, medellín, colombia | pvallej3@eafit.edu.co ] jhon harvey medina [ universidad eafit, medellín, colombia | jhmedinaa@eafit.edu.co ] abstract requirements engineering is a systematic and disciplined approach for the specification and management of software requirements; one of its objectives is to transform the requirements of the stakeholders into formal specifications in order to analyze and implement a system. these requirements are usually expressed and articulated in natural language, this due to the universality and facility that natural language presents for communicating them. to facilitate the transformation processes and to improve the quality of the resulting requirements, several authors have proposed templates for writing requirements in structured natural language. however, these templates do not allow writing certain functional requirements, non-functional requirements and constraints, and they do not adapt correctly to certain types of systems such as self-adaptive, product line-based, and embedded systems. this paper (i) presents evidence of the weaknesses of the template recommended by the ireb® (international requirements engineering institute), and (ii) lays the foundations, through certain improvements to the template proposed by the ireb®, for facilitating the work of the requirements engineers and therefore improving the quality of the products specified with the new template. this new template was built and evaluated through two active research cycles. in each cycle we identified the problems specifying the requirements of the corresponding industrial case with the corresponding base-line template, propose some improvements to address these problems and analyze the results of using the new template to specify the requirements of each case. thus, the resulting template was able to correctly write all requirements of both industrial cases. despite the promising results of this new template, it is still preliminary work regarding its coverage and the quality level of the requirements that can be written with it. keywords: requirement, requirements engineering, natural language, template, application requirement, domain requirement, self-adaptive requirement 1 introduction the requirements are perhaps the most important basis in the construction of software products because, through them, the stakeholders of the system that is going to be implemented can achieve a common understanding of it. according to wiegers and beatty (wiegers and beatty 2013), the two most important objectives in specifying a requirement are that (i) when several people read the requirement they reach the same interpretation; and (ii) the interpretation of each reader coincides with what the author of the requirement was trying to communicate. in this sense, pohl (pohl 2010) states that nl (natural language) is the most common way to communicate and document the requirements of a system since nl is universal and available to any individual in any field; besides, it does not require any kind of special training in the interpretation of notations or symbols as occurs when using an engineering language such as uml (unified modeling language). however, these advantages are overshadowed by the disadvantages of natural language (rupp 2007). according to mavin et al. (mavin et al. 2009) some of the problems susceptible to appear in the requirements specification in nl are (i) ambiguity: a word or phrase has two or more different meanings; (ii) vagueness: lack of precision, structure or detail; (iii) complexity: composite requirements that contain complex sub-clauses or several interrelated statements; (iv) omission: missing requirements, particularly the requirements to handle unwanted behavior; (v) duplication: repetition of requirements defining the same need; (vi) verbosity: use of an unnecessary number of words; (vii) implementation: statements of how the system should be built, rather than what the system should do; and (viii) untestability: requirements that cannot be proven (true or false) when the system is implemented. to reduce these problems in the specifications of the requirements of a system, several authors have defined what is known as template, mold, pattern or boilerplate (rupp 2007). a template defines the structure that the requirements written in nl should have; that structure is flexible so that the resulting requirements have the advantage of being in nl and the advantage of having a well-defined structure. this nl bounded by the possibilities and restrictions of the template is known as semi-structured natural language. the notations in semi-structured language make it possible to build requirements by following a template and assigning a similar structure to each requirement. this approach helps to avoid errors in the early stages of the development process by specifying high-quality requirements efficiently in time and cost (sophist 2014). the template proposed by rupp (rupp 2007) also known as master (mustergultige anforderungen die sophist templates fur requirements) (sophist 2014) has been accepted as a standard for the syntactic specification of system requirements. this template has been recognized as a valuable aid tool so that the requirements are more precise and presenting the new requirements specification template mazo et al. 2019 have a standard syntactic structure that facilitates their understanding (rupp 2007). however, anyone who has used the rupp template in real projects has realized that some requirements cannot be expressed with that structure without some degree of ambiguity or inconsistency. that is the reason this article focuses on investigating the following research question: what are the gaps that requirements engineers find when writing requirements in natural language and how to fill those gaps? to find an answer to this research question, we have designed an experiment inspired by the action science (or action research) research method (o'brien 2001). two cycles of this method were conducted to analyze the requirements of two independent industrial projects. the first cycle of this action research method was reported in (mazo and jaramillo 2019) and the resulting template was used as input for the second cycle, which was oriented to requirements specifications for self-adaptive systems and represents an improved version of the mazo and jaramillo template, using the relax language (whittle et al. 2009) as a reference in this cycle. thus, with this research we aim to analyze the rupp template in order to (i) evaluate their ability to represent industrial product requirements in a semi-structured way, and (ii) propose possible improvements to the template; from the point of view of two academics and two experienced requirements engineers in the context of two technology-based companies. this paper is an extension of our previous work that appeared at cibse’19 (mazo and jaramillo 2019). in this paper, we significantly extended and improved the conference paper. first, we significantly extended the empirical study by evaluating our approach with one more real industrial project. second, we introduce the implementation of the resulting template in the variamos tool (mazo et al. 2015). finally, we enriched the related work in this version. the work resulting from this research is an adaptable and extensible template for specifying requirements of different domains (application systems, software product lines, cyber-physical systems, self-adapting systems). in the future, the template will be adapted and improved to address more domains. this article is structured as follows: section 2 explains the rupp template; section 3 describes the research method used for the experiment; section 4 presents, using some examples, the most evident problems identified when using with the rupp template; section 5 presents the proposed improved template; section 6 presents the preliminary evaluation of the new template; section 7 presents the threats of validity of our study. section 8 presents other initiatives specification templates for individual requirements and some related works; and section 9 finally describes the conclusions and future work related to this research. 2 the syntactic structure of the rupp template as shown in figure 1, the rupp template consists of six spaces (denoted with a, b, c, d, e and f letters) to compose the syntax of a requirement. this section briefly describes each space of the template. figure 1. rupp template. (a) conditions: the first space is a condition or a set of conditions, usually optional, at the beginning of the requirement. a condition can be logical: composed by the conjunction “if”; or temporary: composed by the conjunction “as soon as” or “after that”. (b) the system: the second space is the name of the system, the subsystem or component of the system that is specified for the requirement. (c) degree of obligation: the third space establishes the degree of obligation that the requirement can acquire. the template establishes four levels of obligation nature. ● the mandatory requirements, using the verb “shall” ● the recommended requirements, using the verb “should” ● the future requirements, using the modal verb “will” ● the desirable requirements, using the verb “may” (d) functional activity: the fourth space characterizes the functional activity that the system can assume, which includes the process verb object of the requirement. there are three types of activities: ● autonomous requirement of the system: indicates a functionality that the system performs independently without the need for interaction with users. ● user interaction: indicates a functionality that the system provides to users. ● interface requirement: indicates a functionality that the system performs to react to events with other systems. (e) object: the fifth space is the object for which the behavior specified in the requirement is performed. (f) object details: the sixth and last space corresponds to the additional details (optional) about the object, the adjectives that qualify it or the characteristics that the object can possess. some examples proposed by rupp (2007) for the specification of requirements with this template are the following:  the system should check whether the guest is registered.  after the guest has selected the function “place order”, the system shall display the menu to the guest. presenting the new requirements specification template mazo et al. 2019  the system shall provide the guest with the ability to place his order.  if the chef has rejected the guest's order, the system should ask the guest whether the guest would like to choose another dish. the requirements engineering magazine1 presents some industrial cases in which the rupp template was used. 3 research method the investigation reported in this paper was carried out through the research method called action research (o'brien 2001). action research is defined as “the intervention in a social situation in order to improve this situation and learn from it” (wieringa and morali 2012) (susman and evered 1978). the action research method aims to improve the practice by solving real problems and is conducted in order to investigate current phenomena in their natural context (koshy et al. 2010). we have chosen this method because it allows us to answer the research question and achieve the objective of this research from an empirical experiment in an industrial context. besides, (i) this research method can be executed at low cost since researchers play an active role in it; and (ii) the rigor of the action research method allows to reduce the threats to the validity of the experiment. susman (susman 1983) developed a detailed model of the action research method with the five stages that must be carried out in each cycle of the process: diagnosing, action planning, taking action, evaluation and specifying learning. in the diagnosing stage, researchers identify the problem and collect the data required to carry out a detailed diagnosis. the action planning stage aims to define the different possible solutions that address the problem defined in the first step. during the taking action stage, a solution should be chosen and implemented. in the evaluating stage, researchers should analyze the data corresponding to the results of the chosen action plan. finally, during the specifying learning stage, researchers should interpret the results of the action plan execution and learn according to the success or failure of the solution. therefore, the problem is reevaluated and a new cycle begins until the problem is solved and the stakeholders are satisfied with the obtained result. to answer the research question, we carried out two cycles of the action research method as presented in figure 2. in this experiment, each cycle corresponds to the analysis of a form of specification of the requirements for two industrial projects. the experiment was carried out as follows. in the first cycle, we analyzed the requirements specification of the peopleqa system of the sqa s.a. company. peopleqa is a system for human resource management, which facilitates the self-management of employees in different corporate activities such as permissions, vacation, performance measurement, and internal relations. through the peopleqa system, we proposed the first version of the new template to specify requirements in semi-structured nl. in this cycle, three possible solutions were analyzed: prose style requirements specification (as the stakeholders expressed them), specification using the rupp template and requirements specification using an improved version of the rupp template that we call the mazo & jaramillo template. in the second cycle we analyzed the requirements specification of the yuke-greenhouse system of the koral company, yuke-greenhouse is a self-adaptive system for controlling irrigation, temperature, and environment in greenhouses and coffee crops in colombia. in the second cycle, three possible solutions were analyzed: prose style requirements specification, specification using the mazo & jaramillo template, and requirements specification using the new improved template presented in this paper. figure 2. research process 1 re magazine (https://re-magazine.ireb.org/) presenting the new requirements specification template mazo et al. 2019 in each cycle the following stages were executed: 1. diagnosing: some problems were identified when using prose style and the rupp template to write the requirements of the first case, and when using prose style and the mazo & jaramillo template to write the requirements of the second industrial case. this stage was conducted through several mini-cycles of requirements specification in order to identify the problems associated with this activity and to collect the information needed to create the new template proposed in each cycle and to achieve a systematic response to the research question. 2. action planning: templates of requirements proposed by other authors were considered. in each cycle, it was evaluated that the improved template (resulting from each cycle) was consistent with other than the rupp template, we considered other templates such as ears (mavin et al. 2009), adv-ears (majumdar et al. 2011a) (majumdar et al. 2011b) and iso/iec/ieee 29148-2011 (iso/iec/ieee 2011). to ensure that the improved template produced in the second cycle remained consistent with the considered templates, we planned and executed the following strategy: at the beginning of each cycle of requirements writing, the templates found in the literature (not all of them were found from the first cycle) were used as inspiration artifacts to incorporate their relevant elements in the new template produced at each cycle. thanks to this strategy it was possible to improve our baseline templates (i.e., the rupp template in the first cycle and the mazo & jaramillo template in the second cycle) in the situations where this template was not adequate. 3. taking action: in this stage, we first considered the requirements that could not be fully specified using the reference templates of each cycle. for these requirements, we evaluated to what extent they could be syntactically specified using the templates found during stage 2. we performed this evaluation in order to find requirements specification reproducible patterns. every time that a reproducible pattern was identified in at least three requirements with similar conditions, this pattern was added to the new template proposed in each cycle in order to enrich them. 4. evaluating: at the end of each cycle, it was evaluated whether the proposed template allowed to specify at least 98% of the industrial case requirements corresponding to the current cycle. the main criteria to evaluate the representation of requirements is that they do not present problems of ambiguity, vagueness, complexity, omission, duplication, verbosity, non-implementation and untestability. mavin et al. (mavin et al. 2009) and (rupp 2007) give us a more detailed description of these criteria, which are considered a de facto standard in requirements engineering. 5. specifying learning: at the end of each cycle, the authors interpreted of the results obtained. then, based on 2 requirements specification 1st cycle (http://shorturl.at/cpdeo) these results they determined the strengths and limitations of the improved template produced in each cycle. the various phases and the succession of cycles are collaborative since the research process and objective have been carried out in collaboration between the authors. this is another characteristic that led us to choose action research as a research method for this work. the research process consists of two cycles, one for each industrial case we had at our disposal. although two cases are not enough to propose a generic set of extensions for the rupp template, the second case provides supplementary evidence that allowed us to re-evaluate and improve the template we reported in the previous version of the article. the use of new real cases to evaluate an engineering artifact in its early stages is welcome and usual in empirical research processes such as the one reported in this article. we, therefore, hope that this new template will be evaluated in many more cycles with new and varied industrial cases that help to collectively build the re template that the industry requires. 4 problems identified in the baseline templates 4.1 first cycle the prose style requirements specification corresponding to the peopleqa system of the sqa s.a. company was rewritten with five requirements specification templates as presented in figure 2. the use of each template corresponds to a micro-cycle into the first cycle of the action research process. at the end of these micro-cycles, we produced the first version of the mazo & jaramillo template that was then evaluated and improved in the subsequent two stages of the first cycle. the problems and gaps detected when working with the templates considered in these micro-cycles are described below. these problems and gaps were saved in a document, available online2, which contains each of the requirements of the industrial case and each of the problems encountered during the investigation. in particular, the first sheet presents the requirements in prose style; the second sheet presents the requirements using the rupp template; the third sheet summarizes the problems identified when using the rupp template; and the fourth and last sheet presents the requirements specified with the constructs borrowed from other templates found in the literature. for each of these types of problems, we have defined a descriptive name, a brief description and an example to better understand the problem. missing reasons sometimes it is necessary to express the reason for a requirement. for example, in agile development frameworks, one of the most important aspects in the specification of requirements through user stories is to specify the “why” or the “for what” of the requirement (cohn 2004) (beck 1999). this gives a better context to who implements the functionality or presenting the new requirements specification template mazo et al. 2019 behavior that describes the requirement and will allow him to better understand the level of importance or priority of the requirement. for example, the requirements: the vms (vital monitoring system) must have the ability to interact with other devices of nearby people to know their vital activity. and if any sensor exceeds the defined tolerable limits, the home automation system must light a siren to warn the homeowner. have a “for what” of vital importance because both requirements belong to critical systems. if a requirements specification template allows defining the reason for the requirements, developers can easily understand it, because it is explicitly stated how important is to implement those requirements with high-quality levels. omission of quantities and ranges sometimes the requirements refer not only to a specific object but to several objects or a range of objectives of the same nature. some of the analyzed templates (e.g., the rupp template) do not explicitly allow the possibility of specifying ranges or quantities of objects in the requirements. as presented in the following example, the omission of an amount would have led to ambiguities or inaccuracies: the point of sale subsystem must provide the pos administrator with the ability to link between one and maximum 10 warehouses at a point of sale. omission of biconditionals some requirements require certain behaviors performed only if certain conditions are met; otherwise, the behavior cannot be performed. we call this biconditional to express that behavior a is performed “if and only if” behavior b is fulfilled and vice versa. for example, in the requirement: the point of sale subsystem must show the boxes if and only if they are in the active state, the “show the boxes” behavior will be performed only for objects that are in a certain state and not for all objects within the domain. here there is an explicit condition that the requirement must effect through the process verb “show”, using the conditional “if and only if”. consider another example of a requirement: after the vacuum has been turned on, the ivaccum system should start the cleaning cycle if and only if the vacuum's battery charge is 90% or more. in this case, the behavior of the object depends on a condition on the charge of the battery. as can be seen, these types of conditions are common when specifying requirements in industrial cases; however, some of the analyzed templates (e.g., the rupp template) do not explicitly provide a way to express this kind of specifications. gap in conditionals requirements behaviors are conditioned by different factors, which imply different interpretations depending on these conditions. for example, a requirement that specifies while the temperature control is on, the system must balance the ambient temperature can have a different interpretation to the requirement that specifies if the temperature control is on, the system must balance the ambient temperature and also both can be differentiated from a requirement that specifies as soon as the temperature control is turned on, the system must balance the ambient temperature. in all three cases, although a similar condition is used, the interpretation is different. in the rupp template, only two types of conditionals are used, which are: the logical conditionals and the temporary conditionals (rupp 2007). however, we found other types of conditions in the rest of templates, for example, for behaviors that are triggered by events and for behaviors that take place while the system is in a certain state. lack of verifiability of non-functional requirements some of the templates analyzed in the first cycle were created to specify functional requirements. thus, explicit structure for the adequate writing of measurable and finite factors to define the satisfaction (level) of non-functional requirements and restrictions was a recurrent weakness of the templates analyzed during the first cycle. for example, these two quality requirements: the system should be available 7x24x364 for users and the performance of the system must be optimal, trying to respond to users in less than two seconds have a measurable and finite factor to determine that the requirement will or not satisfy the need of the interested parties. lack of reference to external systems or devices in case the type of system activity is an interface requirement, the syntactic structure of some of the analyzed templates does not explicitly refer to external systems or devices. for example, the requirements the point of sale subsystem must be able to read bar codes on item labels and the system should be able to obtain the information of a client follow the syntactic structure proposed by the rupp template; however, none of these requirements mention the name of the system or device with which information is exchanged, nor it is established if the information goes to or from the device or system. lack of concepts to write domain requirements in some cases, the requirements do not refer to a product but several products of the same family (mazo 2018a). product lines are based on the concept of variability management to specify, design and intensively develop the products of the same family in a prescribed manner. although some of the analyzed templates can be used to specify requirements with different priority levels, they cannot be used to specify their variability. for example, in the requirements: the product line of virtual stores must calculate the vat value of each purchase and the product line of virtual stores could calculate the vat value of each purchase two levels of priority are specified, but the variability of the requirements is not considered. indeed, it is not said if it is for all products of the product line (mandatory for all the products) or only for some of them (optional). presenting the new requirements specification template mazo et al. 2019 4.2 second cycle the requirements specification of the yuke-greenhouse case written in prose style was rewritten with two templates. the first template used in this second cycle is the one produced in the first cycle and the second one corresponds to the relax language (whittle et al. 2009). each rewriting of the requirements of the yuke-greenhouse case with those two artifacts corresponds to a micro-cycle into the second cycle of the action research process as presented in figure 2. at the end of these two micro-cycles, we produced the evaluated new template that was then evaluated and improved in the subsequent two stages of the second cycle. the problems and gaps detected when working with the artifacts considered in these micro-cycles are described below and available online3. for each of these types of problems, we have defined a descriptive name, a brief description and an example to better understand the problem. lack of concepts to write requirements for selfadaptive systems self-adaptive systems have the ability to autonomously modify their behavior at runtime in response to environmental and changing system conditions. self-adaptation is particularly necessary for applications that must be executed continuously, even in adverse conditions and with changing requirements (whittle et al. 2009). in general, self-adaptive systems include automotive systems, telecommunication systems, environmental monitoring, and smart home systems. the main problem faced by requirements engineers is that the typical behaviors of this type of system can vary due to environmental uncertainty conditions, caused by multiple reasons such as weather, sensor failures, unexpected conditions, the variability of data, among others. inability to manage uncertainty uncertainty is one of the characteristics of self-adaptive systems, therefore this type of requirement must ensure that the system meets the needs of the stakeholders while at the same time adapting to the conditions of the environment. 3 requirements specification 2nd cycle (http://shorturl.at/cpdeo) thus, the satisfaction of these requirements should be defined with satisfaction at some level on a continuous scale defined by a fuzzy function (jureta et al. 2015). the mazo & jaramillo template does not consider the uncertainty for selfadaptive requirements. for example, a requirement that specifies: if the ambient temperature rises above 25 degrees, then the self-adaptive system oktupus must raise the temperature level to 30° establishes an invariant restriction (whittle et al. 2009) that make it difficult to adapt the system to certain environment variables. lack of specificity in temporality self-adaptive systems use timing functions and frequencies to adapt themselves to the environment. handling these aspects is also a weakness of the mazo & jaramillo template. let’s consider the following requirement: the oktupus selfadaptive system must measure the temperature of the room every hour. in this case, it would be desirable to be able to relax the requirement to better adapt the measurement period to also consider the changing conditions. this would imply that the system would be able to measure the temperature not only every hour but also every time there is a major change in the system. 5 proposing a new requirements specification template considering each of the problems encountered during the execution of the action research method and exemplified in section 4, then we have improved the rupp template (rupp 2007) and subsequently the mazo & jaramillo template (mazo and jaramillo 2019). the mazo & jaramillo template (c.f. figure 3) was created as a result of the first action research cycle and is composed of eight spaces. each space was structured thinking a simple and robust syntactic specification to cover the most types of requirements in several types of systems. the rectangles in yellow represent conditionals; gray rectangles are used to represent the family of systems, the system or a part of it; the orange rectangles represent the degree of obligation; the green rectangles are the figure 3. mazo & jaramillo template. presenting the new requirements specification template mazo et al. 2019 activities characterizing the system; the blue rectangles represent the objects (nouns), with their respective quantities and complements; and the purple rectangle describes the measurable criterion of verification of the requirement. the latter is optional, for that reason is represented through a dotted line. the improvement made to the mazo & jaramillo template is inspired by concepts from other related works found in the literature, e.g. the ears template (mavin et al. 2009), which establishes a set of syntactic rules for the specification of requirements through the use of conditional clauses that trigger functional behaviors and described in the relax requirements language (whittle et al. 2009), which incorporates various types of operators to address the uncertainty in the behavior of a self-adaptive system. thus, the new requirements specification template proposed in this paper is presented in figure 4 and it is the result of the second action research cycle that follows the first research cycle reported in the cibse conference (mazo and jaramillo, 2019). templates for user requirements specifications, such as connextra for writing user stories (davies 2001), were not considered in this article because our template is oriented to the specification of system and software requirements, while user stories are oriented to the stakeholders (wiegers and beatty 2013). templates oriented to user requirements specification are beyond the scope of this article. in the remainder of this section, we describe each of the components of the resulting template at the end of the two action research cycles. 5.1 conditions under which a behavior occurs some requirements do not describe continuous behaviors, but behaviors that are performed or provided only under certain conditions; for example, logical or temporary, as is shown below. a. requirements with logical conditions. they are used for describing behaviors that are triggered only when a logical condition is met (rupp 2007) or when an unexpected event occurs (mavin et al. 2009). the form is: if then (all|some systems of the )|(the ) shall|should|could for example: if the number of products in a warehouse reach the defined minimum limit then, the inventory subsystem should generate a product replacement alert for that warehouse. b. requirements guided by the state. they are used for describing behavior that must be performed in the system while the system is in a specific state. this condition was proposed by (mavin, et al. 2009). the form of this specification is: while|during (all|some systems of the )|(the ) shall|should|could for example: while the payment of an invoice from a customer has not been confirmed, the subsystem must send a daily text message to the cell phone number registered by the customer. c. requirements with optional elements. they are used for describing behavior that must be performed only if a particular characteristic is included (mavin, et al. 2009). the form of this is specification is: in case is included (all|some systems of the )|(the ) shall|should|could this condition is especially useful in domain requirements when you want to incorporate certain requirements depending on the characteristics provided by the product line. for example: in case the text entry action is included, all systems of the test automation framework product line shall provide the tester with the ability to enter a specific text, in a form field. d. requirements with temporary conditions. they are used for describing behavior that must occur after another behavior occurs. they occur sequentially, it means, behavior a is done after b. this condition was proposed by (rupp 2007). the form is: after|before|as soon as (all|some systems of the )|(the ) shall|should|could after means that the system must have completed a running behavior before initiating another behavior. before means that the system must initiate a behavior before another behavior takes place. as soon as means that the system does not necessarily have to have finished a running behavior before initiating another behavior. for example: after reading the products for a particular location, the inventory subsystem should provide the warehouse owner with the ability to close the product count for that location. e. requirements with complex conditions: for requirements with more complex conditional clauses, it can be necessary to add with keywords as when, while, where. the keywords can be integrated into more complex expressions to specify richer behaviors of the system (mavin et al. 2009). as expressed in the following example: when a cash settlement operation is performed on a cash register, while the box is temporarily closed, the point of sale subsystem should show the amount of cash that is in the box. conditional clauses can also be structured using the boolean operators and, or and combined with not (rupp 2014). for example: if a location contains products and the option to delete a location has been selected, then the inventory subsystem should display an alert message indicating that the selected location cannot be deleted. presenting the new requirements specification template mazo et al. 2019 figure 4. new template for the specification of requirements in semi-structured natural language. presenting the new requirements specification template mazo et al. 2019 the requirements guided by the state and the requirements with optional characteristics were taken from ears (mavin et al. 2009), and the requirements with logical and temporal conditions were taken from the rupp template (rupp 2007). 5.2 family of systems, systems or parts of a system this space in the template is reserved for the name of the product line, system, subsystem or system component. in the case of a product line requirement, it must be specified whether the requirement is valid for all or only for some systems. we completed the second space of the rupp template (cf. b space in figure 1) with the possibility of specifying product line requirements since this template was not correctly adapted to be able to write them in semi-structured nl. the structure of the second space of the new template is as follows: all|some systems of the in some cases, we must consider certain behaviors that some systems of the product line must incorporate if certain conditions or restrictions are met when this happens, we will use the expression: those systems of the some examples of product line requirements, using the improved template are: in case the action of comparing text is included, those systems of the automation framework product line that only include the option to enter text shall provide the tester with the ability to configure a text for comparison with another element. if the automation framework is web-based, all systems of the test automation product line shall provide the tester with the ability to select the type of browser where the test will be run (be it chrome, firefox or safari). 5.3 the degree of priority in the rupp template, this space (cf. c space in figure 1) is traditionally reserved to specify the degree of obligatory nature of the requirement; however, we changed the “obligatory” concept to the “priority” concept in order to not confuse it with the “mandatory” concept of product lines. to define the priority of the requirements we have used the moscow technique (clegg and barker 1994), in which three degrees of priority are established: essential, recommended and desirable. a. essential requirements. these requirements must be implemented to achieve the success of the product or the product line. the word shall is used. b. recommended requirements. these requirements are important, but not necessary to achieve the success of the product or the product line. the word should is used. c. desirable requirements: these requirements are desirable, but not necessary. they could improve the user experience and customer satisfaction. the word could is used. some examples of requirements with differentiation of the degree of priority, using the improved template are: all systems of the test automation product line shall incorporate a click action. if a motion sensor is activated, then oktupus system should send an instant image to the home owner's email. 5.4 the activity the fourth space, the same as the rupp template (d in figure 1), specifies the characterization of the activity that is conducted by the system or by the systems of the corresponding line. there are three types of activities that can be performed: a. autonomous activity. in this kind of activities there is no user involved, which means that the (sub) system or systems initiate and execute the behavior autonomously. the form of this type of activity is: all|some systems of the )|(the ) shall|should|could b. user interaction. in this activity, the (sub) system or systems provide a user with the ability to use certain behavior that is initiated or stimulated by a user (actor) that interacts with the system(s). the form of this part is: all|some systems of the )|(the ) shall|should|could provide with the ability to where who is the actor or user that should have the ability to use the functionality. the user must be correctly characterized and not incur the undue use of nouns without a reference index (rupp 2007); it means, indicating “the user” would be an error that would lead to an ambiguity in the specification. c. interface requirement. in this activity, the system performs a behavior dependent on another entity (which can be another system or a physical device). this space was improved in the new template by explicitly adding the name of the external entity with which the system interacts and the direction of the relationship. the form of this type of activity is: all|some systems of the )|(the ) shall|should|could be able to in addition, this structure is completed with the entity with which the system interacts: ● if the behavior is executed by the external system that transmits data to the receiving system interface, then the specification will be complemented by adding: from ● if the behavior is performed by the system and interacts or affects another system or external device then the specification will be complemented by adding: presenting the new requirements specification template mazo et al. 2019 towards an example in this case for an interface requirement is: the point of sale subsystem shall have the ability to read a valid credit card from a branch's dataphone 5.5 the object or objects this space is reserved for the object or objects that make up the system. in the new template, we have incorporated the concept of range, since the objects can be affected in different ranges. the ranges in the new template are specified as follows: a. single object: one b. a specific object: the c. each object of a set: each d. multiple objects: , where x is the number of objects e. range of objects: between and , where a is the lower range and b is the upper range f. all objects in a set: all the two examples of requirements with ranges of objects, using the improved template are presented as follow: the inventory subsystem should provide the inventory manager with the ability to associate between 1 and 3 bar code reading guns to a cash register. as soon as the daily activity cycle ends, the oktupus system must restart all the sensors connected in the home. 5.6 the complementary details in the sixth space of the template, complementary details of the object are specified. they can be one or several adjectives, as well as a more enriched description of the object, without the risk of altering the proper meaning of the specification of the requirement and focusing only on describing the details related to the object in question. this template space was retained from the rupp template and was not modified by the authors. 5.7 conditionality in the object sometimes, the behavior of the requirement is conditioned by the state of an object. in the new template, we have reserved the seventh space to specify a behavior that the system must carry out if and only if the object meets a certain condition. in this case, the requirement is completed by adding the following expression: if and only if . it is important to clarify that this condition is optional. it is only given explicitly if the precise object of the requirement requires specifying the condition, therefore, it is not mandatory in the specification of the requirement. here are two examples of requirements with conditionality in the object, using the improved template: if any sensor exceeds the defined tolerable limits, the oktupus system should turn on a siren, if and only if the siren is activated. the inventory subsystem could provide the warehouse manager with the ability to eliminate a purchase order, if and only if the purchase order has not been dispatched. 5.8 verification criterion (adjustment) of the requirement in some types of requirements, especially non-functional requirements, it is necessary to establish the degree to which the requirement must be met. robertson and robertson (robertson and robertson 2013) suggested including adjustment criteria; it means including “a quantification of the requirement that demonstrates the standard that the product must reach” as part of the specification of each requirement, functional and non-functional. the adjustment criteria describe a measurable way to assess whether each requirement has been successfully met. for this purpose, in the new template, we have added, in the last space, the option of establishing a measurable or observable criterion to determine the degree of verifiability of the requirement. this was done to make sure that the requirement can be verified either by a person or a machine. this criterion is defined at the discretion of the author of the requirement and is optional, although it is recommended to always use it in the quality requirements. here are two examples of non-functional requirements using the improved template: if a fault causes the system to stop, the oktupus system must restart all the sensors in less than 20 seconds. the system must provide an atm with the ability to register a sale in a cash register without presenting more than 2 different screens. 5.9 relax requirements statements for selfadaptive systems to deal with problems of environmental uncertainty (see section 4.8) of self-adaptive systems, especially in the requirements specification phase, (whittle et al. 2009) propose a language called relax. relax incorporated several types of operators to address the uncertainty in the properties of the system. usually, requirements prescribe the behavior using imperatives such as “must” or “should”. these imperatives define the functional behavior that a system should always provide. however, for self-adaptive systems, environmental uncertainty may mean that it is not always possible to achieve all the “must” statements. therefore, it may be necessary to make concessions among “must” statements in order to make some non-critical behaviors more flexible in favor of more critical ones (whittle, et al. 2009). relax proposes to establish a simple process to explicitly identify when a requirement should remain unchanged and mandatory and when a requirement can temporarily relax under certain conditions. although relax is not a specification template, in this article we have employed several operators presenting the new requirements specification template mazo et al. 2019 proposed by relax to make it easier to specify requirements in self-adaptive systems. figure 4 presents the proposed structure of the improved template for the specification of self-adaptive requirements. although this template is based on relax, we also incorporated some ideas, which were inspired by the works presented in (baresi, et al. 2010). in particular, we incorporated (i) fuzzy conditions that can be taken by self-adaptive systems to measure different environment variables (souza, et al. 2011); (ii) awareness requirements (or awreqs) to specify requirements about the success or failure of other requirements that can refer to goals, tasks, quality constraints, and domain assumptions and (iii) constraints that must be met using certain questions in a conditional manner (ibrahim et al. 2014). these concepts completed the mazo & jaramillo template allowing to specify requirements of self-adaptive systems. due to the autonomous nature of the requirements for this type of system, we have implemented an equivalent template for requirements in self-adaptive systems. this space of the template can be identified in red color and must be written in the requirements, at the end of their textual specification. for this version of the improved template we have omitted the types of activity (see section 5.2) interaction with users (user interaction requirements) and interaction with other systems or devices (interface requirements), we also omit the conditionals of objects and verification criterion, which are explicit in each of the operators to “relax” requirements. in the remainder of this sub-section, we explain each part of the red space of the improved template. a. (as many|as few) as possible: a requirement must maximize or minimize a certain occurrence of something or a certain amount of objects, as many or as few as possible, thus leading to adaptation. b. before|after|during: a requirement must be met before, during or after a particular event, usually,i these three operators go after the operators as many as possible, as few as possible, as soon as possible or as late as possible. c. (as early|as late) as possible: a requirement specifies something that must be fulfilled as early as possible or must be delayed as late as possible. d. until: a requirement must be maintained until a future position (event). e. within: a requirement must be maintained for a particular time interval, expressed in units of time. f. at least: a requirement must meet a minimum frequency or time, until infinity. g. eventually: the behavior of the requirement must occur eventually, e.g. it is not completely safe 4 measurement of physical activity according to the world health organization (https://www.who.int/dietphysicalactivity/physical_activity_intensity/en/) or fixed that the behavior occurs, but the system must be prepared. h. as close as possible to: a requirement specifies something that happens repeatedly, but the frequency can be flexible (above or below the specified frequency, but as close as possible to this value) or a requirement specifies an amount (quantity), but the exact amount can be flexible (above or below the specified amount, but as close as possible to this value). below we present some examples of a vms (self-adaptive vital monitoring system) case using an intelligent bracelet and we use the improved template to specify these requirements.  the vms system must record as many steps as possible during a user's walking activity.  the vms system will consume as few units of energy as possible during the normal operation of the intelligent bracelet.  the vms system must send an alert to the user who must stop when the physical activity levels (met4) are as close as possible to 2.  if a user's vital signs levels are below the userdefined values, then the vms system must send an alert to the registered emergency phone within 2 seconds.  the vms system should check the user's average calories consumed levels eventually. 6 preliminary evaluation of the proposed template as mentioned in section 3, two action research cycles were performed for this research. for the preliminary evaluation of the template proposed in this article, two groups were established in each action research cycle. each group was made up of two roles: a business analyst with similar experience (that is, the same number of years in the company and participation in comparable projects in the subject, duration and size) in requirements engineering; and a technical requirements reviewer. in the first action research cycle, the analyst of each group had to specify the requirements of the peopleqa system of the sqa s.a. company within the rupp template (for the first group) and within the mazo & jaramillo template (for the second group) and the second role inspected the work of each group. the requirements specification document of the peopleqa system corresponds to 46 requirements in prose style. then, the first group of business analysts rewrote the requirements using the rupp template and with the accompaniment of the authors of this paper to support them in resolving doubts. through this method, requirements were presenting the new requirements specification template mazo et al. 2019 identified and specified. once the specification was concluded, the technical reviewer of the first group identified 48 requirements with specification problems that do not adhere to the standard proposed by rupp, meaning that 34.8% of the requirements had problems of adherence with the rupp template. these problems were categorized according to seven types of criteria as shown in table 1 (for a more detailed explanation of each of these types of problems and some examples that may be presented in a requirements specification based on the rupp template see section 4). the second group of business analysts focused on specifying the prose style requirements of the peopleqa system within a new template that was built and incrementally improved through four mico-cycles (each one corresponding to the rest of templates used in the first action research cycle as presented in figure 2). the results of specifying the requirements of the peopleqa system within the resulting template were interesting. thus, it was possible to successfully specify 98% of the requirements using the template proposed in this article. only three requirements could not be fully specified using the improved template, but it is noteworthy that 135 achieved a specification that adheres to the improved template, without any observation by the technical reviewer. the main factor by which some requirements continue with some problems when using the improved template is because it was detected in the first cycle that some requirements when they have a restrictive behavior, that is, when the requirement specifies what the system should not do, instead of what you should do, the template does not adhere properly. according to (wiegers and beatty 2013) these types of requirements are known as negative requirements and will be part of a later investigation on how to specify this type of requirement. table 1. problems identified in the requirements in the first cycle. identified problem in the first cycle # of requirements using the rupp template # of requirements using the proposed template inappropriate conditionality 6 0 lack of reference to external systems or devices 4 0 omission of biconditionals 7 0 omission of quantities and ranges 18 0 lack of verifiability of non-functional requirements 3 0 missing reasons 3 0 others 7 3 in a second cycle, for the requirements specification of the yuke-greenhouse system of the koral company, a selfadaptive system of an intelligent nursery called yukegreenhouse, 12 requirements in prose style were obtained. there prose style requirements were represented as 46 requirements within the mazo & jaramillo template. it is im 5 variamos – (www.variamos.com/variamosweb) portant to consider that of these 46 requirements only 22 fulfilled the conditions to be considered self-adaptation requirements. evidently, these 22 self-adaptation requirements had adherence problems with the rupp template because mazo & jaramillo template is an extension of that template. these problems are shown in table 2. table 2. problems identified in the requirements in the second cycle. identified problem in the second cycle # of requirements using the rupp template # of requirements using the proposed template inappropriate conditionality 2 0 lack of concepts to write requirements for self-adaptive systems 11 0 omission of biconditionals 1 0 omission of quantities and ranges 1 0 lack of specificity in temporality 1 0 inability to manage uncertainty 9 0 the mazo & jaramillo template was improved in the second action research cycle which leads to writing three additional requirements, thus completing 25 requirements. 13 of those 25 requirements were invariant; thus, according to (whittle, et al. 2009) those are requirements that are strict in compliance and cannot be flexibilized. 12 of these 25 requirements had behaviors that reflected factors of uncertainty and can be “relax-ed”; therefore relax operators were applied for this type of requirement. the improved template has been implemented to be used in the variamos tool (mazo 2018b) in order to facilitate the writing of domain requirements, application requirements and self-adaptation requirements (for example, self-adaptable cyber-physical systems). variamos is available online5 and through the requirex option it is possible to access the forms that implement the template proposed in this paper. figure 5 illustrates an excerpt of the form for specifying domain requirements. figure 6 illustrates an excerpt of the form to specify self-adaptative requirements. figure 7 illustrates the administrative panel. this interface provides the ability to execute vital actions such as create a new requirement, edit and delete, in addition, the ability to generate two types of reports, a general report by category and another for each requirement in pdf format. figure 8 illustrates an example of a generated requirements document. presenting the new requirements specification template mazo et al. 2019 figure 5. domain requirements form. figure 6. self-adaptive requirements form. figure 7. administration panel. 7 threats to validity this section aims to demonstrate that the result of the experiment is valid for those in charge of writing the requirements, business analysts and requirements experts. we consider three types of threats to the validity (cook and campbell 1979) of the experiment: (i) the validity of the conclusion, (ii) the internal validity and (iii) the external validity. 7.1. validity of the conclusion concerning the statistical power of statistical tests, the research-action method used in this experiment is exploratory and qualitative, not quantitative, and there is no statistical hypothesis test; therefore, the threat of low statistical power does not apply. according to the threat of reliability of the implementation of the treatment, in the experiment that served us to obtain an improved template of specification of requirements, we applied the treatments to each of the industrial cases in a homogeneous way starting with the rupp template, using other templates and concepts found in the scientific literature, and our analysis to improve it iteratively. however, we are not sure that the results were the same if we had started from another template. we decided to start the experimentation with the rupp template since it is the de facto standard and we wanted all its constructs to be present in the resulting template. 7.2. internal validity from the point of view of history, the two cycles of the experiment were carried out over a six months; in this period there were no relevant environmental, social or personal factors that affected, in one way or another, the results of the experiment. from the point of view of maturity, the authors established a specific scope for each system and stabilized the requirements specification document for each case to prevent it from changing between each cycle. figure 8. generated requirements document example. presenting the new requirements specification template mazo et al. 2019 7.3. external validity the interaction between selection and treatment can be a threat to the validity of the experiment and to avoid a biased result by a single case. we use two industrial cases, and two people to execute the treatments. this experiment involved experienced people in requirements elicitation and specification. participants were also concerned by the specification of system requirements (programmers, testers, end-users, project managers). 8 related work apart from the rupp template, there are other templates proposals for the specification of requirements in semi-structured nl. thus, ears (easy approach to requirements syntax) (mavin, et al. 2009) is one of the templates that considers several conditional patterns from which several requirements are typified. for example, ubiquitous requirements (they do not have a precondition that triggers behaviors, but they are always active), event-driven requirements (describe a behavior that occurs in the system when an event is triggered) and guided by states (describe a behavior that is active while the system is in a defined state). the ears template focuses on conditional patterns under which the requirements are presented; additionally, ears focuses on the aeronautical industry. unlike this, our proposal is intended to be independent of the application domain (see section 4) and aims to support the writing of functional and non-functional requirements and restrictions. alexander & stevens (alexander and stevens 2002) propose a template for writing functional requirements from the user's perspective; since it is more natural to formulate the requirements in terms of the action of a user, not from the perspective of the system (wiegers and beatty 2013). the structure of the template is as follows: the has the ability to unlike the work proposed by alexander and stevens, our approach is oriented to the specification of the requirements from the perspective of the behavior of the systems and not of the needs of the interested parties. adv-ears (advanced ears) (adv-ears 2011) proposes a syntax in semi-structured language to specify functional requirements, in such a way that automated support is given for the derivation of use cases and actors in use case models. unlike adv-ears, our template focuses on functional and non-functional requirements, while adv-ears focuses solely on functional requirements. this syntax is an advanced version of ears (mavin et al. 2009), so some elements of adv-ears could be incorporated in our work in the future. the cesar research project, funded by the european union's artemis program, reviewed the work on the use of templates (artemis 2010) intending to extend and apply the approach to several critical domains for security, with discussions on how to formalize the approach using ontologies. in (souza et al. 2011), the awareness requirements concept is introduced. this concept is related to other requirements and their success or failure evaluation at runtime. in this work, the importance of monitoring requirements at runtime to provide feedback loops is emphasized. this work, like the work proposed by (whittle, et al. 2009), in which the relax language is proposed to write self-adaptation requirements, served as a frame of reference for this article. some other articles complement our work; for example, (tjong, et al. 2006) and (denger, et al. 2003) present two proposals to reduce the ambiguity of the requirements employing patterns and linguistic rules; although part of our work is also to reduce the ambiguity of nl requirements, our work focuses on improving the requirements specification templates based on a standard template. (arora, et al. 2013) and (arora, et al. 2015) provide additional insight by supporting the automatic compliance of the requirements using nl processing techniques for the verification of requirements, something that our work does not raise and that will be part of later work on the automatic verification of the requirements. in addition, (arora, et al. 2013) also presents a flexible template to specify requirements that can be adapted to different styles of writing requirements and other proposals such as (souag, et al. 2018) go even further by allowing the automatic generation of non-functional requirements (security in particular) in semi-structured nl thanks to the use of two ontologies: a security ontology (souag, et al. 2015) and a domain-specific ontology. in contrast, our proposal is agnostic to the type of non-functional requirements; however, it should be inspired in the future by the related work to facilitate the writing of requirements. 9 conclusion and future works the rupp template has been established by the ireb as the de facto standard for the specification of individual requirements, however, when representing certain requirements at an industrial level this template is limited. for this reason we decided to study the research question related to (i) the gaps in the template used as a standard for writing requirements and (ii) how to fill those gaps. this study is based on the experimental research method called action research. this method allowed us to use consistently the authors' own experience in the field of requirements engineering, and the information available at the industrial level and in the literature (other templates and related works). as a result of this experiment, we identified the gaps in the rupp template and, based on those gaps, we propose a more robust template that, unlike others, allows representing the quasi-totality of the requirements and constraints of two industrial cases. through this research we could observe that the reference template must be improved and that it is possible to improve it. we also found that the new template can be used in industrial cases; however, our research is still not conclusive considering that we only experiment with two industrial cases, however with the empirical evidence obtained so far we have presenting the new requirements specification template mazo et al. 2019 seen improvements in time and costs and the high-quality standards in most of the resulting requirements. some aspects that remain pending and require even more work are for example: restart the experiment starting from a different template than rupp and compare the resulting template with that reported in this article; and implement a software tool for the automatic verification of requirements based on the improved template resulting from this research work. it is also necessary to study the improved template in other cases and other types of projects such as distributed, pervasive, cyber-physical, intelligent and data-intensive systems. additionally, natural language patterns, standardized process verbs, and models that complement the improved template could be studied and used to complement this work. in addition to the above, we aim at strengthening the empirical evidence regarding the advantages of the improved template, performing at least two other experiments in companies of different activity sectors to have more conclusive observations on the ease of use, completeness and the accuracy of the proposed template in other contexts. another future work consists of complementing the template proposed in this article with the treatment of negative and ubiquitous requirements, among others. further rationale about the complexity introduced with the extensions are needed and will be addressed as part of the future work. in particular, we will study the following questions: how the extensions affect the usability and the comprehension of the templates obtained? and are the benefits obtained with the extensions defined significant concerning the complexity introduced? references alexander, ian, and stevens richard. writing better requirements. addison-wesley, 2002. arora, chetan, mehrdad sabetzadeh, lionel briand, and frank zimmer. "automated checking of conformance to requirements templates using natural language processing." ieee transactions on software engineering (vol 41 no 10), 2015: 944-968. arora, chetan, mehrdad sabetzadeh, lionel briand, frank zimmer, and raul gnaga. "automatic checking of conformance to requirement boilerplates via text chunking: an industrial case study." acm / ieee international symposium on empirical software engineering and measurement, 2013: 35-44. artemis. "project cesar." cesar partners. rsl reference manual. cesar consortium 1.1 edition. 2010. baresi, luciano, liliana pasquale, and paola spoletini. "fuzzy goals for requirement-driven adaptation." 18th ieee int. conf. requirements engineering (re'10), 2010: 125-134. beck, kent. "embracing change with extreme programming." ieee computer. 32, 1999: 70–77. clegg, dai, and richard barker. case method fast-track: a rad approach. addison-wesley, 1994. cohn, mike. user stories applied for agile software development. adisson wesley, 2004. cook, thomas d, and d t campbell. quasiexperimentation: design & analysis issues for field setting. houghton mifflin, 1979. davies, rachel. format for expressing user stories. 2001. denger, christian, daniel m. berry, and erik kamsties. "higher quality requirements specifications through natural language patterns." proceedings of the ieee international conference on software-science technology & engineering: ieee computer society, 2003: 80-90. ibrahim, noraini, m.n. wan kadir, and safaai deris. "documenting requirements specifications using natural language requirements boilerplates." 8th malaysian software engineering conference (mysec), 20144: 19-24. iso/iec/ieee. "29148 systems and software engineering (life cycle processes — requirements engineering)." 2011. jureta, ivan j., alexander borgida, neil a. ernst, and john mylopoulos. "the requirements problem for adaptive systems." acm trans. management inf. syst. 5, 2015: 17:1-17:33. koshy, elizabeth, valsa koshy, and heather waterman. action research in healthcare. sage, 2010. majumdar, dipankar, sabnam sengupta, ananya kanjilal, and swapan bhattacharya. "adv-ears: a formal requirements syntax for derivation of use case models." first international conference on advances in computing and information technology, 2011: 40-48. majumdar, dipankar, sabnam sengupta, ananya kanjilal, and swapan bhattacharya. "automated requirements modeling with adv-ears." international journal of information technology convergence and services, 2011: 5767. mavin, a., p. wilkinson, a. harwood, and m. novak. "easy approach to requirements syntax (ears)." international requirements engineering conference re, 2009: 317-322. mazo, raúl. guía para la adopción industrial de líneas de productos de software. medellín: editorial eafit, 2018. mazo raúl. software product lines, from reuse to self adaptive systems. université paris 1 panthéon sorbonne, france, habilitation à diriger des recherches (hdr), octobre 2018. mazo, raúl, and carlos jaramillo. "hacia una nueva plantilla para la especificación de requisitos en lenguaje natural semi-estructurado." in the proceedings of the requirements engineering track (ret) of cibse. la habana, cuba, 2019. mazo, raúl, juan muñoz-fernández, luisa rincón, camille salinesi, and gabriel tamura. "variamos: an extensible tool for engineering (dynamic) product lines." xix international software product line conference (splc), 2015: 374-379. o'brien, rory. an overview of the methodological approach of action research. in r. richardson (ed.), theory and practice of action research. joao pessoa: universidade federal da paraíba, 2001. presenting the new requirements specification template mazo et al. 2019 pohl, klaus. requirements engineering fundamentals, principles, and techniques. springer, 2010. robertson, suzanne, and james robertson. mastering the requirements process: getting requirements right, 3rd ed. addison-wesley, 2013. rupp, c. requirements engineering and management. hanser, 2007. sophist gmbh requirements templates the blueprint of your requirement. sophist gmbh. 2014. https://www.sophist.de. souag, amina, camille salinesi, raúl mazo, and isabelle comyn-wattiau. "a security ontology for security requirements elicitation." international symposium on engineering secure software and systems (essos), 2015: 157-177. souag, amina, raúl mazo, camille salinesi, and isabelle comyn-wattiau. "using the aman-da method to generate security requirements: a case study in the maritime domain." requirements engineering (vol 23, issue 4), 2018: 557–580. souza, v.e. silva, alexei lapouchnian, william n. robinson, and john mylopoulos. "awareness requirements for adaptive systems." 6th international symposium on software engineering for adaptive and self-managing systems (seams'11), 2011: 60-69. susman, gerald i. action research: a sociotechnical systems perspective. sage, 1983. susman, gerald i., and roger d. evered. "an assessment of the scientific merits of action research." administrative science quarterly (vol. 23), 1978: 582-603. tjong, sri fatimah, nasreddine hallam, and michael hartley. "improving the quality of natural language requirements specifications through natural language requirements patterns." international conference on computer and information technology (cit'06), 2006: 199-205. vassev, emil. "requirements engineering for self-adaptive systems with are and knowlang." eai endorsed transactions on self-adaptive systems, 2015. whittle, jon, pete sawyer, nelly bencomo, betty h.c. cheng, and jean-michel brunel. "relax: incorporating uncertainty into the specification of self-adaptive systems." 17th ieee international requirements engineering conference (re'09), 2009: 79-88. wiegers, karl, and joy beatty. software requirements third edition. microsoft press, 2013. wieringa, roel, and ayse morali. "technical action research as a validation method in information systems design science." design science research in information systems. advances in theory and practice. desrist 2012. lecture notes in computer science. springer, 2012. 220-238. journal of software engineering research and development, 2021, 9:1, doi: 10.5753/jserd.2021.548  this work is licensed under a creative commons attribution 4.0 international license.. mining experts from source code analysis: an empirical evaluation johnatan oliveira [ federal university of minas gerais (ufmg) | johnatan.si@dcc.ufmg.br ] markos viggiato [ university of alberta | viggiato@ualberta.ca ] denis pinheiro [ federal university of minas gerais (ufmg) | denisppinheiro@gmail.com ] eduardo figueiredo [ federal university of minas gerais (ufmg) | figueiredo@dcc.ufmg.br ] abstract modern software development increasingly depends on third­party libraries to boost productivity and quality. this development is complex and requires specialists with knowledge in several technologies, such as the nowadays libraries. such complexity turns it extremely challenging to deliver quality software, given the pressure. for this purpose, it is necessary to identify and hire qualified developers, to obtain a good team, both in open source and proprietary systems. for these reasons, enterprise and open source projects try to build teams composed of highly skilled developers in specific libraries. however, their identification may not be trivial. despite this fact, we still lack procedures to assess developers skills in widely popular libraries. in this paper, we first argue that source code activities can identify software developers’ hard skills, such as library expertise. we then evaluate a mining­ based strategy to reduce the search space to identify library experts. to achieve our goal, we selected the 9 most popular java libraries and 6 libraries for microservices (i.e., 15 libraries in total). we assessed the skills of more than 1.5 million developers in these libraries by analyzing their commits in more than 17 k java projects on github. we evaluated the results by applying two surveys with 158 developers. first, with 137 library expert candidates, they observed 63% precision for popular java libraries’ used strategy. second, we observe a precision of at least 71% for 21 library experts in microservices. these low precision values suggest space for further improvements in the evaluated strategy. keywords: library experts, software skills, expert identification, mining software repositories. 1 introduction software development has become increasingly complex, both in open­source and proprietary systems (damasiotis et al., 2017). such complexity makes it extremely challeng­ ing to deliver software with quality in time and may hinder developers’ participation in worldwide repositories of source code, such as github (viggiato et al., 2019). to contribute to open­source projects or hire developers (in the case of a company), identifying the developer with the right skills for a good team is a hard task (garcia et al., 2007; mcculler, 2012). besides, in many cases, project managers must build teams of skilled developers in relevant libraries. however, decisions made during the hiring process are a well­known decisive factor to the success of a software project (tsui et al., 2016). providing a more reliable way of identifying developers’ skills can help project managers make the right decision when hiring or attracting the right developers for an open­source project. the task of finding experts in spe­ cific technologies is especially complex, despite the exis­ tence of business­oriented social networks, such as linkedin, where developers write about their attributes and qualifica­ tions. this type of platform is commonly used for the online recruitment of professionals. however, the reliability and ac­ curacy of the information provided in such media are not guaranteed (brown and vaughn, 2011). for instance, some individuals can overvalue their skills and omit some skills in a self­authored curriculum. the most commonly used strategies to find experts have their limitations (tsui et al., 2016; constantinou and kapit­ saki, 2016). for instance, the analysis of the curriculum from linkedin or in paper format can omit desirable skills. be­ sides, developers may have difficulty to express their qual­ ifications (tsui et al., 2016). sometimes, the developer has a specific ability, but s/he considers it irrelevant. in another situation, the developer cites many skills but does not have expertise in the technologies mentioned (constantinou and kapitsaki, 2016). even large companies may rely on cur­ riculum analysis, and this type of research may have inaccu­ rate or outdated information. besides, even talent recruiters may incorrectly identify the developer skills or identify other skills that are not the organization’s focus. hiring lowly skilled software developers can lead to additional costs, ef­ forts, and resources for training them, or expending more time and resources hiring others (constantinou and kapit­ saki, 2016; sommerville, 2015). however, these costs can be reduced if companies identify with more precision best developers according to a job opening. several software developers have used social coding plat­ forms, such as github and bitbucket, to showcase their work, hoping that this may help them be hired for a better job. developers use these social coding platforms to demon­ strate their skills and create an online profile about their projects (constantinou and kapitsaki, 2016). some contrib­ utors are even using these platforms’ social aspects to in­ fer project popularity trends and promote themselves more efficiently through specific projects and collaborations in other open­source projects (?). in some cases, profiles de­ rived from accounts of social platforms, such as github, are considered even more reliable than a curriculum from linkedin, concerning the technical qualifications of a job candidate (constantinou and kapitsaki, 2016). therefore, mailto:johnatan.si@dcc.ufmg.br mailto:viggiato@ualberta.ca denisppinheiro@gmail.com mailto: figueiredo@dcc.ufmg.br presenting the new sbc journal template oliveira et al. 2020 data exploitation from coding platforms is a promising way for potential employers to identify and assess several candi­ dates in real situations (capiluppi et al., 2013). github has been widely used in several works mainly be­ cause it provides several user­based summary statistics, such as the number of contributions in the last year, the number of forked projects, and the number of followers. for instance, some works have used this platform to identify appropriate developers for cross­project bugs (ma et al., 2017), iden­ tification of reuse opportunities (oliveira et al., 2016) and collaborations between projects (dabbish et al., 2012). dif­ ferent approaches have been used to investigate the skills of developers from github (saxena and pedanekar, 2017; mockus and herbsleb, 2002; greene and fischer, 2016). for instance, prior work conducted interviews with members of github to understand the hiring process (marlow and dab­ bish, 2013). we did not compare the results with other ap­ proaches because our strategy is very different from the oth­ ers. therefore, our strategy complements related work by au­ tomatically reducing the search space to support library ex­ perts’ identification. this paper is an extension of our previ­ ous work (oliveira et al., 2019) that proposed and evaluated a strategy to identify library experts from source code, named jexpert. our main goal is to reduce the search space to iden­ tify library experts. we list the following new contributions to this submission compared to the original paper. 1. we present and analyze data of all identified expert can­ didates by means of new boxplot charts. 2. we include a novel classification and discussion of ex­ perts in four categories. 3. we include additional analysis of the library experts by proposing a novel heuristic to rank the top experts of each library. 4. we perform a new identification of experts in microser­ vices libraries. 5. we conducted an additional survey to calculate the strat­ egy precision on identifying experts in microservices. 6. we include additional discussion about the negative re­ sults of the evaluated metrics. in this paper, we evaluate the feasibility of identifying soft­ ware developers’ hard skills; that is, library expertise from source code analysis. we rely on github data to support the identification of the skills of developers based on their contributions. from each type of developer contribution, we aim to identify essential developers skills and evaluate the applicability and precision of the strategy. in the applicabil­ ity evaluation, we performed a mining study with the top­9 most popular java libraries from github, aiming to identify library experts in these libraries. in total, we analyzed more than 16 thousand projects and 1.5 million developers. in the precision evaluation, we designed and sent a survey to more than 1 thousand developers identified for these libraries. we received answers from 158 developers. as a result, we ob­ serve that it is possible to reduce the search space to identify experts from source code. we also note that the strategy pro­ vides meaningful information to recruiters, such as the his­ tory of written lines of code (loc) for each library. these details about the developers can improve the selection of can­ didates. our key contributions are threefold: • we empirically evaluate the applicability and precision of identifying library experts from source code analysis. in addition, we propose a tool to support the strategy; • we identify 1,045 experts in top­9 java libraries with a precision of about 63%; • we identify 136 experts from microservices libraries with a precision of about 71%. low precision values indicate space for future research in this subject. the remainder of this paper is organized as fol­ lows. in section 2, we describe our analysis by detailing the strategy to identify library experts, dataset, and our research questions. section 3 presents the results of the applicability evaluation to identify library experts. section 4 shows the re­ sults to survey with top­9 library experts. section 5 shows the results concerning a survey with library experts in microser­ vices. section 6 shows details about a tool developed to sup­ port the strategy. section 7 presents and discusses threats to validity. related work is discussed in section 8. finally, sec­ tion 9 discusses the concluding remarks and future work. 2 study settings this section describes the protocol to evaluate the identifica­ tion of library experts through an empirical study. section 2.1 presents the aims of our study and the research questions we address. section 2.2 shows the steps performed to evaluate the expert candidates. section 2.3 describes the used dataset. 2.1 goal and research questions this study’s primary goal is to evaluate the applicability and precision of a strategy to reduce the search space to identify li­ brary experts from source code analysis using software repos­ itories. we are interested in whether the strategy can signifi­ cantly reduce the search space to identify experts in a specific library. we are also concerned with assessing the relevance of the results provided by the strategy. for this purpose, we select the 10 most popular and standard java libraries among github developers. we also selected 6 popular libraries for microservices. one library was later excluded (section 2.3). therefore, we evaluate the strategy with the 9 most popu­ lar java libraries and 6 libraries of microservices. to achieve this goal, we use the goal­question­metric method to select measurements of source code. the gqm method proposes a top­down approach to defining measurement; goals lead to questions that are then answered with metrics (basili et al., 1994). table 1 shows the gqm with the research questions and metrics investigated in this study. as mentioned, the goal of this paper is to reduce the search space to identify library ex­ perts from source code. therefore, from this goal, we check if it is feasible to analyze the source code to identify library experts. through rq1, we are interested in investigating the efficiency of the number of commits (metric) to indicate the level of activity of a developer in a specific library. in other words, we aim to analyze the number of commits involving presenting the new sbc journal template oliveira et al. 2020 a specific library performed by a developer to compute their activity level in the library. with rq2, we aim at assessing the knowledge extension based on the number of imports to a specific library. from all imports made by a developer at the source code, we in­ vestigate the number related to the particular library. finally, the last research question (rq3) analyzes the knowledge in­ tensity of the developers from the number of loc related to the library (metric). in this last question, we aim to evalu­ ate the amount of loc implemented by a developer using a specific library. for this purpose, we evaluate the relation of total loc and loc related to a particular library. table 1. the metrics analysis as gqm method questions metrics rq1– how to evaluate the level of activity of a developer in a library? number of commits rq2– how to evaluate the knowl­ edge extension of a developer in a library? number of imports rq3– how to evaluate the knowl­ edge intensity of a developer in a library? lines of code 2.2 evaluation steps this section describes the steps to evaluate the identification of library experts from source code. to answer the research questions presented in section 2.1, we designed a mixed­ method study composed of four steps: 1) library selection, 2) dataset collection, 3) expert identification, and 4) sur­ vey application. figure 1 presents the steps of our research, which are discussed next. for library selection (section 2.3), we selected the top­10 most popular libraries in the java pro­ gramming language to identify library experts. we also se­ lected 6 libraries for microservices to favor external validity. in the dataset collection step (section 2.3), we clone the projects that contain these libraries from github. for iden­ tification of library experts (section 3.1), we compute the skills of developers based on three metrics: number of com­ mits, number of imports, and lines of code. these metrics are presented in section 3.1. finally, we performed two sur­ vey studies. these surveys were conducted to evaluate the precision of the strategy according to the responses of devel­ opers. section 4.1 and 5.1 present details about the surveys. figure 1. study steps 2.3 dataset to create our dataset, we select the 10 most popular and com­ mon java libraries among github developers: hibernate, se­ lenium, hadoop, spark, struts, gwt, vaadin, primefaces, apache wicket, and javaserver faces. this selection was made based on a survey provided by stack overflow1 in 2018 with answers of over 100,000 developers around the world. table 2 summarizes the definitions of each library (top­10). all definitions of the libraries were retrieved from stack overflow and their web pages. we selected java be­ cause it is one of the most popular programming languages2 and there are many java projects available on github. microservices have become most popular in the last years, together with the spread of devops practices (pahl, 2015). we can see a significant increase in the use of microservices architectural style since 2014 (klock et al., 2017), which can be verified in the service­oriented software industry where the usage of microservices has been far superior when com­ pared to other software architecture models (alshuqayran et al., 2016). furthermore, a microservice usually runs on its own process and communicates using standardized inter­ faces. in practice, microservices are widely used by large web companies, such as netflix and amazon (alshuqayran et al., 2016). for these reasons, we aim to identify experts of microservices in 6 libraries: apache karaf, apache spark, javaee, netflix, spring boot, and swagger. table 3 also summarizes the definitions of each library, but now con­ cerning microservices. these definitions were retrieved from stack overflow and their web pages. figure 2 illustrates the criteria for defining our dataset. to achieve more realistic results for software development, we apply the following exclusion criteria. (1) we excluded sys­ tems with less than 1 kloc because we considered them toy examples or early­stage software projects. (2) we removed projects with no commit in the last 3 years because the devel­ opers may forget their code (krüger et al., 2018). finally, in the last exclusion criteria, (3) we removed projects which did not contain imports related to the selected libraries. besides, we excluded all official projects of these libraries because we assume all library project developers are experts in the corresponding library. in popular java libraries, we also re­ moved libraries with less than 100 projects (e.g., javaserver faces). we need a representative number of projects to eval­ uate our strategy. we analyze only files with extension .java. the same process was made to projects of libraries of mi­ croservices. therefore, we end up analyzing 15 libraries in this study. figure 2. steps for collecting software projects from github table 4 shows the number of remained projects after each step in our filtering process. the first part of this table shows the results for top­10 java libraries, and the second part 1https://insights.stackoverflow.com/survey/2018#most­popular­ technologies 2https://spectrum.ieee.org/static/interactive­the­top­programming­ languages­2018 presenting the new sbc journal template oliveira et al. 2020 table 2. library descriptions library description hibernate hibernate is a library of object­relational mapping to object­oriented. selenium a test suite specifically for automating web. hadoop a library that facilitates the use of the network from many computers to solve problems involv­ ing massive amounts of data (tong et al., 2016; ye, 2017). spark a general­purpose distributed computing engine used for processing and analyzing a large amount of data struts it helps in developing web­based applications. gwt it allows web developers to develop and maintain complex javascript front­end applications in java. vaadin it includes a set of web components, a java web library, and a set of tools and application starters. it also allows the implementation of html5 web user interfaces using the java. primefaces a library for javaserver faces featuring over 100 components. apache wicket a library for creating reusable components and of­ fers an object­oriented methodology to web devel­ opment while requiring only java and html. javaserver faces a java view library running on the server machine which allows you to write template text in client­ side languages (like html, css, javascript, etc.). shows the results for microservices libraries. the column #projects presents the number of projects initially selected. next, the column filtered shows the number of projects re­ moved through the filtering step. finally, the column re­ mained presents the number of projects analyzed for each library. 3 applicability evaluation in this section, we describe how we evaluated the strategy in terms of its applicability focusing on the top­9 java libraries. section 3.1 presents the steps to identify library experts, for example, metrics and data about classes. section 3.2 shows an overview of our data. section 3.3 presents the top­10 ex­ perts in each library selected in this study. 3.1 identification of library experts to evaluate the strategy in terms of its applicability, we perform three steps in this study. these three steps are described as follows. step 1: extract data from source code – in this step, we obtain data from the classes created by developers from a git repository. all data, such as added or removed loc, written imports, commits, date, email, and developers’ names, are stored locally. table 3. library descriptions library description javaee the javaee platform is built on top of the java se plat­ form. the java ee platform provides an api and run­ time environment for developing microservices and run­ ning large­scale, multi­tiered, scalable, reliable, and se­ cure network applications. spring boot pivotal solution for implementing cloudbased microser­ vices using the well known spring framework. netflix netflix oss is a set of frameworks and libraries that netflix wrote to implement microservices in distributed­ systems. swagger swagger is used to creating documentation for each mi­ croservice. karaf apache project referenced to support microservice im­ plementations. spark a lightweight web framework that has been used to im­ plement simple and expressive microservices. table 4. projects selected for analysis library #projects filtered reimaned hibernate 31,134 26,020 5,114 selenium 19,062 17,648 1,414 hadoop 11,715 10,778 937 spark 9,144 7,650 1,494 struts 4,741 4,127 614 gwt 4,086 2,635 1,451 vaadin 3,240 2,625 615 primefaces 1,881 1,401 480 apache wicket 1,095 896 199 javaserver faces 120 120 ­ total 86,218 73,900 12,318 microservices apache karaf 264 155 109 apache spark 243 120 123 javaee 321 190 131 netflix 653 240 413 springboot 393 246 147 swagger 357 239 118 total 2,231 1,190 1,041 step 2: search for imports – from the previous step, we search for specific “imports” related to the chosen library. the idea is to explore all files that import the name of the target library. this step is performed as follows. first, the strategy gets files with all commits, for example, commits to loc in general, comments, and mainly the header. second, it analyzes the header of java files containing the name of the package, all imports necessary to class, and the classes’ names. consequently, we get the “import” through regular expression pattern import+“target library”, for example, “import org.apache.spark”. in this example, the target library is spark. figure 3 shows an example of a file with data of committers. as we can observe in figure 3, there are three attributes in this file: (1) hash code of commit, (2) name of the developer, and (3) committed source code. at the beginning of the file, there is the name of a package and many imports. in this part, our strategy is to use a regular expression to detect if the line contains the library we investigate. if the line contains the target library, we compute the hash of commit, the number of imports to the specific library, and the total imports without relation to the target library. step 3: calculate skills – in this last step, we compute the skills for each developer. we rely on three metrics to identify library experts. each metric is calculated concerning the presenting the new sbc journal template oliveira et al. 2020 figure 3. file example with commits of three developers number of commits to a specific library. that is, when a commit using a library is identified, the metrics were calculated. in the following, we explain the 3 proposed metrics. number of commits. this metric calculates the activity of each developer through the number of commits using a particular library. through this metric, we believe it is possible to measure the library’s amount in a project that a specific developer works. number of imports. this metric presents the extension of knowledge in the library. for this metric, we count all im­ ports to the library written by a developer. repeated imports are included. if a developer wrote two equals imports, we would consider 2 imports to the target library. figure 3 shows an example of repeated imports. there are four imports to apache hadoop in this figure, so we compute 4 imports for this library. besides, if a developer made 3 imports to the same library, we compute 3 imports, for example, we are sup­ posed to developer made 3 imports. 1 import o r g . apache . hadoop . i o . longwritable ; 2 import o r g . apache . hadoop . i o . longwritable ; 3 import o r g . apache . hadoop . i o . longwritable ; lines of code. to compute this metric, we developed a heuristic to count the amount of loc related to a specific library. first, we obtain the ratio of changed loc by the number of all imports in the file. then, we multiply the ratio by the number of imports related to the library. our heuris­ tic considers 3 attributes, the number of library imports, the number of imports in general, and the number of loc al­ tered by a commit related to the library. the heuristic is then computed as follows: loc = # of loc altered by commit # of all imports x # of library imports from figure 3, it is possible to compute an ex­ ample for this metric. a developer made a commit with hash code 75b70c and an import to “import org.apache.hadoop.io.intwritable;” (line 2). therefore, we compute this metric as presented above and consider 10.67 loc related to the hadoop library. 3.2 overview of dataset from the dataset projects, we computed all commits with the libraries evaluated in this study and identified 1.5 million dif­ ferent developers who made commits. figure 4 shows the number of developers for the top­9 popular java libraries. the library, with more developers that made commits, was selenium. this library has 811,884 developers. in contrast, apache wicket was the library with fewer developers: 5,440. it is important to say that these developers made at least one commit for the respective library. however, we cannot con­ sider them all experts since a single library use may not indi­ cate high expertise. figure 4. number of developers by library figures 5, 6, and 7 show an overview of the metrics computed to our data set of popular java libraries. figure 5 presents the results to the number of commits per library. figure 6 presents an overview of the metric number of im­ ports per library. finally, figure 7 shows the results of the metric lines of code per library. in general, loc (figure 7) was the metric that presented more variation in our data set. for instance, gwt has developers that wrote more than 130 kloc. similarly, for hibernate, it is possible to see an outlier developer who wrote more than 500 kloc. in con­ presenting the new sbc journal template oliveira et al. 2020 figure 5. number of commits per library figure 6. number of imports per library figure 7. number of loc per library trast, some developers wrote less than 10 lines of code, for example, to the library primefaces. 3.3 top library experts selection in this section, we present the applicability evaluation re­ sults to verify the feasibility of library expert identification focusing on the top­9 popular java libraries. we analyzed 16,703 software systems mined from github and 9 libraries: hibernate, selenium, hadoop, spark, struts, gwt, vaadin, primefaces, and apache wicket. besides, we analyzed data from more than 1.5 million developers who have contributed to these projects in our dataset. table 5 presents the results of top library experts. to ob­ tain these results, we aim to select the top­10 developers, but, in some cases, it was not possible to select top­10 developers. for instance, we obtained 3­top developers in library spark. besides, we consider a developer with a library expert only if this developer obtains high values in at least two metrics, for example, loc & # of commit or # of imports & loc. these developers are identified concerning their contribution. for this, we calculate the 90% percentile in each metric, then fil­ tering the developers with any metric below this threshold (90%). this type of classification is common in other stud­ ies (joblin et al., 2017; ferreira et al., 2019). finally, we sort developers by loc (# of library loc). the filtering thresh­ old was applied to remove potential false positives (i.e., de­ velopers with high # of library loc, but low # of commits). in some cases, it resulted in less than 10 experts for some libraries, such as primefaces (8), spark (3), struts (6), and wicket (5). in table 5, each developer is identified by the start name of the library, followed by a sequence number (e.g., had (1) means the first developer expert of hadoop). the column # of library imports refers to the metric of number of imports written by the developer. it counts the number of imports re­ lated to the specific library evaluated in this study. the col­ umn # of all imports shows the number of imports wrote by the developer in general. when a developer wrote an import to a specific library evaluated in this study, they also wrote imports to other libraries that have not been evaluated. hence, this metric counts all imports in relevant commits made by the developer. the column # of commits shows the results for the num­ ber of commits metric. this metric indicates the number of commits made by a specific developer. the column # of loc presenting the new sbc journal template oliveira et al. 2020 table 5. top library experts id # of library imports # of all imports # of commits # of loc altered by commit # of library loc gwt(1) 1,693 6,836 49 637,724 157,938 gwt(2) 5,108 5,951 386 87,303 74,935 gwt(3) 4,019 5,451 452 75,700 55,813 gwt(4) 1,677 1,880 31 56,535 50,430 gwt(5) 2,497 3,714 74 54,865 36,886 gwt(6) 1,564 6,226 66 135,574 34,056 gwt(7) 2,657 6,167 71 71,767 30,920 gwt(8) 1,732 1,956 141 33,272 29,461 gwt(9) 2,249 2,558 105 31,124 27,364 gwt(10) 1,432 3,791 56 71,264 26,919 had(1) 15,739 32,391 172 488,882 237,550 had(2) 2,083 3,378 14 46,215 28,497 had(3) 1,024 27,277 31 476,220 17,877 had(4) 1,303 2,628 146 31,440 15,588 had(5) 932 1,518 93 16,086 9,876 had(6) 625 1,329 52 16,788 7,895 had(7) 569 1,843 55 19,899 6,143 had(8) 242 599 18 13,051 5,272 had(9) 493 617 18 6,110 4,882 had(10) 322 973 12 11,842 3,918 hib(1) 3,401 5,211 155 78,781 51,417 hib(2) 1,719 2,923 169 25,963 15,268 hib(3) 180 432 25 24,552 10,230 hib(4) 552 1,182 15 13,612 6,356 hib(5) 552 791 44 7,939 5,540 hib(6) 535 756 51 5,684 4,022 hib(7) 509 1,281 10 9,250 3,675 hib(8) 458 898 50 7,060 3,600 hib(9) 202 395 17 6,880 3,518 hib(10) 233 387 15 4,617 2,779 pri(1) 239 16,194 6 245,319 3,620 pri(2) 177 1,286 6 11,232 1,545 pri(3) 72 282 15 3,500 893 pri(4) 37 144 12 2,014 517 pri(5) 38 545 10 6,374 444 pri(6) 28 168 6 2,538 423 pri(7) 28 142 6 1,904 375 pri(8) 27 102 10 1,374 363 sel(1) 614 820 61 8,757 6,557 sel(2) 1,178 1,763 116 9,606 6,418 sel(3) 707 3,166 27 27,808 6,209 sel(4) 287 1,436 49 28,355 5,667 sel(5) 780 1,141 93 7,245 4,952 sel(6) 491 2,229 73 22,302 4,912 sel(7) 242 486 18 9,513 4,736 sel(8) 324 1,027 27 14,084 4,443 sel(9) 394 1,095 16 12,096 4,352 sel(10) 178 417 16 9,685 4,134 spa(1) 757 2,208 36 22,903 7,852 spa(2) 280 1,253 29 17,940 4,008 spa(3) 446 834 38 7,344 3,927 str(1) 670 3,286 3 64,468 13,144 str(2) 531 2,432 2 24,448 5,337 str(3) 616 2,771 2 23,419 5,206 str(4) 175 793 9 14,753 3,255 str(5) 133 1,076 9 21,477 2,654 str(6) 278 818 6 7,357 2,500 vaa(1) 3,541 5,960 100 95,786 56,909 vaa(2) 561 761 46 21,537 15,876 vaa(3) 1,265 2,102 203 21,973 13,223 vaa(4) 684 4,208 74 59,710 9,705 vaa(5) 816 1,178 102 12,557 8,698 vaa(6) 510 656 31 8,913 6,929 vaa(7) 451 628 28 9,169 6,584 vaa(8) 740 1,432 30 11,746 6,069 vaa(9) 358 375 28 6,223 5,940 vaa(10) 334 495 59 8,695 5,866 wic(1) 1,428 1,727 191 16,991 14,049 wic(2) 1,017 1,212 55 14,255 11,961 wic(3) 494 543 56 10,104 9,192 wic(4) 403 451 49 9,549 8,532 wic (5) 476 651 34 8,439 6,170 presenting the new sbc journal template oliveira et al. 2020 altered by commit presents the loc changed by a developer when s/he made a commit related to the library (i.e., identi­ fied by a specific library import). finally, the last column # of library loc shows the results for the metric loc written by the developer related to the library based on our heuristic. in this paper, the developers could be classified into hard/soft committers and hard/soft coders, depending on the metrics’ numbers. we consider a hard committer when a developer obtains data equal or above 75%. that is, we use the 3rd quartile as a parameter. hard committers are developers who made several commits (# of commits) related to the libraries which are subject of this study. for example, let us supposed that developer maike made 10k commits that include hash to library y and developer anna made 1k commits using a hash to library y. in this context, the developer maike is a hard committer in relation to developer anna. similarly, hard coders are developers who wrote several lines of code related to the library (# of library loc). for instance, let us suppose that developer mary wrote 8k loc when made a commit to the library y and developer john wrote 1k loc to library y when made a commit to the same library. therefore, developer mary is considered a hard coder in relation to developer john. nevertheless, a developer could be hard committer and hard coder if s/he has a higher number of commits and loc related to the library. on the other hand, we classify a developer as soft using the same strategy to classify the hard developers. however, we use the data below 25%, i.e., the 1 st quartile, as a parameter. then, we discuss the below reasoning regarding this classification. hard committers and hard coders. according to our metrics, the developer gwt (3) is a hard committer and hard coder (see table 5). this developer made more than 450 commits and wrote more than 55 kloc for this library. it could be noted that other developers are harder committer and harder coder. for instance, the developer had (1) made 172 commits and wrote more than 237 kloc. these are some examples of harder committer and harder coder from the calculated metrics. hard committers and soft coders. we present now the results to hard committers and soft coders. developers had (1) and had (4) in table 5 can be considered hard committers because they made 172 and 146 commits, respectively. the difference between had (1) and had (4) is only 26 commits. however, developer had (4) is considered as soft coder concerning developer had (1) because had (1) wrote more than 235 kloc while the developer had(4) wrote about 15 kloc. developer had (4) wrote only 6% loc of the developer had (1). therefore, had (4) is a hard committer and soft coder. soft committers and hard coders. concerning the soft committers and hard coders, we can observe that developers pri (1), pri (2), sel (1), and str (1) in table 5 are soft committers because they made only a few commits. developer str (1), for instance, made only 3 commits, but s/he wrote more than 13 kloc. therefore, this developer is considered a soft committer and hard coder. soft committers and soft coders. as the name suggests, this category includes the developers that fewer commits and made fewer lines of code compared to their peers. for in­ stance, developers hib (9), hib (10), sel (9), and sel (10) are considered soft committers because they wrote less than 20 commits to libraries cited. besides, these developers wrote less than 5 kloc. therefore, according to our met­ rics, these developers are considered soft committers and soft coders. 4 survey with top libraries experts this section describes the survey applied to github develop­ ers to evaluate the strategy with respect to the top­9 popular java libraries. section 4.1 presents the details regarding the survey developed. section 4.2 presents a summary of some relevant findings. section 4.3 presents the results to rq1 re­ garding the number of commits metric. section 4.4 presents the results to rq2 about the number of imports metric. sec­ tion 4.5 presents the results to rq3 regarding the loc met­ ric. 4.1 survey design according to easterbrook et al. (2008), survey studies are used to identify the characteristics of a population and are usually associated with the application of questionnaires. be­ sides, surveys are meant to collect data to describe and com­ pare or explain knowledge (pfleeger and kitchenham, 2001). we selected the library experts with the best values in the evaluated metrics to validate them through a survey. we de­ signed and applied a survey with the top developers identi­ fied by our strategy. we selected developers with the top­ 20% highest values in at least two (out of three) metrics. we created a questionnaire on google forms3 with two parts: the first one was composed of 5 questions about the background of the expert candidates; the second part also had 5 questions about the knowledge of the expert candi­ dates regarding the evaluated libraries. table 6 contains the tag meaning a specific library, for instance, hadoop. also, this table shows the possible answers to the survey questions. table 6. survey questions on the use of the libraries id questions sq1 how do you assess your knowledge in ? ( ) 1 ( ) 2 ( ) 3 ( ) 4 ( ) 5 sq2 how many projects have you worked with ? ( ) 1 to 5 ( ) 6 to 10 ( ) 11 to 20 ( ) more than 20 projects sq3 how many packages of have you used? ( ) a few ( ) a lot sq4 how often do your commits include ? ( ) a few ( ) a lot sq5 how much of your code is related to ? ( ) few of my code is related to ( ) my code is partially related to ( ) most of my code contains to obtain the email used by the developer to perform the 3https://www.google.com/forms/ presenting the new sbc journal template oliveira et al. 2020 commits in the source code, we used the git­blame4 tool. the emails were collected to send the survey. we sent an email to developers asking them to assess their knowledge of each library. for instance, the developers were invited to rank their knowledge (table 6, sq1) using a scale from 1 (one) to 5 (five), where (1) means no knowledge about the library; and (5) means extensive knowledge about the library. ques­ tions are not mandatory because they may require knowledge of the exceptional features of the library. therefore, partici­ pants are not forced to provide an answer when they do not re­ member a specific library element, such as the time of devel­ opment using the library and the approximate frequency of commits that contain the library. the survey remained open for 15 days in january 2019. in summary, we present the precision evaluation results based on a survey with expert candidates in each of the top­9 popular java libraries. the goal of this evaluation is to verify the precision of the library expert identification. we empir­ ically selected 1,045 developers among the top­20% values in at least 2 metrics. the questionnaire was sent in january 2019. after 15 days, we obtained 137 responses resulting in a response rate of about 15%. we asked the 137 develop­ ers about their software development experience in general (background) and the use of the specific libraries investigated in this paper. 4.2 overview in this section, we present an overview of some relevant find­ ings of the popular java libraries. table 7 presents an overview of the experts’ candidates contacted to answer our first survey. this table has the fol­ lowing structure. the first column (library) indicates the name of the analyzed library. the second column (emails sent) shows the number of emails collected and sent to ex­ pert candidates. the third column (invalid email) presents the number of invalid emails returned by the server. the fourth column (remaining emails) indicates the number of valid emails. the fifth column shows the number of answers we obtained for each library. finally, in the last column, we show the response rate of each library. table 7. top 20% from library experts selected to answer the survey library emails sent invalid email remaining email # answers % gwt 160 18 142 31 22% hadoop 181 33 148 11 7% hibernate 155 10 145 16 11% spark 138 19 119 11 9% struts 42 2 40 9 23% vaadin 107 18 89 15 17% primefaces 30 1 29 9 31% wicket 23 2 21 8 38% selenium 209 31 178 27 15% total 1,045 134 911 137 15% concerning the participants’ background and replication package, we create a web page with more details (oliveira et al., 2020). it is worth mentioning that half of the respon­ dents graduated in computer science, and 7% holds a ph. d. 4https://git­scm.com/docs/git­blame degree. concerning time dedicated to software development, 47% has more than 10 years of experience, and only 2% have less than 1 year of experience. therefore, we can conclude that, in general, the participants are not novices. our study also shows that a significant amount of expert candidates makes commits. when writing code related to a specific library, they perform many imports of particular li­ braries and writes lines of code about the library. we support this affirmation through metrics that evaluate the amount of loc written by a developer when they performed a commit. table 8 shows the results of the knowledge that surveyed de­ velopers claim to have in each library. if we analyze the data about the precision of the strategy from the sum of levels 3, 4, and 5 of the likert­type scale, we obtain on average 88.49% of precision about the knowledge of the developers, i.e., identification is correct in more than 88% of the cases. on the other hand, although a score of three may represent acceptable knowledge, if we followed more conservative cri­ teria, only classifying as library experts the developers that informed a higher (≥ 4) knowledge on the libraries obtain average, 63.31% of precision. this way, we conclude that less than 2/3 of the identified expert candidates identified by the strategy contain high knowledge about the evaluated li­ braries. about 63% of the library experts who answered the survey have high knowledge about the evaluated libraries. table 8. level of knowledge in each library library likert scale total 3­4­5 4­51 2 3 4 5 gwt 1 1 4 9 16 31 94% 81% hadoop 0 1 3 4 3 11 91% 64% hibernate 1 3 6 3 3 16 75% 38% spark 0 1 4 2 4 11 91% 55% struts 2 2 1 4 0 9 56% 44% vaadin 0 2 5 3 5 15 87% 53% primefaces 0 0 4 4 1 9 100% 56% wicket 1 0 2 4 1 8 88% 63% selenium 0 1 4 13 9 27 96% 81% 4.3 level of activity in this section, we answer the first research question. rq1– how to evaluate the level of activity of a developer in a library? to answer this research question, we asked the library ex­ perts the following question. ‘‘how often are your commits related to the library’’? figure 8 shows the results of this question in the first line in each chart to each library. for most libraries, the majority of the participants answered they made ‘‘few’’ commits using the evaluated li­ braries. this way, if we evaluated the results obtained for this label, it is possible to see that from 137 experts, 54% made ‘‘few’’ commits. for instance, in the library hibernate, 87% of developers said they made few commits related to this library. another library that deserves special attention is struts. in this library, 88% of the developers responded that they made few commits. regarding the label ‘‘a lot’’, only 39% of experts polled said they performed many commits. gwt was the library with a higher rate of answers to this presenting the new sbc journal template oliveira et al. 2020 label (62%). therefore, results indicate that the metric num­ ber of commits needs to be combined with other metrics to achieved conclusive results about the skill of developers and even develop other metrics to identify the level of activity ability. answer to rq1. a large proportion of library experts make ‘‘few’’ commits using the library. therefore, we con­ cluded that the solo use of the number of commits could not identify library experts. 4.4 knowledge intensity in this section, we answer the second research question. rq2– how to evaluate the knowledge intensity of a devel­ oper in a library? regarding the number of imports to indicate a library ex­ pert, we ask the developers the following question: ‘‘how often do you include an import of library in your commits?’’. figure 8 shows the results of this question from the second line in each chart to each library. we ana­ lyze the number of imports performed by developers. the main reason for this analysis is to evaluate the feasibility of inferring the skills of the developers from the types of written imports. in general, the label ‘‘few’’ and ‘‘a lot’’ are tied or with little difference between them. for example, hibernate, spark, and primefaces have practically tied. these libraries did not show significant differences; the difference was only 1 absolute point in some cases. in only three cases, the label ‘‘a lot’’ remained significantly higher: gwt (83%), vaadin (67%), and selenium (78%). from 137 experts, 68% said that they made ‘‘a lot of im­ ports’’. however, the number informed by the experts indi­ cates that this metric requires a combination with other met­ rics to achieve better results because 32% of experts said they made few imports to libraries evaluated. therefore, from the survey results, the metric number of imports, as well as the metric number of commits, are not able to identify library experts when we apply one at a time. answer to rq2. the metric number of imports is not able to identify library experts, when we use it alone. 4.5 knowledge extension in order to evaluate the metric lines of code, we present the third research question as follows. rq3– how to evaluate the knowledge extension of a devel­ oper in a library? in this research question, we analyze the developers skill from the number of loc related to the library. we evaluate the number of loc implemented by a developer to a specific library. for this purpose, we asked the library experts from the survey the following question. ‘‘how much of your code is related to the library when you perform a commit?’’. figure 8 shows the results to this question in the third line in each chart to each library. the libraries gwt, wicket, selenium, and hadoop, for instance, obtained 74%, 71%, 70%, and 64% respectively to label ‘‘a lot’’. we noted, however, the label ‘‘a few’’ also remained at a high level in some cases, for instance, the libraries struts (88%) and spark (55%). in fact, the library hibernate re­ mained tied to labels ‘‘a few’’ and ‘‘a lot’’. in general, from 137 experts, 39% said they write ‘‘a few’’ loc and 61% write ‘‘a lot’’ loc with respect to libraries. therefore, it is possible to infer that the metric lines of code alone also does not provide indications about developer skills, although this metric achieved better precision than the metric number of commits. answer to rq3. according to our analysis, the metric lines of code alone cannot reliably provide indications about developers’ skills. in general, our metrics are not fea­ sible to identify library experts. however, our strategy is able to reduce the search space of library experts. there­ fore, a company or project open source can be select a de­ veloper from a group selected by our strategy. 5 survey with microservices experts in order to favor the generalization of our findings, we did a second survey with developers of microservices libraries. for this, we conducted a selection of libraries in this domain. 5.1 survey design we select the library experts to this survey in a similar way to the survey presented in section 4.1. we created a ques­ tionnaire on google forms5 in order to evaluate the knowl­ edge of the developers about microservices libraries. the first question request the login of the developer at github. this login is necessary to map the answer of the developer with our data. we ask developers about their knowledge in all six libraries of microservices. for this, we show all li­ braries investigates in this survey (6 libraries of microser­ vices). we request the developer to rank their knowledge in these libraries in four levels: no knowledge, low knowledge, medium knowledge, and extensive knowledge. each level of knowledge has meaning. no knowledge: this library was never used in any project i am involved in. low knowledge: i never used this library, but it has been used in projects i am involved in. medium knowledge: i used this library in some projects before, but i do not master all its api. exten­ sive knowledge: i used this library many times, and i know a lot of its api. table 9 shows the template of the survey. table 9. level of knowledge in microservices libraries library noknowledge low knowledge medium knowledge extensive knowledge apache karaf ⃝ ⃝ ⃝ ⃝ apache spark ⃝ ⃝ ⃝ ⃝ javaee ⃝ ⃝ ⃝ ⃝ netflix ⃝ ⃝ ⃝ ⃝ spring boot ⃝ ⃝ ⃝ ⃝ swagger ⃝ ⃝ ⃝ ⃝ we selected the library experts with the best values in the evaluated metrics to validate them through a survey. we de­ signed and applied the survey with the top developers iden­ 5https://www.google.com/forms/ presenting the new sbc journal template oliveira et al. 2020 (a) gwt (b) hadoop (c) hibernate (d) primefaces (e) selenium (f) spark (g) struts (h) vaadin (i) wicket figure 8. results of the survey questions for each library presenting the new sbc journal template oliveira et al. 2020 tified by our strategy. that is, we selected developers with the top­20% highest values in at least two (out of three) met­ rics. therefore, we choose 136 candidates library experts in microservices. figure 9 presents an overview of the number of developers by the library. the library with more candidate experts identified was netflix with 64, and the library with fewer candidate library experts identified was karaf with only 1. figure 9. number of developers by library in microservices table 10 presents an overview of the candidate experts contacted to answer our survey. we sent 136 emails, but 38 was returned with invalid email. therefore, we sent the sur­ vey to 98 valid emails. the library, with more amount of respondents, was netflix with 7 candidate library experts. on the other hand, the library with fewer participants was apache karaf with 0. table 10. top 20% from library experts of microservice library emails send invalid email remaing email #answer % apache karaf 1 0 1 0 0% apache spark 6 1 5 4 80% javaee 9 1 8 1 13% netflix 64 18 46 7 15% springboot 37 11 26 6 23% swagger 19 7 12 3 25% total 136 38 98 21 21% 5.2 results in this section, we present the results about a survey per­ formed with library experts from microservices. initially, we perform a pilot survey with 5 developers from netflix ran­ domly selected among the candidate experts identified. we received 3 answers for this library. from the pilot, we did not identify any problem with our survey. then we apply the final survey for all top­20% developers with high values in at least two metrics. note that the results of the pilot survey are part of our final results. table 11 shows the summary results from our second sur­ vey. the first column shows the name of the library. the sec­ ond column shows the number of developers without knowl­ edge in the library target. the third column indicates the number of developers with low knowledge in the library tar­ get. the fourth column shows the number of developers with medium knowledge in the library target. the fifth column shows the number of developers with extensive knowledge in the library target. finally, the two last columns show the precision for medium and extensive knowledge. the library apache karaf is not presented in table 11 because we did not obtain any response for this library. table 11. summary results library no low medium extensive medium (precision) extensive (precision) apache spark 2 1 1 25% javaee 1 100% netflix 1 1 2 3 29% 43% springboot 1 5 17% 83% swagger 1 2 67% total 4 2 4 11 19% 52% table 12 presents the overview of the survey applied with developers from microservices. this table has 8 columns. the column developer represents the name of the developer. we omitted the name of developers to avoid his/her expo­ sure. next, six columns in the sequence represent the libraries investigates from the survey. finally, we have the target li­ brary target name. this column represents the library that our strategy classified the developer as library experts. in this table, we have 4 scales for developers to rank her/his knowledge. the nk represents ‘‘no knowledge’’, lk rep­ resents ‘‘low knowledge’’, mk indicates ‘‘medium knowl­ edge’’, and ek represents ‘‘extensive knowledge’’. table 12 shows, for instance, developer­d1 was identified by our strat­ egy as medium knowledge or extensive knowledge from li­ brary spark. however, developer d1 answered that he/she has low knowledge in this library. d1 is an interesting case because this developer reports low knowledge in the library they had been recommended, but medium knowledge and extensive knowledge for all others. netflix is also an inter­ esting case since only 3 (out of 8) reported extensive knowl­ edge in the library, while 5 reported extensive knowledge in javaee and 5 in spring boot. on the other hand, developer­ d17 was identified by our strategy as medium knowledge or extensive knowledge from library spark, and this developer marked extensive knowledge. from 21 developers that answered the survey, we observe that the strategy obtained a precision of 52% on average for extensive knowledge. table 11 in the last column shows, for example, to library springboot a precision of 83%. on the other hand, for library netflix, our strategy obtained a pre­ cision of 43%. if we consider the survey results to corre­ late with the results of the strategy concerning only the de­ velopers that answered with extensive knowledge, we ob­ tained 52% precision. however, if we consider developers who answered the survey with medium knowledge or exten­ sive knowledge to correlate with strategy results, we obtain 71% precision. 6 tool support we developed a prototype tool, named jexpert, to support the identification of library experts (strategy). we developed jexpert in java programming language. jexpert currently works with java projects, but the tool can be easily adapted presenting the new sbc journal template oliveira et al. 2020 table 12. survey results: microservices (overview) developer apache karaf apache spark javaee netflix spring boot swagger library d1 mk lk mk ek ek ek spark d2 nk nk nk nk lk nk d3 nk mk ek mk mk mk d4 nk nk mk nk mk mk d5 nk nk ek lk ek mk javaee d6 lk nk ek ek ek ek netflix d7 nk nk ek ek ek mk d8 mk lk ek mk ek mk d9 lk lk lk lk ek nk d10 lk nk lk nk lk mk d11 nk nk lk ek lk lk d12 nk nk ek mk mk mk d13 lk lk ek mk ek ek springboot d14 lk lk mk lk ek lk d15 lk lk mk lk mk ek d16 nk mk mk mk ek ek d17 nk mk mk ek ek ek d18 lk lk ek mk ek ek d19 lk nk ek lk ek nk swaggerd20 lk lk ek ek ek ek d21 lk ek nk lk lk ek nk no knowledge lk low knowledge mk medium knowledge ek extensive knowledge to identify library experts in other programming languages. jexpert is a standalone tool and runs in windows, linux, and mac. jexpert is available in our website (oliveira et al., 2020). jexpert uses static analysis to avoid abstract syntax tree (ast). therefore, it reduces the response time when analyzing large systems with hundreds of source elements, such as loc, imports, packages, and classes. our goal is to support recruiters with a flexible, light­weighted means to identify library experts from source code. figure 10 presents the simplified architecture design of jexpert. in the first moment, there are two modules: projects and library name. these two modules are the input of jex­ pert. in other words, jexpert receives as inputs two items, (i) projects in java that contain the target libraries, i.e., sys­ tems from a local directory informed by the user, and (ii) the names (keywords) of the libraries that a developer wants to investigate. module activity extractor is responsible for ex­ tracting the code elements necessary for the computation of activities made by a developer. besides, this module removes the old projects, i.e., projects with commits made more than three years ago, projects with less than 1 kloc, and projects without target library. figure 10. jexpert architecture overview from the next step, the module developer data analyzer computes all data about each developer. this module is re­ sponsible for separating the number of commits to libraries and changes made from source code in general, for instance, the number of lines of code written. this module also com­ putes the number of imports made by developers and verifies if an import is related to the target library. the metric collector module computes the three metrics, as mentioned in section 3.1. finally, the list of experts is generated as output with the sorted list of expert candidates from our metrics. such a list prioritizes the library experts based on a heuristic score, i.e., higher scores come first; cur­ rently, the tool returns a ”.csv” file for each library. 7 threats to validity we based our study on related work to support the evalu­ ation of a strategy to identify library experts. regarding the assessment, we conducted a careful empirical study to assess the efficiency of the strategy from software systems hosted by github. the strategy evaluated can analyze source code from platforms that follow the git architecture. however, some threats to validity may affect our research findings. the main threats and respective treatments are discussed below based on the proposed categories of wohlin et al. (wohlin et al., 2012). construct validity. this validity is related to whether mea­ surements in the study reflect real­world situations (wohlin et al., 2012). before running the strategy, we conducted careful filtering of software systems from github reposito­ ries. however, some threats may affect the correct filtering of systems, such as human factors that wrongly lead to a valid system’s discard to be evaluated. considering that the exclusion criteria to system selection were applied in a manual process, we may have discarded interesting systems that we identified as non­java, for instance. internal validity. the validity is related to uncontrolled aspects that may affect the strategy results (wohlin et al., 2012). the strategy may be affected by some threats. to treat this possible problem, we selected a sample of 5 software systems that contain the library hadoop from our dataset, with a diversified number of loc. then, we manually identified the number of commits from the github presenting the new sbc journal template oliveira et al. 2020 repository, the number of imports, and the number of loc codified to the specific library. we compared our manual results with the results provided by the tool and observed a loss of 5% in metrics terms computed through the automated process. we believe that this error rate does not invalidate our main conclusions. in addition, our strategy has the goal to reduce the search space to identify library experts, that is, we do not recommend a specific developer. external validity. this validity is related to the possibility of generalizing our results (wohlin et al., 2012). we evalu­ ated the strategy with a set of 16,703 software projects from github. considering that these systems may not include all existing libraries, our findings may not be generalized. fur­ thermore, we evaluated the strategy with an online survey with only 158 developers that implemented projects with the investigated libraries. we analyzed the data with only 15 java libraries. however, we chose the top libraries from the survey reported by stackoverflow in 2018, with over 100,000 re­ sponses from developers around the world. we also analyzed microservices libraries. this way, we believe these libraries can represent a reasonable option to evaluate the strategy. 8 related work the use of data from github to understand how software de­ velopers work and collaborate has become recurrent in soft­ ware engineering studies (greene and fischer, 2016; singer et al., 2013; ortu et al., 2015; destefanis et al., 2016; ma et al., 2009; begel et al., 2010; moraes et al., 2010). some studies seek to understand the behavior of developers con­ cerning an interaction with their peers (ortu et al., 2015). for example, a few studies (ortu et al., 2015, 2016) tried to understand who are the developers with peaceful behavior and those with aggressive behavior and if these developers coexist productively in software development projects (ortu et al., 2016). similar studies also tried to understand if there is a relationship between bug resolution time and behavior of developers (ortu et al., 2015). also, some studies investi­ gated developers manners (destefanis et al., 2016) and seek to understand the emotional behavior of software develop­ ers (ortu et al., 2016). schuler and zimmermann (2008) investigated developer expertise based on their commit activities, which manifests itself whenever developers are using functionality. they present preliminary results for the eclipse project. they were able to create expertise profiles that included data about what apis a developer may be an expert in through their use of those apis. wu et al. (2011) proposed drex, an approach to bug assignment using k­nearest neighbor search and social network analysis. this approach performs with the follow­ ing way: 1) finding textually similar bug reports, 2) extract­ ing developers involved in their resolution, and 3) ranking the developers expertise by analyzing their participation in resolving similar bugs. an evaluation of bug reports from the firefox oss project shows the social network analysis of drex outperforms a purely textual approach, with a pre­ diction accuracy of about 15%. in closely related work, greene and fischer (2016) have developed an approach to extract technical information from github developers. the work of these researchers does not differentiate developers from their level of knowledge of technical skills since a recruiter has several candidates for the same job position. besides, such work only shows the profile of the users in github, and it does not extract other charac­ teristics of their knowledge and skills. the other limitation is that they neither provide actual data about the developer’s knowledge production nor present a survey to evaluate the results. singer et al. (2013) investigated the use of profile ag­ gregators in the evaluation of developer skills by developers and recruiters. however, these aggregators only gather skills for individual developers, and it is not clear how they support the identification of relevant developers from a large dataset. we believe that the strategy evaluated in our study is com­ plementary to the described related work, providing a differ­ ent approach focusing on reducing the search space to iden­ tify possible experts. our strategy is complementary with other approaches, such as cvexplorer (greene and fischer, 2016). for instance, by combining our results with cvex­ plorer (greene and fischer, 2016) it is possible to select skills in the language of programming and analyze the met­ rics shown in our paper. to the best of our effort, we did not find a similar large scale study that evaluates some strategy able to identify library experts. hence, we cannot compare the strategy evaluated with other studies. 9 conclusion in this paper, we evaluated a strategy to reduce the search space to identify library experts in software systems from source code analysis. we also presented a prototype tool that implements the strategy. the strategy evaluated is composed of three metrics: number of commits, number of imports, and lines of code. we assessed the strategy in two dimen­ sions: applicability and precision. first, applicability evalu­ ation analyzed the feasibility of identifying library experts candidates in large datasets. second, precision evaluation compared the results provided by a strategy with developers perceptions from a survey. in total, we analyzed 16k software systems mined from github, 15 libraries, and a survey with 158 developers. our findings pointed out that the strategy was able to identify library experts in different libraries from the set of input software systems with a precision of 71% on average. there are many possible extensions for this work. for in­ stance, we did not consider all available data in our analysis, such as the number of forks, number of projects belonging to the developer that have received stars, the number of follow­ ers, number of methods, source code quality, and contribu­ tions to the project discussions. besides, we did not consider the number of lines of code added and removed between ver­ sions. future work can also extend our research to evaluate the strategy of other programming languages and libraries. references alshuqayran, n., ali, n., and evans, r. (2016). a systematic presenting the new sbc journal template oliveira et al. 2020 mapping study in microservice architecture. in 9th inter­ national conference on service­oriented computing and applications (soca), pages 44–51. basili, v., caldiera, g., and rombach, h. d. (1994). the goal question metric approach. online technical re­ port. begel, a., khoo, y. p., and zimmermann, t. (2010). code­ book: discovering and exploiting relationships in software repositories. in 32nd international conference on soft­ ware engineering (icse), pages 125–134. brown, v. r. and vaughn, e. d. (2011). the writing on the (facebook) wall: the use of social networking sites in hiring decisions. journal of business and psychology, 26(2):219. capiluppi, a., serebrenik, a., and singer, l. (2013). assess­ ing technical candidates on the social web. ieee software, 30(1):45–51. constantinou, e. and kapitsaki, g. m. (2016). identifying developers’ expertise in social coding platforms. in 42th euromicro conf. on software engineering and advanced applications (seaa), pages 63–67. dabbish, l., stuart, c., tsay, j., and herbsleb, j. (2012). so­ cial coding in github: transparency and collaboration in an open software repository. in 12th proc. of the conf. on computer supported cooperative work (cscw), pages 1277–1286. damasiotis, v., fitsilis, p., considine, p., and o’kane, j. (2017). analysis of software project complexity factors. in proc. of the 2017 international conf. on management engineering, software engineering and service sciences, pages 54–58. destefanis, g., ortu, m., counsell, s., swift, s., marchesi, m., and tonelli, r. (2016). software development: do good manners matter? peerj computer science, 2(2):1– 10. easterbrook, s., singer, j., storey, m.­a., and damian, d. (2008). selecting empirical methods for software engi­ neering research. in guide to advanced empirical software engineering, pages 285–311. ferreira, m., mombach, t., valente, m. t., and ferreira, k. (2019). algorithms for estimating truck factors: a com­ parative study. software quality journal, 1(27):1–37. garcia, v. c., lucrédio, d., alvaro, a., almeida, e. s. d., de mattos fortes, r. p., and de lemos meira, s. r. (2007). towards a maturity model for a reuse incremental adop­ tion. in 7th brazilian symposium on software compo­ nents, architectures, and reuse (sbcars), pages 61–74. greene, g. j. and fischer, b. (2016). cvexplorer: identifying candidate developers by mining and exploring their open source contributions. in 31st int. conf. on automated soft­ ware engineering (ase), pages 804–809. joblin, m., apel, s., hunsen, c., and mauerer, w. (2017). classifying developers into core and peripheral: an em­ pirical study on count and network metrics. in 39th in­ ternational conference on software engineering (icse), pages 164–174. klock, s., van der werf, j. m. e. m., guelen, j. p., and jansen, s. (2017). workload­based clustering of coherent feature sets in microservice architectures. in 2017 ieee interna­ tional conference on software architecture (icsa), pages 11–20. krüger, j., wiemann, j., fenske, w., saake, g., and leich, t. (2018). do you remember this source code? in 40th proc. of the international conf. on software engineering (icse), pages 764–775. ma, d., schuler, d., zimmermann, t., and sillito, j. (2009). expert recommendation with usage expertise. in interna­ tional conference on software maintenance (icsm, pages 535–538. ma, w., chen, l., zhang, x., and xu, y. z. . b. (2017). how do developers fix cross­project correlated bugs? a case study on the github scientific python ecosystem. in 39th international conference on software engineering (icse), pages 1–12. marlow, j. and dabbish, l. (2013). activity traces and sig­ nals in software developer recruitment and hiring. in 16th proc. of the 2013 conf. on computer supported coopera­ tive work (cscw), pages 145–156. mcculler, p. (2012). how to recruit and hire great software engineers: building a crack development team. apress. mockus, a. and herbsleb, j. d. (2002). expertise browser: a quantitative approach to identifying expertise. in 24rd proc. of the international conf. on software engineering (icse), pages 503–512. moraes, a., silva, e., da trindade, c., barbosa, y., and meira, s. (2010). recommending experts using communi­ cation history. in 2nd international workshop on recom­ mendation systems for software engineering, page 41–45. oliveira, j., fernandes, e., souza, m., and figueiredo, e. (2016). a method based on naming similarity to identify reuse opportunities. in 7th brazilian symposium on infor­ mation systems on brazilian symposium on information systems: information systems in the cloud computing era ­ volume 1, pages 41:305–41:312. oliveira, j., pinheiro, d., and figueiredo, e. (2020). web site of the paper. https://johnatan-si.github.io/ jserd2020/. oliveira, j., viggiato, m., and figueiredo, e. (2019). how well do you know this library? mining experts from source code analysis. in 18th brazilian symposium on software quality (sbes), pages 49–58. ortu, m., adams, b., destefanis, g., tourani, p., marchesi, m., and tonelli, r. (2015). are bullies more productive?: empirical study of affectiveness vs. issue fixing time. in 12th proc. of the working conf. on mining software repos­ itories (msr), pages 303–313. ortu, m., destefanis, g., counsell, s., swift, s., tonelli, r., and marchesi, m. (2016). arsonists or firefighters? affec­ tiveness in agile software development. in 18th interna­ tional conf. on agile software development (xp), pages 144–155. pahl, c. (2015). containerization and the paas cloud. ieee cloud computing, 2(3):24–31. pfleeger, s. l. and kitchenham, b. a. (2001). principles of survey research: part 1: turning lemons into lemonade. sigsoft softw. eng. notes, 26(6):16–18. saxena, r. and pedanekar, n. (2017). i know what you coded last summer: mining candidate expertise from github https://johnatan-si.github.io/jserd2020/ https://johnatan-si.github.io/jserd2020/ presenting the new sbc journal template oliveira et al. 2020 repositories. in 17th companion of the conf. on com­ puter supported cooperative work and social computing (cscw), pages 299–302. schuler, d. and zimmermann, t. (2008). mining usage expertise from version archives. in proceedings of the 2008 international working conference on mining soft­ ware repositories, pages 121––124. singer, l., filho, f. f., cleary, b., treude, c., storey, m.­ a., and schneider, k. (2013). mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators. in 13th proc. of the conf. on computer supported cooperative work (cscw), pages 103–116. sommerville, i. (2015). software engineering. pearson. tong, j., ying, l., hongyan, t., and zhonghai, w. (2016). can we use programmer’s knowledge? fixing parameter configuration errors in hadoop through analyzing q amp;a sites. in 5th ieee int. congress on big data (bigdata congress), pages 478–484. tsui, f., karam, o., and bernal, b. (2016). essentials of soft­ ware engineering. jones & bartlett learning. viggiato, m., oliveira, j., figueiredo, e., jamshidi, p., and kästner, c. (2019). understanding similarities and differ­ ences in software development practices across domains. in 14th international conference on global software en­ gineering (icgse), pages 74–84. wohlin, c., runeson, p., hst, m., ohlsson, m. c., reg­ nell, b., and wessln, a. (2012). experimentation in soft­ ware engineering. springer publishing company, incor­ porated. wu, w., zhang, w., yang, y., and wang, q. (2011). drex: developer recommendation with k­nearest­neighbor search and expertise ranking. in 18th asia­pacific software engineering conference, pages 389–396. ye, c. (2017). research on the key technology of big data service in university library. in 13th int. conf. on natu­ ral computation, fuzzy systems and knowledge discov­ ery (icnc­fskd), pages 2573–2578. introduction study settings goal and research questions evaluation steps dataset applicability evaluation identification of library experts overview of dataset top library experts selection survey with top libraries experts survey design overview level of activity knowledge intensity knowledge extension survey with microservices experts survey design results tool support threats to validity related work conclusion journal of software engineering research and development, 2020, 8:5, doi: 10.5753/jserd.2020.546  this work is licensed under a creative commons attribution 4.0 international license.. software operational profile vs. test profile: towards a better software testing strategy luiz cavamura júnior  [ federal university of são carlos | luiz_cavamura@ufscar.br ] ricardo morimoto  [ federal university of são carlos | rmmorimoto@gmail.com ] sandra fabbri  [ federal university of são carlos | sfabbri@ufscar.br ] ana c. r. paiva  [ school of engineering, university of porto & inesc tec | apaiva@fe.up.pt ] auri marcelo rizzo vincenzi  [ federal university of são carlos | auri@ufscar.br ] abstract the software operational profile (sop) is a software specification based on how users use the software. this specification corresponds to a quantitative representation of the software that identifies its most used parts. as software reliability depends on the context in which users operate the software, the sop is used in software reliability engineering. however, there is evidence of a misalignment between the software tested parts and the sop. therefore, this paper investigates a potential misalignment between sop and the tested software parts to obtain more evidence of this misalignment based on experimental data. we performed a set of experimental studies – exs to verify: a) whether there are significant variations in how users operate the software; b) whether there is a misalignment between the sop and the tested software parts; c) whether failures occur in untested sop parts in case of misalignment; d) whether a test strategy based on the amplification of the existent test set with additional test data generated automatically can contribute to reduce the misalignment between sop and untested software parts. we collected data from four software while users were operating them. we analyzed this data to reach the goals of this work. the results show that there is significant variation in how users operate software and that there is a misalignment between sop and the tested software parts after evaluating the four software studied. there is also indication of failures in the untested sop parts. although the aforementioned test strategy has reduced the potential misalignment, the test strategy is not enough to avoid it, thus indicating a need for specific test strategies using sop as a test criterion. these results indicate that sop is relevant not only to software reliability engineering but also to testing activities, regardless of the adopted testing strategy. keywords: software quality, software testing, operational profile, test profile 1 introduction software users provide relevant data related to the many possible ways they explore a given software feature. we create software based on the expression of the creative nature of our intellect (assesc, 2012). using their previous professional experience, this same creative aspect allows software users to adapt to different ways of using the software due to changes in the process initially supported by the program (sommerville, 1995). this feature makes software functionalities parameterizable to meet specific and particular needs, even if they are designed to meet business rules that are common to many organizations. the software operational profile (sop) corresponds to the manner in which a given user operates the software. the sop may be quantitatively characterized by assigning a probabilistic distribution to the software operations, showing what users use the most in software (musa, 1993; gittens et al., 2004; sommerville, 1995). a given user may not reproduce the same failure identified by another one. the reason for this is that software can have many different operational profiles and experienced users can adapt how they operate the software. as such, software quality is dependent on its operational use (cukic and bastani, 1996). a survey by cukic and bastani (1996) states that information about sop is considered either essential or relevant to issues related to activities inherent to software development. examples of these questions are: “which are the most used parts of the software?”; “how do users use the application?”; “what are the software usage patterns?”; and “how does test coverage correspond to the code that was indeed executed by users?”. additionally, rincon (2011) analyzed a set of ten open-source software and, in only one of them, the available functional test set reached a code coverage close to 70%. even if this interval level of code coverage is considered acceptable, there is a significant percentage of untested code which may be related to critical features for the majority of software users. this fact highlights the possibility of a misalignment between the tested parts and the parts that users effectively use. thus, there are indications of the relevance of sop in ensuring software quality and also in evidencing a possible misalignment between sop and the tested software parts (rincon, 2011; begel and zimmermann, 2014). this misalignment can often lead to failures when operating the software. the term misalignment refers to the potential dissonance between the software tested parts and the sop, which corresponds to the software parts most used by users. thus, it represents situations in which the sop or parts of the sop may not have been previously executed by the software test suite, indicating that the adopted test strategy may not be aligned with the user’s interests in terms of software functionality. therefore, this study investigates a potential misalignment between the tested software parts and sop. the research results, based on a set of experimental studies (exs), provide the following contributions: https://orcid.org/0000-0001-5090-6845 mailto:luiz_cavamura@ufscar.br https://orcid.org/0000-0001-8113-6900 mailto:rmmorimoto@gmail.com https://orcid.org/0000-0003-3052-3016 mailto:sfabbri@ufscar.br https://orcid.org/0000-0003-3431-8060 mailto:apaiva@fe.up.pt https://orcid.org/0000-0001-5902-1672 mailto:auri@ufscar.br software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 1. evidence that there are significant variations in how users operate software, even when they perform the same operations, i.e., there are different software usage patterns; 2. evidence of a possible misalignment between sop and software testing; 3. evidence that there are faults concentrated on untested parts of the software; 4. definition and introduction of the term “test profile”; 5. evidence that even when using an automated test generator to extend an existent test set the misalignment between the sop and the tested parts of the software has little improvement. in addition, the related studies briefly present the results obtained by a systematic review of the literature (slr), which we carried out before the execution of the experimental studies. these results show that, to the best of our knowledge, there is no previous study with the same purpose as this one (cavamura júnior et al., 2020). we adapted the methodology proposed by mafra et al. (2006) to plan and perform the activities described in this paper. the remaining of this paper is as follows: section 2 presents concepts related to the definition of sop. section 3 describes the adopted methodology for this study. section 4 presents the related studies identified and selected by the slr (cavamura júnior et al., 2020). section 5 describes the results of the experimental studies. section 6 presents some lessons learned with the results. section 7 presents threats to validity. lastly, section 8 describes the conclusions and future work. 2 software operational profile (sop) sop is a way to obtain a specification of how users operate software (musa and ehrlich, 1996; sommerville, 1995). musa (1993) proposed one of the most relevant approaches for sop registration. musa (1993) defines sop as a quantitative characterization based on the way software is operated. this definition corresponds to software operations, to which an occurrence probability is assigned. an operation corresponds to a task performed by the software. we delimit this operation by external factors related to software implementation. software operations can present different behavior and, consequently, provide different results. in this way, there are different possible execution paths, depending on the given input data. these different ways of execution are named execution types. in figure 1 we present an example of software operations and their respective execution types. input data, which characterize an execution type, create a data set named input state (“is” in figure 1). input states, associated with execution types, form the software input space. as input states characterize the execution types of an operation, the input space can be fractioned by operations, associating an input state set in each operation, named operation domain. thus, it is possible to assign an input domain to each software operation (“id” in figure 1) that determines how the software executes the operation; i.e., the input dofigure 1. concepts involved in the definition of the operational profile. main elements (input states) determine the execution type of an operation. in figure 1 are shown: i) the input states, identified by “is1, is2, is3, . . . , isn”, ii) the software input space, and iii) the input domain of each operation, identified by “idop1, idop2, . . . , idopn”. although the operation set available in software is finite, the execution types correspond to a set with infinite elements, given that the input domain can be infinite. thus, assigning an occurrence probability to execution types is possible since we can partition the input domain into sub-domains. each generated sub-domain corresponds to an execution category. these categories group the execution types whose different input states produce the same behavior in operation. in figure 1 we present the execution categories, identified by “ec1, ec2, . . . , ecn”, which divide the input domain of each operation and group the execution types with the same behavior. in figure 1 we present the existing relation between operational concepts, execution types, input state, input space, input domains, and execution categories. in musa (1994, 1993) studies, the author assigns an occurrence probability to the execution categories in order to obtain a quantitative characterization of the software corresponding to the operational profile. the data used to get the occurrence probabilities of operation can be obtained from log files generated by previous version of the software or from similar software (musa, 1993; takagi et al., 2007). developer expectations can also determine these probabilities (takagi et al., 2007). in the context of this study, the term granularity corresponds to the level of fragmentation (be it conceptual or structural) we use to assign an occurrence probability or exesoftware operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 cution frequency to the generated software fragments. then, it is possible to identify the most used software parts when users are operating the software, i.e., the sop. according to the object-oriented programming paradigm, subprograms correspond to the methods implemented in data structures, called classes. thus, the methods in this paradigm represent actions assigned to the operations performed by the software. as sop is a software specification based on how users operate software (musa and ehrlich, 1996; sommerville, 1995), showing the software parts most used by users, the sop in the context of this paper corresponds to the frequency of the processed methods while the software is performed by users, thus indicating the most operated software parts. 2.1 the sop and the software quality pressman (2010) defines software quality as an effective process of creating a valuable product for those who produce it and will use it. thus, software quality can be subjective in that it depends on the point of view of who is analyzing the software’s characteristics. considering the user’s point of view, for example, software of quality is software that meets its needs and is easily operated (falbo, 2005). however, from a developer’s point of view, software of quality is e.g., one that demands less maintenance effort. software reliability corresponds to the probability of a software operation occurring without any occurrence of failure in a specified period and in a specific environment (musa, 1979; cukic and bastani, 1996). thus, as software reliability depends on the context in which software is used, software reliability meaning software maintainability and efficiency (among others) is one of the software’s attributes related to software quality, and it represents the user’s point of view on software quality (musa, 1979; bittanti et al., 1988). since the sop represents the way software will be used by its users and considers software reliability as dependent on the context in which users operate the software, sop can support activities related to the reliability of software engineering. thus, the purpose of sop is to generate test data that reproduces the way software is executed in its production environment, ensuring the validity of reliability indicators (musa and ehrlich, 1996). in the software reliability process, a usage model representing the sop is created to design test cases and perform the test activity. the elements constituting the usage model correspond to the adopted granularity to determine the sop, whose execution frequencies or occurrence probability identify the most used software parts. in the literature, studies using models representing sop in their testing techniques have classified these techniques as statistical testing, statistical use testing, reliability testing, model-based testing, use-based testing and sop-based testing (poore et al., 2000; kashyap, 2013; sommerville, 2011; pressman, 2010; musa and ehrlich, 1996). it is worth noting that the frequency with which a fault becomes apparent during the software operation is more significant for users than the remaining faults (takagi et al., 2007) and a defect affecting reliability for one user may never be revealed to another who has a different work routine (sommerville, 2011). the use of sop does not guarantee the detection of all faults, but it ensures that the most used software operations are tested (ali-shahid and sulaiman, 2015). 2.2 problems related to the use of sop although the sop can be obtained from log files recording events that occur in the operating software, in previous versions of the software, in similar software and even from the developers’ experience (musa, 1993; takagi et al., 2007), there are several problems related to the identification of the sop reported in the literature. in this study, we observed that the use of an instrumented version of the software to identify the sop of the data collected during operation of software by users affects the performance of operating the software and generates a large volume of data. according to namba et al. (2015), the effort to identify the sop depends on the complexity of the software. other kinds of problems are also reported in the literature. thus, reports of difficulties and issues related to sop identified in the literature are relevant and will be addressed in possible test approaches defined according to the results presented in this paper. table 1 summarizes the main challenges and problems identified. 3 research methodology the results presented in this paper are part of a phd project (cavamura júnior, 2017) that follows the methodology proposed by mafra et al. (2006). the methodological steps proposed by mafra et al. (2006) were instantiated into the context of the research presented in this article. this methodology is an extension of the methodology proposed by shull et al. (2001) for introducing software processes. the methodology proposed by mafra et al. (2006) is shown in figure 2. we defined five research questions to guide our investigation in this paper: • rq1: are there other studies with the same goal or similar goals whose results provide the contributions proposed in this paper? • rq2: are there any relevant variations in how users operate software? • rq3: is there misalignment between sop and the tested software parts? • rq4: given the misalignment between sop and the tested software parts, do the failures occur in the untested sop parts? • rq5: given the misalignment between sop and the tested software parts, can a test strategy including automated test data1 generator contribute to reduce the misalignment? to answer rq1 and considering the methodology presented in figure 2, the step “secondary study” included a systematic mapping study (sms) and a systematic literature review (slr) to identify studies whose contributions were similar or equivalent to the research contribu1in the remaining of the paper, we use test data to refer to inputs automatically generated. software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 table 1. problems related to sop. reference year of publication reported problem (cukic and bastani, 1996) 1996 identifying the sop is difficult because it requires predicting software usage. (leung, 1997) 1997 estimation errors and sop changes are inevitable when software is operated in a production environment. (shukla, 2009) 2009 studies related to sop focus on exploring software operations. the parameters of these operations are little explored. (sommerville, 2011) 2011 software reliability depends on the context in which software will be used. experienced users can constantly adapt their behavior regarding software usage. (namba et al., 2015) 2015 sop identification requires a lot of effort, making this activity difficult depending on the complexity of the software. (fukutake et al., 2015) 2015 the probability of use decreases when the software usage model has multiple states. (bertolino et al., 2017) 2017 sop-based testing can be saturated and lose effectiveness because it focuses only on failures most likely to occur. figure 2. adopted research methodology (extracted from travassos et al. (2008)) . tions reported in this article and, thus, evaluate its originality. the results obtained from the sms are available elsewhere at http://lcvm.com.br/artigos/anexos/jserd2020/ cap-3-rs-ms.pdf. also, a detailed description of the slr can be found elsewhere in (cavamura júnior et al., 2020). we present a brief description of the main results of both sms and slr in section 4. the “first draft” stage comprised the planning of the experimental studies presented in this study. we adopted the model proposed by the gqm (basili et al., 2002)’s technique to guide the planning of this research. the instantiated model for the planning phase is presented in table 2. the “feasibility study”, “observational study” and “case study: lifecycle” stages comprised the accomplishment of a set of exs subdivided into four activities (at) associated with the research questions, called exs−at1, exs−at2, exs−at3, and exs−at4. the purpose of each activity and the research questions associated with each one of them are summarized in table 3. to perform the exs activities we instrumented four software, s1, s2, s3 and s4, to collect data that allowed us to identify the sop for each individual user. table 4 shows the characterizations of used software and associates them with the exs activities. http://lcvm.com.br/artigos/anexos/jserd2020/cap-3-rs-ms.pdf http://lcvm.com.br/artigos/anexos/jserd2020/cap-3-rs-ms.pdf software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 table 2. exploratory study planning. stage analyze for the purpose of focus perspective context 1 ( rq1 ) studies that addressed the use of sop check whether there are researches for the same or similar purposes answered base on a previous work software test researchers software applications users 2 ( rq2, rq3, rq4 , rq5 ) the sop (a) check if there are significant variations; (b) check if there is a misalignment; (c) show the occurrence of failures; (d) check if the insertion of additional test data, generated automatically by evosuite, can contribute to reduce the misalignment (a) the way software is operated by its users (b) sop and tested software parts (c) sop’s parts not tested (d)sop and tested software parts software test researchers software applications users table 3. research activities. activity purpose question sms/slr evaluate research originality (cavamura júnior et al., 2020) rq1 exs−at1 check for relevant variations in how the users operate the software rq2 exs−at2 find out through the sop and the software’s test suite whether there is a misalignment between sop and the tested parts of the software rq3 exs−at3 once we confirm the misalignment between sop and the tested parts of the software, check if there is any failure in the sop’s parts not tested rq4 exs−at4 check whether a test strategy, based on the amplification of the existent test set with additional test data automatically generated, can contribute to reducing the misalignment between the sop and the tested parts of the software rq5 the “feasibility study” stage comprised the accomplishment of exs−at1. the “observational study” stage comprised the accomplishment of exs−at2, exs−at3, and exs−at4 based on operational profiles collected from s1 and s2. the “case study: lifecycle” stage comprised the accomplishment of exs−at2, exs−at3, and exs−at4 again but based on operational profiles collected from s3 and s4. the “case study: industry” stage is in progress and its results will be published in a future work. once the methodology was defined, this study was planned in two stages to provide answers for the research questions. the research questions associated with these stages is shown in the “stage” column of table 2. • stage 1: performing an sms and an slr; • stage 2: performing the exs composed of four activities: exs−at1, exs−at2, exs−at3, and exs−at4. the focus of this paper is on stage 2 of table 2, i.e., the set of exs we performed to obtain evidence of the possible misalignment between sop and the tested software parts. the other kinds of experiments were also carried out as part of the ongoing work (cavamura júnior, 2017). in section 4, we present a brief description of the main findings of the slr. an interested reader can find more information elsewhere (cavamura júnior et al., 2020). in section 5, the exs and their respective results are described. 4 related work we conducted sms and slr (stage 1 of table 2) to provide the theoretical basis and evidence of the originality of this study. the sms process together with the slr process consist of the planning, conducting and results publishing phases (nakagawa et al., 2017). a detailed description of the sms, slr and their respective detailed results can be found at http://lcvm.com.br/artigos/anexos/jserd2020/ cap-3-rs-ms.pdf and at (cavamura júnior et al., 2020), respectively. we conducted a sms to: i) verify how the distribution of primary studies related to sop in software engineering areas is characterized; ii) acquire knowledge of the contributions provided by the use of sop in the areas of software engineering, focusing on the software quality field. iii) check if the use of sop in quality assurance activities has been a topic of interest to researchers. the sms found 4726 studies, of which we selected 182 for data extraction. the distribution of the primary studies in software engineering areas is shown in figure 3. after we http://lcvm.com.br/artigos/anexos/jserd2020/cap-3-rs-ms.pdf http://lcvm.com.br/artigos/anexos/jserd2020/cap-3-rs-ms.pdf software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 figure 3. distribution of the studies in software engineering areas. analyzed the extracted data, we concluded that software quality is the most explored area by studies that used sop as a resource in the strategies addressed in these studies. most of these strategies are associated with software reliability. although software quality is the most approached area, we found some studies related to software testing. thus, this scenario evidences a gap in the software quality field, mainly in its subareas that are not associated with software reliability. therefore, the results of the sms motivated us on conducting the slr, whose purpose was to identify, analyze and understand the studies whose contributions are similar or equivalent to the contributions of the research reported in this paper, i.e identify, analyze and understand the studies that used sop as evaluation criteria to check is there a possible misaligned between sop and tested software parts (cavamura júnior et al., 2020). at the end of the slr (cavamura júnior et al., 2020), as highlighted in figure 4, we observed only three studies closest to ours: bertolino et al. (2017), chen et al. (2001), and amrita and yadav (2015), briefly described next. figure 4 shows the number of processed studies by slr. the interested reader may find additional information about the complete slr protocol elsewhere (cavamura júnior et al., 2020). bertolino et al. (2017) mention the test based on the operational profile can suffer saturation and loss of effectiveness since it focuses on the occurrence of most likely failures. thus, to improve software reliability, the test should also focus on faults with a low probability of occurrence. in this context, bertolino et al. (2017) present an adaptive and iterative software testing technique based on sop. in the first iteration, the authors selected the test cases following a traditional test based on operational profile, i.e., the authors randomly selected the test cases according to the occurrence probability of each partition of the software input domain under test. in each subsequent iteration, the technique: a) calculates the number of ideal test cases to be selected for each partition, and; b) selects, prioritizes and executes the number of test cases. bertolino et al. (2017) obtained a probability calculation to represent how much the partition test will contribute to program reliability. based on this information, bertolino et al. (2017) determine the optimal amount of test cases for testing each partition. in this probability calculation, bertolino et al. (2017) considered the failure rate and the occurrence probability of each partition. the failure rate is the ratio of the number of failed test cases and the number of test cases assigned to the partition. thus, bertolino et al. (2017) obtained the occurrence probability from sop. to select and prioritize test cases, the frequency with which the program parts are exercised when running the tests is obtained from the previous iterations. as the focus of bertolino et al. (2017)’s approach is to select test cases covering portions of the program that are poorly exercised, test cases associated with the uncovered parts of software have high priority. we can determine software reliability by the time elapsed between the detected faults. in this way, chen et al. (2001)’s technique considers the context in which a test suite can overestimate software reliability when it is not able to detect new faults due to the use of an obsolete sop. the more redundant the test cases are about the covered code, the more overestimated will be the reliability of the software. thus, this technique adjusts the time interval between failures when running redundant test cases. chen et al. (2001)’s identified the redundant test cases through coverage analysis during the execution of the tests. according to amrita and yadav (2015), researchers have approached the selection of test cases based on sop, but the authors did not find much discussion about the infrequent software parts. amrita and yadav (2015) propose a model that provides the flexibility to allocate test cases according to the priority defined by sop and by the experience of the testing team. based on this information, amrita and yadav (2015)’s model selects test cases using fuzzy logic. we observed that bertolino et al. (2017), amrita and yadav (2015) addressed the use of sop in the selection and prioritization of test cases, focusing on those software parts whose operation is infrequent. chen et al. (2001)’s study addressed the selection of test cases, using sop to identify redundant test cases and treat them in the process of software software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 figure 4. processed studies by slr. reliability, and thus obtain more accurate reliability. nevertheless, the studies identified and processed by slr did not directly investigate in their approaches whether there is a misalignment between the existing test suite and sop, thus providing an answer to research question rq1. we believe the selection and prioritization activities will not be productive if we do not align test cases with sop. 5 experimental studies (exs) the studies by begel and zimmermann (2014) and rincon (2011), briefly described in section 1, provided initial evidence about the possible misalignment between the tested software parts and the sop. we performed the exs to obtain empirical data that, after analyzed, could provide answers to the research questions rq2, rq3, rq4, and rq5, thus resulting in more evidence, based on experimental data, on the possible misalignment between the tested software parts and the sop. as described in section 3, we defined four activities for the exs, named exs−at1, exs−at2, exs−at3, and exs−at4. in order to perform these activities, we instrumented four software, s1, s2, s3 and s4, to collect data that allowed to identify the sop for each software during its operation by users. s1, s2, s3 and s4 were implemented under the object-oriented programming paradigm. a characterization of the software used and their association to the activities of the exs is presented in table 4. during these activities, users had to perform tasks at a given period when they were operating s1, s2, s3, and s4. thus, we collected data automatically in an attempt to obtain the operational profile of the software used. in the following subsections, we describe the strategy adopted for the data collection, the activities of the exs, and their results. 5.1 strategy for data collection in each activity, we instrumented the s1, s2, s3, and s4 software to collect data during their operation by the users participating in the activity. we adopted aspect-oriented programming (ferrari et al., 2013; laddad, 2009; rocha, 2005), which allows us to obtain information and to manipulate specific software parts without modifying the implementation of the s1, s2, and s3. for s4, we developed a monitoring tool using the javassist framework. the javassist allows for the manipulation of java bytecode. this feature allowed us to monitor s4 execution and collect s4 information while participants were operating it. although the aspect-oriented paradigm makes it possible to perform the instrumentation without modifying the source code of the software, it requires the created aspects to be compiled together with the software for instrumentation. javaassist was adopted to perform the instrumentation without having to compile the software that is to be instrumented. we defined the strategy for data collection and applied it at the subprogram level. the developed tool and the instrumentation collect information about the methods execution of s1, s2, s3, and s4’. from that information, we obtained the execution frequency of the processed methods during the s1, s2, s3, and s4 software execution in the activities. 5.2 exs–at1: evaluating the variation in how software is operated by users we performed the exs−at1 activity to evidence whether there are relevant variations in how users operate the software to carry out the same task. to measure this variation, we obtained the sop used in this activity for each user through data coming from the instrumented s1 software. in order to reduce the risks associated with the threats to validity of the activity, 30 undergraduate students of the computer science and computer engineering courses participated in this activity. these participants had equivalent experience and knowledge. we trained the participants in an attempt to make them familiar with s1 and the concepts involved with its use. additionally, we assigned the same task to the participants in this activity. we assigned to each participant the task of inspecting the java source code of s1 project, named software under inspection (sui), considering an object-oriented paradigm. we set a time limit for participants to complete the task. the tasks performed within the defined time period were considered successfully completed. thus, data obtained from all participants were used in the activity. software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 table 4. characterization of the software used in the exs. software purpose source methods test cases origin of test cases usage s1 provide software inspection support (crista) closed source 2749 716 computational tool exs−at1, exs−at2 s2 bibliographic reference management (jabref ) open source 7100 514 community exs−at2, exs−at3, exs−at4 s3 process automation (developed on demand) closed source 869 351 test team exs−at2 s4 case tool (argouml) open source 18099 2272 community exs−at2, exs−at3, exs−at4 we stored the data collected by the instrumentation of s1 and we, subsequently, analyzed it. through this data, we identified the sop of each participant. it is worth noting that all participants have the same goal and artifacts to conclude the task. in the following subsections, we describe the analysis of the collected data and the results obtained by the activity exs−at1. 5.2.1 exs–at1: data analysis we grouped the data collected by the exs −at1 activity according to the participant who originated them; that is, for each participant, we obtained and recorded information about the execution of the s1 methods, allowing to compute the execution frequency of the methods. to identify the variations in how users operate s1, we created a representation of the operational profile of s1 for each participant. each representation corresponds to a homogeneous one-dimensional data structure that recorded the execution frequency of each method in s1 for each participant during the execution of the task. the structure elements represent the methods implemented in s1, regardless of whether they were executed during the activity or not. thus, each structure was composed of 2749 elements corresponding to the 2749 methods implemented in s1 (table 4). for each of these elements, we assigned the execution frequency of the method when performing the activity. for non-executed methods, we assigned the numeric value 0. figure 5 presents a graphical representation of the data structure corresponding to a part of the s1 profile. we show some elements (m1, m2, m3, ..., m2749). each element corresponds to an implemented method of s1. the number in the cells represents the execution frequency of a given method for a given participant after concluding an activity. thus, according to figure 5, four methods (m1, m3, m2748, and m2749) were not executed during the activity, while the remaining ones (m2, m100, m101, and m102) were executed 500, 10000, 15725 and 87000 times, respectively. as the variations in how users operate s1 depends on the processed volume, the processed volume for each participant was measured. the s1 software is a computational tool that provides support for the inspection activity of source code based on the stepwise abstraction reading technique. the purpose of the stepwise abstraction reading technique is to determine the program’s functionality according to the functional abstractions generated by the source code (linger et al., 1979). the s1 software analyzes the sui and, for each class, generates a treemap visual metaphor providing a simple mode to visualize the source code. the code blocks are represented by rectangles disposed hierarchically. these rectangles are named declarations on the tool context. when a declaration is selected the respective source code is shown to make the inspection and to register the functional abstraction for that declaration. a functional abstraction is an annotation inserted by s1 user that represents the pseudo-code with respect to the selected declaration. during the s1 operation, for each inspected class the s1 user assigns a functional abstraction for each declaration identified by the tool in the class, identifying that the declaration was inspected. the discrepancies found during the inspection process are recorded in a similar manner in the tool, i.e., assigning the discrepancy to the declaration. figure 6 shows an s1 user interface during a class inspection. s1 provided metrics that allowed us to measure the processing volume generated by each participant. in this activity, the volume of processing corresponds to the number of functional abstractions attributed to each class that structurally composes the sui as well as to the number of discrepancies found in each class. thus, it was possible to determine which classes and how much of each class were inspected by each participant. it should be noted that the same tool configuration parameters were applied to all participants. in an attempt to obtain homogeneity in the processing volume generated by each participant, we grouped them according to the generated processing volume. an indicator was calculated to represent the processing volume generated by each participant. the indicator corresponds to the ratio between the sum of abstractions and discrepancies of all classes of one participant by the sum of declarations of all classes. for instance, the total of inspected software declarations was 1526. among the participants, the largest amount of the functional abstractions and discrepancies registered by one parsoftware operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 figure 5. graphical representation of the data structure. figure 6. s1 user interface. ticipant was 284. for this participant the indicator value was 0.186 (284/1526). the corresponding calculated indicator classified participants. this classification allowed us to identify 3 groups of participants with similar indicator value. in other words, we assigned participants who demanded similar processing volume and resulted in the same group. in table 5 we show the created groups. table 5. groups of participants in activity exs−at1 . group participants a p10, p11, p12, p13, p30 b p4, p5, p6, p7, p8, p9, p24, p25, p26, p27, p28, p29 c p1, p2, p3, p15, p16, p17, p18, p19, p20, p21, p22, p23, p14 according to table 5, 30 individuals participated in the experiment. group a comprises the data obtained by 5 participants; group b compiles the data obtained by 12 participants, and group c compiles the data obtained by 13 participants. we compared the representations of the operational profile of s1 to highlight the variations concerning how the users operate the software. this comparison is possible through the data structures corresponding to these representations. thus, we considered the same group of participants when we performed this comparison. as previously described, homogeneous one-dimensional data structures were used to generate the operational profile representations of s1. the elements that constitute these data structures represent the methods implemented in s1, and their stored values correspond to the execution frequency. as the number of elements and their association to the methods of s1 are common to these structures, we compared the data stored in them, that is, the execution frequency of each method of s1. we compared each element of a data structure to the corresponding element of a different data structure. thus, each representation contained in a group was compared with all other representations contained in the same group. as an example, we compared the representation of the s1 operational profile generated by the data collected by participant p 10 to the ones generated by the participants p 11, p 12, p 13 and p 30 (table 5). software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 we defined an indicator to measure the variations in the execution frequency of each method among the representations. the value of this indicator ranges from 0 to 1. the value of this indicator represents the difference between the method execution frequency, stored in an element of one representation, with the method execution frequency, stored in the respective element in another representation. the indicator is calculated for each comparison made between the elements of one representation with the respective elements of another representation. the indicator value corresponds to the ratio between the difference resulting from the compared frequencies and the highest compared frequency. in figure 7 we illustrate the systematic approach to compare the representations of the operational profile of s1. figure 7 evidences that: a) the closer to 1 is the value of the indicator, the higher the difference between the execution frequencies of the evaluated method; b) the closer to 0 the value of the indicator is, the lower the difference between the execution frequencies of the evaluated method. indicators whose value was equal to 0 denote the participants did not execute that particular method during the accomplishment of the activity. indicators whose values were equal to 1 denote methods executed by only one participant during the activity. table 6 shows the results of the comparison between the operational profile of s1 for each participant of group a. table 6. comparison among participants in group a . id p-1 p-2 dmf im 01 p10 p11 59 0.37 02 p10 p12 42 0.47 03 p10 p13 39 0.62 04 p10 p30 77 0.53 05 p11 p12 45 0.59 06 p11 p13 80 0.56 07 p11 p30 68 0.65 08 p12 p13 73 0.51 09 p12 p30 57 0.38 10 p13 p30 92 0.43 the value in the column “id” in table 6 corresponds to a comparison identification made between two representations of the operational profile of s1. the values in the columns “p-1” e “p-2” refer to the identification of the participants whose collected data gave rise to the representations of the operational profile of s1. the value contained in the “dmf” column refers to the number of methods whose indicator value was equal to 1. the values in column “im” refer to the average value of the indicators originated by the differences between the execution frequencies recorded in the representations of the operational profile (figure 7). as an example, the result obtained from the comparison between the representations of the operational profile of s1 obtained from participants p 12 and p 13 (line 08 of table 6) indicates that 73 methods were performed only by one of the participants, p 12 or p 13. the results of the comparisons also indicate that, on average, the execution frequency of the methods differs by 0.51 for the compared participants, i.e., the frequency of these methods is approximately 50% higher for one of the participants. we created a graphical representation to facilitate the distinction in the operational profile, considering two different participants. as an example, in figure 8 we illustrate the results from the comparison of the operational profile representations obtained from p 12 and p 13. in the graphical representation, each array element represents a method. the information displayed in each element refers to the value obtained for the indicator which quantifies the variation between the execution frequencies of the represented method. methods whose value is one (1) were registered in only one of the operational profile representations of s1 (cells painted black in the graphic representation illustrated by figure 8). the methods whose value obtained by the indicator was between 0.5 (inclusive) and 1 (exclusive) were painted gray in the graphic representations shown in figure 8. the other methods whose value obtained for the indicator were below 0.5 were painted white in the graphic representation shown by figure 8. 5.2.2 exs–at1: results we verified significant differences in the execution frequency of methods for s1 when the participants were operating it. the methods not executed during the activity also had a significant difference between participants. the average value of the indicator used to measure the variations in the execution frequencies of each method was 0.51 for participants of group a. for this same group, the average value in the number of methods whose execution was registered in only one of the representations of the comparisons was 63.2. these averages for the participants of group b and group c were, respectively, 0.5/66.19 and 0.57/43.75. given the exs–at1 results, significant variations were verified among the representations of operational profiles, thus providing an answer to research question rq2. 5.3 exs–at2: sop vs. test profile we performed the exs−at2 activity to obtain evidence of the possible misalignment between sop and the tested software parts. in an attempt to verify a misalignment between sop and the tested software parts, we evaluated the operational profile of s1, s2, s3, and s4, along with their test suites. we obtained the operational profile of s1 during exs−at1. the same procedure we performed to identify the operational profile of s1 we also applied for s2, s3, and s4. as stated in session 5.1, we instrumented s2 and s3 to collect data when users operated the software. these data allowed us to identify the sop of s2 and s3. the operational profile of the s4 software was identified with use of a tool to monitor s4 execution. undergraduate students of the technology in analysis and development systems course participated in the activity as s2 users. thus, we trained the participants, who had equivalent experience and knowledge, to use s2. we repeated the same process above, but now with postgraduate students of the web software development course, who also had equivalent experience and knowledge to participate in the activity as s4 users. in addition, public servants participated in the acsoftware operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 figure 7. systematic to measure variances. figure 8. differences between the p 12 and p 13 representations. tivity as s3 users performing their daily tasks using the software features. the task assigned to s2 users was to operate s2 to record 10 bibliography references. the task assigned to s4 users was to operate s4 to create a class diagram from a given software requirement specification. we set a time limit for the s2, s3, and s4 users to perform the task. the tasks performed within the defined period were considered successfully completed, thus data obtained of all participants were used in the activities. s2, s3, and s4 users obtained similar performance and results in their respective performed tasks. in addition to the data that identified the sop of s1, s2, s3, and s4, we collected data about the test suite execution of these software to obtain evidence of the mismatch between sop and the tested software parts. the same procedure used to collect the data that provided the sop was used to collect data during the execution of the test suites. these data allowed us to obtain the test profile of s1, s2, s3, and s4. we defined the term “test profile” in this paper as the software parts executed after the test suite run. note that the test cases of the used software had different origins (as shown in table 4). we established this characteristic to allow the analysis of sop with test cases defined and created based on different strategies. we compared the test profile of s1, s2, s3, and s4 software to the operational profile of the respective software to verify the mismatch between the sop and the tested software parts. in the following section, we describe the data analysis and the results of the data obtained from these comparisons. 5.3.1 exs–at2: data analysis we compared the test profile of s1, s2, s3, and s4 to the operating profiles of the respective software in an attempt to find the possible mismatch between sop and the test profile. as we already described, in the context of this paper, sop is determined by the frequency of methods execution. we classified the methods implemented in s1, s2, s3, and s4 based on their processing in sop and the test profile. thus, four classification categories are possible: software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 • category 0: method not executed in sop and not executed by the test profile; • category 1: method executed in sop (by at least 1 participant) but not executed in the test profile; • category 2: method not executed in sop but executed by the test profile; • category 3: method executed in both operational and test profiles. as an example, in figure 9 we show a fraction of the classification table of the methods implemented in s1. in this example, the test profile (0) is compared to the operational profiles of participants 0, 1 and 2. we also classified the methods implemented in s2, s3 and s4, generating a classification table for each software. the complete tables are available at http://lcvm.com.br/artigos/anexos/jserd2020/ tabelas/. in figure 9 we show the classification table of s1’ methods. for each method, we assigned a classification category resulting from the comparison between sop and the test profile of s1. the columns “participant op id./test profile”, “cl” and “freq” refer, respectively, to: a) the operational profile obtained by participants compared to the test profile. the line below column title informs the compared participant and the test profile; b) the classification category assigned to the method; c) the difference between the execution frequencies obtained in the operational profile of the participant and the test profile. in figure 10 we show the results from the comparison between sop and the test profile for each evaluated software (s1, s2, s3, and s4). for each evaluated software (s1, s2, s3 and s4) shown in figure 10, the following information is provided: • op ∩ t p : number of methods processed by at least 1 participant and processed by the test profile. • op ̸⊂ t p : number of methods processed by at least 1 participant and not processed by the test profile. • t p ̸⊂ op : number of methods processed by the test profile and not processed by the participants. the results show that: a) 131 out of 280 methods from s1 processed by at least 1 of the participants were not processed by the test profile; 30 methods processed by the test profile were not processed by the participants; b) 313 out of 1308 methods from s2 processed by at least 1 of the participants were not processed by the test profile; 1340 methods processed by the test profile were not processed by the participants; c) 203 out of 437 methods from s3 processed by at least 1 of the participants were not processed by the test profile; 134 methods processed by the test profile were not processed by the participants. d) 4743 out of 8910 methods from s4 processed by at least 1 of the participants were not processed by the test profile; 1319 methods processed by the test profile were not processed by the participants. 5.3.2 exs–at2: results for the s1, s3 and s4 software, approximately 50% of the methods processed by sop were not processed by the test profile. the s2’s methods processed by sop and not processed by the test profile correspond to approximately 25%. it is also possible to verify the occurrence of methods processed by the test profile and not processed by sop for s1, s2, s3 and s4. for s2, the number of methods processed by the test profile and not processed by sop corresponds to approximately 30%. the results show a mismatch between sop and the test profile for s1, s2, s3 and s4. according to rincon (2011), only one open-source software among the ten open-source software researched by him obtained a coverage code between 70 and 80%. if we considered this interval acceptable, in the best case, we are delivering the software with 20% to 30% of the source code not having been executed during the testing phase. according to ivanković et al. (2019), the median code coverage for all google projects with successful coverage computation in the period between 2015 and 2018 varied between 80 and 85%, i.e., an interval between 15 and 20% of the uncovered code. thus, even if we consider acceptable a percentage range for the misalignment between the sop and the test profile that equals the range of uncovered code shown by rincon (2011) and ivanković et al. (2019), i.e., between 15 and 30%, the results obtained from exs−at2 for s1, s3 and s4 are greater than that considered an acceptable range when the methods processed by sop and not processed by the test profile. for s2, the obtained result is equal to the acceptable range considered when it comes to the methods processed in the test profile and not processed by sop. these results show that there may be a misalignment between the sop and tested software parts, providing an answer to question rq3. 5.4 exs–at3: failures in untested sop parts bach et al. (2017) investigated the relationship between the coverage provided by a test suite and its effectiveness. the approach adopted in bach et al. (2017) can also be used as another strategy to get evidence of the possible mismatch between sop and the tested software parts, as well as the relation between this misalignment and software faults. the approach used in bach et al. (2017) defines two scenarios referring to the hypothesis investigated: 1. coverage does not influence the detection of future bugs; 2. a high coverage rate can reduce the volume of future bugs. bach et al. (2017) analyzed identified faults using the failures reported by software users and the relation of the data obtained by this analysis to the coverage provided by the test suite of the respective software. in the context of this paper, we assumed that the failures reported by software users occurred in software parts that constitute the sop since such failures occur during the operation of the software by users. as such, the modified software parts resulting from fault corrections constitute the sop and denote the occurrence of failures in the software parts that http://lcvm.com.br/artigos/anexos/jserd2020/tabelas/ http://lcvm.com.br/artigos/anexos/jserd2020/tabelas/ software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 figure 9. classification of software s1 methods. figure 10. results. comprise the operational profile. given these considerations, the activity exs−at3 serves to verify: 1. if the misalignment between sop and the tested software parts is relevant to software quality (faults not processed by the test profile do occur in sop parts); 2. although there is a misalignment between sop and the tested software parts, this misalignment is irrelevant to software quality (no faults were registered in sop parts not executed by the test profile). we verified the fault history of s2 and s4. s2 and s4 are open-source software, and their source code is available on a hosting platform providing resources to manage modifications in the source code. 5.4.1 analyzing failures in the untested sop of s2 by means of a pull request, we verified the changes in the s2’s source code classified as bug fix. this verification allowed us to identify the s2’s methods modified for attending a bug fix. we identified 79 methods that have corrections of faults identified by failures reported by users. as we assumed, these methods compose the sop identified through data provided by the software community (bug reports), named sopsup in this section. we compared the methods comprising sopsup to the methods processed by s2’s test profile, identified in exs−at2. we found that the test profile did not execute 49 out of the 79 methods constituting the sopsup, i.e., sopsup parts not covered by the test suite where we identified faults. sopsup is based on the assumption that the methods corrected due to failures reported by the community constitute the sop. thus, these failures were not generated by the sporadic actions of users. based on this assumption, we verified if the sopsup methods not processed by the test profile were contained in the sop obtained by exs−at2 participants. among these methods, 7 methods were found in the sop obtained by exs−at2 participants. these 7 methods were classified as sop methods not processed by the test profile. this indicates that, possibly, if the approach used in the activity is applied to the sop obtained from the real users in a real scenario, the 7 methods contended in sopsup, i.e., methods presenting defects, would be found and classified as methods in sop and continue untested. thus, the approach applied in exs−at2 improves new releases of the test suite since it identifies untested and faulty parts of the sop. 5.4.2 analyzing failures in the untested sop parts of s4 unlike the procedure adopted to identify the sopsup of s2, we obtained the sopsup’s methods of s4 from a bug report available in its official website. for the bugs reported an error log was associated.by utilizing these error logs we could identify 15 methods that revealed failures during their execution. these methods comprise the sopsup of the s4. as with s2, we compared the sopsup of s4 to their test profile identified in exs−at2. we found that the test profile did not execute 5 out of the 15 methods constituting the sopsup, i.e., sopsup parts not covered by the test suite where faults were identified. 5.4.3 exs–at3: results table 7 summarizes the data obtained from s2 and s4 about existing failures in untested sop parts. the investigation performed in exs−at2 provided evidence of a mismatch between sop and the tested software parts, and that failures occur in sop parts left untested. for s2, 62.02% of the sopsup parts in which faults identified software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 table 7. s2 and s4 sopsup parts in which faults were identified. sopsup parts in which faults were identified software identified methods with faults identified methods not covered by test s2 79 49 s4 15 5 were not covered by the test profile. for s4 the respective value was 33.33%. this evidence answers research question rq4, showing that failures may occur in sop parts not covered by the test profile. 5.5 exs–at4: attempting to decrease the misalignment between the sop and the test profile we performed the exs−at4 activity to assess whether a test strategy based on the use of automated test data generator can contribute to reduce the possible misalignment between sop and untested software parts. to perform the exs−at4 activity, we selected s2 and s4 software. the reasons why we selected these software are because we used them in exs−at2 and exs−at3 and because they are more representative regarding the number of implemented methods. for each selected software, we generated a test set using an automated tool, named in this section as s2tctool and s4tctool for s2 and s4 software, respectively. the sets of existing test cases for s2 and s4 are named in this section as s2tcexis and s4tcexis (table 4). we used evosuite, an automated generation tool, to write junit tests for java software (fraser and arcuri, 2011). for the generation of s2tctool and s4tctool, among the coverage criteria made available by the test generation tool, we adopted the coverage criterion method, given that sop is represented by the execution frequency of the implemented methods in this paper. for s2 and s4 were generated 4322 and 2803 test cases respectively. we did not use sop data in the planning and execution of exs−at4 test strategy, considering that the sop was unknown for the generation of s2tctool and s4tctool. then, we generated automated test cases for all s2 and s4 parts. we incorporated the s2tctool and s4tctool test cases into s2tcexis and s4tcexis respectively, thus obtaining an extended test set resulted for s2 and s4 from the union of these sets. we named the extended test sets of s2 and s4 as s2tcext and s4tcext, respectively, in this section. in table 8 we show the coverage for s2 and s4 provided by each set of test cases. the numeric values in percentage are presented in table 8. table 8. s2 and s4 software coverage provided by test cases. coverage provided by test cases software tcexis tctool tcext s2 15% 27% 30% s4 32% 42% 60% in table 8 we show that the s2tcext and s4tcext test cases increased the coverage of s2 and s4 provided by s2tcexis and s4tcexis respectively, showing that new parts of s2 and s4 were tested and, consequently, extending the s2 and s4 test profiles. we named the initial test profiles obtained from s2tcexis and s4tcexis as s2tpini and s4tpini in this section. also, we named the extended test profiles of s2 and s4 in this section as s2tpext and s4tpext, respectively. we adopted the same procedure to identify the s2tpini and s4tpini, described in section 5.3, to obtain s2tpext and s4tpext. the same procedure used to compare the s2tpini and s4tpini to the s2’s sop and s4’s sop respectively was used to compare the s2tpext and s4tpext to the s2’s sop and s4’s sop respectively. 5.5.1 exs–at4: data analysis in figures 11 and 12 we show, for s2 and s4 respectively, the results obtained from the comparison between the sop and the extended test profile. results obtained by comparing the sop of these software to the initial test profiles (s2tpini and s4tpini) are presented again in figures 11 and 12 to compare them with the results obtained from the s2tpext and s4tpext. figure 11. s2tpini and s2tpext results. figure 12. s4tpini and s4tpext results. we defined the categories op ∩ t p , op ̸⊂ t p and t p ̸⊂ op , shown in figures 11 and 12, in section 5.3.1 in figures 11 and 12, we can see that: 1. 143 out of 1328 methods from s2 processed by at least 1 of the participants were not processed by the tpext; software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 2524 methods processed by the test profile were not processed by the participants. 2. 4189 out of 8910 methods from s4 processed by at least 1 of the participants were not processed by the tpext; 2977 methods processed by the test profile were not processed by the participants. 5.5.2 exs–at4: results in table 9 we show the difference resulted from tpini and tpext. after comparing the results obtained by s2tpini and s4tpini, the test strategy we adopted in activity exs−at4 reduced the number of methods processed by sop and not processed by the test profile (s2tpext and s4tpext), being more effective for the s2 software. however, it is noteworthy that, regarding the number of implemented methods, s2 is less representative than s4, for which the adopted strategy reduced the amount of methods processed by sop and not processed by the test profile (s4tpext) in, approximately, 10% compared to the initial test profile (s4tpini). theadoptedteststrategyalsoreducedthenumberofmethods constituting the sopsup of s2 and s4 and were not covered by the respective test profile, s2tpini and s4tpini. for s2, 2 out of 49 methods constituting the sopsup and were not processed by s2tpini were processed by s2tpext. for s4, 1 out of 5 methods constituting the sopsup and were not processed by s4tpini was processed by s4tpext. the adopted test strategy aimed to reduce the misalignment between sop and test profile by increasing the set of existing test cases of s2 and s4 using an automated tool. we did not use sop data in the test strategy planning and execution, considering that the sop was unknown for the automatic generation of test cases, which implied generating test cases for all parts of s2 and s4, demanding time and processing because they depend on the applied criteria and parameters as well as on the size of the software for which the test cases were generated. in response to question rq5, we observed that, although we generated test cases for all parts of s2 and s4 and incorporated these cases into the set of existing test cases for the software, the test strategy reduced the misalignment, but the misalignment between sop and the test profile of s2 and s4 was unavoidable. in addition, the automated test generator generates only the test data and assumes the produced output is correct. as such, even if we have improved the coverage of sop, we still need to verify whether the resultant output corresponds to the expected output according to the software specification. thus, the data obtained from the sop is relevant and can be used in existing testing strategies or in the definition of new strategies to contribute to their effectiveness and efficiency. 6 lessons learned first of all, we would like to make it clear that the results obtained so far are not conclusive and they are part of an ongoing work cavamura júnior (2017), and more experimental studies are coming. however, based on the data presented in section 5, we can provide some directions (albeit not exhaustive) on how to use the knowledge about sop in favor of software quality. • we verified during the experimental studies that the identification of sop through instrumentation may affect software performance and produce a huge volume of data depending on the level of fragmentation adopted. nevertheless, the information obtained about the sop can contribute to software test activities. • high levels of coverage do not necessarily indicate a test set is effective in detecting faults and it is unlikely that the use of a fixed value of coverage as a quality target will produce an effective test set (inozemtseva and holmes,2014). ourdataindicatesthatagoodtestset is one with good coverage of the software parts related to the sop. in the occurrence of misalignment between the sop and the tested software parts, the sop can also be used as a criterion for generating test cases to improve the test suite in order to minimize the misalignment. • another possible use of the sop is related to what de andrade freitas et al. (2016) called as “market vulnerability”, wherein each fault in software affects users differently. we should avoid bothering most of our users with constant failures as much as possible when using features most important from their point of view. the sop reflects these software areas. it is possible to use sop to assess the impact caused by each fault in software operability. thus, a rank of known faults can be built based on their impact to the majority of users, providing information able to assist in precifying these faults with respect to the software market. • since the sop represents the most used parts of the software, information about the sop can be used as a criterion to prioritize any other activities inherent to the software development process. 7 threats to validity regarding the exs activities, we considered the participants’ level of knowledge in exs a threat to validity. we selected undergraduate and postgraduate students, who had equivalent experience and knowledge required to perform the activity, to operate s1, s2 and s4 software in order to minimize the risks. we conducted training on s1, s2 and s4, as well as a review of the theoretical concepts inherent in s1, s2 and s4. as s3 was developed on demand, participants already knew the processes automated by it. on exs−at2 the execution of some test cases belonging to the test sets of s1, s2 and s4 run with errors. for s1 0.69% of the automatically generated test cases finished the execution with errors. for s2 1.36% of the automatically generated test cases finished the execution with errors. for s4 17.10% of the automatically generated test cases finished the execution with errors. with the configuration and execution environment in conformity, we chose not to modify the implementation of the existing test cases in order to eliminate the execution errors. software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 table 9. comparison of the results obtained by the test profiles. tccomm s2 s4 – sop vs s2tpini sop vs s2tpext (%) sop vs s4tpini sop vs s4tpext (%) op ∩ t p 995 1185 19.09 (+) 4167 4721 13.29 (+) op ̸⊂ t p 333 143 57.05 (-) 4743 4189 11.68 (-) t p ̸⊂ op 1340 2524 88.35 (+) 1319 2977 125.7 (+) we considered these a threat to validity because some methods may have been executed as a result of these errors, thus not being part of the test profile. on exs−at3 activity, we assumed failures reported by users were revealed by the software parts composing sop, i.e., these failures did not occur in operations sporadically processed by users. we are performing a more comprehensive exs using data obtained from free software repositories. on exs−at4 the execution of some test cases automatically generated for s2 (s2tctool) and s4 (s4tctool) rendered errors. for s2, 4.2% of the automatically generated test cases generated errors during their execution. for s4 0.53% of the automatically generated test cases generated errors. although these errors have low representativeness, they are considered a threat to validity since some methods may have been executed as a result of these errors, thus not being part of the extended test profiles of s2tpext and s4tpext, respectively. in further experiments we intend to investigate the cause of such errors and compute their impact on the test profile. 8 conclusions this paper investigates the possible mismatch between sop and the tested software parts by introducing the term “test profile”. the results provided answers to the defined research questions, stating: a) the originality of this study; b) that there aresignificantvariationsinthewaysoftwareisusedbyusers; c) there may exist a misalignment between the sop and the test profile; d) the existing misalignment is relevant due to the evidence that failures occur in the untested sop parts; e) although the adopted test strategy reduced the misalignment between the sop and test profile, it was not enough to avoid the misalignment. the answers to the research questions provide the expected contributions to this work. these contributions may motivate new research or contribute to existing research in software engineering, more specifically in the field of software quality. the contributions also show that information about software operating profiles can contribute to the software quality activities applied in the industry since the quality of software also depends on its operational use (cukic and bastani, 1996). thus, the contributions provide evidence that sop is relevant not only to activities that determine software reliability but also to the planning and execution of the test activity regardless of the adopted test strategy. for future research we intend to improve software quality from the users’ point of view considering the sop (cavamura júnior, 2019). we expect that the proposed strategy allows: (i) to dynamically adapt an existing test suite to the sop, and; (ii) use sop as a prioritization criterion which, given a set of faults, allows to identify the ones that cause the most significant impact on users’ experience when operating the software, and thus consider such impact on pricing the faults for correction, alongside other criteria. we are investigating and approaching the use of machine learning and genetic algorithms to enable the proposed strategy. lastly, we are working on the implementation of a tool to automate the proposed strategy and to provide support for technology transfer and experimentation. references ali-shahid, m. m. and sulaiman, s. (2015). improving reliability using software operational profile and testing profile. in 2015 international conference on computer, communications, and control technology (i4ct), pages 384–388. ieee press. amrita and yadav, d. k. (2015). a novel method for allocating software test cases. in 3rd international conference on recent trends in computing 2015 (icrtc-2015), volume 57, pages 131 – 138, delhi, india. elsevier. assesc, f. (2012). propriedade intelectual e software cursos de mídia eletrônica e sistema de informação. bach, t., andrzejak, a., pannemans, r., and lo, d. (2017). the impact of coverage on bug density in a large industrial software project. in 2017 acm/ieee international symposium on empirical software engineering and measurement (esem), pages 307–313. basili, v. r., caldiera, g., and rombach, d. h. (2002). encyclopedia of software engineering, volume 1, chapter the goal question metric approach, pages 528–532. john wiley sons. begel, a. and zimmermann, t. (2014). analyze this! 145 questions for data scientists in software engineering. icse 2014, pages 12–23. bertolino, a., miranda, b., pietrantuono, r., and russo, s. (2017). adaptive coverage and operational profile-based testing for reliability improvement. in proceedings of the 39th international conference on software engineering, icse ’17, pages 541–551, piscataway, nj, usa. ieee press. bittanti, s., bolzern, p., and scattolini, r. (1988). an introduction to software reliability modelling, chapter 12, pages 43–67. springer berlin heidelberg, berlin, heidelberg. cavamura júnior, l. (2017). impact of the use of operational profile on software engineering activities. phd thesis, computing department – federal university software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 of são carlos, são carlos, sp, brazil. on going phd project (in portuguese). cavamura júnior, l. (2019). operational profile and software testing: aligning user interest and test strategy. in 2019 12th ieee conference on software testing, validation and verification (icst), pages 492–494. cavamura júnior, l., fabbri, s. c. p. f., and vincenzi, a. m. r. (2020). software operational profile: investigating specific applicabilities. in proceedings of the xxiii iberoamerican conference on software engineering, cibse’2020, curitiba, pr, brazil. curran associates. accepted for publication. (in portuguese). chen, m. ., lyu, m. r., and wong, w. e. (2001). effect of code coverage on software reliability measurement. ieee transactions on reliability, 50(2):165–170. cukic, b. and bastani, f. b. (1996). on reducing the sensitivity of software reliability to variations in the operational profile. in proceedings of the international symposium on software reliability engineering, issre, pages 45–54, white plains, ny, usa. ieee, los alamitos, ca, united states. de andrade freitas, e. n., camilo-junior, c. g., and vincenzi, a. m. r. (2016). scout: a multi-objective method to select components in designing unit testing. in xxvii ieee international symposium on software reliability engineering – issre’2016, pages 36–46. ieee press. bibtex*[organization=ieee computer society] event-place: ottawa, canadá. falbo, r. a. (2005). engenharia de software. ferrari, f. c., rashid, a., and maldonado, j. c. (2013). towards the practical mutation testing of aspectj programs. science of computer programming, 78(9):1639 – 1662. fraser, g. and arcuri, a. (2011). evosuite: automatic test suite generation for object-oriented software. in proceedings of the 19th acm sigsoft symposium and the13theuropeanconferenceonfoundationsofsoftware engineering, esec/fse ’11, pages 416–419, new york, ny, usa. acm. fukutake, h., xu, l., takagi, t., watanabe, r., and yaegashi, r. (2015). the method to create test suite based on operational profiles for combination test of status. in 2015 ieee/acis 16th international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, snpd 2015 proceedings, pages 1–4, white plains, ny, usa. institute of electrical and electronics engineers inc. gittens, m., lutfiyya, h., and bauer, m. (2004). an extended operational profile model. in proceedings international symposium on software reliability engineering, issre, pages 314 – 325, saint-malo, france. inozemtseva, l. and holmes, r. (2014). coverage is not strongly correlated with test suite effectiveness. in proceedings of the 36th international conference on software engineering, icse 2014, page 435–445, new york, ny, usa. association for computing machinery. ivanković, m., petrović, g., just, r., and fraser, g. (2019). code coverage at google. in proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, esec/fse 2019, page 955–963, new york, ny, usa. association for computing machinery. kashyap, a. (2013). a markov chain and likelihood-based model approach for automated test case generation, validation and prioritization: theory and application. proquest dissertations and theses, the george washington university. laddad, r. (2009). aspectj in action: enterprise aop with spring applications. manning publications co., greenwich, ct, usa, 2nd edition. leung, y.-w. (1997). software reliability allocation under an uncertain operational profile. journal of the operational research society, 48(4):401 – 411. linger, r. c., mills, h. d., and witt, b. i. (1979). structured programming theory and practice. in the systems programming series. mafra, s. n., barcelos, r. f., and travassos, g. (2006). applying an evidence based methodology to define new software technologies. in xx brazilian symposium on software engineering sbes’2006, pages 239–254, florianópolis, sc, brazil. available at: http://www. ic.uff.br/~esteban/files/sbes-prova.pdf. access on: 05/04/2020. (in portuguese). musa, j. (1993). operational profiles in software-reliability engineering. ieee software, 10(2):14–32. cited by 396. musa, j. and ehrlich, w. (1996). advances in software reliability engineering. advances in computers, 42(c):77– 117. cited by 1. musa, j. d. (1979). software reliability measures applied to systems engineering. in managing requirements knowledge, international workshop on(afips), volume 00, page 941, s.i. ieee. musa, j. d. (1994). adjusting measured field failure intensity for operational profile variation. in proceedings of the international symposium on software reliability engineering, issre, pages 330–333, monterey, ca, usa. ieee, los alamitos, ca, united states. nakagawa, e. y., scannavino, k. r. f., fabbri, s. c. p. f., and ferrari, f. c. (2017). revisão sistemática da literatura em engenharia de software: teoria e prática. elsevier brasil. namba, y., akimoto, s., and takagi, t. (2015). overview of graphical operational profiles for generating test cases of gui software. in k., s., editor, 2015 ieee/acis 16th international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, snpd 2015 proceedings, pages 1–3, white plains, ny, usa. institute of electrical and electronics engineers inc. poore, j., walton, g., and whittaker, j. (2000). a constraintbased approach to the representation of software usage models. information and software technology, 42(12):825 – 833. pressman, r. s. (2010). software engineering a practitioner’s approach. mcgraw-hill, new york, ny, 7rd edition. rincon, a. m. (2011). qualidade de conjuntos de teste de software de código aberto, uma análise baseada em critérios estruturais. http://www.ic.uff.br/~esteban/files/sbes-prova.pdf http://www.ic.uff.br/~esteban/files/sbes-prova.pdf software operational profile vs. test profile: towards a better software testing strategy cavamura jr et al. 2020 rocha, a. d. (2005). uma ferramenta baseada em aspectos para apoio ao teste funcional de programas java. shukla, r. (2009). deriving parameter characteristics. in proceedings of the 2nd india software engineering conference, isec 2009, pages 57–63, new york, ny, usa. acm. shull, f., carver, j., and travassos, g. h. (2001). an empirical methodology for introducing software processes. in proceedings of the 8th european software engineering conference held jointly with 9th acm sigsoft international symposium on foundations of software engineering, esec/fse-9, page 288–296, new york, ny, usa. association for computing machinery. sommerville, i. (1995). software engineering. addisonwesley, wokingham, england, fifth edition edition. sommerville, i. (2011). software engineering. addisonwesley, harlow, england, 9 edition. takagi, t., furukawa, z., and yamasaki, t. (2007). an overview and case study of a statistical regression testing method for software maintenance. electronics and communications in japan part ii electronics, 90(12):23– 34. travassos, g. h., dos santos, p. s. m., mian, p. g., neto, p. g. m., and biolchini, j. (2008). an environment to support large scale experimentation in software engineering. in 13th ieee international conference on engineering of complex computer systems (iceccs 2008), pages 193– 202. introduction software operational profile (sop) blackthe sop and the software quality blackproblems related to the use of sop research methodology related work experimental studies (exs) strategy for data collection exs–at1: evaluating the variation in how software is operated by users exs–at1: data analysis exs–at1: results exs–at2: sop vs. test profile exs–at2: data analysis exs–at2: results exs–at3: failures in untested sop parts blackanalyzing failures in the untested sop of s2 blackanalyzing failures in the untested sop parts of s4 blackexs–at3: results blackexs–at4: attempting to decrease the misalignment between the sop and the test profile blackexs–at4: data analysis blackexs–at4: results lessons learned threats to validity conclusions journal of software engineering research and development, 2021, 9:2, doi: 10.5753/jserd.2021.742 � this work is licensed under a creative commons attribution 4.0 international license.. software industry awareness on sustainable software engineering: a brazilian perspective leila karita � [ federal university of bahia | leila.karita@ufba.br ] brunna caroline mourão � [ federal university of bahia | brunna.caroline@ufba.br ] luana almeida martins � [ federal university of bahia | martins.luana@ufba.br ] larissa rocha soares � [ federal university of bahia | larissars@dcc.ufba.br ] ivan machado � [ federal university of bahia | ivan.machado@ufba.br ] abstract sustainable computing is a rapidly growing research topic spanning several areas of computer science. particularly, it has received increasing attention in the software engineering field in the last years, with several studies discussing the topic from a range of perspectives. however, few studies have demonstrated the awareness of software practitioners about the underlying concepts of sustainability in the software development practice. in an earlier investigation, we performed a preliminary study on the practitioners’ perception under four main perspectives: economic, social, environmental, and technical. this study extended the previous survey and reached a number of ninety-seven respondents from brazilian companies. the extension aims to expand the results to compare and explore the previous findings in a more in-depth way. the novel results confirmed the evidence raised in the original survey that sustainability in the context of software engineering is a new subject for practitioners. however, professionals have shown interest in the topic, and there is a general understanding that sustainability should be treated as a quality attribute. among the observed perspectives, we generated an initial theory that shows software practitioners know the subject around ’green in software’, even unconsciously. this study brings evidence of how the industry understands and perceives sustainability practices in the software development process. keywords: sustainable software engineering, survey research, empirical software engineering 1 introduction sustainability has been increasingly discussed in the software engineering (se) field (mourão et al., 2018). as more and more software applications are launched in the market as a means to make daily activities easier, there is an increased interest in understanding how such solutions might affect the environment. the impact of technology on daily lives can be seen from two antagonistic perspectives, technology as a contributor to mitigate or produce the environmental impact. whereas the technology helps organizations address environmental issues when providing many improvements (e.g., virtual meetings and improvements in logistics), it is often responsible for environmental degradation by consuming amounts of energy through engineering processes used to make products, for instance (calero and piattini, 2017). developing, maintaining, and evolving energy-efficient software solutions is rather challenging (pinto and castor, 2017). the software development life-cycle is not suitable for identifying the effects of the software system on sustainability (dick et al., 2010). therefore, sustainable thinking is still a new and challenging practice for software engineers and developers. environmental sustainability has been deemed as a nonfunctional requirement (nfr) to consider in a se process (calero and bertoa, 2013; venters et al., 2014; becker, 2014; penzenstadler et al., 2014b). however, this is not commonly employed yet. authors claim that there is still a need for more discussion on “sustainable requirements” to understand how the term has been used in the se field (venters et al., 2017). it is rather important to explicitly identify sustainability requirements and ensure they could be properly monitored and tested along with the software development life cycle. indeed, the definition of sustainable software development is not clear in the literature yet and maybe misunderstood as a consequence. in a previous work (karita et al., 2019), we presented a survey study conducted with twenty-five software engineers from brazilian companies involved in projects from different domains. the study investigated the practitioners’ perception of sustainable practices for software development. the study provides readers with an overview of the main discussions about software sustainability concepts. the yielded results indicate an overall lack of knowledge about the topic, in particular, related to sustainable software concepts. in this study, we conducted an exploratory study to identify whether, and to what extent, the practitioners know about the topic. therefore, we extended our previous work in two directions: (i) we extended the initial survey, which reached a number of ninety-seven respondents; and (ii) established an initial common understanding of the definitions of “sustainable software” based on the achieved results. from the survey, we are interested in finding out even more insights and evidence that may reveal the importance of promoting the sustainable software development field. this study also aims to leverage the state-of-the-practice in sustainable software development, under four main perspectives: economic, social, environmental, and technical. additionally, our study deals with the practitioners’ lack of knowledge about sustainable software development to extract as much information as possible to explore the state-of-thepractice. in this sense, our earlier study allowed us to identify https://orcid.org/0000-0003-0211-7598 mailto:leila.karita@ufba.br https://orcid.org/0000-0001-9472-6643 brunna.caroline@ufba.br https://orcid.org/0000-0001-6340-7615 mailto:martins.luana@ufba.br https://orcid.org/0000-0002-8069-5249 mailto:larissars@dcc.ufba.br https://orcid.org/0000-0001-9027-2293 mailto:ivan.machado@ufba.br software industry awareness on sustainable software engineering: a brazilian perspective karita et al. that even when the practitioners do not know the formal concept of sustainable software development, they can discuss its importance through transitivity, i.e., they are able to apply general concepts of sustainability in the context of software development. the survey results confirm the results identified in the former study (karita et al., 2019). we found that sustainability in the context of software is a new issue for most respondents. the software professionals are not concerned with sustainability throughout the software development life-cycle and have low knowledge about the impacts of not using sustainable practices on the environment. however, they showed interest in the topic, which is a general understanding that sustainability should be treated as a quality attribute. the theory showed that software engineers practically explore the technical and environmental dimensions; that is, they may unconsciously practice ‘green in software’. the remainder of this paper is organized as follows. section 2 presents the underlying concepts of green and sustainable software. section 3 presents the survey conducted to investigate the awareness of practitioners of sustainability. section 4 presents the grounded theory conducted to generate an initial common understanding of sustainability. section 5 provides an in-depth discussion of the yielded results. section 6 discusses the implications for research and practice. section 7 discusses related work. section 8 discusses the threats to validity. section 9 draws concluding remarks. 2 green and sustainable software sustainability has been discussed in several sectors of our society. etymologically, the word sustainable comes from the latin sustare, which means “to sustain”, “to support”, and “to conserve”. the term “sustainable development” was coined in 1987 by gro harlem brundtland, who published a book (our common future) where she stated, “meeting the needs of the present without compromising the ability of future generations to meet their own needs” (imperatives, 1987). according to calero and piattini (2015a), when we take a closer look at the above definition, we could observe that two fundamental pillars underpin sustainability: “the capacity of something to last a long time” and “the resources used”. when addressing sustainability in the context of se, we find several definitions of sustainable se in the literature (calero and piattini, 2015a). tate (2005) defines sustainable se as the development, which is able to make a balance between rapid release and long term sustainability. according to khandelwal et al. (2017), sustainable se consists of processes and practices that help produce sustainable software and everything related to the software product, be it development or maintenance, taking environmental aspects into account. to erdelyi (2013), se can be sustainable by produce sustainable software with environmental awareness and minimizing waste during the software development process. in general, sustainable se can be interpreted as the art of developing sustainable software through a sustainable se process. its goal is to enhance the se practices aiming at the direct and indirect consumption of natural resources and energy, as well as the consequences caused by software systems throughout its life-cycle (johann et al., 2011). sustainability in se can be expressed in several dimensions. for example, the imperatives (1987) addresses the economic, social, and environmental dimensions. the economic dimension refers to the production and consumption of resources and services. the social dimension refers to people and their living and labor conditions, such as education, health, leisure, social equity, livability, and other aspects. the environmental dimension refers to the natural resources of the planet, and the way society uses them. the environmental dimension of sustainability is also called as green dimension (moraga et al., 2017; calero and piattini, 2017, 2015b; garcía-mireles et al., 2017; murugesan, 2008). the term green has also been interpreted in two ways in the literature, either (1) green in software, which is related to develop a more environment-friendly software, i.e., the software is developed in a green manner and produce a green software product; and (2) green by software, which refers to software developed focusing on the preservation of the environment, i.e., the software is the tool supporting the sustainability goals. similarly, penzenstadler et al. (2014a) also interpret sustainable software in two ways: (1) as software code that is sustainable, agnostic of purpose; and (2) as a software purpose for achieving sustainability goals. likewise, dick et al. (2010) emphasize that sustainable software focuses on reducing the consumption of natural resources and energy in the sense of environmental, social, and economic dimensions. in addition to those three dimensions, other studies also add the technical and individual dimensions to analyze the software sustainability. the technical dimension addresses the long-term use of software and their evolution in a constantly changing execution environment (moraga et al., 2017; lago et al., 2015; fernandez et al., 2016). the individual dimension concerns with how to maintain and create software in a way that takes into account developers’ satisfaction with their work over a long period (penzenstadler and femmer, 2013; penzenstadler, 2014; becker et al., 2015). saputri and lee (2016) bring an inclusive view regarding those dimensions. they state that sustainability should be considered an integrated concept, taking into account each sustainability dimension. all of these dimensions could be analyzed from various perspectives. thus, the choice of approach to be adopted should take into account the purpose and scope to be investigated. furthermore, others discuss more technical aspects, such as the sustainability of the software life-cycle. according to johann et al. (2011), green and sustainable software is the enhancement of se to deal with the consumption of natural resources and energy during the entire software development life-cycle. moreover, hilty et al. (2006) and penzenstadler et al. (2014a) agree that the sustainable software can be interpreted as either a software developed to support sustainability goals or a software code being developed through a sustainable process. both interpretations converge to a software system that contributes to more sustainable living. while a sustainable software aims to improve the sustainability of humankind on our planet, a sustainable process is a key requirement for developing sustainable software. thus, a sussoftware industry awareness on sustainable software engineering: a brazilian perspective karita et al. tainable process may consider environmental, technical, and economic impacts during the software life-cycle and involves the pursuit of sustainable development goals. in summary, the literature presents several viewpoints, in which different researchers describe them from their perspective and area of expertise. consequently, the same may happen in the software industry. the lack of a unanimous definition can lead to isolated contributions, and practitioners might see se sustainability from different perspectives. 3 understanding the awareness of practitioners on sustainability this study investigates sustainability in the context of software development from an industry standpoint. in this sense, we conducted a survey to gather the opinions of professionals about sustainability without introducing participants to the topic since it could skew the answers. thus, we aim to identify the awareness of software professionals on sustainability in se. 3.1 research questions we defined six research questions (rq) to understand software familiarity, application, perceived importance, practices, models, and tools regarding sustainability. these are described next. rq1: are the professionals familiar with the concepts of sustainability applied to the software development process? this question investigates at what level the professionals are familiar with concepts related to sustainability in the context of software. rq2: how important is software sustainability for practitioners? this question investigates whether and at what level professionals consider sustainability an important factor in the software development process from a personal perspective and the software industry perception. rq3: what phases of the software development lifecycle (sdlc) do sustainable practices apply to? this question identifies in which sdlc phases the developers have adopted any sustainable se practices in the general and specific scope. rq4: what dimensions of sustainability have been explored in practice (technical, environmental, social, and economic) of software development? this question aims to investigate which of these dimensions have been most exploited by industry. rq5: what models for sustainable software development have been adopted by the software industry? this question investigates whether and what models for sustainability in software have been adopted by professionals. the purpose is to obtain models from the industry that we did not retrieve in the previous study. rq6: what tools have been used to support sustainability in the software development process? aligned with the preceding rq, we aim to explore tools that have been adopted in practice and collaborate with sustainability, based on the intuitive knowledge of the respondents. 3.2 survey design we designed the questionnaire to keep it as brief as possible while still enabling us to collect all relevant information. the stated questions seek to understand practitioners’ motivations and knowledge regarding green practices in software development. in our previous study, we conducted a survey study with twenty-five software professionals from brazilian companies. this study extends the previous one in the number of respondents to seventy-two new professionals from several brazilian software companies based in different regions of the country. therefore, this study presents the results for a total of ninety-seven respondents. this section encompasses the planning details, execution procedures, and reporting of desired and achieved results1. we used the methodology proposed by kasunic (2005) and applied the research survey principles defined by kitchenham and pfleeger (2002). figure 1 shows the adapted methodological steps employed in this extended study. steps 1 to 7 were conducted during the survey design in our previous work. these steps were reviewed and re-applied to accomplish steps 8 to 10 during this extension. target audience. to ensure valid results, we only selected professionals with knowledge in software development processes. the following criteria were considered: 1. professional with experience in the se field. 2. professional role in the company. practitioners should be involved in the software development process in at least one of the following roles: project manager, project leader, system analyst, requirements analyst, system architect, business analyst, developer, tester, product owner, and/or scrum master. questionnaire. we reviewed and applied the same instrument from the preliminary survey study, in which we specified six information groups. they are: respondents characterization, companies characterization, research object, company development process, faced difficulties, and sustainability as a quality attribute. we next describe each category. • respondents characterization: in this category, the goal was to investigate the respondent profile, with information about gender, name, age, level of education, and professional experience; • companies characterization: this category investigates the locality, follow-up, size, time of performance, certifications, level of environmental awareness (any aspect, not only necessarily regarding sdlc processes) of the companies, and function performed by the respondents in the company; 1the resulting data are publicly available at http://doi.org/10. 5281/zenodo.3976312 http://doi.org/10.5281/zenodo.3976312 http://doi.org/10.5281/zenodo.3976312 software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 1. survey design adapted from (kasunic, 2005) • research objective: in this category, the goal was to investigate the respondent’s knowledge regarding concepts related to software sustainability, as well as the importance of the respondent relating sustainability to the software context; • company development process: in this category, the goal was to investigate the software development process of the company the respondent work for and to identify whether and at what level sustainability practices have been applied; • difficulties encountered: this category investigates the likely benefits expected regarding the application of sustainability practices in the software development processes; • sustainability as a quality attribute: in this category, the goal was to investigate the interviewee’s perception of the importance of using sustainability as a quality attribute in their projects. we sent the replication of the reviewed questionnaire on february 1st, 2020. the extended survey instrument was hosted on google forms, alike the first survey. the url to access it was sent by: (1) personalized email to our contact list and (2) whatsapp contacts. we closed the survey on february 29th, 2020. a brief introduction was made with basic information about the purpose of the study with a justification of choice and the importance of the respondent’s. participants were also informed about the privacy policies of the study. data analysis. in this section, we present the data analysis process employed in this study. this research is an exploratory and qualitative study. to achieve the objectives, we adopted the following assumptions about the instrument: 1. for closed questions that could combine multiple responses, the sum of percentages could reach a value above 100%. 2. for closed questions that followed the same pattern of responses, we applied a five-point likert scale, from irrelevant (1) to very important (5). 3. for the open question about the concept of sustainability in the software development process, we applied a coding strategy. two of the authors extracted the general themes of the answers. using these themes, the authors had discussion sessions to develop a single coding scheme. the results were collected and translated into an appropriate graphic image to facilitate understanding. 4. for the other open questions, we include excerpts from the qualitative answers to clarify the results. each of the excerpts is followed by a number that represents a unique identifier for the respondent who expressed the opinion. for example, [#1] indicates the respondent’s answer number 1. 3.3 survey results in this section, we report the results of our extended survey study. 3.3.1 respondents’ demographics this section describes the demographics of the respondents. we investigated their age, and experience time to draw the profile of the observed sample. overall, regarding their gender, 63% of the respondents were men, 36% were women, and 1% were others. figure 2 shows the respondents age. 1% had between 15 and 19 years, 12% had between 20 and 24 years, 22% had between 25 and 29 years, 27% had between 30 and 34 years, 21% had between 35 to 39 years, 8% had between 40 to 44 years, 6% had between 45 to 49 years, and 3% had more than 50 years. more than 50% of the respondents are mainly concentrated in the 25 to 39-year-old range. in terms of their professional experience in software development, figure 3 shows that 28% had up to 3 years of experience, 18% had between 4 and 6 years, 14% had between 7 and 10 years, and 40% of the respondents had more than 10 years of experience in the industry. finally, the respondents informed the roles they play in the companies: 22% work as a software developer, 20% work as a system analyst, 10% work as a requirements analyst, 9% work as a project leader, 7% work as a business analyst and software tester, 5% work as a software architect, 4% work as project management, scrum master and product owner, 3% work as a consultant, 2% as portfolio management, and 1% software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 2. distribution of respondents by age figure 3. distribution of respondents by experience time work as a designer, data analyst, and dba. in this question, the respondents could answer more than one option. 3.3.2 companies’ demographics this section describes the demographics of the companies analyzed by respective practitioners in terms of segment and size. the respondents work in companies of different segments. figure 4 shows that 51% of the respondents work in software factories, 21% in government companies, and 8% in startups and consulting company, each one. the others add up to 12% working in other segments. about company size, 87% of the respondents reported that the size of the company is “large”, that is, it has more than 99 employees. figure 4. distribution of respondents by companies segment in terms of certifications related to the software development process, 48% of the respondents could not tell whether the company in which they work has some certification. 24% reported that the company is not certified. the other 28% of respondents reported that the company has some of the following certifications: capability maturity model integration (cmmi) levels 3 and 5, mps.br (levels of improvement of the brazilian software process) levels c and g, iso 27001: 2007 (information security management system), iso 14001 (quality management system), and tmmi (testing maturity model integration) level 3. 3.3.3 answering the research questions we next discuss the results based on the stated rqs. rq1: sustainability concepts in this question, the goal was to observe the respondents’ comprehension both in the general scope, with respect to the conceptual framework on sustainability, and to understand their perception regarding the adequacy of the companies in which they act to the sustainable practices. initially, seeking to observe the respondents’ level of knowledge, we asked how respondents could self-assess their level of knowledge about sustainability in the software development process. figure 5 shows 46% had no knowledge, which was the first contact with the subject; 45% out of the respondents had low knowledge about the subject; 6% had a medium knowledge; and 3% had high knowledge on it. figure 5. distribution of respondents by knowledge level on sustainability furthermore, we presented six concepts (definitions) about “sustainable software” to participants, available in the literature of relevant authors to the domain. the respondents did not have access to the authors’ names and could choose only one of the six options. our objective was to identify which of those concepts the respondents would be more familiar with. the results are described next. we described each concept in the boxes, followed by the corresponding results. definition 1: “an application that produces as little waste as possible during its development and operation”. erdelyi (2013). 34% of the respondents consider this one the most coherent definition. definition 2: “software developed and used in such a way that leaves minimal negative impact on users, environment, economy and society in general”. naumann et al. (2011). software industry awareness on sustainable software engineering: a brazilian perspective karita et al. 28% of the respondents consider this one the most coherent definition. definition 3: “software whose impacts on the economy, society, human beings and environment, resulting from the development, deployment and use of the software is minimal and/or has a positive effect on sustainable development”. dick et al. (2010). 26% of the respondents identified themselves with this definition. definition 4: “software code being sustainable, agnostic on purpose, or the purpose of the software is to support sustainability goals, i.e., to improve the sustainability of humanity on our planet”. hilty et al. (2006). 4% of the respondents selected this option. definition 5: “software whose purpose is to support sustainability goals, that is, to improve the sustainability of humanity on our planet”. dick et al. (2010). 5% of the respondents preferred this definition. definition 6: “environment friendly software that helps improve the environment”. murugesan (2008). 3% of the respondents selected this option. we can see that definitions 2 and 3 have similar proposals since they mention the impacts of software production according to sustainability dimensions. together, they account for 54% of the respondents’ choice, which demonstrates that the public’s view encompasses other dimensions, not just technical. the perceived increase in the awareness and adoption of sustainable practices in the software context might be related to the recognition of sustainability as a quality goal for software systems penzenstadler et al. (2014a). in this sense, we asked the respondents whether sustainability should or should not be considered an nfr. this question aims to understand the practitioners’ perception of sustainability as an nfr. as a result, 58% of the respondents considered that sustainability should be considered an nfr, which converges to the se literature discussions. however, only 18% of the respondents were able to provide reasonable statements supporting their opinion. next, we cite the respondents’ justifications. one respondent stated that it should be considered as an nfr “because of the impacts on the environment and consequently people’s quality of life” [#2]; another respondent stated that “in software that applies the idea of resource consumption, if requested by the customer, sustainability would be considered a non-functional requirement.” [#51]; another answer was “because it demonstrates to the target audience the concern that the company has when delivering a product. it tries to solve a problem causing the least possible environmental impact, from its conception to the delivery of the solution [#67]. rq2: sustainability importance level in analyzing the degree to which the respondents consider that companies should give importance to the sustainability issue in the software development process, we discovered that 33% treat the issue as “important” and 44% as “very important”. for another 19%, it is “neutral” and 3% see “no importance” in the subject. we observed that, for this minority, companies do not have a process to evaluate the quality of the software and its eventual sustainability. consequently, they see no added value in making the software development process sustainable. by crossing this data with the question “what respondents understand that sustainability represents for companies?”, we could see from figure 6 that most respondents – 48% – see sustainability as an opportunity to gain new business. nevertheless, 18% of the respondents believe that sustainability in the software development process represents costs and expenses for companies. it is worth mentioning that the total amount could exceed 100% as it was a multiple-choice question. in a broad scope, we also asked the respondents if their companies adopted any sustainability practices, such as: proper disposal and recycling of waste and batteries, compliance with environmental legislation, saving water, energy, and paper, and others. respondents could choose one of the following answers: • expert: meets all legislation, performs and encourages various practices. • intermediate: meets several legislation and performs various practices. • beginner: meets few legislation and performs some practices. • no knowledge • does not comply with legislation figure 7 shows that 41% of the respondents could not answer whether the company in which they operate adopts sustainable practices or meets some environmental legislation. 24% indicated that they consider the company at the beginning level since they adopt some practice and comply with few legislation rules; 20% consider that the company is at the intermediate level, taking into account several different laws and practices. another 9% reported that their company did not comply with any legislation; and only 6% pointed out that the company complies with all laws and encourages the adoption of various practices. most respondents that stated the company complies with all laws are from consulting companies. the results of this study showed that employees do not know at what level the company they work for is with regard to the adoption of sustainable practices. next, we asked whether their companies were concerned with minimizing the negative impacts that traditional development process activities could have on the environment. 34% of the respondents reported that the company had a reasonable concern, neither so much nor so much. for 32% of the respondents, the company did not care about such an issue. 19% reported that the company cared a bit. another 15% are concerned about the negative impacts. furthermore, the following responses point to the adoption of agile methodologies as an alternative to minimize negative impacts: “i believe that agile methodologies are more sustainable than the traditional development process because software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 6. company awareness level figure 7. distribution of respondents by company awareness level the software is delivered in parts and in constant monitoring of the client, reducing the risk of developing software which is different from what the client expects and avoiding changes or rewriting software.” [#78]; “the company does not have a physical space and most of its employees work as a home office, as a small number of employees is concerned about the theme, the company is not concerned because its environmental impact is as little as possible.” [#90]. regarding the data from the preliminary study, new results showed a small variation, from 1% to 3% more or less, which means the same trend in the results. when asked about the main barriers that hinder the adoption of sustainability actions and practices in the software development process of the corporate environment, 71% of the respondents stated that there is a lack of companies awareness. another 58% understand that companies do not consider the subject as relevant. 35% of the respondents could not evaluate; 32% responded that their companies do not have qualified staff; and 21% reported difficulties in measuring likely earnings. in the view of 21% of the respondents, bureaucracy becomes a barrier. the remaining 10% consider it figure 8. main difficulties in adopting sustainable practices by companies. as a very expensive investment, as figure 8 shows. because it is a multiple-choice issue, the total ratio could exceed 100%. rq3: sustainable software development process we asked the respondents whether they felt that companies should give importance to the sustainability issue in the software development process. 44% answered that it was “very important”. 33% considered it as being “important”; 19% reported as “neutral”; only 1% considered as “less important”; and 3% did not consider the topic as important. in general, 77% of respondents think that companies should give importance or a lot of importance to sustainability in the software development process. we sought to know what respondents think as mandatory features for a software development process to be considered sustainable. the codes obtained from this open question were mainly: reuse, code quality, sustainable good practices (using standards, green models and metrics), agile methods, resource usage awareness, robust architecture, reduction of environmental impacts, efficient coding, maintainability, adaptability, accessibility, development standards, and optimized coding. when asked whether the companies they worked for used software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 9. sdlc phases covered by sustainable practices to encourage the adoption of sustainable practices, whether in general or specific, in the software development process, 44% were unable to answer; 27% of them stated this was a rather common practice; while other 29% reported that their companies do not encourage. regarding the preliminary survey study, the data did not show a significant variation, from 1% to 5%, which allows us to observe a pattern in the results. in addition, we also attempted to figure out, from the companies that encourage the use of sustainable practices, which are the covered sdlc phases. figure 9 shows that 40% of the companies adopt such practices in the development phase; 29% in the design phase; 19% in requirements; and 13% in testing phase. the respondents were allowed to choose more than one sdlc phase. we asked the respondents in which sdlc phases they could identify any deficiencies in terms of sustainability practices. 21% showed no deficiencies; 19% showed deficiencies for the development phase; 16% at the design phase; 15% in the requirements phase; and 13% in the testing phase. in relation to the preliminary study, the data varied 1% more or less, with the response pattern prevailing. the identified deficiencies were related to: • requirements: poorly designed requirements and low depth of functionality [#60]. • design: deficiency in thinking about solutions that need less space or computational power [#45], poorly implemented code that has a high cost [#47], limited view on the relations between the modules and systems functionalities. [#60]. • development: unstructured code [#57], development with high coupling limited to functionality, only [#60]. in asking what could be done to improve the deficiencies pointed out in the previous question, the respondents suggested: • general: “maturity in software development from conception (requirement) to creation (implementation / coding) to have greater gain, less effort and higher quality (standardization usability, open architecture)” [#9], “programs to encourage the study of the theme well such as broad dissemination and availability of materials that support enrichment on the subject to professionals” [#11], “incentives and awareness” [#15], “institutionally adopting sustainable policies to raise awareness of people and business” [#19]. “context of difficult change. but awareness should be the first step” [#20], “study on the subject, understand what it means and evaluate ways to get started” [#23], “it is necessary to raise awareness of sustainable development practices and metrics for incorporation into the process of the company” [#25]. • requirements: “define techniques for assessing requirements and recording in notes for evaluation and adherence to sustainability” [#7]. • design: “think of an architecture that is sufficient to fit the software design. for example, specifying computers that spend less energy but still meet the project requirements” [#24]. • development: “develop the software with the maximum possible reuse” [#24]. rq4: sustainability dimensions we list the contributions proposed by lago et al. (2015), without showing their related dimension. the idea was to observe how the respondents perceived the dimensions of sustainability in their daily activities, and the importance level of each one was observed. for each feature, respondents were presented with a brief description and five unique response options. table 1 shows that, on average, 84% of the answers considered all characteristics as either “important” or “very important”, the “very important” degree was attributed to the following characteristics: adaptation to changes, reusability, performance, and system quality. the degree “important” was attributed to the characteristics: longevity, software evolution, product roadmap, awareness about the use of sustainable practices, sustainable ethics, energy consumption, environmental concern, time to market, and development effort. the results show that professionals consider the technical dimension as the most important one, with a mean of 93%, followed by other dimensions: social (79%), economic (78%), and environmental (71%). for most respondents, the technical dimension is the most important one. rq5: sustainability models in a recent literature review (mourão et al., 2018), we showed that there is not enough evidence in the literature on the use of a particular model. most of the proposed solutions are strictly academic, with no proof of effectiveness in real environments. therefore, in this question, we analyzed whether the professionals had adequate knowledge about the sustainable software engineering field and whether their companies apply any process model to support sustainability in se practices. as this study is exploratory, the purpose is not to confirm the adoption of models but to obtain them from the industry that we did not identify beforehand. the answers were: 96% of the respondents answered that they are not aware of any applied models. for the 4% of the positive responses, only two respondents specified which model the company uses for supporting sustainability: cmmi (capability maturity model integration) and epeat (electronic product environmental assessment tool). while the software industry awareness on sustainable software engineering: a brazilian perspective karita et al. table 1. sustainability dimensions analysis dimensions sustainability concern irrelevant (1) less important (2) neutral (3) important (4) very important (5) technical longevity 1 1 3 47 45 technical resilience to uncertainty 1 33 63 technical performance 1 4 39 53 technical software evolution 1 1 10 47 38 technical reusability 1 4 5 29 58 technical system quality 1 4 24 68 social product roadmap 1 7 15 41 33 social awareness 2 4 13 44 34 social ethics 2 2 16 41 36 environmental energy consumption 4 5 18 40 30 environmental environmental concern 2 7 21 37 30 economic time to market 4 6 16 38 33 economic development effort 1 2 13 49 32 cmmi helps to improve processes, the epeat assesses various environmental criteria of the full product life-cycle. rq6: sustainability tools similar to the previous question, this one explores tools that have been adopted in practice, which collaborate with sustainability through the intuitive knowledge of the respondents. we analyzed whether the company adopts tools, techniques, or methods to measure sustainability and also if there is the adoption of some sustainable design pattern in the software development process. 64% of the respondents stated they did not know about any or did not know how to report on their use in the company. analyzing the 36% positive responses, we could notice that the respondents use ordinary tools, techniques, or methods to improve sustainability. however, most of them did not explicitly mention which ones they use. regarding the economic and environmental dimensions, one respondent stated that the company adopts a process for reducing expenses through the efficient use and reuse of equipment, which avoids spending on superfluous consumption resources [#4]. other respondents stated that the companies use agile methodologies for software development as scrum [#65, #16], which could be related to the capacity for time and costs management when developing the software. regarding the technical dimension, the responses encompass aspects of software maintenance. some respondents [#21, #55] focused on the development of reusable components that can be combined into a well-defined architecture. other respondents stated that they use tools for automated tests [#28], continuous integration purposes (e.g., jenkins), and for applying quality metrics (e.g., sonar) [#16, #54]. it may help to minimize problems with legacy and also to maintain quality aspects. besides the other rq’s revealed that the respondents consider sustainability an important topic to be addressed in the industry, this result shows that specific tools, techniques, and methods to support sustainability are still unknown by practitioners. however, as some of the respondents tried to relate ordinary tools, techniques, and methods with sustainability, which may indicate that they understand the importance of sustainability when developing software. 4 reaching a common understanding of sustainable software based on the evidence obtained from this survey, we observed that sustainability in se is still an incipient subject. to reach a common understanding of sustainable software, we applied the grounded theory (gt) method (glaser et al., 1968) whose emphasis is on the generation of new theories. gt has a set of procedures that provide a comparative data analysis, which is able to generate, in a systematic way, a theory based on these data (glaser et al., 1968). the result is a set of categories and relationships between them. we next describe the method steps applied: open coding and selective coding. by following the open coding step, we identified the most relevant aspects of sustainability in se from the following open question: “4.2. how do you define ‘sustainability’ in the software development process?”. although we found that 91% of participants have either no or low knowledge about sustainable software, they were able to infer or assimilate the general concept. this conclusion was based on our earlier study (karita et al., 2019). therefore, we proposed an initial taxonomy with the converging points, based on the participants’ common understanding. it could bring a common sense about the practitioners’ perception of the topic. then, we proceeded with the data analysis and captured the codes. the theoretical saturation step was achieved when no new code was identified in the steps of data collection and analysis. the gt application allowed us to group these codes into categories to produce a high abstraction level. the categories were defined to the four sustainability dimensions (technical, social, environmental, and economic). figure 10 shows the taxonomy created for the identified codes. in the frame of each code, we mentioned the total amount of code citations. five new codes emerged from the extended survey, which are: environmental awareness (7%), software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 10. sustainable software taxonomy code optimization (2%), social welfare (1%), software evolution (1%) and added value (1%). they are present in figure 10 at the end of each category (light green frame). the three most cited codes that define sustainability in software are: reuse (24%), minimal use of resources (22%), and low impact (13%). we also analyzed the relationships between codes and categories using selective coding step, as figure 11 shows. the findings show that, although a few knowledge gaps still exist on the subject, practitioners idealize that what drives sustainability in software is the adoption of reuse in the codification phase. we founded this feature in various responses, such as: “it is a style of development of digital systems where the reuse of source code is prioritized to avoid rework” [#4] and “reuse of source code with the creation of components to avoid rework, minimizing the use of available resources, thus enabling greater productivity.” [#9]. the second most cited feature was the minimal use of resources. this feature was exploited indirectly as a consequence of adopting the practice of reuse. according to the se community, reuse has many advantages for software development, such as increased productivity, increased software quality, decreased delivery time, etc. the cause/effect relationship between reuse and other characteristics can be seen in some answers, for example: “sustainability collaborates with combating waste, improving quality of life, creating more durable products, recycling, etc. and this can be applied in the development process software when practicing code reuse, for example.” [#2] and “it means to focus on reusing source code to reduce the amount of effort and resources allocated during software development, which in turn can help to reduce the impact on the environment.” [#24]. therefore, sustainable software for the industry is related to the software produced on the adoption of reuse and development good practices. consequently, second-order results would bring benefit to the environment, such as low impact, combating waste, and energy-efficiency. this concept is also known, in the literature, as green in software. the practitioners understand that a more sustainable way to develop sustainable software is using practices that apply se principles, taking into account environmental aspects. this perspective distorts the idea that software development has effects that go beyond its boundaries. 5 discussion in this section, we discuss the results in the light of collected data, based on the set of analyzed sustainability dimensions. • technical dimension according to penzenstadler et al. (2014b), the technical dimension has a central interest in the requirements related to the software longevity and evolution, such software industry awareness on sustainable software engineering: a brazilian perspective karita et al. figure 11. the relationship of leveraged codes. as non-obsolescence and quality characteristics. both requirements were only cited in the extended study. the study confirmed that software practitioners have a narrow perception of sustainability concepts in the software development process. this is because most practitioners have targeted their perceptions about sustainable software specifically in the quality attribute, such as reuse, optimization, and performance. this skewed view of sustainability covers only one of the five dimensions defined in the literature, the technical dimension, and confirms the results presented in lago et al. (2015). in terms of the software development processes, we could see that companies could not yet be considered green companies or aspiring to be sustainable companies because they do not use models, processes, methods, and tools to support their software development. although the professionals do not have in-depth knowledge of the subject, they could see the advantages and importance of sustainability in software development. the adoption of agile methodologies is another point of discussion. we observed that this topic is relatively new for the surveyed companies. despite the various benefits that agile software development could offer, such as the development of the minimum viable product in a short production cycle, its interaction with sustainability is a gap that needs to be explored. • social dimension the social dimension refers to the effects of software systems in society (e.g., product roadmap, ethics, etc.). in this sense, the study showed that the number of professionals who perceive the impact of sustainability on people’s quality of life and welfare is still low. although the awareness was firstly mentioned in this study as a social concern, similarly to the preliminary survey result, we could observe that all participants in the software’s production process need to create a critical sense in relation to the negative impacts that software production could cause on the planet. based on this understanding, the industry would have professionals engaged in providing sustainability. achieving a sustainable software development environment is possible. however, it is very important to encourage software teams to employ sustainability practices, thus considering existing tools, methods, and processes support, as well as proposing new ones. according to lago and penzenstadler (2017), conducting interviews is another way of creating awareness since the results contribute to social sustainability. additionally, practitioners need to think about sustainability in all spheres of software development, not only from a technology perspective. something has been said about code reuse, maintainability, efficiency, but awareness goes beyond technical bias. the four dimensions interrelate and need to happen in an integrated way so that sustainability could happen in all stages of the software development process, from the customer’s need to the customer satisfaction. therefore, all dimensions could be better disseminated so that greater compliance could be achieved by companies and especially by people. in this way, we could attract conscious and sustainable software companies. • environmental dimension in this dimension, our purpose was to obtain evidence of how professionals perceive the impacts of software development and maintenance in the environment. according to penzenstadler et al. (2014b), environmental sustainability could be achieved by analyzing the software development life-cycle and assessing the envisoftware industry awareness on sustainable software engineering: a brazilian perspective karita et al. ronmental impact it could cause. concerning legislation, some brazilian laws aimed at sustainability were mentioned in the study. however, what could be observed is that environmental issues focused on the environment, such as waste recycling, water saving, are still seen as the main factors associated with the term sustainability by these companies. despite the practitioners’ low knowledge on the subject, the participants attributed to this a high importance. in the software bias, this dimension is directly related to energy consumption and environmental interests (lago and penzenstadler, 2017). most professionals reported that their companies do not have quality requirements related to sustainability. this insight reinforces the need for the research community to increasingly join the effort to make sustainability a software quality requirement. through this study, it was possible to observe that understanding the homogenization of concepts used in this area is still uncertain. for software to be produced sustainably, software professionals must agree on the inherent concepts from this domain and its properties, so that they could have a clear and shared understanding of environmental knowledge and concern. we understand that it is important for practitioners to understand the central pillars of sustainability so that they could have a broader understanding of their likely effects. • economic dimension the economic dimension is one of the main concerns of companies. it is related to market requirements such as budget and cost restrictions (raturi et al., 2014; penzenstadler et al., 2014b). regarding this category, the few codes classified on it were indirectly mentioned. this perception goes against the findings presented by lago and penzenstadler (2017). for professionals, sustainable software development creates an additional effort of development, and current projects do not foresee this type of cost to implement sustainable software. we also noticed that companies do not promote sustainable development, which could encompass hiring qualified people with a good understanding of software engineering principles. therefore, there would be more time and resources to design and develop software with the expected quality associated with sustainable requirements. another aspect that permeates the economic dimension has to do with customer satisfaction (groher and weinreich, 2017). in this sense, few participants mentioned this factor. only 3 reported that sustainability is important, but it does not interfere with customer service functions. therefore, it must be a product obligation, a requirement on the part of the customer. in light of these discussions, we believe that companies should incorporate investments in business decisions to produce more sustainable software and implement sustainable software engineering practices. the lack of companies’ vision of not exploring sustainability practices in software production causes them to reduce their competitive advantage. however, for industry, the expectation of return on these investments is still a gap. from an economic point of view, this gap makes the issue an urgent and strategic concern. in general, the extended study confirmed that the practitioners’ perception of all dimensions of sustainability is subtle. it could be better worked together, not just for the technical direction. therefore, those four factors need to be integrated into practice so that sustainability actually occurs within the scope of software production. the knowledge of software professionals needs to be expanded in all dimensions concerns, such as: knowing that software production has environmental impacts, accessing information, tools, methods, transferring knowledge into actions, and raising awareness of these issues around them. 6 implications for research and practice in this section, we provide readers with a synthesis of the relevant implications that emerged from the analysis of this qualitative study: • green se field is still incipient, and it needs to be disseminated in companies so that teams can start thinking in a sustainable way about software development; • there is a lack of professionals’ knowledge about the topic, in particular regarding how to adopt sustainable practices in the sdlc; • although software professionals have limited or no knowledge about sustainability in the context of software development, they realize that its adoption has benefits for both company and society; • the technical and environmental dimensions are the most relevant and explored ones. most practitioners have targeted their perceptions about sustainable software in “green in software”. they understand that a more sustainable way to develop sustainable software is using practices that apply se principles, taking into account environmental aspects. this perspective distorts the idea that software development has effects that break its boundaries. in this sense, it is important to analyze the role of software and investigate the impacts of its use in society in all dimensions of sustainability. • since there is not enough evidence in the literature on the use of a particular tool and model most reports are strictly academic in character, without proof of effectiveness in real environments (mourão et al., 2018) this study showed that companies not yet use models, processes, methods and tools to support sustainable software development. as such, they cannot be considered green companies or aspiring sustainable companies. 7 related work we next discuss recently published studies that are related to our work. software industry awareness on sustainable software engineering: a brazilian perspective karita et al. a survey conducted with fifty-three software professionals in seven different companies was reported by koçak et al. (2015). the goal was to identify the perception of software professionals about the impact of energy quality related software in order to develop an environmentally sustainable software product. through this research, the authors explored the correlation between software quality and energy efficiency. they used statistical analysis. the results of this study showed that there are significant negative correlations between functional adequacy and compatibility; efficiency and safety of performance; reliability and compatibility with regard to energy efficiency. manotas et al. (2016) performed the first empirical study on how professionals think about energy when writing requirements, design, construct, test, and maintain their software. the authors reported the findings of a quantitative and targeted survey of 464 professionals from the companies abb, google, ibm, and microsoft. this research was motivated and supported by qualitative data from 18 detailed interviews with microsoft employees. the study concluded that green se practitioners take care and think about energy when building their applications. the results show that awareness has changed the discussion about software power consumption. in relation to the awareness stimulus, the authors agree that appropriate support such as the creation of organizational policies and knowledge banks could help to create green software products. pang et al. (2016) conducted a survey with 122 programmers to understand their knowledge and awareness regarding software energy efficiency and consumption. the results show that the programmers’ knowledge about energy consumption is consistent, and 60% of them consider energy consumption when choosing a mobile development platform. however, 80% of the programmers do not take energy consumption into account when developing software. jagroep et al. (2017) reported a multi-core study incorporated with two over two commercial software products. the goal was to identify how to create and maintain awareness of an energy consumption perspective for software among stakeholders involved in the development of software products. during the study, they followed the development process of two commercial software products and provided direct feedback to stakeholders on the effects of their development efforts, specifically on energy consumption and performance, using a power control panel. the authors defined a main research question and three sub-questions. to measure awareness, the authors constructed a survey but did not report the details of the planning, target audience, and instrument. to understand how software sustainability is currently addressed in the practice of software development projects, groher and weinreich (2017) conducted an interview with 10 software project team leaders from 9 companies in austria. the study analyzed the data using the deductive categorization method. the study found that professionals consider software sustainability important, but are technically concerned with sustainability. organizational and economic issues are addressed, but environmental considerations are lacking. the perceived influence of various project factors on sustainability is partially diverse, suggesting that the meaning of sustainability needs to be refined to the specific context of design and application. pinto and castor (2017) conducted a survey with software developers in order to understand the perceptions of those professionals on issues related to energy consumption by the software. the authors interviewed 62 software developers who performed at least one commit on an open-source mobile application. the results of the study suggested that there is a lack of knowledge about how to develop energy-efficient software. in addition, they noted that there is a need for tools to help developers achieve this goal. in order to develop this work, we considered every mentioned study since they bring relevant information on the topic. however, we observed that these studies were usually focused on particular issues, such as the correlation between sustainability and software quality attributes, the energy use in software applications. as the research in this field is incipient, it becomes important to explore the software professionals’ perception with broader coverage. 8 threats to validity construct validity: during the pilot test, some respondents reported that the filling time of the instrument was extensive. as such, our survey respondents may not have adequately answered questions, preferring short answers to more detailed descriptions. to reduce the threat to validity, we group the questions into specific sections to better target questions and answers. another threat was the respondents’ understanding of the questions. to help ensure the understandability of the survey, we asked professionals and researchers with experience in se and experience in survey design to review the survey instrument to ensure the questions were clear and complete. internal validity: an internal limitation may be the selection of companies and practitioners to the sample. we understand that both the number of companies and the number of responses obtained may not adequately represent the entire population of companies and software professionals, characterizing a threat to internal validity. however, as we decided to include only professionals from companies that work in different domains (mostly have offices in several brazilian states), we believe this set might be representative. external validity: the respondents of our survey may not adequately represent all software practitioners. most respondents reported that they work as software developers, which may have skewed the results. nevertheless, we believe that the number of responses that we analyzed provides a rich source of qualitative data that could reveal promising insights. reliability: although grounded theory offers rigorous data collection procedures and analysis, qualitative research is generally subject to researcher bias. certainly, other researchers could make a different interpretation and theory after analyzing the same dataset, but it is believed that at least the main insights would be preserved. then, the results of rq1 might be a threat to validity. however, to mitigate it, the qualitative analysis was performed on the codes recovered and grouped according to the correlations between dimensions of sustainability and sustainable concerns proposed by penzenstadler et al. (2014b) and raturi et al. (2014). software industry awareness on sustainable software engineering: a brazilian perspective karita et al. although the research results may have been influenced by interpretation, to mitigate this threat, the coding process was performed by two authors working together. disagreements in the assignment of codes were discussed until consensus was reached. 9 concluding remarks although the se community has increased its interest in the green and sustainable se field, the software industry has not explored this area in an adequate fashion yet. consequently, green and sustainable practices are not completely known and substantially applied by software practitioners. this study is an extended survey from a previous work designed to gather data from software practitioners from brazilian companies in this respect, and provide data on the software industry’s perception of sustainability in the software development process. the yielded results confirm the findings identified in the original survey. they indicate an overall lack of knowledge about the topic, in particular regarding the concepts about sustainable software, although there is a common understanding that sustainability should be treated as a quality attribute and should support the interaction between sustainability and the sdlc. this study contributes to the field with an initial set of evidence. we could see it as an important step towards establishing a common understanding of how the software industry is receptive to sustainability concepts in software development practices. as future work, we aim to conduct interviews with participants of the survey, in order to enrich and detail the professionals’ perceptions. in addition, we plan to carry out more in-depth studies about already validated techniques and methods that could improve and compose a green checklist for software development. acknowledgements this research was partially funded by fapesb grants jcb0060/2016 and bol0188/2020, ines 2.0, cnpq grant 465614/ 2014-0, and capes finance code 001. references becker, c. (2014). sustainability and longevity: two sides of the same quality? mental, 20:21. becker, c., chitchyan, r., duboc, l., easterbrook, s., penzenstadler, b., seyff, n., and venters, c. c. (2015). sustainability design and software: the karlskrona manifesto. in 2015 ieee/acm 37th ieee international conference on software engineering, volume 2, pages 467–476. ieee. calero, c. and bertoa, m. (2013). 25010+ s: a software quality model with sustainable characteristics. sustainability as an element of software quality. green in software engineering green by software engineering (gibse 2013) co-located with aosd. calero, c. and piattini, m. (2015a). green in software engineering, volume 3. springer. calero, c. and piattini, m. (2015b). introduction to green in software engineering, pages 3–27. springer international publishing, cham. calero, c. and piattini, m. (2017). puzzling out software sustainability. sustainable computing: informatics and systems, 16:117–124. dick, m., naumann, s., and kuhn, n. (2010). a model and selected instances of green and sustainable software. in what kind of information society? governance, virtuality, surveillance, sustainability, resilience, pages 248–259. springer. erdelyi, k. (2013). special factors of development of green software supporting eco sustainability. in 2013 ieee 11th international symposium on intelligent systems and informatics (sisy), pages 337–340. ieee. fernandez, n. c., lago, p., and calero, c. (2016). how do quality requirements contribute to software sustainability? in refsq workshop, pages 7–10. garcía-mireles, g. a., moraga, m. á., garcía, f., calero, c., and piattini, m. (2017). interactions between environmental sustainability goals and software product quality: a mapping study. information and software technology. glaser, b. g., strauss, a. l., and strutzel, e. (1968). the discovery of grounded theory; strategies for qualitative research. nursing research, 17(4):364. groher, i. and weinreich, r. (2017). an interview study on sustainability concerns in software development projects. in 2017 43rd euromicro conference on software engineering and advanced applications (seaa), pages 350–358. ieee. hilty, l. m., arnfalk, p., erdmann, l., goodman, j., lehmann, m., and wäger, p. a. (2006). the relevance of information and communication technologies for environmental sustainability–a prospective simulation study. environmental modelling & software, 21(11):1618–1629. imperatives, s. (1987). report of the world commission on environment and development: our common future. accessed feb, 10. available at http://www.un-documents.net/ our-common-future.pdf. jagroep, e., broekman, j., van der werf, j. m. e., lago, p., brinkkemper, s., blom, l., and van vliet, r. (2017). awakening awareness on energy consumption in software engineering. in 2017 ieee/acm 39th international conference on software engineering: software engineering in society track (icse-seis), pages 76–85. ieee. johann, t., dick, m., kern, e., and naumann, s. (2011). sustainable development, sustainable software, and sustainable software engineering: an integrated approach. in humanities, science & engineering research (shuser), 2011 international symposium on, pages 34–39. ieee. karita, l., mourão, b. c., and machado, i. (2019). software industry awareness on green and sustainable software engineering: a state-of-the-practice survey. in proceedings of the xxxiii brazilian symposium on software engineering, pages 501–510, salvador, brasil. kasunic, m. (2005). designing an effective survey. technical report cmu/sei-2005-hb-004, carnegiemellon university, software engineering institute, pa, usa. khandelwal, b., khan, s., and parveen, s. (2017). cohesive http://www.un-documents.net/our-common-future.pdf http://www.un-documents.net/our-common-future.pdf software industry awareness on sustainable software engineering: a brazilian perspective karita et al. analysis of sustainability of green computing in software engineering. international journal of emerging trends technology in computer science, 6:11–16. kitchenham, b. a. and pfleeger, s. l. (2002). principles of survey research part 2: designing a survey. acm sigsoft software engineering notes, 27(1):18–20. koçak, s. a., alptekin, g. i., and bener, a. b. (2015). integrating environmental sustainability in software product quality. in re4susy@ re, pages 17–24. lago, p., koçak, s. a., crnkovic, i., and penzenstadler, b. (2015). framing sustainability as a property of software quality. commun. acm, 58(10):70–78. lago, p. and penzenstadler, b. (2017). reality check for software engineering for sustainability—pragmatism required. journal of software: evolution and process, 29(2):e1856. manotas, i., bird, c., zhang, r., shepherd, d., jaspan, c., sadowski, c., pollock, l., and clause, j. (2016). an empirical study of practitioners’ perspectives on green software engineering. in 2016 ieee/acm 38th international conference on software engineering (icse), pages 237–248. ieee. moraga, m. á., garcía-rodríguez de guzmán, i., calero, c., johann, t., me, g., münzel, h., and kindelsberger, j. (2017). greco: green code of ethics. journal of software: evolution and process, 29(2):e1850. mourão, b. c., karita, l., and machado, i. c. (2018). green and sustainable software engineering-a systematic mapping study. in proceedings of the 17th brazilian symposium on software quality (sbqs), pages 121–130. acm. murugesan, s. (2008). harnessing green it: principles and practices. it professional, 10(1):24–33. naumann, s., dick, m., kern, e., and johann, t. (2011). the greensoft model: a reference model for green and sustainable software and its engineering. sustainable computing: informatics and systems, 1(4):294–304. pang, c., hindle, a., adams, b., and hassan, a. e. (2016). what do programmers know about software energy consumption? ieee software, 33(3):83–89. penzenstadler, b. (2014). infusing green: requirements engineering for green in and through software systems. in re4susy@ re, pages 44–53. penzenstadler, b. and femmer, h. (2013). a generic model for sustainability with process-and product-specific instances. in proceedings of the 2013 workshop on green in/by software engineering, pages 3–8. acm. penzenstadler, b., raturi, a., richardson, d., calero, c., femmer, h., and franch, x. (2014a). systematic mapping study on software engineering for sustainability (se4s). in proceedings of the 18th international conference on evaluation and assessment in software engineering, page 14. acm. penzenstadler, b., raturi, a., richardson, d., and tomlinson, b. (2014b). safety, security, now sustainability: the nonfunctional requirement for the 21st century. ieee software, 31(3):40–47. pinto, g. and castor, f. (2017). energy efficiency: a new concern for application software developers. communications of the acm, 60(12):68–75. raturi, a., penzenstadler, b., tomlinson, b., and richardson, d. (2014). developing a sustainability non-functional requirements framework. in proceedings of the 3rd international workshop on green and sustainable software, pages 1–8. acm. saputri, t. r. d. and lee, s.-w. (2016). incorporating sustainability design in requirements engineering process: a preliminary study. in asia pacific requirements engineering conference, pages 53–67. springer. tate, k. (2005). sustainable software development: an agile perspective. addison-wesley professional. venters, c., lau, l., griffiths, m., holmes, v., ward, r., jay, c., dibsdale, c., and xu, j. (2014). the blind men and the elephant: towards an empirical evaluation framework for software sustainability. journal of open research software, 2(1). venters, c. c., seyff, n., becker, c., betz, s., chitchyan, r., duboc, l., mcintyre, d., and penzenstadler, b. (2017). characterising sustainability requirements: a new species red herring or just an odd fish? in 2017 ieee/acm 39th international conference on software engineering: software engineering in society track (icse-seis), pages 3–12. ieee. introduction green and sustainable software understanding the awareness of practitioners on sustainability research questions survey design survey results respondents’ demographics companies’ demographics answering the research questions reaching a common understanding of sustainable software discussion implications for research and practice related work threats to validity concluding remarks 544-##_article-1934-1-18-20201119 journal of software engineering research and development, 2020, 8:8, doi: 10.5753/jserd.2020.544 this work is licensed under a creative commons attribution 4.0 international license. how is a developer’s work measured? an industrial and academic exploratory view matheus silva ferreira [ federal university of lavras | matheus.ferreira5@estudante.ufla.br] luana almeida martins [ federal university of lavras | luana.martins1@estudante.ufla.br ] paulo afonso parreira júnior [ federal university of lavras | pauloa.junior@ufla.br ] heitor costa [ federal university of lavras | heitor@ufla.br ] abstract software project management is an essential practice to successfully achieve goals in software development projects and a challenging task for project managers (pms). therefore, information about the developers’ work can be valuable in supporting the pms’ activities. several studies address this topic and suggest different strategies for obtaining such information. given the variety of existing strategies, we need to know the state-ofthe-art on the theme. this article presents the information used for supporting pms in the application of project management practices, especially with regard to risk management and people management. thus, we carried out an exploratory study using a systematic mapping study (sms). contributions include the identification of 64 metrics, four information sources, and seven pm activities supported by the measurement of the developers’ work. additionally, we interviewed four pms to collect their personal opinion of how the metrics and activities reported by our sms could help the project management in practice. each pm considered a different set of metrics to support their activities, but none of them suggested new metrics (besides the 64 metrics identified in the sms). also, we presented aspects to explore the subject, indicating themes for possible new studies in the software engineering area. keywords: project management, knowledge of the developers’ work, project manager’s activities, metric 1 introduction in software projects, the project manager (pm) is the professional who ensures proper project management. the pm’s function includes selecting members for the project team and assigning roles and responsibilities as needed (de souza et al. 2015). pms must know how to assess the skills, strengths, and weaknesses of developers so that they can do their work efficiently (zuser and grechenig 2003). besides, bad people management can bring about project risks (ferreira et al. 2017). for example, team member turnover can be a high risk for a project because some developers can centralize software source code knowledge (boehm 1991). these issues correspond to the pms’ two activities: risk management and people management (sommerville 2019). performing people management and risk management is a non-trivial task. in addition, the pm’s effort is impacted by the size of the project and the size of the team (ahonen et al. 2015). in this context, the team members’ evaluation by pms motivates studies in the literature (de bassi et al. 2018; feiner and andrews 2018; ferreira et al. 2017; zuser and grechenig 2003) that suggest strategies for measuring the developers’ work. given the wide variety of suggested strategies, we need to know the state-of-the-art approaches in this field. this article presents an investigation on how the developers’ work can be measured, and how information on the developers’ work can support the project management (especially, risk management and people management). thus, we performed an exploratory study using a systematic mapping study (sms). sms allows to identify, interpret and evaluate available evidence from studies on a topic, phenomenon, or set of research questions of interest (kitchenham 2004). a sms has three phases (kitchenham and charters 2007): i) planning (we define the motivation, goals, and research protocol); ii) execution (we apply the strategy outlined in the research protocol to identify and select studies); e iii) results (we show the analysis of information obtained from selected studies). additionally, we interviewed four pms to collect their opinion about results obtained in sms that can help them in practice. thus, we have an initial understanding of how the industry measures the developer's work. the remainder of this article is organized as follows: section 2 describes a theoretical framework. section 3 presents the sms planning phase. section 4 describes the sms execution phase. section 5 presents the sms results phase. section 6 discusses the results along with the pms’ opinion. section 7 describes the threats to validity. section 8 draws concluding remarks. 2 background this section discusses risk management and people management. 2.1 risk management in pmbok (project management body of knowledge), risk management is an area of knowledge that aims to identify, evaluate, and monitor the positive (opportunities) and negative (threats) risks that may affect the project (pmi 2017). the most critical activities for the pms when a problem emerges are to evaluate and monitor risks (sommerville 2019). this area comprises five processes how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 (plan risk management, identify risks, perform qualitative risk analysis, perform quantitative risk analysis, and plan risk responses). there are several techniques for analyzing threats. when choosing a threat analysis technique, the pm needs to pay attention to the characteristics of the project to avoid impacts on the quality of the analysis results (tuma, gül, and riccardo 2018). effective performance of risk management can directly impact project success or failure (menezes et al. 2019). the risks can affect the project schedule or resources (project risks), software quality or performance (product risks), or organization that produces or acquires software (business risks) (sommerville 2019). for example, the departure of an experienced developer can represent:  project risk, because this departure affects the schedule due to the loss of human resources;  product risk, because the developer who substitutes an experienced developer can have different skills and project knowledge; and  business risk, because the developer’s experience impacts the signing of contracts. many existent risks in software projects relate to the development team. specifically, one risk addressed in the literature is the lack of technical skills of some team members (menezes et al. 2019), leading to investments in training or hiring people. another risk mentioned as a constant concern for pms is developer turnover associated to concentration of knowledge regarding source codes. in this situation, the project can fail if these developers leave the project/organization earlier than expected (ferreira et al. 2017). an alternative in order to mitigate this is to identify the people who concentrate the knowledge on source code and distribute it among all team members (developers). 2.2 people management in pmbok, resource management is a field of knowledge that aims to identify, acquire, and manage the resources needed for successful project completion (pmi 2017). this area comprises six processes (planning of resource management, estimate activity resources, acquire resources, develop team, manage team, and control resources). in software projects, the team members play different roles. thus, pms need to consider the members’ technical skills and personality to assemble the teams. in order to correctly manage people, pms should (sommerville 2019):  have an honest and respectful relationship with those involved in the project;  have people who are motivated to perform their functions;  support teamwork and maintain relationship of trust among everyone, enabling the team to self-manage;  select team members to optimize performance and meet the projects’ technical and human requirements;  organize the working method and team members’ roles; and  ensure effective communication between the people involved. despite the existing recommendations, there are studies that show that pms can hardly organize performance teams in a systematic and repeatable way (latorre and javier 2017). understanding people’s characteristics, assigning tasks, and recognizing the work done are complex and relevant tasks for pms (zuser and grechenig 2003). 3 sms planning phase this section describes the sms planning phase and presents the research protocol and its validation and the data extraction procedure. 3.1 research protocol the research protocol includes the strategies used for retrieving and selecting studies that are relevant to the topic of interest in the research (kitchenham 2004). in this protocol, we defined the research questions, the procedure used to conduct sms, the inclusion and exclusion criteria for selecting the studies, and how to obtain and classify information. in table 1, we showed the research questions and the goals for answering them. the primary research questions help to understand the measurement of the developers’ work to support pms and consist of the main results of this study. the secondary research questions provide insight into the characteristics of the scientific studies found in the sms. we selected the acm, ieee, and springer repositories of scientific papers. in addition, we chose ei compendex and scopus because they index other repositories. they publish papers from the most important conferences and journals in software engineering (ambreen et al., 2018; bouchkira 2020). additionally, we elaborated on a search string (table 2) that contains five parts of key terms aligned with the research questions to retrieve studies in these repositories. each set is composed of a key expression and its synonyms, as follows:  part 1 refers to the action required to obtain the metrics on the developers’ work. we defined it by measure or measurement or mensuration or dimension or evaluation or analyze or analysis or view or visualization or knowledge;  part 2 refers to the object to be measured (the developer’s work). we defined it by contribution or participation or productivity or skills or collaboration or effort or knowledge;  part 3 refers to who is evaluated by the measurement. we defined it by developers or “software development team” or “team members”;  part 4 refers to whoever is interested in the metrics obtained from the measurement. we defined it by “software project manager” or “project manager” or “project managers” or “software project” or “project management”; and how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020  part 5 refers to the method or tool used to measure. we defined it by tool or framework or plugin or method or metric or factors. only part 1 was searched in the title of the studies because it is related to “measurement” and is relevant to the retrieved studies regarding the sms goal (we used search engine tokens). when we enlarged the search for other items (besides the title), the number of returned studies increased significantly; these studies treat issues that are distant from the objective of this study. we searched the other parts (parts 2 5) in titles, abstracts, and keywords. table 1. research questions primary research questions research question goal q-p.1 what metrics are used by pms to measure the developers’ work? to identify metrics about the developers’ work. q-p.2 how are metrics applied by pms to monitor the developers’ work? to identify how pms extract and analyze the metrics to measure the developers’ work. q-p.3 how do metrics concerning the developers’ work support project management, especially with regard to risk management and people management? to identify how metrics regarding the developers’ work support project management (risk management and people management). secondary research questions q-s.1 what type of solution is often proposed for studies in this area? to identify the type of solution proposed in studies to measure the developers’ work and support the pms’ decision. q-s.2 what type of research methodology is often used for studies in this area? to identify the research methodology used in studies to verify their maturity. q-s.3 how is the proposed solution related to the research methodology in the included studies? to identify the maturity of the solutions presented in studies. for example, how the researchers evaluated the metrics used for measuring the developers’ work. table 2. search string (measure or measurement or mensuration or dimension or evaluation or analyze or analysis or view or visualization) and (contribution or participation or productivity or skills or collaboration or effort or knowledge) and (developers or “software development team” or “team members”) and (“software project manager” or “project manager” or “project managers” or “software project” or “project management”) and (tool or framework or plugin or method or metric or factors) next, we established the selection process, which defines the inclusion and exclusion criteria. for inclusion criteria, the study should be a primary study addressing the mensuration of developers’ work to support the pms’ activities. for exclusion criteria, we removed studies that (i) do not have complete scientific contributions (e.g., abstracts), (ii) are not scientific studies (e.g., standards and tables of contents), (iii) do not have complete texts, or (iv) have restricted access. subsequently, a procedure that involved the efforts of four researchers was defined to select the studies. researchers a and b performed the activities planned for sms. researcher c (experienced) helped researchers a and b. researcher d (the most experienced) supervised the work. the procedure consisted of the following stages:  to apply the search string. researcher a applied the search string in the digital repositories and stored the retrieved studies in the mendeley reference management system (https://mendeley.com);  to remove duplicates. researchers a and b analyzed the information from the studies retrieved in the previous stage in order to identify and remove duplicate ones. they used a mendeley feature to identify duplicate studies. they then removed indexed studies with fewer keywords because those with more keywords in the digital database can be considered as better characterized;  to apply exclusion criteria. researchers a and b applied the exclusion criteria. researcher c monitored the exclusion;  to select potential studies. researchers a and b independently read the resulting studies’ title, abstract, and keyword from the previous stage to identify those with the potential to meet the inclusion criteria. they classified them as “with potential”, “without potential”, or “doubtful” (unsure about the potential). both researchers admitted the maximum of potential studies. next, they merged their classification following two how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 decision criteria (accepted or reject) (bin ali and petersen 2014). the researchers accepted a study if both classified it as “with potential” or if at least one classified it as “with potential”. the researchers rejected the study if both classified it as “without potential” or if one researcher classified it as “without potential” and the other classified it as “doubtful”. researcher c monitored the classification process; and  to apply inclusion criteria. researchers a and b read the full text of the resulting studies from the previous stage and defined five quality questions for scoring the studies (table 3) to apply the inclusion criteria. each question could receive the value 1 (yes), 0 (no), or 0.5 (partly). thus, the minimum score is 0 and the maximum score is 5. after assigning scores individually, researchers a and b calculated the arithmetic mean for each study ((researcher a score + researcher b score)/2). this calculation represents the study’s final score for the inclusion criteria. finally, they accepted studies with a final score equal to or greater than 2.5 (50%) (bin ali and petersen 2014). researcher d analyzed the accepted studies to assess their relevance for sms. table 3. quality questions for inclusion criteria id quality questions q1 are the aims of the research explicitly defined? q2 are the metrics explicitly reported? q3 are the metrics related to a software activity? q4 are the metrics clearly described or defined? q5 are the findings related to a project management activity? 3.2 evaluation of the research protocol we evaluated the research protocol before starting the execution phase (kitchenham and charters 2007) to assess the feasibility of performing sms and identifying the changes necessary in order to improve the quality of the retrieved studies. this evaluation occurred with one test that defined a group of primary studies (control group) to be retrieved by the research protocol. we set the group of control with 7 studies through an ad-hoc literature review, using google scholar to search for studies related to the sms goal. we refined the search string and applied it to the search engines until they returned all the studies from the control group. keywords in part 1 (table 2) are general; the measurement process is commonly applied in the studies to validate their results. therefore, we restricted part 1 to be searched only in the title of the studies. consequently, we found six studies from the control group; we added the other study manually. table a1 appendix a lists the studies from the control group (p20 protocol did not retrieve, and p21, p23, p24, p30, p33, and p39). 3.3 data classification and extraction procedure we established a procedure to classify and extract data from the selected studies in nine categories:  information on the publication. to collect data related to the study’s publication, such as title, authors, year of publication, and publication media (journal/event), and use them to track general information on the studies;  proposal of solution. to collect data on the solution and classify it as “overview”, “method”, “model”, “metric”, and “tool” (petersen et al. 2008). this category helps to q-s.1 and q-s.3;  research methodology. to collect data on the research methodology and classify it as “evaluation research”, “proposal of solution”, “validation research”, “opinion studies”, “experience studies”, and “philosophical studies” (wieringa et al. 2006). this category helps to answer q-s.2 and q-s.3;  metrics to measure the developers’ work. to collect metrics used by pms to measure the developers’ work. this category helps to answer q-p.1;  data source on the developers’ work. to identify data sources used as input to apply metrics. this category helps to answer q-p.2;  method to extract data and apply metrics. to identify the method used to obtain the metrics from the data sources and apply them (e.g., mining of source code repositories and collecting feedback from team members). this category helps to answer q-p.2;  context of metrics extraction and application. to collect metrics in a context: i) type of software (proprietary or open-source) and ii) environment to collect team data (academic or industry). this category helps to answer q-p.2;  method for presenting results to pms. to identify information about how to present metrics to pms. this category helps to answer q-p.2; and  support for project management. to identify how the metrics of the developers’ work support pms in project management, especially in risk management and people management. this category helps to answer q-p.3. 4 sms execution phase we performed the sms execution phase between march and november 2019. first, we customized and applied the search string in the selected search engines, considering their specificities. we used the search string in the:  single search field without filters (acm digital library);  advanced search field without filters (scopus);  single search field with the filter “content-type: conference publications, journals & magazines” (ieee xplore);  single search field with the filter “controlled vocabulary: software engineering” (ei compendex); how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020  single search field with the filters “discipline: computer science; subdiscipline: software engineering; content-type: article” (springer). next, we selected the studies, and the results were as follows (table 4):  application of the search string. we retrieved 1,381 documents (filter 1). of these, 240 documents were from acm (17.4%), 433 documents were from ieee (31.4%), 169 documents were from scopus (12.2%), 141 documents were from ei compendex (10.2%), and 398 documents were from springer (28.8%);  removal of duplicates. after removing duplicate documents, 1,305 documents remained (filter 2). of these, 239 documents were from acm (18.3%), 421 documents were from ieee (32.3%), 115 documents were from scopus (8.8%), 134 documents were from ei compendex (10.3%), and 396 documents were from springer (30.3%);  application of exclusion criteria. after applying the exclusion criteria (after which only studies remained), 1,205 studies remained (filter 3). of these, 212 studies were from acm (17.6%), 415 studies were from ieee (34.4%), 77 studies were from scopus (6.4%), 115 studies were from ei compendex (9.6%), and 386 studies were from springer (32.0%). at this stage, the results returned were only studies;  selection of potential studies. after reading the titles of the studies, abstract, and keywords, 61 studies were retrieved (filter 4). of these, 7 studies were from acm (11.5%), 32 studies were from ieee (52.5%), 8 studies were from scopus (13.1%), 8 studies were from ei compendex (13.1%), and 6 studies were from springer (9.8%); and  application of inclusion criteria. after applying the inclusion criteria, we retrieved and read 40 studies. additionally, we added the (only) non-recovered study (from the control group) by the protocol, totaling 41 accepted studies (filter 5). of these, 4 studies were from acm (9.8%), 27 studies were from ieee (65.9%), 6 studies were from scopus (14.6%), 3 studies were from ei compendex (7.3%), and 1 study was from springer (2.4%). table 4. summary of selection stages repositories filter 1 filter 2 filter 3 filter 4 filter 5 a a r a r a r a r acm 240 239 1 212 27 7 205 4 3 ieee 433 421 12 415 6 32 383 27 5 scopus 169 115 54 77 38 8 69 6 2 ei compendex 141 134 7 115 19 8 107 3 5 springer 398 396 2 386 10 6 380 1 5 total 1,381 1,305 76 1,205 100 61 1,144 41* 20 * included studies from the control group 5 sms results phase the studies resulting from the sms are presented in table a1 (appendix a), containing identifiers, authors, year of publication, and repositories. when extracting the data from the studies, we could observe that the date of the resulting studies began in 2003 and had a publication frequency (average) of 2 studies. over the years, two periods had more publications on the subject (2008 and from 2014 to 2016), with an average of 5.5 studies published (figure 1). figure 1. annual distribution of selected studies additionally, when analyzing the publication media, conferences and symposiums published 31 studies (75.6%), journals published 5 studies (12.2%), and workshops published 5 studies (12.2%). the international conference on software maintenance and evolution (icsme) published more studies (5 studies). to answer the secondary research question q-s.1 what type of solution is often proposed for studies in this area? we mapped the studies regarding the proposal of solutions to investigate the study theme, which could be (figure 2):  method defines workflows, rules, or procedures on how to perform an activity. it was present in 9 studies (22.0%);  model describes conceptual representation with a formal abstraction of details and notations. it was present in 9 studies (22.0%);  metric describes new metrics or one measurement plan. it was present in 9 studies (22.0%);  overview describes and compares information to provide an overview of the subject. it was present in 4 studies (9.7%); and 1 1 0 1 3 4 2 1 0 1 3 5 6 6 3 2 2 0 1 2 3 4 5 6 7 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 5 2 0 1 6 2 0 1 7 2 0 1 8 2 0 1 9 number of studies how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020  tool describes and provides one computational tool. it was present in 10 studies (24.3%). these solutions result from keywords often found in software engineering studies (petersen et al. 2008). this mapping will allow us to assess where the community was more focused, i.e., on proposing new metrics or developing new tools to collect the existing metrics. figure 2. proposal of solutions the identification of the solutions considers the first solution proposed in the study; the “metric” and “tool” solutions were also found as secondary solutions in other studies. all studies used the “metric” solution. the “tool” solution was used in 4 studies to support the “metric” solution (26.7%), in 4 studies to support the “method” solution (26.7%), 5 studies to support the “model” solution (33.3%), and 2 studies to support the “overview” solution (13.3%). in the following items, we discussed proposals for primary solutions:  method was used to mine information and knowledge to support collaborative programming and resource allocation. workflow analysis and interaction among developers facilitate project understanding and development (p10, p16, p32, p37), considering collaborative programming. historical details of resource allocation and individual skills were analyzed to distribute tasks to team members recognizing each developer’s competencies (p1, p6, p11, p36), considering resource allocation;  model was used to analyze individual participation, which investigated the collaborative software engineering context in order to share information and organize tasks and resources. hence, the identified models evaluated the developers’ performance, considering the development environment and team feedback (p2, p7, p20), the building of the development team based on the activity, profile and experience of the developers (p4, p18, p34, p41), and the assessment of how the developers’ roles evolved, based on their contributions (p19, p24);  metric was collected in source code repositories, bug tracking systems, or version control systems. from these metrics, indicators were taken from the developers’ work, e.g., productivity (p8, p9, p17, p21, p22, p23, p25), collaboration (p8, p9, p21, p22, p23), experience (p8, p9, p21, p22, p26), interaction (p8, p22), and task accomplishment indicators (p8, p9, p23, p25, p38);  overview consists in the investigation of how pms understand the developers’ work compared different factors regarding developers’ personality and activity. pms interviewed developers to identify the most appropriate profile for a task (p15, p33) and compared methods used to estimate the developers’ work (p29, p35); and  tool was used to support pm activities. using automated resource allocation in software projects can help pms in analyzing the variables needed for resource allocation. pms can examine the software development process through information extracted from software repositories (p3, p13, p14, p30, p39) and developers’ evolution (p5, p12, p28, p30, p31, p40). to answer the secondary research question q-s.2 what type of research methodology is often used for studies in this area? we mapped studies according to the research methodology used (figure 3):  proposal of solution proposes a solution technique and defends its relevance with one small example, or one good argumentation 13 studies (31.7%) classified;  validation research investigates a solution technique within a specific context through experiments, surveys, or interviews to answer a particular research question 16 studies (39.0%) classified. this methodology does not require more formal experimental methods (e.g., hypothesis testing, control experiment);  evaluation research investigates the relationship among phenomena through formal experimental methods where casual properties are studied empirically, such as case studies, field studies, and field experiments 12 studies (29.3%) classified;  experience studies explains how something has been done in practice based on the author’s experience no study classified;  opinion studies reports the author’s opinion on how things should be no study classified; and  philosophical studies structures the information regarding a specific field like one specific taxonomy or conceptual framework, resulting in a new way of looking at existing things no study classified. we used three levels of research maturity (high rigor, medium rigor, and low rigor) related to the study subject (garousi et al. 2015). the “proposal of solution” methodology has low rigor because it provides simple examples to verify its applicability. the “validation research” methodology has medium rigor because it does not include hypothesis testing nor discussions on threats to validity. the “evaluation research” methodology has high rigor because it includes hypothesis testing and discussions 4 9 9 10 9 15 32 0 20 40 60 overview model method tool metric primary solution secondary solution how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 on threats to validity. given the distribution of studies in these methodologies (figure 3), there are more empirical studies than proposal of solution, indicating high research rigor in this area. figure 3. research methodology to answer the secondary research question q-s.3 how is the proposed solution related to the research methodology in the included studies? we performed an integrated analysis of the results obtained in q-s.2 and q-s.3. the information on the number of studies by research methodology and type of primary solution were listed and presented in figure 4. we can observe that many studies showed the “metric”, “method”, and “model” methodologies as a solution in the “proposal of solution”, “validation research”, and “evaluation research” methodologies, i.e., they were evaluated in the three maturity levels (high, medium, and low rigor). however, many intersections have zero values. for example, in studies that presented the “tool” solution, only evaluations with low and medium rigor were conducted using the “proposal of solution” and “validation survey” methodologies. therefore, this solution needs robust empirical studies. besides, there are no solutions related to the “experience studies”, “opinion studies”, and “philosophical studies” methodologies, which highlights the need for studies in this field. to answer the primary research question q-p.1 what metrics are used by pms to measure the developers’ work? we collected 64 metrics used to measure the developers’ work (table 5) and categorized them into 6 groups (figure 5). in this categorization, we considered the similarity between the meanings and purpose of the metrics. thus, we merged two metrics with different names when their aims and values were the same. besides, there were cases in which the metrics with the same name measured different information. in those cases, we separated them into two or more metrics and changed their names. one example was the collaboration (files) and collaboration (interaction) metrics. the first one relates to the joint work on code files, and the second one relates to the exchange of information and help by teammates. the categories are:  quality (qua). this group refers to the quality of work delivered (task or source code). it covers the martin (martin 1994), ck (chidamber and kemerer 1994), size, and complexity metrics (p8, p9, p11, p14, p23, p25, p33, p39, p38). it comprises 11 metrics in 9 studies (22%); figure 4. relationship between methodologies and proposal of solution  contribution (con). this group refers to the amount of work performed by the developers on the software artifacts (p2, p3, p4, p5, p8, p9, p10, p11, p12, p13, p14, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p32, p33, p34, p35, p37, p39, p40). it covers metrics related to the number of commits and the number of modified/added/removed lines of code (code churn). it comprises 39 metrics in 33 studies (80.5%);  collaborative work (cow). this group refers to information sharing and teamwork to develop the same software artifact (p3, p7, p8, p9, p10, p12, p14, p18, p21, p22, p23, p30, p39, p40). it covers metrics related to the messages exchanged regarding bugs and the commits performing or lines of code in the same artifact. it comprises 4 metrics in 14 studies (34.1%);  degree of importance (imp). this group refers to technology domains, developer reputation, participation time in projects, and type of work (e.g., bug fix) (p4, p5, p7, p8, p9, p13, p14, p15, p17, p18, p19, p20, p21, p22, p26, p27, p29, p31, p32, p33, p34, p35, p37). it comprises 20 metrics in 23 studies (56.1%).  productivity (pro). this group refers to the number of tasks performed within a given time (p2, p8, p17, p20, p24, p25, p33, p34, p36). it relates to the number of modified/added/removed lines of code, the number of commits, or the report on tasks completed in a period. it comprises 7 metrics in 9 studies (22%); and 0 0 0 12 13 16 0 5 10 15 20 experience studies philosophical studies opinion studies evaluation research proposal of solution validation research number of studies how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table 5. metrics to measure the developers’ work # metrics interpretation reference # metrics interpretation reference 1 % of commits con p22, p8 28 contribution start con, imp p34 2 % of lines of code imp, con p19, p35 29 cost pro p2 3 authorship by guilt con p35 30 cqi qua p14 4 cdi qua p14 31 developer fragmentation con p14, p22, p24 5 change code documentation (cad) con p8, p11 32 degree of authorship (doa) con p29, p35 6 ck metrics qua p9, p38 33 effort on commit pro, con p17 7 close a bug (bcl) imp p9, p19, p8 34 effort per modification pro, con p25, p2 8 close a bug that is then reopened (bcr) imp p8 35 expected shortfall (es) con, imp p37 9 close a lingering thread (mct) pro p8 36 expertise imp p18, p31 10 code duplication qua p11 37 expertise breadth of a developer (ebd) con, imp p26 11 collaboration (codechurn) cow, con p7 38 expertise of a developer (ed) con, imp p21, p26 12 collaboration (files) cow, con p14, p3, p10, p12, p30 39 first reply to thread (mrt) imp p8 13 collaboration (interaction) cow, beh p21, p40 40 hero con, imp p13 14 comment on a bug report (bcr) con p19, p8 41 knowledge at risk (kar) con, imp p37 15 commit binary files (cbf) con p8 42 knowledge loss con, imp p32, p37 16 commit code that closes a bug (ccb) con p39, p8 43 last committer “takes all” con p35 17 commit comment that includes a bug report num (cbn) con, imp p17, p19, p8 44 link a wiki page from documentation file (wlp) qua p8 18 commit documentation files (cdf) con p39, p8 45 martin metrics qua p38, p11 19 commit fixes code style (csf) qua, con p39, p8 46 mastery of technologies imp, con p27, p33 20 commit multiples files in a single commit (cmf) con p19, p22, p28, p8 47 monthly effort pro p20, p2 21 commit new source file or directory (cns) con p8 48 mtbc pro p34 22 commit with empty commit comment (cec) con p8 49 number of active days con, imp p9, p12, p20, p19, p30, p34, p29, p39 23 commitment imp, beh p7, p33, p15 50 number of code churn con p20, p34 24 commits versatility con p34 51 number of commits con p14 25 complexity and size qua p9, p23, p38, p11 52 number of lines of code (nloc) con, cow p16, p19, p3, p9, p14, p21, p18, p22, p23, p39, p8, p40 26 contribution duration con, imp p34 53 participate in a flame war (mfw) beh p8, p14 27 contribution factor con p8 54 qcte qua p5, p4, p14, p35 how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table 5. metrics to measure the developers’ work (cont.) # metrics interpretation reference # metrics interpretation reference 55 qmood metrics qua p38 60 start a new wiki page (wsp) con p8 56 report a bug (brp) imp p8 61 status imp p22 57 rework qua p25, p33 62 task delivery pro p24, p33, p36 58 source abandoned con, imp p32, p37 63 truck factor con, imp p13, p29, p35 59 start a new thread (mst) con p8 64 update a wiki page (wup) con p22  behavior (beh). this group refers to the developers’ behavior, such as focus, proactivity, communication skills, and interaction with the team, to identify members’ engagement and commitment to the project (p7, p8, p15, p19, p22, p33). it comprises 5 metrics in 6 studies (14.6%). for example, the truck factor metrics calculate the minimum number of developers who have to leave a project for it to be delayed (high probability). if this measurement returns a value of 1, it indicates that one developer has all the knowledge on the project. therefore, project development can be delayed if this developer abandons it. analogously, if this measurement returns a value of 3, it indicates that 3 developers have all the knowledge on the project. thus, the project’s development can be delayed if these developers abandon it. if only a single developer has the all knowledge on the project, the hero metric indicates that this one is a “hero”. both metrics consider the developers’ contribution towards devising the files, meaning their level of expertise in the project (con, imp). figure 5. categories of metrics to answer the primary research question q-p.2 how are metrics applied by pms to monitor the developers’ work? we considered (i) data sources used to obtain information on the developers’ work (table 6), (ii) the extraction method of this information for the application of metrics (table 7), (iii) context in which the metrics were extracted (figure 6), and (iv) presentation of the results for pms to analyze (table 8). we identified 5 data sources: i) source code repositories (supported by version control systems) in 28 studies (68.3%); ii) bug repositories in 7 studies (17.1%); iii) activity management repositories in 4 studies (9.8%); iv) source code files in 5 studies (12.2%); and v) people (team members and/or pms) in 7 studies (17.1%). we would like to point out that 8 studies used more than one data source (19.5%). the first four data sources provided impersonal data, and only the last one offered subjective data. table 6. data sources on the developers’ work data sources references code repositories p3, p5, p8, p10, p9, p12, p13, p14, p16, p19, p20, p21, p22, p24, p23, p25, p26, p28, p29, p30, p31, p32, p34, p35, p37, p38, p39, p40 bugs repositories p8, p9, p17, p19, p23, p25, p39 activity management repositories p8, p19, p22, p36 source code files p2, p4, p11, p17, p18 people p1, p6, p7, p15, p27, p33, p41 table 7. strategies for data extraction and metrics application strategies references automated tools p2, p3, p4, p5, p8, p9, p10, p12, p11, p13, p14, p16, p17, p18, p19, p20, p26, p28, p29, p30, p31, p35, p36, p39, p40 manual process p21, p22, p23, p24, p25, p32, p34, p37, p38 questionnaires, pms’ intuition, focal groups, and interviews p1, p6, p7, p27, p33, p15, p41 figure 6: extraction context of value of the metrics 3 7 23 2 3 3 0 5 10 15 20 25 academic/industrial academic industrial author’s example proprietary open-source subjective data objective data how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 we found two strategies for data extraction and metrics application from impersonal data sources (table 7): i) use of automated tools (25 studies 61%); and ii) manual process (9 studies 22%). we considered the impersonal data source to identify a strategy composed of questionnaires, pms’ intuition, focal groups, and interviews (7 studies 17.1%). table 8. strategies to present information presentation form references textual p1, p4, p13, p14, p16, p19, p18, p22, p25, p26, p27, p29, p30, p31, p34, p35, p36, p39 graphical p2, p9, p10, p13, p11, p14, p17, p16, p18, p19, p20, p22, p25, p26, p31, p34, p38, p40 visualization techniques p3, p5, p7, p9, p10, p12, p16, p28, p30, p39 additionally, we analyzed the data source used to identify the metrics application context (figure 6). we analyzed the type of software in which the extraction of impersonal data used metrics (33 studies): i) proprietary software (7 studies 21.2%); ii) open-source software (23 studies 69.7%); and iii) author’s example (3 studies 9.1%). for the impersonal data extraction, we analyzed the contexts in which interviews, questionnaires, and focus groups were held (8 studies): i) academic (3 studies 37.5%); ii) industrial (3 studies 37.5%); and iii) industrial and academic (2 studies 25%). for the extraction of impersonal data in an open-source context, the metrics were used with a frequency (median) of three software per study (p3, p4, p5, p8, p9, p10, p12, p13, p14, p16, p17, p18, p19, p20, p26, p29, p30, p31, p34, p35, p36, p38, p39 23 studies). the most widely used open-source software systems for data extraction were eclipse (p4, p17, p19, p31 4 studies), jedit (p5, p9, p13, p30 4 studies), apache (p26, p31, p36 3 studies), and openstack (p20, p31 2 studies). in the proprietary software context, the studies used metrics in just one system (p11, p16, p21, p23, p25, p32, p40 7 studies). in the context the author used as an example, they used examples to show the metrics value extraction (p22, p24, p28 3 studies). in order to extract subjective data in the academic context, undergraduate students in computer science and business informatics were interviewed (p1, p15, p41 3 studies). in the industry context, interviews with product engineers and project managers were carried out (average = 12 people/study) (p2, p6, p7, p27, p33 5 studies). from the context of classification used for data extraction, we identified that the studies analyzed the same metric in different contexts. therefore, the metrics’ applicability to measure the developers’ work does not depend on a specific context. it is sufficient to use metrics on software (proprietary or open-source) and developers for the impersonal and subjective data extraction, respectively. besides, we identified three strategies to submit information to pms (32 studies 78%) (table 8): i) textual (18 studies 56.3%), ii) graphical (18 studies 56.3%), and iii) visualization techniques (10 studies 31.3%). we point out 13 studies combining two or more strategies (40.6%). to answer the primary research question q-p.3 how do metrics concerning the developers’ work support project management, especially with regard to risk management and people management? we identified the project management activities supported by the information obtained by measuring the developers’ work. it is worth noticing that we defined these activities using the researchers’ interpretation. we analyzed the objectives and findings of the studies regarding the background presented (section 2). for example, lima and elias 2019 (lima and elias, 2019) proposed a systematic approach to assign people to a specific activity according to their personality and skills. we used this concept in order to categorize the activities. after extracting the data, we merged some categories according to the similarity of their names and definition. in total, we found 7 activities that support project management (table 9), and we highlight the fact that 16 studies (48.5%) treated two or more activities:  identify the skills and the profile of developers (25 studies 61%);  plan improvements to code quality (4 studies 9.8%);  improve team performance (12 studies 29.3%);  estimate project costs and deadlines and identify anomalies in developer performance (15 studies 36.6%);  understand and control knowledge distribution (9 studies 22%);  adjust pay (2 studies 4.9%); and  identify the need for investment (training or equipment) (5 studies 12.2%). 6 discussion in this section, we discussed the results obtained with the sms and presented the four pms’ opinions. we collected their opinions to understand to what extent the metrics collected could be used to analyze the developers’ work. the pms' opinions were collected using the interview described in appendix b. there is a wide variety of existing metrics to measure the developers’ work (64 metrics). we found two most commonly used metrics: nloc (number of lines of code) (12 studies) and number of commits (9 studies). table 10 presents 45 metrics created from those metrics above. there are 17 metrics unrelated to nloc and the number of commits (table 11) and associated with bugs (2 metrics), documentation (6 metrics), behavior (2 metrics), and quality (7 metrics) activities. how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table 9. activities supported by information about the developers’ work activities references identify the skills and the profile of developers p24, p1, p3, p5, p7, p8, p9, p10, p11, p12, p14, p15, p20, p21, p22, p23, p25, p27, p30, p31, p33, p34, p36, p40, p41 plan improvements to code quality p14, p16, p22, p24 improve team performance p1, p7, p12, p15, p16, p18, p19, p22, p23, p32, p33, p37 estimate project costs and deadlines and identify anomalies in developer performance p1, p2, p6, p8, p9, p11, p12, p14, p17, p19, p18, p22, p28, p34, p33 understand and control knowledge distribution p8, p13, p23, p26, p29, p32, p33, p35, p37 adjust pay p19, p23 identify the need for investment (training or equipment) p8, p15, p17, p24, p28 table 10. metrics created from nloc and number of commits # metrics # metrics 1 % of commits 20 contribution duration 2 % of lines of code 21 contribution factor 3 authorship by guilt 22 contribution start 4 close a bug (bcl) 23 cost 5 close a bug that is then reopened (bcr) 24 developer fragmentation 6 close a lingering thread (mct) 25 degree of authorship (doa) 7 code duplication 26 effort on commit 8 collaboration (codechurn) 27 effort per modification 9 collaboration (files) 28 expected shortfall (es) 10 comment on a bug report (bcr) 29 expertise breadth of a developer (ebd) 11 commit binary files (cbf) 30 expertise of a developer (ed) 12 commit code that closes a bug (ccb) 31 expertise 13 commit comment that includes a bug report number (cbn) 32 hero 14 commit documentation files (cdf) 33 knowledge at risk (kar) 15 commit fixes code style (csf) 34 knowledge loss 16 commit multiple files in a single commit (cmf) 35 last committer “takes all” 17 commit new source file or directory (cns) 36 mastery of technologies 18 commit with empty commit comment (cec) 37 monthly effort 19 commits versatility 38 mtbc table 10. metrics created from nloc and number of commits (cont.) # metrics # metrics 39 number of active days 43 status 40 number of code churn 44 task delivery 41 participate in a flame war (mfw) 45 truck factor 42 source abandoned table 11. unrelated metrics to nloc and number of commits # purpose metrics 1 bugs rework report a bug (brp) 2 1 documentation change code documentation (cad) first reply to thread (mrt) link a wiki page from documentation file (wlp) start a new thread (mst) start a new wiki page (wsp) update a wiki page (wup) 2 3 4 5 6 1 behavior commitment collaboration (interaction) 2 1 quality cdi ck metrics cqi martin metrics qcte qmood metrics size 2 3 4 5 6 7 despite the variety of metrics, most of them quantify the developers’ work regarding their contributions to the software artifacts development. hence, 34 metrics form the contribution group, which shares metrics with the other groups created, except the behavior group, because it considers aspects such as focus and commitment, not the work done by the developers. pms could choose any metrics from the 64 metrics identified in sms as being essential to measure the developer’s work. as a result, they mentioned 29 metrics, of which two or more pms cited the same 9 metrics. three pms chose 3 metrics (task delivery, number of commits, and rework) showing that the performance on task delivery, frequency of contribution to the artifacts and solution generation do not need reworking as the most relevant inputs to measure the developers’ work. another interesting point is the perception of the two pms who work in the same company (pm1 and pm4). although they have the same amount of experience in company a, they have different opinions with regard to measuring the developers’ work. while pm1 highlighted only 1 metric, pm4 highlighted 24 metrics. this result shows the difficulty in finding a consensus regarding how to measure the developers’ work. none of the pms suggested using a metric beyond the 64 metrics presented in table 5. nevertheless, pm1 mentioned code smells as a possible metric, but he/she did not justify its use. maybe, the “number of code smells” can be one metric used to verify how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 the quality of solutions implemented by the developer, but we need to define how to use it. besides, pm2 said, “i use the task delivery, status, and contribution duration metrics using different names (throughput, lead time, and cycle time, respectively)”. one interesting factor is that the selected studies used more than one metric for measuring the developers’ work. this may occur due to task complexity, requiring a lot of data to analyze the team members’ performance in the fairest way possible. the proposed organization of the metrics into groups provides an overview of the “types of information” used by pms. however, no studies covered metrics from all groups. with regard to the artifact used to extract the value of the metrics, we found the source code repositories (managed by version control systems) to be the most frequently used to collect information about the developers’ work. the information collected includes mainly the execution of commits and addition/modification/removal of lines of code. as previously mentioned, they indicate the degree of contribution to the project and are bases for most of the metrics listed. for example, we can trace the activity performed with the commit message if this activity consisted of a bug fixing. another interesting point about nloc and the number of commits is to apply filters, which can be: i) information granularity levels (e.g., the number of commits performed by one developer in a directory, file or line of code); and ii) time intervals (e.g., how much time the developer takes adding lines of code to the repository or how much time the developer contributed to the project in recent months). another example of existent information in the source code repositories is the degree of collaboration among developers. the variability of possibilities offered by the data obtained from the source code repositories is numerous and diverse, in addition to providing support for the two most representative metrics (number of commits and nloc). it can justify the code repositories to be the data source most often used in the studies found in sms. the source code files are other data sources but limited to questions of complexity and quality code. the source code file analysis in an integrated development environment (e.g., eclipse ide) overcomes that limitation, which consists of enhancing it with plugins and enriching information. for example, we can determine how long one developer has kept a source code file open or which developer changed the source code. using bug repositories helps to identify the developers’ work regarding bug detection, correction, or generation in software, providing information on when the developer contributed to correct bugs or need of reworking. task repositories allow managing of bug activities, which can be supported by tools, such as spreadsheets or specialized tools. by analyzing the records of tasks performed by the developers, it is possible to identify information such as the history of tasks delivered by the developer, including extra data (e.g., duration), and types of tasks. this information can be extracted manually or automatically (through integration with the task repositories). people’s opinions were collected through feedback from team members and are based on the pms’ intuition and knowledge about the developers, thus producing more subjective data about their work. possible information includes type and frequency of activities, technical and personal skills, satisfaction, motivation, behavior, and personality. questionnaires, interviews, or focus groups are used to extract such information. interestingly, source code repositories used with other data sources are trivial and are a valuable data source to obtain information about the developers’ work. again, the complexity of quantifying developers’ work explains this use. such complexity can be the reason for using automated tools for data extraction and metrics application and is the most commonly used strategy to obtain the metrics on the developers’ work (28 studies 68.3%). automated tools include web systems, software executed via command terminals, and plugins for integrated development environments. the web systems offer more value to pms among the automated tools because the other tools have no graphical interface and are limited, and any devices and operating systems (using a web browser) can access web systems. another aspect observed in sms, motivated by the complexity of measuring the developer’s work, is the balance between the three ways of presenting the information measured to pms (textual 56.3%, graphical 56.3%, and visualization techniques 31.3%). the studies used text to show uncomplicated information, graphical for group information, and visualization techniques when the amount of information displayed was higher. several studies (40.6%) used a combination of these presentation forms. we also noticed that 7 activities inherent to pms can be supported by measuring the developers’ work. the activities can be related to the definitions described in pmbok for people management, risk management, project quality management, and resource management. in “identifying the skills and profile of the developer” activity, the pm chooses the members for a project team considering the skills and professional profile required to achieve the project goals. this activity supports the task allocation, considering the most appropriate person to perform a function (people management). risk management can be supported, for example, in the following way: the pm can consider that the people available for the project do not have sufficient project technology skills (risk: not obtaining excellent project performance). therefore, close monitoring of the team’s work is necessary, as well as hiring a consultancy firm. in the interview, the pms pointed out the metrics they consider useful to carry out the “identify the skills and the profile of developers” activity. thus, they identified 21 metrics among the 64 metrics found in sms. the most frequently mentioned metric (three pms) was commitment. this metric provides information on the developer’s behavior. other metrics mentioned by more than one pm were collaboration (interaction), mastery of technologies, contribution factor, complexity and size (5 metrics). the how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 pms’ vision is to measure how committed the developer is to the project and the team, how much the developer collaborates with colleagues, and their mastery of project technologies, enabling them to contribute to the implementation of complex solutions for the software. besides, more than one pm considered other relevant metrics. such metrics provide information about the complexity of tasks performed by the developer, the contribution factor in software artifacts, the technical capacity in design technologies, and performance in collaboration with colleagues. in the “planning improvements to code quality” activity, pms analyze the quality of the solutions delivered by the developers. the answers must meet the expectations of those involved with the project (including team, pms and customers) and should be good enough to avoid rework. quality is a factor related to project management, and there is an area of knowledge in pmbok called project quality management for this factor. in the interview, pms pointed out the metrics they consider useful to carry out the “planning improvements to code quality” activity. thus, they identified 17 metrics among the 64 metrics found in sms. two or more pms mentioned the same 2 metrics (rework and collaboration (interaction)). the pms’ view of this activity is significantly divergent. in the opinion of pm1, documentation is the main point, considering the writing software-specific documents and the documentation made from commits in code repositories. for pm2, the solutiongenerated quality and bugs reported for lines of code that the developers created are interesting for quality checking. pm3 considers that developer’s experience and how much knowledge he/she shares with other team members are the main way to ensure that code quality is satisfactory. for pm4, object-oriented code quality metrics, the quality of the implemented solution and the developer’s ability to work in various code parts provide inputs to observe source code quality. in general, metrics are used for documentation, implementation quality, and generated rework. in “improving team performance” activity, there is a dependency on measuring the developers’ work, because measuring the team’s performance (to determine the current state) is necessary in order to plan and apply actions to leverage performance (to promote improvement) and to remeasure it (to the new state). this activity is essential for people management, such as defining teams, assigning roles, communicating, and organizing work, thus contributing to performance improvement. in the interview, pms pointed out the metrics they consider useful to carry out the “improving team performance” activity. thus, they identified 15 metrics among the 64 metrics found in sms. out of these, two or more pms mentioned the same 6 metrics (collaboration (files), collaboration (interaction), mastery of technologies, expertise, collaboration (codechurn), and closing a bug that is then reopened (bcr)). pms share a similar view on performance: the developer with the technical knowledge, experience, and participation in various parts of source code gets the best performance. besides, performance is analyzed by looking at not just one developer, but at the entire team. hence, one developer cannot produce new solutions, but collaborate with other developers for the team to deliver an answer as soon as possible. pm2 and pm3 also highlighted the quality of the implemented solution that the developer performs better, as corrections or refactoring are necessary when the developer provides an error-free task and satisfactory quality. in the same way as the previous activity (related to quality), pms considered metrics related to reworking in order to understand the team’s performance. in the “estimating project costs and deadlines and identifying anomalies in developer performance” activity, there is a relationship with people management because the project budget and cost are anticipated when pms record historical information on the developers’ work. besides, recent history can help to identify when one developer is performing differently than expected (better or worse), which can be a consequence of a change in the developer’s motivation or his/her interpersonal relationships with the team. in the interview, pms pointed out the metrics they consider useful to carry out the “estimating project costs and deadlines and identifying anomalies in developer performance” activity. thus, they identified 24 metrics among the 64 metrics found in sms. two or more pms mentioned the same 6 metrics (cost, effort per modification, rework, commitment, task delivery, and contribution factor). the four pms interviewed considered the technical ability and delivery history of the team allocated to the project to be essential in order to estimate costs and deadlines. additionally, they said that the developers’ knowledge and contributions to the project’s source code should be taken into account in the estimates. pm1 pointed out that performance anomalies are usually caused by the emergence of unplanned and highly complex demands, leading to high development costs. in the “understanding and controlling knowledge distribution” activity, there is support for risk management because the departure of one developer who has most of the knowledge about the source code of a project can present high risks. therefore, pms should monitor knowledge distribution, act towards leveling the knowledge of the team, and prevent essential people from leaving the project earlier than expected. in the interview, pms pointed out the metrics they consider useful to carry out the “understanding and controlling knowledge distribution” activity. thus, they identified 17 metrics among the 64 metrics found in sms. out of the 64, two or more pms mentioned the same 6 metrics (commit documentation files (cdf), collaboration (files), collaboration (interaction), collaboration (codechurn), change code documentation (cad), and degree of authorship (doa)). in the pms’ opinion, it is possible to identify a developer’s knowledge of the system by observing the contributions in writing their documentation and code. pm3 also considers commitment, how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 time working on the project, and all activities registered in the version control system. another interesting point is to measure how much the developer shares your knowledge. pm2, pm3, and pm4 highlighted this information. among the studies found in sms, one study addressed knowledge distribution with greater emphasis (ferreira et al. 2017). the authors calculated the doa metric for all files in the code repository in order to identify the developers with the most knowledge of the project. this calculation generated another metric (truck factor metric). however, when choosing the doa metric instead of the truck factor metric, pms preferred to look at individual files instead of the entire set of files in the repository, identifying developers’ knowledge in parts of source code. in the “pay adjustments” activity, the work done by the development team is recognized, thus keeping professionals motivated and satisfied. this activity is critical when managing people. in the “identifying the need for investment” activity, there is acquisition of equipment and training in order to meet project needs. in the interview, pms pointed out the metrics they consider useful to carry out the “pay adjustments” activity. thus, they identified 30 metrics among the 64 metrics found in sms. out of these 30, two or more pms mentioned the same 12 metrics (commitment, expertise, rework, collaboration (files), mastery of technologies, task delivery, monthly effort, contribution factor, number of active days, cost, contribution start, and contribution duration). with regard to other activities, this one received the most significant number of different metrics, and the rate of metrics chosen by more than one pm was higher. this variety of metrics may be due to the activity’s sensitivity and the concern for making compensation adjustments in the most adequate way. in general, according to the pms, the developers’ commitment and technical ability, the quality of the solutions generated and the time spent on the project are primary information. in the interview, pms pointed out the metrics they consider useful to carry out the “identifying the need for investment” activity. thus, they identified 23 metrics among the 64 metrics found in sms. two or more pms mentioned the same 3 metrics (close a bug that is then reopened (bcr), mastery of technologies, and effort on commit). for pms, it is necessary to understand if the developers are technically competent to perform their activities. hence, the selected metrics provide the level of effort to accomplish tasks, the mastery of project technologies, and the number of errors and rework for solutions delivered by developers. in table 12, we presented the metrics that pms cited ten or more times. they considered them relevant, and their opinion provides some (few) professionals’ views on the metrics found in the literature. thus, these results cannot be generalized for the entire industrial context. table 12. metrics most selected by pms # metric number of citations mentioned by 1 collaboration (interaction) 15 pm2: 1, 3, 5 pm3: 1, 2, 3, 4, 5, 6, 7, general pm4: 1, 3, 5, general 2 rework 14 pm2: 1, 2, 3, 4, 5, 6, 7, general pm3: 6, general pm4: 2, 4, 6, general 3 collaboration (files) 12 pm2: 3, 5, 6 pm3: 3, 4, 5, 6, 7 pm4: 1, 3, 5, general 4 expertise 12 pm1: 3 pm2: 6 pm3: 1, 2, 3, 5, 6, 7, general pm4: 4, 6, general 5 mastery of technologies 12 pm1: 3 pm2: 1, 3, 6, 7 pm3: 3, 5, general pm4: 1, 6, 7, general 6 commitment 11 pm1: 6 pm2: 1, 4, 6 pm3: 1, 3, 4, 5, 6, general pm4: 1 7 cost 11 pm2: 4 pm3: 1, 2, 3, 4, 6, 7, general pm4: 4, 6, general 1. identify the skills and the profile of developers; 2. plan improvements to code quality; 3. improve team performance; 4. estimate project costs and deadlines and identify anomalies in developer performance; 5. understand and control knowledge distribution; 6. adjust pay; 7. identify the need for investment (training or equipment); general. considers relevant to project management (unspecified activity). how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 7 threats to validity internal validity. refers to the effects of the treatments over the variables due to uncontrolled factors in the environment (wohlin et al. 2012). limitations of the search string and digital libraries can lead to an incomplete selection of studies. we selected five search engines and adapted the search string to achieve our goal. however, other digital libraries and keywords could be added to the search string. another possible threat is the researchers’ bias in selecting the studies, and answering the research questions (e.g., identification and grouping of metrics, interpretation of project management activities). to mitigate this, researchers a and b discussed the results while generating the groups presented. these researchers were monitored by researcher c. finally, researcher d (most experienced researcher) evaluated the work performed and suggested improvements to ensure the impartiality and quality of this study. moreover, we carefully defined and reported the search string, digital libraries chosen, and the inclusion and exclusion criteria to ensure sms replicability. external validity. relates to whether the results can be generalized outside the experimental setting (wohlin et al. 2012). one threat to external validity is about our selecting representative studies. with regard to the amount of information collected, we argue that the selected studies are representative. however, we only considered studies from the formal literature, which could be extended by considering the gray literature. our findings are focused on evaluating the developers’ work. currently, we have no intention to generalize our results beyond this field. construct validity. represents the measurement of the concepts of cause and effect in the experiment through dependent and independent variables (wohlin et al. 2012). to ensure that sms was impartial, comprehensive, and of high quality, four researchers took part in the definition and execution of the research protocol. the protocol used to select the studies was validated using a control group (7 studies). however, we addressed 6 studies from this group and added the other study manually. it was one consequence of our decision to limit results by applying keywords from parts 2, 3, 4, and 5 just in the title, abstract, and keyword of the studies. with regard to data extraction, we defined a classification scheme. however, it was a manual process, and we cannot claim that it was carried out mistake-free. the data extraction process required an understanding of the subject to infer the non-explicit data, which made this process exhaustive and complicated. conclusion validity. refers to the extension of the conclusions about the relationship between the treatments and the outcomes (wohlin et al. 2012). we followed a systematic approach for conducting sms and described all procedures to ensure this study’s replicability. 8 final remarks project managers (pms) are professionals whose task is to successfully lead software projects. therefore, project management practices are applied. aiming to support the pm, several studies in the literature have proposed strategies to measure the developers’ work. in this article, an exploratory study was carried out by using the systematic mapping study (sms) and interviewing pms to organize concepts related to this topic. in this context, we answered three primary research questions, specific to our study field (q-p.1 what metrics are used by pms to measure the developers’ work?, q-p.2 how are metrics applied by pms to monitor the developers’ work?, and q-p.3 how does information about the developers’ work support project management?). additionally, we answered three secondary research questions, which are common to sms studies (q-s.1 what type of solution is often proposed for studies in this area?, q-s.2 what type of research methodology is often used for studies in this area?, and q-s.3 how is the proposed solution related to the research methodology in the included studies?). the responses were based on an analysis of 41 studies found using sms and the opinions of four pms opinions. our contributions are: i) identification of the studies’ maturity, ii) identification of solution proposals to investigate the study theme, iii) identification of 64 metrics, iv) organization of the metrics into 6 groups, v) identification of 4 data sources to obtain information, vi) identification of the data extraction context and metrics application, vii) identification of 7 activities for which the pm is responsible (most of these activities are related to risk management and people management), supported by a measurement of the developers’ work, viii) the opinion of four pms on the usefulness of the 64 metrics; ix) the opinion of four pms on how the 64 metrics relate to 7 activities under their responsibility, and x) characterization of the aspects to explore the subject, indicating themes for possible new studies in the area of software engineering. the suggestions for future work include: i) verification of the validity of metrics found by collecting the most significant number of pm opinions, allowing a quantitative analysis, ii) assessment of the validity of the metrics found by conducting a field study, iii) evaluation of approaches to measure the developers’ work considering the industrial, and proprietary software context, iv) creation of new approaches (or advances in existing approaches) to consider diversified metrics that provide information about work quality, contribution by the developer, collaboration, level of importance to the project, productivity, and personal behavior, and v) combination of three forms of presenting information (textual, graphical, and visualization techniques). how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 references b. kitchenham and s. charters, “guidelines for performing systematic literature reviews in software engineering,” 2007. b. kitchenham, “procedures for performing systematic reviews,” keele, uk, keele univ., vol. 33, pp. 1-26, 2004. b. w. boehm, “software risk management: principles and practices,” ieee softw., vol. 8, no. 1, pp. 32–41, 1991. c. wohlin, p. runeson, m. höst, m. ohlsson, b. regnell, and a. wesslén. experimentation in software engineering. springer science & business media, new york, ny. 2012. i. bouchrika, “top computer science conferences”, guide2research, 2020, available at , last access may, 9th, 2020. i. sommerville, engenharia de software, 10th ed. pearson universidades, 2019. j. a. lima and g. elias, “selection and allocation of people based on technical and personality profiles for software development projects,” xlv latin american computing conference (clei), panama, 2019, pp. 1-10, doi: 10.1109/clei47609.2019.235052. j. feiner and k. andrews, “repovis: visual overviews and full-text search in software repositories,” in working conference on software visualization, 2018, pp. 1–11. j. j. ahonen, p. savolainen, h. merikoski, and j. nevalainen, “reported project management effort, project size, and contract type”. journal of systems and software 109, 2015, pp. 205–213. j. menezes, c. gusmão, and h. moura, “risk factors in software development projects: a systematic literature review”, in software qual j 27, 2019, pp 1149–1174. k. petersen, r. feldt, s. mujtaba, m. mattsson, “systematic mapping studies in software engineering”, in: presented at the 12th international conference on evaluation and assessment in software engineering (ease), 2008. k. tuma, ç. gül, and s. riccardo. “threat analysis of software systems: a systematic literature review”, journal of systems and software 144, 2018, pp. 275-294. m. ferreira, m. t. valente, and k. ferreira, “a comparison of three algorithms for computing truck factors,” ieee/acm 25th international conference on program comprehension, 2017, pp. 207–217. n. bin ali and k. petersen, “evaluating strategies for study selection in systematic literature studies,” in: acm/ieee international symposium on empirical software engineering and measurement, 2014, p. 45. n. wieringa, n. m. r. maiden and c. rolland. requirements engineering paper classification and evaluation criteria: a proposal and a discussion. requirements engineering, v. 11, n. 1, p. 102-107, 2006. p. r. de bassi, g. m. p. wanderley, p. h. banali, and e. c. paraiso, “measuring developers’ contribution in source code using quality metrics,” in ieee international conference on computer supported cooperative work in design, 2018, pp. 39–44. pmi, guide to the project management body of knowledge (pmbok® guide), 6th ed., 2017. r. latorre and s. javier, “measuring social networks when forming information system project teams”, journal of systems and software 134, 2017, pp. 304-323. r. martin, “oo design quality metrics an analysis of dependencies,” in workshop pragmatic and theoretical directions in object-oriented software metrics, 1994. s. r. chidamber and c. f. kemerer, “a metrics suite for object oriented design,” ieee trans. softw. eng., vol. 20, no. 6, pp. 476–493, 1994. t. ambreen, n. ikram, m. usman, and niazi, m. “empirical research in requirements engineering: trends and opportunities”. requirements eng 23, 63–95 (2018). v. f. de souza, a. l’erario, and j. a. fabri, “model for monitoring and control of software production in distributed projects”. in: iberian conference on information systems and technologies, 2015, pp. 1–6. v. garousi; y. amannejad; a. b. can. “software testcode engineering: a systematic mapping”. information and software technology, v. 58, p. 123-147, 2015. w. zuser and t. grechenig, “reflecting skills and personality internally as means for team performance improvement.” in: conference on software engineering education and training, 2003, pp. 234–241. appendix a table a1. resultant studies from sms # title authors year repository score p1 reflecting skills and personality internally as means for team performance improvement zuser, w. grechenig, t. 2003 ieee 3.5 p2 continuous productivity assessment and effort prediction based on bayesian analysis yun, s. simmons d. 2004 ei compendex 5.0 p3 visualization of cvs repository information xie, x. poshyvanyk, d. marcus, a. 2006 ieee 4.5 how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table a1. resultant studies from sms (cont.) # title authors year repository score p4 a 3-dimensional relevance model for collaborative software engineering omoronyia, i. ferguson, j. roper, m. wood, m. 2007 ieee 3.5 p5 a visualization for software project awareness and evolution ripley, r. sarma, a. van der hoek, a. 2007 ieee 5.0 p6 evaluating software project portfolio risks costa, h. barros, m. travassos, g. 2007 ei compendex 4.0 p7 development of a project level performance measurement model for improving collaborative design team work yin, y. qin, s. holland, r. 2008 ieee 4.5 p8 measuring developer contribution from software repository data gousios, g. kalliamvakou, e. spinellis, d. 2008 acm 4.0 p9 mining individual performance indicators in collaborative development using software repositories zhang, s. wang, y. xiao, j. 2008 ieee 4.0 p10 svnnat: measuring collaboration in software development networks schwind, m. wegmann, c. 2008 ieee 3.5 p11 case study: visual analytics in software product assessments telea, a. voinea, l. 2009 ei compendex 4.5 p12 using transflow to analyze open-source developers’ evolution costa, j. santana jr., f. de souza, c. 2009 scopus 4.5 p13 are heroes common in floss projects? ricca, f. marchetto, a. 2010 acm 4.0 p14 pivot: project insights and visualization toolkit sharma, v. kaulgud, v. 2012 ieee 5.0 p15 effect of personality type on structured tool comprehension performance gorla, n. chiravuri, a. meso p. 2013 springer 4.0 p16 extracting, identifying and visualisation of the content, users and authors in software projects polášek, i. uhlár, m. 2013 scopus 5.0 p17 towards understanding how developers spend their effort during maintenance activities soh, z. khomh, f. guéhéneuc, y. antoniol, g. 2013 ieee 5.0 p18 a machine learning technique for predicting the productivity of practitioners from individually developed software projects lopez-martin, c. chavoya, a. meda-campana, m. 2014 ieee 4.5 p19 determining developers’ expertise and role: a graph hierarchy-based approach bhattacharya, p. neamtiu, i. faloutsos, m. 2014 ieee 5.0 p20 estimating development effort in free/open-source software projects by mining software repositories: a case study of openstack robles, g. gonzález-barahona, j. m. cervigón, c. capiluppi, a. izquierdo-cortázar, d. 2014 acm 5.0 p21 extracting new metrics from version control system for the comparison of software developers moura, m. nascimento, h. rosa, t. 2014 ieee 4.5 p22 influence of social and technical factors for evaluating contribution in github tsay, j. dabbish, l. herbsleb, j. 2014 acm 5.0 how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table a1. resultant studies from sms (cont.) # title authors year repository score p23 assessing developer contribution with repository miningbased metrics lima, j. treude, c. filho, f. kulesza, u. 2015 ieee 4.0 24 contributor’s performance, participation intentions, its influencers and project performance rastogi, a. 2015 ieee 4.0 p25 identifying wasted effort in the field via developer interaction data balogh, g. antal, g. beszedes, a. vidacs, l. gyimothy, l. vegh, t. zoltan a. 2015 ieee 4.5 p26 niche vs. breadth: calculating expertise over time through a fine-grained analysis da silva, j. clua, e. murta, l. sarma, a. 2015 ieee 5.0 p27 proposal for a quantitative skill risk evaluation method using fault tree analysis liu, g. yokoyama, s. 2015 ieee 4.0 p28 teamwatch demonstration: a web-based 3d software source code visualization for education gao, m. liu, c. 2015 scopus 4.5 p29 a comparative study of algorithms for estimating truck factor ferreira, m. avelino, g. valente, m. ferreira, k. 2016 ieee 5.0 p30 knowledge discovery in software teams by means of evolutionary visual software analytics gonzález-torres, a. garcía-peñalvo, f. therón-sánchez, r. colomo-palacios, r. 2016 scopus 5.0 p31 open-source resume (osr): a visualization tool for presenting oss biographies of developers jaruchotrattanasakul, t. yang, x. makihara, e. fujiwara, k. iida, h. 2016 ieee 5.0 p32 quantifying and mitigating turnover-induced knowledge loss: case studies of chrome and a project at avaya rigby, p. zhu, y. donadelli, s. mockus, a. rigb, p. zhu, y. donadell, s. mockus, a. 2016 scopus 5.0 p33 software project managers’ perceptions of productivity factors: findings from a qualitative study oliveira, e. conte, t. cristo, m. mendes, e. 2016 acm 4.5 p34 using temporal and semantic developer-level information to predict maintenance activity profiles levin, s. yehudai, a. 2016 ieee 5.0 p35 a comparison of three algorithms for computing truck factors ferreira, m. valente, m. ferreira, k. 2017 ieee 5.0 p36 collabcrew an intelligent tool for dynamic task allocation within a software development team samath, s. udalagama, d. kurukulasooriya, h. premarathne, d. thelijjagoda, s. 2017 ieee 4.5 how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 table a1. resultant studies from sms (cont.) # title authors year repository score p37 revisiting turnover-induced knowledge loss in software projects nassif, m. robillard, m. p. 2017 scopus 3.0 p38 measuring developers’ contribution in source code using quality metrics de bassi, p. wanderley, g. banali, p. paraiso, e. 2018 ieee 3.5 p39 repovis: visual overviews and full-text search in software repositories feiner, j. andrews, k. 2018 ieee 3.5 p40 git2net mining time-stamped co-editing networks from large git repositories gote, c. scholtes, i. schweitzer, f. 2019 ieee 4.5 p41 selecting project team members through mbti method: an investigation with homophily and behavioural analysis kollipara, p. regalla, l. ghosh, g. kasturi, n. 2019 ieee 3.0 how is a developer’s work measured? an industrial and academic exploratory view ferreira et al. 2020 appendix b 1. interview with pms we collected the opinion of professionals who work with project management in the software industry and used a structured script containing the following items for conducting the interview1:  description (company, level of education, experience, and number of projects managed);  opinion on which of the 64 metrics (table 5) is necessary to measure the developers’ work;  opinion on which of the 64 metrics (table 5) are useful to support the performance of the 7 activities listed in table 9; and  suggestion for other metrics. we interviewed four pms from three private companies with different characteristics. by interviewing these pms, information from professionals in different contexts and projects was collected. the company’s characteristics are:  company a is a software factory in the brazilian market and has approximately 70 employees;  company b is a software factory operating in brazil’s education area, with around 150 employees; and  company c is an enterprise software consultant with approximately 12,000 employees and headquarters in various countries. table b1 presents the interviewees’ description (pm1, pm2, pm3, and pm4) who work in companies of different characteristics. pm1 and pm4 work in the same company. pm1, pm2, and pm3 are graduates, and pm4 is postgraduate. their experience ranges from 1 to 3 years and have participated in the management of 6 to 10 projects. table b1. activities supported by information on the developers’ work id company education experience # projects pm1 a bachelor degree 1 year 6 pm2 c bachelor degree 1,5 year 8 pm3 b bachelor degree 3 years 10 pm4 a mba 1 year 8 in figure 7, we presented the steps for collecting the pms’ opinions. the researchers defined questions and devised an electronic questionnaire. then, pms were asked to voice their opinions. we scheduled an individual interview with each pm and recorded their answers in the electronic questionnaire. finally, we compiled the responses and included them in the discussion of the results. figure 7. steps to collect pms’ opinions 1 the questions, responses, and annotations are available at (in portuguese): http://doi.org/10.5281/zenodo.3965805 journal of software engineering research and development, 2022, 10:8, doi: 10.5753/jserd.2022.786  this work is licensed under a creative commons attribution 4.0 international license.. a survey on the practices of software testing: a look into brazilian companies italo santos  [ university of são paulo | italo.santos@usp.br ] silvana m. melo  [ university of grande dourados | silvanamelo@ufgd.edu.br ] paulo s. l. souza  [ university of são paulo | pssouza@icmc.usp.br ] simone r. s. souza  [ university of são paulo | srocio@icmc.usp.br ] abstract [context:] software testing is essential for all software development, and techniques and criteria have been proposed to ensure its quality in different application domains. [objective:] this survey aims at the identification of software testing practices in brazilian industries, towards an overview of the latest testing techniques, selection processes, challenges faced, tools and metrics used by testers. [methodology:] survey questions were carefully designed for providing relevant information to both industry and academy and evaluated by testers for improving the quality of the survey. [results and conclusions:] our study provides insights into the current software testing practices in brazilian software companies. the results show testers select a testing technique according to the project scope under development; however, most companies have shown a lack of importance and priority regarding to the testing activity. some challenges raised will foster new research topics, outlined by the needs faced by testers in practice. keywords: survey, software testing practices, software quality assurance, brazil 1 introduction software testing is applied to the industry to assure quality through a direct analysis of the software in execution and provides realistic feedback on software behavior, thus complementing other techniques. beyond the apparent straightforwardness of the checking of a sample of runs, software testing embraces a variety of activities, techniques, and actors, and poses many complex challenges (bertolino, 2007). testing is indispensable for all software development and an integral part of software engineering, in which efforts are made towards a running program with a few possible inputs, thus checking whether the program behaves as specified. the number of possible inputs can be infinite for most programs, thus the challenge is to use a reasonable and cost-effective number of tests while also maximizing the test suite’s fault detection capability. surveys have received much attention in research and practice as a tool to systematically analyze opinions, experiences, and expectations among the population studied (torchiano et al., 2017). a survey is a method for the collection and summarization of evidence from a large representative sample of an overall population of interest (molléri et al., 2016). in software engineering, surveys are one of the most common research methods for empirical investigations (kasunic, 2005). the literature reports several surveys related to software testing practices in different countries, such as australia (ng et al., 2004), canada (geras et al., 2004; garousi and varma, 2010; garousi and zhi, 2013), sweden (engström and runeson, 2010), and brazil (dias-neto et al., 2017). according to wohlin et al. (2011), the collaboration between industry and academia can provide improvement and innovation in the industry and helps to guide new academic research. therefore, we conducted a survey of brazilian companies to identify software testing practices, gain knowledge, and obtain an overview of the latest testing techniques, selection processes, challenges faced, tools, and metrics used by testers. the key motivation is to collect evidence about how practitioners conduct the testing activity, how they select testing techniques to apply to a particular project and we raised the main challenges faced by practitioners bringing important results so that the software testing community can engage and develop new solutions. this study portrays the software testing scenario in brazilian technology companies and provides relevant information towards benefiting testers and fostering new research topics. the survey includes responses from 185 testing practitioners. it has detected the main trends in the software testing industry and areas of strengths and weaknesses. as a result, brazilian software companies perform more testing related to the system level, which demonstrates concern about testing the product only in the final phase of development. the testing technique selection process is conducted by different positions in a company, and a systematic process must be defined to aid decision-making. functional and structural testing are the most common testing techniques due to their high efficiency. selenium is the most used testing tool to automate the system testing task. the proportion of testers in brazilian software industries is usually much smaller in relation to the number of developers in the companies, which may not be an ideal scenario, since a tester applies tests to check the quality of the developers’ work. most companies (78%) do not develop research on testing activity. our study has summarized the main challenges indicated by the respondents regarding the execution of testing activities (e.g., lack of importance and priority of the testing activity, lack of knowledge and training, automation process, tools, schedule). the remainder of the paper is structured as follows: section 2 presents some concepts of software testing; section 3 discusses some related work on testing practices; section 4 https://orcid.org/0000-0002-7545-6104 mailto:italo.santos@usp.br https://orcid.org/0000-0001-5934-2564 mailto:silvanamelo@ufgd.edu.br https://orcid.org/0000-0002-1560-2704 mailto:pssouza@icmc.usp.br https://orcid.org/0000-0001-9007-9821 mailto:srocio@icmc.usp.br a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 describes the survey methodology, goal, design, execution, and sample size; section 5 reports on a detailed analysis of the results; section 6 is devoted to a more in-depth discussion of our findings; section 7 addresses the threats to validity; finally, section 8 provides the conclusions. 2 background to identify and eliminate faults present in the software product, vv&t verification, validation, and testing activities aim to ensure the quality of the software produced. software testing aims to detect faults and analyze if the software runs as expected (myers et al., 2011). according to abran et al. (2004), software testing is usually performed at different levels throughout the development and maintenance processes. the categorization of levels is based on either the object of testing (target), or the purpose (objective). swebok divides the testing target into three levels (e.g., unit, integration, and system). moreover, testing is applied because of specific objectives, which are more or less explicitly stated and show varying degrees of precision. the objective of testing includes, but is not limited to, reliability measurement, identification of security vulnerabilities, usability evaluation, and software acceptance, for which different approaches can be employed (bourque et al., 2014). for each testing level, specific testing techniques can be applied. the mainly testing techniques categories described at swebok, according to the information used to derive test cases are: functional, structural and fault-based testing. structural or white-box testing operates at a low-level design and implementable code and is applied at all levels of the system development (e.g., unit, system, and integration testing) (saglietti et al., 2008). functional or black-box testing is based solely on requirements and specifications. it requires no knowledge of the internal paths, structure, or implementation of the software under test (copeland, 2004). mutation testing is a commonly used fault-based technique that uses knowledge about common mistakes made by developers to generate mutant programs, which are similar to the original but with minor syntactical changes produced by the mutant operators (delamaro et al., 2017). testing techniques provide standards for the systematic design of test cases that exercise both the internal logic of software components and their input and output domains (myers et al., 2011). vegas and basili (2005) claim that the choice of a testing technique to be adopted in a software testing project is mainly based on the knowledge of the test practitioner and often does not consider all testing techniques available in academia. aiming at understanding the gap between academia and industry related to knowledge and application of testing approaches, previous studies conducted surveys, which bring finds about software testing practices from different countries in real software development environments (geras et al., 2004; groves et al., 2000; dias-neto et al., 2017; garousi and zhi, 2013). 3 related work this section discusses some work on software testing practices that have contributed to our research and collection of data on software testing practices in brazilian companies. geras et al. (2004) conducted a regional survey on software testing and software quality assurance techniques in the province of alberta. the results show software organizations tend to train low personnel on testing-related topics. such a practice exerts a two-fold impact, i.e., it detects trends, which reduces and identifies quality causes (e.g., lack of testing), and reveals the difficult adoption of methodologies (e.g., extreme programming) by organizations. differently from their study, our survey includes all companies located in brazil and aims at a better understanding of the selection of testing techniques. another survey into software testing practices was conducted by ng et al. (2004) in australian industries. it focused on five significant aspects of software testing, namely testing methodologies and techniques, automated testing tools, software testing metrics, testing standards, and software testing training and education. the results enabled the development of practices in software testing and some observations and recommendations for the future of software testing in australia for both industry and academia. our study aims to understand topics such as selection testing technique processes and identify challenges faced by testers towards an understanding of the regression testing practices in sweden, engström and runeson (2010) conducted a qualitative survey with 15 industries and validated the outcomes in an online questionnaire with 32 respondents. the main finding was regression testing practices significantly vary among organizations at different stages of a project. the survey also highlighted the importance and challenges of automation. our study presents findings related to general testing issues, rather than specific to regression testing. garousi and varma (2010) replicated a survey into software testing practices in the canadian province of alberta five years after the original study (geras et al., 2004) to analyze possible changes. the criteria used for the design of the questions were relevant to the industry and alignment with the testing knowledge area of the software engineering body of knowledge (swebok). the results showed almost all companies performed unit and system testing, and more organizations used observations and experts’ opinions to conduct usability testing. junit and ibm rational were the most widely used testing tools, and in comparison to the original study, more companies were devoting efforts to pre-release testing. lee et al. (2012) surveyed companies and experts involved in software testing to identify the current practices and opportunities for improvements in software testing methods and tools (stmts). they selected companies with prominent positions in the world market and assumed their testing practices would be better than others. they analyzed such practices and the results showed low or limited use of stmts, difficulties caused by the lack of stmts, demand for interoperability support between methods and tools of software development and testing, and need of guidance for the evaluation of stmts or description of their capabilities. a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 garousi and zhi (2013) presented a newer version of the previous study (geras et al., 2004), which surveyed canadian software testing practices. they reused the questions from the 2004 survey and added four new ones. the results showed the importance of testing-related training had increased, and functional and unit testing were the two common testing types of most attention and effort. moreover, mutation testing had drawn attention among canadian companies, and nunit and web application testing tools had overtaken the popularity of junit and ibm rational. most canadian firms spent less than 40% of their efforts (e.g., budget and time) on testing during development. kassab et al. (2016) surveyed software professionals to determine the use of testing practices. the participants represented 18 industries located in 9 different countries, and the main findings included characteristics of projects, practices, organizations, and practitioners related to software testing. dias-neto et al. (2017) conducted a replicated survey in two south american countries (brazil and uruguay) to understand the perception of testing practitioners regarding the use and importance of software testing practices. the main findings include system and regression testing are the two test types deemed most useful and important, tools to support automation of test case generation and execution or code coverage are still poorly used in their organizations, documentation of test artifacts (plan, cases, procedures, results) are useful and important for software testing practitioners. santos et al. (2020a), conducted a survey focused on mobile testing techniques. their results show that most companies focus on developing native applications. the test level identified as more performed is the system test. also, participants indicate that the correct choice of testing techniques directly influences the final product quality, and testers do not follow a defined process to select testing techniques. the testing tools most used to automate are cucumber and selenium. we highlight the importance of surveys for the identification of software testing practices and their efficiency regarding the results achieved. our work differs from the others because we aim to investigating the practices of testing technique selection in brazilian industries. moreover, we have listed the main challenges faced by testers while performing their activities. 4 survey methodology a survey is a comprehensive research method for the collection of information on descriptions, comparisons, explanations, knowledge, attitudes, and behavior (fink, 2003). a survey collects quantitative, subjective (individual’s opinions, attitudes, and preferences), and objective data (e.g., demographic information). the research process (figure 1) follows the guidelines defined by kitchenham and pfleeger (2008). 1. setting of objectives: each objective is a statement of the expected outcomes or a question to be answered by the survey; 1 setting of objectives 2 survey design 3 development of a survey instrument 4 evaluation of the survey instrument 5 obtaining of valid data 6 analysis of survey data figure 1. overview of the survey research process. 2. survey design: since the type of survey used is a crosssectional study, the participants are asked about their previous experiences at a particular fixed point in time; 3. development of a survey instrument: such a development involves four steps, namely (i) search for relevant literature, (ii) construction of an instrument, (iii) evaluation of the instrument, and (iv) documentation of the instrument; 4. evaluation of the survey instrument: often called pretesting, this evaluation pursues the following goals: (i) checking of whether the questions are understandable, (ii) assessment of the likely response rate and effectiveness of the follow-up procedures, (iii) evaluation of the reliability and validity of the instrument, and (iv) assurance of match of data analysis techniques and expected responses; 5. obtaining of valid data: a sample, which is a subset of a population, is defined in this phase. the answers of this group should represent the entire group. when choosing a sample to be surveyed, we must keep in mind three aspects of survey design, namely (i) avoidance of bias, (ii) appropriateness, and (iii) cost-effectiveness; 6. analysis of survey data: it is the final step, in which all data collected are gathered, and relevant information is extracted, so that the questions defined in the survey objectives can be answered. 4.1 survey goal and research questions the approach used in this survey was based on the goal, question, metric (gqm) methodology and the template proposed in caldiera et al. (1994) and briand et al. (1996). our survey aims to identify software testing practices in brazilian industries to highlight strengths and weaknesses and foster work/research collaboration between academia and industry towards an overview of the latest testing techniques, selection processes, challenges faced, tools and metrics used by testers. the survey was conducted between september 2018 and december 2018 and it is available online1. table 1 shows 1https://github.com/italo-07/surveyteste a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 the research questions raised. 4.2 survey design and questions our survey was based on other relevant studies (geras et al., 2004; ng et al., 2004; engström and runeson, 2010; garousi and varma, 2010; lee et al., 2012; anand et al., 2013; garousi and zhi, 2013) and some of the questions designed by geras et al. (2004), and new ones related to our specific objectives were adopted. testers currently working in the industry were asked to evaluate the set of questions proposed. torkar and mankefors (2003) state feedback improve the set of questions prior to the application of a survey. the goal behind this phase was to ensure the terminology used would be familiar to a reasonable ratio of the audience. according to garousi and zhi (2013), occasionally, the software testing terminology used in academia versus industry may be slightly different or even confusing. the feedback received from testers was used for the validation of the survey questions. the study employs the characterization topics presented in swebok (bourque et al., 2014). therefore, the survey comprises the following categories: (i) testing levels, (ii) testing techniques (selection techniques), and (iii) software testing tools and testing process management. we have added three other categories, namely (i) respondents’ personal and professional profiles, (ii) research & training, and (iii) challenges, which contain information on the respondents’ profile and investigate research initiatives within companies towards identifying challenges faced by testers. survey questions are designed to provide relevant information to the industry and academy. we derived them from each of the eight research questions previously presented, which have composed a set of 34 questions, of which 7 (20%) are qualitative (in italics), and 27 (80%) multiple choices are quantitative. to help readers identify the research questions, we labeled the corresponding categories in the first column and the survey questions in the second column (table 2). the first category is related to respondents’ personal and professional profile (q1-q11), and we are interested in understanding the level of education, undergraduate course, companies’ name, time of experience, current position, types of software projects the companies usually test, and company size. the questions would provide an overview of the scenario of software testing in brazil. the second category (q12–q13) refers to the testing level (e.g., unit, integration, and system testing) and objectives of testing (e.g., acceptance, reliability, regression, performance, security, stress, interface, and usability testing). the survey investigates the use of testing levels by brazilian industries and respondents’ experience. the third category is related to testing techniques and selection processes (q14-q23), i.e., to those who decide on the testing techniques to be used in a project under test, the way the respondents select testing techniques, their experience with other testing techniques, such as mutation or concurrent testing, and what should be considered in a selection process. the fourth category details the testing tools and testing process management (q24-q29) towards a better understanding of testing tools and identification of testing practices of management applied to test teams. the fifth category, i.e., research & training (q30-q33), investigates whether companies develop research on software testing and provide formal training to their testing team, and the way testers seek new knowledge to solve their technical problems. finally, the sixth category (q34) asks the respondents to list three challenges faced in their work routine. 4.3 survey application we used the search management application google forms2 to design and host our online survey, which was available for four months, from september to december 2018. according to the ethics guidelines, the respondents could remove their answers from the study at any time, and researchers agreed on aggregating information from the survey and publishing a summary of the results. to reach the maximum number of responses, invitations were sent to (i) email lists used by the brazilian computer society (sbc)3, which concerns professionals of the computer science community in brazil, (ii) email lists used at the university of sao paulo for undergraduate and graduate students, and (iii) email lists of major brazilian software companies (e.g., brazilian association of software companies (abes)4, association of brazilian companies of information technology assespro5, brazilian association of information and communication technology companies brasscom6, porto digital7, and association of technology companies of sancahub sancahub8). the authors also attended scientific conferences in brazil and requested the participants to answer the survey. 4.4 population and sample size of respondents the survey was available for four months, and 201 responses were obtained. to ensure the quality of the collected data, the respondents informed their emails for further contact towards the confirmation of their information. they were also asked the following question after the first category of questions (q1-q11): do you have knowledge of software testing?. if the answer is no, the survey ends. the purpose here is to be sure the respondents fit the target audience, i.e., testers, developers who perform testing during development, and professionals who work in software quality. the survey data included responses from 185 testers in brazilian companies. table 3 shows the total number of emails sent to the companies. 1269 companies were listed in the first column and represented brazilian companies’ associations of technology, of which 207 mail addresses failed. we believed the emails reported on the website had not been updated, therefore, we sent emails to only 1062 companies. since 185 respondents out of 1062 companies listed contributed to the survey, a rough measure of the sample size equals 17.41% (185/1062). 2https://www.google.com/forms/ 3http://www.sbc.org.br/34-listas-eletronicas 4www.abessoftware.com.br/associados/socios/ 5http://assespro.org.br/ 6https://brasscom.org.br/ 7http://www.portodigital.org/home 8http://sancahub.org/ a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 table 1. research questions gqm methodology goal question metric investigate commonly used testing levels. rq1 what are the most used testing levels in braziliansoftware companies? set of options with the testing levels. understand the process of testing technique selection and familiarity of testers with the existing testing techniques. rq2 how do you select a testing technique to test a software product? rq3 how often does a tester use structural, functional, mutation, and concurrent program testing? open questions for the collection of respondents’ opinions. set of options with the software testing techniques. investigate the testing tools and technologies used and the way testing teams are managed. rq4 what are the most used testing tools? rq5 how are the testing teams and processes managed? set of tool options and possibility of selection of more than one and insertion of other tools names. set of options with the process management. find out whether companies are concerned with research and provide training, and the way testers seek new knowledge. rq6 do brazilian industries develop research on software testing? rq7 how do testers seek knowledge on software testing? does their company offer any training? open questions for the collection of respondents’ opinions. set of options with actions for seeking knowledge. identify research challenges. rq8 what challenges do testers face in their work routine? open questions for the collection of respondents’ opinions. table 2. survey list of questions categories (rq) survey questions respondents’ personal and professional profile 1how old are you? 2what is your gender? 3what is your highest level of education? 4if you have taken an undergraduate course, please inform its name. 5what is your email address? 6what is the name of the company where you work or have worked with software testing? 7in what state is your company located? 8what is (are) your current position(s) in your company? 9how many years of experience do you have in software testing? 10what types of software projects does your company develop? 11what is the size of your company (number of employees)? testing levels rq1 12what type of tests do you work or have you worked on? 13are you engaged in other activities in your company? if so, which ones? testing techniques, selection process rq2 rq3 14do you decide on the testing techniques to be used on the software in your company? 15if you answered no to the previous question, please inform the one who decides on them. 16in your opinion, what makes a tester not consider choosing new testing techniques? 17how do you select a testing technique for a project? 18in your opinion, what is important in the selection of a testing technique for a project? 19what do you think prevents your company from adopting new techniques and methodologies? 20are you familiar with the white box test? if so, how often do you use it in your projects? 21are you familiar with the black box test? if so, how often do you use it in your projects? 22are you familiar with the mutation test? if so, how often do you use it in your projects? 23are you familiar with concurrent testing? if so, how often do you use it in your projects? testing tools and process management rq4 rq5 24what tools do you use to automate the testing activity? 25in your current, or latest project, what mechanisms are (were) used for generation of test cases? 26does your testing team use any measures as a guide for planning or designing tests? 27is testing separated from development and production environments? 28what is (was) the proportion of testers for developers (t:d) in your current, or latest software project? 29does your development team(s) refer to itself as agile or traditional? research & training rq6 rq7 30is your company involved in research on software testing towards the development of new ways/techniques for testing software systems? 31does your company offer any training to the testing team? 32if you answered yes to the previous question, please specify the training. 33how do you usually gain knowledge in the testing area for solving technical problems in the work environment? challenges rq8 34provide a list of the top 3 testing challenges related to your projects. a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 our study comprised a significant number of responses in comparison with other surveys into software testing. the average number of responses was 62. the most extensive survey (garousi and zhi, 2013) comprehended 246 responses, whereas the lowest (lee et al., 2012) provided 14 responses. our survey was well above the average of responses and almost approached the survey of the highest number. table 3. total of invitations sent number of contacts failed to send email sent abes 1023 155 868 assespro 169 39 130 brasscom 15 8 7 portodigital 48 3 45 sancahub 14 2 12 total 1269 207 1062 5 analysis of the survey results 5.1 respondents’ personal and professional profile the participants’ ages were divided into intervals. most respondents (40%) are between 31 and 40 years old, whereas the second most representative group (25%) is aged 26 and 30. other groups range between 20 and 25 years (18%), 41 and 50 years (14%), and over 50 years (3%). according to the range, the profile of participants is diversified, which contributes to the strength of the answers collected. regarding gender, 72% of the respondents are male and 28% female. we intended to identify the relationship between gender and age of respondents and how its representation is portrayed in the software testing scenario within brazilian industries. figure 2 represent this distribution and we can note that the majority of testers in brazilian industries are male with age between 31 and 40 years and the distribution is similar for both genders. concerning respondents’ level of education, 87% had graduated, and were divided into the following categories: graduate (34%), master of business administration mba (24%), masters (20%) and phd (9%). the other respondents are undergraduates (12%), and (1%) who graduated from high school. in figure 3 is shown the distribution of women and men related to your academic background. the presented results show that the distribution is proportional between the genders, with a greater number of men in the pos-graduated level. a higher level of education increases the competitiveness among it companies and guarantees a higher quality of work. however, a proper theoretical and technical training in the area prepares professionals to better perform their activities. concerning the undergraduate courses informed, we grouped the data according to areas (table 4). the majority (173 respondents) obtained an undergraduate degree in areas of science, technology, engineering and mathematics 50% 40% 30% 20% 10% 0% 10% 20% 30% 40% 50% 1 2 3 4 5 6 percentage ag e age and gender female male over 50 years between 41 and 50 years between 31 and 40 years between 26 and 30 years between 20 and 25 years 50% 40% 30% 20% 10% 0% 10% 20% 30% 40% 50% 1 2 3 4 5 6 percentage ag e age and gender female male over 50 years between 41 and 50 years between 31 and 40 years between 26 and 30 years between 20 and 25 years 50% 40% 30% 20% 10% 0% 10% 20% 30% 40% 50% 1 2 3 4 5 6 percentage ag e age and gender female male over 50 years between 41 and 50 years between 31 and 40 years between 26 and 30 years between 20 and 25 years figure 2. respondents’ age and gender distribution. high school undergrad uate graduate mba master phd male 1 14 43 29 33 14 female 0 9 20 15 4 3 0 5 10 15 20 25 30 35 40 45 50 re sp on de nt s gender vs. educational background figure 3. relationship between gender and educational background. (stem), and technological courses (e.g., analysis and systems development, computer networks, information technology management) and enrolled in baccalaureate programs (e.g., information systems, computer science, computer engineering). only (10 respondents) had their base education in applied social sciences courses (e.g., administration, social service, psychology). table 4. undergraduate courses attended by the respondents. course percentage exact sciences, engineering and technology 94% applied social sciences 5% did not take upper course 1% we also identified the respondents according to their companies location and grouped them into the five regions in brazil (figure 4). the region with the most significant number of participants was the southeast region (59%), which is the second largest region of the country and is composed of large technological centers, located in cities such as são paulo, minas gerais, and rio de janeiro. secondly, the northeast region (21%) is a region comprised of 9 states and has technological centers of great recognition such as the porto digital located in recife. moreover, in the third place the south region (11%), which has poles of innovation and technology. finally, the midwest region (7%) and the northern region (2%). hence, we expected that the most significant number of respondents would come from the southeast region, due to the higher concentration of software companies. the position of the highest number of respondents was analyst (82 responses), followed by the developer (33 responses) and tester (23 responses). other positions were manager and leader (e.g., project, qa, test, technical, ti). the years of experience in the software testing industry were divided into five categories (up to 1 year, between 1 and 2 years, between 2 and 5 years, between 5 and 10, and a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 © geonames, microsoft, tomtom powered by bing north central-west south southeast northest 2% 7% 21% 59% 11% figure 4. geographical distribution of respondents. above 10 years). most candidates (76%) had two years’ experience, characterized by solid experience in software testing, and some had up to 5 years’ experience (27%). to analyze the influence of the years of experience from the participant’s industry positions, we compare this data at (figure 5). results show that there are respondents with all levels of experience in almost all positions. although the majority of high positions as project manager and leader, concentrate respondents with more than five years of experience 0 10 20 30 40 50 60 up to 1 year between 1 and 2 years between 2 and 5 years between 5 and 10 years above 10 years (business, qa, requirements, test) analyst (project, test, ti) manager (project, qa, test, technical) leader tester developer figure 5. relationship between years of experience and position. in brazil, most companies classify it professionals as junior, senior, and expert, taking into consideration the experience time in the company, i.e., junior level encompasses up to 5 years’ experience and, in general, senior level comprehends up to 9 years. above 5 years’ experience, the number of professionals is more significant in categories whose level of education associated with mba, masters, and phd (figure 6). the expert’s level requires above 10 years’ experience and well-established academic training, and expert professionals often work in management positions. figure 6 shows the relationship between years of experience and the education level of each respondent. some types of software projects require more intensive tests (figure 7). the category of web systems development (158 responses) showed the highest number of answers, followed by mobile development (105 responses) applications. the other systems indicated were maintenance (74), erp (57) and banking (49). the participants reported other types of projects, such as games development, virtual reality systems, quality consulting, and educational systems. the size of the companies the respondents work at is related to the number of employees as shown in table 5. most 0 10 20 30 40 50 60 up to 1 year between 1 and 2 years between 2 and 5 years between 5 and 10 years above 10 years undergraduate graduate mba master phd figure 6. relationship between years of experience and level of education. figure 7. types of software projects developed by companies. respondents worked in large companies with more than 200 employees (53%). such organizations are more concerned with the quality of the software product developed and invest more resources in hiring professionals to perform testing activities. moreover, from the data collected, we identify the companies that have the certification of capability maturity model integration cmmi. this information is available at cmmi institute9, the model describes a five-level evolutionary path of increasingly organized and systematically more mature processes. table 5. company size (according to number of employees). number of employees respondents percentage up to 200 97 53% between 100 and 200 23 12% between 50 and 100 21 11% between 10 and 50 29 16% between 5 and 10 5 3% between 1 and 5 10 5% 5.2 rq1. what are the most used testing levels in brazilian software companies? the participants’ responses enabled the identification of the testing levels and objectives. in that case, respondents could choose more than one option. the most accomplished testing level (figure 8) was the system test (136 responses), which tests the behavior of an entire system and is usually considered appropriate for the assessment of nonfunctional system requirements, such as security, speed, accuracy, and reliability. the second testing level was unit test (93 responses), which checks the functioning of software elements separately testable. the integration test (4 responses) was given the lowest number of responses. 9https://sas.cmmiinstitute.com/pars/pars.aspx a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 136 93 4 system test unit test integration test figure 8. testing levels used by respondents. the most used testing objective (figure 9) was the interface test (129 responses) checks whether the components in an interface behave as expected, and the acceptance test (124 responses) is applied when the product is almost finished, and the tester must check if the system developed has met the requirements. a regression test (121 responses) is applied when the software has been updated and we need to verify if the new changes will influence the behavior of the unchanged part of the software. the usability test (117 responses), one of the most applied tests, evaluates the degree of ease of the software developed and contributes to the learning and use by the end-user. other tests indicated were (i) stress test, which exercises the software’s maximum capacity for determining its limits, and (ii) security test, which checks whether the software is protected from external attacks. although most of the respondents (132) have cited the automation test as one testing objective, it is not represented in figure 9 because automation is not an objective. instead, it is a way of conducting the testing activity as discussed by dustin et al. (1999). 50 69 117 121 124 129 0 20 40 60 80 100 120 140 security test stress test usability test regression test acceptance test interface test figure 9. testing objectives most used by the respondents. 5.3 rq2. how do you select a testing technique to test a software product? there are several testing techniques available, each one with different and often complementary features responsible for testing several aspects of the software. choosing the testing technique should not only be based on subjective knowledge, but it should also incorporate objective knowledge, guided by elements that favor good choice (santos et al., 2020c,d, 2019; victor and upadhyay, 2011). our study identifies whether the respondents have the autonomy to choose a testing technique to test a software project. results show that the respondents selected (70%) of the testing technique to be applied in the software project. those who did not select (30%) it informed the position responsible for the task (figure 10). the aim is to summarize the positions within a company responsible for the testing technique selection. we identified three main positions that have this responsibility: qa analysts (63%) is the main role that performs this task; project manager (23%) and developers (14%). through the identification of these roles, it is expected to conduct further research to propose a systematic way to select testing techniques to ensure quality and improve the selection process. 14% 23% 63% developer project manager qa analyst figure 10. positions responsible for the selection of testing techniques. factors that hamper the use of new techniques by testers (figure 11) are very short deadlines (158 responses) and lack of knowledge (131). the time devoted to software testing may be very short within a project schedule, in comparison to the other activities in the development process, therefore, testers cannot apply (or even choose) new techniques for improving testing efficiency. the cost of application of a technique (72 responses) is also an obstacle, once it is an expensive activity in the software development process and can consume a large part of the total costs of the project. low priority (58 responses) indicates the company is not concerned with seeking new testing techniques. due to the efforts that must be devoted to a testing activity, companies always use the same techniques, which will become obsolete regarding advances in research in the area. the lack of technical documentation (50 responses) triggers an alert for researchers and companies that develop new techniques and should provide proper documentation to encourage their use and dissemination. some testers declared satisfaction (43 responses) with the techniques used in previous projects; they feel no need to seek new ones, and the company does not allow (21 responses) their use. 131 158 58 72 50 21 43 0 20 40 60 80 100 120 140 160 180 lack of knowledge short deadlines low priority cost of applying the technique lack of technical documentation company not allow new techniques satisfaction with previously techniques figure 11. factors that avoid the use of new testing techniques by testers. the study investigates the way testers select a testing techa survey on the practices of software testing: a look into brazilian companies santos et al. 2022 nique to be applied in a software project (figure 12). the project scope (124 responses) supports decision-making regarding the selection of testing techniques. testers analyze the project to be tested and choose a technique related to its characteristics. another criterion is time (71 responses). the technique selected should be feasible to be applied according to the time defined in the schedule. risk (26 responses) related to the implementation of the technique, involves the problems to applying the testing technique for the software under test. costs (26 responses) are another factor considered, since testers are more worried about ensuring the best quality practice in the testing activity, and do not take costs into consideration. the complexity (20 responses) of the implementation of a technique is also indicated as a selection criterion. one of the fewest answers was about previous experiences (12 responses) of testers. figure 12. criteria adopted for the selection of testing techniques. 5.4 rq3. how often does a tester use structural, functional, mutation, and concurrent program testing? we chose two well-known techniques (structural and functional testing) to investigate the familiarity of the respondents with some testing techniques (figure 13), and two alternative ones (mutation and concurrent testing) to check their demand and use in brazilian software industries. 39 7 111 121 48 8 63 40 83 61 11 21 15 109 0 3 0 20 40 60 80 100 120 140 structural test functional test mutation test concurrent test not familiar familiar but never applied used in some projects used in all projects figure 13. respondents’ familiarity with testing techniques. according to the familiarity reported in the previous questions, the most used testing techniques identified during the execution of testing activities are functional (170 responses), and structural test (98). among brazilian software industries, only a few respondents mentioned using mutation testing in some testing projects (11 responses); most of them reported they were not familiar (111 responses), while the second category of respondents indicated they were familiar, but had never applied it (63 responses). it would be a good practice if companies attempted to apply more mutation testing in their software projects to improving the testing activity. the analysis of participants’ familiarity with concurrent software testing seeks to evaluate the relevance of the area in the evolution of the computer research field. both testing activity and research in concurrent software testing should be more investigated and widespread among testers. most respondents are unfamiliar (121 responses) with this type of test. the second most indicated option revealed although the participants were familiar with the test, they had never applied it (40 responses). in contrast the mutation test, some respondents indicated they used this type of testing in all their projects (3 responses), and another part of respondents reported they used it in one of their projects (21 responses). the tendency to work with concurrent software testing has grown due to the evolution of computers, therefore, software companies should update themselves regarding this trend. brazilian software companies have shown a stronger tendency to apply more tests related to the functionality of the software developed and those that ensure the internal quality of the code. 95 55 0 20 40 60 80 100 120 140 functional test (black box) structural test (white box) model-based test fault-based test 11 98 170 0 20 40 60 80 100 120 140 160 180 functional test (black box) structural test (white box) fault-based test figure 14. most used testing techniques. 5.5 rq4. what are the most used testing tools? tools and frameworks are commonly used to support the testing process (figure 15). selenium (27%), a portable framework used in web application testing that supports the execution and creation of system tests, received the most responses. open-source framework junit (23%) supports the design of automated tests for java language, facilitating the development and maintenance of the code, and open-source jmeter tool (15%) was designed to load functional test behavior and measure performance. soapui (13%) is also an open-source tool used for web application testing. among its functions are web services inspection, support for functional tests, load, and compliance tests. testlink (10%) is an open-source tool that facilitates the management and maintenance of test cases and documents related to the software. sonarqube (9%) continuously inspects code quality to perform automatic revisions with static code analysis and detect bugs, smelly codes, and secua survey on the practices of software testing: a look into brazilian companies santos et al. 2022 rity vulnerabilities; it works with a variety of programming languages. sikuli (3%) is an open-source tool that supports testing automation and automates the elements displayed on the screen. j-unit 23% jmeter 15% selenium 27% sikuli 3% soapui 13% sonarqube 9% testlink 10% figure 15. automation tools and frameworks. concerning the design of test cases, user stories (122 responses) consist of a specification that captures what a user does or needs to do (figure 16). it has been widely used with agile methodologies to support the definition of test cases. tester skills (120 responses) indicate respondents usually rely on their skills from previous experience in designing efficient test in projects they participate in. other categories indicated are (i) bugs reported (102 responses), which shows respondents are based on the design of test cases from errors reported, (ii) code coverage (60 responses), which measures the degree at which the source code of a program is executed when a particular testing suite runs, (iii) boundary value analysis (59 responses), which concerns boundary conditions, i.e, situations directly above, and beneath the edges of input and output equivalence classes, (iv) equivalence partitioning (46 responses), which reduces the number of test cases to a manageable level while still maintaining reasonable testing coverage, and (v) control flow graphs (24 responses), which identify execution paths through a module of program code and create and execute test cases to cover those paths. figure 16. forms of design test cases. the design of test cases can consider measures to guide the activity (figure 17). priority level (116 responses) consists in prioritizing the creation of test cases fundamental for the correct functioning of the system. code lines (37 responses) consist of tools for an automatic generation of test cases through the lines of code. function points (33 responses) are a technique that establishes a measure of size, considering the creation of a test case according to the implemented functionality of the software. complexity (20 responses) is a metric that identifies the cyclomatic complexity of the software and uses a strategy to test each path independently of the program. therefore, the number of test cases generated corresponds to the cyclomatic complexity. figure 17. measures for the design and plan of test cases. 5.6 rq5. how are the testing teams and processes managed? during the development and execution of testing activities, the testing team must have full autonomy to work on a version of the system that is not used by real users. testers require the whole software available to perform their activities without the risk of modifying its functional version. the environment refers to the use of servers or different virtual machines for tasks of development, testing, and production (table 6). respondents indicated software and hardware are separated (58%), i.e., they might be divided into different virtual and physical servers as a safer option in case of equipment failure. moreover, the separation among testing, development, and production environments does not apply (20%), which indicates no concern on the part of the testing team or the project management regarding the distribution of such environments. as a result, projects running in real-time can be impacted, and new tests must be applied. table 6. characteristics of the testing environment. development environment responses percentage separate software and hardware 107 58% separate software 40 22% not applicable 37 20% an important point is the identification of the proportion of testers for developers (t:d) (figure 18). the proportion of most indicated was 1: 5 or fewer developers (68 responses), which highlights a scenario inefficient due that five developers would be working in new system functionalities and only one tester to perform testing activities in all the new functionalities developed. a 2: 1 and higher ratio (24 replies) would be better if the testers split the several testing activities to ensure a better quality of their tasks and speed up the reporting of errors. the respondents indicated that there are no testers as the ratio of 0: 5 or fewer developers (20 responses), in this scea survey on the practices of software testing: a look into brazilian companies santos et al. 2022 nario developers, have the function of performing the tests in their codes, it is not a good practice because, without a professional focused to ensure the software quality, the final product developed can be compromised. figure 18. proportion of testers to developers (t:d). to understand how the development teams consider themselves, agile methodology (61%) seems to be dominant and quite popular among software companies (figure 19). traditional methodology (12%) received a small number of answers, which characterizes the migration of companies that used old software processes to new projects that apply agile methodologies. other respondents indicated no specific distinction (27%), and that their companies could use both approaches. the type of development methodology used by companies must be known, since each model directly impacts the testing activity. agile 61% traditional 12% no specific distinction 27% figure 19. characteristics of the development team. 5.7 rq6. do brazilian industries develop research on software testing? many companies are not involved in research (78%), which is not an ideal scenario. the lack of research investments by companies is an issue to be addressed. santos et al. (2020b), highlights some research opportunities to indicate new research directions at the intersection of testing and the software ecosystem that could be further explored for companies in collaboration with the industry. research initiatives are essential for enabling significant advances in software testing. some respondents indicated that 22% of companies invested in research, despite its importance for the evolution of the software testing area. these companies show a concern with the development of research and prioritize the research area through a specific department to conduct research in software testing. we investigate whether company size influences the amount of research investments. figure ref fig: researchcompany shows that the company size and, consequently, its investment capacity does not affect the research initiative. 0 10 20 30 40 50 60 70 80 yes no responses in ve st m en t i n re se ar ch company size vs. investiment in research small large figure 20. research initiatives from companies. 5.8 rq7. how do testers seek knowledge on software testing? does their company offer any training? most companies (58%) offer no training to the testing team, and those that provide it (42%) aim at improving their employees’ skills. among the training provided by companies, training on testing techniques (24 responses) is indicated as advantageous for spreading knowledge about new testing techniques among testers and improving the execution of their work activities. web courses (20 responses) are offered through online platforms, and companies pay for their employees to access them. internal training (20 replies) is provided when companies promote the sharing of internal knowledge among their employees. training related to automation and tools (20 responses) aims to improve the forms of testing automation and find the available automation tools that can contribute to the automation activity. figure 21. type of training provided by respondents’ companies. both it professionals and testers commonly face challenges during the execution of their tasks. to comprehend how these professionals seek new information (figure 22)., results indicate that testers usually seek new knowledge through conversations with more experienced testers (125 responses), which identifies the need for communication among employees of the same team, since the most experienced ones can share their knowledge with beginners, and a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 vice versa, it will help to increase team integration. other options indicated were reading scientific papers (106 responses) and attending courses (106 responses), testers acquire new content to help them solve their problems, which reinforces the importance of research and collaboration between industry and academy. figure 22. actions for the obtaining of knowledge. other options were (i) search in web forums (83 responses), which consists of the sharing of questions with the testers’ community to obtain help, (ii) attendance at conferences (66 responses) for networking and meeting other professionals in the field, and (iii) talking with researchers (64 responses) of the area towards finding solutions to problems. however, no simple communication is established between industry and academy. 6 discussion on the survey findings 6.1 correlations in this section, we present a triangulation constructed from the obtained results. we are looking for correlations among the analyzed data. some exciting finds are discussed below. 6.1.1 software testing technique used versus type of project we analyzed the correlation between testing techniques and software projects, seeking to understand which testing techniques are commonly used to test companies’ different types of systems. the results are shown in figure 23, where we can see that for all types of projects, functional testing is the most used. the second testing technique most used is the errorbased test, except for web systems and erp systems, where the structural test takes second place. on average, the structural testing presented similar results to the error-based test and the last site, as the least used technique was the modelbased test. despite that, all testing techniques are used at some intensity level independently of the software project type. 6.1.2 software testing technique used versus company size we evaluated the relationship between companies’ size and the applied testing technique to understand which factors influence the choice of the testing technique (figure 24. we can 0 20 40 60 80 100 120 140 160 banking systems systems maintenance erp systems web systems mobile applications re sp on se s type of software type of software vs. testing technique fault-based test model-based test structural test funcitional test figure 23. relationship between type of software project and testing technique adopted by companies. observe that functional testing is the most used technique, regardless of the company size. about other testing techniques, structural testing is more used in large companies while faultbased and model-based are few used techniques both in large and small companies. 0 10 20 30 40 50 60 70 80 90 100 model-based test fault-based test structural test functional test re sp on se s testing technique testing technique vs. company size large companies small companies figure 24. relationship between company size and testing technique used in companies’ projects. 6.2 challenges this study has summarized the main challenges indicated by the respondents regarding the execution of testing activities, which can contribute to the development of research in the software testing area (figure 25). • lack of importance and priority of the testing activity (71 responses): the team involved in the project management does not value the correct execution of the testing activity and often neglects the activity establishing short deadlines for its development. awareness of its importance should be raised by professionals in the technological area, and a possible solution would be the insertion of specific disciplines in technological undergraduate courses; • lack of knowledge and training (51 responses): respondents indicated some testers do not have the necessary knowledge to perform automation activities, since they are not familiar with programming, and companies do not provide them with specific training. the activity requires substantial expertise in programming. • automation process (46 responses): there is a need to automate the tests developed to improve efficiency in the testing activity. automation is a challenge in testing practice because there are no clear guidelines that a survey on the practices of software testing: a look into brazilian companies santos et al. 2022 4 6 7 8 9 15 24 36 38 46 51 71 bdd test maintainability test coverage documentation database communication efficiency cost schedule tools automation lack of knowledge and training lack of importance and priority of test activity 0 10 20 30 40 50 60 70 80 figure 25. challenges related to testing in software companies. should be followed to help testers perform the automation process; • tools (38 responses): many testers use automated tools to support the automation process. the respondents reported difficulty using some tools and the challenge of finding one suitable for the activity. consequently, several tools are used. the development of tools that offer a variety of functionalities for supporting the automation process and other testing activities is, therefore, fundamental; • schedule (36 responses): facing short deadlines dedicated to the testing activity within the project schedule, is a hard task; therefore, improvements in efficiency, such as activities and new proposals are required towards optimizing testing. other challenges indicated are (i) cost (24 responses) new approaches must be developed for reducing the costs spent on testing activities, (ii) efficiency (15 responses) it is related to the difficulty of performing the testing activity and achieving the best possible results, (iii) database communication (9 responses) the generation of masses of data for use in automated tests requires improvements in the communication with the database during testing activities, (iv) documentation of testing activities (8 responses) respondents reported difficulties in finding satisfactory documents to solve their problems of use of tools or new testing techniques, (v) test coverage (7 responses), which is a resource-intensive task, (vi) test maintainability (6 responses), which maintains the created test cases; their reuse in other programs is a topic to be further researched, and (vii) bdd (4 responses) its adoption can be hampered by an incorrect implementation. 7 threats to validity our results can potentially be affected by the methodology defined, and the threats are discussed towards guiding the interpretation. 1. conclusion validity: regards the reliability and strength of the data provided by the respondents. to minimize it, we carefully summarized the data collected and excluded blank ones or those with incorrect information. the survey initially received 201 responses. to assure all respondents were aware of and worked with software testing, we asked a question to check whether they knew about software testing. if the answer was no, the survey was finished. 16 negative answers were given; therefore, they were excluded, thus leaving 185 responses to be analyzed. 2. internal validity: regards influences that can affect the study to causality. to mitigate it, we followed the guidelines proposed by kitchenham and pfleeger (2008). prior to designing the survey, we analyzed other relevant work, reviewed similar past surveys (mentioned in section 4), reused some of their questions and created new ones, based on the specific objectives of our research. 3. construct validity: concerns the generalization of the study results. our scenario relates to this threat if we measure the real scenario of software testing practices in the brazilian software industry. to minimize it, we categorized the data collected in each question for extracting relevant information and making assumptions about the topics addressed. 4. external validity: concerns the generalization of our rea survey on the practices of software testing: a look into brazilian companies santos et al. 2022 sults. the population and sample size of the respondents (table 3) outlines the number of emails sent to the lists of companies accessed. since 185 respondents out of 1062 companies listed contributed to the survey, a rough measure of the sample size equals 17.51% (185/1062). our sampling rate has provided relevant results and contributed to our assumptions. 8 conclusions and future work this paper has reported the findings of a survey on software testing practices conducted in brazilian software companies. prior to its design, we intensively studied other surveys related to the topic. our research objectives and questions were specified according to the gqm methodology. the study presents, in detail, the results collected from the analyzed data, which enabled the identification of the software testing practices most used by testers and software quality professionals in the brazilian companies. testers select a technique according to the scope of the project. the most applied is functional and structural testing; the least used one is the mutation test, which was also pointed out as the technique most respondents had never applied. selenium was indicated as the most used tool in testing activities. the challenges faced by testers were discussed and have raised interesting topics to be more deeply explored. the need for surveys on software testing practices is clear, since they will help software companies improve both their testing strategies and the relationship between testing and the software quality of the products developed. despite some limitations, our study can support this process. as future work, we aim to portray a broad view of the industry as a whole and a more detailed picture of some companies. the research will be developed with more qualitative aspects involving formal interviews or focus groups, and companies will be visited to obtain more results for comparisons. acknowledgements the authors acknowledge fapesp (sao paulo research foundation), for the financial support under processes number 2018/101839 and 2019/06937-0, and propp/ufgd under sigproj project number 322855.1174.8276.1103 2019. references abran, a., moore, j. w., bourque, p., dupuis, r., and tripp, l. l. (2004). guide to the software engineering body of knowledge: 2004 version swebok. ieee computer society. anand, s., burke, e. k., chen, t. y., clark, j., cohen, m. b., grieskamp, w., harman, m., harrold, m. j., mcminn, p., bertolino, a., et al. (2013). an orchestrated survey of methodologies for automated software test case generation. journal of systems and software, 86(8):1978–2001. bertolino, a. (2007). software testing research: achievements, challenges, dreams. in 2007 future of software engineering, pages 85–103. ieee computer society. bourque, p., fairley, r. e., et al. (2014). guide to the software engineering body of knowledge (swebok (r)): version 3.0. ieee computer society press. briand, l. c., differding, c. m., and rombach, h. d. (1996). practical guidelines for measurement-based process improvement. software process: improvement and practice, 2(4):253–280. caldiera, g., basili, v. r., and rombach, h. d. (1994). goal question metric paradigm. encyclopedia of software engineering, 1:528–532. copeland, l. (2004). a practitioner’s guide to software test design. artech house. delamaro, m., jino, m., and maldonado, j. (2017). introdução ao teste de software. elsevier brasil. dias-neto, a. c., matalonga, s., solari, m., robiolo, g., and travassos, g. h. (2017). toward the characterization of software testing practices in south america: looking at brazil and uruguay. software quality journal, 25(4):1145– 1183. dustin, e., rashka, j., and paul, j. (1999). automated software testing: introduction, management, and performance. addison-wesley longman publishing co., inc., usa. engström, e. and runeson, p. (2010). a qualitative survey of regression testing practices. in international conference on product focused software process improvement, pages 3–16. springer. fink, a. (2003). the survey handbook, volume 1. sage. garousi, v. and varma, t. (2010). a replicated survey of software testing practices in the canadian province of alberta: what has changed from 2004 to 2009? journal of systems and software, 83(11):2251–2262. garousi, v. and zhi, j. (2013). a survey of software testing practices in canada. journal of systems and software, 86(5):1354–1376. geras, a. m., smith, m., and miller, j. (2004). a survey of software testing practices in alberta. canadian journal of electrical and computer engineering, 29(3):183–191. groves, l., nickson, r., reeve, g., reeves, s., and utting, m. (2000). a survey of software development practices in the new zealand software industry. in software engineering conference, 2000. proceedings. 2000 australian, pages 189–201. ieee. kassab, m., defranco, j., and laplante, p. (2016). software testing practices in industry: the state of the practice. ieee software. kasunic, m. (2005). designing an effective survey. technical report, carnegie-mellon univ pittsburgh pa software engineering inst. kitchenham, b. a. and pfleeger, s. l. (2008). personal opinion surveys. in guide to advanced empirical software engineering, pages 63–92. springer. lee, j., kang, s., and lee, d. (2012). survey on software testing practices. iet software, 6(3):275–282. molléri, j. s., petersen, k., and mendes, e. (2016). survey guidelines in software engineering: an annotated review. in proceedings of the 10th acm/ieee international symposium on empirical software engineering and measurea survey on the practices of software testing: a look into brazilian companies santos et al. 2022 ment, page 58. acm. myers, g. j., sandler, c., and badgett, t. (2011). the art of software testing. john wiley & sons. ng, s., murnane, t., reed, k., grant, d., and chen, t. (2004). a preliminary survey on software testing practices in australia. in software engineering conference, 2004. proceedings. 2004 australian, pages 116–125. ieee. saglietti, f., oster, n., and pinte, f. (2008). white and greybox verification and validation approaches for safety-and security-critical software systems. information security technical report, 13(1):10–16. santos, i., c filho, j. c., and souza, s. r. (2020a). a survey on the practices of mobile application testing. in 2020 xlvi latin american computing conference (clei), pages 232–241. ieee. santos, i., coutinho, e. f., and souza, s. r. (2020b). software testing ecosystems insights and research opportunities. in proceedings of the 34th brazilian symposium on software engineering, pages 421–426. santos, i., furlanetti, a. b., melo, s. m., de souza, p. s. l., delamaro, m. e., and souza, s. r. (2020c). contributions to improve the combined selection of concurrent software testing techniques. in proceedings of the 5th brazilian symposium on systematic and automated software testing, pages 69–78. santos, i., melo, s. m., de souza, p. s. l., and souza, s. r. (2019). testing techniques selection: a systematic mapping study. in proceedings of the xxxiii brazilian symposium on software engineering, pages 347–356. santos, i., melo, s. m., de souza, p. s. l., and souza, s. r. (2020d). towards a unified catalog of attributes to guide industry in software testing technique selection. in 2020 ieee international conference on software testing, verification and validation workshops (icstw), pages 398– 407. ieee. torchiano, m., fernández, d. m., travassos, g. h., and de mello, r. m. (2017). lessons learnt in conducting survey research. in 2017 ieee/acm 5th international workshop on conducting empirical studies in industry (cesi), pages 33–39. ieee. torkar, r. and mankefors, s. (2003). a survey on testing and reuse. in software: science, technology and engineering, 2003. swste’03. proceedings. ieee international conference on, pages 164–173. ieee. vegas, s. and basili, v. (2005). a characterisation schema for software testing techniques. empirical software engineering, 10(4):437–466. victor, m. and upadhyay, n. (2011). selection of software testing technique: a multi criteria decision making approach. in international conference on computational science, engineering and information technology, pages 453–462. springer. wohlin, c., aurum, a., angelis, l., phillips, l., dittrich, y., gorschek, t., grahn, h., henningsson, k., kagstrom, s., low, g., et al. (2011). the success factors powering industry-academia collaboration. ieee software, 29(2):67–73. introduction background related work survey methodology survey goal and research questions survey design and questions survey application population and sample size of respondents analysis of the survey results respondents' personal and professional profile rq1. what are the most used testing levels in brazilian software companies? rq2. how do you select a testing technique to test a software product? rq3. how often does a tester use structural, functional, mutation, and concurrent program testing? rq4. what are the most used testing tools? rq5. how are the testing teams and processes managed? rq6. do brazilian industries develop research on software testing? rq7. how do testers seek knowledge on software testing? does their company offer any training? discussion on the survey findings correlations software testing technique used versus type of project software testing technique used versus company size challenges threats to validity conclusions and future work journal of software engineering research and development, 2020, 8:4, doi: 10.5753/jserd.2020.602  this work is licensed under a creative commons attribution 4.0 international license.. reducing the discard of mbt test cases thomaz diniz [ federal university of campina grande | thomaz.morais@ccc.ufcg.edu.br ] everton l. g. alves [ federal university of campina grande | everton@computacao.ufcg.edu.br ] anderson g.f. silva [ federal university of campina grande | andersongfs@splab.ufcg.edu.br ] wilkerson l. andrade [ federal university of campina grande | wilkerson@computacao.ufcg.edu.br ] abstract model-based testing (mbt) is used for generating test suites from system models. however, as software evolves, its models tend to be updated, which may lead to obsolete test cases that are often discarded. test case discard can be very costly since essential data, such as execution history, are lost. in this paper, we investigate the use of distance functions and machine learning to help to reduce the discard of mbt tests. first, we assess the problem of managing mbt suites in the context of agile industrial projects. then, we propose two strategies to cope with this problem: (i) a pure distance function-based. an empirical study using industrial data and ten different distance functions showed that distance functions could be effective for identifying low impact edits that lead to test cases that can be updated with little effort. moreover, we showed that, by using this strategy, one could reduce the discard of test cases by 9.53%; (ii) a strategy that combines machine learning with distance values. this strategy can classify the impact of edits in use case documents with accuracy above 80%; it was able to reduce the discard of test cases by 10.4% and to identify test cases that should, in fact, be discarded. keywords: mbt, test case discard, suite evolution, agile development 1 introduction software testing plays an important role since it helps gain confidence the software works as expected (pressman, 2005). moreover, testing is fundamental for reducing risks and assessing software quality (pressman, 2005). on the other hand, testing activities are known to be complex and costly. studies found that nearly 50% of a project’s budget is related to testing (kumar & mishra, 2016). in practice, a test suite can combine manually and automatically executed test cases (itkonen et al., 2009). although automation is always desired, manually executed test cases are still very important. itkonen et al. (2009) state that manual testing still plays an important role in the software industry and cannot be fully replaced by automatic testing. for instance, a tester that runs manual tests tends to better exercise a gui and find new faults. on the other hand, manual testing is often costly (harrold, 2000). to reduce the costs related to testing, model-based testing (mbt) can be used. it is a strategy where test suites are automatically generated from specification models (e.g., use cases, uml diagrams) (dalal et al., 1999; utting & legeard, 2007). by using mbt, sound tests can be extracted before any coding, and without much effort. in agile projects, requirements are often volatile (beck & gamma, 2000; sutherland & sutherland, 2014). in this scenario, test suites are used as safety nets for avoiding feature regression. discussions on the importance of test case reuse are not new (von mayrhauser et al., 1994). in software engineering, software reuse is key for reducing development costs and improving quality. this is also valid for testing (frakes, 1994). a test case that finds faults can be a valuable investment (myers et al., 2011). good test cases should be stored as a reusable resource to be used in the future (cai et al., 2009). in this context, an always updated test suite is mandatory. a recent work proposed lightweight specification artifacts for enabling the use of mbt in agile projects (n. jorge et al., 2018), claret. with claret, one can both specify requirements using use cases and generate mbt suites from them. however, a different problem has emerged. as the software evolves (e.g., bug fixes, change requirements, refactorings), both its models and test suite need revisions. since mbt test suites are generated from requirement models, in practice, as requirements change, the requirement artifacts are updated, new test suites are generated, and the newer suites replace the old ones. therefore, test cases that were impacted by the edits, instead of updated, are often considered obsolete and discarded (oliveira neto et al., 2016). although one may find it easy to generate new suites, regression testing is based on a stable test suite that evolves. test case discarding implies important historical data that are lost (e.g., execution time, the link faults-to-tests, faultdiscovering time). test case historical data is an important tool for assessing system weaknesses and better manage it, therefore, one should not neglect it. for instance, most defect prediction models are based on historical data (he et al., 2012). moreover, for some strategies that optimize testing resources allocation, historical data is key (noor & hemmati, 2015; anderson et al., 2014). by discarding test cases, and their historical data, a project may miss important information for both improving a project and guiding its future actions. moreover, in a scenario where previously detected faults guide development, missing tests can be a huge loss. finally, test case discard and poor testing are known as signs of bad management and eventually lead to software development waste (sedano et al., 2017). however, part of a test suite may turn obsolete due to little impacted model updates. thus, those test cases could be easily reused with little effort and consequently reducing testmailto:thomaz.morais@ccc.ufcg.edu.br mailto:everton@computacao.ufcg.edu.br mailto:andersongfs@splab.ufcg.edu.br mailto:wilkerson@computacao.ufcg.edu.br diniz et al. 2020 ing discards. nevertheless, manual analysis is tedious, costly, and time-consuming, which often prevents its applicability in the agile context. in this sense, there is a need for an automatic way of detecting reusable and, in fact, obsolete test cases. distance functions map a pair of strings to a number that indicates the similarity level between the two versions (cohen et al., 2003). in a scenario where manual test cases evolve due to requirement changes, distance functions can be an interesting tool to help us classify the impact of the changes into a test case. in this paper, first, we assess and discuss the practical problem of model evolution in mbt suites. to cope with this problem, we propose and evaluate two strategies for automatically classifying model edits and tests aiming at avoiding unnecessary test discards. the first is based on distance functions, while the second combines machine learning and distance values. this work is an extension over our previous one (diniz et al., 2019) including the following contributions: • an study using historical data from real industrial projects that investigates the impact of model evolution in mbt suites. we found that 86% of the test cases turn obsolete between two consecutive versions of a requirement file, and those tests are often discarded. moreover, 52% of the found obsolete tests were caused by low impact syntactic edits and could become fully updated with the revision of 25% of the steps. • an automatic strategy based on distance functions for reclassifying reusable test cases from the obsolete set. this strategy was able to reduce test case discard by 9.53%. • an automatic strategy based on machine learning and distance functions for classifying test cases and model change impact. this strategy can classify the impact of edits in use case documents with accuracy above 80%, it was able to reduce the discard of test cases by 10.4%, and to identify test cases that should, in fact, be discarded. this paper is organizedas follows. in section2, we present a motivational example. the needed background is discussed in section 3. section 4 presents an empirical investigation for assessing the challenges of managing mbt suite during software evolution. sections 5 and 6 present the strategy for classifying model edits using distance functions and the performed evaluation, respectively. section 7 introduces the strategy that combines machine learning and distance values. section 8 presents a discussion comparing results from both strategies. in section 9, some threats to validity are cleared. finally, sections 10 and 11 present related works and the concluding remarks. 2 motivational example suppose that ann works in a project and wants to benefit from mbt suites. her project follows an agile methodology where requirements updates are expected to be frequent. therefore, she decides to use claret (n. jorge et al., 2018), an approach for specifying requirements and generating test suites. the following requirement was specified using claret’s dsl (listing 1): “in order to access her email inbox, the user must be registered in the system and provide a correct username and password. in case of an incorrect username or password, the system must display an error message and ask for new data.”. in claret, an ef [flow #] mark refers to a possible exception flow, and a bs [step #] mark indicates a returning point from an exception/alternative to the use case’s basic flow. from this specification, the following test suite can be generated: s1 = {tc1, tc2, tc3}, where tc1 = [bs:1 → bs:2 → bs:3 → bs:4], tc2 = [bs:1 → bs:2 → bs:3 → ef[1]:1 → bs:3 → bs:4], and tc3 = [bs:1 → bs:2 → bs:3 → ef[2]:1 → bs:3 → bs:4] . suppose that in the following development cycle, the use case (listing 1) was revisited and updated due to both requirement changes and for improving readability. three edits were performed: (i) the message in line 9 was updated to “displays a successful message”; (ii) system message in line 12 was updated to “alerts that username does not exist”; and (iii) both description and system message in exception 3 (line 14) were updated to “incorrect username/password combination” and “alerts that username and/or password are incorrect”, respectively. since steps from all execution flows were edited (basic, exception 1, and exception 2), ann discards s1 and generates a whole new suite. however, part of s1’s tests was not much impacted and could be turned to reused with little or no update. for instance, only edit (iii), in fact, changed the semantic of the use case, while (i) and (ii) are updates that do not interfere with the system’s behavior. therefore, only test cases that exercise the steps changed by (iii) should be in fact discarded (tc3). moreover, test cases that exercise steps changed by (i) and/or (ii) could be easily reused and/or updated (tc1 and tc2). we believe that an effective and automatic analyzer would help ann to decide when to reuse or discard test cases, and therefore reduce the burden of losing important testing data. 1 systemname "email" 2 usecase "log in user" { 3 actor emailuser "email user" 4 precondition "there is an active network connection" 5 basic { 6 step 1 emailuser "launches the login screen" 7 step 2 system "presents a form with username and password fields and a submit button" 8 step 3 emailuser "fills out the fields and click on the submit button" 9 step 4 system "displays a message" ef[1,2] 10 } 11 exception 1 "user does not exist in database" { 12 step 1 system "alerts that user does not exist" bs 3 13 } 14 exception 2 "incorrect password" { 15 step 1 system "alerts that the password is incorrect" bs 3 16 } 17 postcondition "user successfully logged" 18 } listing 1: use case specification using claret. diniz et al. 2020 3 background this section presents the mbt process, the claret notation, and the basic idea behind the two strategies used for reducing test case discard, distance functions, and machine learning. 3.1 model-based testing mbt aims to automatically generate and manage test suites from software specification models. mbt may use different model formats to perform its goals (e.g., labeled transition system (lts) (tretmans, 2008), uml diagrams (bouquet et al., 2007)). as mbt test suites are derived from specification artifacts, their test cases tend to reflect the system behavior (utting et al., 2012). utting & legeard (2007) discuss a series of benefits of using mbt, such as sound test cases, high fault detection rates, and test cost reduction. on the other hand, regarding mbt limitations, we can list the need for well-built models, huge test suites, and a great number of obsolete test cases during software evolution. figure 1 presents an overview of the mbt process. the system models are specified through a dsl (e.g., uml) and a test generation tool is used to create the test suite. however, as the system evolves, edits must be performed on its models to keep them up-to-date. if any fault is found, the flow goes back to system development. these activities are repeated until the system is mature for release. note that previous test suites are discarded, and important historical data may be lost in this process. figure 1. mbt process 3.2 claret claret (n. jorge et al., 2017, 2018) is a dsl and tool that allows the creation of use case specifications using natural language. it was designed to be the central artifact for both requirement engineering and mbt practices in agile projects. its toolset works as a syntax checker for use cases description files and provides visualization mechanisms for use case revision. listing 1 presents a use case specification using claret. from the use case description in listing 1, claret generates its equivalent annotated labeled transition system (alts) model (tretmans, 2008) (figure 2). transition labels starting with [c] indicate pre or post conditions, while the ones starting with [s] and [e] are regular and exception execution steps, respectively. figure 2. alts model of the use case from listing 1. claret’s toolset includes a test generation tool, lts-bt (labeled transition system-based testing) (cartaxo et al., 2008). lts-bt is an mbt tool that uses as input lts models and generates test suites by searching for valid graph paths. the generated tests are reported in xml files that can be directly imported to a test management tool, testlink1. the test cases reported in section 2 were collected from lts-bt. 3.3 distance functions distance functions are metrics for evaluating how similar, or different, are two strings (coutinho et al., 2016). distance functions have been used in different contexts (e.g., (runkler & bezdek, 2000; okuda et al., 1976; lubis et al., 2018)). for instance, coutinho et al. (2016) use distance functions for reducing mbt suites. there are several distance functions (e.g., (hamming, 1950; han et al., 2007; huang, 2008; de coster et al., 1; levenshtein, 1966)). for instance, the levenshtein function (levenshtein, 1966; kruskal, 1983) (equation described below) compares two strings (a and b) and calculates the number of required operations to transform a into b, and viceversa; where 1ai ̸=bj is the indicator function equal to 0 when ai ̸= bj and equal to 1 otherwise, and leva,b is the distance between the first i characters of a and the first j characters of b. to illustrate its use, consider two strings a = “kitten” and b = “sitting”. their levenshtein distance is three, since three operations are needed to transform a to b: (i) replacing ‘k’ by ‘s’; (ii) replacing ‘e’ by ‘i’; and (iii) inserting ‘g’ at the end. a more detailed discussion about the levenshtein and others functions, as well as an open-source implementation of them are available2. 1http://testlink.org/ 2https://github.com/luozhouyang/python-string-similarity diniz et al. 2020 leva,b(i, j) =   max(i, j) if min(i,j) = 0 min   leva,b(i − 1, j) + 1 leva,b(i, j − 1) + 1 otherwise leva,b(i − 1, j − 1) + 1ai̸=bj 3.4 machine learning machine learning is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention (michie et al., 1994). by providing ways for building datadriven models, machine learning can produce accurate results and analysis (zhang & tsai, 2003). the learning process begins with observations or data (examples), it looks for data patterns, and make future decisions. by applying machine learning, one aims to allow computers to learn without human intervention, and to adjust its actions accordingly. machine learning algorithms are often categorized as supervised or unsupervised. supervised machine learning algorithms (e.g., linear regression, logistic regression, neural networks) use labeled examples from the past to predict future events. unsupervised machine learning algorithms (e.g., kmeans clustering, gaussian mixture models) are used when the training data is neither classified nor labeled. it infers a function to describe a hidden structure from unlabeled data. the use of machine learning in software engineering has grown in the past years. for instance, machine learning methods have been used for estimating development effort (srinivasan & fisher, 1995; baskeles et al., 2007), predicting a software fault-proneness (gondra, 2008), fault prediction (shepperd et al., 2014), and improving code quality (malhotra & jain, 2012). 4 analysing the impact of model evolution in mbt suites to understand the impact of model evolution in mbt suites, we observed two industrial projects (saff and bzc) from industrial partners. both systems were developed in the context of a cooperation between our research lab and two different companies, ingenico do brasil ltda and viceri solution ltda. the saff project is an information system that manages status reports of embedded devices; and bzc is a system for optimizing e-commences logistic activities. the projects were run by two different teams. both teams applied agile practices and used claret for use case specification and generation of mbt suites. both projects use manually executed system-level blackbox test cases for regression purposes. in this sense, test case historical data is very important since it can help to keep track of the system evolution and to avoid functionality regression. however, the teams reported that often discard test cases when the related steps on the system use cases are updated in any form, which they refer to as a practical management problem. therefore, we mined the projects repositories, traced each model change (use case update), and analyzed its impact on the generated suites. table 1. summary of the artifacts used in our study. #use cases #versions #edits saff 13 42 415 bzc 15 37 103 total 28 79 518 our goal in this study was to better understand the impact of model updates in the test suites and to measure how much of a test suite is discarded. to guide this investigation, we defined the following research questions: • rq1: how much of a test suite is discarded due to use case editions? • rq2: what is the impact of low (syntactic) and high (semantic) model edits on a test suite? • rq3: how much of an obsolete test case needs revision to be reused? 4.1 study procedure for each claret file (use case model), we collected the history of its evolution in a time frame. in the context of our study, we consider a use case evolution any edit found between two consecutive versions of a claret file. our study worked with 28 use cases, a total of 79 versions, and an average of 5 step edits per claret file. table 1 presents the summary of the collected data. after that, we collected the test suites generated for each version of the claret files. figure 3. procedure. we extracted a change set for each pair of adjacent versions of a claret file (uc, uc’). in our analysis, we considered two kinds of edits/changes: i) step update. any step in the base version (uc) that had its description edited in the delta version; and ii)stepremoval. any step that existed inuc but not in uc’. we did not consider step additions. since our goal was to investigate reuse in a regression testing scenario, we considered suites generated using only the base version (uc). consequently, no step addition could be part of the generated tests. after that, we connected the changeset to the test suites. for that, we ran a script that matched each edited step to the test cases it impacted. we say a test case is impacted by a modification if it includes at least one modified step from the changeset. thus, our script clustered the tests based on oliveira neto et al. (2016)’s classification: obsolete. test cases that include updated or removed steps. these tests have different actions or system responses, when compared to its previous version; and reusable. test cases that exercise only unmodified parts of the model specification. all their actions and responses remained the same when compared to its previous version. figure 3 summarizes our study procedure. diniz et al. 2020 to observe the general impact of the edits, we measured how much of a test suite was discarded due to use case edits. being s_total the number of test cases generated from a use case (uc); s_obs, the number of found obsolete test cases; and n the number of pairs of use cases; we define the average number of obsolete test cases (aotc) by equation 1. aot c = ( ∑ s_obs s_total ) ∗ 1 n (1) then, we manually analyzed each element from the changeset and classified them into low impact (syntactic edit), high impact (semantic edit), or a combination of both. for this analysis, we defined three versions of the aotc metric: aotc_syn, the average number of obsolete test cases due to low impact edits; aotc_sem, the average number of obsolete test cases due to high impact edits; and aotc_both, that considers tests with low and highly impacted steps. finally, to investigated how much of a test case needs revision, for each test, we measured how many steps were modified. for this analysis, we defined the ams (average modified steps) metric (equation 2), which measures the proportion of steps that need revision due to model edits. being tc_total the number of steps in a given test case; tc_c, number of steps that need revision; and n the number of test cases: am s = ( ∑ tc_c tc_total ) ∗ 1 n (2) 4.2 results and discussion our results evidence that mbt test suites can be very sensitive to any model evolution. a great number of test cases were discarded. on average, 86% (aotc) of a suite’s tests turned obsolete between two consecutive versions of a use case file (figure 4). this result exposes one of the main difficulties of using mbt in agile projects, requirement files are very volatile. thus, since test cases are derived from these models, any model edit leads to a great impact on the generated tests. although a small number of test cases were reused, the teams in our study found the mbt suites useful. they mentioned that a series of unnoticed faults were detected, and the burden of creating tests was reduced. thus, we can say that mbt suites are effective, but there is still a need for solutions for reducing test case discard due to model updates. rq1: how much of a test suite is discarded due to use case editions? on average, 86% of the test cases became obsolete between two versions of a use case model. we manually analyzed each obsolete test case. figure 5 summarizes this analysis. as we can see, 52% of the cases became obsolete due to low impact (syntactic) edits in the use case models, while 21% were caused by high impact (semantic) changes, and 12% by both syntactic and semantics changes in the same model. therefore, more then half of the obsolete set refer to edits that could be easily revised and turned to reusable without much effort (e.g., a step rephrasing, typo fixing). 14% 86% suite per evolution obsolete reusable figure 4. reusable and obsolete test cases. 52% 21% 12% obsolete per type both semantic syntactic figure 5. tests that became obsolete due to a given change type. rq2: what is the impact of low (syntactic) and high (semantic) model edits on a test suite? 52% of the found obsolete tests were caused by low impact use case edits (syntactic changes), while 21% were due to high impact edits (semantic chances), and 12% by a combination of both. we also investigated how much of an obsolete test case would need to be revised to avoid discarding. it is important to highlight that this analysis was based only on the number of steps that require revision. we did not measure the complexity of the revision. figure 6 shows the distribution of the found results. as we can see, both medians were similar (25%). thus, often a tester needs to review 25% of a test case to turn it reusable, disregarding the impact of the model evolution. as most low impact test cases relate to basic syntactic step updates (e.g., fixing typos, rephrasing), we believe the costs of revisiting them can be minimal. for the highly impacted tests (semantic changes), it is hard to infer the costs, and, in some cases, discarding those tests can still be a valid option. however, a test case discard can be harmful and should be avoided. rq3: how much of an obsolete test case needs revision to be reused? in general, a tester needs to revisit 25% of the steps of an obsolete test, regardless of the impact of the model edits. diniz et al. 2020 ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ●●● ●● ● ● ● ●●● ●● ● ● ●●● ●● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ●● ● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ●● ● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ●●●●●● ● ●●●●●●●●●●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ●●●●●● ● ●●●●●●●●●●●● ● ●●●●●● ● ●●●●●● ● ●●●●●●●●●●●● ● ●●●●●● ● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ●●● ●●● ● ●●●●●● ●●● ●●● ● ●●●●●● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ●● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●● ●● ●●● ●● ● ● ●● ● ●●● ●● ● ● ●●● ● ●●● ●● ● ● ●● ● ●●● ●● ● ● ●● ● ●●● ●● ● ● ●● ● ● ●●● ●● ● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●●●●●● ● ●●●●●● ● ●●●●●●● ● ●●●●● ●● ●●● ●● ● ● ●●● ●● ● ● ● ●●● ●● ● ● ●●● ●● ● ● ●●● ●● ● ●●●●●● ● ●●● ● ●● ● ●●● ● ●● ● ● ●● ●● ●● ●● ●● ● ● ●● ● ●● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●● ●● ● ●● ●●●● ● ● ●● ●● ●●●● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●●●● ● ●● ● ● ●●●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ●●●● ● ●● ● ● ● ●●●● ● ●● ● ● ●●●● ●● ● ● ● ● ● ●●●● ● ●● ●● ●●●● ●● ●● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●●●●● ● ● ●● ● ●●●●● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●●●● ● ● ●●● ● ●●● ● ● ● ● ●●● ● ●●● ● ●●● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ●● ● ●●●●● ●● ● ●●●●● ●● ●●●●●●● ● ● ● ●●●●● ● ●●●● ● ● ●● ●●●●●●●●● ●● ●●● ●● ●●● ●●●● ● ●●●● ● ●●●● ● ●●● ● ●●● ● ● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● ● ●●● ● ●●● ● ● ● ● ● ●●●●●●● ●●●●●●● ●●●●●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ●●● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●●● ●●● ● ●● ● ●●●●●● ●● ● ●● ● ●●●●●● ●● ● ●● ● ●●●●●● ●● ● ●● ● ●●●●●● ●● ● ●● ● ●●●●●● ●● ● ●● ● ●●●●●● ●● ● ● ●● ● ● ●● 0 25 50 75 100 semantic syntactic obsolete test type p ro p o rt io n o f c h a n g e d s te p s i n t e s t c a s e s ( % ) figure 6. proportion of the test cases modified by edit type. 5 distance functions to predict the impact of test case evolution the study described in section 4 evidences the challenge of managing mbt suites during software evolution. to test whether distance functions can help to cope with this problem, we ran a second empirical study. the goal of this study was to analyze the use of distance functions to automatically classify changes in use case documents that could impact mbt suites. 5.1 subjects and functions for that, our study was also run in the context of the industrial projects saff and bzc. it is important to remember that both projects used agile methodologies to guide the development and updates in the requirement artifacts were frequent. moreover, their teams used both claret (n. jorge et al., 2017), for use case specification, and lts-bt (cartaxo et al., 2008) for generating mbt suites. as our study focuses on the use of distance functions, we selected a set of ten of the most well-known functions that have been used in different contexts: hamming (hamming, 1950), lcs (han et al., 2007), cosine (huang, 2008), jaro (de coster et al., 1), jaro-winkler (de coster et al., 1), jaccard (lu et al., 2013), ngram (kondrak, 2005), levenshtein (levenshtein, 1966), osa (damerau, 1964), and sorensen dice (sørensen, 1948). to perform systematic analyses, we normalize their results in a way that their values range from zero to one. values near zero refer to low similarity, while near one values indicate high similarity. we reused opensource implementations of all ten functions34. to customize and analyze the edits in the context of our study, we created our own tool and scripts that were verified through a series of tests. we mined the projects’ repository and collected all use case edits. each of these edits would then impact the test cases. we call “impacted” any test case that includes steps that were updated during model maintenance. however, we aim to use distance functions to help us to classify these edits and avoid the test case discard. 3https://github.com/luozhouyang/python-string-similarity 4https://rosettacode.org/wiki/category:programming_tasks table 2. classification of edits. steps description version 1 version 2 classification “extract data on offline mode.” “show page that requires new data.” high impact “show page that requires new data.” “show page that requires new terminal data.” low impact ”click on edit button” ”click on the edit button” low impact to guide our investigation, we defined the following research questions: • rq4: can distance functions be used to classify the impact of edits in use case documents? • rq5: which distance function presents the best results for classifying edits in use case documents? 5.2 study setup and procedure since all use case documents were claret files, we reused the data collected in the study of section 4. therefore, a total of 79 pairs of use case versions were analyzed in this study, with a total of 518 edits. table 1 summarizes the data. after that, we manually analyzed each edit and classified them between low impact and high impact. a low impact edit refers to changes that do not alter the system behavior (a pure synthetic edit), while a high impact edit refers to changes in the system expected behavior (semantic edit). table 2 exemplifies this classification. while the edit in the first line changes the semantics of the original requirement, the next two refer to edits performed for improving readability and fixing typos. during our classification, we found 399 low impact and 27 high impact edits for the saff system, and 92 low and 11 high impact for bzc. this result shows that use cases often evolve for basic description improvements, which may not justify the great number of discarded test cases in mbt suites. after that, for each edit (original and edited versions), we ran the distance functions using different configuration values and observed how they classified the edits compared to our manual validation. 5.3 metrics to help us evaluate the results, and answer our research questions, we used three of the most well-known metrics for checking binary classifications: precision, which is the rate of relevant instances among the found ones; recall, calculates the rate of relevant retrieved instances over the total of relevant instances; and accuracy, which combines precision and recall. these metrics have been used in several software engineering empirical studies (e.g., (nagappan et al., 2008; hayes et al., 2005; elish & elish, 2008)). equations 3, 4 and 5 present those metrics, where tp refers to the number of cases a distance function classified an edit as low impact and the manual classification confirms it; tn refers to the number of matches regarding high impact edits; fp refers to when the automatic classification reports low impact edits when in diniz et al. 2020 fact high impact edits were found; and fn is when the automatic classification reports high impact when in fact should be low impact edits. p recision = t p t p + f p (3) recall = t p t p + f n (4) accuracy = t p + t n t p + t n + f p + f n (5) 5.4 results and discussion to answer rq4, we first divided our dataset of use case edits into two (low and high impact edits), according to our manual classification. then, we ran the distance functions and plotted their results. figures 7 and 8 show the box-plot visualization of this analysis considering found low (figure 7) and high impacts (figure 8). as we can see, most low impact edits, in fact, refer to low distance values (median lower than 0.1), for all distance functions. this result gives us evidence that low distance values can relate to low impact edits and, therefore, can be used for predicting low impact changes in mbt suites. on the other hand, we could not find a strong relationship between high impact edits and distance values. therefore we can answer rq4 stating that distance functions, in general, can be used to classify low impact. rq4: can distance functions be used to classify the impact of edits in use case documents? low impact edits are often related to lower distance values. therefore, distance functions can be used for classifying low impact edits. figure 7. box-plot for low impact distance values. as for automatic classification, we need to define an effective impact threshold, for each distance function, we run an exploratory study to find the optimal configuration for using each function. by impact threshold, we mean the distance value for classifying an edit as low or high impact. for instance, consider a defined impact threshold of x% to be used with function f. when analyzing an edit from a specification document, if f provides a value lower than x, we say the edit figure 8. box-plot for high impact distance values. is low impact, otherwise it is high impact. therefore, we design a study where, for each function, we vary the defined impact threshold and we observed how it would impact precision and recall. our goal with this analysis is to identify the more effective configuration for each function. we range the impact threshold between [0; 1]. to find this optimal configuration, we consider the interception point between the precision and recall curves, since it reflects a scenario with less mistaken classifications (false positives and false negatives). figure 9 presents the analysis for the jaccard functions. its optimal configuration is highlighted (impact threshold of 0.33) – the green line refers to the precision curve, the blue line to the recall curve, and the red circle shows the point both curves meet. figure 10 presents the analysis for the other functions. figure 9. best impact threshold for the jaccard function. table 3 presents the optimal configuration for each function and the respective precision, recall, and accuracy values. these results reinforce our evidence to answer rq4 since all functions presented accuracy values greater than 90%. moreover, we can partially answer rq5, since now diniz et al. 2020 figure 10. exploratory study for precision and recall per distance function. we found, considering our dataset, the best configuration for each distance function. to complement our analysis, we went to investigate which function performed the best. first, we run proportion tests considering both the functions all at once and pair-to-pair. our results show, with 95% of confidence, could not find any statistical differences among the functions. this means that distance function for automatic classification of edits impact is effective, regardless of the chosen function (rq5). therefore, in practice, one can decide which function to use based on convenience aspects (e.g., easier to implement, faster). rq5: which distance function presents the best results for classifying edits in use case documents? statistically, all ten distance functions performed similarly when classifying edits from use case documents. 6 case study to reassure the conclusions presented in the previous section, and to provide a more general analysis, we ran new studies considering a different object, tcom. tcom is an industrial software also developed in the context of our cooperation with the ingenico brasil ltda. it controls the execution and manages testing results of a series of hardware parts. it diniz et al. 2020 table 3. best configuration for each function and respective precision, recall and accuracy values. function impact threshold precision recall accuracy hamming 0.91 94.59% 94.79% 90.15% levenshtein 0.59 95.22% 95.42% 91.31% osa 0.59 95.22% 95.42% 91.31% jaro 0.28 95.01% 95.21% 90.93% jaro-winkler 0.25 95.21% 95.21% 91.12% lcs 0.55 94.99% 94.79% 90.54% jaccard 0.33 95.22% 95.42% 91.31% ngram 0.58 95.41% 95.21% 91.31% cosine 0.13 95% 95% 90.73% sørensen–dice 0.47 94.99% 94.79% 90.54% table 4. summary of the artifacts for the tcom system. #use cases #versions #edits tcom 7 32 133 is important to highlight that a different team ran this project, but in a similar environment: claret use cases for specification and generated mbt suites. the team also reported similar problems concerning volatile requirements, and frequent test case discards. first, similar to the procedure applied in section 5.2, we mined tcom’s repository and collected all versions of its use case documents and their edits. table 4 summarizes the collected data from tcom. then, we manually classified all edits between low and high impact to serve as validation for the automatic classification. finally, we ran all distance functions considering the optimal impact thresholds (table 3 second column) and calculated precision, recall and accuracy for each configuration (table 5). table 5. tcom evaluating the use of the found impact threshold for each function and respective precision, recall and accuracy values. function impact threshold precision recall accuracy hamming 0.91 87.59% 94% 84.96% levenshtein 0.59 87.85% 94% 85.71% osa 0.59 87.85% 94% 85.71% jaro 0.28 89.52% 94.00% 87.22% jaro-winkler 0.25 94.00% 89.52% 87.22% lcs 0.55 89.62% 95% 87.97% jaccard 0.33 89.52% 94% 87.22% ngram 0.58 87.85% 94% 85.71% cosine 0.13 88.68% 94% 86.47% sørensen–dice 0.47 88.68% 94% 86.47% as we can see, the found impact thresholds presented high precision, recall, and accuracy values when used in a different system and context (all above 84%). this result gives as evidence that, distance functions are effective for automatic classification of edits (rq4) and that the found impact thresholds performed well for a different experimental object (rq5). in a second moment, we used this case study to evaluate how our approach (using distance functions for automatic classification) can help reducing test discards: • rq6: can distance function be used for reducing the discard of mbt tests? to answer rq6, we considered tcom’s mbt test cases table 6. example of a low impacted test case. ... step 1: operator presses the terminal approving button. step 2: system goes back to the terminal profiling screen. ... ... step 1: operator presses the terminal approving button. step 2: system redirects the terminal to its profiling screen. ... generated from its claret files. since all distance functions behave similarly (section 5.4), in this case study we used only levenshtein’s function to automatically classify the edits and to check the impact of those edits in the tests. in a common scenario, which we want to avoid, any test case that contains an updated step would be discarded. therefore, in the context of our study, we used the following strategy “only test cases that contain high impact edits should be discarded, while test cases with low impact edits are likely to be reused with no or little updating”. the rationale behind this decision is that low impact edits often imply on little to no changes to the system behavior. considering system-level black-box test suites (as the ones from the projects used in our study), those tests should be easily reused. we used this strategy and we first applied oliveira’s et al.’s classification (oliveira neto et al., 2016) that divided tcom’s tests among three sets: obsolete – test cases that include impacted steps; reusable – test cases that were not impacted by the edits; and new – test cases that include new steps. a total of 1477 mbt test cases were collected from tcom’s, where 333 were found new (23%), 724 obsolete (49%), and 420 reusable (28%). this data reinforces silva et al. (2018)’s conclusions showing that, in an agile context, most of an mbt test suite became obsolete quite fast. in a common scenario, all “obsolete” test cases (49%) would be discarded throughout the development cycles. to cope with this problem, we ran our automatic analysis and we reclassified the 724 obsolete test cases among low impacted – test cases that include unchanged steps and updated steps classified by our strategy as ”low impact”; highly impacted – test cases that include unchanged steps and “high impact” steps; and mixed, test cases that include at least one “high impact” step and at least one “low impact” step. from this analysis, 109 test cases were low impacted. although this number seems low (15%), those test cases would bewronglydiscardedwheninfact theycouldbeeasilyturned into reusable. for instance, table 6 shows a simplified version of a “low impacted” test case from tcom. as we can see, only step 2 was updated to better phrase a system response. this was an update for improving specification readability, but it does not have any impact on the system’s behavior. we believe that low impacted test cases could be easily reused with little or no effort. in our case study, most of them need small updating that could be easily done in a single updating round, or even during test case execution. for the tester point of view, this kind of update may not be urgent and should not lead to a test case discard. the remaining test cases were classified as follows: 196 “highly impacted” (27%), and 419 “mixed” (58%). table 7 and 8 show examples of highly impacted and mixed tests, diniz et al. 2020 table 7. example of a highly impacted test case. ... step 3: operator presses camera icon. step 4: system redirects to photo capture screen. ... step 9: operator takes a picture and presses the back button. ... ... step 3: operator selects a testing plan. step 4: system redirects to the screen that shows the selected tests. ... step 9: operator sets a score and press ok. ... table 8. example of a mixed test case. ... step 2: operator presses button cancel to mark there is no occurrence description. ... step 7: operator presses the button send. ... ... step 2: operator presses the button cancel to mark there is no occurrence description. ... step 7: operator takes a picture of the hardware. ... respectively. in table 7, we can see that steps 3, 4, and 9 were drastically changed, which infer to a test case that requires much effort to turn it into reusable. on the other hand, in the test in table 8, we have both an edit for fixing a typo (step 2) and an edit with a requirement change (step 7). to check whether our classification was in fact effective we present its confusion matrix (table 9). in general, our classification was 66% effective (precision). a smaller precision was expected, when compared to the precision classification from section 5, since here we consider all edits that might affect a test case, while in section 5 we analyzed and classified each edit individually. however, we can see, our classification was highly effective for low impacted test cases, and most mistaken classification relates to mixed one (tests that combine low and high impact edits). those were, in fact, test cases that were affected in a great deal by different types of use case editions. back to our strategy, we believe that highly impacted or mixed classifications indicate test cases that are likely to be discarded, since they refer to tests that would require much effort to be updated, while low impacted tests can be reused with little to no effort. overall, our strategy correctly classified the test cases in 66% of the cases (precision). regarding low impacted tests, we correctly classified 63% of them. therefore, from the 724 “obsolete” test cases, our strategy automatically inferred that 9.53% of them should not be distable 9. confusion matrix. predicted low high mixed actual low 69 3 37 109 actual high 4 37 155 196 actual mixed 21 27 371 419 94 67 563 724 carded. we believe this rate can get higher when we better analyze the mixed set. a mixed test combines low and high impact edits. however, when we manually analyzed those cases, we found several examples where, although high impact edits were found, most test case impacts were related to low impact edits. for instance, there was a test case composed of 104 execution steps where only one of those steps needed revision due to a high impact use case edit, while the number of low impact edits was seven. in a practical scenario, although we still classify it as a mixed test case, we would say the impact of the edits was still quite small, which may indicate a manageable revision effort. thus, we state that mixed tests need better analysis before discarding. the same approach may also work forhighly impacted tests when related to a low number of edits. finally, we can answer rq6 by saying that an automatic classification using distance functions can, in fact, reduce the number of discarded test cases by at least 9.53%. however, this rate tends to be higher when we consider mixed tests. rq6: can distance function be used for reducing the discard of mbt tests? the use of distance functions can reduce the number of discarded test cases by at least 9.53%. 7 combining machine learning and distance values in previous sections, we showed that distance functions alone could help to identify low impact edits that lead to test cases that can be updated with little effort. moreover, the case study in section 6 showed that this strategy could reduce test case discard by 9.53%. although very promising, we believe those results could be improved, especially regarding the classification of high impact edits. in this sense, we proposeacomplementarystrategythatcombinesdistancevalues and machine learning. to apply machine learning and avoid test case discard, we used keras (gulli & pal, 2017). keras is a high-level python neural networks api that runs on top of tensorflow (abadi et al., 2016). it focuses on efficiency and productivity; therefore, it allows easy and fast prototyping. moreover, it has a stronger adoption in both the industry and the research community (géron, 2019). keras provides two types of modelssequential, andmodel with the functional api. we opted to use a sequential model due to its easy configuration and effective results. a sequential model is composed of a linear stack of layers. each layer contains a series of nodes and performs calculations. a node is activated only when a certain threshold is achieved. in the context of our model, we used the rectified linear units (relu) and softmax functions. both are known to be a good fit for classification problems (agarap, 2018). dense layer nodes are connected to all nodes from the next layer, while dropout layer nodes are more selective. our model classifies whether two versions of a given test case step refer to a low or high impact edit. although techniques such as word2vec (rong, 2014) could be used diniz et al. 2020 to transform step descriptions to numeric vectors, due to previous promising results (sections 5 and 6), we opted to use a classification based on a combination of different distance values. therefore, to be able to use the model, we first pre-process the input (versions of a test step), run the ten functions (hamming (hamming, 1950), lcs (han et al., 2007), cosine (huang, 2008), jaro (de coster et al., 1), jarowinkler (de coster et al., 1), jaccard (lu et al., 2013), ngram (kondrak, 2005), levenshtein (levenshtein, 1966), osa (damerau, 1964), and sorensen dice (sørensen, 1948)), and collect their distance values. those values are then provided to our model that starts with a dense layer, followed by four hidden layers, and returns as output a size two probability array o. o’s first position refers to the found probability of a given edit be classified as high impact, while the second refers to the probability for a low impact edition. the highest of those two values will be the final classification of our model. suppose two versions of the test step s (s and s’). first, our strategy runs the ten distance functions considering the pair (s; s’) and generates its input model set (e.g., i = 0.67; 0.87; 0.45; 0.78; 0.34; 0.6; 0.5; 0.32; 0.7; 0.9). this set is then provided to our model that generates the output array (e.g., o = [0.5; 0.9]). for this example, o indicates that the edits that transformed s to s’ are high impact, with 50% chances, and low impact, with 90% chances. therefore, our final classification is that the edits were low impact edit. for training, we created a dataset with 78 instances of edits randomly collected from both saff and bzc projects. to avoid possible bias, we worked with a balanced training dataset (50% low impact and 50% high impact edits). moreover, we reused the manual classification discussed in sections 4 and 5 as reference answers for the model. in a notebook intel core i3 with 4gb of ram, the training phase was performed in less than 10 minutes. 7.1 model evaluation similar to the investigation described in sections 5 and 6, we proceeded an investigation to validate whether the strategy that combines machine learning with distance values is effective and promotes the reduction of test case discard. for that, we set the following research questions: • rq7: can the combination of machine learning and distance values improve the classification of edits’ impact in use case documents? • rq8: can the combination of machine learning and distance values reduce the discard of mbt tests? to answer rq7, and evaluate our model, we first ran it against two data sets: (i) the model edits combined of the saff and bzc projects (table 1); and (ii) the model edits collected from the tcom project (table 4). while the first provides a more comprehensive set, the second allows us to test the model in a whole new scenario. it is important to highlight that both data sets contain only real model edits performed by the teams. moreover, they contains both low and high impact edits. again, we reused our manual classification to validate the model’s output. table 10 presents the results of this evaluation. as we can table 10. results of our model evaluation using tcom’s dataset. precision recall accuracy saff+bzc 81% 97% 80% tcom 94% 99% 95% see, our strategy performed well for predicting the edits impact, especially for tcom, it provided an accuracy of 95%. these results give us evidence of our model efficiency. rq7: can the combination of machine learning and distance values improve the classification of edits’ impact in use case documents? our strategy was able to classify edits with accuracy above 80%, an improvement of 7% when compared to the classification using only distance functions. this result reflects its efficiency. to answer rq8, we considered tcom’s mbt test cases generated from its claret files. we reused the manual classification from section 6, and ran our strategy to automatic reclassify the 724 obsolete test cases among low impacted – test cases that include unchanged steps and updated steps classified by our strategy as “low impact”; highly impacted – test cases that include unchanged steps and “high impact” steps; and mixed, test cases that include at least one “high impact” step and at least one “low impact” step. from the 109 actual low impacted test cases, our strategy was able to detect 75 (69%), an increase of 6% when compared to the classification using a single distance function. those would be test cases that should be easily revised to avoid discarding as model changes were minimal (figure 6). table 9 presents the confusion matrix for our model classification. out of the 724 obsolete test cases (according to oliveira et al.’s classification (oliveira neto et al., 2016)), our model would help a tester to automatically save 10.4% from discarding. as we can see, overall, our classification was 69% effective, an increase of 3% when compared to the classification using a single distance function (table 9). although this improvement may be low, it is important to remember that those would be the actual test that would be saved from a wrong discard. on the other hand, we can see a great improvement in the high impact classification (from 19% to 86%). this data indicates that different from the strategy using a single distance function, our model can be a great help to automatically identify both reusable and in fact obsolete test cases. on the other hand, the classification for mixed test cases performed worse (from 88% to 61%). however, we believed that mixed test cases are the ones that require a manual inspection to check whether it is worth updating for reusing or should be discarded. it is important to highlight that our combined strategy was able to improve the performance rates for the most important classifications (low and high impacted), which are related to major practical decisions (to discard or not a test case). moreover, when wrongly classifying a test case, our model often sets it as mixed, which we recommend a manual inspection. therefore, our automatic classification tends to be accurate and not misleading. finally, we can answer rq8 by saying that the combined diniz et al. 2020 strategy was, in fact, effective for reducing the discard of mbt tests. the rate of saved tests was 10.4%. moreover, it improved, when compared to the strategy using a single distance function, the detection rate by 6% for low impacted test cases, and by 67% for high impact ones. rq8: can the combination of machine learning and distance values reduce the discard of mbt tests? our combined strategy helped us to reduce the discard of test cases by 10.4%, an increase of 0.9%. however, it correctly identifies test cases that should in fact be discarded. 8 general discussion in previous sections, we proposed two different strategies for predicting the impact of model edits and avoiding test case discarding: (i) a pure distance function-based; and (ii) a strategy that combines machine learning with distance values. both were evaluated in a case study with real data. the first strategy applies a simpler analysis, which may infer lower costs. though simple, it was able to correctly identify 63% of the lowimpacted test cases, and to rescue 9.53% of the test cases that would be discarded. however, it did not perform well when classifying highly impacted tests (19%). our second approach, though more complex (it requires a set of distance values as inputs to the model), generated better results for classifying low impacted and highly impacted test cases, 68%, and 86% precision, respectively. moreover, it helped us to avoid the discard of 10.4% of the test cases. therefore, if running several distance functions for each model edit is not an issue, we recommend the use of (ii) since it is in fact the best option for automatically classify test cases that should be reused (low impacted) or be discarded (highly impacted). moreover, regarding time, our prediction model responses were almost instant. regarding mixed tests, our suggestion is always to inspect them to decide whether it is worth updating. 9 threats to validity most of the threats for validity to the drew conclusions refer to the number of projects, use cases, and test cases used in our empirical studies. those numbers were limited to the artifacts created in the context of the selected projects. therefore, our results cannot be generalized beyond the three projects (saff, bzc, and tcom). however, it is important to highlight that all used artifacts are from real industrial systems from different contexts. as for conclusion validity, our studies deal with a limited data set. again, since we chose to work with real, instead of artificial artifacts, the data available for analysis were limited. however, the data was validated by the team engineers and by the authors. one may argue that since our study deals only with claret use cases and test cases, our results are not valid for other notations. however, claret resembles traditional specification formats (e.g., uml use cases). moreover, claret test cases are basically a sequence of pairs of steps (user input system response), which can relate to most manual testing at the system level. regarding internal validity, we collected the changed set from the project’s repositories, and we manually classify each change according to its impact. this manual validation was performed by at least two of the authors and, when needed, the project’s members were consulted. moreover, we reused open-source implementations of the distance functions5. these implementations were also validated by the first author. 10 related work the practical gains of regression testing are widely discussed (e.g., (aiken et al., 1991; leung & white, 1989; wong et al., 1997; ekelund & engström, 2015)). in the context of agile development, this testing strategy plays an important role by working as safety nets changes are performed (martin, 2002). parsons et al. (2014) investigate regression testing strategies in agile development teams and identify factors that can influence the adoption and implementation of this practice. they found that the investment in automated regression testing is positive, and tools and processes are likely to be beneficial for organizations. our strategies (distance functions, distance functions and machine learning) are automatic ways to enable the preservation of regression test cases. ali et al. (2019) propose a test case prioritization and selection approach for improving regression testing in agile projects. their approach prioritizes test cases by clustering the ones that frequently change. here, we see a clear example of the importance of preserving test cases. some work relates agile development to model-based testing, demonstrating the general interest in these topics. katara & kervinen (2006) introduce an approach to generate tests from use cases. tests are translated into sequences of events called action-words. this strategy requires an expert to design the test models. puolitaival (2008) present a study on the applicability of mbt in agile development. they refer to the need for technical practitioners when performing mbt activities and specific adaptations. katara & kervinen (2006) discuss how mbt can support agile development. for that, they emphasize the need for automation aiming that mbt artifacts can be manageable and with little effort to apply. cartaxo et al. (2008) propose a strategy/tool for generating test cases from alts models and selecting different paths. since the alts models reflect use cases written in natural language, the generated suites encompass the problems evidenced in our study (a great number of obsolete test cases), as the model evolves. oliveira neto et al. (2016) discuss a series of problems related to keeping mbt suites updated during software evolution. to cope with this problem, they propose a test selection approach that uses test case similarity as input when collecting test cases that focus on recently applied changes. oliveira neto et al.’s approach refers to obsolete all test cases that are impacted in any way by edits in the requirement model. however, as our study found, a great part of those tests can be little 5https://github.com/luozhouyang/python-string-similarity diniz et al. 2020 impacted and could be easily reused, avoiding the discard of testing artifacts. the test case discard problem is not restricted to claret artifacts. other similar cases are discussed in the literature (e.g.,(oliveira neto et al., 2016; nogueira et al., 2007)). moreover, this problem is even greater with mbt test cases derived from artifacts that use non-controlled language (pinto et al., 2012). other works also deal with test case evolution (e.g., (katara & kervinen, 2006; pinto et al., 2012)). they discuss the problem and/or propose strategies for updating the testing code. those strategies do not apply to our context, as we work with mbt test suite evolution generated from use case models. distance functions have been used in different software engineering scenarios (e.g., (runkler & bezdek, 2000; okuda et al., 1976; lubis et al., 2018)). for instance, runkler & bezdek (2000) use the levenshtein function to automatically extract keywords from documents. in the context of mbt, coutinho et al. (2016) investigated the effectiveness of a series of distance functions when used combined with strategies for suite reduction based on similarity. although in a different context, their results go according to ours where all distance functions performed in a similar way. the use of machine learning techniques in software engineering is not new. baskeles et al. (2007) propose a model for estimating development effort aiming at overcoming problems related to budget and schedule extension. gondra (2008) uses an artificial neural network to determine the importance of software metrics for predicting fault-proneness. durelli et al. (2019) present a systematic mapping study on machine learning applied to software testing. from 48 selected primary studies, they found that machine learning has been used mainly for test-case generation, refinement, and evaluation. for instance, strug & strug (2012) use a knnlearner to reduce the set of mutants to be executed in mutation testing. it predicts when a test can kill certain mutants. fraser & walkinshaw (2015) propose an approach based on machine learning algorithms to evaluate test suites using behavioral coverage. it receives data from a test generation tool and predicts the behavior of the program for the given inputs. zhu et al. (2008) propose a model for estimating test execution efforts based on testing data such as the number of test cases, test complexity, and knowledge of the system under testing. chen et al. (2011) present a machine learning approach for selection regression test cases. their learner clusters similar test cases based on an input function and constraints. our work differs from the others since it uses machine learning and distance functions to predict the impact of a given use case update and to avoid the discard of mbt test cases. 11 concluding remarks in this paper, we describe a series of empirical studies ran on industrialsystemsforevaluatingtheuseofdistancefunctions to classify the impact of edits in use case files automatically. our results showed that distance functions are effective in identifying low impact editions. therefore, we propose two variations of its use: as a classification strategy itself (section 5), and combined with a machine learning model (section 7). we also found that low impact editions often refer to test cases that can be easily updated without any effort. our strategies helped to both identify low impact and high impact test cases. we believe those results can help testers to better work with mbt artifacts in the context of software evolution and avoid the discard of test cases. as future work, we plan to expand our study with a broader set of systems. we also consider developing a tool that, using distance functions, can help testers to identify and update low impact test cases. finally, we plan to investigate the use of different approaches (e.g., other machine learning techniques, dictionaries) to improve our classification rates and better help testers when updating highly impacted test cases. acknowledgements this research was partially supported by a cooperation between ufcg and two companies viceri solution ltda and ingenico do brasil ltda, the latter stimulated by the brazilian informatics law n. 8.248, 1991. second and fourth authors are supported by national council for scientific and technological development (cnpq)/brazil (processes 429250/2018-5 and 315057/20181). first and third authors were supported by ufcg/cnpq and capes, respectively. references abadi m., et al., 2016, in 12th {usenix} symposium on operating systems design and implementation ({osdi} 16). pp 265–283 agarap a. f., 2018, arxiv preprint arxiv:1803.08375 aiken l. s., west s. g., reno r. r., 1991, multiple regression: testing and interpreting interactions. sage ali s., hafeez y., hussain s., yang s., 2019, software quality journal, pp 1–27 anderson j., salem s., do h., 2014, in proceedings of the 11th working conference on mining software repositories. pp 142–151 baskelesb.,turhanb.,benera.,2007, in200722ndinternational symposium on computer and information sciences. pp 1–6 beck k., gamma e., 2000, extreme programming explained: embrace change. addison-wesley professional bouquet f., grandpierre c., legeard b., peureux f., vacelet n., utting m., 2007, in proceedings of the 3rd international workshop on advances in model-based testing. pp 95–104 cai l., tong w., liu z., zhang j., 2009, in 2009 15th ieee pacific rim international symposium on dependable computing. pp 103–108 cartaxo e. g., andrade w. l., neto f. g. o., machado p. d., 2008, in proceedings of the 2008 acm symposium on applied computing. pp 1540–1544 chen s., chen z., zhao z., xu b., feng y., 2011, in 2011 diniz et al. 2020 fourth ieee international conference on software testing, verification and validation. pp 1–10 cohen w. w., ravikumar p., fienberg s. e., et al., 2003, in iiweb. pp 73–78 coutinho a. e. v. b., cartaxo e. g., de lima machado p. d., 2016, software quality journal, 24, 407 dalal s. r., jain a., karunanithi n., leaton j. m., lott c. m., patton g. c., horowitz b. m., 1999, in proceedings of the 1999 international conference on software engineering (ieee cat. no.99cb37002). pp 285–294, doi:10.1145/302405.302640 damerau f. j., 1964, commun. acm, 7, 171 de coster x., de groote c., destiné a., deville p., lamouline l., leruitte t., nuttin v., 1 diniz t., alves e. l., silva a. g., andrade w. l., 2019, in proceedings of the xxxiii brazilian symposium on software engineering. pp 337–346 durelli v. h., durelli r. s., borges s. s., endo a. t., eler m. m., dias d. r., guimaraes m. p., 2019, ieee transactions on reliability, 68, 1189 ekelund e. d., engström e., 2015, in 2015 ieee international conference on software maintenance and evolution (icsme). pp 449–457 elish k. o., elish m. o., 2008, journal of systems and software, 81, 649 frakes w., 1994, in proceedings of 1994 3rd international conference on software reuse. pp 2–3 fraser g., walkinshaw n., 2015, software testing, verification and reliability, 25, 749 géron a., 2019, hands-on machine learning with scikitlearn, keras, and tensorflow: concepts, tools, and techniques to build intelligent systems. o’reilly media gondra i., 2008, journal of systems and software, 81, 186 gulli a., pal s., 2017, deep learning with keras. packt publishing ltd hamming r. w., 1950, the bell system technical journal, 29, 147 han t. s., ko s.-k., kang j., 2007, in international workshop on machine learning and data mining in pattern recognition. pp 585–600 harrold m. j., 2000, in proceedings of the conference on the future of software engineering. pp 61–72 hayes j. h., dekhtyar a., sundaram s., 2005, in acm sigsoft software engineering notes. pp 1–5 he z., shu f., yang y., li m., wang q., 2012, automated software engineering, 19, 167 huang a., 2008, in proceedings of the sixth new zealand computer science research student conference (nzcsrsc2008), christchurch, new zealand. pp 9–56 itkonen j., mantyla m. v., lassenius c., 2009, in 2009 3rd international symposium on empirical software engineering and measurement. pp 494–497, doi:10.1109/esem.2009.5314240 katara m., kervinen a., 2006, in haifa verification conference. pp 219–234 kondrak g., 2005, in consens m. p., navarro g., eds, lecture notes in computer science vol. 3772, spire. springer, pp 115–126, http://dblp.uni-trier.de/ db/conf/spire/spire2005.html#kondrak05 kruskal j. b., 1983, siam review, 25, 201 kumar d., mishra k., 2016, procedia computer science, 79, 8 leung h. k., white l., 1989, in proceedings. conference on software maintenance-1989. pp 60–69 levenshtein v. i., 1966, soviet physics doklady, 10, 707 lu j., lin c., wang w., li c., wang h., 2013. pp 373–384, doi:10.1145/2463676.2465313 lubis a. h., ikhwan a., kan p. l. e., 2018, international journal of engineering & technology, 7, 17 malhotra r., jain a., 2012, journal of information processing systems, 8, 241 martin r. c., 2002, agile software development: principles, patterns, and practices. prentice hall michie d., spiegelhalter d. j., taylor c., et al., 1994, neural and statistical classification, 13, 1 myers g. j., sandler c., badgett t., 2011, the art of software testing. john wiley & sons n. jorge d., machado p., l. g. alves e., andrade w., 2017, in proceedings of the 24th tools session / 8th brazilian conference on software: theory and practice. , doi:10.1109/re.2018.00041 n. jorge d., machado p., l. g. alves e., andrade w., 2018. pp 336–346, doi:10.1109/re.2018.00041 nagappan n., murphy b., basili v., 2008, in 2008 acm/ieee 30th international conference on software engineering. pp 521–530 nogueira s., cartaxo e., torres d., aranha e., marques r., 2007, in 1st brazilian workshop on systematic and automated software testing. noor t. b., hemmati h., 2015, in 2015 ieee 26th international symposium on software reliability engineering (issre). pp 58–68 okuda t., tanaka e., kasai t., 1976, ieee transactions on computers, 100, 172 oliveira neto f. g., torkar r., machado p. d., 2016, information and software technology, 80, 124 parsons d., susnjak t., lange m., 2014, software quality journal, 22, 717 pinto l. s., sinha s., orso a., 2012, in proceedings of the acm sigsoft 20th international symposium on the foundations of software engineering. p. 33 pressman r., 2005, software engineering: a practitioner’s approach, 6 edn. mcgraw-hill, inc., new york, ny, usa puolitaival o.-p., 2008, adapting model-based testing to agile context: master’s thesis. vtt technical research centre of finland rong x., 2014, arxiv preprint arxiv:1411.2738 runkler t. a., bezdek j. c., 2000, in ninth ieee international conference on fuzzy systems. fuzz-ieee 2000 (cat. no. 00ch37063). pp 636–640 sedano t., ralph p., péraire c., 2017, in 2017 ieee/acm 39th international conference on software engineering (icse). pp 130–140 shepperd m., bowes d., hall t., 2014, ieee transactions on software engineering, 40, 603 silva a. g., andrade w. l., alves e. l., 2018, in proceedhttp://dx.doi.org/10.1145/302405.302640 http://dx.doi.org/10.1145/363958.363994 http://dx.doi.org/10.1109/esem.2009.5314240 http://dblp.uni-trier.de/db/conf/spire/spire2005.html#kondrak05 http://dblp.uni-trier.de/db/conf/spire/spire2005.html#kondrak05 http://dx.doi.org/10.1145/2463676.2465313 http://dx.doi.org/10.1109/re.2018.00041 http://dx.doi.org/10.1109/re.2018.00041 diniz et al. 2020 ings of the iii brazilian symposium on systematic and automated software testing. sast ’18. acm, new york, ny, usa, pp 49–56, doi:10.1145/3266003.3266009, http://doi.acm.org/10.1145/3266003.3266009 sørensen t., 1948, biol. skr., 5, 1 srinivasan k., fisher d., 1995, ieee transactions on software engineering, 21, 126 strug j., strug b., 2012, in ifip international conference on testing software and systems. pp 200–214 sutherland j., sutherland j., 2014, scrum: the art of doing twice the work in half the time. currency tretmans j., 2008, in , formal methods and testing. springer, pp 1–38 utting m., legeard b., 2007, practical model-based testing: a tools approach. morgan kaufmann publishers inc., san francisco, ca, usa utting m., pretschner a., legeard b., 2012, software testing, verification and reliability, 22, 297 von mayrhauser a., mraz r., walls j., ocken p., 1994, in proceedings 1994 ieee international conference on computer design: vlsi in computers and processors. pp 484– 491 wong w. e., horgan j. r., london s., agrawal h., 1997, in proceedings the eighth international symposium on software reliability engineering. pp 264–274 zhang d., tsai j. j., 2003, software quality journal, 11, 87 zhu x., zhou b., hou l., chen j., chen l., 2008, in 2008 the 9th international conference for young computer scientists. pp 1193–1198 http://dx.doi.org/10.1145/3266003.3266009 http://doi.acm.org/10.1145/3266003.3266009 http://dx.doi.org/10.1002/stvr.456 http://dx.doi.org/10.1002/stvr.456 introduction motivational example background model-based testing claret distance functions machine learning analysing the impact of model evolution in mbt suites study procedure results and discussion distance functions to predict the impact of test case evolution subjects and functions study setup and procedure metrics results and discussion case study combining machine learning and distance values model evaluation general discussion threats to validity related work concluding remarks journal of software engineering research and development, 2020, 8:9, doi: 10.5753/jserd.2020.731  this work is licensed under a creative commons attribution 4.0 international license.. extraction of test cases procedures from textual use cases: is it worth it? erick barros dos santos*[ federal university of ceará | erickbarros@great.ufc.br ] rossana maria de castro andrade† [ federal university of ceará | rossana@great.ufc.br ] ismayle de sousa santos [ federal university of ceará | ismaylesantos@great.ufc.br ] lucas simão da costa [ federal university of ceará | lucascosta@great.ufc.br ] thaís marinho de amorim [ federal university of ceará | thaisamorim@great.ufc.br ] bruno sabóia aragão‡ [ federal university of ceará | bruno@great.ufc.br ] danilo reis de vasconcelos [ federal institute of ceará | danilo.reis@ifce.edu.br ] abstract software testing plays a major role in software quality once it assures that the software complies with its expected behavior. however, this is an expensive activity and, consequently, companies usually do not perform testing activities on software projects due to the time required. these costs may be even higher in testing processes that rely on manual test execution only, which is both time­consuming and error­prone. one strategy commonly used to mitigate these costs is to use tools to automate testing activities such as test execution, test documentation, and test case generation. this paper presents an experience report in the context of a test factory about the use of a tool that partially automates the specification of test case procedures from textual use cases. this tool automatically retrieves use cases from the requirement management system, generates the test case procedures, requires inputs from the tester, and then sends the test cases to the test management system. this paper details how this tool was used in releases of an industrial software project through a proof of concept. we also performed a feasibility study with four test analysts from different projects to gather more data regarding its efficiency to support the test case documentation. the results indicate that the tool reduces the test specification time, and that the integration with both requirements and test management systems made our tool feasible in practice. keywords: software testing, test generation, test factory 1 introduction software testing has an essential role in software quality as­ surance, allowing the discovery of bugs beforehand over the product life cycle (myers et al., 2004). however, performing manual testing activities can be time­consuming and error­ prone. beyond that, mistakes in these activities (e.g., bad test coverage or error in testing effort estimation) may contribute to the appearance of test debts, i.e., technical debts related to software testing activities (samarthyam et al., 2017; aragão et al., 2019). aiming to reduce the mistakes and costs related to soft­ ware testing, many companies have dedicated efforts to au­ tomate testing activities, such as the generation of test cases, test execution, and test reports (garousi and mäntylä, 2016). in spite of the advanced research on testing activities automa­ tion in the academy, the main concern in the industry is to im­ prove the effectiveness and efficiency of the tests with the au­ tomation and use of techniques that are easy to use (garousi and felderer, 2017). besides that, many software development companies have hired test factories services. one of the advantages of a test factory is that it acts in software testing externally and inde­ pendently from the development team (andrade et al., 2017). *master researcher scholarship ­ sponsored by cnpq (n o 133464 / 2018­0). †researcher scholarship ­ dt level 2, sponsored by cnpq (n o 315543 / 2018­3). ‡researcher scholarship ­ sponsored by fundação de cultura e apoio ao ensino, pesquisa e extensão. test factories can help to improve the quality of software by reducing the effort of testing activities from the development team. test factories also have teams that work on several do­ mains of systems, which can be allocated to work on different testing projects on demand. software development organiza­ tions also have the benefit of outsourcing the selection of the testing team. on the other hand, test factories have to cope with chal­ lenges related to the definition of testing processes (aragão et al., 2017) and automation of test case execution (vieira et al., 2018a). additionally, the tight deadlines of software projects can hinder the process of an external company that offers the testing services. thus, it is necessary to research the automation of testing activities. regarding the automation of activities, which is the focus of this paper, the literature still has few experience reports, especially in the context of a test factory. this sort of study is important since it provides evidence that knowledge from literature can support practitioners. this paper’s main objective is to report the experience on the test generation from use cases with an automated tool. in our previous work (santos et al., 2019), we presented our first experience report on using a tool for the semi­automatic generation of test procedures based on use cases. the devel­ opment of this tool was based on existing work in the soft­ ware testing literature. we also conducted a proof of concept in the context of a test factory to assess the benefits of the tool during the testing process and reported five lessons learned from this experience. afterwards, we intend to expand the previous report, also focusing on the acquired experience in mailto:erickbarros@great.ufc.br mailto:rossana@great.ufc.br mailto:ismaylesantos@great.ufc.br mailto:lucascosta@great.ufc.br mailto:thaisamorim@great.ufc.br mailto:bruno@great.ufc.br mailto:danilo.reis@ifce.edu.br extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 the automatic generation of tests. this extension consists of improving the tool’s functionalities, allowing users to define their own templates to extract data from the textual use cases. another improvement in our tool was the inclusion of busi­ ness rules in the test case generation process to increase the test coverage. in this context, we plan to answer the following question: “is it feasible to use a tool to generate test cases from textual use cases in the test process within a test factory?”. to answer this question, we expanded the efficiency proof of concept with more data regarding real releases of an in­ dustry software project. in this proof of concept, the speci­ fication of tests needed 65,38% less time than manual activ­ ity. we also conducted a feasibility study with test analysts from different projects and collected their feedback. all users needed less time to complete the specification task using the tool, but they also reported the need to improve its usabil­ ity. for instance, the solution generates test procedures with unnecessary extra characters. this paper is organized as follows: section 2 discusses re­ lated work. section 3 presents the methodology used for the development and proof of concept of our solution. section 4 describes the environment of the test factory, its team pro­ file, tools, and internal processes. section 5 details the tool developed. section 6 details the proof of concept conducted with our solution in a industrial context. section 7 describes the feasibility study with users that was conducted. section 8 summarizes the lessons learned during this study. finally, section 9 concludes the paper. 2 related work in the literature, several work deal with the generation of test cases and procedures from use cases. for example, some ap­ proaches (nogueira et al., 2019; sneed, 2018) are based on natural language processing (nlp) for the extraction of test cases and others on the generation of intermediate models to extract the necessary information (some and cheng, 2008; massollar et al., 2012). furthermore, studies (gutiérrez et al., 2015; jorge et al., 2018; massollar et al., 2012; yue et al., 2015) in the literature have performed evaluations and expe­ rience reports in the industry about test generation tools and approaches. some and cheng (2008) offers an approach for generating test scenarios based on textual use cases, using a restricted language with tokens for preconditions, flows, steps, and con­ ditional expressions. the first step in the approach consists of extracting information from structured texts to create a state machine called the control flow­based state machine (cfsm), in which transitions represent the steps, and states represent the actions and outputs. use cases included in an­ other use case compose the same cfsm. at the end of the generation process, a global cfsm is generated to link all use cases, which is traversed to generate the test scenarios. the paths in the model represent scenarios that can be generated with different coverage criteria. we use a similar concept in this paper to generate the test procedures, also requiring man­ ual intervention to create the final tests. another similarity is that we generate the scenarios by paths in the flows, but without generating intermediate models and with simplified selection criteria. massollar et al. (2012) present an automated model­based approach for generating test cases. the approach consists of specifying the use cases using specific patterns so that they are converted into uml1 activity diagrams to represent the system’s behavior. the goal of the activity diagram is two­ folded ­ to check if the use cases have been specified cor­ rectly and assist test models generation. this test model is the basis for the generation of procedures and test cases in a way that the test analyst must manually identify and insert the necessary data to generate the test cases. this paper also presents an evaluation of the tool that is carried out with two software engineers and a group of students. the authors dis­ cuss the data related to the specification time and the model verification, but with low emphasis on the test generation. gutiérrez et al. (2015) present a model­based approach for test case generation which focuses on the use of meta­ models to increase the generalization of the solution with dif­ ferent approaches. their solution uses meta models to model use cases and test elements, thus making transformations in the models until the test cases can be obtained. this work presents three industrial use cases, one of them in an agile context, and it also summarizes the lessons learned. even with the introduction of extra models and their respective transformations, the authors reported effort reductions with the use of the proposed tools. however, not much informa­ tion is provided about the approach’s effort in the agile envi­ ronment. yue et al. (2015) present rtcm, an approach for the tex­ tual specification of test cases through similar elements from use cases. this approach provides some predefined patterns for test specification and a tool called atoucan4test, whose primary goal is to assist the whole generation process of man­ ual and automated test cases. to analyze the feasibility of the solution, the authors present the use of the tool in two indus­ trial case studies in the domain of cyber­physical systems. the assessment is focused on automated test scripts genera­ tion, in which the authors report a significant reduction in the implementation effort. finally, the authors present the lessons learned from the process. jorge et al. (2018) propose claret, a domain­specific language that allows use case creation using structured nat­ ural language and test cases. they also present a supporting tool that allows the specification and validation of use cases but also converts them to labeled transitions systems to cre­ ate the test cases. this work describes industrial study cases in an agile environment, where the software engineers write use cases using claret and generate test cases by using the developed tool. they also present the lessons learned and results on the effectiveness of the solution. sneed (2018) reports his experience in the industry with the semi­automatic generation of tests. the generation ap­ proach consists of extracting information through natural language processing, using either requirement in plain text or use cases enriched by keywords. the expressions of the text are compared with grammars to identify actions, states, and business rules that serve as the basis to test conditions. 1https://www.uml.org/ extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 table 1. comparison of related work. work generation goal artifacts for generation tool­supported industrial evaluation some and cheng (2008) test scenarios textual template and models yes no massollar et al. (2012) test procedures textual template and models yes yes gutiérrez et al. (2015) test cases models yes yes yue et al. (2015) test cases textual template and models yes yes jorge et al. (2018) test cases textual template yes yes sneed (2018) test cases textual template yes yes nogueira et al. (2019) test cases textual template and models yes no current work test procedures textual template yes yes finally, the tester must change the test conditions to insert in­ puts and outputs. the author reports experience in four indus­ trial projects, summarizing data related to effort. this work is similar to ours, mainly in the use of keywords to ease the ex­ traction of information, though little information is provided on how these tests can be changed through the tool. addi­ tionally, the integration of the tool with other systems that supports the testing process is not presented. nogueira et al. (2019) propose an approach for the auto­ matic generation of tests from use cases. for this purpose, the authors propose the use of a controlled natural language. the first step consists of modeling use cases through language, which allows the declaration of system interactions, entries, and conditional expressions. after that, the specification is converted into csp models, so that the variables and data types are converted to formalism. in the third step, the ana­ lyst specifies the purposes of testing that will guide test gen­ eration. finally, the generation is performed using an lts model, where the traces represent test scenarios and the spec­ ified domain is used to create the tests. the authors reported the implementation of a tool that abstracts the formalism of the approach for testers. among the similarities, it is possible to highlight the use of use cases of partners in the industry. however, the tool usage by test analysts is not presented. the main goal of our paper is to report the experience of the automatic generation of test procedures from textual use cases. to accomplish this, we implemented a tool that fulfills the needs of a particular agile project. table 1 summarizes a comparison of our work with the related work presented in this section. it is possible to verify that most studies have focused on the generation of test cases rather than test pro­ cedures. however, these approaches impose the additional cost of formal models to increase efficacy (massollar et al., 2012; yue et al., 2015; gutiérrez et al., 2015). to analyze and extract the use cases, we used predefined structures in the use cases without restricting their specification with a syntax grammar (nogueira et al., 2019; sneed, 2018). this latter would require changes in all use cases of the project’s documentation. the work most related to ours is the one by some and cheng (2008) and massollar et al. (2012). in our tool, we use a concept similar to the scenarios presented by the fore­ mentioned authors to create the procedures, linking the use case flows through references in steps. the main difference was to use a simpler representation of use cases that do not rely on models or formal languages. this impacts the effi­ cacy during the test generation. on the other hand, there is the benefit of offering a more practical solution with less ef­ fort related to specification. thus, the solution can be consid­ ered an initial approach for test automation, which integrates with other systems of testing projects. moreover, our paper discusses the results of effort metrics collected in the context of a test factory and presents feasibility study results with users. 3 research method to guide the execution of the original study (santos et al., 2019), we used a methodology with steps that were based on the transfer technology model proposed by gorschek et al. (2006). this model favors a cooperation between the academia and the industry and can be beneficial for both. it allows researchers to study relevant industry issues and val­ idate their results in a real environment. the methodology used in this paper has five steps. in this paper, we improved the solution of step 2 and the proof of concept of step 3. we also added step 5 to perform the evaluation with users. the steps are described thereupon. step 1 ­ identifying potential improvement areas based on industry requirements: in this step, we performed ob­ servations on the test activities of a real test project (see sec­ tion 4). to assist information gathering, involved researchers asked the test team about needs regarding the testing process. we identified improvement issues related to the test specifica­ tion process execution. after that, test analysts of the project were interviewed to gather more details about how the test activities can be executed to reduce the effort2. as a result, we identified the requirements of an automation tool for sup­ porting the test analysis and specification. the requirement for the solution is that it should not cause too many modifications in the artifacts (e.g., use case and test cases templates) of the process. this solution must also be as practical as possible to reduce the effort related to for­ mal specifications. most of the solutions presented in section 2 requires the introduction of additional models or modifica­ tion of the use case templates. we also could not find an auto­ mated solution that fulfill the projects needs, so a customized solution must be implemented. 2effort calculated in man­hour extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 table 2. used metrics during the proof of concept. group metric formula interpretation test effort efficiency of the test specification amount of specified test cases / total time the biggest is considered better test coverage coverage require­ ments amount of covered use case flows / total amount of use case flows the biggest is considered better test coverage test cases per require­ ment total amount of test cases per re­ quirement n/a test effort economy effort variance [(real effort ­ estimated effort) / estimated effort] * 100 considering the formula, it is pos­ sible to assume that: (i) when the effort variance is positive, it means that extra time (effort) will be nec­ essary to complete the planned work. (ii) when the effort variance is neg­ ative, it means that it will be neces­ sary less time (effort) to complete the planned work. (iii) otherwise, if the effort variance is zero, it means that it takes the estimated effort. n/a total amount of test cases total amount of test cases n/a step 2 ­ solution design: after the previous step, we started the elaboration of a solution. the goal was to use prac­ tices and concepts from the literature that would best fit the requirements of the industrial project. in order to do so, we made a review of some solutions presented in the literature that could help in the development of a customized solution. as described in section 2, many approaches are supported by additional models to increase the effectiveness of the gen­ erated tests. however, we chose to avoid model­dependent approaches, since the objective was an easy­to­implement solution that does not require the manipulation of an addi­ tional formalism or that could somehow affect the sprints of the project. as the use cases of the industrial project were specified in the portuguese language, we also chose not to use nlp­based approaches, once the solutions found are de­ signed for usage with the english language. additionally, we did not intend to use a fixed syntax to avoid impacts on the specification of use cases. the latter was necessary because the project employs a use case template in the requirements elicitation that should not be changed. among the tools in the literature that propose test genera­ tion, specmate (freudenstein et al., 2018) is one of the cur­ rent tools with more features. although it still depends on additional models for test generation, its procedure specifi­ cation and test data insertion process is straightforward and does not depend on additional models. we follow similar in­ teractions to build the user interface of our solution. our steps to generate the test scenarios draw similarities with the work of some and cheng (2008), but with simpler coverage crite­ ria. given the considerations mentioned above, we decided to develop a tool that would partially automate the specifica­ tion process of test procedures. thus, test analysts could have more control over the test specification. this tool should also receive input data to create test cases. step 3 ­ performing a proof of concept: to perform an initial evaluation of the tool, a proof of concept was made in one sprint of the same project used as the context to build the solution, which was started in june 2019 and finished in july 2019. this proof of concept was conducted by one of the researchers and assisted by the test leader of the software project. aiming to analyze the impact of the tool in the test­ ing teamwork, metrics related to the number of test cases, requirements coverage, effort, and variance were collected. these metrics are part of the test factory process (de cas­ tro andrade et al., 2017) and are based on articles available in white literature (seela and yackel) and academic work (lazic and mastorakis, 2008). table 2 summarizes the met­ rics used and their respective formulas. based on the pilot results, which were presented in our previous paper (santos et al., 2019), we obtained initial data about the feasibility of the tool and its benefits. we also identified some failures in the tool, which were fixed before the tool was deployed. the results of the proof of concept are presented in section 6. step 4 ­ solution deploy: in this step, the elaborated so­ lution is deployed in the project for use. in our case, we de­ ployed the tool for use in our industrial project. then, we used our tool during some releases of the referred project. the data collected during this usage is presented in section 6. step 5 ­ carrying out a feasibility study with other professionals: after the deployment of the solution, we per­ formed a feasibility study with professionals from the testing area of other software projects. the main objective of this evaluation was to obtain data about the efficiency of the tool with professionals from different contexts who have had ex­ perience in the specification of tests based on use cases. the results of this feasibility study are presented in section 7. extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 4 proof of concept environment in this section, we present the environment in which the so­ lution was created and the proof of concept in section 6 was conducted. we performed the proof of concept in a re­ search, development, and innovation project concerned with requirement elicitation and software testing of the software a3. this software aims to manage a passive optical network. this project can be considered distributed since the client, development and requirement/test teams belong to different institutions and work in different locations. the team respon­ sible for the software a’s requirement/tests makes use of the scrum (schwaber and beedle, 2002) framework with sprints that lasted a month. this environment was the basis to build the tool presented in section 5. it was also used to conduct the proof of concept to be presented in section 6. subsection 4.1 presents the testing team’s profile. sub­ section 4.2 describes the tools and patterns adopted in the project. subsections 4.3 and 4.4 detail, respectively, the re­ quirement and the test process used. these processes were elaborated based on the previous experiences (aragão et al., 2017; vieira et al., 2018b) of the great’s4 test factory in test projects. 4.1 team profile the test factory team involved in software a project is com­ posed of a test manager, one requirement analyst, two test analysts, one trainee (tester), and one researcher. among the members, only one of the analysts and the trainee executed test cases. the analyst has a fourteen­month experience in re­ quirement elicitation and worked for one year and six months in test execution. the trainee has an experience of a year and four months in requirement elicitation and test execution. both of them have a fourteen­month experience in require­ ment elicitation and test activities in the software a project. the requirement/test team performed both the requirement and test activities, in which the use cases are the basis for the test case specification. in addition to the tests based on use cases, the team also conducted exploratory tests during the execution of the sprints. the high knowledge of soft­ ware a’s requirements allowed the test analysts to generate more concise test documentation, thus providing more agility during the process. therefore, the analysts executed the tests based on the documents and their own experience. however, concise test documentation can also be costly to create and maintain, especially in a project with a lot of requirement changes and fixed release date. 4.2 tools and patterns of the internal pro­ cesses to guide the activities of the requirement and test processes (see sections 4.3 and 4.4), the testing team used the fol­ lowing tools: jira5, for use case and task management; 3the system name was omitted due to confidentiality agreement. 4http://www.great.ufc.br 5https://www.atlassian.com/software/jira figure 1. example of patterns used to use case specification. extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 confluence6, for business rules and general documentation; testlink7, for test plan and test cases management; and the browsers google chrome8 and firefox9, for the test case ex­ ecution. since the beginning of software a project, the stakehold­ ers decided to perform the requirement specification using well­defined templates, aiming to improve the understand­ ability for all stakeholders. therefore, we used special sym­ bols that ease the identification of elements in the use case steps. figure 1 shows a fictitious example of a use case to edit a registered user, where the basic flow starts in the tag [basic flow]. likewise, in step 3 of the basic flow, the in­ put fields are identified by double quotations. in step 4, the clickable visual elements are written between <> symbols. the use case also has information about the use case’s goal, related mockups, preconditions, and the acceptance criteria. the latter refers to the flows that should not have any crit­ ical bug so the use case implementation can be considered “done”. 4.3 requirements process in order to organize the requirement engineering tasks, we followed a requirement process with the following activities: elicitation, analysis, specification, and validation. accord­ ing to wiegers and beatty (2013), these steps are essential to requirement engineering in a software project. during the implementation of the project, the analysts performed the re­ quirements activities in a way its outputs could be used as input to the sprint’s backlog. although the project had agile characteristics, the client requested the detailing of the documentation for software a because of its complex features. thus, we specified the re­ quirements through textual use cases. each step task is pre­ sented as follows. 1. elicitation: this step aims to identify the system re­ quirements by consulting the stakeholders. in this pro­ cess, the team performs an interview with stakeholders and the elaboration of usage scenarios with interface prototyping using the balsamiq10 tool. 2. analysis: this step is responsible for verifying the consistency, completeness, and viability of previous elicited requirements. hence, the stakeholders prioritize the requirements aiming to identify which ones have a faster and higher return of investment to the client and final customer. 3. documentation: in this step, the analysts specify the requirements as use cases and communicate them to the team. the system business rules are also documented. 4. validation: in this step, the analysts assure that the re­ quirements have acceptable description and can be sent to the development. in this paper, this step also involves the creation and validation of high fidelity prototypes. 6https://www.atlassian.com/software/confluence 7http://www.testlink.org 8https://www.google.com/chrome 9https://www.mozilla.org 10https://balsamiq.com/wireframes/ 4.4 testing process the testing process provides real feedback from the behav­ ior of the software (bertolino, 2007), and the organization of activities allows its communication, monitoring, and improv­ ing (mette and hass, 2008). processes can vary depending on the institution. still, there are generic processes (mette and hass, 2008; iso/iec29119­ 2, 2013) that can be adapted for the organization’s purposes. great’s test factory project (de castro andrade et al., 2017) is based on mps.br (montoni et al., 2009) and has three steps: (i) planning; (ii) elaboration; and (iii) execution. in the context of this project, the requirement analysts send the documents, specified as described in section 4.3, to the test activities so that the specification and execution of the tests are performed before the system is released. we present a brief description of the main activities of this process as fol­ lows: 1. planning: this step consists in verifying the test goals and perform the required actions to transform the test strategy in an operational plan. 2. specification: this step aims to elaborate tests to meet the demands of the test plan. this also includes auto­ mated test scripts specification, when necessary. 3. execution: at last, the final step relates to executing the tests and store results. in this step, the test analysts must verify the test incidents. therefore, the analysts also generate a test report and send it to the client with lessons learned. it is worth noting that, during the whole process, the test team controlled and kept track of the activities, allowing them to make some improvements in the next process exe­ cution. 5 tool for semi­automatic genera­ tion of tests procedures in our previous work (santos et al., 2019), we introduced the tool used to generate tests from use cases in the context of the software a project. in this paper, we will refer to this tool as uc2proc. this tool was mainly developed by one re­ searcher and one test analyst. regarding the improvements of the tool (see section 5.1), another test analyst was respon­ sible for implementing additional features. the features of the uc2proc tool comprise processing structured textual use cases from jira, the generation of test procedures from the use case flows, the edition of the extracted test procedures, the addition of input data to create test cases, and, finally, sending generated test cases to the testlink. for the current study, our tool was improved based on the pilot results. all the tool features and its improvements are detailed in subsection 5.1. in subsection 5.2, we present its package diagram, and in subsection 5.3 we present some in­ terfaces and how the tool works. 5.1 tool features the features of the uc2proc are described as follows: extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 (1) integration with jira. in our tool, the test analyst can search for the identifier of a use case issue from the jira system. next, regular expressions are used to extract the in­ formation from the textual use cases, which processes the fol­ lowing elements: objective, preconditions, flows, steps, refer­ ences to other flows, and data entries in the steps. this opera­ tion only works correctly if the use cases are strictly specified according to the patterns configured in the tool. (2) testing procedures generation. the information ex­ tracted from the textual use cases is used to generate test procedures. to achieve so, the uc2proc first creates test scenarios that are composed of flow sequences to be vis­ ited in the use cases. to accomplish this task, we used an approach similar to the one presented by some and cheng (2008). then, we implemented an algorithm that generates state sequences starting in the ’basic flow’ and visits all al­ ternative/exception flows that depart from it. thus, starting from each of the alternative/exception flows, all their respec­ tive steps are analyzed and the paths to other flows are visited. this scenario of flow sequences are used to compile the input steps and variables that will compose the test procedures. in the current version of our tool, we also added a new function that identifies the business rules referenced in the use case. the tool then creates two tests for each rule: one with the purpose of validating it and the other aiming at veri­ fying the violation of the rule. the user then visualizes the ref­ erence and manually fills the steps of the test procedure. the coverage criteria, although simple, allows generating scenar­ ios that go through all flows and some transitions. however, a deep search for all state machine paths is not performed, which could lead to the generation of some scenarios with several flows. the intention of this functionality is to gen­ erate test procedures similar to those that the test analysts manually generate in the project. algorithm 1 explains the process to generate the scenarios for each use case. lines 1 to 5 declare the necessary vari­ ables, where testscenerarios is the list of scenarios with the use case flows, currentpath is an auxiliary variable and test­ procedures is the final list of test procedures. the first step is to create a scenario for the basic flow. thus, the algorithm iterate over each step of the basic flow. from line 8 to line 11, a new scenario is created from basic flow to each flow that is called in the basic flow steps. next, the lines of each alternative/exception flow are analyzed, and one scenario is created starting from then to each new flow reference. in sum­ mary, scenarios are generated by exploring a maximum of one level from each flow. after creating the test scenarios, the algorithm iterates over each scenario, creating a proce­ dure containing title, goal, and preconditions from the sce­ nario. thus, the algorithm iterates over each business rules from use case and creates a test procedure containing the ti­ tle of the rule and blank space in steps and output, which the user must manually fill. at last, the algorithm returns the list of test procedures generated. (3) testing procedures management. the developed tool allows the test analyst to add, edit, and delete a test proce­ dure, as well as to manage the steps within a procedure. our tool also extracts and displays to the user the inputs listed in the steps of each procedure. these inputs can be added, edited, or removed when editing the steps, but it is not pos­ algorithm 1: generation of test procedures result: list of test procedures 1 usecase ← use case to be processed; 2 businessrules ← list of business rules in usecase; 3 testscenarios ←∅; 4 currentp ath ←∅; 5 testp rocedures ←∅; 6 comment: creation of scenarios 7 testscenarios ← testscenarios∪ basic flow in usecases; 8 for each flow in basic flow from usecase do 9 currentp ath ← basicf low ∪ f low; 10 testscenarios ← testscenarios ∪ currentp ath; 11 end 12 for each remaining flow in usecase do 13 currentp ath ← reconstruct path from basic flow to current flow; 14 for each flow reference in current flow do 15 testscenarios ← currentp ath∪ flow reference; 16 end 17 end 18 comment: creation of test procedures 19 for each scenario in testscenarios do 20 procedure ← create test procedure with title, goal, and preconditions from scenario; 21 for each flow reference in scenario do 22 add user steps in procedure; 23 add system steps in procedure; 24 end 25 testp rocedures ← testp rocedures ∪ procedure; 26 end 27 comment: creation of procedures to bussiness rules 28 for rule in businessrule do 29 procedure ← create new procedure with title of businessrule comment: the steps must be created by the test analyst 30 testp rocedures ← testp rocedures ∪ procedure; 31 end sible to automatically extract references of the screens mock­ ups and actors of the use case. (4) testing data insertion. after generating the test pro­ cedures, the test analyst can insert the test data and generate instances of test cases with different input data. (5) integration with testlink. after the generation of the test procedures and addition of the data to generate test cases, the tool sends the test suite to testlink. for this, the test an­ alyst must configure the test project name in which the gen­ erated test cases must be uploaded. (6) template customization. in our previous work (san­ tos et al., 2019), the tool was limited to a fixed use case tem­ plate. in the present work, we added new functionality that allows the customization of the patterns to detect elements of textual use cases. it is worth noting that the general struc­ extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 ture of the use cases is fixed and must be followed. however, the user can create its own regular expressions using a form in the tool. the major advantage of this functionality is that it makes the tool more customizable, allowing it to adapt to patterns of different projects or organizations. 5.2 package diagram the uc2proc tool was developed as a web app using the framework ionic11 and the javascript programming lan­ guage 12. we also used the jira and testlink apis to allow communication with these services. figure 2 presents a uml figure 2. package diagram of developed uc2proc. package diagram representing an overview of the tool’s archi­ tecture. this figure highlights the main modules of the tool: issue, use case, scenario, test data, test case and testlink. the issue module manages the issues received from the jira api, which uses the defined regular expressions to ex­ tract the necessary elements from the issue. these elements are used to instantiate the objects (e.g., basic flow, business rules) used in the test generation process. the use case module receives the issue extracted from the previous module. next, it instantiates a use case object based on the information received in such a way that each flow contains steps, the flow/event that triggers it, and the flows possible to reach it. after that, the scenario module generates the usage sce­ narios from the flows labeled in the use case module. the generation process follows the algorithm presented in sub­ section 5.1. the test data module handles the manual input of test data in the test scenarios. finally, the test case module generates the test cases that will compose the test suite and uses the testlink module to send them to the test management tool. 11https://ionicframework.com/ 12https://www.javascript.com/ 5.3 tool’s usage this section presents an example of the use of our tool, given the use case presented in figure 1. at first, to use the tool, the user must configure the integration with external man­ agement systems, detailed as follows: • jira: the test analyst must provide the url of the jira api, username, and api key of an account. this information can be obtained from the security page of the jira user account; and • testlink: the test analyst must provide the url of the testlink api, which can be obtained from the analyst responsible for the maintenance of the testlink in the organization, and the api key, which can be found on the user page in testlink. figure 3. tool’s issue searching. once the authentication information is configured, the user must start by searching the use cases. in order to do so, it should type the use case issued id from jira in the field “search”. the system then displays the search results and al­ lows the found issues to be added by the “+” button, as shown in figure 3. thus, the user must click on the right arrow button and the system runs algorithm 1. hence, it displays the generated scenarios for each flow and business rule from the use case. assuming the structure of the use case presented in figure 1, the expected result must generate one test procedure for each flow (basic flow, af­01, af­02, ef­01) and two additional procedures for the business rules. figure 4. example of scenario generation. table 3 presents the test procedures generated with the following fields: (i) scenario, the test scenario generated; (ii) title, which is the test procedure title; and (iii) actions extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 table 3. example of test procedures generated by tool. id scenario title actions outputs 1 [basic flow] [basic flow] ­ edit user [basic flow] 1,3,4,6 [basic flow] 2,5 2 [basic flow],[af­01],[basic flow] [af­01] ­ admin checks temporary field [basic flow] 1 [af­01] 2,4 [basic flow] 4,6 [basic flow] 2 [af­01] 3 [basic flow] 5 3 [basic flow],[af­02],[basic flow] [af­02] ­ admin clicks on cancel [basic flow] 1,3 [af­02] 2 [basic flow] 6 [basic flow] 2,5 4 [basic flow],[ef­02] [ef­01] ­ email already registered [basic flow] 1,3 [ef­01] 2 [basic flow] 6 [basic flow] 2 [ef­01] 4 5 bn001.01 validates bn001.01 blank blank 5 bn003.02 validates bn003.02 blank blank figure 5. test cases exported to testlink. figure 6. example of testlink integration pop­up. and outputs, which represent the steps and expected results through the use case flow (between “[]”) and the step num­ bers. for instance, the first row of 3 represents basic flow scenario, which assumes steps 1, 3, 4 and 6, steps that con­ tain “the admin”, from use case as test actions, and steps 2 and 5, steps that contain “system”, as the expected result. thus, the user can add, remove, or modify test cases or test data using the fields shown in figure 4. after the edition of the test procedures and generation of the test cases, the user must click on the button “send to testlink”, and choose the name of the project and test suite in testlink, as shown in figure 6. the tool then creates a new test suite into the chosen project and export all current test procedures as test cases into testlink. an example of test cases generated from use case shown in figure 1 is depicted in figure 5. 6 proof of concept in this section, we detail the process of the study execution after the deployment of the solution described in section 3. after that, we performed a proof of concept conducted in the environment presented in section 4. subsection 6.1 describes the steps to perform the evalu­ ation. subsection 6.2 summarizes and discusses the results related to the test effort. 6.1 proof of concept steps we evaluated the tool in three steps: (1) selection of use cases, which produces a list of requirements without previous test cases; (2) effort estimation, where the analyst evaluates the time to complete the task based on its experience; and, (3) automatic generation of tests, comprising the actual use of the generated solution. these steps were performed by one of the researchers and the test analyst of the project. these steps are described next. extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 figure 7. process to semi­automatic specification of test procedures. figure 8. subprocess to adjust test procedure. 1. use case selection. the first step of the proof of con­ cept was the selection of the use cases. to prevent the bias as­ sociated with the analyst’s knowledge, we selected use cases that were not analyzed and specified before. taking into ac­ count that textual patterns of the project were already applied to documents, the analysts did not perform editions in the use cases. to present results as close as possible in the original con­ text of software a project, we selected use cases from a real release of the software under development. considering the aforementioned conditions, the team used all the artifacts pro­ duced during the proof of concept in the real release. 2. effort estimation. the test analyst estimated the man­ ual effort to specify the tests in minutes. then, the analyst calculated the effort with a metric defined by the following equation: 3 ∗ (n of f lows + n of businessrules) + w . the number three is a multiplier factor representing the an­ alyst’s effort in minutes to specify the tests for a use case with n flows and n business rules. additionally, the metric includes a weight w that adds the extra time based on the analyst’s perception. the cases with w equals to ­1 refers to flows with many repeated steps, which required lower test specification effort. the greater the inexperience of the ana­ lyst with the functionality under test, the greater the value of the w weight. the team of software a project created the metric to fill its needs of manual effort estimation, considering margins of error based on the analyst’s opinion. the whole equation is based on recent experience gained in the project. despite the metric is ad hoc and not validated in controlled experiments or case studies, the results were accurate enough in our pre­ vious evaluations (santos et al., 2019). 3. generation of test procedures. the test analyst must perform the test cases generation based on use cases. the test analyst carries this process using the proposed tool and the testlink, aiming to compare the required time to complete the task and the estimation effort results. it is worth noting that the test analyst did not specify the test cases manually during the proof of concept. to use the tool and generate the test cases, the test analyst performed the activities following the process illustrated in figures 7 and 8. the details of each activity are presented as follows. 1. select sprint’s use cases from jira: this activity re­ ceives as input the list of selected use cases from back­ log, which were selected at the beginning of the proof of concept. at this point, the test analyst must select the use cases for the automatic generation of test procedures. 2. verify compliance of use cases with the organiza­ tion’s patterns: in this activity, the test analyst checks extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 table 4. tool results for selected use cases. id uc flows business rules uc steps template correc­ tion time adjust time testlink correc­ tion time weight estimated time actual time test cases gener­ ated 1 1 0 4 00:01:14 00:00:32 00:00:10 2 00:05:00 00:01:56 1 2 3 0 11 00:01:11 00:00:44 00:00:43 3 00:12:00 00:02:38 3 3 1 0 4 00:01:02 00:00:18 00:00:25 2 00:05:00 00:01:45 1 4 3 1 32 00:02:41 00:02:28 00:00:56 3 00:15:00 00:06:05 5 5 3 0 16 00:01:27 00:00:41 00:00:41 3 00:12:00 00:02:49 3 6 6 2 34 00:01:45 00:07:21 00:00:33 2 00:26:00 00:09:39 6 7 3 0 17 00:03:12 00:00:54 00:00:25 3 00:12:00 00:04:31 3 8 3 0 18 00:03:45 00:01:01 00:00:45 3 00:12:00 00:05:31 3 9 3 2 17 00:02:12 00:07:29 00:00:50 0 00:15:00 00:10:31 7 10 3 2 16 00:01:26 00:02:35 00:00:50 0 00:15:00 00:04:51 7 11 3 1 12 00:02:53 00:02:33 00:00:39 2 00:14:00 00:06:05 5 12 4 3 25 00:01:31 00:05:30 00:01:42 ­1 00:20:00 00:08:43 13 13 2 1 13 00:04:47 00:01:58 00:00:26 0 00:09:00 00:07:11 5 14 2 1 10 00:02:08 00:00:40 00:00:26 0 00:09:00 00:03:14 2 15 9 1 40 00:02:03 00:01:53 00:00:41 10 00:40:00 00:04:37 13 16 8 3 54 00:03:58 00:07:00 00:01:06 2 00:35:00 00:12:04 11 17 4 2 24 00:02:46 00:01:37 00:01:00 0 00:18:00 00:05:23 6 18 6 1 27 00:02:00 00:03:06 00:00:44 4 00:25:00 00:05:50 9 19 1 1 7 00:00:57 00:01:16 00:00:22 ­1 00:05:00 00:02:35 2 20 5 5 30 00:00:43 00:04:52 00:01:11 0 00:30:00 00:06:46 10 21 4 2 24 00:02:00 00:09:55 00:01:16 0 00:18:00 00:13:11 7 22 9 3 47 00:05:02 00:05:07 00:00:30 4 00:40:00 00:10:39 12 23 6 1 26 00:01:31 00:04:49 00:00:30 4 00:25:00 00:06:50 8 24 8 3 47 00:06:00 00:04:00 00:01:21 2 00:35:00 00:11:21 12 total 100 35 555 00:58:14 01:18:19 00:18:12 47 07:27:00 02:34:45 154 the compliance of use cases from the previous activity regarding the organization’s patterns. the focus of this activity is also to analyze whether the use cases follow the patterns configured in the automated tool. the test analyst must record any non­conformity in the corre­ sponding jira task. 3. use case update: in this sub­process, use cases that do not follow the pattern configured in the tool must be up­ dated by the requirements analyst. this activity receives as input the list of incorrect use cases. in the end, the requirements analyst must produce updated use cases according to the established patterns. 4. perform automatic specification of test procedures: this activity contemplates the use of the tool to perform the automatic generation of test procedures. in order to do so, the test analyst must have access to verified use cases. at the end of the process, a set of test procedures must be generated. 5. analyze the generated test procedures: after gener­ ating the procedures, the test analyst should perform an analysis of the generated steps. in this one, the test an­ alyst assures the correct extraction of steps and outputs from the use cases. if errors are found in the procedures, the team records the occurrences in the jira. 6. adjust the set of test procedures: this sub­process consists of adjusting the set of test procedures when there are non­conformities, and the test analyst is re­ sponsible for performing them. the sub­process must receive the list of test procedures for adjustments and produces as output the adjusted set. 7. generation of test cases: after analyzing the proce­ dures and performing the necessary adjustments, the test analyst should provide the test data to generate the test cases. the test analyst repeats this action until the desired coverage is obtained. the activity receives as in­ put a set of test procedures and must produce as output a suite of test cases. 8. send test cases to the testlink: the last activity of the process is to send the test cases to testlink. the activ­ ity receives the test case suite as input and automatically sends them to the testlink tool. after sending the test cases to testlink, the test analyst must perform any ad­ ditional update directly in the testlink. if more tests cases are generated for the the test suite already present in the testlink, the test analyst must also add them man­ ually. 6.2 metrics results and discussion during the processing of use cases, the test analyst manu­ ally collected the times obtained at the end of each activity. hence, the only instrument used in this step was a spread­ sheet with the same fields of table 4. table 4 shows the results of the tool usage. in this table, we extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 detail the main use case information, such as the number of flows, steps, business rules, and the estimated modeling time of the use case. the table also presents the time spent to cor­ rect the non­conformity of patterns in the use case, the time required to edit the test suit (e.g., the addition of the business rules details), and, finally, the time necessary to adjust the test cases to be exported to testlink. in total, 24 use cases were used, distributed among five sprints, totaling 100 flows and 555 steps. as shown in table 4, the use of the tool in the sprint yielded 154 test cases, taking, in total, 2 hours and 34 min­ utes to be carried out. this represents a reduction of approxi­ mately 65,38% in effort compared to the 7 hours and 27 min­ utes of estimated manual effort. from the results of the columns “business rules”, “uc steps number” and “test adjust time” it is possible to verify that, most of the times, the use cases with a higher number of business rules and steps demanded a greater effort for cor­ rection. this complements the results of our previous work, which shows that the complexity of the use cases impacts the generation of tests since the manual intervention of the analysts is necessary. in the previous paper (santos et al., 2019), the result of the pilot study was presented using nine use cases belong­ ing to one system a project sprint, which accounts for 41 flows and 186 steps. in this proof of concept, the test ana­ lyst estimated the time needed for the modeling of each use case and compared it with the time spent during automation. the reduction of effort in the pilot was approximately 65%. thus, analyzing the results obtained, it is possible to identify an effort below the estimated for the generation of test cases, especially in use cases with a larger quantity of flows and steps. regarding the question raised in the title of the paper, it can be said that this generation was worthwhile in the par­ ticular context of the project. however, more research must be carried out to generalize the findings. 6.3 threats to validity after the proof of concept execution, we have identified some limitations which must be discussed. so, we have car­ ried out the presentation of these limitations as threats to va­ lidity, as showed in (wohlin et al., 2012). we discussed the following threats. regarding the external validity, which determines the gen­ eralization of results, we used a metric to estimate the manual effort that could be a threat to the proof of concept. the met­ ric estimates the time required to complete the specification of the test cases, and it was created based on the recent expe­ rience of the test analysts. however, the resultant value takes into account specific characteristics related to the context of the software a, e.g., extra time to adjust the test cases after the specification. nevertheless, we believe that the metric is simple enough to be adapted for other projects. also, regarding the generalization of the study, the usage of patterns in the tool for data extraction can hinder the tool usage by other teams. to mitigate this, we tried to develop the tool to read patterns as general as possible, but that still fit the needs of the current project. additionally, we also improved the tool to allow users to create custom patterns with regular expressions. at last, the proof of concept was conducted by one of the authors, who is also the test analyst in the project. to reduce the possible bias of the evaluation, the test analyst used a tool as similar as possible to the real context. additionally, the proof of concept was a pilot evaluation to analyze the feasibility of the tool. 7 feasibility study to analyze the viability of using the developed solution in different contexts, we perform a proof of concept in a real sprint with users trained and used to project context of soft­ ware a. aiming to obtain more data about the use of the uc2proc tool, we performed a feasibility study with test an­ alysts from another industry project in partnership with the great13 laboratory. this feasibility study focuses on mea­ suring the tool’s efficiency from the user’s perspective, with­ out previous experience with the developed solution in this case. additionally, we also collected users’ opinions about the positive and negative points of the solution. subsection 7.1 details the methodology adopted to con­ duct the feasibility study. subsection 7.2 discusses the results obtained after the feasibility study with the users. subsection 7.3 explains some of the limitations in the conduction of the work. 7.1 methodology we decided to plan the feasibility study using the decide framework proposed by preece et al. (2004). this framework aims to guide evaluations with users through a checklist with well­defined activities ranging from defining objectives to evaluating data. we selected decide because it is easy to apply in practice, allowing the assessment to be conducted by inexperienced assessors (preece et al., 2004). the following paragraphs will describe the six activities related to planning and execution. (1) define objectives. the first item on the decide checklist concerns the definition of the objectives that should guide the feasibility study. since our focus was to analyze whether the solution generated was capable of reducing the testing specification time, we defined the following objec­ tives: (i) to evaluate the efficiency of using the tool to spec­ ify tests by the test analyst and (ii) discover the positive and negative perceptions of the test analyst about the use of the tool. (2) explore questions. the second item of decide is the definition of the questions that must be answered at the end of the study. considering as our goal the analysis of the user performance with the automated tool, we elaborated on the following question: was the tool able to increase the testing specification speed compared to the manual specification? regarding the objective of identifying positive and negative aspects, we elaborated on the following question: what are the positive and negative perceptions of the analyst during the tool usage? 13http://www.great.ufc.br extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 (3) choose the evaluation paradigm. the third item cor­ responds to choosing the evaluation paradigm and techniques to answer the questions of item two. in order to identify the efficiency of the generated solution, the participants of the feasibility study performed a manual and an automatic task. the details of the activities are described as follows: i questionnaire for profile identification: the evaluators applied this questionnaire to identify some general char­ acteristics of the test analysts, such as their experience with use cases and automated test specification tools. the form filled by the users is presented in table 6 of the appendix a with the name professional profile. ii manual specification of tests: the first task performed by the test analyst concerns the manual specification of tests related to a use case. in order to achieve it, the par­ ticipant received a document describing the use case, be­ ing instructed to make the specification in the testlink. the use case for this activity is a fictional system for cre­ ating movie schedules in theaters, with a total of five flows and 24 steps. during the performance of the ac­ tivity, two evaluators made notes about the comments and doubts of the participants and other considerations about the execution of the task. iii tool presentation: after that, the researcher presented the tool and showed an example of how to use it. the use case employed in the example is the same as the one in the manual task. iv semi­automatic specification: in the second task, the test analyst received a use case document of music soft­ ware that allows the creation of playlists. even though it is a different system, the use case has the same com­ plexity (number of flows, steps, and references between flows) as used in the manual task. then, they were instructed to process it with the tool and send it to testlink. the two evaluators also observed and made notes just as in the manual task. v open questionnaire: finally, analysts needed to answer an open questionnaire with questions about the positive and negative aspects of the tool. the final form is pre­ sented in table 6 of the appendix a with the name user ­ tool evaluation. (4) identify practical questions. the fourth item of de­ cide corresponds to identifying issues related to the selec­ tion of users and materials to be used. the study population was professionals working on software testing projects at the test factory of great lab. we selected two subjects for the pilot study and four for the final evaluation. besides that, all tasks were performed in a controlled environment with the aid of a computer. (5) decide how to deal with ethical issues. the purpose of the fifth item concerns how to protect the privacy and other issues related to the participants of the feasibility study. at this point, test analysts were asked to sign a consent form to participate in the research and were informed about the purpose of the research, the data anonymization, and how it would be conducted. (6) data evaluation. the last item of decide is about evaluating, interpreting, and presenting the data obtained dur­ ing the evaluation. the performance of the users was evalu­ ated by comparing the execution times of the tasks manually and with the tool’s support. to answer the question regarding the positive and negative aspects of using the tool, the two evaluators analyzed and discussed the notes collected during the execution of the tasks and the answers from the form with open questions. both results were combined to provide the fi­ nal topics related to positive/negative points of the solution. regarding the context of this project, it was impossible to carry out evaluations with many users, so we did not use statistical tests. instead, we presented the data and discussed the times obtained and the possible reasons that led to the results. given our focus on assessing efficiency, we did not assess the correctness of the test cases produced, as coverage requirements and specification type may change for different projects. 7.2 results this subsection presents the results obtained after conduct­ ing the feasibility study. before the final evaluation, we con­ ducted two pilot tests based on the planning of subsection 7.1 to make possible improvements. the tests were performed with a test analyst and a trainee in test analysis. after per­ forming the tests, the evaluators detected inconsistencies in the use cases of the tasks that were promptly corrected. we also decided to reduce the size of the test cases to take less time from the professionals’ work. finally, we made a few adjustments using forms and other evaluation materials. table 5. experience of participants. participant id profession experience in software testing (years) experience with use cases (years) user 1 test analyst intern 2 2 user 2 test analyst 4 0 user 3 test analyst 4 1 user 4 test analyst 5 2 figure 9. participants experience in test case specification with use case. after the pilot tests were executed, we selected four par­ ticipants from a different project. all of those participants work as test analysts, but three of them were more experi­ enced, and one of them was an intern. table 5 summarizes the profile of the four professionals in the way that it details their experience with software testing and specification tests extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 based on use cases. thus, as illustrated in the graph in figure 9, most of the participants had some experience with speci­ fying test cases in the industry, but none of them works with use cases in their current projects. figure 10. time required in minutes to execute the manual and tool sup­ ported tasks. during the performance of the manual tasks with and with­ out the uc2proc tool, we collected the total times per execu­ tion. figure 10 presents a chart comparing the total times in minutes obtained by each of the four participants, in which all of them achieved lower specification time using the tool. while the average execution of the manual task was equal to 28.25 minutes, the average obtained with the use of the tool was equal to 14.50 minutes. given the reduced size of the sample of participants that we obtained, we chose not to analyze statistical significance in the differences. there­ fore, the answer to the first question of the feasibility study (was the tool able to increase the speed of tests specification when compared to the manual activity?) gives more indica­ tions that the developed solution can increase the specifica­ tion speed. however, more evaluations are needed to gather more data about its actual effectiveness. after the activities were carried out, the participants were instructed to fill out an open questionnaire to point out the uc2proc tool’s positive and negative points. it is worth men­ tioning that the evaluators wrote down the participants’ com­ ments during the execution of the tasks. the form used by the evaluators is available in 6 of the appendix a with the name evaluator ­ tool evaluation. the two researches that conducted the feasibility study per­ formed a qualitative analysis of the questionnaire answers and users’ comments during the evaluation. the evaluators used the notes to complement answers of the questionnaire about positive/negative points, so this results are presented together. to accomplish this, we grouped the most repeated and contrasting topics into the following categories about the tool: efficiency, utility, understanding, and visual acceptance. the content of these topics was then used to compose the list of positive and negative points presented below. as positive points, all participants mentioned that they felt a reduction in the time by using the uc2proc tool compared to manual activity. it means that the analyst’s work could be streamlined. half of the participants also found positive the integration of the tool with systems like testlink and jira. this last comment may be related to the participants’ work­ ing context so that they are used to working with this suite of systems to manage the testing activities. the main negative point indicated by three of the four par­ ticipants was about the titles generated by the tool. they were not very intuitive and could be difficult to understand during execution. this may happen because the uc2proc tool cre­ ates the test case’s title based on the titles of the last flow in the generated sequence. for instance, a test procedure with a flow sequence composed by basic flow, alternative flow and basic flow will have the title of the basic flow. this leads to test procedures with repeated titles. another negative point was the repetition of steps when the tool moves between different flows more than once. we observed it mainly from the basic flow to alterna­ tive/exception flows. still regarding the steps, two partici­ pants reported that it was necessary to correct some of them because there were system responses without user actions. it occurs because each system response has its own step in the generation process, even though they are in sequence. fi­ nally, most of the users seemed confused when using the tool, once the buttons were not very intuitive and it offered little feedback on the actions. they also reported that some of the test procedures contained outcomes without steps, but this is how it is supposed to work, taking into account that the tool produces steps with only one outcome. 7.3 limitations subsections 7.1 and 7.2 presented the methodology used to conduct the feasibility study and the results obtained, respec­ tively. however, we identified some limitations in the eval­ uation that deserve to be discussed. the first limitation con­ cerns the small number of participants, making it difficult to apply statistical tests to state the real difference between man­ ual and automatic task times. nevertheless, participants had varied experience with software testing in the industry, hav­ ing already worked with different types of systems, tools for test process support, and types of requirements documents. therefore, the selected group can be considered suitable for an initial evaluation outside the software a project context. moreover, the participants used only fictitious systems documentation because of the confidentiality of the software a. however, the use cases used are similar to the original doc­ umentation of the software. we also tried to create new use cases where they have complexities and interactions similar to the use cases from software a. regarding the execution of tasks, all participants executed the manual task before the automated one. to mitigate this threat and reduce bias in this approach, the analysts used dif­ ferent use cases in both tasks. finally, during the feasibility study, some participants pointed out problems in the tool procedures. the main con­ cern was about the repetition of some steps, mainly during the transition from basic flow to alternative and exception flows. in its current version, the tool can incorrectly repeat some steps from the basic flow steps. even so, we believe that it was not of high impact for the execution of the evalu­ ation, considering that the repeated steps could be easily ex­ cluded. extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 8 lessons learned as explained in section 3, the performed study had activi­ ties related to the specification of the procedure. while us­ ing the tool during the sprints, it was possible to obtain lessons regarding the solution and its use circumstances, so they are based on the researcher’s observations, opinion of the requirements/tests team’s members, and analysis of the collected metrics. these lessons represent some of the chal­ lenges obtained with the use of the solution in a way that actions taken during the sprint were integrated into the pro­ cess of using the tool. the main lessons learned during the process are listed as follows: ll­1: the efficiency of a test case generator tool using use cases is strongly related to the following of the writ­ ing pattern. use case modeling is a task that demands a high degree of instruction, communication, and knowledge about the software product. being a manual activity, it is common to create certain textual documents susceptible to attention errors in the writing pattern. these errors could vary from problems in the spelling, plural, blanks in the markup charac­ ters (such as writing “[fa ­ 01] instead of [fa­01]”) or even hidden logical loops. such errors generated flaws in the tool and needed to be corrected, implying additional time in the process of semi­automated generation. therefore, one action taken was to perform a detailed inspection to determine if the use cases follow the template of the project, thus making any necessary corrections. having a use case with the correct pat­ tern as an example helped analysts to identify inconsistencies more quickly and, consequently, enable the use of the tool. for teams working with poorly detailed documentation, the tool may not generate good results. however, it is possible to address in future work, more specific scripts for different types of projects and documentation. ll­2: the deployment of an automation tool may not be worth the effort reduction. throughout the design of software a, the number of flows and steps was used to esti­ mate the time needed to generate the tests. in some cases, this calculation inaccuracy is observed in use cases with a large number of flows, but that could be easily modeled manually or in an automated way. in this case, the tool has allowed a negligible reduction in efficiency gains, so the effort to adapt the use cases of a project to the presented template may not compensate, especially if the use case is straightforward. in these scenarios, the implementation of a semi­automated tool may require a great deal of manual work to adjust the use cases to a template and the generated procedures, which may impact the project activities. nonetheless, for most of the use cases of software a, there were indications that the time re­ duction for the test specification compensated the implemen­ tation of the tool in the test factory’s context. ll­3: textual use cases do not express all the informa­ tion necessary for good test coverage. in some use cases of software a, it was not possible to automatically obtain all the necessary information to generate more test cases from the use case documentation. the reason for this is that busi­ ness rules were expressed in unstructured natural language and screen prototypes were images. it prevented the extrac­ tion of some input variables for the procedures; the used use case patterns gave analysts freedom to specify. during the use of the tool, the test analyst needed to continue consulting the other documents during the analysis process and by that ensure the desired coverage. ll­4: the integration of a solution with specific pro­ cess tools is an important factor for efficiency gain. dur­ ing the execution of a requirements/test process, the analysts may need to interact with different support tools to facilitate the activities’ performance. therefore, to facilitate the prac­ tical application of an automation tool, it becomes essential that the developed solution integrates with the other systems. for example, in this report, the analysts originally cloned the tests on testlink, but the task generated many errors due to the lack of options for the test data and interface problems. in this sense, using the tool for specification activities and then submitting the tests to testlink helped to decrease the errors. ll­5: generating additional tests for business rules were not advantageous in all cases. according to user re­ ports, the implementation of the new functionality was ad­ vantageous since it signaled the business rules referenced in the case of use. on the other hand, three negative points were reported about the functionality. the first one is that a good part of the business rules could be covered with only one test case, generating less useful tests. the second point is related to use cases that were too specific and had detailed flows to the business rules; this way, it was necessary to remove the duplicate tests. finally, the users needed to apply some effort to complement the test cases based on business rules, since only the title was generated. 9 conclusion this paper presented an experience report about the gener­ ation of test procedures in an industrial context. this paper is an extension of our previous work (santos et al., 2019), whose the main goal was to analyze the feasibility of insert­ ing a tool to automate the generation of tests based on use cases. we implemented the solution in partnership with the in­ dustry, thus enabling the generation of a product that better suits the needs of the requirements/testing team, which leads to the question of this paper: “is it feasible to use a tool to generate test cases from textual use cases in the test process at a test factory?”. our previous results showed that the proposed solution positively contributed to the analysts’ activities. therefore, we have extended the current work through the following contributions: (i) improvements in the tool’s generation of test procedures; (ii) data related to more testing cycles of soft­ ware a; and (iii) feasibility study with test analysts. regarding the tool’s improvement, we realized that only indicating the procedures of business rules to be tested might not be sufficient. in such manner, we obtained low gain in the effort. therefore, the results reinforced the need for specifi­ cation of business rules in a structured manner. when it comes to the tool usage in more releases, we con­ cluded that the effort reduction in the test generation was maintained, as well as the relationship between the complex­ ity of the use cases and the time spent in manual interven­ extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 tion during the specification process. the reduction in ef­ fort equaled 65,38% in the context of the industry software project. furthermore, the majority of the effort required was adjusting the test procedures generated by the tool. in addition to the proof of concept, the feasibility study has provided further insight into the efficiency of the solution. although all users completed the task more quickly using the tool, they pointed out interface issues that can make the software hard to use. these evaluations also enabled the generation of one ad­ ditional lesson learned regarding the generation of tests for business rules, which demanded additional effort to remove unnecessary tests. this set of lessons learned can give more information about the introduction of an automated tool in a testing process. considering the characteristics of software a project, the team decided for the development of a simple custom solu­ tion. nonetheless, finding the right degree to which the test­ ing process had to adapt to the insertion of new tools was chal­ lenging. regardless of the decision about the usage of custom solutions or other available solutions, we believe that more work is needed to provide practical insights in the context of test factories, which could benefit projects in distributed scenarios. concerning future work, we plan to research how to doc­ ument business rules to increase the efficiency of generating test procedures. in the current work, this improvement has become even more evident, considering the perceived effort necessary to update procedures with partial descriptions of business rules. the test analysts of software a also pointed out that several use cases needed corrections in its patterns. since this hinders the usage of the tool, we also plan to apply techniques of static analysis in the requirements documen­ tation. finally, we intend to make improvements in the tool based on user comments and analyze how the test procedures can assist the generation of automated scripts for functional tests. a instruments of the feasibility study table 6 presents the following forms used in the evaluation: (1) professional profile, used to to collect the professional profile; (2) user ­ tool evaluation, filled by the participants to report the positive and negative points of the solution; and, (3) evaluator ­ tool evaluation, used by the research to col­ lect the time and general notes during the tasks. references r. m. d. c. andrade, i. d. s. santos, v. lelli, k. m. de oliveira, and a. r. rocha. software testing process in a test factory­from ad hoc activities to an organizational standard. in iceis (2), pages 132–143, 2017. b. aragão, i. santos, t. nogueira, l. mesquita, and r. an­ drade. modelagem interativa de um processo de desen­ volvimento com base na percepção da equipe: um relato de experiência. in anais do xiii simpósio brasileiro de sistemas de informação, pages 428–435. sbc, 2017. b. s. aragão, r. m. c. andrade, i. s. santos, r. n. s. cas­ tro, v. lelli, and t. g. r. darin. testdcat: catalog of test debt subtypes and management activities. in testing soft­ ware and systems, pages 279–295, cham, 2019. springer international publishing. isbn 978­3­030­31280­0. a. bertolino. software testing research: achievements, chal­ lenges, dreams. in 2007 future of software engineering, fose ’07, pages 85–103, washington, dc, usa, 2007. ieee computer society. isbn 0­7695­2829­5. . url http://dx.doi.org/10.1109/fose.2007.25. r. m. de castro andrade, i. de sousa santos, v. lelli, k. m. de oliveira, and a. r. c. da rocha. software testing pro­ cess in a test factory ­ from ad hoc activities to an organi­ zational standard. in iceis, 2017. d. freudenstein, m. junker, j. radduenz, s. eder, and b. hauptmann. automated test­design from requirements­ the specmate tool. in 2018 ieee/acm 5th interna­ tional workshop on requirements engineering and test­ ing (ret), pages 5–8. ieee, 2018. v. garousi and m. felderer. living in two different worlds: a comparison of industry and academic focus areas in soft­ ware testing. ieee software, (1):1–1, 2017. v. garousi and m. v. mäntylä. when and what to automate in software testing? a multi­vocal literature review. infor­ mation and software technology, 76:92–117, 2016. t. gorschek, p. garre, s. larsson, and c. wohlin. a model for technology transfer in practice. ieee software, 23(6): 88–95, 2006. j. gutiérrez, m. escalona, and m. mejías. a model­driven approach for functional test case generation. journal of systems and software, 109:214–228, 2015. iso/iec29119­2. iso/iec/ieee international standard ­ soft­ ware and systems engineering –software testing –part 2:test processes. iso/iec/ieee 29119­2:2013(e), pages 1–68, sept 2013. . d. n. jorge, p. d. machado, e. l. alves, and w. l. an­ drade. integrating requirements specification and model­ based testing in agile development. in 2018 ieee 26th international requirements engineering conference (re), pages 336–346. ieee, 2018. l. lazic and n. mastorakis. cost effective software test met­ rics. wseas transactions on computers, 7(6):599–619, 2008. j. l. massollar, r. m. de mello, and g. h. travassos. struc­ turing and verifying requirement specifications through ac­ tivity diagrams to support the semi­automated generation of functional test procedures. in 2012 eighth international conference on the quality of information and communi­ cations technology, pages 239–244. ieee, 2012. a. mette and j. hass. testing processes. in software testing verification and validation workshop, 2008. icstw’08. ieee international conference on, pages 321–327. ieee, 2008. m. a. montoni, a. r. rocha, and k. c. weber. mps. br: a successful program for software process improvement in brazil. software process: improvement and practice, 14 (5):289–300, 2009. http://dx.doi.org/10.1109/fose.2007.25 extraction of test cases procedures from textual use cases: is it worth it? santos et al. 2020 table 6. forms of the feasibility study. form fields possible answers professional profile 1. what is your age? open 2. what is your position? open 3. how much time do you have in software test­ ing? open 4. do you have previous experience on test cases based on use cases? 1. i have knowledge about the specification of tests based on use cases in the industry 2. i have knowledge about the specification of tests based on use cases in the academia 3. i have not previous knowledge about the specification of tests based on use cases 5. how much experience do you have with use case based testing specification in the industry? open user ­ tool evaluation 1. what are the positive points of the tool used in the evaluation? open 2. what are the negative points of the tool used in the evaluation? open evaluator ­ tool evaluation 1. time to complete task open 2. general notes open g. j. myers, t. badgett, t. m. thomas, and c. sandler. the art of software testing, volume 2. wiley online library, 2004. s. nogueira, h. araujo, r. araujo, j. iyoda, and a. sampaio. test case generation, selection and coverage from natural language. science of computer programming, pages 84– 110, 2019. j. preece, y. rogers, and h. sharp. interaction design. apo­ geo editore, 2004. g. samarthyam, m. muralidharan, and r. k. anna. under­ standing test debt. in trends in software testing, pages 1–17. springer, 2017. e. b. d. santos, l. s. d. costa, b. s. aragão, i. d. s. santos, and r. m. d. c. andrade. extraction of test cases pro­ cedures from textual use cases to reduce test effort: test factory experience report. in proceedings of the xviii brazilian symposium on software quality, pages 266–275, 2019. k. schwaber and m. beedle. agile software development with scrum, volume 1. prentice hall upper saddle river, 2002. s. seela and r. yackel. 64 essential testing metrics for mea­ suring quality assurance success. url https://www. qasymphony.com/blog/64-test-metrics/. h. m. sneed. requirement­based testing­extracting logical test cases from requirement documents. in international conference on software quality, pages 60–79. springer, 2018. s. s. some and x. cheng. an approach for supporting system­level test scenarios generation from textual use cases. in proceedings of the 2008 acm symposium on applied computing, pages 724–729. acm, 2008. l. s. vieira, c. g. l. barreto, e. b. dos santos, b. s. aragão, i. de sousa santos, and r. m. c. andrade. automação de testes em uma fábrica de testes: um relato de experiên­ cia. in anais do xiv simpósio brasileiro de sistemas de informação, pages 80–73. sbc, 2018a. l. s. vieira, c. g. l. barreto, e. b. dos santos, b. s. aragão, i. de sousa santos, and r. m. d. c. andrade. automação de testes em uma fábrica de testes: um relato de experiên­ cia. in anais do xiv simpósio brasileiro de sistemas de informação, pages 80–73. sbc, 2018b. k. wiegers and j. beatty. software requirements. pearson education, 2013. c. wohlin, p. runeson, m. höst, m. c. ohlsson, b. regnell, and a. wesslén. experimentation in software engineering. springer science & business media, 2012. t. yue, s. ali, and m. zhang. rtcm: a natural language based, automated, and practical test case generation frame­ work. in proceedings of the 2015 international sympo­ sium on software testing and analysis, pages 397–408. acm, 2015. https://www.qasymphony.com/blog/64-test-metrics/ https://www.qasymphony.com/blog/64-test-metrics/ introduction related work research method proof of concept environment team profile tools and patterns of the internal processes requirements process testing process tool for semi-automatic generation of tests procedures tool features package diagram tool's usage proof of concept proof of concept steps metrics results and discussion threats to validity feasibility study methodology results limitations lessons learned conclusion instruments of the feasibility study journal of software engineering research and development, 2019, 7:6, doi: 10.5753/jserd.2019.472  this work is licensed under a creative commons attribution 4.0 international license.. characterization of software testing practices: a replicated survey in costa rica christian quesada-lópez  [ universidad de costa rica, universidad estatal a distancia | cristian.quesadalopez@ucr.ac.cr; cquesadal@uned.ac.cr ] erika hernandez-agüero  [ universidad estatal a distancia | ehernandez@uned.ac.cr ] marcelo jenkins  [ universidad de costa rica | marcelo.jenkins@ucr.ac.cr ] abstract software testing is an essential activity in software development projects for delivering high quality products. in a previous study, we reported the results of a survey of software engineering practices in the costa rican industry. to make a more in-depth analysis of the specific software testing practices among practitioners, we replicated a previous survey conducted in south america. our objective was to characterize the state of the practice based on practitioners’ use and perceived importance of those practices. this survey evaluated 42 testing practices grouped in three categories: processes, activities, and tools. a total of 92 practitioners responded to the survey. the participants indicated that: (1) tasks for recording the results of tests, documentation of test procedures and cases, and re-execution of tests when the software is modified are useful and important for software testing practitioners. (2) acceptance and system testing are the two most useful and important testing types. (3) tools for recording defects and the effort to fix them (bug tracking) and the availability of a test database for reuse are useful and important. regarding the use and implementation of practices, the participants stated that (4) planning and designing of software testing before coding and evaluating the quality of test artifacts are not a regular practice. (5) there is a lack of measurement of defect density and test coverage in the industry; and (6) tools for automatic generation of test cases and for estimating testing effort are rarely used. this study gave us a first glance at the state of the practice in software testing in a thriving and very dynamic industry that currently employs most of our computer science professionals. the benefits are twofold: for academia, it provides us with a road map to revise our academic offer, and for practitioners, it provides them with a first set of data to benchmark their practices. keywords: software testing, industry practices, survey, costa rica, replication, empirical software engineering 1 introduction software testing is an essential activity in software development projects, for delivering high quality products, but it is a costly activity in the software development life cycle (garousi and zhi, 2013). software testing represents, on average, around 35% of the total budget of a development project (dias-neto et al., 2017). testing practices play a significant role in the development process, they represent a quality assurance strategy for the identification of defects in the software applications before its deployment (juristo et al., 2004). software testing has been a focus of attention for the industry. for example, the international software testing qualifications board (istqb, https://www.istqb.org/) aims to continually improve and advance the software testing profession by defining and maintaining a body of knowledge that allows testers to be certified based on best practices, connecting the international software testing community, and encouraging research. istqb promotes the value of software testing as a profession to individuals and organizations and has performed studies to observe the perception of practitioners on testing. after the “2013 istqb effectiveness survey”, in which they collected feedback on the impacts of testing certifications, in 2015 istqb conducted a worldwide survey on software testing practices with 3,281 responses from testing practitioners from 89 countries. istqb survey reveals significant findings for the professional practice: • the budgets assigned to testing are large and keep on growing and ranges between 11% and 40%. • the agile methodologies are being adopted ahead of traditional ones that emphasize the need to have appropriate testing processes and techniques for agile. • the segregation of duties has become a standard practice where in 84% of the cases the test team does not report to develop. • the test tools for defect tracking, test execution, test automation, test management, performance testing, and test design are widely adopted. • some level of test automation is a trending topic with a with 72% of adoption. • testing requires a wide range of skills and competencies. • there are important career paths available for testers and test managers. • the decision of when to stop testing is mainly based on requirements coverage. • exploratory testing is the most adopted test techniques. • performance, usability, and security are the top three non-functional testing activities. • there are several improvement opportunities in testing practices such as test automation, test process, communication, and test techniques. afterward, the 2017-2018 istqb worldwide software testing practices report collected more than 2,000 responses from 92 countries. it reported findings mostly in parallel mailto:cristian.quesadalopez@ucr.ac.cr mailto:cristian.quesadalopez@ucr.ac.cr mailto:ehernandez@uned.ac.cr mailto:marcelo.jenkins@ucr.ac.cr characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 with the previous survey and revealed the following: (1) main improvement areas in software testing were test automation, knowledge about test processes, and communication between development and testing. (2) the top five test design techniques are use case testing, exploratory testing, boundary value analysis, checklist-based, and error guessing. (3) trending topics will be test automation, agile testing, and security testing. (4) new technologies that could affect testing are security, artificial intelligence, and big data. finally, (5) non-testing skills expected are soft skills, business and domain knowledge, and business analysis skills. currently, there is a gap between knowledge in academia and the software testing practices used in industry (diasneto et al., 2017). moreover, there is a knowledge deficiency for testing topics in practice activities (scatalon et al., 2018). garousi and felderer (2017) state that the level of joint industry-academia collaborations in software engineering is very low compared to the number of activities in each of the two communities. comparing the focus areas of industry and academia in software testing, results show that the two groups are talking about quite different things. as an example, practitioners talk about test automation referring to automating the test execution phase and academics on automated approaches (mostly focused on test-case generation and test oracles) (garousi and felderer, 2017). moreover, researchers tend to be more interested in theoretically challenging issues, but test engineers in practice are more looking for options to improve the effectiveness and efficiency of testing (garousi and felderer, 2017; garousi et al., 2017). besides, there is a wide spectrum of testing practices conducted by different software teams (garousi and zhi, 2013) and a little evidence in the literature regarding the use and importance of such practices in industry (dias-neto et al., 2017). the characterization of testing practices used in industry can help professionals, researchers, and academics to better understand the challenges faced by the software engineering profession (garousi and zhi, 2013). to characterize testing practices in the software industry, a large number of surveys have been conducted in different countries. garousi and zhi (2013), and dias-neto et al. (2017) summarized previous surveys on software testing practices. in particular, dias-neto et al. (2017) identified surveys conducted to characterize the adoption of software testing practices, tools, and methods. the earliest identified surveys to characterize aspects of the testing process were from the united states of america in (beck and perkins, 1983; gelperin and hetzel, 1988; torkar and mankefors, 2003). after that, other surveys were identity in united states (wojcicki and strooper, 2006; kassab et al., 2017; kassab, 2018). a set of replications surveying testing practices in canada was conducted from 2004 to 2017 (geras et al., 2004; garousi and varma, 2010; garousi and zhi, 2013; garousi et al., 2017) and some studies surveying testing practices in south america was conducted from 2006 to 2018 (dias-neto et al., 2006; de greca et al., 2015; dias-neto et al., 2017; robiolo et al., 2017; scatalon et al., 2018). four more surveys were conducted in australia and new zealand between 2004 and 2012 (ng et al., 2004; chan et al., 2005; sung and paynter, 2006; wojcicki and strooper, 2006; kirk and tempero, 2012). additionally, other studies surveying different aspects related to testing practices were conducted in finland (taipale et al., 2005, 2006; kasurinen et al., 2010; pfahl et al., 2014; smolander et al., 2016; hynninen et al., 2018; raulamo-jurvanen et al., 2019), spain (fernández-sanz, 2005; fernández-sanz et al., 2009), sweden (runeson, 2006; grindal et al., 2006; engström and runeson, 2010), korea (park et al., 2008; yli-huumo et al., 2014), netherlands (vonken et al., 2012), norway (deak et al., 2013; deak and stålhane, 2013), belgium (pérez et al., 2013), turkey (garousi et al., 2015), sri lanka (vasanthapriyan, 2018), and bangladesh (bhuiyan et al., 2018). finally, other studies surveying different aspects related to testing practices were conducted in different countries (chan et al., 2005; causevic et al., 2010; rafi et al., 2012; lee et al., 2012; greiler et al., 2012; pham et al., 2013; daka and fraser, 2014; kanij et al., 2014; deak, 2014; ghazi et al., 2015; kochhar et al., 2015; lima and faria, 2016; rodrigues and dias-neto, 2016; garousi et al., 2017; kochhar et al., 2019). in costa rica, previous surveys had been conducted to characterize software engineering practices. in our previous work (quesada-lópez and jenkins, 2017, 2018), we replicated a survey based on (garousi et al., 2015, 2016) where we identify the most common practices, methods, and tools in professional practice and their related challenges. moreover, we conducted a cross-factor correlation analysis of development and testing engineering practices versus practitioner demographics. in (aymerich et al., 2018), the authors conducted a survey on development practices based on the helena study (kuhrmann et al., 2017). they studied development approaches, practices, and methods in the industry. to analyze the specific software testing practices among practitioners in our country, we replicated previous surveys conducted in south america (dias-neto et al., 2006; de greca et al., 2015; dias-neto et al., 2017; robiolo et al., 2017). further replications in different countries are still needed to allow the comparison of industry trends in software testing practices (garousi and zhi, 2013; dias-neto et al., 2017). the results of these surveys can support evidence on testing practices in the software engineering community (garousi and zhi, 2013). the objective of our study was to characterize a set of software testing practices with respect to their use and importance from the point of view of practitioners of software organizations in costa rica. in this work, we replicated the previously surveys in (dias-neto et al., 2006; de greca et al., 2015; dias-neto et al., 2017; robiolo et al., 2017) with 92 practitioners from our country. as stated in (dias-neto et al., 2017), we were interested in understanding the testing practitioners’ use and perceived importance of software testing practices. in addition, we wanted to compare the results of our study with the results of the previous surveys. thus, to facilitate the comparison between previous studies and this replication, we used the same questionnaire used in (diasneto et al., 2017). previously, we had researched the software engineering practices of the industry in costa rica (quesada-lópez and jenkins, 2017, 2018). in this paper, we extend our previous study on software testing practices (quesada-lópez et al., 2019) by extending the analysis performed. besides, we concharacterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 ducted a literature search to identify past surveys on software testing practices in the industry. we describe the survey’s planning, design, execution, analysis of the collected data, and the comparison with previous surveys conducted in brazil, uruguay, and argentina to discuss the use and importance of software testing practices. finally, to get feedback about the significance and usefulness of the survey results from the practitioners’ perspective, we made two presentations of the study to different groups of professionals. this study gave us a first glance at the state of the practice in software testing in a thriving and very dynamic industry that currently employs most of our computer science professionals. the benefits are twofold: for academia, it provides us with a road map to revise our academic offering, and for practitioners, it provides a baseline to benchmark their current practices. the paper is structured as follows: section 2 presents the related work. section 3 describes the survey replication process. section 4 analysis the results of the survey. finally, section 6 outlines our conclusions and future work. 2 related work several survey studies have been conducted on the subject of software testing practices in different countries and scales (garousi and zhi, 2013). this section summarizes identified past surveys on software testing practices in the industry. these studies mainly aim to characterize the state of the practice in the software testing industry, identifying trends and opportunities for improvement and training (diasneto et al., 2017). to identify past surveys on software testing practices in the industry, we conducted a literature search. first, we conducted an exploratory search using scopus and using the search string “title-abs-key((“software”) and (“testing practices” or “quality assurance practices”) and (“survey” or “questionnaire”))”. additionally, we applied the snowballing technique (wohlin, 2014) on two surveys previously published (garousi and zhi, 2013; dias-neto et al., 2017). their cited references were searched using google scholar. the inclusion criteria included only papers describing software testing surveys based on titles, keywords, abstracts, and analysis. the list includes papers on software engineering practices that report results on specific software testing practices. table 1 briefly summarizes the surveys on testing practices. the paper reference, scale and region (or target community), target audience, number of respondents, and survey goal and focus area are listed. this table was based on garousi and zhi (2013); dias-neto et al. (2017) and updated with identified surveys in our search. in table 1, papers reported in garousi and zhi (2013) were marked with (*) and papers reported in dias-neto et al. (2017) were marked with (**). papers in both studies were marked with (***). the following reports were excluded because their research goal and method were not comparable to the others surveys (andersson and runeson, 2002; runeson et al., 2003). the studies attempt to identify and characterize different software testing practices, processes, tools, and methods in different contexts. many surveys were conducted since 2006, denoting the interest in surveys on software testing industry. in the last decade, one survey was published in 2009, four surveys were published in 2010, five surveys in 2012, the same quantity in 2013 and 2014, four surveys were published in 2015, three surveys in 2016, five surveys in 2017 and 2018, and finally, three surveys were published in 2019, as listed in table 1. the main surveys’ goals reported were: • to characterize the adoption of software testing practices, processes, tools, and methods in different contexts (beck and perkins, 1983; gelperin and hetzel, 1988; torkar and mankefors, 2003; geras et al., 2004; ng et al., 2004; chan et al., 2005; wojcicki and strooper, 2006; dias-neto et al., 2006; kasurinen et al., 2010; garousi and varma, 2010; kirk and tempero, 2012; garousi and zhi, 2013; pérez et al., 2013; daka and fraser, 2014; yli-huumo et al., 2014; de greca et al., 2015; garousi et al., 2015; ghazi et al., 2015; smolander et al., 2016; kassab et al., 2017; quesadalópez and jenkins, 2017; dias-neto et al., 2017; robiolo et al., 2017; hynninen et al., 2018; vasanthapriyan, 2018). • to characterize the strengths and issues of software testing, and the opportunities for the improvement of software testing, including the critical factors of success in different aspects of software testing (runeson, 2006; engström and runeson, 2010; causevic et al., 2010; rafi et al., 2012; lee et al., 2012; greiler et al., 2012; pfahl et al., 2014; kochhar et al., 2015; rodrigues and dias-neto, 2016; bhuiyan et al., 2018; kochhar et al., 2019). • to analyze what factors may influence the selection of software testing practices (fernández-sanz et al., 2009; greiler et al., 2012; deak et al., 2013; pham et al., 2013; pérez et al., 2013; deak and stålhane, 2013; pfahl et al., 2014; deak, 2014; kochhar et al., 2015; lima and faria, 2016; kochhar et al., 2019; raulamo-jurvanen et al., 2019). • to analyze software testing practices and the level of maturity in the industry (fernández-sanz, 2005; grindal et al., 2006; park et al., 2008). • to compare practitioners’ software testing practices and the state of art (sung and paynter, 2006; causevic et al., 2010; engström and runeson, 2010; vonken et al., 2012; rafi et al., 2012; scatalon et al., 2018). • to characterize training needs and skills needed in software testing (ng et al., 2004; chan et al., 2005; kanij et al., 2014; vasanthapriyan, 2018). • to identify research directions in software testing (taipale et al., 2005, 2006; smolander et al., 2016; garousi et al., 2017). studies reported the gap between software testing state of the art and state of the practice (ng et al., 2004; dias-neto et al., 2006; sung and paynter, 2006; causevic et al., 2010; engström and runeson, 2010; rafi et al., 2012; lee et al., 2012; yli-huumo et al., 2014; garousi et al., 2017; scatalon et al., 2018; vasanthapriyan, 2018; scatalon et al., 2018). characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 software testing is still reported as a time consuming and expensive phase in software development (beck and perkins, 1983; ng et al., 2004; dias-neto et al., 2006). the automation of software testing has continued its growth and there are opportunities for automated software testing research (ghazi et al., 2015; hynninen et al., 2018; kochhar et al., 2019; raulamo-jurvanen et al., 2019). 3 replication process in the following subsections, we provide details about the methodology for conducting the replication. replication studies are beneficial to evaluate the validity of prior study findings. successful replications increase the validity and reliability of the outcomes observed in the original study and are an essential part of the experimental paradigm to produce generalizable knowledge (carver et al., 2014). combined results from a family of replications are interesting because all studies are related and could investigate related questions. the aggregation of replication results will be useful for software engineers to draw conclusions and consolidate the findings (carver, 2010; juristo and gómez, 2010; carver et al., 2014). a close replication study attempts to recreate the known conditions of the original study and is very similar to the original study. close replications are often used to establish whether the original outcomes are repeatable (lindsay and ehrenberg, 1993). our study is an external replication of four previously conducted surveys in south america (dias-neto et al., 2006; de greca et al., 2015; dias-neto et al., 2017; robiolo et al., 2017). dias-neto et al. (2006) analyze the answers of 36 practitioners from 13 brazilian organizations to identify the software testing practices used by the organizations and its importance. greca et al. (2015) replicated the original survey with 18 practitioners in argentina. dias-neto et al. (2017) conducted the same survey in brazil and uruguay with 150 practitioners. they surveyed different companies from southern/brazil (56 participants), northern/brazil (50 participants) and uruguay (44 participants). robiolo et al. (2017) surveyed 25 practitioners from 25 organizations of the public sector. in this study, we reported the responses from 92 practitioners from costa rica. the study includes a detailed analysis of the data collected, and its comparison with previous studies, in accordance with the recommendations and guidelines in (carver, 2010; carver et al., 2014). this study is descriptive (linåker et al., 2015) and is intended to compare and extend previous results (carver et al., 2014), highlighting the similarities and differences in the use and importance of testing practices in different countries. the authors of the original study did not take part in the replication process. however, in our replication, we reused the survey goal, research questions, questionnaire, and analysis procedure reported in (dias-neto et al., 2017; robiolo et al., 2017). 3.1 goal and research questions the objective of the study formulated using the goal, question, metric (gqm) approach (basili et al., 1994) was to characterize testing practices based on the practitioners’ use and perceived importance in the context of software organizations in costa rica. the survey evaluated 42 testing practices grouped in three categories: processes, activities, and tools. we studied the following research questions: • rq1: what are the software testing practices used by practitioners in their organizations? • rq2: what are the most important software testing practices according to the opinion of testing practitioners? 3.2 survey design to address the study’s goal and research questions, we conducted a survey to gather the opinions from practitioners. 3.2.1 target population and sampling the target population is the practitioners applying testing practices in software organizations in costa rica. the practitioners were sampled by convenience. they were contacted through the university of costa rica and the state distance university, two of the most important public universities in our country. e-mail distribution lists were used to support the recruitment of participants. 3.2.2 instruments used to collect data we applied the questionnaire designed in (dias-neto et al., 2017) to collect the information. the instrument was divided into three parts: (1) profile and demographics, (2) the use of testing processes, activities and tools; and (3) perceived importance of testing processes, activities, and tools. the instrument evaluated 42 testing practices grouped in three categories: testing processes (practices related to the adopted test processes in the software organization), testing activities (practices concerned with the procedures performed during the software testing), and testing tools (practices concerned with tools supporting the software testing). we used the spanish version of the instrument. in order to validate the questionnaire (concepts, language, and practices), we conducted five survey pilots. table 2 details the list of questions of the instrument. the participants were asked to fill out the job position, experience in software testing, academic degree, certifications in testing, development methodology, programming language expertise, software platform used for development, company’s size, and quality team configuration. participants were asked to fill the entire questionnaire with the 42 testing practices according to the use level in their current organization and the perceived importance of a testing practice. dias-neto et al. (2017; 2006) defined a five point likert scale to express the gradual increase in the level of use and importance of a testing practice, as shown in table 3. as in the previous study, each practitioner answered only one option for the level of use and importance for each software testing practice. characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 1. summary of previous surveys on software testing practices. paper reference scale/region target audience number of respondents goal/focus area beck and perkins (1983) dallas-fort worth, usa computer users 63 to analyze the usage of software engineering techniques, tools, and methods. they analyzed testing and validation activities in the software life cycle (*). gelperin and hetzel (1988) washington, usa not reported not reported to characterize major test process models, methodologies, and describe some of the changes associated with testing growth (**). torkar and mankefors (2003) usa, sweden software development organization 91 to explain to what extent software testing had been used when reusing software components (**). geras et al. (2004) alberta, canada, software development organization 60 to characterize test practices and software quality assurance techniques (***). ng et al. (2004) australia senior software practitioners 65 to determine testing techniques, tools, metrics, standards, and whether the training courses in software testing adequately cover the testing methodologies and skills required (**). fernándezsanz (2005) spain professional practitioners 102 to analyze testing practices and the level of maturity in testing. taipale et al. (2005) finland software testing researchers 10 to identify research directions in software testing (**). chan et al. (2005) 5 countries software testing practitioners 34 to characterize software testing practices, and the levels of software testing education and training (**). wojcicki and strooper (2006) usa, australia list at cs.oswego.edu and ibm 35 to analyze the state of practice of verification and validation technology, the decision process for use, and cost-effectiveness for concurrent programs (**). runeson (2006) sweden software developers 15 to characterize the strengths and issues of unit testing (**). grindal et al. (2006) sweden not reported 12 to characterize organizations’ testing maturity (**). sung and paynter (2006) new zealand software testers 62 to compare software testing practices with the authors’ software testing framework (**). dias-neto et al. (2006) brazil software developers 36 to characterize the state of the practice of software testing in brazil (**). taipale et al. (2006) finland industry specialists 40 to determine the current situation and improvement needs in software testing. park et al. (2008) korea software professionals in defense industry 38 to identify test maturity, testing practices, and characteristics of software development in the korean defense industry. fernándezsanz et al. (2009) spain software professionals 127 to analyze what factors may influence testing practices. engström and runeson (2010) sweden software developers 32 to characterize the gap between the state of the art and practice of regression testing practices. kasurinen et al. (2010) finland software testers and test managers 31 to identify the state of the practice on software test automation (**). causevic et al. (2010) not reported researchers 83 to identify obstacles between the available (stateof-the art) and preferred (state-of-the-practice) practices by software testing practitioners (**). garousi and varma (2010) alberta, canada software developers 53 to replicate geras et al. (2004) on software testing techniques and analyze possible changes (***). rafi et al. (2012) not reported software developers 115 to characterize the benefits and limitations of software testing automation (**). continued on next page characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 1 – continued from previous page paper reference scale/region target audience number of respondents goal/focus area lee et al. (2012) not reported executives 33 to identify the current practices and opportunities for the improvement of software testing tools and methods (**). greiler et al. (2012) not reported eclipsecon participants 151 to discover how testing is performed, why testing is performed in a certain way and what test-related issues the community is facing (**). kirk and tempero (2012) new zealand software developers 195 to understand what practices are used in software testing (***). vonken et al. (2012) netherlands development organizations 99 to determine whether there is a gap between the current state-of-the-practice and state-of-the-art in software engineering (*). deak et al. (2013) norway computing students 33 to identify the interest and desire to work in software testing among engineering and computer science students (**). deak and stålhane (2013) norway not reported 23 to characterize the factors that can influence the creation of a software testing department or the investment in software testing personnel (**). garousi and zhi (2013) canada software developers 246 to characterize canadian testing practices (***). pham et al. (2013) not reported software developers of github 569 to characterize how the testing behavior is influenced by the peculiarities of social coding environments (**). pérez et al. (2013) belgium development professionals 63 to assess the state of the practice in software quality with respect to software quality, and how these practices vary across companies. pfahl et al. (2014) finland and estonia software developers 61 to study how software engineers understand and apply the principles of exploratory testing, as well as the specific advantages and difficulties they experience (***). daka and fraser (2014) 29 countries software developers 246 to characterize how software developers use unit testing techniques (**). kanij et al. (2014) 22 countries software testers 104 to characterize skills of software testers affecting software testing (**). deak (2014) not reported software testers 26 to characterize the impact of the development methodology on testers motivation (**). yli-huumo et al. (2014) south korea software development professionals 34 companies to explore software development methods and quality assurance practices used by software industry. de greca et al. (2015) argentina software developers 18 to characterize the state of the practice in software testing in argentina, a replication of dias-neto et al. (2006) (**). garousi et al. (2015) turkey software professionals 202 to characterize techniques, tools and metrics used by practitioners and the challenges faced. they included the analysis of the types of software testing practices, the latest techniques, tools, and metrics used and the challenges faced by practitioners (**). ghazi et al. (2015) not reported practitioners from linkedin and yahoo groups 27 to explore the testing of heterogeneous systems with respect to the usage and perceived usefulness of testing techniques used for heterogeneous systems from the point of view of industry practitioners. kochhar et al. (2015) not reported software developers in github and microsoft 210 to understand the common testing tools used by android developers as well as the challenges faced by them when they test their apps. continued on next page characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 1 – continued from previous page paper reference scale/region target audience number of respondents goal/focus area lima and faria (2016) portugal software testing professionals 147 to assess the relevance of distributed and heterogeneous systems in software testing practice, the features to be tested, the test automation and tools, and desired features in test automation. rodrigues and dias-neto (2016) not reported software testing researchers and practitioners 33 to evaluate the critical factors of success in software test automation life cycle. smolander et al. (2016) finland software industry specialists 55 to understand the current situation and improvement needs in software test automation. kassab et al. (2017) penn state great valley, usa. linkedin professionals 67 to examined how software professionals used testing. quesadalópez and jenkins (2017) costa rica software practitioners 278 to characterize engineering practices including the analysis of the software testing practices, a replication of garousi et al. (2015). dias-neto et al. (2017) brazil and uruguay software testing practitioners 150 to understand the perception of practitioners regarding the use and importance of software testing practices, a replication of dias-neto et al. (2006); de greca et al. (2015). robiolo et al. (2017) argentina software professionals in public sector 25 organizations to analyze use and importance of software testing practices, a replication of dias-neto et al. (2006); de greca et al. (2015); dias-neto et al. (2017). garousi et al. (2017) canada, turkey, denmark, austria, germany practitioners 105 to characterize challenges and research topics that industry wants to suggest to software testing researchers. hynninen et al. (2018) finland industry practitioners 33 to explore industry practices concerning software testing and to assess how they test their products and what process models they follow, a continuation study of taipale et al. (2006); kasurinen et al. (2010). kassab (2018) not reported software professionals 72 to discover the actual practices for software testing and quality assurance activities for software in safety-critical systems. bhuiyan et al. (2018) bangladesh it personnel 47 organizations to identify the challenges along with the practices of software quality assurance and testing. scatalon et al. (2018) brazil software professionals 90 to identify knowledge gaps in software testing between undergraduate courses and what professionals actually applied in industry after graduating. vasanthapriyan (2018) sri lanka software development professionals 152 from 3 software companies to determine software testing practices, testing methodologies and techniques, automated tools, testing metrics, testing training and academic collaboration with software industry. kochhar et al. (2019) 27 countries software practitioners 261 to investigate what make good test cases and to describe characteristics of good test cases and testing practices. raulamojurvanen et al. (2019) finland testing professionals 89 to study how software practitioners evaluate testing tools. this study(quesadalópez et al., 2019) costa rica software practitioners 92 to characterize the state of the practice based on the perception of practitioners on the use and importanceof softwaretesting practices, areplication of dias-neto et al. (2006); de greca et al. (2015); dias-neto et al. (2017); robiolo et al. (2017). characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 3.2.3 data analysis for each testing practice, we collected the use and importance level based on the opinions of the professionals. the equations were based on dias-neto et al. (2017). first, the responses of the professionals were differentiated by assigning a weight for each participant according to their experience, academic degree, and certifications on testing (eq. 1). second, we multiplied each answer by the weight of the participant and computed the total value for a testing practice (eq. 2). finally, we obtained a normalized value for the levels of use and importance that oscillates between 0% and 100% (eq. 3). we applied the following formulas: w (i) = dt (i) m ddt + t t (i) m dt t + f (i) + g(i) + h(i) (1) where: w (i) is the total weight for participant i. dt (i) is the number of years of experience for participant i in software development. t t (i) is the number of years of experience for participant i in software testing. m ddt and m dt t are the median of dt and t t . f (i) is the highest academic degree for participant i (0-high school, 1undergraduate, 2-specialization, 3-master, 4-ph.d). g(i) is the self-assigned expertise level by the participant i (0-none, 1-low, 2-medium, 3-high, 4-excellent). h(i) is the number of testing certifications reported by the participant i. t (j) = n∑ i=1 (answer(i, j) ∗ w (i)) (2) where: t (j) is the total value obtained for use and importance regarding the testing practice j. answer(i, j) is the answer value (1–5) relating to the use and importance by the participant i for the testing practice j. n (j) = t (j)∑n i=1 w (i) ∗ 5) (3) where: n (j) is the normalized value for use and importance of testing practice j and ∑n i=1 w (i) ∗ 5) is the maximum possible value for testing practice j. for each testing practice, the use and importance were analyzed and compared with previous studies, and the correlation between use and importance perceived was evaluated. for this study, we replicated the analysis proposed in (diasneto et al., 2017). the most used/important software testing practices, the differences between regions, and the difference between the levels of use and importance perceived by practitioners were analyzed. finally, the existence of a significant correlation between the levels of use and importance for each evaluated practice was tested. 3.3 survey execution the electronic questionnaire was implemented using limesurvey (www.limesurvey.org) and it was available in a survey server at the university of costa rica for a period of two months, from september to october 2018. participants were asked to complete the survey online. all participants were invited to participate anonymously and voluntarily by email. we sent e-mail invitations directly to the professionals through contact lists of the universities. practitioners could withdraw at any time, and only summarized and aggregated information were published. similar to experiences in previous studies (quesada-lópez and jenkins, 2017, 2018), some participants leave questions unanswered and others leave the questionnaire without completing it. only the completed answers were considered for the analysis of results. after data pre-processing, the responses of 92 professionals were analyzed. 3.4 threats to validity this work is subject to the threats to the validity reported for this type of studies including previous replications and the results must be interpreted carefully. we discuss the validity concerns based on wohlin et al. (2012) classification. 3.4.1 internal validity this threat is related to the quantity and representativeness of the sample. the practitioners were sampled by convenience, reported as common practice for survey studies in software engineering (molléri et al., 2016; ghazi et al., 2017), and in previous surveys listed in section 2. besides, the survey could not necessarily represent all the costa rican industry. although we achieved a relatively high number of respondents compared with previous surveys (dias-neto et al., 2017; robiolo et al., 2017), it was not possible to evaluate the representativeness of the sample. we were not able to obtain a reliable estimation of the total number of practitioners in the software industry of costa rica. our participants were mainly invited through the universidad estatal a distancia and universidad de costa rica network and partners in costa rican software development organizations. many practitioners out of our contact were not probably properly represented in the survey sample. moreover, we were informed that some practitioners working in transnational software companies could not answer the questionnaire for confidentiality issues with their companies. the original testing practices lists in the original study were not modified to allow the replication. the original practices could be outdated from the current state of the art and practice. moreover, some testing practices in costa rica’s context could be missed or omitted. first, we believe that the set of practices is still representative in the testing research field (dias-neto et al., 2017). second, we conducted five survey pilots with professionals in costa rica to validate the questionnaire (concepts, language, and practices). 3.4.2 construct validity the testing practices lists were based on a previous survey instrument (dias-neto et al., 2017, 2006). the analysis of the levels of use and importance has already been used in the evaluation of the performance of organizations. we counted the votes for each question and then made statistical analysis. we used the weight function based on dias-neto et al. (2017) to compare the results across studies. the weight function characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 2. survey questionnaire. id question d01 job position d02 experience in software testing d03 academic degree d04 certifications in testing d05 development methodology d06 programming language expertise d07 software platform used for development d08 company’s size d09 quality team configuration p01 documentation of test plan p02 documentation of test procedures and cases p03 recording the results of test execution p04 measurement and analysis of the test coverage p05 use of methodology or process p06 analysis of identified defects p07 identification and use of risks for planning and executing software tests p08 planning/designing of testing before coding p09 monitoring adherence to the test process p10 re-execution of tests when the software is modified p11 evaluation of the quality of test artifacts p12 setting a priori criteria to stop the testing p13 reporting evaluation of a test round a01 definition of a responsible professional or team a02 application of unit tests a03 application of integration tests a04 application of system tests a05 application of acceptance tests a06 application of regression tests a07 application of exploratory tests a08 application of performance tests a09 application of security tests a10 registration of the time spent on testing a11 measurement of the effort/cost of testing a12 storage of records (log) of the executed tests a13 measurement of the defect density a14 conducting training on software testing a15 separation of testing and development activities a16 storage of test data for future use a17 analysis of faults patterns (trend) a18 availability of human resources allocated full time for testing a19 selection of test techniques according to the project’s features t01 availability of a test database for reuse t02 use of tools for automatic execution of test procedures or cases t03 use of tools for automatic generation of test procedures or cases t04 use of test management tools to track and record t05 use of tools to estimate test effort and/or schedule t06 use of test management tools to enact activities and artifacts t07 use of tools for recording defects and the effort to fix them (bug tracking) t08 use of coverage measurement tools t09 continuous integration tools for automated tests t10 selection of test tools according to project characteristics d: demographics. p: testing processes. a: testing activities. t: testing tools. characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 3. level of use and importance. l level of use l level of importance 1 not applied: the practice is outside the scope of the organization’s software projects. 1 not important: the practice is not necessary for software projects. 2 not used: the practice is within the scope of the organization, but it is not used in any software project. 2 low value: the practice has low importance to use in software projects. 3 infrequent use: the practice is not frequently used in the organization’s software projects. 3 limited value: the practice can be adequate to use in software projects. 4 common use: the practice is used in most of the organization’s software projects. 4 significant value: the practice is recommended to use in software projects. 5 standard use: the practice is used in all organization’s software projects. 5 essential value: the practice must be used in all software projects. l: likert scale. should be carefully analyzed to interpret the results. the analysis showed differences in the levels of use and importance of software testing practices. the characteristics of the organizations could affect these results. we informed participants of the survey that we will not collect any personal information so that professionals will remain anonymous. 3.4.3 conclusion validity the analysis procedure to obtain the level of use and importance according to the characteristics of each participant was based on previous surveys (dias-neto et al., 2017, 2006). the analysis procedure is a weighted average, where the weight function is based on qualitative aspects representing each subject (dias-neto et al., 2017). the model of use and importance was based on a previous empirical evaluation of the software practices (dias-neto et al., 2006). the trade-off of using this type of analysis is that the information from the extremes can be lost (dias-neto et al., 2017). all conclusions in this study are traceable to data. 3.4.4 external validity the survey reflects the practitioners’ interpretation of importance and use. the answers could not necessarily represent the reality of testing practices and could reflect subjectivity. aspects such as self-awareness and difference of training of the participants could influence responses. the results show a correlation between the levels of use and importance. it could indicate that practitioners find those practices usable and important, but they could not distinguish between the use and importance or they see no value in the difference (diasneto et al., 2017). in this study, we analyzed correlations between testing practices and we did not intend to establish any causal relationship. 4 analysis of results 4.1 demographics of the participants in this survey, 92 complete answers were analyzed. our participants could indicate more than one job position: 54% (50) of the practitioners reported one position, 23% (21) two positions, 8% (14) reported 3 and 4 positions, and 7% (7) reported up to 7 positions. table 4 presents the quantity (q) and the percentage (%) of participants per position and company’s size (s1: less than 10 employees, s2: 10-49 employees, s3: 50-100 employees, s4: more than 100 employees). participants claimed to be mostly project managers (18%), analysts (17%), developers (16%), and quality managers (14%). in addition, participants reported being software engineers (9%), test analysts (8%), testers (8%), quality engineers (6%), and software architects (3%). around 36% of participants are working on quality/testing. however, 32% (29) of the participants reported that both development and quality teams perform testing activities, 34% (31) reported that only quality teams perform testing, and 26% (24) reported that the development teams perform testing activities. with respect to organizations size, 50% (46) of participants work in organizations with more than 100 employees, 16% (15) in organizations with 50-100 employees, 22% (20) work in organizations with 10-49 employees, and 12% (11) in organizations with less than 10 employees. table 4. participants per position and company’s size. position q % s1 s2 s3 s4 project man. 32 18 7 8 3 14 analyst 31 17 4 6 6 15 test analyst 15 8 1 4 1 9 architect 6 3 6 quality man. 14 8 1 2 2 9 test leader 10 6 1 3 6 developer 29 16 3 5 5 16 tester 15 8 2 4 9 quality eng. 11 6 1 1 2 7 software eng. 16 9 2 6 1 7 total 92 100 11 20 15 46 % 12 22 16 50 participants reported on average, 11.5 years of experience in the software industry, and 5.5 years of experience in software quality and testing. only 20% (18) of the participants hold a software testing certification. some 15% (14) of practitioners are istqb certified testers, 3% (3) are certified test characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 figure 1. distribution of respondents’ weight. manager (ctm), and 1% (1) is a certified software quality engineer (csqe). participants reported the level of experience in testing, 33% (30) of the participants indicated a medium level of experience, 27% (25) indicated a high level, 21% (19) indicated a low level, 15% (14) an excellent level, and 4% (4) indicated no experience in testing. finally, participants reported their academic degree, 49% (45) hold a university degree, 36 (33%) a master’s degree, 14% (13) have a technical specialization, and only 1% (1) holds a ph.d. in total, 59% (54) of the practitioners claim to apply agile methodologies, 26% (24) traditional methodologies and 15% (14) use a hybrid development methodology. the most used programming languages are .net in c# and visual basic (35%), java (24%), c/c++ (11%), php (9%), and python (9%). 4.1.1 participants’ influence dias-neto et al. (2017) observed that some participants could influence the results of the testing practices with their answers (experience and academic degree, as defined in eq. 1). in this section, we analyzed the influence of each participant in this survey. the distribution of participants’ weight ranges from 1.20 to 15.00 (m = 6.63, m d = 6.50, s.d. = 2.92). the 25th percentile was 4.80, the 50th percentile was 6.50, and the 75th percentile was 8.17. the normality test shows a normal distribution. the p-value for the shapiro-wilk test indicates that the values representing the influence (weight) of the participants were normally distributed (p > 0.05). figure 1 shows the weight distribution through a dispersion and box-plot graph. two outliers were identified (experts), the weights were 14.00 and 15.00 respectively. both of them are project managers, with 30 years of experience in the it industry, and 20 years of experience in testing. their highest academic degree is a master’s degree and the first one is a certified test manager (ctm). in our analysis, we used the answers of all participants. 4.1.2 participants among surveys in this study, we compare the results of surveys conducted in argentina, brazil, uruguay, and costa rica. table 5 presents the percentages of the positions reported in each previous survey (dias-neto et al., 2017; robiolo et al., 2017) and this study. we present the percentages of northern brazil (nbr, n=50), southern brazil (sbr, n=56), uruguay (uy, n=44) (dias-neto et al., 2017), argentina (ar, n=25) (robiolo et al., 2017), and costa rica (cr, n=92). the positions (%) reported are: analysts (p1), architects (p2), developers (p3), project managers (p4), quality managers (p5), testanalysts (p6), test leaders (p7), and testers (p8). in brazil and uruguay, 66% of the respondents are working on quality/testing (quality manager, test leader, test analyst, and tester) and 34% in development activities (analyst, architect, developer, and project manager). in the northern brazil region 84% are working on quality/testing, in southern brazil region 59%, and in uruguay 57% (diasneto et al., 2017). in contrast, argentina reported only 16% of the respondents working on quality/testing and 84% in other development activities (16% were not reported) (robiolo et al., 2017). in costa rica, 36% of the respondents are working on quality/testing, including 6% reported as quality engineers. table 5. participants per position (%). survey p1 p2 p3 p4 p5 p6 p7 p8 nbr 12 4 6 47 14 16 sbr 14 2 4 21 5 38 11 5 uy 7 2 16 18 14 7 36 ar 16 12 40 4 8 4 cr 17 3 16 18 14 8 8 in the same way, table 6 the percentage of respondents by the company’s size. the company’s size (%) are: less than 10 (s1), 10 49 (s2), 50 99 (s3), and more than 100 (s4). we can observe that with the exception of argentina (ar), most of the answers come from professionals from organizations with more than 100 employees. table 6. participants per company’s size (%). survey s1 s2 s3 s4 nbr 10 14 16 60 sbr 9 30 21 39 uy 5 23 20 52 ar 36 24 16 24 cr 12 22 16 50 in the next sections, we present the analysis of the results of the use and importance of the evaluated software testing practices. first, we present the analysis of the use and perceived importance of testing practices. second, we analyze the correlation between use and perceived importance, third, the results between use and perceived importance based on “more used” and “more important”, “less used” and “less important”, “more used” and “less important”, and “less used” and “more important” are discussed. finally, we compare the results among replications. 4.2 analysis of the use and perceived importance of testing practices table 7 presents a heat map with the results of the use and importance of software testing practices. the first column contains the results of our study and the other four columns the results of the previous studies. the most used and perceived important (p. i.) testing practices in process (p), activities (a), and tools (t) were marked in green, and the least used and important ones were marked in red. the greener color means the practice is deemed useful and/or important, the redder mean the practice is not considered important or not implemented. we present the results of costa rica (cr), argentina (ar) (robiolo et al., 2017), northern brazil (nbr), southern brazil (sbr), and uruguay (uy) (dias-neto et al., 2017). characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 for each testing practice, we could observe some trends by analyzing the use and important across the replications. in all five countries/regions, there is a set of used and important practices (p02: documentation of test procedures and cases, p03: recording the results of test execution, p10: reexecution of tests when the software is modified, a01: definition of a responsible professional or team, a03: application of integration tests, a04: application of system tests, a05: application of acceptance tests, t01: availability of a test database for reuse, and t07: use of tools for recording defects and the effort to fix them-bug tracking), and a set of less used and considered less important practices (p08: planning/designing of testing before coding, a10: registration of the time spent on testing, a11: measurement of the effort/cost of testing, a13: measurement of the defect density, a14: conducting training on software testing, and a17: analysis of faults patterns-trends). 4.2.1 use of testing practices the results of the use of software testing practices per country/region are presented. by analyzing the green patterns in table 7, we can conclude that the three most used testing processes reported were: the recording of test cases results (p03), the documentation of test procedures and cases (p02), and the re-execution of tests when the software is modified (p10). in the case of testing activities, the three most used were the application of acceptance testing (a05) and system testing (a04), and the definition of a responsible professional or team (a01). finally, the three most used testing tools were those for recording defects and the effort to fix them bug tracking (t07), a test database for reuse (t01), and management tools to track and record the results (t04). on the other hand, the processes for planning/designing of testing before coding (p08), the evaluation of the quality of test artifacts (p11), and the measurement and analysis of the test coverage (p04) were reported as the three least used. the measurement of the defect density (a13), the analysis of faults patterns – trends (a17), and the registration of the time spent on testing (a10) were reported as the three least used activities. finally, the three least used tools were the tools for automatic generation of test procedures or cases (t03), coverage measurement tools (t08), and tools to estimate test effort and/or schedule (t05). 4.2.2 importance of testing practices the importance perceived by the participants on the software testing practices per country/region is presented in table 7. by observing the green patterns, we can conclude that the three most perceived important testing processes were: the task of recording the results of tests cases (p03), the documentation of test procedures and cases (p02), and the reexecution of tests when the software is modified (p10). these processes were also the most used by practitioners. in the case of testing activities, the three perceived as most important were the application of acceptance testing (a05), the application of integration tests (a03), and the storage of records (logs) of the executed tests (a12). besides, system testing (a04), and a definition of a responsible professional or team (a01) were perceived as important. finally, the three most important testing tools were: tools for recording defects and the effort to fix them bug tracking (t07), tools for automatic execution of test procedures or cases (t02), and a test database for reuse (t01). the management tools to track and record the results (t04) were also perceived as important. likewise, the processes for test artifacts quality (p11), for planning/designing of testing before coding (p08), and for reporting evaluation of a test round (p13) were perceived as the three least important. the measurement of the defect density (a13), the application of exploratory tests (a07), and the analysis of faults patterns – trends (a17) were perceived as the three least important activities. the perceived as the three least important tools were the tools to estimate test effort and/or schedule (t05), coverage measurement tools (t08), and tools for automatic generation of test procedures or cases (t03). 4.3 analysis of correlation between use and perceived importance table 8 presents the spearman’s rho correlation coefficient between the use and perceived importance of each testing practice (two-tail test with p<0.01). in this case, there was a positive correlation between the use and perceived importance, and all correlations were statistically significant. the values above 0.5 were considered as highly correlated and are marked in bold. a high correlation means that the participants either: (1) deemed the practice useful and important, or (2) deemed the practice not useful and not important. our results show that although there is a correlation between the values of use and perceived importance, only 18 of 42 practices are highly correlated (p01: documentation of test plan, p02: documentation of test procedures and cases, p03: recording the results of test execution, p09: monitoring adherence to the test process, p12: setting a priori criteria to stop testing, p13: reporting evaluation of a test round, a01: definition of a responsible professional or team, a04: application of system tests, a06: application of regression tests, a07: application of exploratory tests, a10: registration of the time spent on testing, a11: measurement of the effort/cost of testing, a12: storage of records (log) of the executed tests, a13: measurement of the defect density, t01: availability of a test database for reuse, t05: use of tools to estimate test effort and/or schedule, t06: use of test management tools to enact activities and artifacts, t07: use of tools for recording defects and the effort to fix them-bug tracking). in the following section, we compare the relation between use and importance. 4.4 analysis between use and perceived importance dias-neto et al. (2017) analyze the level of use and perceived importance dividing the test practices into two equal groups of the total 42 practices. table 9 presents the “more used” and “more important”, and the “less used” and “less important” testing practices according to the answers of costa rican practitioners. to classify the practices, the top 21 most used practices and the top 21 most perceived as important characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 7. comparison on the use and perceived importance of testing practices. cr (n=92) ar (n=25) nbr (n=50) sbr (n=56) uy (n=44) characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 8. spearman’s correlation between use and importance. id testing practice rs p01 documentation of test plan .585 p02 documentation of test proc. and cases .644 p03 recording the results of test execution .556 p04 measurement, analysis of test coverage .393 p05 use of methodology or process .492 p06 analysis of identified defects .400 p07 identification and use of risks .447 p08 plan/design tests before coding .372 p09 monitoring adherence to the test process .602 p10 re-execution of tests when modified .467 p11 evaluation of the quality of test artifacts .395 p12 setting a priori criteria to stop testing .712 p13 reporting evaluation of a test round .537 a01 def. of a professional or team .516 a02 application of unit tests .448 a03 application of integration tests .456 a04 application of system tests .605 a05 application of acceptance tests .472 a06 application of regression tests .562 a07 application of exploratory tests .587 a08 application of performance tests .306 a09 application of security tests .323 a10 registration of the time spent on testing .565 a11 measurement of the effort/cost of testing .561 a12 storage of records (log) of the executed tests .585 a13 measurement of the defect density .532 a14 conducting training on software testing .459 a15 separation of testing and dev activities .468 a16 storage of test data for future use .482 a17 analysis of faults patterns (trend) .411 a18 availability of human resources full time .476 a19 selection of test techniques based on features .450 t01 availability of a test database for reuse .548 t02 automatic execution of test proc. or cases .360 t03 automatic generation of test proc. or cases .355 t04 test management tools to track and record .453 t05 to estimate test effort and/or schedule .542 t06 test management tools to enact artifacts .545 t07 recording defects and the effort to fix them .518 t08 use of coverage measurement tools .479 t09 continuous integration for automated tests .424 t10 selection of test tools based on proj. charcs. .450 practices were selected. the set of “most used, most important” practices represents the good practices in testing performed by cost rican practitioners. the set of “least used, least important” testing practices represent those that seem to be not relevant in the context of these organizations. furthermore, these practices could represent gaps in knowledge about their benefits or simply a lack of organizational resources to put them into practice. table 10 presents the “more used” and “less important”, and the “less used” and “more important” testing practices. the set of “most used, least important” testing practices includes the practices used by software practitioners but considered not as important as other practices. in this case, other used practices could generate more value in supporting testing activities. the set of “least used, most important” testing practices are those not used by practitioners in their software organizations, but perceived as important for their professional practice. 5 discussion the results of the use of software testing practices show that practitioners in our industry are currently implementing basic processes and tools for performing software testing, but at the same time, they are not using key metrics for assessing testing results or the quality of the testing products. this clearly represents an important area for improvement in our industry and a challenge for universities for teaching these concepts. second, although not perceived as important by practitioners, we believe that metrics (such as defect density) and processes such as analysis of fault patterns are key for software organizations that aspire to improve their processes and reach higher maturity levels. they may not be deemed important now, but they will gain more importance as the industry matures. on the other hand, based on the analysis of the correlation between use and perceived importance, we agreed with (dias-neto et al., 2017) when they state that practitioners can find the practices they use daily to be important and therefore, either they cannot distinguish between the use and important or they do not see value in the distinction. in the following section, we compare the relation between use and importance. finally, based on the analysis between use and perceived importance, the set of “least used, least important” testing practices could represent gaps in knowledge about their benefits or simply a lack of organizational resources to put them into practice. these practices may point out the gaps between academia and industry and, for example, have to be addressed through practitioners’ training courses and software process improvement plans to show the benefits of their application. the set of “least used, most important” can be complex or expensive to implement, they may have considerable training needs, or these organizations may not have the necessary tools to perform them. 5.1 comparing the results among replications to compare the results of this survey with previous studies dias-neto et al. (2017) the “more used” and “more important” testing practices, and the “less used” and “less important” testing practices were analyzed. table 11 presents the “more used” and “more important” testing practices for each replication. five testing practices are common in all surveys (p03: recording the results of test execution, a01: definition of a responsible professional or team, a03: application of integration tests, a04: application of system tests, a05: application of acceptance tests), and four practices are common in four surveys (p2: documentation of test procedures and cases, p10: re-execution of tests when the software is modified, a15: separation of testing and development activities, a18: availability of human resources allocated full time for testing). characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 9. use and importance similarities between testing practices. id “more used” and “more important” id “less used” and “less important” p02 documentation of test procedures and cases p04 measurement and analysis of the test coverage p03 recording the results of test execution p07 identification and use of risks p05 use of methodology or process p08 planning/designing of testing before coding p06 analysis of identified defects p09 monitoring adherence to the test process p10 re-execution of tests when modified p11 evaluation of the quality of test artifacts a01 definition of a responsible professional or team p13 reporting evaluation of a test round a02 application of unit tests a07 application of exploratory tests a03 application of integration tests a10 registration of the time spent on testing a04 application of system tests a11 measurement of the effort/cost of testing a05 application of acceptance tests a13 measurement of the defect density a06 application of regression tests a14 conducting training on software testing a12 storage of records (log) of the executed tests a17 analysis of faults patterns (trend) a15 separation of testing and dev activities a19 selection of test techniques based on features a18 availability of human resources full time t03 tools for automatic generation of test cases t01 availability of a test database for reuse t05 use of tools to estimate test effort and/or schedule t04 test management tools to track and record t08 use of coverage measurement tools t06 test management tools to enact artifacts t09 use of continuous integration tools for tests t07 tools for bug tracking and effort to fix them t10 selection of test tools according to project charcs. table 10. use and importance similarities between testing practices. id “more used” and “less important” id “less used” and “more important” p01 documentation of test plan a08 application of performance tests p12 setting a priori criteria to stop testing a09 application of security tests a16 storage of test data for future use t02 automatic execution of test procedures or cases table 12 presents the “less used” and “less important” testing practices for each replication. six testing practices are reported in four surveys (p07: identification and use of risks for planning and executing software tests, p09: monitoring adherence to the test process, a11: measurement of the effort/cost of testing, t03: use of tools for automatic generation of test procedures or cases, t05: use of tools to estimate test effort and/or schedule, t08: use of coverage measurement tools). these practices represent a gap between software testing state of the art (academia) and the state of the practice (practitioners) considering that the list of practices in the survey was defined considering the academic literature. in (de greca et al., 2015), no practices were classified as less used and less important. in table 11 and table 12, we only included practices of our survey, and practices with more than three occurrences across replications. we found no significant differences in practices perceived usefulness and importance between our survey and previous surveys. as in other countries, important practices are not being used in our software industry. this opens an interesting line of research to find out why they are not being used. our survey aggregated evidence previously reported and presented new evidence on the use and perceived importance of testing practices in the industry: • there is a gap between software testing state of the art and state of the practice. this study identified a set of testing practices classified as “less important” and “less used” (table 9), and the set of these “less important” and “less used” testing practices reported in multiple replications (table 12). • the findings support that organizations mainly use the ad hoc criteria to stop testing. in dias-neto et al. (2017); robiolo et al. (2017) the practice p12: setting a priori criteria to stop the testing is ranked low (the level of use ranked in the bottom 10th (65%), 10th (63%), 12th (64%) and 7th (50%) positions respectively). in the case of costa rica p12 was ranked 23rd (72%). the perceived importance received a total of 77% (8th), 73% (10th), and 74% (11th) in dias-neto et al. (2017), 73% (13th) in robiolo et al. (2017), and 87% (17th) in costa rica. • the application of unit tests (a02) is not within the three most used (71%, 79%, 78%) and important (81%, 88%, 86%) practices in any of the regions reported in diasneto et al. (2017). however, in robiolo et al. (2017) unit tests were reported as the most important practice (93%) and used (79%). in this study, unit testing was reported used (79%) and important (92%). according to the findings, we cannot conclude about the use and importance level of unit tests. other testing practices, such as a03: application of integration tests, a04: application of system tests, a05: application of acceptance tests, and a06: application of regression tests were reported as used and important in multiple replications (table 11). • the findings indicated some level on the use and importance of automated testing. however, t03: use of tools for automatic generation of test procedures or characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 table 11. comparison of “more used” and “more important” testing practices. id “more used” and “more important” this study robiolo et al. (2017) diasneto et al. (2017) degreca et al. (2015) diasneto et al. (2006) p02 documentation of test procedures and cases 3 3 3 3 p03 recording the results of test execution 3 3 3 3 3 p05 use of methodology or process 3 3 p06 analysis of identified defects 3 p10 re-execution of tests when the software is modified 3 3 3 3 a01 definition of a responsible professional or team 3 3 3 3 3 a02 application of unit tests 3 3 3 a03 application of integration tests 3 3 3 3 3 a04 application of system tests 3 3 3 3 3 a05 application of acceptance tests 3 3 3 3 3 a06 application of regression tests 3 3 3 a12 storage of records (log) of the executed tests 3 3 a15 separation of testing and dev activities 3 3 3 3 a16 storage of test data for future use 3 3 3 a18 availability of human resources allocated full time for testing 3 3 3 3 t01 availability of a test database for reuse 3 3 t04 test management tools to track and record 3 3 t06 test management tools to enact activities and artifacts 3 t07 tools for recording defects and the effort to fix them (tracking) 3 3 table 12. comparison of “less used” and “less important” testing practices. id “less used” and “less important” this study robiolo et al. (2017) diasneto et al. (2017) degreca et al. (2015) diasneto et al. (2006) p04 measurement and analysis of the test coverage 3 3 3 p07 identification and use of risks 3 3 3 3 p08 planning/designing of testing before coding 3 3 3 p09 monitoring adherence to the test process 3 3 3 3 p11 evaluation of the quality of test artifacts 3 3 3 p13 reporting evaluation of a test round 3 a07 application of exploratory tests 3 3 a10 registration of the time spent on testing 3 3 a11 measurement of the effort/cost of testing 3 3 3 3 a13 measurement of the defect density 3 3 3 a14 conducting training on software testing 3 3 a17 analysis of faults patterns (trend) 3 3 3 a19 selection of test techniques based on features 3 t03 use of tools for automatic generation of test procedures or cases 3 3 3 3 t05 use of tools to estimate test effort and/or schedule 3 3 3 3 t08 use of coverage measurement tools 3 3 3 3 t09 use of continuous integration tools for automated tests 3 3 t10 selection of test tools according to project characteristics 3 3 characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 cases was reported as “less used” and “less important” in dias-neto et al. (2017); robiolo et al. (2017) and this study. besides, the testing practices t02: use of tools for automatic execution of test procedures or cases, and t09: use of continuous integration tools for automated tests were categorized as “less used”. we cannot infer whether the level of use is lesser or higher than manual testing. finally, we confirmed some similarities highlighted by dias-neto et al. (2017) regarding industrial surveys: (1) testing automation is a concern, but it has not reached full adoption in industry, (2) the ad hoc has been reported as one of the main used criteria to stop testing, (3) the use of tools for recording defects and bug tracking are the most adopted, and (4) the most used testing levels are acceptance, integration, system, and unit testing. 5.2 getting feedback from practitioners to get some feedback about the significance and usefulness of this research from the practitioners’ perspective, we made two presentations to different groups of professionals about our study results. after presentations, we asked them the following two questions: (1) do you think that the data on this presentation provides value for your professional practice? (2) what would you like to see in future presentations? for the first question, everyone who answered responded in the affirmative. they considered the results from the survey useful to keep up to date with industry trends and improve their own software processes. one person mentioned the importance of doing an informal benchmark with this initial data. a couple of them also mentioned the importance for academia to know these data for keeping updated their curricula and for better defining the exit profile of their graduates. for the second question, the answers varied substantially. some people would like to see presentations with specific examples or case studies on how to implement software testing practices in organizations. others would like to have a presentation on guidelines about how to implement some of those practices in their own organizations. others suggested having presentations about software testing metrics and tools (including the measurement of testing effectiveness), and how to implement them in small and medium organizations. finally, one person suggested to hold an entire workshop on software testing and to include software security testing as the main issue. 6 conclusions this paper reported a survey study of software testing practices in the costa rican software industry and compared the results with previous studies conducted in south america. we characterized a set of testing practices with respect to their use and perceived importance from the point of view of 92 practitioners. the main software testing practices reported in this survey were the recording of the results of tests, documentation of test procedures and cases, and re-execution of tests when the software is modified. acceptance and system testing were the two most useful and important testing types. the tools for recording defects and the effort to fix them (bug tracking) and the availability of a test database for reuse were reported useful and important. in contrast, the planning and designing of software testing before coding and evaluating the quality of test artifacts were not a regular practice. finally, there is a lack of measurement of defect density and test coverage in the industry; and tools for automatic generation of test cases and for estimating testing effort are rarely used. a set of testing practices were common across different countries: the application of integration, system and acceptance tests, the recording of test execution results and the definition of a responsible professional, or team for testing. in contrast, our results confirm that the main testing limitations are the monitoring and measurement of tests and defects, the automatic generation of test cases, and procedures and the management of test coverage and effort. these last three are clear areas for process improvement. further studies in different countries and regions should be conducted to compare industrial trends in software testing practices. we believe this work could be used by organizations, practitioners, and academics to improve the state of the practice in our software industry. for future work, it could be interesting to make a comparison using the demographic data of the participants (such as types of projects, organizations’ characteristics, and others) to find out if different demographics influence the results by country. acknowledgements this work was partially supported by universidad estatal a distancia comiex-19-2017 and universidad de costa rica project no. 834-b7-749. we would like to thank guilherme travassos, santiago matalonga, martín solari, arilo dias-neto and gabriela robiolo for providing the earlier version of the questionnaire. we thank all practitioners of the survey for their participation. references andersson, c. and runeson, p. (2002). verification and validation in industry-a qualitative survey on the state of practice. in proceedings international symposium on empirical software engineering, pages 37–47. ieee. aymerich, b., díaz-oreiro, i., guzmán, j. c., lópez, g., and garbanzo, d. (2018). software development practices in costa rica: a survey. in international conference on applied human factors and ergonomics, pages 122–132. springer. basili, v., gianluigi, c., and rombach, d. (1994). the goal question metric approach. encyclopedia of software engineering, pages 528–532. beck, l. l. and perkins, t. e. (1983). a survey of software engineering practice: tools, methods, and results. ieee transactions on software engineering, (5):541–561. bhuiyan, s. a. r., rahim, m. s., chowdhury, a. e., and hasan, m. h. (2018). a survey of software qualcharacterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 ity assurance and testing practices and challenges in bangladesh. international journal of computer applications, 975:8887. carver, j. c. (2010). towards reporting guidelines for experimental replications: a proposal. in 1st international workshop on replication in empirical software engineering, pages 2–5. citeseer. carver, j. c., juristo, n., baldassarre, m. t., and vegas, s. (2014). replications of software engineering experiments. causevic, a., sundmark, d., and punnekkat, s. (2010). an industrial survey on contemporary aspects of software testing. in 2010 third international conference on software testing, verification and validation, pages 393–401. ieee. chan, f., tse, t., tang, w., and chen, t. (2005). software testing education and training in hong kong. in fifth international conference on quality software (qsic’05), pages 313–316. ieee. daka, e. and fraser, g. (2014). a survey on unit testing practices and problems. in 2014 ieee 25th international symposium on software reliability engineering, pages 201– 211. ieee. de greca, f., rossi, b. d., robiolo, g., and travassos, g. h. (2015). aplicación y valoración de la verificación y validación de software: una encuesta realizada en buenos aires. in simposio argentino de ingeniería de software (asse 2015)-jaiio 44 (rosario, 2015). deak, a. (2014). a comparative study of testers’ motivation in traditional and agile software development. in international conference on product-focused software process improvement, pages 1–16. springer. deak, a. and stålhane, t. (2013). organization of testing activities in norwegian software companies. in 2013 ieee sixth international conference on software testing, verification and validation workshops, pages 102–107. ieee. deak, a., stålhane, t., and cruzes, d. (2013). factors influencing the choice of a career in software testing among norwegian students. software engineering, page 796. dias-neto, a., natali, a. c. c., rocha, a. r., and travassos, g. h. (2006). caracterização do estado da prática das atividades de teste em um cenário de desenvolvimento de software brasileiro. v simpósio brasileiro de qualidade de software, vila velha, es. dias-neto, a. c., matalonga, s., solari, m., robiolo, g., and travassos, g. h. (2017). toward the characterization of software testing practices in south america: looking at brazil and uruguay. software quality journal, 25(4):1145– 1183. engström, e. and runeson, p. (2010). a qualitative survey of regression testing practices. in international conference on product focused software process improvement, pages 3–16. springer. fernández-sanz, l. (2005). un sondeo sobre la práctica actual de pruebas de software en españa. reicis. revista española de innovación, calidad e ingeniería del software, 1(2). fernández-sanz, l., villalba, m. t., hilera, j. r., and lacuesta, r. (2009). factors with negative influence on software testing practice in spain: a survey. in european conference on software process improvement, pages 1–12. springer. garousi, v., coşkunçay, a., betin-can, a., and demirörs, o. (2015). a survey of software engineering practices in turkey. journal of systems and software, 108:148–177. garousi, v., coşkunçay, a., demirörs, o., and yazici, a. (2016). cross-factor analysis of software engineering practices versus practitioner demographics: an exploratory study in turkey. journal of systems and software, 111:49– 73. garousi, v. and felderer, m. (2017). living in two different worlds: a comparison of industry and academic focus areas in software testing. ieee software, (1):1–1. garousi, v., felderer, m., kuhrmann, m., and herkiloğlu, k. (2017). what industry wants from academia in software testing?: hearing practitioners’ opinions. in proceedings of the 21st international conference on evaluation and assessment in software engineering, pages 65–69. acm. garousi, v. and varma, t. (2010). a replicated survey of software testing practices in the canadian province of alberta: what has changed from 2004 to 2009? journal of systems and software, 83(11):2251–2262. garousi, v. and zhi, j. (2013). a survey of software testing practices in canada. journal of systems and software, 86(5):1354–1376. gelperin, d. and hetzel, b. (1988). the growth of software testing. communications of the acm, 31(6):687–695. geras, a. m., smith, m. r., and miller, j. (2004). a survey of software testing practices in alberta. canadian journal of electrical and computer engineering, 29(3):183–191. ghazi, a. n., petersen, k., and börstler, j. (2015). heterogeneous systems testing techniques: an exploratory survey. in international conference on software quality, pages 67–85. springer. ghazi, a. n., petersen, k., reddy, s. s. v. r., and nekkanti, h. (2017). survey research in software engineering: problems and strategies. arxiv preprint arxiv:1704.01090. greiler, m., deursen, a. v., and storey, m.-a. (2012). test confessions: a study of testing practices for plug-in systems. in proceedings of the 34th international conference on software engineering, pages 244–254. ieee press. grindal, m., offutt, j., and mellin, j. (2006). on the testing maturity of software producing organizations. in testing: academic & industrial conference-practice and research techniques (taic part’06), pages 171–180. ieee. hynninen, t., kasurinen, j., knutas, a., and taipale, o. (2018). software testing: survey of the industry practices. in 2018 41st international convention on information and communication technology, electronics and microelectronics (mipro), pages 1449–1454. ieee. juristo, n. and gómez, o. s. (2010). replication of software engineering experiments. in empirical software engineering and verification, pages 60–88. springer. juristo, n., moreno, a. m., and vegas, s. (2004). reviewing 25 years of testing technique experiments. empirical software engineering, 9(1-2):7–44. kanij, t., merkel, r., and grundy, j. (2014). a preliminary survey of factors affecting software testers. in 2014 23rd characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 australian software engineering conference, pages 180– 189. ieee. kassab, m. (2018). testing practices of software in safety critical systems: industrial survey. in 20th international conference on enterprise information systems, iceis 2018, pages 359–367. scitepress. kassab, m., defranco, j. f., and laplante, p. a. (2017). software testing: the state of the practice. ieee software, 34(5):46–52. kasurinen, j., taipale, o., and smolander, k. (2010). software test automation in practice: empirical observations. advances in software engineering, 2010. kirk, d. and tempero, e. (2012). software development practices in new zealand. in 2012 19th asia-pacific software engineering conference, volume 1, pages 386–395. ieee. kochhar, p. s., thung, f., nagappan, n., zimmermann, t., and lo, d. (2015). understanding the test automation culture of app developers. in 2015 ieee 8th international conference on software testing, verification and validation (icst), pages 1–10. ieee. kochhar, p. s., xia, x., and lo, d. (2019). practitioners’ views on good software testing practices. in proceedings of the 41st international conference on software engineering: software engineering in practice, pages 61–70. ieee press. kuhrmann, m., diebold, p., münch, j., tell, p., garousi, v., felderer, m., trektere, k., mccaffery, f., linssen, o., hanser, e., et al. (2017). hybrid software and system development in practice: waterfall, scrum, and beyond. in proceedings of the 2017 international conference on software and system process, pages 30–39. acm. lee, j., kang, s., and lee, d. (2012). survey on software testing practices. iet software, 6(3):275–282. lima, b. and faria, j. p. (2016). a survey on testing distributed and heterogeneous systems: the state of the practice. in international conference on software technologies, pages 88–107. springer. linåker, j., sulaman, s. m., maiani de mello, r., and höst, m. (2015). guidelines for conducting surveys in software engineering. lindsay, r. m. and ehrenberg, a. s. (1993). the design of replicated studies. the american statistician, 47(3):217– 228. molléri, j. s., petersen, k., and mendes, e. (2016). survey guidelines in software engineering: an annotated review. in proceedings of the 10th acm/ieee international symposium on empirical software engineering and measurement, page 58. acm. ng, s., murnane, t., reed, k., grant, d., and chen, t. (2004). a preliminary survey on software testing practices in australia. in 2004 australian software engineering conference. proceedings., pages 116–125. ieee. park, j., ryu, h., choi, h.-j., and ryu, d.-k. (2008). a survey on software test maturity in korean defense industry. in proceedings of the 1st india software engineering conference, pages 149–150. acm. pérez, j., mens, t., and kamseu, f. (2013). a pilot study on software quality practices in belgian industry. in 2013 17th european conference on software maintenance and reengineering, pages 395–398. ieee. pfahl, d., yin, h., mäntylä, m. v., and münch, j. (2014). how is exploratory testing used? a state-of-the-practice survey. in proceedings of the 8th acm/ieee international symposium on empirical software engineering and measurement, page 5. acm. pham, r., singer, l., liskin, o., figueira filho, f., and schneider, k. (2013). creating a shared understanding of testing culture on a social coding site. in proceedings of the 2013 international conference on software engineering, pages 112–121. ieee press. quesada-lópez, c., hernandez-aguero, e., and jenkins, m. (2019). a survey of software testing practices in costa rica. in proceedings of the xxii ibero-american conference on software engineering (cibse 2019). la habana, cuba, 23-27 abril 2019, pages 107–145. quesada-lópez, c. and jenkins, m. (2017). estudio sobre las prácticas de la ingeniería de software en costa rica: resultados preliminares. in proceedings of the xx iberoamerican conference on software engineering (cibse 2017). buenos aires, argentina, 22-23 may 2017, pages 107–145. quesada-lópez, c. and jenkins, m. (2018). factores asociados a prácticas de desarrollo y pruebas de software en costa rica: un estudio exploratorio. in proceedings of the xxi ibero-american conference on software engineering (cibse 2018). bogotá, colombia, 23-27 abril 2018, pages 107–145. rafi, d. m., moses, k. r. k., petersen, k., and mäntylä, m. v. (2012). benefits and limitations of automated software testing: systematic literature review and practitioner survey. in proceedings of the 7th international workshop on automation of software test, pages 36–42. ieee press. raulamo-jurvanen, p., hosio, s., and mäntylä, m. v. (2019). practitioner evaluations on software testing tools. in proceedings of the evaluation and assessment on software engineering, pages 57–66. acm. robiolo, g., m, m., rossi, b., and travassos, g. h. (2017). aplicación e importancia de las pruebas de software: una encuesta realizada en buenos aires en el ámbito público. in xx ibero-american conference on software engineering (cibse 2017). argentina, 22-23 may 2017. rodrigues, a. and dias-neto, a. (2016). relevance and impact of critical factors of success in software test automation lifecycle: a survey. in proceedings of the 1st brazilian symposium on systematic and automated software testing, page 6. acm. runeson, p. (2006). a survey of unit testing practices. ieee software, 23(4):22–29. runeson, p., andersson, c., and höst, m. (2003). test processes in software product evolution—a qualitative survey on the state of practice. journal of software maintenance and evolution: research and practice, 15(1):41–59. scatalon, l. p., fioravanti, m. l., prates, j. m., garcia, r. e., and barbosa, e. f. (2018). a survey on graduates’ curriculum-based knowledge gaps in software testing. in 2018 ieee frontiers in education conference (fie), pages 1–8. ieee. characterization of software testing practices: a replicated survey in costa rica quesada-lópez et al. 2019 smolander, k., taipale, o., and kasurinen, j. (2016). software test automation in practice: empirical observations. in data structure and software engineering, pages 107– 145. apple academic press. sung, p. w.-b. and paynter, j. (2006). software testing practices in new zealand. in in proceedings of the 19th annual conference of the national advisory committee on computing qualifications, pages 273–282. taipale, o., smolander, k., and kälviäinen, h. (2005). finding and ranking research directions for software testing. in european conference on software process improvement, pages 39–48. springer. taipale, o., smolander, k., and kälviäinen, h. (2006). a survey on software testing. 6th international spice. torkar, r. and mankefors, s. (2003). a survey on testing and reuse. in proceedings 2003 symposium on security and privacy, pages 164–173. ieee. vasanthapriyan, s. (2018). a study of software testing practices in sri lankan software companies. in 2018 ieee international conference on software quality, reliability and security companion (qrs-c), pages 339–344. ieee. vonken, f., brunekreef, j., zaidman, a., and peeters, f. (2012). software engineering in the netherlands: the state of the practice. technical report series tud-serg-2012022. wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering. in proceedings of the 18th international conference on evaluation and assessment in software engineering, page 38. citeseer. wohlin, c., runeson, p., höst, m., ohlsson, m. c., regnell, b., and wesslén, a. (2012). experimentation in software engineering. springer science & business media. wojcicki, m. a. and strooper, p. (2006). a state-of-practice questionnaire on verification and validation for concurrent programs. in proceedings of the 2006 workshop on parallel and distributed systems: testing and debugging, pages 1–10. acm. yli-huumo, j., taipale, o., and smolander, k. (2014). software development methods and quality assurance: special focus on south korea. in european conference on software process improvement, pages 159–169. springer. introduction related work replication process goal and research questions survey design target population and sampling instruments used to collect data data analysis survey execution threats to validity internal validity construct validity conclusion validity external validity analysis of results demographics of the participants participants' influence participants among surveys analysis of the use and perceived importance of testing practices use of testing practices importance of testing practices analysis of correlation between use and perceived importance analysis between use and perceived importance discussion comparing the results among replications getting feedback from practitioners conclusions 460-##_article-598-1-18-20191219 journal of software engineering research and development, 2019, 7:7, doi: 10.5753/jserd.2019.460 this work is licensed under a creative commons attribution 4.0 international license. specifying the process model for systematic reviews: an augmented proposal pablo becker [gidis_web, engineering school, universidad nacional de la pampa | beckerp@ing.unlpam.edu.ar] luis olsina [gidis_web, engineering school, universidad nacional de la pampa | olsinal@ing.unlpam.edu.ar] denis peppino [gidis_web, engineering school, universidad nacional de la pampa | denispeppino92@gmail.com] guido tebes [gidis_web, engineering school, universidad nacional de la pampa | guido.tebes92@gmail.com] abstract context: systematic literature review (slr) is a research methodology intended to obtain evidence from scientific articles stored in digital libraries. slrs can be performed on primary and secondary studies. although there are guidelines to the slr process in software engineering, the slr process is not fully and rigorously specified yet. moreover, it can often be observed a lack of a clear separation of concerns between what to do (process) and how to do it (methods). objective: to specify the slr process in a more detailed and rigorous manner by considering different process modeling perspectives, such as functional, behavioral, organizational and informational. the main objective in this work is specifying the slr activities rather than their methods. method: the spem (software & systems process engineering metamodel) language is used to model the slr process from different perspectives. in addition, we illustrate aspects of the proposed process by using a recently conducted slr on software testing ontologies. results: our slr process model specifications favor a clear identification of what task/activities should be performed, in which order, by whom, and which are the consumed and produced artifacts as well as their inner structures. also, we explicitly specify activities related to the slr pilot test, analyzing the gains. conclusion: the proposed slr process considers with higher rigor the principles and benefits of process modeling backing slrs to be more systematic, repeatable and auditable for researchers and practitioners. in fact, the rigor provided by process modeling, where several perspectives are combined, but can also be independently detached, provides a greater richness of expressiveness in sequences and decision flows, while representing different levels of granularity in the work definitions, such as activity, sub-activity and task. keywords: systematic literature review; systematic mapping; process modeling perspectives; spem; process improvement; software testing ontology 1 introduction a systematic literature review (slr) aims at providing an exhaustive evidence of relevant literature for a set of research questions. initially, slrs were conducted in clinical medicine (rosenberg et al. 1996). since kitchenham issued in 2004 a technical report (kitchenham 2004) about slrs, the use of slr in different scientific communities of software engineering (se) has become more and more frequent for gathering evidence mainly from primary studies and, to a lesser extent, from secondary studies. the output document yielded when applying the slr process on primary studies is called secondary study, while applying it on secondary studies is called tertiary study. to quote just a few examples, authors in sepúlveda et al. (2016), tahir et al. (2016), and torrecilla-salinas et al. (2016) document secondary studies on diverse topics in se, while the authors in garousi & mäntylä (2016) and kitchenham et al. (2010b) report tertiary studies. very often researchers have reused the procedures and guidelines proposed in kitchenham (2004), which first were reviewed by biolchini et al. (2005), and later updated by kitchenham and her colleagues in 2007 (brereton et al. 2007, kitchenham & charters 2007). more recently, by conducting a slr, kitchenham & brereton (2013) evaluated and synthesized studies published by se researchers (including different types of studies, not only primary ones) that discuss about their experiences in conducting slr and their proposals to improve the slr process. even though slrs have become in an established methodology in se research, and there exist guidelines that help researchers to conduct a slr, the slr process itself is not fully and rigorously specified yet. figure 1 depicts the process specification made by brereton and kitchenham et al., which was totally adopted or slightly adapted by the rest of figure 1. slr process proposed by kitchenham. note that the figure’s style is slightly adapted from brereton et al. (2007). specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 the se community up to the present time. this process specification shows ‘what’ to do through its phases and steps and in which order –or in other words, through its processes, activities and tasks as per becker et al. (2015). however, the process in figure 1 can be improved if we take into account the principles of process modeling proposed by curtis et al. (1992), and used for instance in becker et al. (2012). curtis et al. describe four perspectives (views) for modeling a process:  functional: which describes what activities should be carried out and what flow of artifacts (e.g., documents) is necessary to perform the activities and tasks;  behavioral: which specifies when activities should be executed, including therefore the identification of sequences, parallelisms, iterations, etc.;  organizational: which aims at showing where and who are the agents (in compliance with roles) involved in carrying out the activities; and,  informational: which focuses on the structure of artifacts produced or consumed by activities, on their interrelations, etc. therefore, a full process specification considering different perspectives contributes to a clearer identification of what task/activities should be performed, in which order, by whom, and which are the consumed and produced artifacts as well as their inner structure. in addition to these four views, a methodological perspective is defined in olsina (1998), which specifies particularly what constructors (i.e., methods) are assigned to the activity descriptions. however, we detect that there is often a lack of a clear separation of concerns between what to do (process) and how to do it (methods). consequently, sometimes methods are included as activities in the process, such as in garousi & mäntylä (2016), as we discuss later on. some benefits of using process modeling to strengthen the process specifications in general, and to strengthen the slr process in particular, are: to facilitate the understanding and the communication, which it implies that the process model (with the richness that graphic representations provide) should be understandable for the target community; to give support to the process improvement, since all the fundamental perspectives of the process model are identified, which benefits the reutilization and the evaluation of impacts in front of potential changes in the process; to give support to process management, that is, to the planning, scheduling, and monitoring and control activities; to allow the process automation, which can help to provide supporting tools and to improve the performance; to favor the verification and validation of the process, fostering thus the consistency, repeatability and auditability in projects. additionally, in large-scale studies, like a slr, a pilot or small-scale trial often precedes the main study in order to analyze its design validity. therefore, it is very useful for researchers to conduct a slr pilot to test whether aspects of the slr design (such as the search string, selection criteria and data extraction form) are suitable. however, we observe that activities related to the pilot test are not explicitly specified in current slr process models, as it happens in the process representation of figure 1. it is important to remark that the present paper is a significantly extended version of tebes et al. (2019a). in this work, we include the organizational perspective for the slr process, and new models from the functional and behavioral perspectives for some activities. furthermore, in tebes et al. (2019a) the study on software testing ontologies is illustrated just for the slr pilot test. here, we use the same study to illustrate fully the produced main artifacts throughout the slr process. on the other hand, this work differs from tebes et al. (2019b), whose focus is mainly on the analysis of the retrieved software testing ontologies but not on the process perspectives as we do in the next sections. summarizing, the main contribution of this work is to augment the existing slr process specifications, considering the principles and benefits of process modeling such those described above. to this aim, we use the functional, behavioral, informational and organizational perspectives. furthermore, we specify activities related to the slr pilot test, which are often neglected in other slr process specifications. as a result, the slrs can be more systematic, repeatable and auditable for researchers and practitioners. it is worth noting that in the current work, regarding those quoted benefits of process modeling, our slr process specifications aim primarily at facilitating the understanding and communication, as well as at giving support to the process improvement and process management. however, a thorough discussion and a detailed illustration of process modeling for supporting fully slr process automation are out of the scope of this article. this paper is structured as follows: section 2 addresses related work. section 3 specifies the proposed slr process considering different process modeling perspectives. section 4 illustrates a practical case applied to software testing ontologies from the process modeling perspectives standpoint. section 5 discusses some benefits of our slr process. finally, section 6 presents our conclusions and outlines future work. 2 motivation and related work one motivation for modeling the slr process arose from certain difficulties that we faced (all the authors of this paper) when carrying out a slr pilot study about software testing ontologies (tebes et al. 2018, tebes et al. 2019a). the general objective of this pilot study was to be able to refine and improve aspects of the protocol design such as the research questions, search protocol, selection and quality criteria, and/or data extraction forms. analyzing several works about slr, we have observed at least three main issues. first, activities related to the slr pilot test are often omitted or not explicitly specified. second, some aspects of the existing slr processes are weakly specified from the point of view of the process modeling perspectives. third, there is often a lack of a clear separation of concerns between what to do (process) and how to do it (methods). next, we comment related work to slr process specification where these issues were detected. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 the first graphic representation of the slr process proposed by kitchenham (brereton et al. 2007) was outlined in 2007 –taking into account previous works of the same authors (kitchenham 2004) and other contributions such as biolchini et al. (2005). it was totally adopted or slightly adapted by the rest of the se community until to the present moment. most of the works divide the process into three phases or stages: plan review, conduct review and document review. while at phase level the same three main activities are generally preserved (for example, in sepúlveda et al. (2016), tahir et al. (2016), torrecilla-salinas et al. (2016), to quote just a few works), at step level (sub-activities and tasks) they differ to some extent from each other. for example, in tahir et al. (2016) three steps are modeled for phase 1: 1) necessity of slr; 2) research questions formation; and 3) review protocol formation. note that these steps differ from those shown in figure 1. moreover, in sepúlveda et al. (2016) five steps are presented for phase 1: 1) goal and need of slr; 2) define research questions; 3) define search string; 4) define inclusion and exclusion criteria; and 5) protocol validation. the same lack of consensus in naming and including steps is observed in the abovementioned works for phase 2. furthermore, in these works just a behavioral perspective is used to specify the process, so inputs, outputs and roles are not considered in these process models. although the slr pilot test activity is usually neglected, in sepúlveda et al. (2016) the “pilot selection and extraction” step is included in phase 2. nevertheless, the selection and pilot extraction step does not iterate into –or feedback to phase 1, which may help to improve slr design aspects, as we model in our proposed process specification in figure 2. in garousi & mäntylä (2016) and irshad et al. (2018), we observe another adaptations or variations of the process documented in brereton et al. (2007). in irshad et al. (2018) the use of two methods called backward and forward snowballing is emphasized, while in garousi & mäntylä (2016) the snowballing activity is included in the systematic review process. table 1 summarizes the analyzed features of the slr processes considered in this related work section. on the other hand, it is important to remark that while slrs are focused on gathering and summarizing evidence of primary or secondary studies, systematic mapping (sm) studies are used to structure (categorize) a research area. according to marshall & brereton (2013), a sm is a more ‘open’ form of slr, which is often used to provide an overview of a research area by assessing the quantity of evidence that exists on a particular topic. in petersen et al. (2015), authors performed a sm study of systematic maps, to identify how the sm process is conducted and to identify improvement potentials in conducting the sm process. although there are differences between slrs and sms regarding the aim of the research questions, search process, search strategy requirements, quality evaluation and results (kitchenham et al. 2010a), the process followed in petersen et al. (2015) is the same to that used for slrs. therefore, we can envision that our proposed process can be used for both slr and sm studies. what can differ, it is the use of different methods and techniques for some activities and tasks, mainly for the analysis, since as mentioned above the aim and scope for both are not the same, as also analyzed in the napoleão et al. (2017) tertiary study. in summary, as an underlying hypothesis, the existing gap in the lack of standardization of the slr and sm processes currently used by the scientific communities can be minimized, if we would consider more appropriately the principles and benefits of process modeling enumerated in the introduction section. table 1. papers analyzed in this related work section considering some features of slr process specifications. features of the process specifications paper functional perspective behavioral perspective organizational perspective informational perspective used notation pilot test activities biolchini et al. 2005   (*) uml brereton et al. 2007  informal (using boxes and arrows) garousi & mäntylä 2016  informal (with legend of the used elements) irshad et al. 2018 (**) sepúlveda et al. 2016  uml  tahir et al. 2016  informal (using boxes and arrows) tebes et al. 2019a    spem  torrecilla-salinas et al. 2016  informal (using circles, boxes and arrows) * the specification of the informational perspective is represented by a text-based work-product breakdown structure. ** it has no graphical representation for the followed slr process. however, authors adopt the kitchenham & charters (2007) process and the wohlin (2014) guidelines for snowballing. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 3 augmented specification of the slr process considering that, there is no generalized consensus yet in the terminology used in the process domain, we introduce the meaning of some terms used in this work and then we focus on the slr process specification. note that the terms considered below (highlighted in italic) are taken from the process core ontology (processco) documented in becker et al. (2015). in this work, a process is composed of activities. in turn, an activity can be decomposed into tasks and/or into activities of lower level of granularity called sub-activities. a task is considered an atomic element (i.e., it cannot be decomposed). besides, process, activity and task are considered work (entity) definitions, which indicate ‘what’ to do. every work definition (process/activity/task) consumes, and modifies and/or produces work products. a particular work product type is artifact (e.g., diagrams, documents, among others). additionally, methods are resources that indicate ‘how’ to carry out the description of a work definition. in processco, many methods may be applicable to a work description. lastly, an agent is a performer assigned to a work definition in compliance with a role. in turn, the role term is defined as a set of skills (abilities, competencies and responsibilities) that an agent ought to own in order to perform a work definition. regarding the main aim of this section, figure 2 illustrates the proposed slr process from the behavioral perspective using spem (omg 2008). there are several process modeling languages such as bpmn (omg 2011), spem and uml activity diagram (omg 2017), which are the most popular in the academy and industry. their notations are very similar considering different desirable features such as expressiveness (i.e., amount of supported workflow patterns), understandability, among others (portela et al. 2012, russel et al. 2006, white 2004). from the functional, behavioral and organizational perspectives, spem, uml and bpmn are suitable modeling languages that can be used. however, for the informational perspective, bpmn is not a suitable language. spem allows to use both bpmn business process diagram and uml activity diagram, among other diagrams like the uml class diagram for specifying all process perspectives. figure 2. behavioral perspective of the proposed slr process. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 figure 3. functional and behavioral perspectives of the proposed slr process. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 as seen in figure 2, our proposed process, like the original process (brereton et al. 2007), has three main activities: (a1) design review, (a2) implement review and (a3) analyze and document review. in turn, these activities group sub-activities and tasks. note that for the design search protocol sub-activity, the included tasks are shown as well, while not for the rest of the a1 sub-activities. it is done so intentionally to communicate that sub-activities have tasks, but at the same time for not giving all the details in order to preserve the diagram legibility. as the reader can notice, our process specification has more details than other currently used models for slrs, from the behavioral perspective standpoint. for example, we introduce decision nodes (diamonds in figure 2) to represent iterations (e.g. between validate slr design and improve slr design in a1) and to convey that some activities/tasks could be not performed (e.g. the perform slr pilot study sub-activity in a2 is optional). consequently, our process helps to indicate explicitly to researchers and practitioners that the slr process is not totally sequential. it is worth mentioning that figure 2 shows a recommended flow for the slr process. in other words, it represents the "slr-to-be" rather than the "slr-as-is" process. we are aware that in a process instantiation there might be some variation points, including the parallelization of some tasks, and so on, as we discuss to some extent later on. furthermore, aimed at enriching the process specification, we consider in figure 3 the functional perspective joint to the behavioral perspective. therefore, throughout the entire process, we can see the flow of activities and the work products consumed and produced in each activity/task. the functional perspective is very important to check out which documents are needed to perform a task and which documents should be produced, serving for verification purposes as well. unfortunately, the functional perspective in the current slr process proposals is often neglected. considering that a slr is a very time-consuming endeavour that hardly can be faced by just one person, usually several researchers are involved playing different roles. the slrs with the highest quality should have input from experts in the subject being reviewed, in the different methods for search and retrieval, in qualitative and quantitative analysis methods, among many other aspects. therefore, the organizational perspective to show the different roles involved in a slr process can be used as represented in figure 4. among these roles, a1 includes the slr designer (or research librarian) whose agent should develop comprehensive search strategies and identify appropriate libraries. also, this role in conjunction with the analysis expert designer are needed for the design of the data extraction form as well as for the definition of potential analysis methods. the slr validator role whose agent should have expertise in conducting systematic reviews, and the domain expert role, which should be played by an agent aimed at validating the protocol and clarifying issues related to the topic under investigation. table 2 describes the responsibilities and/or capabilities required by the different roles. note that an agent can play different roles and, in turn, a role can be played by one or more agents (or even by a team). for example, in a given slr the data collector role is frequently played by several researchers since extract data from a sample of documents and extract data from all documents sub-activities are very time consuming and require a lot of effort. in the following sub-sections, the three main activities are described considering their sub-activities and tasks, sequences, inputs and outputs, and roles, by considering the functional, behavioral and organizational perspectives. additionally, to enrich the process specifications, in some cases, the informational perspective is used as illustrated later on. 3.1 to design the review (a1) the main objective of the design review (a1) activity is to design the slr protocol. to achieve this, the tasks and activities depicted in the light-blue box in figure 3 should be performed following the represented flow and the input and output artifacts. as shown in figure 3, the first task is to specify research questions, which consumes the “slr information need goal specification” artifact. it contains the goal purpose and the statement established by researchers, which guides the review design. then, from the “research questions”, the design search protocol activity is carried out. this includes the table 2. definitions of roles for the slr process. role definition (in terms of responsibility/capability) analysis expert designer a researcher responsible for identifying the suitable qualitative/quantitative data analysis methods and techniques to be used. the agent that plays this role should also be capable on managing documentation and visualization techniques. data analyzer responsible for conducting the data analysis. data collector responsible for extracting data from primary or secondary studies. domain expert a researcher or practitioner with knowledge, skills and expertise in a particular topic or domain of interest. expert communicator researcher with rhetoric and oratory skills who communicates the slr results to an intended community/audience. slr designer a researcher with knowledge and skills for designing and specifying slr protocols. slr expert researcher a researcher with knowledge and expertise in conducting slr studies. slr performer a researcher with knowledge and skills for retrieving documents. the agent that plays this role should be expert in using search engines and document retrieval methods and techniques. slr validator a researcher with expertise in slr for checking the suitability and validity of a slr design. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 specify search string and identify metadata for search tasks, as well as the select digital libraries sub-activity. in turn, the latter includes the define digital libraries selection criteria and identify digital libraries tasks as represented in figure 5. examples of digital libraries selection criteria can be the target language and the library domain, among others. digital libraries’ selection can determine the scope and validity of the reviewers’ conclusions. as a result of the design search protocol activity, the “search protocol” is obtained, which includes a search string consisting of terms and logical operators, the metadata on which the search will be applied (e.g. title and abstract) and, the selected digital libraries (e.g. ieee, acm, springer link, google scholar, among others). from the “search protocol” and “research questions” artifacts, it is possible to execute the define selection and quality criteria sub-activity. this produces the “selection criteria” and “quality criteria” artifacts. the criteria can be indicated in a checklist with the different items to be considered. the “selection criteria” documents the set of inclusion and exclusion criteria, i.e., the guidelines that determine whether an article will be considered in the review or not. reviewers should ask: is the study relevant to the review’s purpose? is the study acceptable for review? to answer these questions, reviewers formulate inclusion and exclusion criteria. each systematic review has its own goal purpose and research questions, so its inclusion and exclusion criteria are usually unique (except for a replication). however, inclusion and exclusion criteria typically belong to one or more of the following categories: (a) study population, (b) nature of the intervention, (c) outcome variables, (d) time period, (e) cultural and linguistic range, and (f) methodological quality (meline figure 4. organizational perspective of the proposed slr process. figure 5. functional and behavioral perspectives for the select digital libraries sub-activity. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 2006). note that in figure 6 “inclusion criteria” and “exclusion criteria” artifacts are part of the “selection criteria” artifact. the “quality criteria” documents features that allow to evaluate the quality of retrieved studies in a2, as well as to identify relevant or desirable aspects for the researchers. sometimes, quality criteria are used like inclusion/exclusion criteria (or to build them) because are very important to select studies of high quality for deriving reliable results and conclusions (kitchenham et al. 2004, kitchenham & charters 2007). in other cases, researchers did not plan to exclude any studies based on the quality criteria. to use “quality criteria” as “selection criteria” is a critical decision because if the inclusion criteria are too broad, poor quality studies may be included, lowering the confidence in the final result; but if the criteria are too strict, the results are based on fewer studies and may not be the yielded evidence generalizable (lam & kennedy 2005). as shown in figure 3, the next activity is design data extraction form. as output, the “template of the data extraction form” is yielded, whose fields are defined from the “research questions” and “quality criteria”. this will be used in a2 to collect information about each selected article. note in figure 4 that this activity is performed by the slr designer and the analysis expert designer. the former should has knowledge and skills to design and specify the data extraction form, while the latter expertise to identify required data types for analysis purposes (look at the annotation for the design data extraction form sub-activity, in fig 3). then, all the artifacts produced until this moment should be validated. to validate the slr design implies reviewing such documents in order to detect problems or opportunities for improvement. usually, researchers with expertise in conducting slrs perform this activity (see slr expert researcher and slr validator definitions in table 2). as outcome, the “slr protocol” document is obtained, which contains all the artifacts previously produced, such as represented in the informational perspective in figure 6. lastly, it is worth mentioning that the “slr protocol” document may be in an approved, corrected or disapproved state. in the latter case, a list of “detected problems/suggested improvements” should also be produced. this artifact will serve as input to the improve slr design activity, which includes tasks such as correct research questions, correct search string, among other tasks, in order to introduce changes in the “slr protocol”, i.e., to introduce corrections to improve it. once the protocol has been corrected, the validate slr design activity is performed again aimed at checking that the corrected protocol complies with the “slr information need goal specification”. ultimately, a1 activity ends when the “slr protocol” is approved. 3.2 to implement the review (a2) the a2 main objective is to perform the slr. pink box in figure 3 shows the different sub-activities and tasks of a2 together with its input and output artifacts. note that for firsttime cases where a study is not a repeated or replicated one, performing first a pilot test is recommended, which is aimed at fitting the “slr protocol” produced in the a1 activity. note that this concern is usually neglected or poorly specified in other existing slr/sm processes. when the realization of a slr pilot study (a2.1) is taken into account, the first task to be enacted by the slr performer is select digital libraries for pilot study (see figure 7, which mainly emphasizes the flow of tasks and activities for the pilot study). this consists of choosing a library subset (usually one or two) from the “selected digital libraries” artifact produced in a1. then, the execute search protocol task on selected libraries considering the “search string” and the “metadata for search” is enacted. as outcome, a list of “pilot study retrieved documents” is produced. from this list, in figure 6. informational perspective for the design review (a1) activity. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 apply selection criteria activity, the articles are downloaded and filtered out considering the “inclusion criteria” and “exclusion criteria”. this results in the “pilot study selected documents”. from this subset of documents, the extract data from a sample of documents is done by data collectors (see figure 8). this activity involves the select sample of documents task, which can be done randomly (kitchenham 2004). then, for each document, the extract data from sample task is performed by using the “template of the data extraction form”. note that data is extracted from only one sample since the aim of the pilot test is just to analyze how suitable the protocol being followed is. if more than one data collector will use the forms in the final review, then it is recommended that more than one data collector participate in the pilot study data extraction. testing the forms by different data collectors can be useful to find inconsistencies. finally, considering all the parts that integrate the “slr protocol” artifact (figure 6) as well as the “forms with pilot extracted data”, the analyze suitability of the slr design activity is performed by the slr validator and the expert domain. this analysis permits to adjust the data extraction form in addition to other protocol aspects such as the research questions, search string and/or selection criteria. for example, a method to validate the search string is checking if a set of known papers is recovered among the “pilot study selected documents”. when no problem is detected in the protocol, the perform slr activity (a2.2) is carried out. however, if a problem is detected or there is an opportunity for improvement, the improve slr design and validate slr design activities should be carried out again, as shown in the behavioral perspective specified in figure 7. once all the changes have been made and the “slr protocol” has been approved, the a2.2 sub-activity should be executed. notice in figure 7 that a new cycle of the pilot study could be performed, if were necessary. the perform slr (a2.2) implies the execute search protocol task taking now into account all the “selected digital libraries”. the slr performer must apply selection criteria on the “retrieved documents” in order to filter out those that do not meet the criteria defined in a1. as artifact, “selected documents” is yielded serving as input to the add non-retrieved relevant studies sub-activity. this activity is usually performed using a citation-based searching method, for example, the forward snowballing method (i.e., finding papers that cited papers found by a search process) or the backward figure 7. behavioral perspective for the perform slr pilot study (a2.1) activity. figure 8. functional and behavioral perspectives for the extract data from a sample of documents sub-activity. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 snowballing method (i.e., looking at the references of the papers found by a search process). at the end, the extract data from all documents activity is done by using the “template of the data extraction form”. this activity is performed by one or more data collectors. depending on data collector agents’ experience, the available resources, amount of articles, among other factors, a given article can be analyzed by one or two agents. in cases where the same article is read independently by several agents (as data collectors), the extracted data should be compared and disagreements be solved by consensus among them or by an additional researcher, maybe by an agent that plays the slr expert researcher role. if each document is just reviewed by one data collector agent, for example, due to time or resource constraints, it is important to ensure that some method will be used for verifying consistency. note in figure 3 that discrepancies should be recorded in the “divergencies resolution report”, as also suggested by biolchini et al. (2005). once a2 is accomplished, the “forms with extracted data” artifact is available for the a3 activity. 3.3 to analyze and document the review (a3) a3 is a central activity in the entire slr/sm process. the main objective of this activity is to synthesize the analysis results based on the available scientific evidence in order to draw conclusions and communicate the findings. furthermore, considering that a slr should be systematic, reproducible and auditable, the continuous documentation of the followed process, applied methods and produced artifacts is a key issue. (note that there are additional activities to those specified in figure 2 and figure 3, in which the management of a slr project –specifically, the project planning and schedulingshould also take into accounts such as the documention of artifacts in all activities and control their changes and versions). figure 3 (in gray box) shows that analyze slr results is the first sub-activity to be performed in a3. this implies in turn the design slr analysis sub-activity, which is performed by an agent that plays the analysis expert designer role, who is responsible for identifying the suitable data analysis methods and techniques to be used. as output, the “slr analysis specification” is produced. then, the implement slr analysis sub-activity should be enacted by a data analyzer, who is responsible for conducting the data analysis. analysis is carried out looking at the “forms with extracted data” and “slr analysis specification” artifacts. in analysis, diverse measurement, evaluation, categorization and aggregation methods of data as well as visualization means (such as tables, charts, word clouds, among others) can be used in order to give answer to the established research questions, e.g. to address the findings of similarities and differences between the studies, among many others. as a result, the “data synthesis” artifact is produced. the synthesis is usually descriptive. however, sometimes, it is possible to supplement a descriptive synthesis with quantitative summaries through meta-analysis, using arithmetical and statistical techniques appropriately. finally, an expert communicator carries out the document/communicate results sub-activity. to this end, dissemination mechanisms are first established, for example, technical reports, journal and conference papers, among others. then, the documents that convey the results to the intended community are produced. in this way, the slr process concludes. all the collected evidence and summarizations might be publicly available for auditability reasons. 4 application of the proposed slr process on primary studies in olsina & becker (2017), a family of evaluation strategies is presented. any strategy in this approach, regardless if it is devoted to evaluation, testing, development or maintenance goal purposes, should integrate a well-established terminology, process and method specifications. to broaden the family, we are in the way of the development of software testing strategies. so first, our current aim is to build a software testing domain terminology. in this direction, in tebes et al. (2019a) a slr on software testing ontologies was designed and documented, i.e., the a1 and a2.1 activities were carried out. then, in tebes et al. (2019b), we documented the whole study including the a1-a3 activities. the result allows us to establish the grounds for developing a top-domain software testing ontology that should be integrated into the conceptual framework already developed for existing evaluation strategies. in order to illustrate the proposed slr process, next, we introduce the rationale for our research to contextualize the slr study on software testing ontologies described later on. 4.1 rationale for the slr study on software testing ontologies a strategy is a core resource of an organization that defines a specific course of action to follow, i.e., specifies what to do and how to do it. consequently, strategies should integrate a process specification, a method specification, and a robust domain conceptual base (becker et al. 2015). this principle of integratedness promotes, therefore, knowing what activities are involved, and how to carry them out by means of methods in the framework of a common domain terminology. in olsina & becker (2017), to achieve evaluation purposes, a family of strategies integrating the three-abovementioned capabilities is discussed. the conceptual framework for this family of evaluation strategies is called c-incami v.2 (contextual-information need, characteristic model, attribute, metric and indicator) (becker et al. 2015, olsina & becker 2018). this conceptual framework was built on vocabularies or terminologies, which are structured in ontologies. figure 9 depicts the different c-incami v.2 conceptual components or modules, where the gray-shaded ones are already developed. the ontologies for non-functional requirements (nfrs), nfrs view, functional requirements (frs), business goal, project, and context are defined in olsina & becker (2018), while for measurement and evaluation are in becker et al. (2015). the remainder ontologies (for testing, development and maintenance) are not built yet. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 bearing in mind that there are already integrated strategies that provide support for achieving evaluation purposes, the reader can assume that strategies that provide support for achieving testing purposes are feasible to be developed as well. given that a strategy should integrate a well-established domain terminology, therefore, a well-specified testing strategy should also have this capability for the testing domain. a benefit of having the suitable software testing ontology is that would minimize the heterogeneity and ambiguity problems that we currently observe in the different concepts dealing with testing methods and processes. in this direction, we conducted the slr study on software testing ontologies (tebes et al. 2019b) in order to establish the suitable top-domain testing ontology to be integrated into the c-incami v.2 conceptual framework. that is, we envision populating the testing conceptual component shown in figure 9 and linking it with the frs and nfrs components. next, we illustrate the a1-a3 activities presented in section 3 using excerpts of the tebes et al. (2019b) study. members of the gidis_web research group between the end of may and the beginning of august 2018 performed the a1 and a2.1 activities, while members of the ort uruguay research group helped in this stage to the validation of the slr protocol. then, members of both research groups between the end of october 2018 and the beginning of march 2019 performed the a2.2 activity. 4.2 to design the review (a1) for the software testing ontologies study as observed in the functional and behavioral perspectives in figure 3, to start a1, the “slr information need goal specification” is required. in this case, the information need establishes that papers documenting software testing ontologies from digital libraries must be systematically analyzed. from the main goal established for this slr, two “research questions” were initially formulated, namely: (rq1) what are the existing ontologies for the software testing domain? and, (rq2) what are the relevant concepts, their relationships, attributes and constraints or axioms needed to describe the software testing domain? to answer rq1 will allow us to identify and analyze the different existing software testing ontologies. the rq2 will serve us to know the terms (or concepts), their relationships, attributes or properties and restrictions needed to specify an ontology for the testing domain. then, the “search protocol” was designed. taking into account rq1, the following search string was initially proposed: “software testing” and (“ontology” or “conceptual base”). for this particular study, the search string was applied on the three selected metadata, namely: title, abstract and keywords. (note that the search string could also be applied to the full-text). finally, the selected digital libraries included in the revision were scopus, ieee xplore, acm digital library, springer link and science direct. for the “selection and quality criteria” defined in this case study, see wp 1.3 in table 3. then, based on the research questions and the quality criteria, a set of fields for the data extraction was defined by the slr designer and the analysis expert designer (see wp 1.4 in table 3). for example, to extract terms, properties, relationships and axioms from each article, the “relevant concepts used to describe software testing domain” field was specified as follows (not shown in wp 1.4 of table 3): terms: termn: definition attributes: termn (attributen.1=definition, attributen.2=definition, …) relationships: is_a (termx_type, termy_subtype) part_of (termx_whole, termy_part) relationm_name (termx, termy): definition relationz_without_name (termx, termy): definition axioms or restrictions: axiom: definition/specification the “data extraction form” allows obtaining homogeneity between the data extracted from each document and thus facilitating the task of analyzing them. once validated the produced artifacts, as result of a1, the “slr protocol” was obtained. table 3 shows all the artifacts figure 9. conceptual components (modules) and their relationships for the c-incami v.2 framework. note that nfrs stands for non-functional requirements while frs stands for functional requirements. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 that integrate this document, which correspond to those specified in the informational perspective of figure 6. 4.3 to implement the review (a2) for the software testing ontologies study 4.3.1 to perform the slr pilot study (a2.1) looking at the activity flow shown in the behavioral perspective of figure 2, we conducted a pilot study for analyzing the suitability of the “slr protocol”. as part of the a2.1 execution, from the “selected digital libraries” in a1 (see wp 1.2.3 in table 3), scopus was selected for this pilot test because it contains digital resources from various sources such as elsevier, ieee xplore and acm digital library. as result of carrying out the execute search protocol and apply selection criteria activities, 19 documents were obtained (tebes et al. 2018), which were reviewed by three data collectors of the gidis_web research group to extract data from a sample of documents (recall figure 8). once a2.1 was completed, a list of “detected problems/suggested improvements” was produced and the improve slr design activity was run (as prescribed in figure 3 through the functional and behavioral perspectives). table 4 shows the updates (highlighted in blue and underlined) that the “slr protocol” underwent after the pilot study. next, some changes are described considering the “detected problems/suggested improvements” document. in the rq1 research question, the “existing” term was replaced by the “conceptualized” term. the former term is broader than the latter including conceptualizations and implementations of software testing ontologies. however, our main goal is retrieving conceptualized ontologies regardless whether they are implemented or not. on the other side, the “relevant” term in the rq2 research question in table 3 influenced negatively the number of terms extracted by each data collector. therefore, the research question was reformulated as observed in the wp 1.1 in table 4. in addition, the “relevant concepts used [...]” field in the form (see wp 1.4) was changed for “specified concepts [...]”. this change made the extraction more objective and easier to interpret than with the initial design. moreover, the full reading of articles during the pilot study allowed us to detect that ontologies of various types were presented, such as foundational ontology, top domain ontology and domain ontology. since the final aim after executing the slr was to adopt, adapt or build a new top-domain ontology, this information turned out relevant. consequently, a new research question (rq3 in the wp 1.1 in table 4) and the “classification of the proposed ontology” field in the “template of the data extraction form” (see wp 1.4 in table 4) were added. table 3. “slr protocol” artifact produced in a1 for the software testing ontologies study. research questions (wp 1.1) –note that wp stands for work product rq1: what are the existing ontologies for the software testing domain? rq2: what are the relevant concepts, their relationships, attributes and constraints or axioms needed to describe the software testing domain? search protocol (wp 1.2) search string (wp 1.2.1) "software testing" and ("ontology" or "conceptual base") metadata for search (wp 1.2.2) title; abstract; keywords selected digital libraries (wp 1.2.3) scopus, ieee xplore, acm digital library, springer link and science direct selection and quality criteria (wp 1.3) inclusion criteria (wp 1.3.1) 1) that the work be published in the last 15 years; 2) that the work belongs to the computer science area; 3) that the work documents a software testing ontology; 4) that the document is based on research (i.e., it is not simply a "lesson learned" or an expert opinion). exclusion criteria (wp 1.3.2) 1) that the work be a prologue, article summary or review, interview, news, discussion, reader letter, or poster; 2) that the work is not a primary study; 3) that the work is not written in english. quality criteria (wp 1.3.3) 1) is/are the research objective/s clearly identified? 2) is the description of the context in which the research was carried out explicit? 3) was the proposed ontology developed following a rigorous and/or formal methodology? 4) was the proposed ontology developed considering also its linking with functional and nonfunctional requirements concepts? template of the data extraction form (wp 1.4) researcher name; article title; author/s of the article; journal/congress; publication year; digital library; name of the proposed ontology; relevant concepts used to describe software testing domain; methodology used to develop the ontology; research context; research objective/s; does the proposed ontology consider its linking with functional and nonfunctional requirements concepts?; additional notes. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 also the search string was modified slightly (compare wp 1.2.1 in table 3 and table 4) because not all search engines take into account variations or synonyms of the used words. the inclusion criterion 1 in the wp 1.3.1 (table 3) is not very specific; therefore, it was modified as observed in the wp 1.3.1 of table 4. the full reading of articles also permitted to detect that some of them were different versions (or fragments) of the same ontology. therefore, exclusion criteria 5 and 7 were added (see wp 1.3.2 in table 4). on the other hand, since the searches in scopus retrieve documents that belong to other digital libraries, exclusion criterion 6 of the wp 1.3.2 was added to eliminate duplicates. finally, we also observed that some ontologies were built taking into account other terminologies, which may add a quality factor to the new proposal. for this reason, quality criterion 5 was added in the wp 1.3.3 of table 4, which implies a new field in the “template of the data extraction form” (see “terminologies or vocabularies taken into account [...]” in wp 1.4). this new quality criterion may prove to be useful information in the construction process of any ontology. the reader can check the final “template of the data extraction form” artifact in appendix a. 4.3.2 to perform the slr (a2.2) six agents (four researchers from gidis_web and two researchers from ort uruguay) performed the a2.2 activity. table 4. the new “slr protocol” version after the pilot study (a2.1) activity was performed. (note that changes are indicated in blue and underlined w.r.t. that information shown in table 3). research questions (wp 1.1) –note that wp stands for work product rq1: what are the conceptualized ontologies for the software testing domain? rq2: what are the most frequently included concepts, their relationships, attributes and axioms needed to describe the software-testing domain? rq3: how are existing software testing ontologies classified? search protocol (wp 1.2) search string (wp 1.2.1) ("software testing" or "software test") and ("ontology" or "ontologies") metadata for search (wp 1.2.2) title; abstract; keywords selected digital libraries (wp 1.2.3) scopus, ieee xplore, acm digital library, springer link and science direct selection and quality criteria (wp 1.3) inclusion criteria (wp 1.3.1) 1) that the work be published in the last 15 years (from the beginning of 2003 until november 12, 2018); 2) that the work belongs to the computer science area or to the software/system/information engineering areas; 3) that the document has the ontological conceptualization of the testing domain (i.e., it is not simply a "lesson learned or expert opinion" or just an implementation). exclusion criteria (wp 1.3.2) 1) that the work be a prologue, article summary or review, interview, news, discussion, reader letter, poster, table of contents or short paper (a short paper is considered to that having up to 4 pages size); 2) that the work is not a primary study; 3) that the work is not written in english; 4) that the work does not document a software testing ontology; 5) that the ontology presented in the document be an earlier version than the most recent and complete one published in another retrieved document; 6) that a same document be the result of more than one bibliographic source (i.e., it is duplicated); 7) that the conceptualized ontology in the current document be a fragment of a conceptualized ontology in another retrieved document. quality criteria (wp 1.3.3) 1) is/are the research objective/s clearly identified? 2) is the description of the context in which the research was carried out explicit? 3) was the proposed ontology developed following a rigorous and/or formal methodology? 4) was the proposed ontology developed considering also its linking with functional and nonfunctional requirements concepts? 5) what other terminologies of the software testing domain were taken into account to develop the proposed ontology? template of the data extraction form (wp 1.4) researcher name; article title; author/s of the article; journal/congress; publication year; digital library; name of the proposed ontology; specified concepts used to describe software testing domain; methodology used to develop the ontology; terminologies or vocabularies taken into account to develop the proposed ontology; classification of the proposed ontology; research context; research objective/s related to software testing ontologies; does the proposed ontology consider its linking with functional and non-functional requirements concepts?; additional notes. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 figure 10 shows the execute search protocol and apply selection criteria work definitions instantiated for the slr project on software testing ontologies. the execute search protocol task was performed for both research groups. particularly, gidis_web agents (green shadow in the figure) retrieved documents from the scopus, acm and ieee xplore digital libraries while, in parallel, ort agents (orange shadow) retrieved documents from springer link and science direct digital libraries. the workload for carrying out the execute search protocol tasks was balanced, i.e., two members per each group participated. the remainder two members of gidis_web acted as domain expert and slr validator, who coordinated and guided to the other researchers throughout a2.2. additionally, figure 10 shows the tasks instantiated in this slr project in order to perform the apply selection criteria sub-activity. note that these tasks are scheduled taking into account the include and exclude criteria artifacts of our project (see wp 1.3.1 and wp 1.3.2 artifacts in table 4). it is important to remark that from the project scheduling standpoint, for instance, the execute search protocol task in the generic process in figure 3 was instantiated twice in figure 10, considering the actual project particularities. note also that the apply selection criteria sub-activity is shown at task level in figure 10 considering the actual project particularities. therefore, the process models represented in section 3 should be customized for any specific project life cycle and context. as result of the execute search protocol tasks, 731 documents were retrieved. figure 11 shows the amount of “retrieved documents” per digital library, following the specification of the “search protocol” artifact (see wp 1.2 in table 4). figure 10. instantiation of the tasks for the execute search protocol and apply selection criteria work definitions. note that ic and ec stand for inclusion criteria, and exclusion criteria. figure 11. “retrieved documents” per digital library after performing execute search protocol. 181 scopus 25% 1 science direct 0% 88 ieee xplore 12% 18 acm dl 2% 443 springer link 61% retrieved documents scopus science direct ieee xplore acm dl springer link specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 table 5 records the number of “selected documents” produced after performing the apply selection criteria sub-activity. this yielded 10 selected primary studies by applying the different inclusion/exclusion criteria over the analyzed content as presented in its 1st and 2nd columns. considering the initial and end states the reached reduction rate is 98.6%. the next activity carried out was the add non-retrieved relevant studies sub-activity as depicted in figure 3. this was performed just by gidis_web members using two different methods, namely: backward snowballing and prior knowledge of other research work. table 6 shows the “final selected documents” after enacting add non-retrieved relevant studies sub-activity. this totalized 12 research primary studies after surpassing all inclusion and exclusion criteria (see table 6), namely: 10 selected documents plus 1 document (pid 1000) retrieved by backward snowballing, plus 1 document (pid 1003), a master thesis of asman & srikanth (2016) retrieved by prior knowledge of other research work from the researchgate network by the end of 2017. table 5. “selected documents” after performing the apply selection criteria sub-activity. note that ic and ec stand for inclusion criteria, and exclusion criteria. criteria analyzed content initial studies eliminated selected documents reduction ic1, ic2, ec1 title+abstract 731 209 522 28.5% ec6 (duplicates) title+abstract 522 47 475 6.4% ec2, ec3, ec4 full text 475 456 19 62.3% ic3, ec5, ec7 full text 19 9 10 1.2% total 721 98.6% table 6. “final selected documents” after performing add non-retrieved relevant studies. pid title authors reference digital library observations 118 an ontology based approach for test scenario management sapna p. g.; mohanty h. sapna & mohanty (2011) springer link 346 an ontology for guiding performance testing freitas a.; vieira r. freitas & vieira (2014) acm duplicated: scopus, ieee xplore, acm 347 regression tests provenance data in the continuous software engineering context campos h.; acácio c.; braga r.; araújo m. a. p.; david j. m. n.; campos f. campos et al. (2017) acm 359 ontology-based test modeling and partition testing of web services bai x.; lee s.; tsai w. t.; chen y. bai et al. (2008) ieee xplore duplicated: scopus, ieee xplore 366 test case reuse based on ontology cai l.; tong w.; liu z.; zhang j. cai et al. (2009) ieee xplore 383 an ontology-based knowledge sharing portal for software testing vasanthapriyan s.; tian j.; zhao d.; xiong s.; xiang j. vasanthapriyan et al. (2017b) ieee xplore duplicated: scopus, ieee xplore. 397 ontology-based development of testing related tools barbosa e. f.; nakagawa e. y.; riekstin a. c.; maldonado j. c. barbosa et al. (2008) scopus 424 semi-automatic generation of a software testing lightweight ontology from a glossary based on the onto6 methodology arnicans g.; romans d.; straujums u. arnicans et al. (2013) scopus 457 an ontology-based knowledge framework for software testing vasanthapriyan s.; tian j.; xiang j. vasanthapriyan et al. (2017a) scopus duplicated: scopus, springerlink 468 roost: reference ontology on software testing souza é. f.; falbo r. a.; vijaykumar n. l. souza et al. (2017) scopus 1000 developing a software testing ontology in uml for a software growth environment of web-based applications zhu h.; huo q. zhu & huo (2005) retrieved from paper pid 468 using backward snowballing 1003 a top domain ontology for software testing asman a.; srikanth r. m. asman & srikanth (2016) retrieved by prior knowledge specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 finally, the extract data from all documents sub-activity should be performed, which has as input the “template of the data extraction form” (appendix a) and the “final selected documents” (table 6), and as output the “forms with extracted data” artifact. appendix b shows the filled form with extracted data for the pid 347 article. this sub-activity must be performed in a very disciplined and rigorous way, being also very time consuming. the work distribution for the extract data from all documents sub-activity was as follows. two members of gidis_web, as data collectors, gathered completely the required data of the 12 documents. at random, we selected 4 out of 12 documents which were made available (by google drive) for data collection by two members at universidad ort uruguay, but not shared with each other while gathering the data in order to permit later a more objective checking of consistency. as result of this checking to the instantiated forms of both groups some minor issues were raised and discrepancies were consensuated via video chat. (note that the data and quality extraction reliability could be evaluated. for example, in the study documented in kitchenham & brereton (2013), authors checked the level of agreement achieved for data and quality extraction. in the case of quality extraction, the pearson correlation coefficient was found between the values for each assessor for each paper both for the number of appropriate questions and for the average quality score for each paper. in the case of data extraction, the agreement with respect to the study categories was assessed using the kappa statistic.) it is worth mentioning that thanks to looking for inconsistencies, we detected that into the collected concepts (i.e., terms, properties, etc.) included in the form for the rteontology (campos et al. 2017) there were not only those software testing domain-specific terms and relationships but also those related to the core or high-level ontology so-called prov-o (lebo et al. 2013). hence, we decided to document both in a differentiated way (by colors in the form) so, at analysis time, count only the domain related concepts accordingly. for instance, there are 17 terms in campos et al. (2017), but just 14 are domain-specific software testing terms not taken from prov-o. a similar counting situation happened with roost (souza et al. 2017), but in this case, they took terms from ufo (a foundational ontology built by the same research group). lastly, all these raised issues during the extract data from all documents sub-activity were documented in the “divergencies resolution record” artifact. ultimately, all the main slr artifacts for this study including the filled forms with the extracted data for the 12 “final selected documents” can be publicly accessed at https://goo.gl/hxy3yl. 4.4 to analyze and document the review (a3) for the software testing ontologies study a3 includes two sub-activities. the first sub-activity named analyze slr results produces the “data synthesis” artifact. to produce this artifact, playing the analysis expert designer role, we have designed and specified a set of direct and indirect metrics (olsina et al. 2013). below, we show just the formulas (not their whole specifications) for some indirect metrics: %dftr = (#dftr / #tr) * 100 (1) #rh = #txrh + #notxrh (2) #txrh = #tx-is_a + #tx-part_of (3) %dfnotxrh = (#dfnotxrh / #notxrh) * 100 (4) %txcptualvl = (#txrh / #rh) * 100 (5) briefly, metric’s formula (1) allows us to know the proportion of defined terms (#dftr) with regard to the number of terms (#tr) in the conceptualized ontology. the metric’s formula (2) is devoted to calculate the number of taxonomic (#txrh) and non-taxonomic relationships (#notxrh). note that a taxonomic relationship is the sum of the inheritance (#tx-is_a) relationships and the whole-part (#tx-part_of) relationships -see formula (3). the metric’s formula (4) permits to understand the proportion of defined non-taxonomic relationships. finally, the metric’s formula (5) calculates the percentage of taxonomic conceptual-base level, which is very useful to determine if a conceptual base is actually an ontology or rather a taxonomy. (note that, for brevity reasons, we did not include the metric’s formula for getting the ratio of specified axioms.) table 7. metrics for terms (tr), attributes (at) and relationships (rh) for the campos et al. (2017) ontology. note that df stands for defined; tx for taxonomic; notx for non-taxonomic; and cptualvl for conceptual level. pid terms (tr) attributes (at) relationships (rh) #tr #dftr %dftr #at#dfat%dfat#rh#txrh#tx-is_a#tx-part_of#notxrh#dfnotxrh%dfnotxrh %txcptualvl 347 14 14 100% 0 16 15 14 1 1 1 100% 93.75% figure 12. word cloud (https://www.nubedepalabras.es/) for the recorded terms from the 12 data extraction forms. note that attributes and relationships names are not included. also, note that word size is related to the term frequency, but term colors have no meaning. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 these metrics are very easy to use considering that the “template of the data extraction form”, particularly the “specified concepts used to describe software testing domain” field, has a suitable design for facilitating the subsequent counting (see the structure of this field in appendix a, and one example of its instantiation in appendix b). additionally, we use a table to record the measure’s values. for example, table 7 shows the measures’ values for the paper pid 347 (campos et al. 2017). then, all these values are considered by the data analyzer to perform the analysis. moreover, in this activity we use other tools for analysis purposes. for example, a word cloud tool for the rq2 (what are the most frequently included concepts, their relationships, attributes and axioms needed to describe the software testing domain?) is used. figure 12 shows the word cloud produced from the terms retrieved from the 12 conceptual bases. all the tables, charts, word cloud and measures are included in the “data synthesis” and used by the data analyzer to perform the analysis. in figure 13, we show an excerpt of the “data synthesis” produced for the software testing ontologies study. then, using the “data synthesis” as input, a “scientific article” (tebes et al. 2019b) was elaborated by the expert communicator and the domain expert in the document/communicate results sub-activity. in tebes et al. (2019b), the full analysis is documented, where all research questions are answered and other issues are considered, such as validity threats. however, due to the page size constraints that a conference paper has, we are currently extending it to a journal format where there is less page limits. this slr on software testing ontologies will be then fully documented. 5 discussion the process proposed by kitchenham et al., which was so far adopted or slightly adapted by other researchers –for example in sepúlveda et al. (2016), tahir et al. (2016), torrecilla-salinas et al. (2016)is rather coarse-grained represented using only the behavioral perspective from the process modeling standpoint. if we compare figure 1 with figure 2 –where both are representations of the behavioral perspective-, we can observe, on one hand, a greater richness of expressiveness in sequences and decision flows in figure 2, and, on the other hand, the possibility of representing different levels of granularity in work definitions such as activity, sub-activity and task. (note that, for the reason of maintaining the simplicity and legibility of the diagram in figure 2, we have not specified other aspects such as iterations and parallelisms – e.g., the parallelism that can be modeled between the define quality criteria and define inclusion/exclusion criteria tasks, within the define selection and quality criteria subactivity in figure 2.) although the behavioral perspective is undoubtedly necessary, it is usually not sufficient since it does not represent inputs and outputs for the different activities and tasks. for this reason, the functional perspective in conjunction with the behavioral perspective enhance the model expressiveness as shown in figure 3. furthermore, the informational perspective also enriches the slr process specification and favors the understanding of the process by showing the structure of a given artifact. this informational perspective is illustrated for the “slr protocol” artifact in figure 6, which is produced by the a1 activity. additionally, the organizational perspective shown in figure 4 helps to understand that in order to carry out an slr, it is also necessary to cover several roles that agents (both human and automated agents) require with different skills, i.e., the set of capabilities, competencies and responsibilities. on the other hand, a key aspect to be highlighted in the present proposal is the modeling of the a2.1 sub-activity (perform slr pilot study), which gives feedback to the a1 activity (design review). this pilot test activity, in the adapted processes of brereton et al. (2007) is very often neglected. alternatively, if it is mentioned or considered such as in brereton et al. (2007) and sepúlveda et al. (2016), it is poorly specified. for example, in sepúlveda et al. (2016), authors include the selection and pilot extraction activity, which is represented simply as a step in the a2 activity – named phase in sepúlveda et al. (2016)-, but it does not give feedback to the a1 activity (phase). so modeling the feedback loop is important due to it could help to improve aspects figure 13. fragment of the “data synthesis” produced in the analyze slr results sub-activity. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 of the slr design, as we have proposed in figure 7, and illustrated its usefulness in sub-section 4.3.1, particularly in the improvement of the data extraction form. additionally, notice that in our proposal we include both the validate slr design sub-activity (which is represented in a1) and the perform slr pilot study sub-activity (which clearly must be represented in a2 since it contains tasks inherent to the execution of the slr). in short, the main contribution of this work is to augment the slr process currently and widely used by the se scientific communities. this is achieved by considering, on one hand, the principles and benefits of process modeling with greater rigor by using four modeling perspectives, namely: functional, behavioral, informational and organizational. and, on the other hand, the vision of decoupling the ‘what’ to do aspect, which is modeled by processes, activities, tasks, artifacts and behavioral aspects, from the ‘how’ to realize the description of work definitions, which is modeled by method specifications. it can be observed in the cited literature (and in general) a lack of separation of concerns between what to do (processes) and how to do it (methods). although throughout sections 3 and 4 we have indicated some methods applicable to slr tasks, the emphasis in this work is not on the specifications of methods. this separation of concerns can be seen, for example, in the description of the add non-retrieved relevant studies task (recall sub-section 4.3.2), where it can be carried out by using two methods such as the forward snowballing and/or backward snowballing. however, in the process models related to this sub-activity (and to others) no reference is made to these methods, nor to any others. 6 conclusion in this work, we have documented the slr process specification by using process-modeling perspectives and mainly the spem language. it is a recommended flow for the slr process, since we are aware that in a process instantiation there might be some variation points, such as the parallelization of some tasks. it is worth noting that, regarding the quoted benefits of process modeling in the introduction section, the proposed slr process specifications aim primarily at facilitating the understanding and communication, as well as at giving support to the process improvement and process management. however, a thorough discussion and a detailed illustration of process modeling for supporting fully slr process automation have been out of the scope of this article. additionally, we have highlighted the benefits and strengths of the proposed slr process model compared with others, which are more coarse-grained specified. finally, we have illustrated aspects of it by exemplifying a slr on software testing ontologies. one important contribution is the inclusion of the pilot test activity, which promotes the validation of the “slr protocol” artifact not only in the a1 activity but also in the a2.1 sub-activity. it is worthwhile to highlight that conducting a pilot study (not for a replicated study) can be very useful to improve the slr protocol and foster to some extent the quality of the evidence-based outcomes. given that so far we have performed very few pilot studies, we have not the practical evidence to state what is the more effective sample size for it. in fact, kitchenham & charters (2007) neither mention the appropriate sample size due to – they indicateit depends on the available resources as well. in our humble opinion, we consider that in a pilot study, in addition to the sample size, it is important to select a set of documents (or the whole sample) and perform the data extraction of each document by more than one independent data collector. having more than one filled data extraction form for the same document allows checkings for potential inconsistencies in the designed artifacts. moreover, the data validator (slr expert researcher) and the domain expert are very important roles that should be played at least by one person with expertise in order to check (jointly with the independent data collectors) the data extraction forms for inconsistencies, in addition to detect opportunities for improvement in the template metadata for the ulterior analysis endeavor, in the a3 activity. in a nutshell, we consider that several variables may be important for making a good/effective pilot test. however, we would need more informed evidence to tackle this issue. so it is an interesting aspect that could be further investigated by the community. the proposed process model for the slr provides a good baseline for understanding the details and discussing alternatives or customizations to this process. in fact, the rigor provided by process modeling, where several perspectives are combined (e.g. functional, informational, organizational and behavioral), but can also be independently detached, provides a greater richness of expressiveness in sequences and decision flows, while representing different levels of granularity in the work definitions, such as activity, sub-activity and task. it is worth mentioning that the specified process contributes to one pillar of a well-established slr strategy –knowing beforehand that a strategy should also integrate the method specification capability. note that, for the same task, different method specifications could be applied. in consequence, the life cycle of a given slr project should organize activities and tasks considering not only the prescribed slr process but also the appropriate allocation of resources such as methods and tools, among others, for achieving the proposed goal. there are additional activities (to those specified in figure 2) in which project planning must also take into account such as documenting artifacts in all activities and control their changes and versions. the continuous documentation and versioning of artifacts are key factors to guarantee consistency, repeatability and auditability of slr projects. as an ongoing work, we are currently developing a supporting tool for this slr process since it can be very time consuming and also error prone remembering and using all the elements that this process provides. although a variety of tools is available to assist the slr process (marshall & brereton 2013), current tools obviously do not follow our slr process or do not fit well. consequently, we are developing a tool that can help to automate part or all of our slr process and its documentation, in addition to favor its usefulness and wide adoption. taking into account that the main objective in the present work is to provide models to facilitate the understanding and the communication of the slr process to researchers and practicioners, rather than to give support specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 to the fully slr process automation, we will need to augment for instance some activity specifications at task level in addition to provide a more flexible process flow for collaborative work. on the other side, as ongoing work, we are currently finishing the development of the suitable top-domain testing ontology to be integrated into the c-incami v.2 conceptual framework. to this end, we took into account those explicit terminologies coming not only from some of the existing primary studies, but also from official and de facto international standards (e.g. iso/iec/ieee 29119-1:2013 https://www.iso.org/standard/45142.html, and istqb https://www.istqb.org/downloads.html respectively), which are widely adopted by professional testers. lastly, as a future work, we will perform a thorough checking if our slr process specifications can suitably represent the slr processes followed by other researchers. the outcome of this work will help us to validate our process specifications in a broader way. acknowledgments. this work and line of research are supported by science and technology agency of argentina in the pict 2014-1224 project, at universidad nacional de la pampa. references arnicans, g., romans, d., & straujums, u. (2013). semi-automatic generation of a software testing lightweight ontology from a glossary based on the onto6 methodology, frontiers in artificial intelligence and applications, v.249, pp. 263-276. asman, a., & srikanth, r. m. (2016). a top domain ontology for software testing, master thesis, jönköping university, sweden, pp. 1-74. bai, x., lee, s., tsai, w. t., & chen, y. (2008). ontologybased test modeling and partition testing of web services, ieee int’l conference on web services (icws'08), pp. 465-472. barbosa, e. f., nakagawa, e. y., riekstin, a. c., & maldonado, j. c. (2008). ontology-based development of testing related tools, 20th international conference on software engineering and knowledge engineering (seke'08), pp. 697-702. becker, p., lew, p., & olsina, l. (2012). specifying process views for a measurement, evaluation, and improvement strategy, advances in software engineering journal, academic editor: osamu mizuno, hindawi publishing corporation, usa, v.2012, 27 pgs. becker, p., papa, f., & olsina, l. (2015). process ontology specification for enhancing the process compliance of a measurement and evaluation strategy, clei ejnal., 18:(1), pp. 1-26. biolchini, j., mian, p.g., natali a.c.c., & travassos, g. (2005). systematic review in software engineering. technical report rt-es 679-05. pesc, coppe/ufrj. brereton, p., kitchenham, b., budgen, d., turner, m., & khalil, m. (2007). lessons from applying the systematic literature review process within the software engineering domain, journal of systems and software, 80:(4), pp. 571– 583. cai, l., tong, w., liu, z., & zhang, j. (2009). test case reuse based on ontology, 15th ieee pacific rim international symposium on dependable computing, pp. 103108. campos, h., acácio, c., braga, r., araújo, m. a. p., david, j. m. n., & campos, f. (2017). regression tests provenance data in the continuous software engineering context, 2nd brazilian symposyum on systematic and automated software testing (sast), paper 10, pp. 1-6. curtis, b., kellner, m., & over, j. (1992). process modelling, communications of acm, 35:(9), pp. 75-90. freitas, a., & vieira, r. (2014). an ontology for guiding performance testing, ieee/wic/acm international joint conferences on web intelligence (wi) and intelligent agent technologies (iat), (wi-iat '14), v.1, pp. 400407. garousi, v., & mäntylä, m. (2016). a systematic literature review of literature reviews in software testing, information and software technology, v.80, pp. 195-216. irshad, m., petersen, k., & poulding, s. (2018). a systematic literature review of software requirements reuse approaches, information and software technology, v.93, pp. 223-245. kitchenham, b. (2004). procedures for undertaking systematic reviews, joint tr, comp. science dep., keele university (tr/se-0401) and national ict australia ltd. (0400011t.1). kitchenham, b., & brereton, p. (2013). a systematic review of systematic review process research in software engineering, information and software technology, 55:(12), pp. 2049-2075. kitchenham, b., & charters, s. (2007). guidelines for performing systematic literature reviews in software engineering, ebse technical report, software engineering group, school of computer science and mathematics keele university and department of computer science university of durham, uk, v. 2.3. kitchenham, b., budgen, d., & brereton, p. (2010a). the value of mapping studies-a participant-observer case study, 4th international conference on evaluation and assessment in software engineering, british computer society, pp. 25–33. kitchenham, b., pretorius, r., budgen, d., brereton, p., turner, m., niazi, m., & linkman, s. (2010b). systematic literature reviews in software engineering –a tertiary study, information and software technology, 52:(8), pp. 792–805. kitchenham, b.a., dyba, t., & jorgensen, m. (2004). evidence-based software engineering. 26th international conference on software engineering, pp. 273-281. lam, r. w., & kennedy, s. h. (2005). using metaanalysis to evaluate evidence: practical tips and traps. canadian journal of psychiatry, 50, pp. 167–174. lebo, t., sahoo, s., mcguinness, d., belhajjame, k., cheney, j., corsar, d., garijo, d., soiland-reyes, s., specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 zednik, s., & zhao, j. (2013). prov-o: the prov ontology. retrieved by july 1st, 2019, from https://www.w3.org/tr/2013/rec-prov-o-20130430/ marshall, c., & brereton, p. (2013). tools to support systematic literature reviews in software engineering: a mapping study. acm/ieee international symposium on empirical software engineering and measurement, baltimore, md, pp. 296-299. doi: 10.1109/esem.2013.32 meline, t. (2006). selecting studies for systematic review: inclusion and exclusion criteria. contemporary issues in communication science and disorders, 33, pp. 21-27. napoleão, b.m., felizardo, k.r., de souza, e.f., & vijaykumar, n. (2017). practical similarities and differences between systematic literature reviews and systematic mappings: a tertiary study, the 29th international conference on software engineering and knowledge engineering, pp. 85-90, doi: 10.18293/seke2017-069. olsina, l. (1998). functional view of the hypermedia process model, 5th international workshop on engineering hypertext functionality, at icse’98, kyoto, japan, pp. 110. olsina, l., & becker, p. (2017). family of strategies for different evaluation purposes, xx cibse’17, caba, argentina, published by curran associates, pp. 221-234. olsina, l., & becker, p. (2018). linking business and information need goals with functional and non-functional requirements, xxi cibse’18, bogotá, colombia, published by curran associates, pp. 381-394. olsina, l., covella, g., & dieser, a. (2013). metrics and indicators as key organizational assets for ict security assessment, chapter 2, in the book titled "emerging trends in ict security", elsevier (morgan kaufmann), 1st edition, akhgar & arabnia (eds.), pp. 25-44. isbn: 9780124114746 omg (2008). software & systems process engineering meta-model (spem) specification, version 2.0. omg (2011). business process model and notation (bpmn) specification, version 2.0. omg (2017). unified modeling language (uml) specification, version 2.5.1. petersen, k., vakkalanka, s., & kuzniarz, l. (2015). guidelines for conducting systematic mapping studies in software engineering: an update, information and software technology, v.64, pp. 1–18. portela, c., vasconcelos, a., silva, a., sinimbú, a., silva, e., ronny, m., lira, w., & oliveira, s. (2012). a comparative analysis between bpmn and spem modeling standards in the software processes context, journal of software engineering and applications, 5(5), pp. 330-339. rosenberg, d.s.w., gray, j., hayes, r., & richardson, w. (1996). evidence-based medicine: what it is and what it isn’t, british medical journal, vol. 312, no. 7023, p. 71. russel, n., van der aalst, w., hofstede, a., & wohed, p. (2006). on the suitability of uml activity diagrams for business process modelling, third asia-pacific conference on conceptual modelling (apccm), hobart, v.53, pp. 195– 204. sapna, p. g., & mohanty, h. (2011). an ontology based approach for test scenario management, 5th international conference on information intelligence, systems, technology and management (icistm'2011), v.141, pp. 91100. sepúlveda, s., cravero, a., & cachero, c. (2016). requirements modeling languages for software product lines: a systematic literature review, information and software technology, v.69, pp. 16-36. souza, e. f., falbo, r. a., & vijaykumar, n. l. (2017). roost: reference ontology on software testing, applied ontology journal, 12:(1), pp. 1-30. tahir, t., rasool, g., & gencel, c. (2016). a systematic literature review on software measurement programs, information and software technology, v.73, pp. 101-121. tebes, g., peppino, d., dameno, j., becker, p., & olsina l. (2018). diseñando una revisión sistemática de literatura sobre ontologías de testing de software. (in english: designing a systematic literature review on software testing ontologies), vi congreso nacional de ingeniería informática/sistemas de información (conaiisi), mar del plata, argentina, pp. 1-13. tebes, g., peppino, d., becker, p., & olsina, l. (2019a). especificación del modelo de proceso para una revisión sistemática de literatura. (in english: specifying the process model for a systematic literature review), xxii cibse’19, la habana, cuba, published by curran associates 2019, pp. 391-404, isbn 978-1-5108-8795-4. tebes, g., peppino, d., becker, p., matturro, g., solari, m., & olsina, l. (2019b). a systematic review on software testing ontologies, 12th international conference on the quality of information and communications technology, springer book, ccis v.1010, m. piattini et al. (eds.): quatic 2019, ciudad real, spain, pp. 144-160, https://doi.org/10.1007/978-3-030-29238-6_11. torrecilla-salinas, c.j., sedeño, j., escalona, m.j., & mejías, m. (2016). agile, web engineering and capability maturity model integration: a systematic literature review, information and software technology, v.71, pp. 92-107. vasanthapriyan, s., tian, j., & xiang j. (2017a). an ontology-based knowledge framework for software testing, communications in computer and information science, v.780, pp. 212-226. vasanthapriyan, s., tian, j., zhao, d., xiong, s., & xiang, j. (2017b). an ontology-based knowledge sharing portal for software testing, ieee international conference on software quality, reliability and security companion (qrs-c'17), pp. 472-479. white, s. a. (2004). process modeling notations and workflow patterns, workflow handbook 2004, pp. 265–294. wohlin, c. (2014). guidelines for snowballing in systematic literature studies and a replication in software engineering, proceedings of the 18th international conference on evaluation and assessment in software engineering (ease '14), acm, new york, ny, usa, article 38, pp. 1-38. zhu, h., & huo, q. (2005). developing a software testing ontology in uml for a software growth environment of web-based applications, software evolution with uml and xml, idea group, pp. 263-295. specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 appendix a data extraction form researcher name last name (surname), first name pid article title author/s of the article author1_surname, initials_of_the_name; author2_surname, initials_of_the_name; ... journal/congress/other name publication year digital library name (note: if an article was repeated, indicate the name of all the libraries in which was found) name of the proposed ontology name and/or acronym specified concepts used to describe software testing domain terms: termn: definition attributes: termn (attributen.1 = definition, attributen.2 = definition,…) relationships: is_a (termx_type, termy_subtype) part_of (termx_whole, termy_part) relationm_name (termx, termy): definition relationz_without_name (termx, termy): definition axioms or restrictions: axiom: definition/specification methodology used to develop the ontology name and/or acronym terminologies or vocabularies taken into account to develop the proposed ontology ontologies: name/s and/or reference/s taxonomies: name/s and/or reference/s glossaries/dictionaries: name/s and/or reference/s classification of the proposed ontology select one category: 1foundational ontology (top level) 2top-domain ontology 3domain ontology 4instance/application ontology 5not determined by authors research context description research objective/s related to software testing ontologies description does the proposed ontology consider its linking with functional and non-functional requirements concepts? 1yes 0no additional notes (this field is used for the researcher to make his/her observations or comments about something that is important to highlight) specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 appendix b data extraction form researcher name peppino, denis h. pid article title 347 regression tests provenance data in the continuous software engineering context author/s of the article s. campos junior, h.; paiva, c. a.; braga, r.; araújo, m. a. p.; david, j. m. n.; campos, f. journal/congress/other proceedings of the 2nd brazilian symposium on systematic and automated software testing (sast), pp. 10:1 10:6 publication year 2017 digital library acm name of the proposed ontology rte-ontology specified concepts used to describe software testing domain important: see additional note #2. terms: 1. activity 2. testingactivity: groups test execution activities. 3. testsuiteexecution: represents one test session. 4. testcaseexecution: represents one test case. 5. agent 6. person: represents a person in the environment. 7. entity 8. testingenvironment: groups classes of the testing environment. 9. softwarebuild: represents a specific build of the software. 10. artifact: groups software artifacts. 11. testingartifact: groups artifacts related to testing. 12. testinglog: logs generated by a test session. 13. testingsourcecode: represents the source class of test cases. 14. softwareartifact: represents artifacts not directly related to test. 15. sourcecode: represents software’s source code. 16. softwareexecutable: binaries used to execute the software. 17. testingsuite: groups test classes. attributes: relationships: is_a(activity, testingactivity) is_a(testingactivity, testsuiteexecution) is_a(testingactivity, testcaseexecution) is_a(agent, person) is_a(entity, testingenvironment) is_a(testingenvironment, softwarebuild) is_a(entity, artifact) is_a(artifact, testingartifact) is_a(testingartifact, testinglog) is_a(testingartifact, testingsourcecode) is_a(artifact, softwareartifact) is_a(softwareartifact, sourcecode) is_a(softwareartifact, softwareexecutable) is_a(entity, testingsuite) composedof(testingsuite, testingsourcecode) covers(testingsuite, sourcecode): represent which source code is covered by a test suite. wasinformedby(activity, activity) wasassociatedwith(activity, agent) actedonbehalfof(agent, agent) used(activity, entity) wasderivedfrom(entity, entity) wasgeneratedby(entity, activity) wasattributedto(entity, agent) axioms or restrictions: methodology used to develop the ontology terminologies or vocabularies taken into account to develop the proposed ontology ontologies: prov-o (lebo et al. 2013) taxonomies: glossaries/dictionaries: specifying the process model for systematic reviews: an augmented proposal becker et al. 2019 classification of the proposed ontology 3domain ontology research context in this paper, they propose an architecture based on the use of an ontology and provenance model to capture and provide regression tests data to support the continuous improvement of software testing processes. research objective/s related to software testing ontologies the main objective of this paper is to propose an approach capable of capturing and providing information about past executions of regression tests. to this, data provenance is used to achieve this goal. does the proposed ontology consider its linking with functional and non-functional requirements concepts? 0no additional notes note 1: rte: regression test execution. note 2: the terms/relationships highlighted in red and underlined were not counted and not taken into account in the analysis activity because they belong to the prov-o, which has core or highlevel concepts. journal of software engineering research and development, 2021, 9:10, doi: 10.5753/jserd.2021.477  this work is licensed under a creative commons attribution 4.0 international license.. a data­centric model transformation approach using model2graphframe transformations luiz carlos camargo  [ universidade federal do paraná c3sl labs | lccamargo@inf.ufpr.br ] marcos didonet del fabro  [ universidade federal do paraná, c3sl labs | didonet@inf.ufpr.br ] abstract data­centric (dc) approaches are being used for data processing in several application domains, such as dis­ tributed systems, natural language processing, and others. there are different data processing frameworks that ease the task of parallel and distributed data processing. however, there are few research approaches studying on how to execute model manipulation operations, as model transformations models on such frameworks. in addition, it is of­ ten necessary to provide extraction of xmi­based formats into possibly distributed models. in this paper, we present a model2graphframe operation to extract a model in a modeling technical space into the apache spark framework and its graphframe supported format. it generates graphframe from the input models, which can be used for partitioning and processing model operations. we used two model partitioning strategies: based on sub­graphs, and clustering. the approach allows to perform model analysis applying operations on the generated graphs, as well as model transformations (mt). the proof of concept results such as model2graphframe, graphframe partitioning, graphframe connectivity, and graphframe model transformations indicate that our model extraction can be used in various application domains, since it enables the specification of analytical expressions on graphs. furthermore, its model graph elements are used in model transformations on a scalable platform. keywords: model extractor, data­centric approach, spark graphframes, model transformations 1 introduction model transformations (mts) are key artifacts for exist­ ing mde (model­driven engineering) approaches, since they implement operations between models (brambilla et al., 2012). nevertheless, the transformation of models via paral­ lel and/or distributed processing is still a challenging ques­ tion in mde platforms. there are recent initiatives that aim to improve existing solutions by adapting the computa­ tion models, for instance, using mapreduce (dean and ghe­ mawat, 2008) to integrate model transformation approaches within the data­intensive computing models. works such as burgueno et al. (2016), pagán et al. (2015), benelallam et al. (2015) and tisi et al. (2013) aim at providing solu­ tions for this new scenario using frameworks such as linda and mapreduce. even when adopting these frameworks, the model processing is not a straightforward task, since the mod­ els are semi­structured, which can have self­contained or inter­contained elements, different of flat data structures on linear space usage, such as logs, text files, and others. the need for performing complex processing on large vol­ umes of data has led to the re­evaluation of the utilization of different kinds of data structures (raman, 2015). very large models (vlms) are composed of millions of elements. vlms are present in specific domains such as the automo­ tive industry, civil engineering, software product lines, and modernization of legacy systems (gómez et al., 2015). fur­ thermore, new applications are emerging involving domains, such as internet of things (iot), open data repositories, so­ cial networks, among others, demanding intensive and scal­ able computing for manipulating their artifacts (ahlgren et al., 2016). there is a wide range of approaches of model transforma­ tions (kahani et al., 2018), such as qvt (omg, 2016), atl, etl (kolovos et al., 2008), viatra (varró et al., 2016), among others. however, most of these approaches adopt as strategy the local and sequential execution for the transforma­ tion of models, conditioning the processing of models with large amounts of elements (vlms) to the capacity of the ex­ ecution environment. given the nature of models and meta­models, they can have elements that are densely interconnected. this hardens the processing of transformation rules, mainly when execut­ ing a pattern matching step (jouault et al., 2008). moreover, distributed model transformation (mt) requires strategies for partitioning and distributing the model elements on dis­ tinct nodes, while at the same time, ensuring the consistency among their elements (benelallam et al., 2018). a large part of model­based tools uses a graph­oriented data model. these tools have been designed to help users in specifying and executing model­graph manipulation op­ erations efficiently in a variety of domains (xin et al., 2013; szárnyas et al., 2014; junghanns et al., 2016; shkapsky et al., 2016; li et al., 2017; benelallam et al., 2018; tomaszek et al., 2018; azzi et al., 2018). the extraction of large semi­ structured data under a graph perspective can be useful in choosing a strategy to design distributed/parallel mts, graph­ data processing, model partitioning, and to analyze model inter­connectivity, as well as to offer graph­structured infor­ mation to different contexts. even though, the graph pro­ cessing in the mt context requires more research, involving implicit parallelism, parallel/distributed environments, lazy­ evaluation, and other mechanisms for model processing. for these reasons, in this paper, we present an evalu­ ation study on the application of a data­centric (dc) ap­ proach for model extraction and mt in the spark framework, based on graphframes (apache, 2019). therefore, we con­ sider that the mechanisms, such as implicit parallelism, lazy­ https://orcid.org/0000-0001-7879-9893 mailto:lccamargo@inf.ufpr.br https://orcid.org/0000-0002-8573-6281 mailto:didonet@inf.ufpr.br a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 evaluation, model partitioning, and scalable framework, can compose an approach for mt. first, we inject the input model into a dataframe, which is a format supported by apache spark. second, we im­ plement in scala a model extraction with graph generation from the dataframe and its schema. it translates the in­ put models into graphframe from a dataframe, through a model2graphframe transformation, which allows us to pro­ cess them. we evaluate how to query the graph elements us­ ing its native query language, and also, how to specify dif­ ferent kinds of operations over graphframes. we focus on the partition of graphs from graphframes into sub­graphs, as well as the clustering of its vertices, which are used in model transformations. we provide the following contributions: • we produce an automated mechanism for data trans­ lations between the mde technical space and the dataframe and graphframe formats, which allows the execution of different operations (including mt) over the models from the graphframe; • we use two partitioning strategies of models on graph­ frame (semi­automated), one based on the motif al­ gorithm and another on clustering using the infomap framework. the model partitioning result is used on mt, aiming to improve the execution performance; • to validate our approach, we implemented a proof of concept, in which we compared the partitioning strate­ gies in mt executions on top of the spark, a scalable framework. this paper is organized into 6 sections. in section 2, we introduce the context for this work with the dataframe and graphframes apis and their data formats, as well as model transformations using graphs; in section 3, we present the specifications of our approach, including extracting, trans­ lating, partitioning, and model transformations; in section 4, we describe the proof of concepts for validating our ap­ proach; in section 5, we present related work; in section 6, we conclude with future work. 2 context in this section, we present dataframe, a distributed col­ lection of data organized into named columns, and graph­ frames, a graph processing library based on dataframes, both for apache spark. we also introduce: the mt, the key artifact for existing mde approaches; model extractor (me) for extracting model elements from different technical spaces; and graph, a data structure composed of vertices and edges, which may be used in mt. 2.1 data structures on graphframe apache spark (apache, 2019) is a general­purpose data pro­ cessing engine providing a set of apis that allow the im­ plementation of several types of computations, such as in­ teractive queries, data and stream processing, and graph pro­ cessing. the dataframe spark api uses distributed datasets. a dataset is a strongly­typed data structure organized in collections. the dataset api allows the definition of a dis­ tributed collection of structured data from jvm objects, and its manipulation using functional transformations such as map, flatmap, filter, and others. structurally, a dataframe is a two­dimensional labeled data structure with columns of potentially different types. each row in a dataframe is a single record, which is rep­ resented by spark as an object of type row. each dataframe contains data grouped into named columns, and keeps track of its own schema. summarizing, a dataframe is similar to a table in a relational database, but with a difference, their columns allow the manipulation of multivalued attributes. a dataframe can be transformed into new dataframes using various relational operators available in its api and expres­ sions based on sql­like functions. dataframes and datasets are (distributed) table­like collections with well defined rows and columns. each column must have the same number of rows and each column has type information that must be consistent for every row in the collection. dataframes and datasets represent immutable and lazily evaluated plans that specify what operations to apply to data residing at a loca­ tion to generate some output (chambers and zaharia, 2018). figure 1 shows an example of a dataframe. it is formed by three rows and five columns, and contains data extracted from model families (rows with march, sailor, and camargo families. a row can have columns with dif­ ferent types, such as string, integer, date, boolean, and array. rows columns---------------------------+---------+--------------------+---------+----------+-----------------+ | lastname| daughters| father| mother| sons| +---------+--------------------+---------+----------+-----------------+ | march| [[, brenda]]| [, jim]| [, cindy]| [[, brandon]]| | sailor| [[, kelly]]|[, peter]|[, jackie]|[[,david],[,dy...| | camargo|[[, jor], [, teste]]| [, luiz]| [, sid]| [[, lucas]]| +---------+--------------------+---------+----------+-----------------+ figure 1. dataframe families another possible way to describe elements and their rela­ tionships is the creation of graphs, due to their high expres­ siveness. spark provides the graphx and graphframes apis to process data in graph formats. in the graphframes api, the graphframe class is used for instantiating graphs. in fig­ ure 2, we present a simple illustrative example of a family model, using the march family elements into a graphframe instance. it can be created from vertex (nameverticesdf) and edge (roleedgesdf) dataframes. a vertex dataframe has to contain a special column named "id", which specifies a unique id for each vertex in the graph. an edge dataframe should contains two special columns: "src" (as the source vertex id of the edge) and “dst” (as the destination ver­ tex id of the edge) (chambers and zaharia, 2018; apache, 2019). the graphframe model supports user­defined attributes within each vertex and edge. the graphframes api provides the same operations of the dataframe api, such as map, select, filter, join, and others. it has a set of built­in graph algorithms, such as breadth­first search (bfs), label propagation, pagerank, and others. the graphframes and dataframe apis are based on the concept of a resilient dis­ tributed dataset (rdd), which is an immutable collection of records partitioned across a number of computers or nodes. to provide fault tolerance, each rdd is logged to construct a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 roleedgesdf +---+---+--------+ |src|dst| role| +---+---+--------+ | 1| 2|daughter| | 1| 3| father| | 1| 4| mother| | 1| 5| son| +---+---+--------+ graphframe nameverticesdf +---+-------+ | id| name| +---+-------+ | 1| march| | 2| brenda| | 3| jim| | 4| cindy| | 5|brandon| +---+-------+ figure 2. march family graphframe a lineage dataset (data lineage (tang et al., 2019)). when a data partition of a rdd is lost due to the node failure, the rdd can recompute that partition with the full information on how it was generated from other rdd partitions (apache, 2019). 2.2 model transformations using graphs a directed graph may be represented by (g(v, e)), where v represents a set of vertices and e the set of edges of the graph g. a sub­graph s of a graph g is a graph whose ver­ tices v (s) are a sub­set of the set of vertices v (g), where v (s) ⊆ v (g), and the set of edges e(s) is a sub­set of the edges e(g), that is, e(s) ⊆ e(g). extensions of this basic representation have been proposed to define the graph as a data model (junghanns et al., 2016; barquero et al., 2018). graphs are useful for modeling computational problems. they can be adopted to model relationships among objects. a graphcan be used, such as a representation format for models, enabling abstract features of a model. in model transforma­ tion processes, graphs can be used to translate instances from one modeling language to another, since the structures of that language can be represented by a type of graph. the triple graph grammars approach (schürr, 1995) is a way to specify translators of data structures and to check their consistency. in addition to model transformation, there is a variety of based­graph algorithms used for processing graph models in different domains, such as complex network structures, net­ work analysis, business intelligence, and others (junghanns et al., 2016; löwe, 2018). graph transformation has been widely used for express­ ing model transformations, since graphs are well suited to de­ scribe the underlying structures of models and meta­models. operations are implemented as model transformations solv­ ing different tasks. a transformation is a set of rules that describe how a model in the source language can be trans­ formed into a model in the target language (rutle et al., 2012). the extraction is a process that transcribes model/meta­ model elements from the native source platform to the tar­ get platform (jia and jones, 2015). this is necessary mainly when the input model comes from a different technical space (e.g., input model is in the xmi format and the transforma­ tion platform works on data collections). 3 a data­centric approach for mt in a previous work (camargo and fabro, 2019), we presented astudyonapplyingadata­centric languagecalledbloom(al­ varo et al., 2011) to develop model transformations. there are three major differences from the previous study to this paper: a) we define a specific format based on rdf (w3c, 2014), and we used it in the injection/extraction operations for translating source model in new modeling domain; b) we implement the rdf models in data collections and specify transformation rules, mapping the source and target meta­ models and models elements as ruby classes; and c) we choose the bloom language, a data­centric declarative lan­ guage, since it is based in collections (unordered sets of facts) and provides implicit­parallelism. on the other hand, the use of the data­centric approach, and parallel model transforma­ tions are the main similarities between these works. the proposed approach in this work is built on top of the apache spark framework, using dc aspects such as high­ level programming, parallel/distributed environments, and considering that a model element is a set of data. it allows the extraction of models and meta­models in different for­ mats and transforming them to a directed­graph, which is as­ signed to a graphframe. the transformation output is the in­ put to process graph operations and model transformations. in order to improve the performance of transformation exe­ cutions, we use two different strategies for partitioning mod­ els from graphframe. figure 3 shows an overview of our ap­ proach. there are arrows between spark components, mainly in spark context. it is the responsible for managing all exe­ cutions on the spark framework. the arrows among the ap­ proach modules (2, 3, and 4) represent the interaction be­ tween them and their outputs, forming a workflow. all the steps of the workflow are automated, except for the opera­ tion on graph to the partitioning of models (semiautomated). we describe these steps in the next sections. the driver node controls the execution of a spark ap­ plication and maintains all states of a spark cluster. it ex­ changes messages with the cluster manager in order to ob­ tainphysical resources and launchexecutors (workernodes). the executor is the process that performs the tasks assigned by the spark driver. the executors have the responsibility to receive the tasks (task) assigned by the driver, run them, and report back their state and results. the interaction between the work nodes and spark context is supported by a cluster manager, which is responsible for maintaining a cluster of machines (nodes) that will run one or more spark applica­ tions (chambers and zaharia, 2018; apache, 2019). in our approach, the modules 2 and 3 are executed on the driver node. the injector module is responsible for extracting the input model to the dataframe, which is transformed into a graphframe by the model translator module. the model transformation (module 4) is executed on worker node(s). for the module 3, we create a meta­model to instantiate the result of the translation of the input model to a graph model. it is necessary for assuring the conformance and con­ sistency of translation output. such meta­model is based on the graphdb meta­model proposed by (daniel et al., 2016), which focuses on nosql graph databases. figure 4, de­ picts our graph meta­model, where graphelement repre­ sents all elements of a graph. their sub­types, graph vertex and graph edge, express the vertices and edges, respectively. a graphvertex has an id attribute, meaning that each ver­ tex is unique. also, there are type and value attributes to a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 represent the model element properties, forming a triple. in contrast, the graphedge type has a string attribute key for identifying the elements from src and dst links, which are represented by src (source) and dst (destination) associa­ tions between graphvertex and graphedge classes. we use the graph meta­model as a schema to instan­ tiate model elements and their relationships by means of the graphvertex and graphedge classes. their properties, such as attributes and associations indicate the model ele­ ment structures. graphvertex and graphedge classes are in­ stantiated into a graphframe, and from the graphframe it is possible to specify operations and queries to manipulate them. an instance of the graph meta­model is shown in sub­ figures 5a and 5b. figure 3. an overview of data­centric approach for mt figure 4. graph meta­model a set of operations over graph elements of graphframe can be executed, such as the motif algorithm to split graph in sub­graphs, graph degree to compute the valency of a ver­ tex in a graph, queries, and others1. in addition to such exe­ cutions, the model2graphframe (m2g) output is also used as input by the model transformation module, which trans­ forms the input model elements in a directed­graph format to the target model. in the next sections, we present the steps to extract and transform models, as well as two alternatives for model par­ titioning. 3.1 extracting model elements into a dataframe the initial step consists of the extraction of the input model elements into a dataframe model. it starts when the user submits (1 in figure 3) the input model with its name, and 1the valency of a vertex of a graph is the number of edges that are incident to the vertex location (path) (figure 3) to the driver node. the injector module (2 in figure 3) assigns the input model in formats such as xmi or json to a variable (modelpath) which is read for loading the input model. next, the input model is parsed (dataframe api) and its elements are assigned to a dataframe (modeldf). all dataframe has a schema for de­ scribing the data structures, such as the input model. thus, a schema is formed according to the input data structures. list­ ing 2 shows an example of a dataframe schema. we choose to use the dataframe in this step due to their schema. it pre­ serves the input data structures, easing the translation of the input models to the graphframe through the reuse of these structures. furthermore, it is not necessary to implement a parser for loading the input model to dataframe. we use the family model excerpt from the atl zoo (eclipse, 2019) to illustrate the extraction into the dataframe and we then describe how model elements are represented in a dataframe. in spark, the operations on data are made by means of transformations and actions. a trans­ formation is formed by a set of instructions to manipulate data and an action is specified to trigger the computation on data. when it is called, it notifies the spark engine to com­ pute a result from a series of transformations (chambers and zaharia, 2018). listing 3 illustrates the extraction result from the model family (excerpt) in xmi format (listing 1) to a dataframe, where its structure is supported by dataframe schema shown in listing 2. listing 1: model families excerpt ... listing 2: family schema excerpt root |-family: array (nullable = true) | |-element: struct (containsnull = true) | | |-lastname: string (nullable = true) | | |-daughters: struct (nullable = true) | | | |-firstname: string (nullable = true) ... listing 3: dataframe family excerpt +---------+--------------------+---------+----------+---+ | lastname| daughters| father| mother| sons| +---------+--------------------+---------+----------+---+ | march| [[, brenda]]| [, jim]| [, cindy]| [[, brandon]]| ... ... ... ... .... according to figure 3, the model elements are structured in a set of columns with an unspecified number of rows, since a schema defines the column names and types of a dataframe. the rows are unspecified because the reading of the model elements is a lazily­evaluated operation (lazy evaluation (michael l., 2016)). the schema does not require the rows to be identified explicitly. although a dataframe schema can be specified manually, we opt for the schema generated by the parser by the read op­ eration of the input model (extraction step). in this schema, the structures of input model elements are preserved in a a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 tree format by the translation process. listing 2 has a trans­ lation example, where the dataframe schema is structured by element root and their rows are represented by family element. the multivalued elements are represented by ar­ rays (array) and their elements are represented by structs that may have one or more elements, including null values (containsnull). these elements represent the leaves (i.g., lastname) and have a type (i.g., string). all elements represented on dataframe schema have the (nullable) at­ tribute assigned as true by default. this is for fitting the spark framework for handling the dataframe columns, with the nullable attribute true or false. their columns are logical constructions that represent a value computed by means of programmatic expressions. thus, to have a real value for a column, we need to have a row, and consequently to have a dataframe. therefore, since the input model was translated to a dataframe, it can be transformed according to the transformation domains of the user. 3.2 translating the input dataframe to graphframe in a second step, the model translator module (3 in fig­ ure 3) translates the input model, which was assigned to a dataframe, into a graphframe. we use the model elements in the dataframe as input to the model translator. in addi­ tion to elements, the schema associated with the dataframe that describes the model element structures is essential for our model translator, since we use it for reproducing these element structures in a graph, assigning them to the graph­ frame. we create an algorithm for translating a dataframe to a graphframe, conforming to the meta­model of figure 4. algorithm 1 is responsible for such translation. as input, the algorithm receives a dataframe, which is processed by combining its content and schema. algorithm 1 contains the functions model2graphframe and model2graphschema. the source code of the functions is available on2. since the modeldf dataframe contains all model elements, it is assigned as a parameter to the model2graphframe function. it is responsible for starting the transformation process called. for simplicity’s sake, we omit the specification of the model2graphschema function in algorithm 1 (line 2), the model2graphschema function with the model elements and the dataframe schema as parameters. it performs the processing of model elements and their structures together with the respective schema columns of dataframe in a recursive way, assigning its result into the verticesdf and edgesdf dataframes. (lines 3 and 4). we use the wildcard parameters (_1 and _2) and the todf function with its parameters, and the respective dataframe columns ("id","value"). thus, the first elements are separated to the verticesdf dataframe and the remaining elements are to the edgesdf dataframe. both dataframes shape the vertices and edges and are assigned into the graphframe (gf, line 7) by model2graphframe function. 2https://github.com/lzcamargo/extracspk (a) graphframe vertices (b) graphframe edges figure 5. family model elements translating to graphframe algorithm 1 m2g translation algorithm input: modeldf : dataf rame output: gf : graphf rame 1: function model2graphframe(modeldf ) 2: graphdata ← model2graphschema(modeldf.collect, modeldf.schema, 0) 3: verticesdf ← graphdata._1.todf (”id”, ”value”) 4: edgesdf ← graphdata._2.todf (”src”, ”dst”, ”key”) 5: return (verticesdf, edgesdf ) 6: end function 7: gf ← model2graphf rame(modeldf ) we use some family model elements (listing 2) as input to present a translation example (an algorithm 1 execution). to access the vertex and edge contents, we execute the com­ mands: gf.vertices.show() and gf.edges.show(). its outputs are represented in figures 5a and 5b. the values of family model elements from the dataframe are instantiated into graph vertices. the model element names are assigned to graph edges as keys. the links (src and dst) among ver­ tices and edges establish the relationship of the model ele­ ments. in figure 5 we use circles and rectangles for illus­ trating the model element structures and their relationships. for example, the vertices and edges marked in red demon­ strate the structure of the lastname sailor element, and the blue ones denote the firstname david element. the relationship between these two elements is marked on edge (figure 5b), where the src column value is noted in red, and the value of the dst column is noted in blue. the join of these structures (the match between id, src, and dst columns) al­ lows to identify that david is a son (sons), and belongs to sailor family. thus, the model elements are structured into graphframes so that they can be queried and processed for different purposes. in the first two steps, we obtain the extraction of the input model to the modeldf dataframe and its translation to the graphframe gf. we consider the result of these operations as the transformation of the input model to a graph, in particu­ lar the model2graphframe transformation. in the next steps, we use the graphframe contents for model partitioning and model transformations. https://github.com/lzcamargo/extracspk a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 3.3 model partitioning in this step, we present two strategies that we use for parti­ tioning models from graphframe: one based on the model key­element names with the motif algorithm, and another using clustering. first we present their implementation. in the next section, we present a proof of concept on using these strategies. we choose the first strategy because it allows us to use the transformation rule names with an algorithm im­ plemented on the graphframes api itself, in this case the motif algorithm. regarding clustering, we choose it to link the model elements on clusters by means of the related ver­ tices (src to dst) in edges contained on the graphframe. we use the clusters as parameters for the spark framework par­ titions in the processing of the model transformations. in a graph, a motif can be defined as a pattern of interconnections of edges that occurs in a graph (milo et al., 2002). we are in­ terested in finding patterns in a graph for a given purpose, forming sub­graphs as such partitions from this graph. thus, we consider the following definition, where a graph g′ is a sub­graph of graph g = (v, e), if v ′ ⊆ v and e′ ⊆ e ∪ (v ′ ∗ v ′). if g′ ⊆ g and g′ contains all of the edges ⟨u, v⟩ ∈ e with u, v ∈ v ′, then g′ is an induced sub­graph of g. in our context, consider a scenario with the following transformation rule names: package2schema, class2table, att2col, and family2person. from each rule name, we use its prefix (i.e., package, class, att, and family) as a pa­ rameter (key­element) in graph partitioning using motif al­ gorithm, particularly for the key column of the edges. this means that these prefixes are interest points in the graph. in a graphframe, the motif finding is implemented in a domain­specific language (dsl) for expressing struc­ tural queries. for example, graph.find("(a)-[e]->(b); (b)-[e2]->(a)") will search for pairs of vertices a,b con­ nected by edges in both directions. it will return a dataframe of all the structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. the returned columns will be the vertices a, b, and edges e, e2 (apache, 2019). we specify the sub­graphs extraction combining motif finding and a filter. this means that depending on the in­ put model it is necessary to adjust of motif algorithm pa­ rameters and/or filter, characterizing the model partition­ ing semi­automated. listing 4 shows the implementation in spark scala for the class elements through the tag "classes", which were mapped to column key of the edgesdf dataframe. graph motifs are patterns that occur re­ peatedly in the graphs and represent the relationships among the vertices. in a graphframe, motif finding uses a declar­ ative dsl for expressing structural queries for finding pat­ terns among edges and vertices by means of the find() func­ tion. therefore, we choose it for easing the sub­graph extrac­ tions. we believe that its characteristics can generate consis­ tent sub­graphs from key model elements (prefix name rules). line 3 of listing 4 is the specification of a query for search­ ing for pairs of vertices between (a,b), (b,c), and (c,d), which are respectively connected by edges e, ea, and eb. we also use a filter for delimiting the vertex pairs, starting from an edge, whose key property element is equal to the tag "classes". this means that the execution of this ex­ pression will return as motifsdf all the structures (vertices and edges) related to the filtered property (classes) on the graph, which are arranged in a, e, b, ea, c, eb, and d columns. we select the edges contained in motifsdf and as­ sign them to the sube immutable variable (line 5). we use it as edges for composing the subg sub­graph, whose vertices are the same as in the gf graph. we apply the dropisolated­ vertices() function to exclude the isolated vertices (i.e., ver­ tices with degree zero, if there are any.) for ensuring that the links among vertices and edges in subg sub­graph. in this case, listing 4 allows us to get all the class elements and their associated elements from the graphframe that repre­ sent a class model, producing a sub­graph. listings 11 and 12 show an example of the edges and ver­ texes of a sub­graph (s­g), such as a result from listing 4. this example and the results from of the other motif specifi­ cations for the model key­elements, such as package, att, female, and male are presented in section 4. listing 4: motifs sub­graph extraction 1 object subgraph { 2 def main ( args : array [ str ing ] ) : unit = { 3 val motifsdf = gf. find ( ” ( a ) −[ e] − >(b ) ; ( b) −[ ea ] − >(c ) ; 4 ( c ) −[ eb ] − >(d ) ” ) . f i l t e r ( ”e . key = ’ c l a s s e s ’” ) 5 val sube = motifsdf . s e l e c t ( ”eb . src ” , ”eb . dst ” , ”eb . key” ) 6 val subg = graphframe (gf. v e r t i c e s , sube ) 7 . d r o p i s o l a t e d v e r t i c e s ( ) 8 } 9 } now we present the utilization of clustering as a strategy, by implementing it using the infomap from the mapequa­ tion framework (bohlin et al., 2014). there are other alter­ natives for such implementation, such as the utilization of the k­means algorithm (macqueen, 1967), one of the most commonly used clustering algorithms. we could also adapt the apache spark mllib, machine learning (ml) library. it provides various operations based in ml, including cluster­ ing. infomap is a fast stochastic and recursive search algo­ rithm with a heuristic method louvain (blondel et al., 2008) based on the optimization of modularity. when it is exe­ cuted with vertices and edges of a graph, the neighbor nodes are joined into modules, which are subsequently joined into super­modules and so on, clustering tightly interconnected nodes into modules. infomap has been used in community partition problems (aslak et al., 2018; edler et al., 2017), for detecting communities in large networks, and to help in the analysis of complex systems. in addition, infomap oper­ ates on graph­structures in the pajeck format (file.net)3, which can be easily extracted from the graphframe as input to infomap. for example, listing 5 shows a excerpt of the file.net extracted from class­0 model, and listing 6 shows the .clu output file, the clustering result, where the nodes are gathered in the respective clusters (node and cluster columns). column flow contains cluster indices for each node, but they are discarded when the .clu file is injected into dataframe by a loading operation and used in clustering model elements. however, the clustering from graphframe using the infomap framework is a semi­automated operation, since we do not implement integration between our approach and the infomap framework (operations on graph, figure 3) 3https://gephi.org/users/supported­graph­formats/pajek­net­format/ a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 listing 5: class­0 file.net *vertices 50031 0 0 1 1 2 2 ... *arcs 50030 1 2 4 5 4 6 .. listing 6: clustering nodes # node cluster flow: 8 1 0.0457141 7 1 0.00261991 10 1 0.00261991 6 1 0.00222776 9 1 0.00222776 5 1 0.00195755 11 1 0.027326 46 1 0.00233907 ... later, we present the use of infomap and the model parti­ tioning in section 4. 3.4 mt using graphframe in the last step, we specify a set of operations and transfor­ mation rules to transform the source model in graphframe into a target model. they are executed as parallel tasks on worker nodes of the spark framework, through the model transformation module (4 in figure 3). the source code of the operations and transformation rules are available on4. listing 7 shows the family2person rule written in scala as a singleton object (object family2person). we sepa­ rate the male and female elements in the maleedgesdf and femaleedgesdf dataframes. they contain the target val­ ues (dstm, dstf, dst) that link each last name with its first names. we use the select, join, and filter func­ tions to select the last and first names from of maleedgesdf. for each join operation, we use the filter function (lines 4, 6, 12, and 14) to ensure the accurate selection of model el­ ements, since they are formed by relationships among edges and vertices ("dstm" === "id"). in lines 7 and 15, we use the select and concat functions to assign the last name (lastname) and the respective first names (value) as the full name (fullname column) to the malefullnamesdf dataframe. listing 7: f2p rule 1 object family2person { 2 val malefullnamesdf = maleedgesdf 3 .select($"dstm", $"dst").join(gf.vertices) 4 .filter($"dstm" === $"id") 5 .select($"value".alias("lastname"), $"dst") 6 .join(gf.vertices).filter($"dst"===$"id") 7 .select(concat($"lastname", lit(" "), $"value") 8 as "fullname") 9 10 val femalefullnamesdf = femaleedgesdf 11 .select($"dstf", $"dst").join(gf.vertices) 12 .filter($"dstf" === $"id") 13 .select($"value".alias("lastname"), $"dst") 14 .join(gf.vertices).filter($"dst"===$"id") 15 .select(concat($"lastname", lit(" "), $"value") 16 as "fullname") 17 } for the femalefullnamesdf dataframe (lines 9 to 14), we use the same idea applied to the malefullnamesdf dataframe. these dataframes are merged (union function) in the persondf dataframe, each one with a new column gender (withcolumn("gender")) to ensure the gender dis­ tinction among persons. 4https://github.com/lzcamargo/transformspk next, we specify an operation, using coalesce(1) method to instantiate the transformation output in a single partition (1). this means that output tasks will be reduced in a single partition (distinct output) as the final result of the transformation. the example in listing 8 is obtained with the write function, and the tags (root and row) of the databricks:spark-xml library, indicating that the format was assigned as xml. we separate these commands (write op­ erations in the target model) from the loading rules for better code legibility. since the target model was stored in a repos­ itory, it enables to load the output in xml/xmi format and in­ stantiate it back in graphframe. listing 8 shows a portion of the persons.xml file content. it represents the family2person transformation result, using the family model presented in listing 1 as the source model. listing 8: persons model excerpt < persons > < gender > male < fullname > march jim < gender > male < fullname > sailor dylan < gender > female < fullname > march cindy < gender > female < fullname > march brenda ... in this section we described our approach. in the next sec­ tion, we perform the proof of concepts in order to validate its feasibility. 4 implementation we implemented a proof of concept (poc) (kendig, 2016) using graphframes to demonstrate the feasibility of our ap­ proach and to show its usefulness under following aspects: the processing of model2graphframe outputs, the partition­ ing of graphs contained in the graphframe, connectivity among model elements in a set of graphframes, and the ex­ ecution of model transformation using the graphframes. we run the poc in a single machine with the following software stack: ubuntu 18.04; spark 2.4; and scala 2.3. it is hosted by an intel core i5­4210u 1600 cpu with 8096 mb of ram; and the processor has two cores. as input, we use the both class and family models in xmi format. there are four models with the following specifications: • class­0, class model with no attributes or methods, only package and class elements. this kind of model is used in domain modeling, useful to understand the ideas and concepts of the domain (larman, 2004); • class­3, class model with package and class elements, each class contains from 1 to 3 methods and attributes; • class­6, as the previous item, but each class contains from 1 to 6 methods and attributes; • family model with 0 to 3 sons and daughters. its el­ ements are self­contained in lastname elements and their attributes. https://github.com/lzcamargo/transformspk a data­centric model transformation approach using model2graphframe transformations camargo and del fabro 2021 we get the class models from, each one with 10000 classes5. they were created to be used as a benchmark for the class2relational transformation case studies in parallel transformations using lintra (burgueno et al., 2015)6. these models have references among their elements established by attributes. for instance, the class­0 model has 10 pack­ age elements and each package has 1000 class elements. the family model has 10000 lastname elements, which we created for this proof of concept. in this case, we consider these elements as self­contained (class­0 and family). how­ ever, there are models (class­3 and class­6) that besides self­ contained elements, also contain inter­connected elements, where class elements are referenced by one or more class elements, which are contained in other packages. attributes such as super, and type establish such references. the models used on poc have a different density (class­ 0, family, and class­6) and interconnectivity (class­3 and class­6) among their elements. this means that we will validate our approach in relation to these model aspects. to measure the execution times in seconds, we use the system.currenttimemillis() function from the scala language, in a dedicated machine with no ui interactions. the input model elements once extracted to a graphframe, they must be available. each model element in the graph­ frame vertices has to be linked to its properties through graphframe edges. we have defined three research questions to validate the poc implementation and its main aspects. q1: how to check if the model2graphframe output is available for processing? to address this question, we use the directed­graph prop­ erty (dgp) to check the total of edges and vertices in a directed­graph g, ∑v (g) v=0 −1 = ∑e(g) e=0 , where the v (g) total minus 1 is equal to e(g) total. when this property is true to a directed­graph it is considered as a simple directed graph (hochbaum, 2008). a directed­graph is no longer simple if there are multiple edges or loops. hence, the v(g) total is less than to the e(g) total ( ∑v (g) v=0 <∑e(g) e=0 ). in addition, we execute a set of queries on the graphframe to validate the contents of vertices and edges, whose input models contain 100 classes and 100 families. this means that we take a set of model elements contained into graphframe and we compare it with its input model elements. although the m2g outputs are directed­graphs into graphframe, we need to know whether it is achievable to use them in model transformations. to address this issue, we define question q2. q2: is it possible to perform mt using graphframe? we address this question in order to use graphframe in model transformations. our goal is to verify how the source models into graphframes can be transformed to target models. we specify operations and rules using methods and functions in scala for manipulating vertices and edges in graphframe (e.g., listing 7). they are similar 5http://atenea.lcc.uma.es/descargas/mtbenchmark/classmodels 6http://atenea.lcc.uma.es/index.php/main_page/resources/lintra to transformation specifications in atl ­ atlas transfor­ mation language (jouault et al., 2008), where helpers and transformation rules are the constructs used to specify the transformation functionality. finally, the last question is about performance of mt exe­ cutions using clusters. q3: does executing model transformations using model par­ titioning improve performance? we address this question in order to verify whether the execu­ tions of model transformations using model partitioning im­ prove performance, since we adopted two partitioning strate­ gies for this approach: partitioning of input model into graph­ frame in sub­graphs, and generating of clusters from graph­ frame vertices. in the following sections, we present the proof of concepts, results and the answers for the above ques­ tions, as well as further discussions. 4.1 processing model2graphframe outputs to check the graphframe outputs with respect to the input models, we obtain the total of vertices and edges and we use the dgp to check their amount. columns v(g) and e(g) of table 2 show the total of vertices and edges from the input models (model column). the amount of vertices v(g) ­ 1 is equal to the amount of edges e(g) for the class­0 and family models, demonstrating that they are simple directed­graphs. however, the total of vertices and edges from the class­3 and class­6 models indicate that they are not simple directed­graphs (v(g) < e(g)). in addition, we execute queries as shown below, and their results are compared to input model elements to validate the m2g consistency. it returns the values of class properties such as name, isabstract, and visibility from the graphframe vertices. it does not return attributes and methods, because the key­element (key) is assigned the "classes" value. gf.edges.where($"key"==="classes") .select($"dst".as("dstv")).join(gf.edges) .filter($"dstv"===$"src").select($"dst") .join(gf.vertices).filter($"dst"===$"id").show() listings9and10showexcerptsofclass­0modelelements and the query output. they represent an example of our vali­ dation. in this case, the relation among classes and their prop­ erties are established by the graphframe edges (gf.edges src and dst), whereas the value of each property is assigned to the graphframe vertices (gf.vertices). listing 9: class­0 model employeewithhighsalaries(double salary) { list res = new arraylist<>(); for(employee e: employees) { if(e.getsalary() > salary) res.add(e); } return res; } filter 1 public list employeewithhighsalaries(double salary) { return employees.stream() .filter(e →e.getsalary() > salary) .collect(collectors.tolist()); } filter 2 previous research on java lambda expressions focused on their introduction via automatic techniques for refactor­ ing legacy code to “make the code more succinct and read­ able” (gyori et al., 2013; dantas et al., 2018)—in partic­ ular situations that one can, for instance, replace either an anonymous inner class or a loop over a collection by state­ ments involving lambda expressions. other approaches rec­ ommend transformations that introduce lambda expressions to remove duplicated code (tsantalis et al., 2017) and to use parallel features of java 8 properly (khatchadourian et al., 2019). also, mazinanian et al. (2017) present a comprehen­ sive study on the adoption of java lambda expressions to un­ derstand the motivations that lead java developers to adopt the functional style of thinking in java. the authors pub­ lished a large dataset with more than 100 000 real usage sce­ narios. we use this dataset to understand program compre­ hension benefits with the adoption of java lambda expres­ sions. at first glance, the use of lambda expressions, due to its conciseness, yields a more succinct and readable code (gyori et al., 2013; dantas et al., 2018). however, this is not always the case, as dantas et al. (2018) produced automated refactor­ ings for iterating on collections that developers judged less comprehensible. we aim to investigate further which scenar­ ios benefit from the introduction of lambda expressions. to the best of our knowledge, previous research did not inves­ tigate the assumption that the use of lambda expressions ac­ tually lead to benefits on program comprehension. 3 study settings the general goal of this research is to investigate the bene­ fits on code comprehension after refactoring a java method to introduce a lambda expression, and thus answering the research questions we present in section 3.1. to this end, we conducted a research in two phases, both using a mixed­ methods approach. in the first phase, whose results we presented in previ­ ous work (lucas et al., 2019), we carried out a quantitative assessment of 66 pairs of code snippets, using state­of­the­ art models for measuring software comprehension (see sec­ tion 3.2). each pair corresponds to a method body before and after introducing lambda expressions. we also conducted a qualitative investigation (survey) considering the opinion of 28practitioners thatansweredquestionsthatalsoaimtocom­ pare the code before and after the introduction of lambda ex­ pressions in nine pairs of code snippets. in the second phase we mitigated some possible threats that we identified in the first study: a small number of code snippets used in the survey of the first phase and the assess­ ment of code snippets that might contain not only a man­ ual program transformation, but actually a manual program transformation and an additional contribution to the program (e.g., a bug fix). as such, in the second phase we leveraged existing support of program transformation tools to refactor legacy code of open source systems to introduce lambda ex­ pressions. considering the outcomes of these program trans­ formation tools, we again conducted a quantitative assess­ ment (using state­of­the­art models for measuring software comprehension) of a random sample of 92 pairs of code snip­ petsandasurveywith182practitioners thatevaluatedat least five code snippets from this sample of 92 pairs. 3.1 research questions we investigated the following research questions in our study. (q1) does the use of lambda expressions improve program comprehension? (q2) does the introduction of lambda expression reduce source code complexity? (q3) what are the most suitable situations to refactor a code to introduce lambda expressions? (q4) how do practitioners evaluate the effect of introducing a lambda expression into a legacy code? (q5) what is the practitioners’ opinion about the recom­ mendations from automated tools to introduce lambda expressions? understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 we conducted this research using an iterative approach, and after investigating a given question, new sub­questions and hypothesis emerged. for instance, we investigated whether or not the reduction in the size of a code snippet, af­ ter introducing a lambda expression, has an influence on the perception of the participants about the quality of the trans­ formation. 3.2 metrics of the quantitative study we measured the complexity of a code snippet using two met­ rics: number of source lines of code (sloc) and cyclomatic complexity (cc). both metrics have been used in a num­ ber of studies (riaz et al., 2009; baggen et al., 2012; land­ man et al., 2016). in addition, we used two models to esti­ mate and compare the readability of each pair of code snip­ pets considered in our research. readability is one of the as­ pects used for assessing program comprehension, and here­ after both terms (readability and program comprehension) are used interchangeably. the first model we used to esti­ mate program comprehension is based on the work of buse and weimer (2010). it estimates the comprehensibility of a code snippet considering a regression model that takes as in­ put several characteristics, including the length of each line of code in a code snippet, the number of identifiers in a code snippet, and the length of the identifiers present in a code snippet (buse and weimer, 2010). the second model was proposed by posnett et al. (2011), which builds upon the buse and weimer model, though con­ sidering a smaller number of characteristics. based on this model, we can estimate the readability of a code snippet us­ ing eq. (1) and eq. (2); and the constant c = 8.87. e(x) = 1 1 + e−z(x) (1) z(x) = c + 0.40l(x) − 0.033v (x) − 1.5h(x) (2) that is, in the posnett et al. model, we calculated pro­ gram comprehension using three main components: the num­ ber of lines of a code snippet (l(x)), the volume of a code snippet (v (x)), and the entropy (h(x)) of a code snip­ pet. the volume of a code snippet x is given by v (x) = n (x)log2n(x), where n (x) is the program length of the code snippet and n(x) is the program vocabulary. these measures are defined as • program length (n (x)) is given by n (x) = n 1(x) + n 2(x), where n 1(x) is the number of op­ erators and n 2(x) is the number of operands of a code snippet. • program vocabulary (n(x)) is computed using the formula n(x) = n1(x) + n2(x), where n1(x) is the number of unique operators and n2(x) is the number of unique operands of a code snippet. the entropy of a document x (in our case a code snip­ pet) is given by eq (3), where xi is a token in x, count(xi) is the number of occurrences of xi in the document x, and p(xi) is given by eq (4). the entropy (h(x)) in our context estimates the degree of disorder of the source code. h(x) = − n∑ i=1 p(xi) log2 p(xi) (3) p(xi) = count(x)∑n j=1 count(xj ) (4) we used an existing tool2 to estimate the comprehensibil­ ity of the code snippets using the buse and weimer (2010) model. we developed our own tool to automate the computa­ tion of the comprehensibility model by posnett et al. (2011).3 we executed these computations for all pairs of code snip­ pets that we collected either from real scenarios (first phase) or from the outcomes of the program transformation tools (second phase). 3.3 code snippets’ datasets in the first phase of this research, we used an existing tool (minerwebapp) and a dataset from a previous work (maz­ inanian et al., 2017), to identify code snippet candidates to our research. minerwebapp monitors the adoption of java lambda expressions in open source projects hosted on github, and has been used in previous research on the adop­ tion of lambda expressions (mazinanian et al., 2017). the goal of minerwebapp is to identify and classify the use of lambda expressions code snippets. minerwebapp classifies the occurrences of lambda expressions into three categories: • new method: when a new method containing lambda expressions is added to an existing class; • new class: when a new class is added to the project, and this class contains methods with lambda expressions; • existing method: when a lambda expression is intro­ duced into an existing method. the decision of using an existing tool and dataset simpli­ fied our process of collecting real usage scenarios of lambda expressions. we randomly selected 59 code snippets from the minerwebapp dataset—considering exclusively the code snippets of the third category (existing method). we also col­ lected 29 code snippets of refactoring scenarios we gener­ ated using rjtl (dantas et al., 2018) and submitted via pull requests to open source projects. in total, we selected 88 code snippets from 22 projects, including code snippets from the elastic search, spring framework, and eclipse foundation projects. we manually reviewed these code snippets and re­ moved 22 pairs that clearly do not correspond to a refactoring or that already had a lambda expression in the first version of the code. this cleanup lead to a final dataset with 66 pairs of code snippets from 19 projects that we considered in the first phase of the research. in table 1 we show the number of pairs of code snippets we collected from the github repositories, coming either from minerwebapp or from rjtl transfor­ mations. all procedures to collect and characterize the code snip­ pets from github pages have been automated, using a crawler and additional scripts for computing source code 2http://www.arrestedcomputing.com/readability/ 3https://github.com/rbonifacio/program­comprehension­metrics understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 table 1. selected projects in the first phase. project snippets from minerwebapp snippets from rjtl seleniumquery 0 10 elasticsearch 0 4 corenlp 0 15 vertx­examples 2 0 swagger2markup 2 0 spongeapi 4 0 tailor 2 0 agrona 1 0 rxandroidble 2 0 optaplanner 7 0 rxjava­android­samples 3 0 kaa 1 0 jersey 4 0 uhabits 2 0 graylog2­server 1 0 fluentlenium 1 0 qualitymatters 1 0 jbpm 1 0 spring­integration 3 0 metrics (figure 2 shows an overview of the approach). the crawler expects as input a csv file, where each line specifies the project, the url of the commit, the start and end lines of the code snippet, and the type of the refactoring (e.g., anony­ mous inner class to lambda expression, foreach statements to a recursive pattern using lambda expressions, and so on). crawler compute metrics store figure 2. procedures for collecting code snippets and calculating metrics in the second phase, we used three automated refactor­ ing tools (rjtl tool, netbeans ide, and intellj ide) to find opportunities and then introduced lambda expressions glob­ ally into the methods of five open­source systems (see ta­ ble 2). we chose these systems because they have been used to assess the performance of lambdaficator (gyori et al., 2013)—lately integrated into netbeans to assist developers to migrate legacy systems towards java 8. we were also able to build and execute the test cases of these systems, before and after applying the transformations. after executing the three tools in the five systems, we generated a dataset of 1987 transformations recommending refactorings to intro­ duce lambda expressions (table 2 shows the details). table 2. number of refactoring recommendations each tool (rjtl, netbeans ide, and intellij) produced. project rjtl netbeans ide intellij ide junit4­r4.13­rc­2 9 104 39 tomcat­7.0.98 3 354 105 fitnesse­20191110 4 319 70 antlrworks­1.5.1 89 316 118 ant­ivy­rel­2.5.0 23 389 45 we followed a set of steps in order to validate and create our second dataset of transformations. we first downloaded and built the last (stable) version of the systems, before ex­ ecuting the refactoring tools. after that, for each program transformation tool, we created a specific git branch, exe­ cuted the program transformation tool, and built the system again—lookingeitherforacompilationortestexecutionfail­ ure. we checked out the files that, after applying a transfor­ mation, introduced a failure, removing spurious transforma­ tions. accordingly, we built a dataset with 1987 transforma­ tions. we then randomly selected 92 pairs of code snippets to explore in the second phase of our research. we classified this final set of 92 transformations (appendix a details the taxonomy) and computed the source code metrics and read­ ability models. we stored the code snippets and the results of the metric calculations into a database. table 3 summarizes this final set of 92 transformations. table 3. number of transformation grouped by type and tool. type rjtl netbeans ide intellij ide anonymous inner class 17 13 28 reduce 0 2 0 chaining 0 6 0 foreach 0 9 0 map 0 2 0 filter 2 1 0 anymatch 12 0 0 we finally investigated the situations where at least two tools recommended a refactoring in the same code snippet. considering the initial set of 1987 transformations, we found 357 cases (17.96%) of code snippets having recommenda­ tions from more than one tool. nonetheless, the recommen­ dations are not exactly the same. for instance, the code snip­ pets of figure 3 present transformations recommended to the same original code (figure 3­(a)), but suggested by netbeans ide, intellij ide, and rjtl. in this example, it is possible to realize that the intellij ide leverages the mechanism of type inference, while netbeans ide and rjtl do not. moreover, there is a slight difference in the indentation of the resulting code from the netbeans ide and rjtl recommendations. we removed this kind of duplication in our dataset with 100 code snippets, leading to a final dataset of 92 pairs of code that we used in the second phase of our research. 3.4 procedures of the qualitative study regarding the qualitative study, we conducted the research using an approach based on a previous work (dos santos and gerosa, 2018). that is, we designed an online survey that al­ lowed the participants to evaluate pairs of code snippets. in the first phase we only invited professional developers with some background in java programming, from a convenient population of developers in our own professional network. table 7 details the characteristics of the survey participants from the first phase of our research. the survey was orga­ nized in two sections. the first section aimed to character­ ize the experience of the participants; while the second one aimed to investigate the benefits (or drawbacks) of introduc­ understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 figure 3. transformations recommended to the same code snippet sug­ gested by rjtl, netbeans,and intellij. public xjrotabletogglebutton createtogglebutton(string title...) { xjrotabletogglebutton b = new xjrotabletogglebutton(title); b.setfocusable(false); b.addactionlistener(new actionlistener() { public void actionperformed(actionevent e) { performtogglebuttonaction(tag); } }); components2toggle.put(c, b); return b; } (a) original code public xjrotabletogglebutton createtogglebutton(string title...) { xjrotabletogglebutton b = new xjrotabletogglebutton(title); b.setfocusable(false); b.addactionlistener((actionevent e) →{ performtogglebuttonaction(tag); }); components2toggle.put(c, b); return b; } (b) code after applying the netbeans transformation public xjrotabletogglebutton createtogglebutton(string title...) { xjrotabletogglebutton b = new xjrotabletogglebutton(title); b.setfocusable(false); b.addactionlistener(e →performtogglebuttonaction(tag)); components2toggle.put(c, b); return b; } (c) code after applying the intellij transformation public xjrotabletogglebutton createtogglebutton(string title...) { xjrotabletogglebutton b = new xjrotabletogglebutton(title); b.setfocusable(false); b.addactionlistener((actionevent e)→ { performtogglebuttonaction(tag); } ); components2toggle.put(c, b); return b; } (d) code after applying the rjtl transformation ing lambda expressions into legacy code. this second section comprised the following (survey) questions. • s1q1: do you agree that the adoption of lambda ex­ pressions on the right code snippet improves the read­ ability of the left code snippet? this is a likert scale question—(1) meaning strongly disagree and (5) mean­ ing strongly agree, which focuses on the readability as­ pect. • s1q2: which code do you prefer? this is a yes or no question, which aims to understand if the new code improves general quality attributes. the same question has been explored in a previous work (dos santos and gerosa, 2018). • s1q3: would you like to include any additional com­ ment to your answers? this is an open question that al­ lowed the participants to optionally present further de­ tails about their answers. we first conducted a pilot with five students, to evaluate whether our online survey tool would be able to properly cap­ ture the opinion of the developers. after conducting this pi­ lot, we implemented several adjustments in the layout and in the functionalities of the tool, in order to increase our confi­ dence in the tool for the next executions of the survey. the pilot also revealed that answering all pairs of code snippets was a time­consuming activity. for this reason, we split the pairs of code snippets into two groups, and then randomly assigned the participants to answer the survey questions con­ sidering code snippets either from the first or from the second group. the participants should answer the survey’s questions for a set of a minimum three and a maximum of six pairs of code snippets—randomly selected from the first or second groups of code snippets. considering the second phase of our study, we used the set of 92 randomly selected pairs of code snippets whose trans­ formed code correspond to a recommendation from rjtl, netbeans ide, or intellij. in this phase, the participants an­ swered the following questions. • s2q1: what is your opinion about the following sen­ tences? (a) the new code is easier to comprehend, (b) the new code is more succinct and readable, (c) the intention of using a lambda expression in the new code is clear, and (d) the new code is harder to debug. respondents presented their opinion about these sen­ tences using a likert scale—(1) meaning strongly dis­ agree and (5) meaning strongly agree. the first three sentences are claims that motivate the adoption of lambda expressions in java programs (gyori et al., 2013). the fourth sentence came from our own experi­ ence in debugging pieces of code that use java lambda expressions. • s2q2: how often would you perform this type of trans­ formation? this is a likert scale question— (1) mean­ ing never and (5) meaning always. the goal was to evaluate how often developers would perform a specific transformation to introduce lambda expressions. • s2q3: how important is the automated support for this kind of transformation? this is a likert scale question— (1) meaning not important at all and (5) meaning extremely important. the goal of this ques­ tion was to evaluate how important the use of tools to support a specific transformation is. • s2q4: would you perform this transformation? why? this is an open question that allowed the participants to optionally present further details about their opinion. in the second phase of our research, we used a set of so­ cial media tools to invite developers to answer the survey. that is, we sent a message to specific communities of java developers, including communities from facebook, reddit, telegram, and mailing list of java developers (e.g. netbeans developers, jdk developers). we presumed that the devel­ opers have a good experience with java programming. this phase had 182 participants located in 32 different countries (see table 4). the developers needed 04:23 minutes (on av­ erage) to complete the questionnaire, where they evaluated a maximum of 5 transformations and answered a set of 7 questions regarding each pair of code snippet. in this phase, we generated a survey randomly selecting five pairs of code snippets for each participant. tables 5 and 6 summarize the number of participants considering the level of education and professional experience of the respondents, respectively. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 table 4. distribution of respondents according to their location. country respondents percentage brazil 71 39.01 united states 25 13.74 germany 16 8.79 portugal 8 4.40 india 7 3.85 united kingdom 6 3.30 netherlands 5 2.75 spain 4 2.20 other countries 40 21.97 table 5. characterization of the survey’s participants in the second phase over the level of education. developers degree number of participants percentage some high school 5 2,74% high school graduate 13 7,14% undergraduate 21 11,53% bachelor’s degree 58 31.86% master’s degree 76 41.75% doctorate degree 9 4.94% table 6. characterization of the survey’s participants in the second phase over developer experience. developers experience number of participants percentage (%) less than one year 14 7.69% between one and four years 52 28.57% between five and ten years 48 26.37% more than ten years 68 37.36% we cross­validated the results of the qualitative assess­ ment with the results of the quantitative assessments, by cor­ relating the results of the estimates for program comprehen­ sion from the two models discussed in the previous section with the results of the surveys. we also explored the results of the survey considering the measurements of sloc and cc, for all pairs of code snippets in the survey. 3.5 data analysis we used exploratory data analysis (eda) to answer our first two research questions. eda is a method that allows re­ searchers to build a broad understanding about the data, using descriptive statistics (e.g., median and mean) and graphical methods (e.g., histograms and boxplots). we also leveraged hypothesis testing to further explore the first two research questions. regarding the remaining research questions, which we ad­ dressed using surveys as the main method for data collection, we also relied on eda to consolidate the answers to the lik­ ert scale based questions (in terms of descriptive statistics and plots); while the answers to the survey’s open­end ques­ tions were literally quoted. since we collected a more sig­ nificant feedback for the open­ended questions in the second phase of the research (177 answers in total), we also consoli­ dated the answers to the second phase’s open­ended question using thematic analysis (silva et al., 2016; shrestha et al., 2020). we conducted our thematic analysis in four steps. in the first, we carried out an initial reading of the answers to the fourth question of our survey (s2q4), preparing the scene before starting the coding stage. in the second step, we per­ formed an initial coding for each answer. next, in the third stage, we analyzed the codes with the goal of finding themes (that is, grouping of related codes). finally, in the fourth step, we reviewed and merged the themes, generating a new, more comprehensive list of topics. we included a small phase of cross­validation, in which two authors gave feedback on the assignments. these two authors did not contribute to the ini­ tial assignment of codes and themes to the answers. 4 results of the first phase in this section we present the results from the first phase of our research. initially we discuss the outcomes of the quan­ titative assessment, which considers the models of buse and weimer (2010) and posnett et al. (2011) (section 4.1). after that, we present the results of the qualitative assessments and compare the findings of the two studies (section 4.2). 4.1 quantitative assessment we considered the 66 pairs of selected code snippets dur­ ing the quantitative assessment. for each pair, we calculated the number of lines of code (sloc), the cyclomatic com­ plexity (cc), the estimate comprehensibility using the buse and weimer and the posnett et al. models. we addressed two main hypothesis in order to answer our research questions. h1: the introduction of lambda expressions improves program comprehension, according to the state­of­the­ art readability models. conversely, our first null hypothesis (h10) investigates whether the introduction of lambda expressions does not change program comprehension, according to state­of­the­ art readability models. we used a signal test (wilcoxon signed­rank test wilcoxon (1945)) to investigate this hy­ pothesis, considering the comprehensibility assessments us­ ing the models of buse and weimer and posnett et al. for each pair of code, the introduction of lambda expres­ sions might have increased, decreased, or unchanged the comprehensibility, according to both models. as such, the wilcoxon signed­rank test tested the null hypothesis that the comprehensibility of the source code before and after the introduction of lambda expressions are identical (wilcoxon, 1945). table 8 summarizes the results, considering all pairs of code snippets. although the posnett et al. method builds upon the model of buse and weimer, our analysis revealed a lack of agree­ ment in the results from the two models. the outcomes of the test revealed that the introduction of lambda expres­ sions actually decreases program comprehension (p­value < 0.0001), when considering the buse and weimer model. nonetheless, when we considered the posnett et al. model, we could not reject the null hypothesis, and this result sug­ gested that the introduction of lambda expressions does not affect the comprehension of the code snippets (p­value = 0.668). due to these conflicting results, we compared both understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 table 7. characterization of the survey’s participants in first phase. id gender degree experience lambda experience functional programming experience 1 male master student no 1­4 years 4 years 2 male bsc degree yes 1­4 years 2 years 3 male master student yes more than five years 11 years 4 male bsc degree yes 1­4 years 4 years 5 male master student yes 1­4 years 10 years 6 male bsc degree no 5+ years 11 years 7 male master student yes 1­4 years 11 years 8 male master student yes more than five years 11 years 9 male master student no no experience 7 years 10 male bsc degree yes 1­4 years 5 years 11 male bsc degree yes 5+ years 5 years 12 male phd degree yes no experience 10 years 13 male bsc degree yes 1 year 11 years 14 female master student no no experience 5 years 15 male master student yes no experience 7 years 16 female phd degree no 4­5 years 5 years 17 male master student yes 1 year 4 years 18 male bsc degree yes 1­4 years 2 years 19 female undergraduate student no 1 year 1 years 20 male bsc degree yes no experience 7 years 21 male master student yes more than five years 11 years 22 male undergraduate student yes no experience 1 year 23 male bsc degree yes 1 year 1 year 24 male undergraduate student yes no experience 1 year 25 male undergraduate student yes 1 year 4 years 26 male master student yes 4­5 years 5 years 27 male bsc degree no no experience 1 year 28 male bsc degree yes no experience 11 years table 8. number of pairs of code snippets that have increased the readability, decreased the readability, and unchanged the readabil­ ity; after the introduction of lambda expressions. model increased decreased unchanged buse and weimer 13 44 9 posnett et al. 31 35 0 models to the results of the qualitative assessment (sec­ tion 4.2). h2. sloc and cc can be used to predict the benefits (or drawbacks) on program comprehension, according to the readability models considered in this research. we investigated this hypothesis using a regression model. first, we calculated the differences in the sloc (∆s) and cc (∆cc) metrics, considering the code snippets before and after the introduction of lambda expressions. we then built two regression models, one considering as response variable the difference in the buse and weimer model (∆bw) and one considering as response variable the difference in the posnett et al. model (∆p). ∆bw = b0 + b1 ∆s + b2 ∆cc (5) ∆p = c0 + c1 ∆s + c2 ∆cc (6) accordingly, we unfolded h2 in two alternative hypothe­ ses, one for each readability model. that is, the null hypothe­ ses for h2 are as follows. • h2.10: there is no relationship between ∆bw and the predictors ∆s and ∆cc. • h2.20: there is no relationship between ∆p and the predictors ∆s and ∆cc. table 9 and table 10 show the results of the regression analysis, considering the first and second models of eq (5) and eq (6). considering a significance level < 0.05, we could not predict the benefits/drawbacks of introducing lambda ex­ pressions, according to the buse and weimer model to esti­ mate readability, in terms of lines of code (p­value = 0.08) and cyclomatic complexity (p­value = 0.98). this result sug­ gested that we should not reject the null hypothesis h2.10, and there is a negligible relationship between the predic­ tors (∆s and ∆cc) with the response variable ∆bw. finally, only 2% of the variability in ∆bw was explained by the lin­ ear regression of eq. (5) (adjusted r­squared: 0.02). simi­ larly, variables ∆s and ∆cc did not explain the variability in ∆p (adjusted r­squared: 0.05). nonetheless, considering the second regression model (eq. (6)), the result suggested that there is a relationship between sloc and ∆p (p­value = 0.01)—though it is a small correlation (ρ = −0.188 using the spearman correlation method). in summary, the results of the regression analysis refuted our hypothesis h2: ∆s and ∆cc presented a negligible re­ understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 lationship with ∆bw and ∆p; and thus they could not ad­ equately predict the variability in the response variables of eq. (5) and eq. (6). table 9. summary of the regression model to estimate the differ­ ence on the buse and weimer estimates, using sloc and cc estimate std. error t value pr(>|t|) (intercept) 0.0309 0.0128 2.41 0.0190 ∆s 0.0052 0.0029 1.77 0.0816 ∆cc 0.0003 0.0199 0.01 0.9888 table 10. summary o the regression model to estimate the differ­ ence on the posnett et al. estimates, using sloc and cc estimate std. error t value pr(>|t|) (intercept) ­0.0184 0.0161 ­1.14 0.2567 ∆s ­0.0088 0.0037 ­2.41 0.0190 ∆cc 0.0099 0.0249 0.40 0.6937 4.2 qualitative assessment considering the qualitative assessment, 28 participants with a substantial experience in java programming evaluated a number between three and six pairs of code snippets. for each pair of code snippet, these participants answered the survey questions s1q1, s1q2, and s1q3. recall that we split the code snippets into two groups, and thus each code snippet was evaluated by 14 participants. the data collection lasted 16 days, and, on average, each participant spent 2:30 minutes to evaluate each pair of code snippet. we used two forms of data analysis in this assessment. first, we summarized the responses to sq1 and sq2 using ta­ bles and plots, which allowed us to build a broad view of the closed questions’ answers. in the second analysis, we con­ sidered the answers to the open questions literally (some of them are quoted here), to draw a broader understanding about the implications of refactoring java legacy code to introduce lambda expressions. 4.2.1 improvements on readability the goal of the first question of our survey (do you agree that the adoption of lambda expressions on the right code snip­ pet improves the readability of the left code snippet?) was to evaluate if, according to the perception of java developers, the introduction of lambda expressions improve the compre­ hension of the code snippets. we used a likert scale to inves­ tigate this. considering the answers to all pairs of code snip­ pet, 11.1% and 39.7% either strongly agree or agree that the introduction of lambda expressions improve the readability of the code, respectively; while 24.6% of the responses were neutral, 21.4% disagree, and 3.2% strongly disagree with the sq1 statement (see table 11). therefore, we found develop­ ers leaning towards a readability improvement after the in­ troduction of lambda expressions. to better understand this result, we analyzed the an­ swers for each pair of code snippet (see figure 4). trans­ formations 1035, 1052, and 1180 present more than 60% figure 4. answers to the first question of the survey, considering the pairs of code snippets 29% 14% 14% 7% 29% 21% 43% 36% 29% 57% 64% 71% 57% 36% 64% 21% 43% 43% 14% 21% 14% 36% 36% 14% 36% 21% 29% 100 50 0 50 100 1027 1035 1052 1062 1166 1180 1182 1183 1192 percentage response strongly disagree disagree neutral agree strongly agree of positive answers (i.e., introducing lambda expressions improves the readability of these code snippets). differ­ ently, the pair of code snippet 1182 on figure 5 received 79% of answers either neutral or negative (i.e., the intro­ duction of lambda expressions seems to reduce the read­ ability of this code snippet). in this particular case, a for(obj: collection) {...} statement is replaced by a collection.foreach(obj -> {...}) loop, which in­ cludes a lambda expression. most of the participants did not agree that the introduction of a lambda expression improved the readability of the source code in this situation. one of the participants stated: “(considering the code snippet 1182) i think that replac­ ing a normal for each by a collection.foreach() would only bring benefits when there are additional calls either to the map or filter methods, or perhaps calls to some other method list processing.” figure 6 shows the pair of code snippet 1180. in this ex­ ample, an instance attribute (duplicate) was first initialized using an anonymous inner class (figure 6­(a)). this anony­ mous inner class was later replaced by a lambda expression (figure 6­(b)), and 64% of the participants either agree or strongly agree that this transformation improves the readabil­ ity of the code snippet. regarding this pair of code snippet, one of the participants stated that: understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 table 11. summary of the answers for the question do you agree that the adoption of lambda expressions on the right code snippet improves the readability of the left code snippet? s1q1 answers percentage cum. percentage strongly disagree 4 3.2% 3.2% disagree 27 21.4% 24.6% neutral 31 24.6% 49.2% agree 50 39.7% 88.9% strongly agree 14 11.1% 100.0% total 126 100.0% figure 5. pair of code snippet 1182 assertequals(numrequests, responses.size()); for(testresponse t: responses) { response r = t.getresponse(); assertequals(t.method, r.getrequestline().getmethod()); ... } (a) assertequals(numrequests,responses.size()); responses.foreach(t →{ response r = t.getresponse(); assertequals(t.method, r.getrequestline().getmethod()); ... }); (b) “here the transformation makes sense, because it elim­ inates the use of anonymous inner class with a trivial method body (often used to implement the command de­ sign pattern in java)” figure 6. pair of code snippet 1180 private function duplicate = new function() { public string apply(string in) { return in + in; } }; (a) private function duplicate = (string in) →{ return in + in; }; (b) considering all pairs of code snippets we used in the sur­ vey, only in two pairs of code snippets (1166 and 1182) we observed a tendency towards either a neutral or a dis­ agreement opinion that the introduction of lambda expres­ sions improves the readability of the code. more specif­ ically, in these two cases, the percentage of agree and strongly agree was under 50%. both are examples of trans­ formations that replace a regular for each statement to a collection.foreach(...) using a lambda expression. 4.2.2 source code preference the goal of the second question of our survey (which code do you prefer?) was to understand if the practitioners had a pref­ erence for the code before or after the introduction of lambda expression. considering the nine pairs of code snippets of the survey (that we randomly select from the initial population), only the pair of code snippet 1166 received more selections for the first version of the code (i.e., before the introduction of lambda expressions). therefore, we found some evidence in this survey that the participants identify the introduction of lambda expressions as a transformation that improves the quality of the source code. surely, this preference depends on the experience of the developers, as one of the partici­ pants state: “it depends on the practical knowledge on functional programming, since programmers of the 1980s and 1990s are likely to consider easier to understand code where loops, control variables, and pointers are ex­ plicit.” we used the spearman correlation test to verify whether the reduction on lines of code and the reduction on cyclo­ matic complexity could explain the preference of the partic­ ipants for the pieces of code after the introduction of lambda expressions. we found a moderate to high correlation (0.67) between the reduction on the lines of code and the number of votes in favor of the code after the introduction of lambda expressions. therefore, in the cases that a source code trans­ formation to introduce lambda expressions reduced the num­ ber of lines of code, it might have also improved the gen­ eral quality of the code—according to the perceptions of the participants. differently, we found a weak correlation be­ tween the reduction on cyclomatic complexity and the num­ ber of choices in favor (or against) of the code snippets using lambda expressions. we could understand this result because the introduction of lambda expressions did not reduce the cy­ clomatic complexity in several cases. 5 results of the second phase in this section, we replicate the process executed in the first phase, but only considering transformations suggested by au­ tomated tools. section 5.1 presents the results of the quanti­ tative assessment, taking into account the models of buse and weimer (2010) and posnett et al. (2011). after that, we present the results of the qualitative assessments and compare them to the results of the quantitative study (sec­ tion 5.2). 5.1 quantitative assessment we considered the 92 pairs of code snippets randomly se­ lected from the set of recommendations to introduce lambda expressions suggested by rjtl, netbeans, and intellij. for each pair, we estimated the code comprehension of the ver­ sions before and after applying the suggested transforma­ tions, using both the buse and weimer (2010) and posnett et al. (2011) models. we also calculated the sloc and cc metrics for both versions of code snippets. to investigate h1 (the introduction of lambda expressions improves program comprehension, according to state­of­the­ art readability models), we executed the wilcoxon signed­ rank test considering the two models to measure code com­ prehension. first, we evaluated the situations where a trans­ formation increased, decreased or unchanged code com­ prehension according to the models. after that, we executed understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 the wilcoxon signed­rank test. table 12 summarizes the re­ sults, showing that, in most of the cases, the introduction of lambda expressions suggested by automated tools actually reduces code comprehension, according to both state­of­the­ art readability models. table 12. number of pairs of code snippets that have increased the readability, decreased the readability, and unchanged the readabil­ ity; after the introduction of lambda expressions. model increased decreased unchanged buse and weimer 25 63 4 posnett et al. 20 67 5 the results of the wilcoxon signed­rank test suggested that the introduction of lambda expressions decreases the comprehensibility of the pairs of code snippets (p­value < 0.0001). for instance, figures 7 and 8 show pairs of snip­ pets that have been evaluated using the readability metrics. the transformation of an anonymous inner class led to an im­ provement according to buse and weimer (2010) metric: the readability for the code before the transformation according to this model is 0.29; and 0.50 after introducing a lambda expression. however, considering a transformation that re­ places a for loop by a lambda expression, the metric’s result worsened significantly, reducing from 0.72 to 0.13 after the source code transformation. figure 7. pair of code snippet 528. replacing an anonymous inner class. public synchronized string getresolvername(modulerevisionid mrid) { modulesettings ms = modulesettings.getrule(mrid, new filter() { public boolean accept(modulesettings o) { return o.getresolvername() != null; } }); return ms == null ? defaultresolvername : ms.getresolvername(); } (a) public synchronized string getresolvername(modulerevisionid mrid) { modulesettings ms = modulesettings.getrule(mrid, (modulesettings o)→{ return o.getresolvername() != null ;}); return ms == null ? defaultresolvername : ms.getresolvername(); } (b) to investigate the h2 hypothesis (sloc and cc can be used to predict the benefits (or drawbacks) on program com­ prehension, according to the readability models considered in this research.), we calculated the differences in the sloc (∆s) and cc (∆cc) metrics, considering the code snippets before and after the introduction of lambda expressions. accordingly, we explored the null hypotheses h2.10 and h2.20 (section 4). tables 13 and 14 summarize the results of the regression analysis considering a significance level < 0.05. after performing the regression analysis, both models led to a p­value > 0.05, w.r.t the sloc metric. however, dif­ ferently from the results of the first phase, the analyses led to a p­value < 0.05 when considering the cc metric. such results suggested that cyclomatic complexity can be used to estimate the impact on code comprehension after the intro­ figure 8. pair of code snippet 499. replacing a structural for loop. public void rewind(int start) { currenttokenindex = start; /** remove any consume and lookahead attribute for any token with index * greater than start */ for (integer idx : inputtokenindexes) { if (idx >= start) { indextoconsumeattributemap.remove(idx); lookaheadtokenindexes.remove(idx); } } } (a) public void rewind(int start) { currenttokenindex = start; /** remove any consume and lookahead attribute for any token with index * greater than start */ inputtokenindexes.stream().filter((idx) →(idx >= start)).map((idx) →{ indextoconsumeattributemap.remove(idx); return idx; }).foreachordered((idx) →{ lookaheadtokenindexes.remove(idx); }); } (b) duction of lambda expressions. therefore, the results con­ firmed our second hypothesis with respect to the cyclomatic complexity metric, being possible to estimate the effect on the readability metrics using the difference on the cc met­ ric. we further detail these results in section 6. table 13. summary of the regression model to estimate the differ­ ence on the buse and weimer estimates, using sloc and cc estimate std. error t value pr(>|t|) (intercept) 0.0318 0.0220 1.44 0.1522 slocdiff 0.0038 0.0056 0.67 0.5034 ccdiff ­0.0623 0.0136 ­4.59 0.0000 table 14. summary of the regression model to estimate the differ­ ence on the posnett et al. estimates, using sloc and cc estimate std. error t value pr(>|t|) (intercept) 0.0100 0.0161 0.62 0.5357 slocdiff ­0.0043 0.0041 ­1.05 0.2975 ccdiff ­0.0253 0.0100 ­2.54 0.0130 5.2 qualitative assessment in the qualitative assessment, we report the results of a sec­ ond survey with practitioners, to capture the perception of the developers about the impact on the readability of the code after applying transformations that introduce lambda expressions. these transformations had been recommended by automated tools only. we present the distribution of re­ sponses in the form of plots to build a broad perspective of the opinion of the respondents to every closed question. we then show the insights we got after conducting a thematic analysis of the open­ended questions, highlighting the par­ ticipants’ opinions with quotations and code examples. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 5.2.1 the impact of introducing lambda expressions in our second survey, our first question asked the opinion of the respondents about four sentences, which we use to un­ derstand the impact of introducing lambda expressions in the pairs of code snippets. we organize this section according to the sentences of the first question of the second survey. the new code is easier to comprehend. the purpose of this sentence was to evaluate if the transformations to introduce lambda expressions (recommended by automated tools) im­ prove program comprehension. contrasting with the over­ all claims about the benefits of introducing lambda expres­ sions (gyori et al., 2013), we found that almost all types of transformations the automated tools suggest do not im­ prove the readability of the programs.interestingly, except for three types of transformations (anonymous inner class to lambda, for loop to any match, and for loop to filter), the respondents most often did not agree that the introduc­ tion of lambda expressions makes the code easier to compre­ hend. actually, according to figure 9, 68% of the respon­ dents stated that they did not agree that transformations in­ volving the chaining of different stream operations improve program comprehension, and we observed the same trend for other typical recursive patterns (e.g., map, reduce, and for each). figure 9. summary of the developers’ answers to the sentence the new code is easier to comprehend of the survey. 17% 28% 68% 28% 39% 50% 50% 67% 54% 8% 50% 26% 33% 17% 16% 18% 24% 22% 35% 17% 33% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response strongly disagree disagree neither agree or disagree agree strongly agree it is worth to link these results to the answers to the open­ ended question. that is, according to the participants, replac­ ing an anonymous inner classes by a lambda expressions of­ ten improves program readability. figure 10 shows an exam­ ple of this particular type of transformation. after introduc­ ing the lambda expression, the code is more succinct because it removes some of the boilerplate code necessary to imple­ ment anonymous inner classes. regarding the code snippet of figure 10, one participant stated: “(the code on the right is…) easier to read, usually lambda also makes the code cleaner and compact.” this comment suggests that this is a situation where the in­ troduction of a lambda expressions improves program com­ prehension. differently, transformations involving chaining of the stream api methods received 68% of responses as either figure 10. pair of code snippets 480. replacing an anonymous inner class into lambda expression. private throwingrunnable evaluatewithexception(exception e) { return new throwingrunnable() { public void run() throws throwable { statement.nextexception = e; statement.waitduration = 0; failontimeout.evaluate(); } }; } (a) private throwingrunnable evaluatewithexception(exception e) { return () →{ statement.nextexception = e; statement.waitduration = 0; failontimeout.evaluate(); }; } (b) strongly disagree or disagree, characterizing possible sce­ narios where the introduction of lambda expressions does not improve code comprehension. another case involved trans­ formations of for loops into foreach statements, which had 39% of negative answers (either strongly disagree or dis­ agree. the type of transformations with the recursive pat­ terns map and reduce received 50% of negative responses. with respect to a transformation involving chaining, one of the respondents stated the following about the example of figure 11. “it’s a bad example ... although i use lambdas a lot, i would never use them in exactly this way.” considering the same example of code in figure 11, an­ other participant discussed that: “(i would) almost never (execute this transformation). transforming for loops into foreach statements with lambda expressions provides little benefit other than us­ ing a maybe slightly more concise syntax. “readabil­ ity” in my mind is such a subjective criterion that it is close to useless as a metric for making any decisions: someone coming from a functional language will find a map/filter/reduce pipeline easier to “read”, and some­ one coming from a structured programming language will naturally tend towards nested loops.” this is an example of transformation that replaces for each statements by lambda expressions. according to the respondents, it does not improve program comprehension. based on these results, we disclose that transformations of type replacing anonymous inner class with lambda expres­ sions, replacing a for loop with the filter pattern and re­ placing a for loop with the anymatch method improve code comprehension; while the transformations replacing a for loop with a for­each statement, replacing a for loop with the reduce pattern , replacing a for loop with the map pattern, and replacing a for loop with a chaining of operators of­ ten do not improve program comprehension according to the developers’ opinion. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 figure 11. pair of code snippet 489. replacing loop to foreach, filter and foreachordered. private void postconfigure() { list triggers = settings.gettriggers(); for (trigger trigger : triggers) { eventmanager.addivylistener(trigger, trigger.geteventfilter()); } for (dependencyresolver resolver : settings.getresolvers()) { if (resolver instanceof basicresolver) { ((basicresolver) resolver).seteventmanager(eventmanager); } } } (a) private void postconfigure() { list triggers = settings.gettriggers(); triggers.foreach((trigger) →{ eventmanager.addivylistener(trigger, trigger.geteventfilter()); }); settings.getresolvers() .stream() .filter((resolver) →(resolver instanceof basicresolver)) .foreachordered((resolver) →{ ((basicresolver) resolver).seteventmanager(eventmanager); }); } (b) the new code is more succinct and readable. the purpose of this sentence was to assess whether or not the introduction of lambda expressions makes the code more succinct and im­ proves its readability. figure 12 summarizes the results of the developers’ responses to this particular sentence. in this case, we found a more positive tendency, and the transformations from anonymous inner class into lambda expressions and the transformations resulting in the map, reduce, filter, and anymatch patterns present a leaning towards positive an­ swers (agree or strongly agree). however, the assessment revealed that two types of transformations do not improve readability: transformations involving foreach and chain­ ing of the stream api methods received more than 49% of negative responses (strongly disagree and disagree). figure 12. summary of the developers’ answers to the sentence the new code is more succinct and readable of the survey. 12% 19% 62% 28% 50% 33% 17% 82% 51% 22% 44% 31% 50% 58% 6% 29% 16% 28% 19% 17% 25% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response strongly disagree disagree neither agree or disagree agree strongly agree the transformation in figure 13 shows a scenario that replaces a for each statement by a call to the foreach method of the stream api. although this is a straightfor­ ward situation where a developer might use a foreach, it does not improve the quality of the code, and most of the respondents considered that this particular scenario does not make the code more succinct and readable (more than 80% of the respondents are either neutral or does not agree that his transformation brings these benefits). regarding this pair of code snippets, one of the respondents clearly stated this perception. “(this) transformation does not improve readability and makes debugging more difficult.” figure 13. pair of code snippet 504. replacing a for loop by a foreach pattern. public contextconfigurator updatedwith(properties newproperties) { for (string key : newproperties.stringpropertynames()) { withparameter(key, newproperties.getproperty(key)); } return this; } (a) public contextconfigurator updatedwith(properties newproperties) { newproperties.stringpropertynames().foreach((key) →{ withparameter(key, newproperties.getproperty(key)); }); return this; } (b) differently, figure 14 shows an example of transformation that makes the code more succinct and readable, according to the opinion of the respondents. in this case, more than 80% of the answers were either neutral or present a leaning towards the agreement that the resulting code is more succinct and readable. altogether, from these observations, we argue that trans­ formations replacing for loops by a foreach method call and the composition of stream operations (sec:chaining) do not improve readability or make the code more succinct. on the other hand, the other types of transformations have shown benefits regarding code readability. figure 14. pair of code snippet 465. replacing anonymous inner class. public fitnesse(fitnessecontext context) { this.context = context; rejectedexecutionhandler handler = new rejectedexecutionhandler() { @override public void rejectedexecution(runnable r, threadpoolexecutor e) { log.log(warning, ”could not handle request. thread pool ...”); } }; //... } (a) public fitnesse(fitnessecontext context) { this.context = context; rejectedexecutionhandler handler = (runnable r, threadpoolexecutor e) →{ log.log(warning, ”could not handle request. thread pool ...”); }; //... } (b) the intention of using a lambda expression in the new code is clear. the purpose of this question was to investi­ gate whether or not developers are able to understand the motivation for using the lambda expressions introduced in understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 the new code. figure 15 summarizes the results of the de­ velopers responses to this question. similarly to the previous sentence, we found a more negative leaning when we consid­ ered the transformations that replace a for loop by a call to the foreach method and transformations that introduce a chaining of stream operations. the remaining types of transformations seemed to make clear the intention of using either a lambda expression instead of an anonymous inner class or a recursive pattern (e.g., filter, anymatch, map or reduce) instead of a for loop. figure 15. summary of the developers’ answers to the sentence the inten­ tion of using a lambda expression in the new code is clear of the survey. 12% 6% 51% 39% 44% 25% 33% 78% 88% 32% 61% 41% 50% 50% 10% 7% 16% 0% 15% 25% 17% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response strongly disagree disagree neither agree or disagree agree strongly agree transformations introducing a call to the foreach method received 44% of negative (strongly disagree or disagree) responses. this suggests a neutral opinion regarding the clear intention of introducing a lambda expression. figure 16 shows an example of code that replaces a for loop by the foreach pattern, where 66% of the respondents considered unclear the intention of the code. in particular, a participant stated that: “(i would never) perform this transformation. the for loop makes it clear and explicit that we are iterating over the elements in the collection—it is a fundamental part of the language that we all understand. the (use of) lambda expression does not.” figure 17 shows an example of transformation that makes the intention of the code clearer. this transformation replaces a for loop by a call to the anymatch method, and 88% of the respondents assigned either a neutral or a positive answer (agree or strongly agree) with respect to the clear intention of using a lambda expression in this example. a respondent also claimed that: “…the new code is more elegant and makes the inten­ tion of finding some occurrence where the condition is true clearer.” altogether, from these observations, we argue that trans­ formations replacing for loops by calls to the foreach method and the composition of stream operators (chaining) donotmakeclear theintentionofintroducinglambdaexpres­ sions. on the other hand, the other types of transformations figure 16. pair of code snippets 502. replacing a for loop with the foreach pattern. protected map csort(list list, int col) { typeadapter a = columnbindings[col].adapter; map result = new hashmap<>(list.size()); for (object row : list) { try { a.target = row; object key = a.get(); bin(result, key, row); } catch (exception e) { // surplus anything with bad keys, including null surplus.add(row); } } return result; } (a) protected map csort(list list, int col) { typeadapter a = columnbindings[col].adapter; map result = new hashmap<>(list.size()); list.foreach((row) →{ try { a.target = row; object key = a.get(); bin(result, key, row); } catch (exception e) { // surplus anything with bad keys, including null surplus.add(row); } }); return result; } (b) figure 17. pair of code snippet 548. replacing a for loop with the anymatch pattern private static boolean isassignabletoanyof(class[] typearray, object target) { for (class type : typearray) { if (type.isassignablefrom(target.getclass())) { return true; } } return false; } (a) private static boolean isassignabletoanyof(class[] typearray, object target) { return typearray.stream() .anymatch(type →type.isassignablefrom(target.getclass())); } (b) have shown benefits, making it clear the intention of replac­ ing anonymous inner classes with lambda expressions and the use of other recursive patterns (filter, anymatch, map, and reduce). the new code is harder to debug. the goal of this sentence was to assess whether or not the introduction of lambda ex­ pressions makes the code more difficult to debug. the results in figure 18 show that practically all types of transformations present the side effect of hindering the task of debugging, apart from the transformations that replace anonymous inner classes by lambda expressions. transformations involving calls to the filter and chaining methods of the stream api received more than 70% of negative responses—that is, respondents either agree or strongly agree that the transformations make the code harder to debug. differently, transformations that replace understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 anonymous inner classes by lambda expressions received 53% of positive answers (respondents consider that this kind of transformation does not hinder debugging activities). figure 18. summary of the developers’ answers to the sentence the new code is harder to debug of the survey. 53% 29% 3% 17% 26% 25% 25% 31% 49% 84% 72% 39% 67% 58% 16% 22% 14% 11% 35% 8% 17% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response strongly disagree disagree neither agree or disagree agree strongly agree figure 19 shows an example of a transformation that in­ troduces a foreach statement. in this case, 88.33% of the respondents were either neutral or presented a positive feel­ ing that this transformation does not hinder debugging tasks. interesting, one participant claimed that this transformation made the code harder to debug (due to obfuscating the types of variables), although he/she was still leaning towards con­ sidering the transformation beneficial. “obfuscating the types of the variables used makes the code easier to change, but at the same time may make it harder to debug. i would still perform the transforma­ tion though.” figure 19. pair of code snippet 510. replacing loop to foreach pattern. public synchronized void adderror(test test, throwable e) { ferrors.add(new testfailure(test, e)); for (testlistener each : clonelisteners()) { each.adderror(test, e); } } (a) public synchronized void adderror(test test, throwable e) { ferrors.add(new testfailure(test, e)); clonelisteners().foreach((each) →{ each.adderror(test, e); }); } (b) figure 20 shows an example of transformation that also makes the code hard to debug (more than 85% of the respon­ dents either agree or strongly agree that this transformation hinders debugging tasks). however, in the opinion of a de­ veloper, an improvement in the transformation could actually make the resulting code easier to debug. “yes (i would perform this transformation), in a hurry, but with a minute more time i’d extract the filter into its own function. however, the suggested refactoring is in itself valuable because it does bring out the important part. if an automated tool did this to a whole codebase, it would make debugging easier, especially for junior developers.” figure 20. pair of code snippet 491. replacing anonymous inner class with lambda expressions. public file[] getconfigurationresolvereportsincache(final string resolveid) { final string prefix = resolveid + ”−”; final string suffix = ”.xml”; return getresolutioncacheroot().listfiles(new filenamefilter() { public boolean accept(file dir, string name) { return name.startswith(prefix) && name.endswith(suffix); } }); } (a) public file[] getconfigurationresolvereportsincache(final string resolveid) { final string prefix = resolveid + ”−”; final string suffix = ”.xml”; return getresolutioncacheroot().listfiles((dir, name) → name.startswith(prefix) && name.endswith(suffix)); } (b) in summary, from these observations, we argue that evolv­ ing a legacy code to use the stream api and lambda expres­ sions often makes the resulting code harder to debug. this undesired side effect does not happen in the case of transfor­ mations from anonymous inner classes into lambda expres­ sions. 5.2.2 how often would you perform this type of trans­ formation? the purpose of this question was to assess how often devel­ operswouldperformthesetof98transformationsweexplore during the survey. interesting, besides the possible side ef­ fect of hindering debugging activities, respondents presented a positive tendency to accept 72% of the transformations in our dataset—respondents rejected 22% of the transfor­ mations and were neutral with respect to 6% of the trans­ formations. nonetheless, when we discarded the transfor­ mations involving anonymous inner classes, the number of transformations that the respondents would accept dropped from 72% to 44.44%, and the respondents would reject 50% of the transformations. figure 21 summarizes the responses to this question, which presents options related to frequency (from never to always). it is possible to observe that the respondents would not perform some of the transformations. for instance, the re­ spondents would never or rarely replace a for loop by a call to the foreach method in 50% of the scenarios. we found a similar result when considering transformations that intro­ duce the map recursive pattern. differently, the respondents stated they will either often or always perform transforma­ tions replacing for loops by a call to the anymatch method (61%) and inner classes by lambda expressions (60%). ta­ ble 15 presents a different perspective about the answers to this question, without splitting them using the type of the transformations. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 figure 21. summary of the developers’ answers to the question how often would you perform this type of transformation? of the survey. 17% 10% 38% 22% 59% 42% 17% 60% 61% 11% 22% 13% 25% 17% 23% 29% 51% 56% 28% 33% 67% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response never rarely sometimes often always table 15. summary of the answers for the question how often would you perform this type of transformation? s2q2 answers percentage cum. percentage never 55 8.65% 8.65% rarely 92 14.5% 23.5% sometimes 173 27.2% 50.35% often 160 25.2% 75.5% always 156 24.5% 100.0% total 636 100.0% 5.2.3 how important is the automated support for this kind of transformation? the purpose of this question was to assess the importance of using tools to perform transformations that introduce lambda expressions. figure 22 summarizes the results for this ques­ tion, where the options range from not important at all to very important. we can observe in the figure that respon­ dents considered the support of automated tools either mod­ erately important or very important to apply the transfor­ mation, in more than 50% of the cases. this might indicate that developers prefer to perform these transformations using some code refactoring tool. however, transformations intro­ ducing the foreach recursive pattern received most of the responses between not important at all and low important, which perhaps supports that this particular kind of transfor­ mation does not improve the source code. finally, the trans­ formation classified as replacing a for loop with a chaining of operators received most responses in neutral (38%). based in these results, we can argue that developers con­ sider worth the use of refactoring tools to introduce lambda expressions and rejuvenate java programs. however, there is some room for improving these tools, as we discuss possible scenarios in the next section. 5.2.4 synthesis of the responses to the open­ended question in this section we present a synthesis of answers to the open­ ended question of our second survey, using the thematic analysis procedures we detailed in section 3. we found three recurrent themes that might explain the reasons for accept­ ing a transformation: more succinct code, easier to under­ stand, and clear code intention. we also identified three re­ current themes that might justify why a given transforma­ figure 22. summary of the developers’ answers to the question how im­ portant is the automated support for this kind of transformation? of the survey. 16% 29% 32% 6% 46% 17% 17% 65% 58% 30% 50% 31% 50% 67% 19% 12% 38% 44% 22% 33% 17% 100 50 0 50 100 anonymous inner class anymatch chaining filter foreach map reduce percentage response not important at all low importance neutral moderately very important tion should not be applied: small benefit, harder to under­ stand, and wrong scenario for using a lambda expression. finally, several answers claim that the transformations could be improved (the need improvements theme that appears in transformations marked either as accepted or rejected). sev­ eral answers provided an alternative to the modified version of the code (often using a textually description, but in a few cases, the participants also shared as code example using a gist4). most recommendations to improve the resulting code (i.e., the code after applying a transformation) relate to the source code format, e.g.: “no need of curly braces and semicolon on the second statement” and “i would always perform this transformation, but i would use line breaks and filters to make the code more readable”. perhaps, refactoring engines that introduce lambda expressions could benefit from ad­ vanced code format tools (e.g., the approach by parr and vinju (2016)). other possible improvements are trickier, which might indicate the need to follow a careful code re­ view process after applying code transformations (carvalho et al., 2020). for instance, one of the participants argued that: “[…] streams should produce collections as results, not populate them as side­effects. if we fixed that, and broke to a new line before each transformation or filter, then i think it would be ok.” other possible improvements stress the use of the type in­ ference mechanism: “i don’t think you need to specify (file file), do you? you could just say ”file” and let the type get in­ ferred [, right]? unless collectionutils.select is overloaded and takes multiple different functional types.” we found that the transformation engines of netbeans ide and rjtl do not explore the type inference mechanism in their refactoring recommendations. participants also suggested that the intro­ duction of lambda expressions brings small benefits, and, as such, they would rarely change a legacy code that is working just to introduce new language constructs or idioms. “i would not rewrite legacy code to introduce a lambda expression in this way, unless the inner code itself would have to be rewritten.” understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 table 16. features that point to code improvements after introducing lambda expressions. themes frequency description representative examples participants more succinct code 19 they are transformations to introduce lambda expressions that make the code more succinct. ”yes, perfect case for lambda, short, clear”; ”yes, i would because nowadays languages have improved their syntax to provide a better and easy code to developers make their softwares, java 8 introduced lambda, where you can write less code and do more.”; ”i would sometimes make this change, but not always because it is only making the code more succinct”; p203, p285, p749 easier to understand 15 they are transformations to introduce lambda expressions that make the code more comprehensibly. ”yes. the new code, besides looking cleaner, is also really easier to read and comprehend.”; ”yes, code readability was a factor”; ”easier to read, usually lambda also makes the code cleaner and compact”; p803, p334, p337 clear code intention 14 they are transformations to introduce lambda expressions that make the code more clear. ”yes, since it looks more ”straight forward”, and it makes the code itself cleaner”; ”i would do it because it’s easier to write and the code gets cleaner.”; ”yes, absolutely, clearer intent, more expressive, easier to read and comprehend.”; p803, p635, p229 table 17. features that point to code worsening after introducing lambda expressions. themes frequency description representative examples participants small benefit 27 they are transformations to introduce lambda expressions that have little or no benefit. ”no. the benefit isn’t big enough to perform the transformation.”; ”i believe, in this example, the transformation is a small part of the method and it does not influence positively or negatively at all the legibility of the method.”; ”i consider both versions to be similar”; p268, p138, p166 harder to understand 5 they are transformations to introduce lambda expressions that make code less comprehensibly. ”this is still pretty hard to read and understand on account of a) the hard cast of the lambda to callable, which seems weird ­ is this necessary? isn’t it at least a callable? b) why a ”checkthat” method is calling ”checksucceeds” which seems a little like jumping to a conclusion.”; ”maybe not a complex return on one line”; ”i would never perform this transformation. the for loop makes it clear an explicit that we are iterating over the elements in the collection ­ it is a fundamental part of the language that we all understand.”; p229, p203, p583 wrong scenario 5 they are transformations to introduce lambda expressions that shouldn’t be done. ”since this is a void method it will, by definition, never be truly functional. splitting the original code into a map – with a side effect, no less! – and a terminal operation with foreach construct does not really improve anything in my mind.”; ”i tend to avoid try­catch in lambda expressions. i don’t think it’s bad to do so, but i personally don’t do it, even if it means using an anonymous inner class.”; p694, p547 table 16 and table 17 summarize the frequency of the re­ current themes. as a future work, our goal is to consider the answerstothisopen­endedquestiontoimprovetherjtlim­ plementation. all code snippets and datasets we used in our research are available in the paper’s companion website 5. 6 discussion as explained in the previous section, we found conflicting results in our research. in the first phase, the models for es­ timating readability diverge from one another. that is, the buse and weimer (2010) model suggests that when a de­ veloper introduces a lambda expression into java legacy method, the readability of the method decreases. differently, the model of posnett et al. (2011) suggests that the introduc­ tion of lambda expressions does not impact program compre­ hension in the first phase. contrasting, in the second phase, both models suggest that the introduction of lambda expres­ sions decreases program comprehension. the main differ­ ence between the two phases is that the second one only con­ sider transformations suggested by automated tools. perhaps, manual transformations fix some problems related to read­ ability. nonetheless, the results of the qualitative assessments 4gist is a github feature that allow developers to share code 5https://waltim.github.io/jserd.html with practitioners suggest that the introduction of lambda expressions improves program comprehension in particu­ lar cases. for instance, the replacement of anonymous inner classes by lambda expressions often improve readability— according to the results of our surveys. other scenarios that the introduction of lambda expressions might be positive are the replacement of for loops with simple recursive pat­ terns like filter and anymatch. we believe that these con­ flicting results are partially due to the limitations of both models on identifying improvements in finer­grained trans­ formations. considering the results of both quantitative and qualitative studies, we answer our research questions in sec­ tion 6.1 and present some lessons learned in section 6.2. fi­ nally, we present some threats to the validity of our study in section 6.3. 6.1 answers to the research questions when using a mixed­methods approach, the best scenario oc­ curs in situations where the results of a quantitative studies support the findings and explains the results of the qualita­ tive ones (or vice­versa). considering table 18, which com­ bines the results of the quantitative and qualitative assess­ ment for the transformations that replace anonymous inner classes with lambda expressions, it is possible to observe differences between the outcomes of both readability mod­ els and the developers perceptions of code comprehension. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 we are in favor of the results of the qualitative study. there­ fore, considering our first research question (does the use of lambda expressions improve program comprehension?), our findings revealed that refactoring a legacy code to introduce lambda expression improves program comprehension in the specific scenarios we discussed earlier. table 18. number of code snippets that increased readability, decreased readability and unchanged readability; after replacing anonymous inner classes with lambda expressions by the tools. evaluator increased decreased unchanged buse and weimer 23 32 3 posnett et al. 11 43 4 developers 51 3 4 after these results, we investigated whether the code com­ plexity metrics (sloc and cc), independently, could pre­ dict if a transformation of a legacy code to introduce lambda expressions improves the readability of the code. to perform this investigation, we calculated the differences in sloc (∆s) and cc (∆cc) metrics, considering the code snippets before and after the introduction of lambda expressions. af­ ter that, we ran the pearson’s correlation test (mukaka, 2012), to assess whether these differences correlate with possible improvements in program comprehension according to the survey respondents. we found that (∆cc) has no relation to the answers of developers about comprehension. on the other side, the (∆s) presents a moderate correlation (ρ = 0.5324 and p­value < 0.05). such results revealed that the greater the reduction of lines after the introduction of lambda expressions, the better the comprehension of the code ac­ cording to the developers opinion—independently of reduc­ ing the cyclomatic complexity or not. therefore, tool devel­ opers could use sloc to automatic learn good situations to suggest transformations that introduce lambda expressions. regarding the second research question (does the intro­ duction of lambda expressions reduce source code complex­ ity?), after assessing the impact of introducing lambda ex­ pressions in 158 pairs of code snippets (66 of the first phase and 92 from the second phase of this research), we found that introducing lambda expressions (a) reduces the size of the code (sloc) in 70% of the cases and (b) reduces the cy­ clomatic complexity in 40% of the cases. only in a few cases, the introduction of lambda expressions increased sloc. we did not find any case in which a transformation increases cy­ clomatic complexity. considering our third research ques­ tion (what are the most suitable situations to refactor code to introduce lambda expressions?), we found that replacing anonymous inner class by a lambda expressions might be considered the killer application to introduce lambda expres­ sions in legacy java code. in addition, scenarios replacing for loops having internal conditional with an anymatch opera­ tor often improved the readability of the code and makes the intention of using the lambda expression more clear. differ­ ently, just replacing a simple for over a collection statement with a collections.foreach() did not bring any benefit, according to the participants of our surveys. we also found that the chaining of stream methods and the introduction of recursive patterns (e.g., filter and map) hinders debugging activities according to the developers. regarding our fourth research question (how do practi­ tioners evaluate the effect of introducing lambda expressions into a legacy code?), developers agreed that the introduc­ tion of lambda expressions improve the quality of the code (in particular when removing the boilerplate code related to anonymous inner classes), though it might introduce some challenges to debugging activities in general. developers would actually accept most of the rjtl, netbeans, and in­ tellij transformations (72%), and they considered worth the existence of automated support to introduce lambda expres­ sions and thus rejuvenate java legacy code. finally, with respect to our last research question (what is the practitioners’ opinion about the recommendations from automated tools to introduce lambda expressions?), the re­ sults suggested that the use of automated tools to rejuvenate java programs is promising. again, considering only recom­ mendations from netbeans ide, rjtl, and intellij ide, de­ velopers agreed that transformations replacing anonymous inner class by lambda expressions improve program com­ prehension. still, the feedback from the participants revealed several weaknesses of these tools, and thus we found some space to improve these refactoring engines, as we discuss in the next section. 6.2 lessons learned need for reviewing comprehensibility models. the state­ of­the­art models for estimating code readability could not capture the benefits of introducing lambda expressions, as the participants of our survey report. we believe that a fur­ ther investigation is necessary, in order to understand if these models fail to capture the benefits of fine­grained transfor­ mations similar to the introduction of lambda expression, or if theyalsofailwhenevaluatinggeneral transformationssuch as more popular refactorings. nonetheless, both models are sensitive for code formatting decisions, including the number of blank characters. similar conclusions have been reported in a recent research work fakhoury et al. (2019). recommendations for refactoring tools. we found that transforming anonymous inner class into lambda expres­ sions is the scenario that brings more benefits for code com­ prehension. we also found that replacing for loops having an internal conditional by an anymatch and filter pat­ terns improves the code readability. nonetheless, we con­ sider that it is not recommended to blindly apply automatic transformations from simple for loop statements into a collections.foreach() statement. this kind of transfor­ mations does not improve code readability. several features might also help to identify the situations where introducing a lambda expression do not improve the code. for example, according to the participants, we should avoid combining the functional and imperative styles in the same method. simi­ larly, several transformations led to pieces of code with a wrong indentation (e.g., comprising long lines or unneces­ sary curly braces). according to the practitioners, some rec­ ommendations decreased the readability of the code due to indentation issues. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 6.3 threats to validity there are two main threats to our work. first, our results de­ pend on the representativeness of the code snippets used in the investigation. although we used a sample from real sce­ narios that introduce lambda expressions in legacy code, this sample might not correspond to a representative population that would be recommended to conclude our quantitative as­ sessment. we evaluated nine pairs of code snippets in the first survey. to circumvent such a threat, we replicated the study and evaluated 92 pairs of code snippets. this number is sim­ ilar to the number of code snippets evaluated in a previous study (posnett et al., 2011). the second threat is related to external validity. initially, our research participants belonged to a relatively small group of professional developers, who despite having great experi­ ence in java, were a small group of developers in our cy­ cle. during the replication of the study, we were able to sig­ nificantly increase the number of participants from different locations in the world. we believe that, with this variety of participants, our results became more reliable, allowing us to generalize our findings to this population. finally, we could have used other models to estimate read­ ability, which have been previously discussed in the litera­ ture (scalabrino et al., 2016). however, we only found an implementation of one of these models, the one by buse and weimer (2010). we also implemented the computation for an additional model by posnett et al. (2011), but it would be difficult to provide implementations for all models available in the literature. 7 final remarks in this paper we presented the results of a mixed­method investigation (i.e., using quantitative and qualitative meth­ ods) about the impact on code comprehension with the adop­ tion of lambda expressions in legacy java systems. we used two state­of­the­art models for estimating code comprehen­ sion (buse and weimer, 2010; posnett et al., 2011), and found conflicting results. both models (posnett et al., 2011) and (buse and weimer, 2010) suggested that the introduction of lambda expressions does not improve the comprehensibil­ ity of the source code. differently, the results of the quali­ tative studies (surveys with practitioners) indicated that the introduction of lambda expressions in legacy code improves code comprehension in particular cases (particularly when replacing anonymous inner classes by lambda expressions). after considering these conflicting results, we argue that (a) this kind of source code transformation improves software readability for specific scenarios and (b) we need more ad­ vanced models to understand the benefits on program com­ prehension after applying finer­grained program transforma­ tions. acknowledgements we would like to thank the anonymous reviewers for their valu­ able comments, which helped us to improve the quality of this pa­ per. this work was partially supported by fap­df, research grant 05/2018. references alqaimi, a., thongtanunam, p., and treude, c. (2019). au­ tomatically generating documentation for lambda expres­ sions in java. in proceedings of the 16th international conference on mining software repositories, msr ’19, pages 310–320, piscataway, nj, usa. ieee press. avidan, e. and feitelson, d. g. (2017). effects of variable names on comprehension an empirical study. in scan­ niello, g., lo, d., and serebrenik, a., editors, proceedings of the 25th international conference on program compre­ hension, icpc 2017, buenos aires, argentina, may 22­23, 2017, pages 55–65. ieee computer society. baggen, r., correia, j. p., schill, k., and visser, j. (2012). standardized code quality benchmarking for improving software maintainability. software quality journal, 20(2):287–307. buse, r. p. l. and weimer, w. (2010). automatically doc­ umenting program changes. in pecheur, c., andrews, j., and nitto, e. d., editors, ase 2010, 25th ieee/acm inter­ national conference on automated software engineering, antwerp, belgium, september 20­24, 2010, pages 33–42. acm. carvalho, a., luz, w. p., marcilio, d., bonifácio, r., pinto, g., and canedo, e. d. (2020). c­3pr: a bot for fixing static analysis violations via pull requests. in kontogian­ nis, k., khomh, f., chatzigeorgiou, a., fokaefs, m., and zhou, m., editors, 27th ieee international conference on software analysis, evolution and reengineering, saner 2020, london, on, canada, february 18­21, 2020, pages 161–171. ieee. dantas, r., carvalho, a., marcilio, d., fantin, l., silva, u., lucas, w., and bonifácio, r. (2018). reconciling the past and the present: an empirical study on the applica­ tion of source code transformations to automatically re­ juvenate java programs. in oliveto, r., penta, m. d., and shepherd, d. c., editors, 25th international conference on software analysis, evolution and reengineering, saner 2018, campobasso, italy, march 20­23, 2018, pages 497– 501. ieee computer society. dos santos, r. m. and gerosa, m. a. (2018). impacts of coding practices on readability. in khomh, f., roy, c. k., and siegmund, j., editors, proceedings of the 26th con­ ference on program comprehension, icpc 2018, gothen­ burg, sweden, may 27­28, 2018, pages 277–285. acm. fakhoury, s., roy, d., hassan, s. a., and arnaoudova, v. (2019). improving source code readability: theory and practice. in proceedings of the 27th international confer­ ence on program comprehension, icpc ’19, pages 2–12, piscataway, nj, usa. ieee press. favre, j.­m., lämmel, r., schmorleiz, t., and varanovich, a. (2012). 101companies: a community project on soft­ ware technologies and software languages. in furia, c. a. and nanz, s., editors, objects, models, components, pat­ terns, pages 58–74, berlin, heidelberg. springer berlin heidelberg. http://www.fap.df.gov.br/ understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 godfrey, m. w. and german, d. m. (2008). the past, present, and future of software evolution. in 2008 fron­ tiers of software maintenance, pages 129–138. gopstein, d., iannacone, j., yan, y., delong, l., zhuang, y., yeh, m. k.­c., and cappos, j. (2017). understanding misunderstandings in source code. in proceedings of the 2017 11th joint meeting on foundations of software en­ gineering, esec/fse 2017, pages 129–139, new york, ny, usa. acm. gyori, a., franklin, l., dig, d., and lahoda, j. (2013). crossing the gap from imperative to functional program­ ming through refactoring. in proceedings of the 2013 9th joint meeting on foundations of software engineering, esec/fse 2013, pages 543–553, new york, ny, usa. acm. khatchadourian, r., tang, y., bagherzadeh, m., and ahmed, s. (2019). safe automated refactoring for intelligent par­ allelization of java 8 streams. in proceedings of the 41st international conference on software engineering, icse ’19, pages 619–630, piscataway, nj, usa. ieee press. landman, d., serebrenik, a., bouwers, e., and vinju, j. j. (2016). empirical analysis of the relationship between cc and sloc in a large corpus of java methods and c functions. journal of software: evolution and process, 28(7):589–618. lehman, m. m. and ramil, j. f. (2001). rules and tools for software evolution planning and management. annals of software engineering, 11(1):15–44. lott, s. f. (2018). functional python programming: dis­ cover the power of functional programming, generator functions, lazy evaluation, the built­in itertools library, and monads. packt publishing ltd. lucas, w., bonifácio, r., canedo, e. d., marcilio, d., and lima, f. (2019). does the introduction of lambda expres­ sions improve the comprehension of java programs? in do carmo machado, i., souza, r., maciel, r. s. p., and sant’anna, c., editors, proceedings of the xxxiii brazil­ ian symposium on software engineering, sbes 2019, sal­ vador, brazil, september 23­27, 2019, pages 187–196. acm. mazinanian, d., ketkar, a., tsantalis, n., and dig, d. (2017). understanding the use of lambda expressions in java. proc. acm program. lang., 1(oopsla):85:1–85:31. mukaka, m. m. (2012). a guide to appropriate use of cor­ relation coefficient in medical research. malawi medical journal, 24(3):69–71. overbey, j. l. and johnson, r. e. (2009). regrowing a lan­ guage: refactoring tools allow programming languages to evolve. in proceedings of the 24th acm sigplan con­ ference on object oriented programming systems lan­ guages and applications, oopsla ’09, pages 493–502, new york, ny, usa. acm. parr, t. and vinju, j. j. (2016). towards a universal code formatter through machine learning. in van der storm, t., balland, e., and varró, d., editors, proceedings of the 2016 acm sigplan international conference on software language engineering, amsterdam, the nether­ lands, october 31 ­ november 1, 2016, pages 137–151. acm. pennington, n. (1987). stimulus structures and mental rep­ resentations in expert comprehension of computer pro­ grams. cognitive psychology, 19(3):295 – 341. posnett, d., hindle, a., and devanbu, p. t. (2011). a sim­ pler model of software readability. in van deursen, a., xie, t., and zimmermann, t., editors, proceedings of the 8th international working conference on mining software repositories, msr 2011 (co­located with icse), waikiki, honolulu, hi, usa, may 21­28, 2011, proceedings, pages 73–82. acm. riaz, m., mendes, e., and tempero, e. (2009). a systematic review of software maintainability prediction and metrics. in 2009 3rd international symposium on empirical soft­ ware engineering and measurement, pages 367–377. scalabrino, s., linares­vásquez, m., poshyvanyk, d., and oliveto, r. (2016). improving code readability models with textual features. in 2016 ieee 24th international conference on program comprehension (icpc), pages 1– 10. shrestha, n., botta, c., barik, t., and parnin, c. (2020). here we go again: why is it difficult for developers to learn an­ other programming language? in proceedings of the 42nd international conference on software engineering, icse. silva, d., tsantalis, n., and valente, m. t. (2016). why we refactor? confessions of github contributors. in zim­ mermann, t., cleland­huang, j., and su, z., editors, pro­ ceedings of the 24th acm sigsoft international sympo­ sium on foundations of software engineering, fse 2016, seattle, wa, usa, november 13­18, 2016, pages 858–870. acm. storey, m. d., wong, k., and müller, h. a. (2000). how do program understanding tools affect how programmers un­ derstand programs? sci. comput. program., 36(2­3):183– 207. stroustrup, b. (2013). the c++ programming language. addison­wesley professional, 4th edition. tilley, s. r., paul, s., and smith, d. b. (1996). towards a framework for program understanding. in wpc ’96. 4th workshop on program comprehension, pages 19–28. tsantalis, n., mazinanian, d., and rostami, s. (2017). clone refactoring with lambda expressions. in 2017 ieee/acm 39th international conference on software engineering (icse), pages 60–70. urma, r.­g., fusco, m., and mycroft, a. (2014). java 8 in action: lambdas, streams, and functional­style program­ ming. manning publications co. von mayrhauser, a. and vans, a. m. (1995). program com­ prehension during software maintenance and evolution. ieee computer, 28(8):44–55. wilcoxon, f. (1945). individual comparisons by ranking methods. biometrics bulletin (jstor), 1(6):80–83. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 a taxonomy of lambda expression transformations this appendix introduces a simple taxonomy used to classify the lambda expression transformations. for each member of the taxonomy, we present a brief description and an example. replacing anonymous inner class with lambda expressions a developer might use this transformation to convert an anonymous inner class into a lambda expression. figure 23 shows an example of this transformation. figure 23. pair of code snippet 551. replacing the anonymous inner class. public void runtest() { runbeforesthentestthenafters(new runnable() { public void run() { runtestmethod(); } }); } (a) public void runtest() { runbeforesthentestthenafters(()→ { runtestmethod();}); } (b) replacing a for loop with the map pattern a developer might use this transformation to convert a for loop into a map recursive pattern of the stream api. fig­ ure 24 shows an example of this transformation. figure 24. pair of code snippet 495. replacing loop to map pattern. public void draw(graphics2d g) { for (color c : shapes.keyset()) { g.setcolor(c); g.draw(shapes.get(c)); } } (a) public void draw(graphics2d g) { shapes.keyset().stream().map((c) →{ g.setcolor(c); return c; }).foreachordered((c) →{ g.draw(shapes.get(c)); }); } (b) replacing a for loop with the reduce pattern a developer might use this transformation to convert a for loop into a reduce pattern of the stream api. figure 25 shows an example of this transformation. in this example, there is a composition between a map and a reduce, though the goal is to reduce a collection of test classes into the num­ ber of test methods. figure 25. pair of code snippet 513. replacing loop to reduce. public int counttestcases() { int count = 0; for (test each : ftests) { count += each.counttestcases(); } return count; } (a) public int counttestcases() { int count = 0; count = ftests.stream() .map((each) →each.counttestcases()) .reduce(count, integer::sum); return count; } (b) replacing a for loop with a for­each statement a developer might use this transformation to convert a for loop into a foreach statement. figure 26 shows an exam­ ple of this transformation. respondents of our survey do not consider that this kind of transformation improves the quality of the code. figure 26. pair of code snippet 500. replacing loop to foreach pattern. public list getpotentialfixtureclassnames(set elements) { list candidateclassnames = new arraylist<>(); if (!isfullyqualified()) { for (string packagename : elements) { addblahandblahfixture(packagename + ”.”, candidateclassnames); } } addblahandblahfixture(””, candidateclassnames); return candidateclassnames; } (a) public list getpotentialfixtureclassnames(set elements) { list candidateclassnames = new arraylist<>(); if (!isfullyqualified()) { elements.foreach((packagename) →{ addblahandblahfixture(packagename + ”.”, candidateclassnames); }); } addblahandblahfixture(””, candidateclassnames); return candidateclassnames; } (b) replacing a for loop with the filter pattern. a developer might use this transformation to convert a for loop into the filter recursive pattern of the stream api. figure 27 shows an example of this transformation. respon­ dents in our survey consider that this type of transformation improves the quality of the code. understanding the impact of introducing lambda expressions in java programs walter lucas et al. 2020 figure 27. pair of code snippet 547. replacing loop to replacing a for loop with the filter pattern recursive pattern. public classpath(list paths) { this.elements = new arraylist<>(); this.separator = paths.get(0).getseparator(); for (classpath path : paths) { for (string element : path.getelements()) { if (!elements.contains(element)) { elements.add(element); } } } } (a) public classpath(list paths) { elements = path.getelements().stream() .filter(e →!elements.contains(e)).collect(collectors.tolist());} } (b) replacing a for loop with the anymatch method. a developer might use this transformation to convert a for loop and conditional if into the anymatch method. fig­ ure 28 shows an example of this transformation. respondents in our survey consider that this type of transformation im­ proves the quality of the code. figure 28. pair of code snippet 555. replacing a for loop with the anymatch pattern. private boolean isoverridenwithoutannotation(method[] methods, method superclazzmethod, class annotation) { for (method method : methods) { if (ismethodoverride(method, superclazzmethod) && (method.getannotation(annotation) == null)) { return true; } } return false; } (a) private boolean isoverridenwithoutannotation(method[] methods, method ↪→ superclazzmethod, class annotation) { return methods.stream().anymatch(method →ismethodoverride(method, ↪→ superclazzmethod) && (method.getannotation(annotation) == null)) ↪→ ; } (b) replacing a for loop with a chaining of opera­ tors. a developer might use this transformation to convert a for loop into the chaining operators. figure 29 shows an example of this transformation where is addition a se­ quence of distinct patterns (map and filter) followed by foreachordered statement. figure 29. pair of code snippet 493. replacing loop to chain of stream operators. private void rememberallopeneddocuments() { final list docpath = new arraylist(); for (xjwindow window : xjapplication.shared().getwindows()) { final xjdocument document = window.getdocument(); if(xjapplication.handlesdocument(document)) { docpath.add(document.getdocumentpath()); } } awprefs.setallopeneddocuments(docpath); } (a) private void rememberallopeneddocuments() { final list docpath = new arraylist(); xjapplication.shared().getwindows().stream().map((window) →window. ↪→ getdocument()).filter((document) →(xjapplication. ↪→ handlesdocument(document))).foreachordered((document) →{ docpath.add(document.getdocumentpath()); }); awprefs.setallopeneddocuments(docpath); } (b) introduction background and related work study settings research questions metrics of the quantitative study code snippets' datasets procedures of the qualitative study data analysis results of the first phase quantitative assessment qualitative assessment improvements on readability source code preference results of the second phase quantitative assessment qualitative assessment the impact of introducing lambda expressions how often would you perform this type of transformation? how important is the automated support for this kind of transformation? synthesis of the responses to the open-ended question discussion answers to the research questions lessons learned threats to validity final remarks taxonomy of lambda expression transformations journal of software engineering research and development, 2021, 9:16, doi: 10.5753/jserd.2021.804  this work is licensed under a creative commons attribution 4.0 international license.. on a preliminary theory of communication in distributed software development: a grounded theory-based research nelson leitão júnior  [ federal university of pernambuco | ngslj@cin.ufpe.br] ivaldir de farias junior  [ pernambuco university garanhuns campus | ivaldir.farias@upe.br] hermano moura  [ federal university of pernambuco | hermano@cin.ufpe.br] sabrina marczak [ pontifical catholic university of rio grande do sul | sabrina.marczak@pucrs.br] abstract communication is one of the leading challenges faced by teams working in a distributed setting; yet, little has been theorized about how communication occurs in such context. our long-term research goal is to construct a communication theory in distributed software development, aiming to propose a theoretical foundation for future academic studies on the topic of communication and a reference for industry practitioners. to achieve this goal, we are using grounded theory, including an exploratory literature review before the theory construction, to confirm the research gap. in this paper, we present a further preliminary version of the communication theory comprising six theoretical categories and 31 subcategories. the theory brings, up to know, a consolidated body of knowledge and points out the main concepts that define what communication is in distributed software teams. keywords: distributed software development, communication in software teams, grounded theory 1 introduction the research community on distributed software development (dsd) have been identifying effective communication as one of the leading issues in distributed teams (carmel, 1999; herbsleb et al., 2005; shah et al., 2012; aranda et al., 2010; clear and beecham, 2019) for some time, especially when considering distribution in a global scale (aranda et al., 2010), named global software engineering (gse). several organizations have been adopting dsd despite the odds of its benefits and challenges such as communication breakdowns that impose risks for the implementation of development projects and affect the software process quality (cruzes et al., 2016). given that the practice of dsd by software organizations dates from the mid 80s (aoyama, 1997) and that the first edition of the most relevant conference on the topic international conference on global software engineering (icgse) has been hold for over a decade, we recognize that dsd is a business model that is no longer a future trend, but a reality in many software organizations. this context offers a mature scenario for better understanding communication in dsd. by considering the inherent challenging nature of communication in dsd teams, due to aspects such as intensive asynchronous communication (clear and beecham, 2019), we argue that little or nothing was presented in the dsd literature about what communication is in those teams, in the form of a specific scientific theory. moreover, we argue that the emergence of a theory in distributed software development to describe the phenomenon of communication in those teams is an essential step for supporting the future of dsd, as communication-related issues may lead to the disuse of this model. thus, we expect that this theory is beneficial and can help to support the research effort in dsd area. first, by offering a scientific and theoretical foundation for future interventions on the improvement of communication in those teams. second, by providing direct support for industry representatives on software engineering fronts, including risk, productivity, and people management in their dsd projects. this paper contributes towards the definition of an analytical theory, i.e., to “analyze ‘what is’ as opposed to explaining causality or attempting predictive generalizations.” (gregor, 2006, p. 622). this theory has been developed using grounded theory from the ground up, including its selfevaluation embedded within the research process. a first preliminary version was already published (leitão júnior et al., 2019). here, we present the evolution of this theory by adding a more advanced and detailed version of our theory, including one new theoretical category, refactored classes from the first version and new subcategories to the theory structure, as well as our full descriptive content of the communication phenomenon, based on the analysis of our most recent data. the remainder of this paper is organized as follows. section 2 presents the research method. section 3 presents the results from the first research step, the exploration of the study gap. section 4 shows the results from the usage of grounded theory for constructing the communication theory. section 5 presents the further preliminary version of the theory of communication in distributed software development teams. section 6 brings related work. section 7 presents a discussion on findings, followed by section 8, that brings the threats to validity and future work. 2 communication and dsd there are at least three different views on communication theory: communication as a one-way process of meaning construction, in which the sender attempt to construct or reconstruct the receiver’s view; as a two-way process, in which two or more individuals construct meaning together; communication as an omnidirectional diachronic process, focused on the https://orcid.org/0000-0003-1057-078x mailto:ngslj@cin.ufpe.br https://orcid.org/0000-0001-9860-8206 mailto:ivaldir.farias@upe.br https://orcid.org/0000-0001-5992-2171 mailto:hermano@cin.ufpe.br mailto:sabrina.marczak@pucrs.br on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 development of meaning itself (van ruler, 2018). furthermore, when considering communication as a “concept”, authors had presented many views on this matter, and there had no agreement on what communication to “to communicate” is (van ruler, 2018). in classical latin, communication refer to “share with”, to “share out”, “to make generally accessible” or to “to discuss” (glare, 1968). further in time, modern authors bound the concept of communication to the concept of ”meaning” and its usage to interpret events, such as proposed by littlejohn (littlejohn and foss, 1992) and rosengreen (rosengren, 2000). rosengreen sees communication as a process of meaning creation in psychological, social, and cultural ways, including understanding messages and ambiguities solution (rosengren, 2000). at this point, we choose rosengarden’s (rosengren, 2000) view on communication for this research work because we believe that communication as a ”meaning creation” process relates closely to professional software development on its multiple demands. in this context, how can we place the communication context in dsd teams? in dsd, communication plays a significant role in the success of those projects. it allows team members to share information with stakeholders and clarify issues (farias junior, 2014), both in the context of colocalized and remote team members. therefore, dsd teams largely depend on communication among those involved in the project, either directly or indirectly. in this context, the means of communication have a significant influence in the projects in distributed environments (de farias junior et al., 2016), as challenges associated with communication increase when the media chose to support distributed teams are not as rich as what face-to-face (in person) communication offer (herbsleb et al., 2001). 3 related work we highlight in this section three studies that we used as a reference to this one. those studies supported our methodological decisions and research planning, as follows. 3.1 gregor’s work gregor (gregor, 2006) presented a research essay on the structural nature of theory in information systems (is). the author brought to light the issues of causality, explanation, prediction, and generalization that underlies theories in is. at this point, gregor proposed a taxonomy on the classification of theories in is according to their goals, thereby framing theories on distinct aspects of analysis, explanation, prediction, and prescription. this study served as a reference for a better understanding of theories’ different nature and based our decision to delimiting our theory as an analytic one. that’s because our theory aims to describe a broad phenomenon by using an extensive analytic process, i.e., a grounded theory (gt) research on communication in dsd teams. table 1. research characteristics epistemology constructivist approach qualitative objectives exploratory and descriptive method grounded theory 3.2 charmaz’s work charmaz (charmaz, 2014) presented in mid-1990s her grounded theory (gt) proposal based on the original ideas from glaser and strauss (birks and mills, 2011). her gt was presented as a constructivist focused gt proposal (charmaz, 2006; morse et al., 2016), with a greater focus on the voice of participants and their experiences, in the co-construction of data (breckenridge et al., 2012), and in the relativism of multiple social realities (charmaz, 2000) (charmaz, 2006). the gt from charmaz was our main methodological choice, as it presents a proposal in line with our epistemological orientation. furthermore, the charmaz’s gt brings a proposal compatible with a local context, focusing on the voice of the participants and their experiences, thereby providing the means for better understanding the communication phenomenon in dsd (see chapter 4). 3.3 adolph’s work adolph (adolph, 2013) performed a field study to understand how software development is managed. in this context, the author used grounded theory (gt) for proposing a substantive theory. adolph brought the discussion on the adequate moment for interacting with an extensive literature review before creating his theory via gt, due to aspects of introduction biases. at this point, adolf’s work served as a reference for dealing with preexisting literature in this study. base on adolph, we performed a preemptive (exploratory) literature review before the theory construction and an extensive literature review after the theory proposal. 4 research method this study proposal comes from a constructivist philosophical (epistemological) instance, i.e., a philosophy that considers the knowledge as a result of social construction, the truth relative to its context, the interpretation of theoretical terms, a tendency for qualitative methods, and a tendency for proposing local theories (easterbrook and neves, 2007). from the point of view of how to approach the problem, this research is of a qualitative nature, as it considers a dynamic relationship between individuals and the real world that could not be translated into numbers (kauark et al., 2010). furthermore, from the perspective of its objectives, this research is an exploratory and descriptive. exploratory since it aims the familiarity with the problem, making it explicit (gil, 2002). descriptive since it aims to describe the characteristics of a given population or phenomenon (gil, 2002), i.e., the communication phenomena in dsd teams. table 1 summarizes the characteristics of this research. on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 considering that the primary goal of this research is the proposal of a new theory, we have been using the gt method since it is a research method oriented for the generation of theories based on rigorous analysis of data (glaser and strauss, 2009). the use of gt is appropriate when there is a lack of knowledge or theory on a certain topic (glaser and strauss, 1967), and when no existing theories offer solutions (chenitz and swanson, 1986). among the theorists from the vanguard of the gt, including barney glaser, anselm strauss (glaser and strauss, 1967) and juliet corbin, and in the path of the evolution of this research method, kathy charmaz, as a former student of glaser proposed in the 2000s a new constructivist grounded theory (cgt) approach. this research was based on the original ideas from glaser and strauss (birks and mills, 2011), that when compared to its original form, was presented as a constructivist focused gt one (morse et al., 2016). thus, in the light of our constructivist orientation, we adopted the gt school of charmaz (charmaz, 2014) in this research. 4.1 an overview of charmaz’s grounded theory the constructivist grounded theory from charmaz begins with the proposed research question, followed by the recruitment and sampling of participants, also known as “initial sampling,”. next comes the data collection, both initial and focused coding, and the categorization process. thus, based on the emerged categories, the new theory may be proposed and disseminated. in parallel with the entire process (except for the research question specification), charmaz includes the practice of memo writing, constant comparative method and theoretical sampling, which will support the specification and evolution of theoretical categories. this process goes on in spiral cycles, that will continue until the theoretical saturation, i.e., when fresh data no longer sparks theoretical insights. we detail the main research activities in gt, according to charmaz (charmaz, 2014), as follows: i. research problem and opening research questions: to be defined on an initial basis; ii. initial sampling: to provide an “a point of departure, not of theoretical elaboration and refinement” (charmaz, 2014), to define the criteria and establish how data will be accessed. in the context of this research, to plan and start our data retrieval process; iii. gathering rich data (data collection): to collect rich, detailed, and full data; iv. initial coding: to code data, i.e., to “label bits of data according to what they indicate” (charmaz, 2014, p. 19), a closely word-by-word, line-by-line or segment-bysegment data study to begin conceptualizing the ideas; v. focused coding: to separate, sort and synthesize substantial amounts of data, i.e., to select initial codes that stand out and to identify others, as the result of code comparison; vi. memo writing: to help with the process of developing ideas by writing extended notes on codes that crystalize meanings. memos are informal analytic notes to construct theoretical categories, specify their properties, define relationships between other categories, and identify gaps (charmaz, 2014); vii. theoretical sampling: this kind of sampling differs from the initial sampling process, as stated by charmaz: “initial sampling in grounded theory gets you started; theoretical sampling guides where you go.” (charmaz, 2014, p. 197). 4.2 research design our research design has been performed in two sequential steps. the first step is the gap exploration literature review. such as performed by adolf (adolph et al., 2012) in his gt research, we performed a non-extensive (exploratory) literature review to confirm the absence of a specific theory for explaining the communication in dsd teams to sustain the choice of using gt as the main methodological approach. as the use of the gt method is appropriate when there is a lack of knowledge or theory of a topic (glaser and strauss, 1967), or when no existing theories offer solutions (chenitz and swanson, 1986). see section 5 for the results of this research step. at this point, we avoided performing an extensive review prior to the theory construction, as we are aware of the discussion in the literature about using existing literature during a gt research and the possible inclusion of bias in the theory itself, i.e., in the theory to come (dunne, 2011). therefore, we plan to perform a later extensive literature review, as future work. the second step is the theory construction via grounded theory, following charmaz’s specification (charmaz, 2014). the gt approach is appropriated for this research for being a research method oriented for the generation of theories (glaser and strauss, 2009) and for allowing researchers to study social interactions and the people’s behavior (glaser and strauss, 1967), in which the communication phenomenon resides. furthermore, the specific charmaz’s gt proposal is the correct choice for this research context due to our constructivist philosophical (epistemological) orientation, i.e., as this approach has a greater focus on the voice of participants and their experiences, in the co-construction of data (breckenridge et al., 2012) and in the relativism of multiple social realities (charmaz, 2000). 4.3 data collection we performed the data collection process via interview sessions according to charmaz’s intensive interviewing specification, i.e., open-ended interviews as one-sided conversations, gently guided by the interviewer, focusing on the interviewee’s perspective, meanings, and experience (charmaz, 2014). therefore, we argue that the intensive interviewing in the context of gt is essentially a technique that aims at obtaining answers to the base question of “what’s happening here?” (glaser, 1978) and besides that, not necessarily relying on predefined questions, as the interviewer should further develop questions on the emerging concepts. nevertheless, charmaz recommends the construction of an interview guide for new researchers to help to think about the kinds of questions that help to achieve the research objectives (charmaz, 2014, p. 62). therefore, as this was our first incursion on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 table 2. characterization of the current research sample id sessions ids α β γ δ ϵ ζ role a 1 i01 12 2 2 15 n l portfolio manager b 1 i02 20 11 11 40 g l qa lead. c 1 i03 19 10 2 30 g l designer d 2 i04; i05 22 12 6 30 g m proj. mngr. (pm) e 2 i06; i09 20 14 10 15 g m sw. engineer f 2 i07; i08 30 15 13 20 g l tech. lead. g 1 i10 15 6 6 10 g l pm & researcher α experience in it (in years). β experience in dsd (in years). γ experience in dsd including cultural diversity (in years). δ maximum number of individuals in dsd project. ϵ dsd dispersion level: national (n), continental (c) or global (g). ζ organization size: (l) large; (m) medium; (s) small; (m) micro. according to european union (eu) recommendation 2003/361. table 3. gt artifacts totals after 10 iterations gt artifact totals all quotations: 1167 quotations in use: 858 initial codes: 633 focused codes: 42 memos: 48 theoretical categories: 6 on the gt method, we prepared an initial interview guide, followed by subsequent versions of this document. we present a sample of the questions that composed those the two initial versions of those guides, as follows. i. please tell me about your work in dsd, how does it happen? how does the workflow and what is your role in this process?; ii. how do you communicate with your boss or leader?; iii. how does communication happen with your colocalized colleagues?; iv. communicating with your remote colleagues is different from communicating with co-localized ones? if that’s true, it is different in which aspects?; v. how do you feel in trying to communicate with your remote colleagues?; vi. how is communicating with remote and local colleagues from a distinct culture?; vii. how do you feel in trying to communicate with colleagues from other cultures?. please keep in mind that we adapted those questions and created new ones during ongoing interviews and posterior ones, as emerging concepts demanded further data collection needs, i.e., as an operationalization of the theoretical sampling procedure (see section vii.). in this context, we invited 12 dsd professionals, resulting in 5 rejections (or omissions) and the acceptance of 7 professionals who took part in 10 interviews. as we are also practitioners and researchers in software engineering, we used the help of our professional contacts in the software industry and the linkedin social network for identifying and getting to professionals compatible with our research sample profile, i.e., professionals with relevant experience in it and dsd teams, performing as practitioners or researchers. we first attempted an informal invitation for those professionals and later on, we proceeded for the formal invites. we achieved a response rate of 58.33%. table 2 presents the professional characterization of these individuals, with emphasis on their professional dsd experience. 4.4 data analysis this research is ongoing, and up to the elaboration of this paper, we performed 10 iterations of gt analysis, one for each interview transcript. we are using the atlas.ti (available at atlasti.com) computer assisted qualitative data analysis software (caqdas) software tool, to help with the objective of this analysis, i.e., to assist in the process of uncovering the communication phenomenon within the interview scripts. in this context, we present in table 3 an overall view of the current number of gt artifacts to better situate the ongoing analysis effort, based on metrics from the atlas.ti tool. we performed the analysis process by codifying, writing memos, establishing relations, and comparing data. the theory is in its second preliminary version, emerged from data itself, grouped in the hierarchical concepts of initial codes, focused codes as subcategories, and focused codes as theoretical categories. please notice that we did not perform any procedural approach for linking categories to subcategories, e.g., such as axial coding. at this point, we performed the link between those elements by establishing relations from data itself, without preconceived ideas or procedures, as an emerging rather than a procedural approach for linking data, just as suggested by charmaz (charmaz, 2014, p. 148). 5 results from the gap exploration review to confirm the absence of a specific theory that explains the phenomena of communication in dsd teams, we performed an exploratory literature review. we used the ieee xplore, acm digital library and google scholar knowledge bases, on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 figure 1. mapping between identified theories and communication factors in dsd. table 4. search results source matches ieee xplore 13 acm digital library 9 google scholar 6 with the search string as follows: (s1): ”theory” and (”communication” or ”communicative”) and (”dsd” or ”distributed software development” or ”gsd” or ”global software development”). next, we adapted the search string (s1) to the required syntax of the search engine, maintaining the same semantics and executed each search string in the selected knowledge bases in both title and abstract fields. table 4 presents the results. after removing duplicates, we identified the relevance of identified papers for this research, according to the inclusion criterion (ic1): primary studies on the phenomenon of communication in dsd which propose a new theory in this context or which refers existing ones as part of their theoretical basis, methodological approach or research goals. and exclusion criterion (ec1): non-english language papers. finally, by applying both ic1 and ec1 criteria, we selected 12 studies, and within those, eight of them referred to existing theories, as follows: the social network theory (travers and milgram, 1967); the social presence theory (short et al., 1976); hofstede’s cultural dimension theory (hofstede, 1983); the media richness theory (daft and lengel., 1986); the media synchronicity theory (dennis and valacich, 1999); the media switching theory (robert and dennis, 2005); the open-coopetition theory (teixeira, 2014); the social-technical theory of coordination (herbsleb, 2016). next, we included a mapping between all identified theories and a set of factors that influence communication dsd (farias junior, 2014, p. 110), as an additional procedure to this research step. by performing this additional procedure, we aimed at further deepening our findings through the identification of evidence of non-directly addressed concepts in dsd communication context. those factors were based on the results of a tertiary systematic literature review on communication in dsd projects (dos santos et al., 2012), which aimed at consolidating knowledge about communication in this context. the study identified 29 factors based on 20 identified secondary studies, in the attempt to answer the research question of “which factors influence communication in distributed software development projects?” (dos santos et al., 2012, p. 1). in this context, our mapping aimed at identifying whether each theory addresses each communication factor, i.e., if the theory appears to describe or explain each respective factor, as presented in figure 1. some dsd communication factors were addressed by one or more theories, such as the factor “cultural differences” by the “hofstede’s cultural dimension theory” and the “team awareness” by the “socio-technical theory of coordination.” on the other hand, factors such as “language or linguistic barriers” and “definition of roles and responsibilities” were not addressed by any of the theories (figure 1). in short, we present as follows the main findings from this on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 mapping activity: i. none of the identified studies proposed a new and specific communication theory in dsd; ii. none of the referred theories in the studies were specific for the dsd context; iii. none of the referred theories addressed 14 of 29 communication factors in dsd; iv. none of the referred theories could address alone, all the communication factors in dsd. at this point, we may notice that the first and more critical find for our exploratory review is that we identified no specific communication theories for dsd. this finding suggests a theoretical gap in the literature and supports the choice of gt as our primary research method. this situation is corroborated by the additional findings on the identified theories not being specific for dsd and on the 14 factors that did not appear to be part of their theoretical content. in this context, we may cite that at least some of those 14 factors are closely related to the communication context of dsd teams, i.e., to the diverse and distributed nature of its members. among those we may cite: including the “language or linguistica barriers”, “communication skills”, “temporal distance” and “synchronization of work schedules”. those circumstances lead us to to the conclusion that a communication theory for the specific context of dsd teams will have its place in the literature. 6 a further preliminary theory of communication in distributed software development we present as follows a further version preliminary theory of communication in distributed software development teams as an evolution of its first version (leitão júnior et al., 2019). the theory includes all its theoretical categories and subcategories as the constructs of the communication phenomenon (figure 2). next we present the descriptive content of all theoretical categories based on their memos, and together with figure 2, the following content describes the communication phenomenon in dsd teams. furthermore, as previously stated, we are constructing this theory via charmaz’s grounded theory (gt) proposal (see section 4), i.e., all its contents are grounded in data itself, from all interview transcripts. therefore, we also present along with each component some of the quotations that supported its emergence as a theoretical element. 6.1 considering universal communication enablers (i) dsd teams include distinct humane contexts, and the efficacy of communication in those teams is affected by the personality of individuals themselves, i.e., how they are, how they present themselves in their working teams, and the natural inclination to proximity in working relations. this theoretical category describes the role of universal communication enablers in dsd teams, i.e., factors that are relevant for the communication aspect in diverse professional and everyday humane contexts, including dsd teams, through its subcategories, as follows. 6.1.1 communicating face-to-face (i.a.) traditional face-to-face (in-person) communication is part of the communication context of dsd teams. dsd team members recognize face to face as being more effective than other remote, or async communication means, even when those individuals perfect remote communication with time and effort. communicating face to face, benefits from the usage of spatial resources, by being physically close to individuals, perceiving movements, and by using spatial elements such as physical boards and adhesive notes. face-to-face communication enables the eye-to-eye contact in the communication act and supports a better understanding of the best moment for interrupting and stating other viewpoints, without running, or rushing. in this context, dsd team members perform face-to-face (inperson) meetings under circumstances that include: fast and colocalized discussions, project milestones presentations, discussion of complex matters, collecting feedback; review overall expectations with clients. examples of quotations in this component: i. “yeah, when you’re hiding behind the computer, the chances of you’re running over, of letting something pass, i think it’s greater than when you’re in person, looking and watching people.” (translated); ii. “i think that you, face-to-face, you have there, the exchange of looks, the person’s posture, it is, you know the most assertive moment of you interrupting a reasoning and being able to talk about running over the other.” (translated); iii. “he spent two, three days together it flowed... we solved a number of things.” (when questioned about face-toface efficacy) (translated). 6.1.2 communicating informally (i.b.) informality in the communication context in dsd teams refers to minor preoccupation with how the message will be received, i.e., a more relaxed and natural communication style between closer teammates. communicating informally also refers to a natural and eventually ludic way of expressing oneself. this refers to a flat hierarchy, which supports better opportunities for everyone to speak. additionally, it refers to usually not document the discussed matters. informal communication occurs under circumstances that include fast and lightweight meetings such as daily ones. under the circumstances of discussions after formal communication attempts on project scope, i.e., when the operational work begins, as well as under the conditions of working with colleagues under the same role, e.g., communicating between developers. informal communication is a common communication approach in dsd teams. it has the potential to positively impact the efficacy of communication, as it supports the removal of communication barriers and the support of collaboration in those teams. on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 figure 2. a taxonomical representation of the further preliminary version of the theory of communication in distributed software development. examples of quotations in this component: i. “when work begins, communication is completely informal!” (translated); ii. “that, that is, 80% and 75% of the communication is informal!” (translated); iii. “i would not say more effective, i do not know exactly if it was more effective, i would say it is more, it is ... comfortable!” (when questioned about the efficacy of informal communication) (translated). 6.1.3 considering affinity (i.c.) the nature of affinity between members is part of the communication context in dsd as a relevant factor for communication efficacy in those teams. the affinity between dsd members support informal communication, i.e., communicating without worrying too much about how the message will be received, in a synergic relation. eventually, the lack of affinity is associated with formality in the communication act. affinity also emerges with time in the teamwork, e.g., through the interaction with good work-capable colleagues that support the integration of team members, even in challenging schedules. additionally, affinity may come with the ones that bring knowledge and a sense of security to the team. thus, dsd team members will probably develop an affinity with at least specific colleagues. nevertheless, the lack of affinity leads to discomfort in the communication attempt, as a negative experience with barriers to effective communication, slower communication timing within those teams, and less availability of team members to communicate. examples of quotations in this component: i. “so, the difference is, that person has less resistance to listen to your ideas, has less resistance to perform their work, has less resistance to help you, has less resistance to ask for help, it’s a completely different job!” (when questioned about communicating with affinity) ii. “it’s that person that you say: ”with this guy i’m safe!”, because he does everything, he is not afraid to work, he comes over the weekend, he, studies, he ... is the, this is the member of your team you want!” (when questioned about working with affinity) (translated) 6.1.4 communicating remotely (i.d.) dsd team members communicate remotely with colleagues in the circumstances of discussing professional matters with both remote and local individuals. this way of communicating is a distinct approach when compared with in-person communication. when colocalized, dsd members maintain an open channel with colleagues, which is eventually inevitable, supporting informal and non-work-related comon a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 munication as part of their work lifecycle. when colocalized, dsd team members are prone to identifying specific aspects of the everyday life of the colleagues in a healthy communication approach. nevertheless, when communicating remotely, this day to day changes awareness of colleagues is hindered, as the communication occurs in sessions, in a non-continuous approach, difficulting the understanding of the organizational culture of remote dsd cells. thus, dsd team members may recognize remote communication as a colder approach due to the distance between individuals. also, remote communication may demand an extra effort to establish a proper communication channel in dsd teams, which can be seen by some as an issue to be dealt with. examples of quotations in this component: i. “...where you have people from all over the world, you are multicultural, then whatever, or with a colleague who is there next door, sit at the table next door or if i’m talking to someone on skype, who’s on the other side. (when questioned about remote communication) (translated);” ii. “when you are communicating with, with a person remotely, there is the coldness of distance, so your interpersonal work it is twice as much, the energy you spend, if not triple is twice as much!” (regarding the effort to communicate remotely) (translated).” 6.1.5 communicating formally (i.e.) the concept of formality in dsd team communication refers to a mostly documented, or textual communication approach, for reasons such as the need to disclose problems, and as a need for establishing documentation with stakeholders for further reference. email-based communication can eventually be characterized as formal and meetings that implicate some degree of documentation as well. furthermore, formal communication also refers to a planned communication attempt, including the usage of strategies and protocols for a communication act. a well prepared formal communication attempt in dsd teams tends to be clear and less prone to noise, but those benefits will usually come with the price of an extra effort to dsd members, mostly in writing, documenting, and working with the available communication channels. formal communication occurs under the circumstances that include communicating with clients that are not in everyday contact with the team, including those from different cultural contexts. formal communication also occurs when communicating with colleagues from remote sites or different cultural contexts, with senior dsd team members, and with any stakeholders that are not in the everyday working context. examples of quotations in this component: i. “yeah, the, the change of... basically this is the definition of scope, let’s put it this way, definition of scope, change or not, definition of scope uhh, the meeting is formal, so if there is, registration, if you have, uhh, written stories, it happens twice three times, you check with the team, uhh, you usually have two, three meetings, even when everyone is aware of what has to be done.” (translated);” ii. “...30, 25% to 20% of formal communication, it only exists when you are, discussing, opening or closing a milestone. ” (translated);” iii. “...if i talked on the phone, i send an email confirming everything, but there has to be some kind of record, not only verbal of what happened.” (translated).” 6.1.6 practicing empathy (i.f.) placing yourself in the other’s perspective, i.e., with empathy towards others, is part of the communication context in dsd teams. dsd team members recognize that the practice of empathy supports better communication in their teams, as it is understood as an essential communication-oriented approach by some. thus, dsd members may practice empathy towards colleagues for better communication results. the practice of empathy in dsd also supports culturally diverse teams by potentializing the communication in a foreign language, e.g., native speakers may place themselves in the place of foreign colleagues and recognize, as well as encourage their effort to communicate. furthermore, the establishment of empathy between individuals comes easier when affinity takes place; nevertheless, other factors such as the lack of team spirit will hinder the development of empathy in dsd teams. examples of quotations in this component: i. “..then he said: ’just the effort of you are trying to talk to me, i already see something positive!”’ (regarding the recognition of a stakeholder on trying to speak in a foreign language) (translated); ii. “...yes, i already apply it several times, i applied it to this project, not only with the team, but with the client himself when, you know, he came with, yeah, it’s some more delicate issue, i tried to understand their side, what kind of pressure he was under, the context he was working on.” (when questioned about the practice of empathy) (translated); iii. “you have to exercise this when you are in an environment of diversity, cultural, social, etc., you put yourself in the place of the other.” (regarding the practice of empathy in cultural diverse teams) (translated). 6.2 including cultural diversity (ii.) communicating in global dsd teams is a multicultural experience that eventually includes individuals from different parts of the world. working and communicating in multicultural dsd teams can be a joyful experience, in both professional and personal dimensions, as for some dsd members, and enthusiasts, communicating in this context is a fluent and transparent experience. for those individuals, being exposed to different ideas and schools of thoughts, speaking other languages, and learning about the world is a fantastic experience. nevertheless, communicating in cultural diversity can be a challenge, including the need to understand foreign languages, different availability expectations, and cultural differences. in this context, dsd organizations may include inon a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 stitutional directives for better dealing with cultural diversity and may even try to enlist professionals with inclination to the cultural diversity itself. those organizations may also perform and effort to identify cultural-related misbehaviors of its members and try to correct those. this theoretical category describes the influence on the cultural diversity in the communication context in dsd teams through its subcategories, as follows. 6.2.1 dealing with different communication styles (ii.a.) working under the circumstances of distinct cultural contexts in dsd teams includes individuals that communicate in a way due to cultural-related reasons. individuals may communicate more directly and objectively, i.e., speaking straight, bluntly, which may sound like rudeness for professionals with the lack of experience with other cultural contexts. thereby disturbing communication in dsd teams and eventually supporting the feeling of discomfort and a withdrawn behavior among dsd members. the communication timing may also vary in those teams, as individuals from specific cultural contexts may act more timidly and avoid questioning about the development fronts, preferring to work right away. this behavior is in contrast with others that may try to exhaustively ask details on their tasks before beginning the development itself. dsd members may also include de usage of specific jargon that may impact the communication efficacy, even by not being a critical issue in dsd teams. examples of quotations in this component: i. “so, this many times in communication generated a communication disorder, because we started to think that they were being reactive in things that we suggested, you know?” (concerning a stakeholder that communicated in what it felt like a blunt approach) (translated); ii. “there was a person there who was very blunt, it started ... he passed things on to you, i don’t know what, but you thought you were doing everything wrong, you were doing everything terrible and then the person... he did not praise us, but ultimately, everything was fine, everything was correct! i don’t know what, then it got kind of confusing!” (concerning a stakeholder that did not communicate well his perception on colleagues’ work) (translated); iii. “they are so sincere that they offend and this has generated such heavy conflicts, especially with me, right?” (translated);” 6.2.2 dealing with distinct professional expectations (ii.b.) individuals from different cultural contexts may also come with varying views of expected professional results and ambitions. those may include a faster ascension of professional carrier and a clear plan for promotions, mostly when those individuals came from more densely populated areas. on the other hand, individuals from other cultural contexts may be reticent in being promoted and assume the extra responsibility that comes with it. moreover, specific dsd team members may even react emotionally to the recognition of their work in the form of a promotion, i.e., being very moved, due to possible cultural related behaviors. furthermore, expectations of professional availability to work and communicate, vary in the circumstances of different cultural contexts in dsd teams. individuals from specific cultural contexts may expect an unrestricted, and eventually, non-concerned extra availability to work and communicate. this expectation may even be supported by the organizational culture of the dsd project clients. on the other hand, in a culturally diverse location, dsd members tend to be more respectful in what concerns the demand for increased availability of individuals, with extra attention to resting times and individual boundaries. nevertheless, when an increased availability of its members is commonly required, some dsd organizations may apply an extra effort to mitigate this situation. those differences in availability expectations may be poorly communicated within the team, however, with time, and further discussions on this matter, this misalignment can be mitigated. thus, managing the professional expectations of team members, mostly in cultural diversity, is a necessity in the broader communication context of dsd teams. examples of quotations in this component: i. “so yeah, sometimes the foreigners themselves from the company, not from the client, would send an e-mail, i don’t know, at midnight, because they are working anytime!” (translated); ii. “i’m not saying it’s good, it’s bad, but unintentionally, people manages expectations, assuming that employees are, yeah, yeah, are [always available]! compared to...” (comparing avaiability expectations from individuals from a different country where he lives in) (translated); iii. “uhh, the other thing is ambition, so different cultures have different ambitions!” (translated); iv. “because there are people who are looking for promotion, so wanting to know what is going to happen for them to come up, in the company, uhh and that influences and has to be taken into consideration.” (regarding colleagues from a different cultural context) (translated). 6.2.3 considering gender diversity (ii.c.) working in dsd teams include communicating in gender diversity as a natural element in those teams. communicating in gender diversity is not usually recognized as a problem in dsd teams, with almost no specificities or issues in the communication act. in this context, dsd team members acknowledge that communication impacts will not be usually associated with gender itself, but by the experience of the individuals in working and communicating in dsd themselves. nevertheless, rare cultural related specificities may emerge in dsd teams, such as the impression that members in specific cultural contexts prioritize speaking times based on gender, e.g., women speaking after men in the circumstances of meetings. also, as a rare occurrence, individuals can have on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 the impression of being lesser treated by colleagues in a culturally diverse team, in part for being a woman. additionally, dsd members may expose the feeling that the leadership that is mostly performed by men only is unfortunate. in contrast with others, that may see a more gender-diverse team as good as a non-diverse one, as working with men in higher numbers has been an expected scenario in dsd teams and during it studies. examples of quotations in this component: i. “i think it has communication impacts of different kinds, yes, but regardless of gender.” (on the participant’s opinion on gender matters) (translated); ii. “but i, i would say that you would have to be a little more careful with what you say, with the jokes you take” (regarding informal communication on gender matters) (translated); iii. “yes, there were two men and a woman on their team and, the woman always spoke last, yes, i don’t know if it has to do with whether it was just a coincidence or not, but it always happened that way, you know?” (on the impression of unbalanced speaking times) (translated); 6.2.4 considering different religious beliefs (ii.d.) working in dsd teams in cultural diversity includes communicating with members of different religious beliefs, usually not provoking relevant impacts or specificities in the communication act. in a culturally diverse location, dsd organizations may deal with the diversity of religious beliefs by stating organizational directives for respecting religious aspects of the employees (including dsd members). those organizations may even include the right to dsd members to ignore specific organizational rules for religious-related reasons. nevertheless, dsd members will eventually take part in religious practices or events that may interfere or support specific behaviors in their teams. in this context, eventual communication disturbances may come in the form of absences or restrictions due to religions-related reasons. still, those situations will, in turn, be well handled by dsd leadership. examples of quotations in this component: i. “but, because of beliefs, values, no, just misunderstandings.” (regarding communication issues on different beliefs) (translated); ii. “there are, there are people on the team who enter... and they are entering [on a religious ritual] so they cannot eat, they have to leave later, or earlier...” (translated); iii. “there are, there are these things, religion and culture that affects, but it is not a serious problem, we adapt ourselves!” (translated); 6.2.5 adapting schedules to cultural diversity (ii.e.) the nature of dealing with specific and different schedules in dsd teams in cultural diversity is part of the broader communication context in some dsd teams. even by not necessarily being a common situation, absences in the circumstances of specific cultural events, related cultural dates, e.g., grand sports events, the expectation of the frequency and the duration of vacations, cultural-related events such or religion-related absences can be eventually misunderstood in those teams, leading to conflicts and discomfort. specific or national holidays can also impact dsd team schedules, e.g., holidays that can be included by a non-planned schedule; christians can be unavailable during christmas; the chinese new year may drive the attention of some members. furthermore, in specific cultural contexts, individuals may lead to an overall expectation of the beginning of daily working hours much earlier than most of the members from other distinct cultural-based dsd sites, as well as leaving the office earlier. additionally, some religious rituals may also occur during working hours, which will also drive the attention of dsd members. thus, dsd leaders often adopt approaches such as a custom, and documented schedule for their projects in cultural diversity, to better work and support the communication on their teams. examples of quotations in this component: i. “but, because of beliefs, values, no, just misunderstandings.” (regarding communication issues on different beliefs) (translated); ii. “there are, there are people on the team who enter... and they are entering [on a religious ritual] so they cannot eat, they have to leave later, or earlier...” (translated); iii. “there are, there are these things, religion and culture that affects, but it is not a serious problem, we adapt ourselves!” (translated); 6.2.6 considering hierarchy (ii.f.) the concept of the hierarchy of stakeholders is part of the communication context of some dsd teams. this concept is not necessarily a cultural trace, but it has a higher relevance under the circumstances of multicultural teams. communicating with individuals in different hierarchical levels may impact the communication or speaking timings for expressing opinions. thus, as a cultural trace, senior members may have a larger window to communicate than junior ones. those hierarchical differences may even prevent effective communication between individuals from specific cultural contexts. colleagues in lower hierarchical levels can eventually omit opinions, and the ones from higher hierarchical members may be automatically accepted without further discussion. thus, lower hierarchical individuals may act in a withdrawal behavior on trying to express themselves, being afraid of a wrong understanding of the colleagues in a higher hierarchical status. examples of quotations in this component: i. “so, of course, like this, you have ... there is this hierarchy that you respect but at the same time you will not change your opinion so now, in another culture i no longer know if i would give the same opinion or be quiet!” (translated); ii. “uhh, there were, there were situations like that, that i didn’t agree with... that i thought was not correct. but because the person was at a higher level of hierarchy and on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 the person had more experience, i ended up doing ok, so that’s it, but it brings discomfort, for sure.” (translated); iii. “because, what i see and this is a very big cultural difference, is that, you have to respect your level of... if you are more senior, you will speak more, if you are junior, you will speak less, so you have to respect that.” (translated); 6.3 adopting communication practices (iii.) dsd team members include distinct communication practices in their communication context. those individuals may communicate in checkpoints, that is, specific moments during the dsd project in which the overall status needs to be aligned and codify messages by documenting and publishing to effectively disseminate new versions of software artifacts for the whole dsd team, among others. this theoretical category describes the communication aspects of adopting communication practices in dsd teams through its subcategories, as follows. 6.3.1 traveling to communicate (iii.a.) traveling to communicate is a common practice in dsd teams. dsd team members travel to remote dsd cells to perform face-to-face (and in-person) communication with their remote colleagues and clients. those travels will support a closer communication style that can eventually be seen as a practice to supplement remote communication. dsd members will travel for working and communicating during one or more weeks or in a smaller time frame, for important in-person meetings, e.g., kickoff meetings, and for organizational events or celebrations, also, for knowing in-person remote members and support the process of team building in those projects. working and communicating during those travels supports a better understanding of the organizational culture of remote dsd sites, such as internal meetings, the physical structure, and the organizational bureaucracy, which are usually not clear in remote communication. traveling to those remote teams also includes the purpose of learning about cultural diversity, i.e., by placing dsd members in different cultural contexts for a longer period and expect that they will understand better the host culture. examples of quotations in this component: i. “every three, four weeks, we went there, physically in rio to make deliveries and do more conversations, even face-to-face conversations, to complement these, these, remote integrations.” (translated); ii. “yes, yes, many times when there will be kickoffs, from projects, or when even celebrations after the end of the project, they travel, they come here...” (translated); iii. “...and every three or four weeks he had face-to-face communications.” (translated); 6.3.2 performing on-demand meetings (iii.b.) communicating on-demand meetings, i.e., promoting meetings under overall circumstances of project needs, is included in the communication context in dsd teams. dsd team members perform those meetings for reasons that include the definition of the scope to be developed, the understanding of project challenges, project tracking, and the refinement of the design of solutions in their teams. those meetings can be a valuable tool for evidencing the lack of understanding and misunderstandings, including the identification of wrong directions in the development, under the circumstances of supporting the work progress in those teams. nevertheless, some dsd members may feel that in some instances, on-demand meetings are not the right approach for communicating. that’s because those meetings may lead to excessive discussions, mostly at the beginning of the project, and dsd organizations may eventually impose a time limit for performing those meetings. examples of quotations in this component: i. “uhh, you don’t, you don’t, you spend more time but then you have a 15-minute, 20-minute, 30-minute meeting, you spend up to an hour, the person is solving the problem, go ahead, go back...” (translated); ii. “then, for example, it is, sometimes the developers, they, based on, yeah, the features they were doing, they did it as if it were a tech session, to make the alignment of what they were discussing.” (translated); iii. “so, i have several meetings with them to clarify, direct the tests they do.” (translated); 6.3.3 communicating in global meetings (iii.c.) dsd team members also perform communication in global meetings, i.e., meetings that include members for all, or multiple dsd sites. those may be segmented to a specific technical aspect and may consist of participants from various dsd projects for sharing knowledge and obtain feedback. dsd members will perform global meetings in circumstances that include relevant project dates, such as the end of a sprint. those meetings are justified for reasons that include presenting or sharing the completed development to all dsd teams, with a focus on the teamwork, and not the individuals by themselves. still, and as an exception, global meetings may be used as an approach for the public and positive feedback for individuals that made the difference in the current development cycle or delivery. nevertheless, global meetings may be seen with a lack of appreciation from some team members. that is because those meetings may include the diverse range of matters under discussion, which may be out of the specific context of specific dsd sites, without any noticeable impact on the work of those teams. examples of quotations in this component: i. “at the company i think there are... designers worldwide and if i remember correctly, what happens, there is a global meeting of the leading designers of company that are, i think, 20, no, 15 no, 18 designers.” (translated); ii. “...the sprint took time, it varied, but it was usually three weeks, we did a review, a review meeting, then all the teams got together and, yes, one representative or two of on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 each team is, presented what was done, and then it was a more global event, there were people from all over the world.” (translated); 6.3.4 adopting individuals as communication endpoints (iii.d.) communicating in dsd teams may be performed by team representatives acting as communication endpoints in the circumstances of institutional meetings and development fronts with clients. those individuals may represent a dsd site, clients, a whole region of dsd cells in their respective areas of knowledge, e.g., interface design in latin america, or even the organization itself. communication endpoints may be recognized as an effective approach for communicating in dsd teams, and in this context, leaders have the potential to fulfill the requisites to become communication endpoints. those endpoints will usually be required to discuss critical matters in dsd teams and interdisciplinary matters. organizations may adopt communication endpoints as part of their culture or directives; nevertheless, a communication endpoint may only be adopted at the beginning of a dsd project. that is because, with time, the team members may feel the need to extinct this role, as communication may become decentralized in their teams. examples of quotations in this component: i. “they have a central representative of the company who speaks for all locations and some of these people, as they have something to bring, something to discuss, they show up at these meetings, so it’s, like, by a mediator, like a responsible person.” (translated); ii. “i hold meetings with the client, i take questions that the, yeah, the testing team raises.” (translated). 6.3.5 adopting feedback (iii.e.) dsd members adopt feedback practices in their teams as part of their communication context. feedback in those teams may be used as an approach for dealing with failed communication attempts or the lack of availability of specific members for communicating, as an attempt for changing unwanted behaviors in those teams. dsd team members usually perform this practice with a certain degree of anonymity. in this context, dsd teams may even avoid the usage of software tools for better protecting the identity of individuals. dsd teams adopt distinct approaches for collecting feedback data, including enlisting dsd members for individual, one to one feedback, and using organizational surveys. those teams may eventually apply a consistent effort in tracking the welfare level of dsd members and may even include matters discussed in feedback sessions in the agendas for the meetings to come, as part of an organizational effort. feedback sessions are usually performed in the circumstances of on-demand in-person meetings, after important milestones such as postmortem meetings or in a pre-accorded schedule. practicing feedback comes with the potential to achieve better communication results, mostly when those sessions are communication-oriented ones. examples of quotations in this component: i. “yes, uhh, no, i can’t tell you about the quality, but the concern with communication, yes!” (regarding impacts on communication quality on the adoption of feedback) (translated); ii. “we used to do feedback practices in person, we avoided them, and maybe more because of a culture, ours, we avoided making these feedbacks, yeah, remote.” (translated); iii. “yes, yes, especially when the feedback was on this topic, you know, i, i remember for example that, what we discussed a lot, is, the way people should be participating in a remote meeting.” (regarding impacts on communication quality on the adoption of feedback) (translated); 6.3.6 celebrating to communicate (iii.f.) dsd members perform the practice of celebrating together in their teams. that is a common practice in some cultural contexts and a less eventual in others. dsd organizations support the aspect of celebrations by giving up of working hours from dsd members to participate in those events, and by promoting, and eventually, financing this practice according to their budget. supporting celebrations in dsd teams is justified by reasons that include supporting closer relations, dealing with the isolation of some members, and encouraging teamwork, all in a relaxed out of the routine situation. those celebrations have the potential to support the outside-work informal communication between dsd team members, to know each other better, and thereby support the establishment of empathy between those individuals. furthermore, even when an organizational intention is not in place for supporting celebrations in dsd teams, those may also be promoted by the dsd members themselves. celebrations will usually occur under the circumstances that include the achievement of milestones or other relevant project victories in dsd teams. examples of quotations in this component: i. “but all of this has a reason, apart from social issues, which is a little different, but apart from that, it’s basically to strengthen uhh, the connections, uhh, interpersonal, communication, uhh, empathy, sympathy...” (translated); ii. “it helps, you know, on a daily basis, we talk and have more, have a better relationship, right, at work.” (translated); iii. “but up to a limit, part was part of the project.” (regarding the organizational support on celebrations) (translated); 6.3.7 adopting daily meetings (iii.g.) daily meetings are part of the communication context in dsd teams, i.e., scrum based ones. those meetings are performed locally, remotely, and eventually globally, with all sites, including remote members, with the help of on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 software tools. this practice is justified for daily activities tracking, as those meetings support the visibility of current, and specific work status of the colleagues and from the team as a whole. daily meetings in dsd teams are usually wellexecuted, with the right frequency, and with a well-defined scope, and dsd members recognize those meetings as a good approach for communicating. dsd team members will usually not document those meetings, but exceptions include registering eventual emerged technical related debts. daily meetings may involve all team members, but in some cases, dsd team members may perform those with representatives from dsd sites, i.e., with an optional presence of some members. dsd members may eventually cancel specific daily meetings, due to reasons that include performing team training or other specific and longer meetings. in this situation, those members will usually accumulate matters for the next daily meeting session. examples of quotations in this component: i. “so, we have, daily meetings with everyone together, of course, only those who have a reason to show up!” (translated); ii. “and with the team, we had remote meetings every day, quick meetings, also via skype, to follow the project.” (translated); iii. “that dynamic was this; daily meetings with remote teams, the client interacted extensively at any time...” (translated); 6.4 including different time zones (dtzs) (iv.) working in dsd teams include communicating across different time zones (dtzs), which can be a challenge for dsd members, as communicating in this context can lead to problems in dsd teams. this theoretical category describes the communication is different time zones in dsd teams through its subcategories, as follows. 6.4.1 synchronizing communication in dtzs (iv.a.) dsd team members eventually work with members from remote sites in which their work hours may not occur at the same moment, e.g., the beginning of the workday of a dsd cell may be the middle or the end of another one. communicating in dtzs in dsd supports information and activities desynchronization. thus, to mitigate the communication desynchronization, dsd team members do their best to communicate synchronously, by aligning and planning their overall schedules to enable meetings. nevertheless, attempts to synchronize communication efforts in this context can be hampered by the limitation of availability of dsd members themselves, and attempts to work around those limitations can lead to discomfort for those individuals. thus, when communication synchronization is not possible, dsd members may still insist on communicating synchronously, by adopting strategies such as accumulating by themselves all the matters to later discussions. examples of quotations in this component: i. “so, it had this kind of impact, but we were organizing, trying to schedule moments, that is, it was a regular time for both teams, right, so, it impacted, but we were resolved.” (regarding the impacts of synchronous communication in dsd teams) (translated); ii. “...there was the question of the time zone and the question of the times was summer time in one place and it wasn’t, so we always changed this time a lot, but whenever it was possible for everyone to participate.” (regarding daily meetings in different time zones) (translated); iii. “it is, it is, complicated, it is difficult, and i often want to leave, it’s my time to leave and they are talking to me because it’s at the beginning of their working day, you know?” (translated). 6.4.2 communicating asynchronously in dtzs (iv.b.) working in dsd teams includes communicating asynchronously between dsd sites across dtzs as a common approach. dsd team members will usually adopt asynchronous media for communicating with individuals from remote sites, i.e., communication software channels such as e-mail, issue tracking, and chat tools. nevertheless, communicating via those tools will demand the effort of dsd team members to state their working status, and the lack of this effort can lead to communication failures in dsd teams. examples of quotations in this component: i. “i remember this workflow well, yeah, the first thing that happened early in the morning was if, look, yeah, there in the development tools what it was done because, yeah, our team was well distributed...” (translated); ii. “yes, it is asynchronous, at the time i think it was the messenger that we used the most and e-mail, right?” (translated). 6.5 choosing communication media (v.) dsd project members communicate via a diverse range of options as communication media, including specific sets of software-based media for better communication results. those options are identified by aspects such as the expectative of time to obtain responses, the communication speed, among others. in this context, dsd members feel that the chosen media may impact the degree or the efficacy of communication in their teams, e.g., adopting approaches such as a text-based communication as an attempt to hinder the noise. this theoretical category describes the communication media in dsd teams through its subcategories, as follows. 6.5.1 e-mailing (v.a.) despite its age, communicating via e-mail is a common approach in dsd teams. dsd team members recognize the e-mail communication as a formal and objective approach, as well as the preferred choice for some of those individuals for asynchronous communication. additionally, when on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 communicating via e-mail, dsd members mitigate misunderstanding issues with accents in the standard language of choice in those teams, i.e., the dsd standard language, when it is a foreign language. nevertheless, e-mail communication can be prone to misinterpretations when not correctly written. thus, e-mail users must be cautious of its or their recipients and try to understand their perspective. examples of quotations in this component: i. “there is communication by e-mail for more formal things...” (translated); ii. “now, when they were more common subjects that could wait for one day, it didn’t deman to everybody know about it, synchronously, that was the e-mail...” (translated). iii. “the, the differences are, are bigger, the difference in accent is bigger, so, by e-mail, uhh, yeah, it’s more interesting!” 6.5.2 chatting via software (v.b.) dsd members also use chat-based software as a choice for communicating in their teams e.g., instant messaging (im). those tools are usually used with a high frequency in dsd, with both local and remote members. chat tools are the media of choice in the circumstances that include weekly and daily meetings. furthermore, those tools are used for reasons that include communicating and sharing knowledge, project tracking, and as a support for the collaborative process in dsd teams. furthermore, even by being an informal oriented communication media, i.e., a more natural communication approach, without worry on the form of the text itself, those are also used by dsd team members as a way for documenting conversations for further reference. modern chat tools include communicating via audio and video calls, which are also adopted by dsd members as communication media. additionally, those tools include the possibility of communicating to a specific group or specialized channel, which may address a group of individuals involved in particular project events or tasks via simple short messages, e.g., a deployment activity. even by not being as immediate as in-person communication, chat-based communication is usually well accepted in dsd teams for being a fast communication media. examples of quotations in this component: i. “but for quick things, we have an im, business system that is only local, where we talk a lot, create, yeah, very similar to the old irc.” (translated); ii. “we create one for the release 1 of, of, such project, another, qa group, the other, the leadership group of the such team, so, we create several different channels, or direct messages.” (regarding the usage of specialized chat channels) (translated). 6.5.3 adopting video call software (v.c.) video calls are part of the communication process in dsd as a common approach in those teams. dsd members perform video calls in the circumstances of every day (on-demand) remote face-to-face communication or specific global meetings. video calls may be demanded by dsd organizations as an institutional communication practice. additionally, this media may include the support of visual sharing of working artifacts (e.g., sharing screens) as part of the communication attempt. the lack of face to face, in person or via video calls, has the potential to hinder the communication efficacy, as the lack of eye-to-eye contact can lead to some degree of misunderstanding about how the message was received between the parts, leading to discomfort in this communication act. in this context, dsd members perform video calls for reasons that include mitigating or even replacing in person’s face to face contacts. video calls have the potential of being an effective way of communicating, as the gestures and body language may be included in the video. nevertheless, even by simulating an in-person face-to-face contact, video calls can still be considered by dsd members as an impersonal way of communicating when compared with its physical contra part. examples of quotations in this component: i. “...we did audio and video often, shared the screen, the computer... i think main point of collaboration between teams right?” (translated); ii. “i believe so, so it was often necessary to have a video call with the people there” (regarding the usage of video calls for complementing previous communication attempts) (translated); iii. “uhh, it helps, because, body language and gestures they go along with the video, right, so it’s not just how to write an email or send a message.” (regarding the usage of video calls to mitigate the lack of in person communication) (translated) 6.5.4 adopting task boards (v.d.) dsd members adopt task boards as communication media in their teams. those individuals will eventually register development tasks in the form of adhesive notes to compose their boards. at this point, task boards enable a spatial and explicit message exchange, without the need for further software related steps, i.e., loading a software tool and authenticating. task boards are used for reasons that include the tracking of the ongoing development in dsd teams. some dsd teams will abdicate of using physical task boards in the circumstances that include a high dispersion of dsd members. nevertheless, specific software tools make feasible the usage of task boards with remote teams, including additional features such as a graphical representation of the undergoing development (that, in turn, will require updating the development process for correct results). those software-based boards will usually clearly present their contents, i.e., in an understandable way. the usage of task boards in dsd teams supports the process of sharing the understanding of the proposed activities with all team members. examples of quotations in this component: on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 i. “yeah, we added tickets with as much detail as possible, and they are sent on, on our board there to be tested... by our team here.” (translated); ii. “look, we use post it and use the board, especially when we are going to have a meeting, to define how the new feature will be...” (translated). 6.5.5 adopting collaborative software (v.e.) collaborative software tools are part of the communication context in dsd. collaboration can be performed in these software tools by the discussion in threads, which demands the constant interaction of dsd members for effective communication attempts. dsd organizations may adopt organizational or custom collaborative tools for reasons that include the need to document meetings in those teams. communicating via those tools is mostly performed in dsd standard language, in a less informal and more direct (i.e., closer to real-time) style when compared with e-mails. still, the adoption of collaboration tools is not absolute, and some dsd projects proceed without those tools. examples of quotations in this component: i. “so, yeah, i think there are a lot of collaboration tools, you know? for collaborative work.” (translated); ii. “...we used slack for communication, there were often giant threads of discussions about something that was, lack of understanding, you know?” (translated). 6.5.6 adopting issue tracking systems (v.f.) communication in dsd teams also occurs via documented information in change requests (crs) from issue tracking systems. those tools may be adopted on an institutional level, and cr registers may contain requirements and technical details for the implementation of the proposed fix or feature and its current development state. examples of quotations in this component: i. “we use jira.” (translated); ii. “so, the product manager initially go to the customer and take the requirements and put into a excel sheet and from there, the product owner uhh ... make sense by sitting with the product manager and putting into the jira, as a card.” (translated); iii. “...i don’t remember if we were using jira or if it was mantis yet” (translated). 6.5.7 adopting physical drawing boards (v.g.) dsd team members use physical, white, or blackboards as a communication media in their teams. using free boards in dsd teams are considered by dsd members as a useful approach for communicating, as the message persists in the physical environment. this approach supports exposing ideas and the evolution of its concepts in those teams. dsd teams may, for example, communicate via free drawing in those boards, not necessarily using a specific pattern of diagrams. examples of quotations in this component: i. “we you draw a lot on the board, you don’t follow any kind of pattern, unless you have to know how to write and read, of course!” (translated); ii. “...but draw, try to organize your thinking in a logical way, with diagrams, almost, all kinds of diagrams” (translated); iii. “some people use uml, others don’t, without any kind of reinforcement to use specific methodologies.” (translated). 6.6 communicating in a dsd standard language (dsl) (vi.) dsd team members will eventually communicate in a standard or common language in their teams, a dsd standard language (dsl). using a common and distinct language in this context can be a manageable approach for some individuals, i.e., a less demanded requirement, but not a straightforward process for others, supporting communication issues. bad dsl communication leads to longer and complicated meetings in dsd teams, as well as supporting the weakening of relations due to limited or bad communication between dsd members, thereby, leading to negative feelings for those involved. examples of quotations in this component: i. “ok, we also have language issues, because, as the company’s head office is american, all meetings are in english, and it is, of course, the common language for most people.” (translated); ii. “...the official standard language is spanish” (translated). iii. “so, i, i, 100% of the time had to speak spanish...” (translated); 6.6.1 dealing with regional accents (vi.a.) communicating in dsd teams in the circumstances of multicultural teams includes dealing with regional accents. in this context, communicating is for some members is a straightforward experience, but for others, a relevant challenge, with the possibility of hindering the communication act. dsd members with some degree of capabilities on multiple foreign languages will better understand regional accents, but, as an overall understanding, with time and experience, as dsd professionals will improve their capabilities to understand different accents. examples of quotations in this component: i. “ok, we also have language issues, because, as the company’s head office is american, all meetings are in english, and it is, of course, the common language for most people.” (translated); ii. “...the official standard language is spanish” (translated). iii. “so, i, i, 100% of the time had to speak spanish...” (translated); on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 6.6.2 considering dialects (vi.b.) dialects are part of the communication context in dsd teams. those can emerge as variations of languages that include the english, the chinese, and the indian, among others. those dialects include specific pronunciation styles, and a subset of terms, justified by the circumstances of distinct origins of dsd members. the presence of different dialects in dsd teams may hinder the communication act. even for a business standard language as the english language, regional variations can support misunderstandings, even for native speakers. therefore, even by choosing the english language as the dsl option in dsd teams, dsd members may need to get used to eventual dialects and try to learn about them, demanding some time for better communication effectiveness. on the other hand, some languages may be less prone to dialects, i.e., presented more uniformly in dsd teams. furthermore, previous experiences of dsd members in dealing with different languages and cultural contexts will support a better understanding of dialects in those teams. examples of quotations in this component: i. “...invariably, english may be the official language, but you’re going to have to learn multi dialects of english.” (translated); ii. “...you have to understand how to communicate with the chinese, then you fail! because there is no single chinese...” (translated). 6.6.3 struggling with a foreign language (vi.d.) dsd members tend to identify by themselves rationally their limitations on the chosen dsl as a foreign language, as well as able to identify the limitations of their colleagues. eventually, dsd members may present themselves without the capacity to master a foreign spoken dsl. this situation can lead to the omission from the participation in discussions during meetings, as those individuals may not understand the spoken matters. the lack of adequate capabilities in the chosen dsl may lead to frustration in the communication act. without those capabilities, dsd members may find themselves with the impression that they could contribute more effectively than they did. also, by the sensation that colleagues are “giving up” to communicate due to those limitations. the lack of capabilities in the dsl may also lead to discomfort of individuals in dsd teams, as this situation can support feelings of incapability, and in fewer circumstances, inferiority. overall, limitations in the chosen dsl can lead to communication delays, noise, and misunderstandings of meanings. nevertheless, bad dsl capabilities will not necessarily make recurrent meetings such as daily ones, unfeasible, but may add barriers to effective communication, whether using a native language would probably not. examples of quotations in this component: i. “...for example, when i was working on the other project that was also with dsd, he, we have a problem of being able to understand the english of some people...” (translated); ii. “but, yeah, it was certainly different than if everyone had been speaking their native language, right?” (translated). iii. “...it was not so comfortable, because each of us has a different level of english, right? and many times we could not capture everything that was said by another person...” 7 discussion in what concerns the study gap consolidation (see section 4.2), and based on the presented findings, whereas no specific dsd communication theory was identified, and as it seems, indicating explanation gaps in dsd communication, we identified that a new and specific dsd communication theory has its place in literature. as for the current state of our theory, we highlight that when compared with its first preliminary version (leitão júnior et al., 2019), this version of our theory brings to light additional theoretical content on the aspects of the diversity of cultural expectations, on the effort to deal with different schedules in cultural diversity, on the adoption of physical drawing boards as communication media, and mostly, on a new language-based dimension and its new components. this version also included some refactoring on theoretical concepts that were represented by specific components such as the “reports” as media and the “accepting cultural diversity”, which now had part of their contents revised for better represent communication-related aspects and some of its contents reallocated to other theoretical components of the theory. this further preliminary theory of communication in distributed software development teams, brings with its theoretical contents the concept that communicating in dsd teams comprises distinct and multidisciplinary concepts, including cultural and behavioral elements, hence, suggesting multiple research fronts on this phenomenon in distributed software development teams in a multidisciplinary strategy. in this context, and in what concerns the cultural diversity in dsd teams, we can trace a parallel of those findings with toomey and dorjee (ting-toomey and dorjee, 2018), who state that to communicate appropriately and effectively, individuals have to manage diverse sociocultural identity memberships, adaptively, and with gurung and prater (gurung and prater, 2017), who states that dealing with cultural differences in dsd is a contemporary challenge. we also identified that dealing with a foreign language in dsd teams is still a challenge for many individuals due to limitations in language capabilities itself and the possibility of interaction with different accents and dialects that may come with it. additionally, we also identified that the nature of communication practices and the right choice of media for communicating also plays a relevant role in those teams, as dsd stakeholders have been adopting strategies and choosing spatial and software-based tools for better communication in their teams for some time, suggesting that the construction of new methodologies, or software-based tools for the improvement of communication can be well received by dsd members, as approaches as such are usually well incoron a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 porated by them in their daily work environment. furthermore, when considering the factors that influence communication in dsd teams according to santos and coauthors (dos santos et al., 2012) we identified direct references to the theoretical content of the emerged theory with six factors that were not apparently addressed by any of the theories that we identified during our exploratory literature review, as follows. i. language of linguistic barriers: this factor is directly addressed by the “struggling to communicate with a foreign language” component. that‘s because the respective component includes theoretical contents on aspects such as the challenging nature of mastering a foreign language, using this language as the language of choice in dsd teams, and considerations on the effects of insufficient communication capabilities on this foreign language. additionally, we may trace a parallel between this factor and the “considering dialects” components, as the second one includes theoretical content on the diversity of dialects that may permeate dsd teams; ii. limited informal communication: we argue that this factor is addressed by the “communicating informally” component. this component brings to light theoretical content on the nature of informal communication in dsd teams by the perspective of a more relaxed and natural communication style in those teams. additionally, the component also includes considerations on the circumstances of the adoption of informal communication, with a brief intersection on the hierarchical context in those teams; iii. temporal distance: this factor relates directly with the components below the “including different time zones” dimension. this dimension brings to light communication-related aspects of the interaction of dsd teams members between different time zones. additionally, this dimension brings the view of communication as a challenge to be tacked by dsd members in their teams. iv. synchronization of work schedules: we may trace parallels between this factor and on the effort of dsd team members to synchronize the work in different time zones, as presented by the “synchronizing communication in dtzs” component. furthermore, we may also trace parallels between this factor and the “adapting schedules to cultural diversity” and “considering different religions beliefs”, on the effort of dsd team members to do their best on trying to manage fluctuations in work schedules due to cultural or religions reasons; v. distribution of tasks: we may argue some intersection between this factor and the “adopting issue tracking systems” and “adopting physical drawing boards” components, as both present theoretical content on the effort of dsd team members to distribute and better communicate tasks in their teams; vi. communication skills: we argue that this factor relates to diverse aspects of the emerged theory, but mostly on the positive outcomes in the communication context on the practice of empathy, as presented by the “practicing empathy” component, on the better understanding of the diversity of communication styles in cultural diversity as described by the “dealing with different communication styles” component, and on the considerations on the capabilities on communicating in a foreign language as presented by the components below the “communicating in a dsd standard language (dsl)”. those findings lead us to the conclusion that the emerged theory has the potential to contribute to the dsd literature as it brings knowledge to the communication context of dsd teams, even on its preliminary version. furthermore, we may thereby conclude that the usage of the grounded theory has been a rich choice for uncovering the communication phenomenon in dsd. additionally, we propose some actions to dsd managers and other practitioners for better communicating in dsd teams, as follows. i. consider “soft” skills: the practice of empathy and a better understanding of the established affinity between dsd members will help to support a healthier communication context in dsd teams; ii. understand cultural aspects: managers need to be aware of the cultural diversity that permeates global dsd teams and consider this aspect when planning communication. people from different origins will bring different ambitions, expectations, and communication styles with them. therefore, dsd managers in this context must do their best to understand those circumstances in order to avoid future communication problems; iii. do not forget in-person communication: when possible, managers shall consider communicating in person with important stakeholders, mostly in the beginning and in critical moments of dsd projects; iv. do not overuse asynchronous communication: asynchronous communication is a trend in dsd due to the nature of this model. still, managers and team members need to do their best to make synchronous communication feasible in their teams; v. support informal communication: informal communication has the potential to remove communication barriers in dsd teams. therefore, and even by not forgetting the value of formal communication, dsd members shall support informal communication for better communication results in their teams. 8 limitations and threats to validity we consider as the first limitation to this study our decision to not use procedural techniques such as the ”axial coding” for bringing back data, i.e., for linking subcategories (dimensions) to subcategories (components). we related those components by the data that those represent, as a non-procedural approach exemplified by charmaz (charmaz, 2014, p. 148). nevertheless, as also stated by the author, this approach may also lead to some degree of ambiguity. thus, even by exhausting the process of defining categories and relating those to subcategories, we accept that some degree of ambiguity may exist in the hierarchical representation of our categories and on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 subcategories. still, we believe that those circumstances will not compromise the abstraction of data itself and the robustness of the emerging theory’s theoretical content. as an overall threat to this research work as a qualitative one and mostly, for being a gt research, we believe that bringing preconceived ideas from literature into the emerging theory could, potentially, represent a threat, i.e., a “biases” aspect. at this point, we decided to plan our extensive literature review to a later stage of this research, after constructing the new theory to mitigate biases effects on the emerged concepts, even by not necessarily being a problem from charmaz’s perspective. furthermore, we remark the possibility of including some degree of implicit biases in the emerging theory due to our software engineering area experience. we are researchers in dsd, with experience in the software industry. i’m a researcher and practitioner in the software engineering industry with more than 20 years of experience in diverse roles, including coding, testing, requisites, elicitation, and analysis. our coauthors are both lecturers and researchers with extensive experience with project management, quality, maturity, and capacity models, among other software engineering fronts. still, we believe that our experience will not pose a significant risk of biases in the emerged theory, as we followed the strict charmaz’s gt specification. regarding this topic, charmaz stated that “researchers typically hold perspectives and possess knowledge in their fields before they decide on a research topic” as even “examining committees expect such expertise, funding agencies require it.” (charmaz, 2014, p. 306). thereby, we believe that we got to the same page as charmaz, who understood this scenario as a natural and eventually a necessary one. 8.1 future work our first challenge to tackle as future work is to collect further data, continue to perform the constant comparative method and review our codes, classes and sub-classes until achieving “theoretical saturation” (see section 4.1). by performing this effort, we aim at evolving the current state of the theory and thereby, conclude the application of the gt method. next, we aim at strengthen the validity of this study by performing additional research steps. at this point, we may cite golafshani (golafshani, 2003), who defines validity in quantitative research as “whether the means of measurement are accurate and whether they are actually measuring what they are intended to measure.” but the author, however, also states that this concept is viewed differently by qualitative researchers, who consider these terms inadequate in their research contexts. supporting the need for some qualifying check or measure for validity for their qualitative work, including data or method triangulation. thus, to better deal with those risks, we propose an additional third research step as future work. this new and last step includes a later extensive literature review in the format of a systematic mapping study (sms). this new step will also have an evaluation process, based on a focus group session followed by the verification of a set of credibility criteria on the construction process of the theory itself, as an approach for validating gt studies. in this way, we are trying to characterize a triangulated method approach (gt, systematic mapping study (sms) and focus groups), which enhances the strength of qualitative studies (patton, 1990) as ours. we detail the activities of this additional step as follows. i. systematic mapping study: as stated by kitchenham and charters (kitchenham and charters, 2007), a systematic mapping study (sms) allows the identification of evidence in a domain at a larger scale of granularity. the authors also affirm that these studies allow the identification of evidence clusters and deserts to direct the focus of future systematic literature reviews and to identify areas for future primary studies. therefore, we propose a sms to confront the emerged theory with the literature and situating it. we expect that this sms will provide a selection of studies about communication in dsd and an extensive view of communication theories used in this context, to allow the comparison of the emerged theory with other studies in an extensive approach; ii. focus groups sessions: we propose one focus groups session with practitioners from the software industry in dsd for verifying if the emerged theory reflects the real communication context in those teams. focus group is a technique for data collection based on group interactions, in which a researcher suggests a topic for further discussions (morgan, 1997) and collects the emerging data. focus group is a technique for primary explore the group perception instead of the individual’s ones; thus, when using this technique, a researcher must be focused on the consensus and the exceptions exposed by the group (sim, 1998); iii. theory evaluation: we propose the evaluation of the new theory by reflecting our theory development process with a set of evaluation criteria for gt and other scientific studies. additionally, we propose using the findings from the sms study and the feedback on the new theory collected via the focus group session. in this context, we propose using the set of criteria proposed by charmaz (charmaz, 2014), as it comes from our leading methodological choice. charmaz states that the line between process and product becomes blurred for our audiences, as other scholars will likely judge the gt process as an integral part of the product (charmaz, 2014). the author also states that expectations for a gt study may vary (charmaz, 2014), but the following list of criteria may give some ideas to grounded theorists: credibility, originality, resonance, and usefulness. references adolph, s. (2013). reconciling perspectives: a substantive theory of how people manage the process of software development. phd thesis, thesis (doctorate in computer engineering) faculty of graduated studies (electrical and computer engineering) at university of british columbia, vancouver, canada. adolph, s., kruchten, p., and hall, w. (2012). reconciling perspectives: a grounded theory of how people manage on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 the process of software development. journal of systems and software, 85(6):1269–1286. aoyama, m. (1997). agile software process model. in proceedings of the twenty-first annual international computer software and applications conference, pages 454– 459, washington, dc, usa. ieee. aranda, g. n., vizcaíno, a., and piattini, m. (2010). analyzing and evaluating the main factors that challenge global software development. the open software engineering journal, 4:14–25. birks, m. and mills, j. (2011). essentials of grounded theory. sage london. breckenridge, j., jones, d., elliott, i., and nicol, m. (2012). choosing a methodological path: reflections on the constructivist turn. grounded theory review, 11(1):64–71. carmel, e. (1999). global software teams: collaborating across borders and time zones. prentice hall, upper saddle river, nj, usa, 1 edition. charmaz, k. (2000). grounded theory: objectivist and constructivist methods. nk denzin, ys lincoln, eds. handbook of qualitative research. thousand oaks, ca, sage publications, 509:535. charmaz, k. (2006). constructing grounded theory: a practical guide through qualitative analysis, volume 10. charmaz, k. (2014). constructing grounded theory. sage publications, kindle edition, rohnert park, usa, 2 edition. chenitz, w. c. and swanson, j. m. (1986). from practice to grounded theory: qualitative research in nursing. addison-wesley, menlo park, ca. clear, t. and beecham, s. (2019). global software engineering education practice continuum. acm transactions on computing education: special issue of the acm transactions on computing education, 19(2):7. cruzes, d. s., moe, n. b., and dybå, t. (2016). communication between developers and testers in distributed continuous agile testing. in proceedings of the ieee 11th international conference on global software engineering, pages 59–68, irvine, ca, usa. ieee. daft, r. l. and lengel., r. h. (1986). organizational information requirements, media richness and structural design. management science, 32(5):554–571. de farias junior, i., marczak, s., santos, r., and moura, h. (2016). communication in distributed software development: a preliminary maturity model. in ieee 11th international conference on global software engineering (icgse), pages 164–173, california, usa. ieee. dennis, a. and valacich, j. (1999). rethinking media richness: towards a theory of media synchronicity. in proceedings of the annual hawaii international conference on systems sciences. 1999, page 10, maui, hawaii. ieee. dos santos, a. c., de farias junior, i. h., de moura, h. p., and marczak, s. (2012). a systematic tertiary study of communication in distributed software development projects. in proceedings of the international conference global software engineering, pages 182–182, porto alegre, rs, brazil. ieee. dunne, c. (2011). the place of the literature review in grounded theory research. international journal of social research methodology, 14(2):111–124. easterbrook, s. and neves, b. (2007). seminar 2 : epistemology & ethics. farias junior, i. h. d. (2014). c2m a communication maturity model for distributed software development. doctoral dissertation (doctorate in computer science), cin, ufpe university, recife, brazil. gil, a. c. (2002). como classificar as pesquisas. como elaborar projetos de pesquisa, 4:44–45. glare, p. g. (1968). oxford latin dictionary. clarendon press. communication. glaser, b. g. (1978). theoretical sensitivity: advances in the methodology of grounded theory. the sociology press, san francisco, 1 edition. glaser, b. g. and strauss, a. l. (1967). the discovery of grounded theory: strategies for qualitative research. transaction publishers, london, uk, 1th edition. glaser, b. g. and strauss, a. l. (2009). the discovery of grounded theory: strategies for qualitative research. transaction publishers, london, uk, 7th edition. golafshani, n. (2003). understanding reliability and validity in qualitative research. the qualitative report, 8(4):597–607. gregor, s. (2006). the nature of theory in is research. mis quartely, 30(3):611–642. gurung, a. and prater, e. (2017). a research framework for the impact of cultural differences on it outsourcing. in global sourcing of services: strategies, issues and challenges, pages 49–82. world scientific. herbsleb, j. (2016). building a socio-technical theory of coordination: why and how (outstanding research award). in proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering, fse 2016, pages 2–10, new york, ny, usa. acm. herbsleb, j., herbsleb, j. d., moitra, d., and moitra, d. (2001). global software development. ieee software, 18(4):16–20. herbsleb, j., paulish, d., and bass, m. (2005). global software development at siemens: experience from nine projects. in proceedings of the international conference on software engineering, pages 524–533, saint louis, mo, usa, usa. ieee. hofstede, g. (1983). national cultures in four dimensions: a research-based theory of cultural differences among nations. international studies of management & organization, 13(1-2):46–74. kauark, f. d. s., manhães, f. c., and medeiros, c. h. (2010). metodologia da pesquisa: um guia prático. via litterarum. kitchenham, b. and charters, s. (2007). guidelines for performing systematic literature reviews in software engineering version 2.3. engineering, 45(4ve):1051. leitão júnior, n., farias junior, i., and moura, h. p. (2019). a preliminary theory of communication in distributed software development teams. journal of convergence information technology, 14(2):30–41. littlejohn, s. w. and foss, k. a. (1992). theories of human communication. waveland press, 4 edition. on a preliminary theory of communication in distributed software development: a grounded theory-based research leitão júnior et al. 2021 morgan, d. (1997). focus groups as qualitative research. sage publications, london, uk, 2 edition. morse, j. m., noerager, p., juliet, s., bowers, b., charmaz, k., and clarke, a. e. (2016). developing grounded theory: the second generation. routledge, new york, usa. patton, m. q. (1990). qualitative evaluation and research methods. sage publications, thousand oaks, california, usa, 2 edition. robert, l. p. and dennis, a. r. (2005). paradox of richness: a cognitive model of media choice. ieee transactions on professional communication, 48(1):10–21. rosengren, k. e. (2000). communication, an introduction. sage publications., london, uk. shah, y. h., raza, m., and ulhaq, s. (2012). communication issues in gsd. international journal of advanced science and technology, 40:69–76. short, j., williams, e., and christie, b. (1976). the social psychology of telecommunications. john wiley and sons ltd, london, united kingdom. sim, j. (1998). collecting and analysing qualitative data: issues raised by the focus group. journal of advanced nursing, 28(2):345–352. teixeira, j. (2014). understanding collaboration in the opensource arena: the cases of webkit and openstack. in proceedings of the international conference on evaluation and assessment in software engineering, pages 52:1– 52:5, london, england, united kingdom. acm. ting-toomey, s. and dorjee, t. (2018). communicating across cultures. guilford publications, new york, ny, usa. travers, j. and milgram, s. (1967). the small world problem. psychology today, 1(1):61 – 67. van ruler, b. (2018). communication theory: an underrated pillar on which strategic communication rests. international journal of strategic communication, 12(4):367– 381. introduction communication and dsd related work gregor's work charmaz's work adolph's work research method an overview of charmaz's grounded theory research design data collection data analysis results from the gap exploration review a further preliminary theory of communication in distributed software development considering universal communication enablers (i) communicating face-to-face (i.a.) communicating informally (i.b.) considering affinity (i.c.) communicating remotely (i.d.) communicating formally (i.e.) practicing empathy (i.f.) including cultural diversity (ii.) dealing with different communication styles (ii.a.) dealing with distinct professional expectations (ii.b.) considering gender diversity (ii.c.) considering different religious beliefs (ii.d.) adapting schedules to cultural diversity (ii.e.) considering hierarchy (ii.f.) adopting communication practices (iii.) traveling to communicate (iii.a.) performing on-demand meetings (iii.b.) communicating in global meetings (iii.c.) adopting individuals as communication endpoints (iii.d.) adopting feedback (iii.e.) celebrating to communicate (iii.f.) adopting daily meetings (iii.g.) including different time zones (dtzs) (iv.) synchronizing communication in dtzs (iv.a.) communicating asynchronously in dtzs (iv.b.) choosing communication media (v.) e-mailing (v.a.) chatting via software (v.b.) adopting video call software (v.c.) adopting task boards (v.d.) adopting collaborative software (v.e.) adopting issue tracking systems (v.f.) adopting physical drawing boards (v.g.) communicating in a dsd standard language (dsl) (vi.) dealing with regional accents (vi.a.) considering dialects (vi.b.) struggling with a foreign language (vi.d.) discussion limitations and threats to validity future work journal of software engineering research and development, 2021, 9:3, doi: 10.5753/jserd.2021.827  this work is licensed under a creative commons attribution 4.0 international license.. an empirical study of bugs in covid­19 software projects akond rahman  [ tennessee technological university| arahman@tntech.edu ] effat farhana [ north carolina state university| efarhan@ncsu.edu ] abstract the dire consequences of the covid­19 pandemic have influenced development of covid­19 software i.e., software used for analysis and mitigation of covid­19. bugs in covid­19 software can be consequential, as covid­19 software projects can impact public health policy and user data privacy. the goal of this paper is to help practitioners and researchers improve the quality of covid­19 software through an empirical study of open source software projects related to covid­19. we use 129 open source covid­19 software projects hosted on github to conduct our empirical study. next, we apply qualitative analysis on 550 bug reports from the collected projects to identify bug categories. we identify 8 bug categories, which include data bugs i.e., bugs that occur during mining and storage of covid­19 data. the identified bug categories appear for 7 categories of software projects including (i) projects that use statistical modeling to perform predictions related to covid­19, and (ii) medical equipment software that are used to design and implement medical equipment, such as ventilators. based on our findings, we advocate for robust statistical model construction through better synergies between data science practitioners and public health experts. existence of security bugs in user tracking software necessitates development of tools that will detect data privacy violations and security weaknesses. keywords: bugs, covid­19, empirical study, pandemic, software quality 1 introduction the novel coronavirus disease (covid­19) is a world­ wide pandemic that spreads through droplets generated from coughs or sneezes and by touching contaminated sur­ faces (john hopkins university, 2020). as of may 31 2020, covid­19 has caused 370,247 deaths across the world (john hopkins university, 2020). apart from causing thousands of deaths and creating long term health repercussions for vul­ nerable populations, covid­19 has severely impacted the economic sector. according to a recent study (erin duffin, 2020), due to covid­19 gross domestic product (gdp) will decrease from 3.0% to 2.4% worldwide. as of may 28 2020, nearly 41 million citizens reported unemployment in usa alone (mitchell hartman, 2020). more than 3.9 billion peo­ ple around the world were under some form of stay at home order due to covid­19 (alasdair sandford, 2020). health care professionals are at the frontline of combat­ ing covid­19. practitioners from other domains, such as software engineering have also joined forces to analyze and mitigate the negative consequences of covid­19. for ex­ ample, statistical modeling was used to build a software that identifies pneumonia caused by covid­19 from lung scan images (tom simonite, 2020). the software was used in 34 chinese hospitals (tom simonite, 2020). in response to the food insecurity caused by covid­19, practitioners have cre­ ated an interactive visualization software that displays free meal sites across usa (why hunger, 2020). the creators of the software envision in building a social movement to eradicate hunger and address economic inequalities. as an­ other example, apple and google have jointly announced of creating a software framework that will help practitioners build tools to trace covid­19 infection status of mobile app users (apple, 2020). the above­mentioned examples show covid­19 software i.e., software used for analysis and miti­ gation of covid­19, to have near­term and long­term effects on public health and society. despite the above­mentioned advancements, covid­19 software projects are susceptible to bugs. let us consider fig­ ure 1 in this regard. figure 1 provides a snapshot of a bug re­ port related to statistical modeling (begley, 2020a). we ob­ serve when implementing a statistical model the practition­ ers did not consider the correlation between intensive care unit (icu) bed availability and death rate prediction. further­ more, the number of icu beds is incorrectly assumed to be 40,000 instead of 1,000. we hypothesize systematic analysis can reveal bug cate­ gories including statistical modeling bugs similar to figure 1. in prior work researchers (garcia et al., 2020; rahman et al., 2020; linares­vásquez et al., 2017; catolino et al., 2019; thung et al., 2012; wan et al., 2017) have documented the im­ portance of bug categorization. for example, for autonomous vehicle software garcia et al. 2020 stated that categorization of bugs can help to construct bug detection and testing tools. linraes­vásquez et al. 2017 stated categorizing vulnerabil­ ities can help android practitioners “in focusing their veri­ fication and validation activities”. according to catolino et al. 2019, “understanding the bug type represents the first and most time­consuming step to perform in the process of bug triage”. in prior work, researchers have categorized bugs for infras­ tructure as code (iac) (rahman et al., 2020), autonomous vehicle (garcia et al., 2020), and machine learning (thung et al., 2012; islam et al., 2019) software. however, covid­ 19 software is different from previously studied software in the following aspects: (i) development context: unlike previ­ ously studied software projects, covid­19 software is de­ veloped in response to a pandemic that infected 6.1 million individuals in five months (john hopkins university, 2020), and (ii) public health: unlike previously studied software projects, covid­19 software has direct implications on pub­ https://orcid.org/0000-0002-5056-757x mailto:arahman@tntech.edu mailto:efarhan@ncsu.edu an empirical study of bugs in covid­19 software projects rahman and farhana 2021 figure 1. an example of a bug report related to statistical modeling in a software project called ‘neherlab/covid19_scenarios’. lic health and relevant policy making for inhabitants in 188 countries. in response to the pandemic, researchers have conducted studies related to modeling (dehning et al., 2020; yang and wang, 2020; tamm, 2020), biological science (jin et al., 2020; wang et al., 2020; de clercq, 2006; helms et al., 2020), social science (van bavel et al., 2020; pulido et al., 2020; evans et al., 2020; will, 2020; jarynowski et al., 2020), and policy making (corey et al., 2020; mello and wang, 2020; rourke et al., 2020; kraemer et al., 2020). however, characterization of bugs in covid­19 software remains an unexplored area. the scope of our paper is to get a systematic understand­ ing of bugs in covid­19 software projects. in our paper, we refer to covid­19 software projects as software projects that were created to analyze and mitigate the consequences of covid­19. these projects were created in response to a global pandemic that created a worldwide impact on pub­ lic health, economy, and societal activities. our hypothesis is that the utility of covid­19 software projects and the ur­ gency associated with these projects can yield (i) manifesta­ tion of bugs unique to the covid­19 reality, and (ii) bug res­ olution time. furthermore, from our empirical analysis what categories of bugs appear for what types of covid­19 soft­ ware projects. the goal of this paper is to help practitioners and re­ searchers improve the quality of covid­19 software through an empirical study of open source software projects related to covid­19. we answer the following research questions: • rq1: what categories of open source covid­19 software projects exist? we identify seven categories of software projects related to covid­19: aggregation, education, medical equipment, mining, user tracking, statistical modeling, and volunteer management. • rq2: what categories of bugs exist in open source covid­19 software projects? how frequently do the identified bug categories appear? what is the res­ olution time for the identified bug categories? we identify eight bug categories: algorithm, data, depen­ dency, documentation, performance, security, syntax, and user interface. except for mining and medical equip­ ment projects, for types of covid­19 software projects the most frequently occurring bug category is ui. • rq3: how similar are the identified bug cate­ gories to that with previously studied software projects? identified bug categories for covid­19 soft­ ware projects also appear for other software types, but their manifestation of the bugs is different for covid­ 19 software projects. contributions: we list our contributions as follows: • a categorization of bugs that appear in covid­19 soft­ ware projects; • a categorization of oss projects related to covid­19; • an empirical study that identifies what category of bugs appear for what category of covid­19 software projects; and • a comparison of bug categories for covid­19 soft­ ware projects to that with previously studied software projects. we organize rest of the paper as follows: we discuss re­ lated work in section 2. we provide the methodology to an­ swer the three research questions in section 3 and provide the results in section 4. we discuss our results with a sum­ mary of our findings in section 5. we provide the limitations of our paper in section 6. finally, we conclude the paper in section 7. our constructed dataset is available as a public, citable repository (rahman and farhana, 2020). overview of the empirical study an overview of our pa­ per is available in figure 2. first, we mine software projects related to covid­19 from github by applying a filtering cri­ teria based on number of issues, number of developers etc. next, we apply qualitative analysis technique called open coding (saldana, 2015) on the readme files of the col­ lected open source software (oss) projects to identify what categories of oss projects exist related to covid­19. after characterizing the collected software projects, we again ap­ ply open coding on 550 bug reports from the collected oss projects to identify bug categories. we also quantify the fre­ quency and resolution time of each bug category across the identified project categories. finally, we conduct a scoping an empirical study of bugs in covid­19 software projects rahman and farhana 2021 review (munn et al., 2018) to find the similarities in bug categories between covid­19­related software projects and other categories of software projects. 2 related work our paper is related with prior research that has focused on categorization of bugs in oss projects. mockus et al. 2002 studied the contribution nature in oss apache and mozilla projects. they (mockus et al., 2002) observed contributors who submit bug reports are approximately 8.2 times higher in number than contributors who address bugs in bug reports. ma et al. 2017 investigated python github projects that are used in the scientific domain, and observed developers to use stack traces, as well as communicate with upstream devel­ opers, to identify root causes of bugs. zhang et al. 2019 ex­ amined bug reports for mobile and desktop software hosted on github, and identified differences on how the reports are constructed. ray et al. 2014 studied the correlations between bugs and the language the software is being developed, and reported a modest correlation using an empirical study of 729 github projects. categorization of domain­specific oss bugs has also been investigated: thung et al. 2012, garcia et al. 2020, wan et al. 2017, islam et al. 2019, and rahman et al. 2020 in separate research papers used oss projects to classify bug categories respectively, for machine learning, autonomous vehicle, blockchain, deep learning, and iac. our paper is also related with publications that have in­ vestigated the impact of covid­19 on software develop­ ment. ralph et al. 2020 surveyed 2,225 practitioners and re­ ported fear related to covid­19 to affect productivity of software practitioners. butler and jaffe 2020 conducted a di­ ary study with 435 practitioners and reported practitioners to face challenges, such as having too many meetings and feel­ ing overworked while working from home due to covid­19. oliveira et al. 2020 surveyed 413 practitioners from brazil and reported practitioners’ perceived productivity to increase due to fewer interruptions. from the above­mentioned discussion we observe bugs in software projects related to covid­19 to be an under­ explored area. while there exists several bug categorization studies (thung et al., 2012; garcia et al., 2020; wan et al., 2017; islam et al., 2019; rahman et al., 2020) no studies ex­ ist for covid­19­related projects. the bug categorization­ related studies for iac, block chain, and deep learning moti­ vated us to derive bug categories and quantify the identified bug categories. wan et al. 2017’s paper on blockchain bugs motivated us to study bug resolution time for each identified bug category. in our paper, we study covid­19 software bugs in the following manner: • categories of bugs; • frequency of identified bug categories; • resolution time of identified bug categories; and • categories of software projects. 3 methodology in this section we provide the methodology to answers re­ search questions: rq1, rq2, and rq3. 3.1 methodology for rq1: what categories of open source covid­19 software projects exist? we define covid­19 software projects as software projects used for analysis and mitigation of covid­19. we hypoth­ esize multiple categories of covid­19 software projects to exist in the oss domain. we validate our hypothesis by sys­ tematically categorizing covid­19 software projects. our categorization will provide insights on how the software de­ velopment community has responded to the covid­19 pan­ demic. we answer rq1 by completing the following steps: 3.1.1 dataset collection we conduct our empirical analysis by collecting covid­ 19 software projects hosted on github. to collect these projects we use github’s search utility (github, 2020c), where we first identified projects tagged as ‘covid­19’. we use the search string ‘covid­19’, as it is a topic designated for covid­19 by github (github, 2020a). our assumption is that by using a github­designated tag we can collect oss projects hosted on github that are related to covid­19. oss projects hosted on github are susceptible to quality issues, as github users often host repositories for personal purposes that are not reflective of real­world software de­ velopment (munaiah et al., 2017). upon collection of the projects we apply a set of filtering criteria so that we can identify projects that contain sufficient data for analysis. we describe the filtering criteria below: • criterion­1: the project must have at least 2 developers. our assumption is that this criterion will filter out projects used for personal purposes. • criterion­2: the project has at least 5 open issues. we use this filtering criterion to identify projects that are actively maintained. our assumption is that by using this criterion we will able to identify covid­19 software projects that are not used for personal purposes as well as projects that are active. prior research (agrawal et al., 2018) has also used the count of issues to filter oss projects hosted on github to conduct empirical studies. • criterion­3: the project must have at least two commits per month. munaiah et al. 2017 used the threshold of at least two commits per month to determine which projects have enough development activity for software organiza­ tions. we use this threshold to filter projects with short de­ velopment activity. • criterion­4: the readme of the project is written in en­ glish. readme projects related to covid­19 can be non­ english. we do not include non­english projects as raters who will perform categorization are not familiar with non­ english languages, such as spanish and cantonese. an empirical study of bugs in covid­19 software projects rahman and farhana 2021 public github filtered covid-19 projects characterization of covid-19 projects characterization of covid-19 software bugs figure 2. an overview of our empirical study. • criterion­5: the project is related with covid­19. we use the ‘topic’ 1 feature of github to search and identify covid­19 software projects. however, practitioners can mislabel projects using the ‘topic’ feature of github po­ tentially including projects in our dataset that are not re­ lated with covid­19. for example, from manual inspec­ tion we observe the ‘rehansaeed/schema.net’ 2 project to be tagged as ‘covid­19’, even though it is not related with covid­19. in fact, the project is used to convert blob objects into c# classes. 3.1.2 qualitative analysis of readme files we apply a qualitative analysis called open coding (saldana, 2015) on the content of readme files for each of the down­ loaded projects from section 3.1.1. readme files describe the content of the project and give github users an overview of the software project (prana et al., 2019). we hypothesize that by systematically analyzing the content of the readme files we can derive what types of software projects are devel­ oped that are related to covid­19. in open coding a rater identifies and synthesizes patterns within unstructured text (saldana, 2015). we select open cod­ ing because we can obtain detailed information on the soft­ ware project categories. we use a hypothetical example to demonstrate our process of open coding in figure 3. first, we collect text from the readme files for each of the col­ lected projects from section 3.1.1. next, we extract text snip­ pets that describe the purpose of the software project. for example, from the raw text ‘the covid­19 vulnerability in­ dex (cv19 index) is a predictive model that identifies people who are likely to have a heightened vulnerability to severe complications from covid­19’ we extract the text snippet ‘a predictive model’, as the extracted text snippet describes the purpose of the software project. next, from the text snip­ pets ‘a predictive model’ and ‘modelling estimated deaths’ we generate an initial category called ‘models to predict’. two initial categories ‘models to predict’ and ‘models to un­ derstand’ are combined into one category ‘statistical mod­ eling’, as they both indicate the descriptions of the software projects to be related with statistical modeling. 1https://github.com/topics 2https://github.com/rehansaeed/schema.net the first and second authors conduct the open coding process separately. both authors used excel spreadsheets to conduct the open coding process manually. the first and second authors respectively an experience of 10 and 6 years in software engineering and has experience in con­ ducting open coding upon software project artifacts, such as commit messages (rahman et al., 2020) and stack over­ flow posts (farhana et al., 2019). upon completion of the open coding process, the first and second authors identify agreements and disagreements. disagreements are resolved upon discussion, agreement rate is calculated using cohen’s kappa (cohen, 1960). during the discussion phase both au­ thors agreed present their justification, and recheck the cat­ egory derivation based on the discussion and revisiting con­ tent. the mapping determined upon discussion is considered final. one project can map to multiple categories. 3.1.3 closed coding we apply closed coding (crabtree and miller, 1999) to iden­ tify which project maps to the identified categories from sec­ tion 3.1.2. closed coding is the qualitative analysis technique where a rater maps an artifact to a pre­defined category by inspecting the artifact (crabtree and miller, 1999). the first and second author separately conduct closed coding on the collected readme files. both authors use excel spread­ sheets to conduct closed coding. after completing the closed coding process the first and second authors identify agree­ ments and disagreements. agreement rate is recorded using cohen’s kappa (cohen, 1960). disagreements are resolved using discussion. during the discussion phase both authors present their justification for disagreements. next, based on the discussion the authors recheck the labeling based on the justification and content analysis. the categorization deter­ mined upon discussion is considered final. 3.1.4 rater verification the derived categories are susceptible to the bias of the first and second author. we mitigate the limitation by allocating an additional rater who applied closed coding for a subset of the readme files. the additional rater who is not an author of the paper, is a fourth year phd candidate in the depart­ ment of computer science at tennessee technological uni­ an empirical study of bugs in covid­19 software projects rahman and farhana 2021 readme excerpt raw text initial category category the covid-19 vulnerability index (cv19 index) is a predictive model that identifies people who are likely to have a heightened vulnerability to severe complications from covid19 covid-19 agent-based simulator (covasim): a model for understanding novel coronavirus epidemiology code for modelling estimated deaths and cases for covid19 a predictive model a model for understanding modelling estimated deaths models to predict models to understand statistical modeling figure 3. a hypothetical example to demonstrate our process of open coding to categorize covid­19 software projects. versity. the rater has a professional experience of 2 years in software engineering and has conduced qualitative analysis on software artifacts, such as bug reports. we randomly al­ locate a set of 100 readme files mined from 100 projects to the rater. the rater applies closed coding on the content of the readme files, to identify the mapping between each project and identified categories. upon completion of closed coding we calculate cohen’s kappa (cohen, 1960) between the rater and the first author, as well as with the second au­ thor, separately. 3.2 methodology for rq2: what categories of bugs exist in open source covid­19 software projects? how frequently do the identified bug categories appear? what is the resolution time for the identified bug categories? in this section, we answer “rq2: what categories of bugs appear in covid­19 software projects? how frequently do the identified bug categories appear? what is the resolution time for each bug category?” a categorization of bugs for covid­19 software projects can inform practitioners and re­ searchers about how software related to covid­19 is devel­ oped and in which areas they can help. furthermore, educa­ tors can learn about the software bugs that occur in a soft­ ware related to a pandemic and disseminate these findings in the classroom. frequency of the identified bug categories can help us understand what categories of software tend to contain what types of software bugs and provide quality im­ provement suggestions accordingly. quantifying the resolu­ tion time for bugs in software projects can help software en­ gineering researchers provide actionable guidelines to prac­ titioners. for example, wan et al. 2017 observed that for blockchain software projects security bugs can take longer to fix compared to other bug categories. based on their find­ ings wan et al. 2017 recommended that blockchain project maintainers can adopt security analysis and repair tools to fix security bugs quickly. we provide the methodology to iden­ tify bug categories, quantify bug category frequency, and bug resolution time below: methodology to identify bug categories: we identify bug categories using the following steps: • step#1­filtering: we collect the 4,405 issue reports from the 129 projects and manually inspect each issue report. we do not rely on automated approaches, such as keyword search or using bug labels, as automated approaches tend to generate false positives, which may bias research results (herzig et al., 2013). while inspect­ ing each issue report we use the following ieee defini­ tion for bugs: “an imperfection that needs to be replaced or repaired” (ieee, 2010), similar to prior work (rah­ man et al., 2020). by completing this step we will obtain a set of closed issues reports that correspond to bugs. we use closed reports because as open bug reports are often incomplete and may not help in identifying bugs (wan et al., 2017). the first and second author manually inspect individu­ ally to identify what issue reports correspond to bugs. we record agreement rate and cohen’s kappa (cohen, 1960) between the first and second author. disagree­ ments between the first and second author are resolved through discussions. the process is subjective and sus­ ceptible to the bias of the first and second author. we mitigate the bias by using an additional rater, who in­ spected randomly inspected 100 issue reports and clas­ sified them as bug reports and non­bug reports. the ad­ ditional rater is the fourth year phd candidate at ten­ nessee technological university who is also involved in rater verification for rq1. bugpropall(x) = # of bug reports labeled as category x total # of bug reports ∗ 100% (1) an empirical study of bugs in covid­19 software projects rahman and farhana 2021 bugpropcateg(x, y) = # of bug reports labeled as x, of project type y # of bug reports for project type y ∗ 100% (2) • step#2­open coding: we apply open coding (saldana, 2015) on the content of the collected bug reports from step#1. our open coding process is illustrated in fig­ ure 4 using an example. first, we extract raw text from bug report titles and description, from which we gener­ ate initial categories. next, we merge initial categories based on the commonalities and generate categories. similar to deriving project categories, the first and sec­ ond author separately apply the process of open cod­ ing to generate bug categories. upon completion of the process we quantify agreement rate and measure co­ hen’s kappa (cohen, 1960). for disagreements we con­ duct discussion. generated categories upon discussion is considered final. methodology to quantify bug category frequency: we apply the following steps to quantify the frequency of identified bug categories: • step#1­closed coding: we apply closed coding (crabtree and miller, 1999) to map each identified category to the bug reports that we study. the first and second author sep­ arately apply closed coding for the collected bugs from step#1. upon completion, we calculate the agreement rate and cohen’s kappa (cohen, 1960). disagreements are re­ solved using discussion. • step#2­metric calculation: we quantify the frequency of the identified bug categories using two metrics: bug­ propall’ and ‘bugpropcateg’. we use equations 1 and 2 to respectively calculate ‘bugpropall’ and ‘bugpropcateg’. the ‘bugpropall’ metric refers to the proportion of bugs across all projects, and provides a holistic overview of the frequency of identified bug categories. the ‘bugprop­ categ’ metric refers to the proportion of bugs for a certain project category, and provides a granular overview of bug category frequency for each software project types identi­ fied from section 4.1.2. • step#3­rater verification: the use of first and second au­ thor as raters to conduct closed coding is susceptible to rater bias. we mitigate this limitation by allocating an addi­ tional rater. we assign randomly selected 250 bug reports to the additional rater who apply closed coding. we pro­ vide the additional rater with a document that provides def­ initions of each identified category with examples. similar to our process of rater verification for project cate­ gorization, the additional rater is the fourth year phd candi­ date in the department of computer science in tennessee technological university. the fourth year phd candidate is involved in the rater verification process for identifying project categories and labeling issue reports as bug reports. methodology to quantify bug resolution time we use the open and closing timestamp for each closed bug report in our dataset to quantify the resolution time for each bug cate­ gory, similar to wan et al. 2017. we calculate bug resolution time by computing the number of hours that have elapsed between when the bug report is opened and closed, and not re­opened again, as per our dataset , which was downloaded on april 04, 2020. we report bug resolution time for all bug categories, as well as for bug reports that belong to certain categories of software projects. 3.3 methodology to answer rq3: how simi­ lar are the identified bug categories to that with previously studied software projects? we conduct a scoping review of publications related to soft­ ware bug categorization. using a scoping review, researchers can synthesize results using a limited search (anderson et al., 2008). according to munn et al. 2018 “researchers may con­ duct scoping reviews instead of systematic reviews where the purpose of the review is to identify knowledge gaps, scope a body of literature, clarify concepts or to investigate research conduct.”. unlike a systematic literature review, a scoping review is less comprehensive, and can be used as a precursor to conduct a systematic literature review. scoping review can be useful to collect emerging evidence, which eventually can be used to inform further research decisions (anderson et al., 2008). for example, if a researcher is inexperienced in the do­ main of software fuzzing, and wants to get an understanding of existing topics such as practices and techniques to imple­ ment fuzzing, then a scoping review could be useful to that researcher of interest. we conduct a scoping review by identifying well­known venues where software engineering research is published. we select five conferences: international conference on soft­ ware engineering (icse), symposium on foundations of software engineering (fse), international conference on automated software engineering (ase), international con­ ference on mining software repositories (msr), and inter­ national symposium on software testing and analysis (is­ sta). we select these conferences because these conferences are considered reputed venues to publish literature related to software engineering (emery berger, 2021), and sponsored by special interest groups of the association of computing machinery (acm). we select conferences as they tend to have a shorter review cycle and are more likely to include recent advances in the field of interest (vardi, 2009). we con­ duct the review by applying the following steps: • step­1: we download all papers from 2010 to 2020 for each of the four conferences. we select papers from 2010 to 2020 to identify and synthesize state of the art bug tax­ onomies and categories used for a wide range of software projects. papers that studied bug categories prior to 2010 may not give us an understanding of the state of art. our hypothesis is that by identifying papers from the last 10 years we will get a better overview of what types of bugs appear for a wide range of software projects. an empirical study of bugs in covid­19 software projects rahman and farhana 2021 bug report excerpt raw text initial category category fix historical nyc data transition to borough/county level reporting temperature data not saved in the backend rajasthan district names are wrong fix nyc data data not saved in the backend district names are wrong data bugs related to location data bugs related to storage data bugs figure 4. a hypothetical example to demonstrate our process of open coding to identify bug categories for software projects. • step­2: we read the title, abstract, and keywords to deter­ mine if the downloaded papers are related to software bug categorization. • step­3: upon completion of step­2, one rater reads each collected paper, and identifies topics discussed in the pa­ per of interest using qualitative analysis. for each paper the rater determines if the paper focuses on bug categoriza­ tion. if so, the rater documents the bug categories for the reported software project. upon completion of the above­mentioned steps, we derive reported bug categories for multiple software projects. 4 results in this section, we provide answers to the three research ques­ tions, rq1, rq2, and rq3. 4.1 answer to rq1: what categories of open source covid­19 software projects exist? we answer rq1 by first providing summary statistics of our dataset in section 4.1.1. next, we report categories of the projects in section 4.1.2. 4.1.1 summary of dataset altogether we download 129 projects for analysis. using the search feature we identify 3,276 public projects upon which we apply our filtering criterion. a complete break­ down of our filtering criterion is available in table 1. at­ tributes of the projects are available in table 2. ‘languages’ in table 2 correspond to the count of main programming lan­ guages of the collected projects as determined by github’s linguist tool (github, 2020b). example languages include javascript, python and r. a temporal evolution of the 129 covid­19 software projects based on creation date is available in figure 5. we observe sharp increase in project creation after feb 29, 2020. table 1. filtering of covid­19 projects used in paper. criteria github initial 3,276 criterion­1 (devs >= 2) 1,287 criterion­2 (open issues >= 5) 169 criterion­3 (commits/month >= 2) 154 criterion­4 (readme is english) 131 criterion­5 (actually covid­19) 129 final 129 table 2. attributes of studied covid­19 projects. attributes total commits 38,152 developers 2,243 duration 12/2019­03/2020 files 24,839 issues 4,405 languages 18 releases 286 projects 129 4.1.2 categorization of covid­19 software projects we identify 7 categories of covid­19 software projects. we describe each of the categories below in alphabetic order: i: aggregation:: this category includes software projects that curate data related to covid­19 and present collected covid­19 data in an aggregated format using vi­ sualizations. the purpose of these projects is to help users un­ derstand the spread of the covid­19 disease over time and location. software projects that belong to this category can be country specific as done in ‘juanmnl/covid19­monitor’ (juan­ mnl, 2020) and ‘dsfsi/covid19za’ (marivate and combrink, 2020) respectively, for ecuador and south africa. aggrega­ tion of covid­19 data can also be at a global level, for ex­ ample, ‘boogheta/coronavirus­countries’ (boogheta, 2020) is a software that aggregates covid­19 data across the world and allows software users to compare the reported cases on a country­by­country basis. ii: education:: this category includes projects that pro­ vide utilities on educating people about covid­19. lack of knowledge related to infections and symptoms can con­ tribute to rapid spreading of covid­19. the purpose of an empirical study of bugs in covid­19 software projects rahman and farhana 2021 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 20 19 −1 2− 22 20 19 −1 2− 29 20 20 −0 1− 07 20 20 −0 1− 14 20 20 −0 1− 21 20 20 −0 1− 28 20 20 −0 2− 07 20 20 −0 2− 15 20 20 −0 2− 22 20 20 −0 2− 29 20 20 −0 3− 07 20 20 −0 3− 15 20 20 −0 3− 22 20 20 −0 3− 29 20 20 −0 4− 04 date c o u n t o f p ro je c ts count of covid−19 software projects over time figure 5. temporal evolution of covid­19 software projects based on their creation date. we observe sharp increase in project creation after feb 29, 2020. these projects is to build software, where users can ask questions and obtain answers. we observe two categories of software: first, question and answer websites similar to stack overflow 3, such as ‘nthopinion/covid19’ (nthopinion, 2020), where users can ask questions about covid­19, and other users answer such questions. second, we observe bot­ specific software, such as ‘deepset­ai/covid­qa’ (deepset ai, 2020) that provides answers for questions related to covid­19 automatically. iii: medical equipment:: this category includes projects to curate and maintain source code for the design and implementation of medical equipment used to treat covid­ 19. the purpose of these projects is to create designs of covid­19 related medical equipment, such as ventilators at scale, so that the growing need of medical equipment in hos­ pitals is satisfied. one example of such repository is ‘makers­ for­life/makair’ (makers­for life, 2020), which states the fol­ lowing in it’s readme page: “aims at helping hospitals cope with a possible shortage of professional ventilators dur­ ing the outbreak. worldwide. ... we target a per­unit cost well under 500 eur, which could easily be shrunk down to 200 eur or even 100 eur per ventilator given proper economies of scale, as well as choices of cheaper on­the­shelf compo­ nents”. the project includes design of the proposed ventila­ tors as cad files, as well as relevant firmware available as c++ code files. another example is the ‘popsolutions/openventila­ tor’ (popsolutions, 2020), which aims to provide cheap but reliable ventilators to treat covid­19 in economically under­developed regions of the world. the software project initiated from a facebook group called ‘open source covid­19 medical supplies’ 4, where members discussed the scarcity of ventilators and the importance of creating cheap ventilators through efficient design. in the project we notice developers to create, build, and share designs using openscad scripts. openscad is an open source tool to build computer­aided design (cad) objects 5. iv: mining:: this category includes projects that provide apis to mine covid­19 data from data sources, such as the us center for disease control and prevention 3https://stackoverflow.com/ 4https://www.facebook.com/groups/opensourcecovid19medicalsupplies/ 5https://www.openscad.org/ (cdc) 2020, the world health organization (who) 2020, and data reported from local institutions. the purpose of this category of software is to provide utilities for software devel­ opers so that they can get real­time access to covid­19 data to build aggregation software, discussed above. because of the nature of the pandemic, access to real­time data is pivotal for accurate aggregation and analysis. the mining tools help developers to get such support. mining software can be lo­ cation specific, for example ‘dsfsi/covid19africa’ (marivate et al., 2020) is dedicated to curate and collate covid­19 re­ lated data for african countries. v: user tracking:: this category includes software projects that collects information from users regarding their covid­19 infection status. tracking of user information can happen voluntarily, where the user voluntarily self re­ ports covid­19 infection status. the ‘enigmampc/safe­ trace’ (enigmampc, 2020) software is an example where users self report their infection status as well as location his­ tory. tracking of user information can also be done using inference, as done in ‘openmined/covid­alert’ (openmined, 2020), where the software collects user’s location informa­ tion to predict if the user is in a location with high infection density. one utility of these projects is to identify high­risk locations so that users can have an understanding of which nearby location can be avoided. self reporting software have yielded benefits for china and south korea (huang et al., 2020). vi: statistical modeling:: this category includes soft­ ware that use statistical models to predict attributes related to covid­19. the purpose of the projects is to make pre­ dictions for the future based on existing data. example us­ age of statistical models include (i) predicting death rate as done in ‘imperialcollegelondon/covid19model’ (imperial­ collegelondon, 2020), (ii) automating the process of lung segmentation with computerized tomography (ct) scan, as done in ‘johof/lungmask’ (johof, 2020), (iii) predicting the impact of the covid­19 pandemic on hospital demands as done in ‘neherlab/covid19_scenarios’ (neherlab, 2020), and (iv) predicting presence of covid­19 with x­ray images us­ ing deep learning as done in ‘elcronos/covid­19’ (elcronos, 2020). vii: volunteer management:: this category includes software used to efficiently manage volunteering effort. the an empirical study of bugs in covid­19 software projects rahman and farhana 2021 purpose of this software is to build software platforms so that users can volunteer and participate in activities to help dis­ tressed families and communities. one example is the ‘covid­ volunteers’ (helpwithcovid, 2020) software, which provides a web portal where users can sign up for 650 projects that include donation of masks, personal protective equipment (ppes), and testing of covid­19 6. platforms can be global, such as ‘covid­volunteers’, and also regional, for example ‘applifting/pomuzeme.si’ (applifting, 2020) creates a web portal so that people inside czech republic can volunteer. 4.1.3 frequency of the identified categories based on project count aggregation is the most frequent cat­ egory. along with project count, we provide summary statis­ tics of projects that belong to each category in table 3. we also observe on average user tracking projects to be more frequently released compared to other project types. we identify four software projects that belong to multiple categories. as an example, the ‘soroushchehresa/awesome­ coronavirus’ (soroushchehresa, 2020) project belongs to the categories: aggregation, mining, and statistical modeling. 4.1.4 rater agreement we report agreement rate for three steps: open coding, closed coding, and rater verification. open coding: after completing open coding, the first and sec­ ond author respectively, identified 7 and 10 categories. the agreement rate is 70.0%, and the cohen’s kappa is 0.7, indi­ cating ‘substantial’ agreement (landis and koch, 1977). the authors disagreed on ‘volunteering software related to local communities’, ‘education bots’, and ‘aggregated visualiza­ tions’, additional categories identified the second author. disagreements were resolved through discussion. both au­ thors provided justifications for their categorization. the first author pointed out that the category ‘education bots’ can be merged with ‘education’ as the category ‘education’ encom­ passes all categories of knowledge software, such as bots and web applications. the first author also pointed out that ‘volunteering software related to local communities’ can be merged with ‘volunteer management’, as the category is an extension of the category ‘volunteer management’. further­ more, the first author also pointed out that ‘aggregated visu­ alizations’ can be merged with ‘aggregation’, as ‘aggrega­ tion’ includes software that aggregates covid­19 data and displays aggregated data with visualizations. the second au­ thor was convinced by the first authors’ justification and up­ dated her derived list of categories. closed coding: during closed coding the first and second au­ thors mapped each of the 129 projects to an existing category. the agreement rate is 93.8%. the cohen’s kappa is 0.92. the authors disagreed on the labeling of 8 projects, which are resolved through discussion. during the discussion phase both authors agreed to present their justification, and recheck the labeling based on the justification and content analysis. the categorization determined upon discussion is considered final. 6https://helpwithcovid.com/medical rater verification: we also measured the agreement rate be­ tween an additional rater and the authors for categorizing readme files of projects. cohen’s kappa between the ad­ ditional rater and the first author for a randomly selected set of 50 readme files is 0.73, indicating ‘substantial’ agree­ ment (landis and koch, 1977). cohen’s kappa between the additional rater and the second author for a randomly se­ lected set of 50 readme files is 0.73, indicating ‘substan­ tial’ agreement (landis and koch, 1977). the agreement rate between the additional rater and the first and second author is respectively, 78.0% and 76.0%. 4.2 answer to rq2: what categories of bugs exist in open source covid­19 software projects? how frequently do the identified bug categories appear? what is the resolu­ tion time for the identified bug categories? we answer rq2 by first providing a breakdown of how we obtained our bug reports in table 4 and 5. as shown in ta­ ble 5, the categories with the most and least bug reports are re­ spectively, aggregation and medical equipment. one project can belong to multiple categories, and that is why the total count of bug reports does not total 550. next, we describe the identified bug categories in sec­ tion 4.2.1 by applying open coding on the collected 550 bug reports. the frequency of the identified bug categories is pro­ vided in section 4.2.2. we provide details of rater verification in section 4.2.3. finally, we provide the bug resolution time in section 4.2.4. 4.2.1 bug categories of covid­19 projects we identify 8 bug categories, which we describe below al­ phabetically: i: algorithm:: this category corresponds to bugs when implementation of an algorithm does not follow expected be­ havior. an algorithm is a sequence of computational steps that transform input into output (cormen et al., 2009). we ob­ serve algorithm bugs to include two sub­categories: (i) bugs related to statistical modeling algorithms, where statistical modeling results are incorrect due to incorrect assumptions and/or implementations, and (ii) bugs related to incorrect logic implemented in the software. example: we provide examples for the two sub­ categories: • statistical modeling: in a bug report titled “death rates should increase when icu’s are overwhelmed” (beg­ ley, 2020a), a practitioner describes how incorrect as­ sumption can result in incorrect modeling behavior. the practitioner discusses that bed space is correlated with estimation of fatality rate. when bed space of hospi­ tals are exhausted hospitals will not be able to treat new covid­19 new patients, which could potentially increase the fatality rate. the bug report provides evidence that if the context of covid­19 is not correctly incorporated in statis­ tical models, those models will provide incorrect re­ sults. incorrect statistical models can be consequential, an empirical study of bugs in covid­19 software projects rahman and farhana 2021 table 3. summary statistics of projects that belong to each category. based on project count ‘aggregation’ is the most frequent category as highlighted in green. proj. categ. projects com. devs files iss. rele. aggregation 50 14,985 663 8,641 908 72 mining 35 9,671 894 6,714 515 21 stat. model. 22 7,214 429 3,464 491 38 education 9 4,550 196 1,696 406 14 user track 9 2,020 152 2,291 119 286 volunteer. 7 2,186 143 2,041 320 0 med. equip. 3 859 38 790 14 63 table 4. filtering of bug reports from covid­19 software projects. initial 4,095 criterion­1 (closed issues) 2,965 criterion­2 (valid bug reports) 550 final 550 table 5. count of bug reports for each category of covid­19 software projects. aggregation­related projects have the highest amount of bug reports. project category count (%) aggregation 220 (40%) mining 150 (27.3%) stat. model. 98 (17.8%) education 58 (10.5%) volunteer. 40 (7.3%) user track 31 (5.6%) med. equip. 4 (0.7%) as countries are adopting public health policies specific to covid­19. for example researchers have critiqued the statistical models derived by the institute for health metrics and evaluation at the university of washing­ ton (ihme), and advised usa policymakers to use the modeling results with caution (begley, 2020b). • incorrect logic: in a bug report titled “fix prefecture sorting” (reustle, 2020), a practitioner describes a sort­ ing bug which occurs when trying to visualize covid­ 19 cases based on prefectures in japan. a prefecture is an administrative jurisdiction in a country similar to a state or province (hu and qian, 2017). the bug occurred due to an incorrect logic that did not perform sorting by prefectures. ii: data:: this category corresponds to bugs that occur during mining and storage of covid­19 data. as discussed in section 4.1.2 we observed our dataset to include projects that mine and aggregate covid­19 data. we observe four sub­categories of data bugs: (i) storage: bugs that occur while storing data in a database, (ii) mining: bugs that occur while retrieving data from data apis, (iii) location: bugs where lo­ cation information in stored data is incorrect, and (iv) time series: bugs that correspond to missing data for a certain time period. example: we provide examples for each of these sub­ categories below: • storage: in a bug report titled “temperature data not saved in the backend” (pavel ilin, 2020), a practitioner describes a bug where patient temperature data is in­ serted in the front­end but not stored in the database. • mining: bugs occur when covid­19­related data is being mined. a practitioner describes a mining bug in a bug report titled “cdc children scraper is out­ dated” (timoeller, 2020). the mining tool mines data related to children affected by covid­19. • location: in a bug report titled “rajasthan district names are wrong”, a practitioner describes that inserted location data for an indian state called ‘rajasthan’ is wrong (singhrajenm, 2020). • time series: missing data was reported for a project and reported in a bug report titled “data has a gap between 2020­3­11 and 2020­3­24” (zbraniecki, 2020). iii: dependency:: this category corresponds to bugs that occur when execution of the software is dependent on a software artifact that is either missing or incorrectly speci­ fied. for covid­19 projects, an artifact can be an api or a build artifact. example: in a bug report titled “missing postgis” (va­ clavpavlicek, 2020), a practitioner describes that installation and execution of the software is prohibited due to a software package called ‘postgis’, which is used to store spatial and geographic measurements, such as area, distance, polygon, and perimeter in postgresql databases. iv: documentation:: this category corresponds to bugs that occur when incorrect and/or incomplete informa­ tion in specified in release notes, maintenance notes, and doc­ umentation files, such as readme files. example: in a bug report titled “missing code of conduct”, a practitioner describes a ‘code_of_conduct.md’ file to be missing in a markdown file that describes how practi­ tioners can contribute to the project (mdeous, 2020). v: performance:: this category corresponds to bugs that cause performance discrepancies for the software. per­ formance bugs are manifested in slow response of the web or mobile app. example: in a bug report titled “cluster animation slow­ ing down the browser. it also takes much time”, a practitioner describes how a performance bug related to an animation fea­ ture is slowing down a firefox browser on windows 10 (sub­ ratappt, 2020). the performance bug was reported for a web­ site called ‘covid19india.org’ 7, which aggregates covid­ 19 data for india and displays them. vi: security:: this category corresponds to bugs that violate confidentiality, integrity, or availability for the soft­ ware. example: in a bug report titled “fix password reset proce­ dure” (landovsky, 2020), a practitioner describes a password reset bug, where the password reset procedure ends arbitrar­ ily after 500 login attempts. vii: syntax:: this category corresponds to bugs related 7https://www.covid19india.org/ an empirical study of bugs in covid­19 software projects rahman and farhana 2021 table 6. frequency of identified bug categories. ui­related bugs are the most frequent. bug category bugpropall (%) ui 38.2 data 30.9 dependency 18.9 algorithm 7.8 syntax 6.7 security 2.5 performance 1.6 documentation 1.4 with the syntax of the programming languages used to de­ velop the software. example: we notice bugs related to data types in ‘ne­ herlab/covid19_scenarios’. in the bug report titled “fix types and linting errors” (ivan aksamentov, 2020), a practitioner describes how linting and type checking was disabled for the project, which led to bugs related to linting and type check­ ing. viii: ui:: this category corresponds to bugs that in­ volve the user interface (ui) of the software. ui bugs include navigation­related bugs on web pages, bugs related to acces­ sibility, displaying incorrect images, links, and color, and re­ sponsiveness. example: in a bug report titled “accessibility fixes” (abquirarte, 2020) describes a ui bug related to accessibility. according to the bug report, a screen reader incorrectly renders check marks and crosses in front of the “do’s and don’t as m’s and n’s”. 4.2.2 frequency of identified bug categories based on the ‘proportion of bugs across all projects (bug­ propall)’ metric we observe ui bugs to be the most frequent category, whereas documentation is the least frequent cate­ gory. we provide a complete breakdown of the metric in ta­ ble 6. data bugs have four sub­categories: storage, mining, location, and time series. the frequency for storage, mining, location, and time series is respectively, 4.7%, 5.8%, 87.2%, and 2.3%. algorithm bugs have two sub­categories: statisti­ cal modeling and wrong logic. the frequency for statistical modeling and wrong logic is respectively, 42.3% and 57.7%. we observe bug category frequency to vary for differ­ ent categories of projects. we provide the ‘proportion of bugs for a certain project category (bugpropcat)’ val­ ues for each project category in table 7. ‘agg’, ‘mine’, ‘sta’, ‘edu’, ‘trak’, ‘vol’ and ‘equ’ respectively, cor­ responds to the seven project categories: aggregation, min­ ing, statistical modeling, education, user tracking, volunteer management system, and medical equipment. according to table 7, except for mining and medical equipment software, the dominant bug category is ui. one possible explanation can be the analyzed software projects have uis, which may have contributed to the frequency of ui bugs. for mining software the dominant bug category is data bugs i.e., bugs that occur due to storing and processing of covid­19 data. for medical equipment software the dom­ inant bug category is dependency. we also notice algorithm bugs to be the second most frequent bug category for statis­ tical modeling software. similar to prior work on machine learning (thung et al., 2012), we expected algorithm bugs to be the most dominant category for statistical modeling. sta­ tistical modeling software also have uis for user interaction, and the count of ui bugs may have foreshadowed the count of algorithm bugs. 4.2.3 rater agreement and verification we report agreement rate for four steps: issue labeling, open coding, closed coding, and rater verification. labeling issues as bugs: while labeling collected issue re­ ports as bug reports and non­bug reports the agreement rate is 96.5% and the cohen’s kappa is 0.9. open coding to identify bug categories: the first and sec­ ond author respectively, identified 9 and 10 categories. the agreement rate is 72.7%, and the cohen’s kappa is 0.70, indi­ cating ‘substantial’ agreement (landis and koch, 1977). the first author identified ‘database’ as a category not identified by the second author. upon discussion both authors agreed that ‘database’ is related to data storage and belongs to the data category. the second author identified two additional categories ‘public health data’ and ‘type errors’. after dis­ cussing the definitions of all categories both authors agreed that ‘public health data’ and ‘type errors’ can respectively, be merged with data and syntax. closed coding to quantify bug category frequency: dur­ ing closed coding the first and second author mapped each project to an existing category. the agreement rate is 95.1% and the cohen’s kappa is 0.93. the authors disagreed on the labeling of 27 bug reports, which are resolved through dis­ cussion. rater verification: for the randomly selected 250 issue re­ ports we allocate an additional rater who manually identi­ fied which of the issue reports are bug reports and non­bug reports. the cohen’s kappa between the additional rater and the first author is 0.80, indicating ‘substantial’ agree­ ment (landis and koch, 1977). the cohen’s kappa between the additional rater and the second author is 0.84, indicating ‘perfect’ agreement (landis and koch, 1977). the agreement rate between the additional rater and the first and second au­ thor is respectively, 89.0% and 93.0%. we have also measured the agreement rate between an ad­ ditional rater and the authors for categorizing bug reports. cohen’s kappa between the additional rater and the first au­ thor for a randomly selected set of 250 bug reports is 0.65, indicating ‘substantial’ agreement (landis and koch, 1977). cohen’s kappa between the additional rater and the second author for a randomly selected set of 250 bug reports is 0.68, indicating ‘substantial’ agreement (landis and koch, 1977). the agreement rate between the additional rater and the first and second author is respectively, 78.0% and 81.6%. 4.2.4 resolution time of identified bug categories we provide bug resolution time as measured in hours for all bug categories in table 8. from table 8 we observe that based on min and median bug resolution times security bugs take the longest to resolve, followed algorithm bugs. we also observe data bugs to take as long as 548 hours to resolve. a breakdown of bug resolution time across the seven project categories is provided in table 9. the ‘all’ row in an empirical study of bugs in covid­19 software projects rahman and farhana 2021 table 7. bug category frequency for each identified project type. all values are presented in (%). agg mine sta edu trak vol equ bug categ. algorithm 6.8% 6.7% 22.4% 3.4% 0.0% 2.5% 0.0% data 28.6% 60.6% 13.2% 15.5% 0.0% 12.5% 0.0% dependency 16.3% 18.0% 18.3% 24.1% 9.7% 27.5% 75.0% document 0.9% 1.3% 1.0% 0.0% 0.0% 10.0% 0.0% performance 2.7% 2.0% 0.0% 0.0% 3.2% 0.0% 0.0% security 1.8% 0.0% 3.0% 3.4% 6.4% 12.5% 0.0% syntax 5.9% 3.3% 14.3% 17.2% 3.2% 10.0% 0.0% ui 50.0% 12.0% 34.7% 44.8% 77.4% 32.5% 25.0% table 8. resolution time of identified bug categories. resolution times is measured in hours. median resolution time is highest for security bugs. bug category min median max security 1.240 13.9 144.6 algorithm 0.041 13.5 172.7 syntax 0.004 12.1 174.2 ui 0.003 11.8 254.2 data 0.003 8.4 548.0 performance 0.961 7.1 104.4 dependency 0.014 2.4 379.4 documentation 0.013 1.4 76.8 table 9. resolution time of bug categories grouped by project cate­ gories. we measure resolution time in hours. median bug resolution time is highest for projects related to medical equipment software. project category min median max medical equipment 5.0 29.4 46.4 volunteer management system 0.013 21.1 174.2 user tracking 0.124 16.5 294.5 education 0.121 11.2 294.5 aggregation 0.003 8.7 379.4 statistical modeling 0.004 7.2 168.3 mining 0.005 2.5 548.1 all 0.003 7.4 548.0 table 9 shows the minimum, median, and maximum bug res­ olution time for all bug categories measured in hours. in table 9 we observe four instances where the minimum bug resolution time is less than 6 minutes (< 0.1 hours). one possible explanation can be practitioners’ habit of opening a bug report after they have developed the fix for a bug (wan et al., 2017; thung et al., 2012). in such cases, practitioners notice the bug early, construct the fix for the bug, and then submit the bug report by opening and closing the bug report promptly. median bug resolution duration for each project type and bug category is provided in table 10. ‘agg’, ‘mine’, ‘sta’, ‘edu’, ‘trak’, ‘vol’ and ‘equ’ respectively, cor­ responds to the seven project categories: aggregation, min­ ing, statistical modeling, education, user tracking, volunteer management system, and medical equipment. we observe median bug resolution time to vary across bug categories as well as for project categories. 4.3 answer to rq3: how similar are the iden­ tified bug categories to that with previ­ ously studied software projects? we report our findings in table 11. the ‘bug category’ col­ umn presents the bug categories identified for covid­19 software projects, whereas, the ‘other software projects’ col­ umn presents the software projects for which the bug cate­ gory was observed according to papers identified from our scoping review. we observe bug categories for covid­19 software projects to also be observable for other categories of software projects, such as deep learning and automated vehicle. 5 discussion in this section, we first provide a summary of our findings in section 5.1. next, we provide a discussion on the implica­ tions of our findings in section 5.2. 5.1 summary project category: aggregation definition: aggregate covid­19 data and present using visualizations count : 50 out of 129 (38.7%) most frequent bug category: ui bugs median bug resolution time: 8.7 hours project category: mining definition: mine covid­19 data count : 35 out of 129 (27.1%) most frequent bug category: data bugs median bug resolution time: 2.5 hours project category: statistical modeling definition: use of statistical models to make covid­19 predictions count : 22 out of 129 (17.0%) most frequent bug category: ui bugs median bug resolution time: 7.2 hours project category: education definition: educate people about covid­19 count : 9 out of 129 (6.9%) most frequent bug category: ui bugs median bug resolution time: 11.2 hours project category: user tracking definition: track user data related to covid­19 count : 9 out of 129 (6.9%) most frequent bug category: ui bugs median bug resolution time: 16.5 hours project category: volunteer management definition: efficiently manage volunteering effort related to covid­19 count : 7 out of 129 (5.4%) most frequent bug category: ui bugs median bug resolution time: 21.1 hours project category: medical equipment definition: source code for design and implementation of medical devices count : 3 out of 129 (2.3%) most frequent bug category: dependency bugs an empirical study of bugs in covid­19 software projects rahman and farhana 2021 table 10. median bug resolution time for each bug category and each project type measured in hours. ‘—’ indicates categories for which no bug reports exist. agg mine sta edu trak vol equ bug cat. algorithm 9.8 10.8 13.9 10.1 — 13.5 — data 12.2 4.4 15.2 17.0 — 42.0 — dependency 5.6 0.1 0.3 4.5 5.3 2.9 22.4 document 1.3 39.0 1.5 — — 6.9 — performance 7.1 36.6 — — 1.5 — — security 8.1 — 3.1 84.1 13.9 20.4 — syntax 12.1 4.7 11.4 8.6 16.9 79.3 — ui 8.3 2.7 13.1 16.8 18.7 21.9 46.4 table 11. comparison of bug categories of covid­19 software projects with that of other software project categories. bug category other software projects security iac (rahman et al., 2020), oss github projects (ray et al., 2014) algorithm autonomous vehicle (garcia et al., 2020), oss github projects (ray et al., 2014) syntax iac (rahman et al., 2020), deep learning (islam et al., 2019), oss github projects (ray et al., 2014) ui blockchain (wan et al., 2017) data deep learning (islam et al., 2019) performance oss github projects (ray et al., 2014) dependency iac (rahman et al., 2020) documentation autonomous vehicle (garcia et al., 2020), iac (rahman et al., 2020) median bug resolution time: 29.4 hours 5.2 implications we discuss the implications of our findings below: security and privacy implications of user tracking soft­ ware: from table 3 we observe 9 projects to be related with user tracking. while the benefits of user tracking software have been documented for countries, such as russia and south korea (crowell morning, 2020), this category of soft­ ware can have negative impacts on privacy of end­users. data generated from user tracking software can be leveraged for marketing purposes. we make the following recommenda­ tions to preserve privacy of user data in user tracking soft­ ware: • policymakers should construct policies specific to covid­19 software that collects user data. • practitioners who develop user tracking software should leverage existing privacy policy frameworks, such as the ‘national institute of standards and technology (nist) privacy framework’ 2020. • privacy researchers can build tools that will automati­ cally detect and report privacy policy violations. evidence from table 7 shows that security bugs to exist for user tracking software. we advocate security researchers to systematically investigate if user tracking software includes security bugs. recent news articles suggest that user track­ ing software, such as contract tracing apps may become more and more prevalent as apple and google are already provid­ ing frameworks to build software that tracks user data (ap­ ple, 2020). our hypothesis is that availability of these frame­ works will facilitate rapid development and deployment of mobile apps that collect user data. security weaknesses in these apps can provide malicious users opportunity to con­ duct large­scale data breaches. we notice anecdotal evidence in this regard: a researcher has identified vulnerabilities in a user tracking app that could leak user location data (green­ berg, 2020). panelists at eurocrypt 2020, a cryptography research conference, discussed limitations of user tracking mobile apps for covid­19 with respect to api design, in­ door location tracking, and informing users about privacy risks (eurocrypt, 2020a) (eurocrypt, 2020b). towards constructing correct statistical models: from section 4.2.1 we have observed statistical modeling bugs to exist. bugs related to statistical modeling can be conse­ quential because based on the predictions generated by sta­ tistical models, policymakers enforce public health policies. one possible explanation for buggy statistical models can be attributed to the quality of datasets using which statistical models are build (koerth et al., 2020). for example, fatality prediction models that are built using the ‘diamond princess cruise ship dataset’ may not be applicable for a specific geo­ graphic region with low population density. another possible explanation can be a lack of context and knowledge related to public health specific that hinders model builders to identify appropriate independent variables to construct the models. incorrect estimation of hospital beds from our discussion in section 4.2.1 is one example. other examples of independent variables related to public health includes staff availability, count of known cases, hospitalization rate etc. (attia, 2020). according to a health expert (attia, 2020), statistical mod­ els that predicted 2.4 million us residents to die, assumed a hospitalization rate of 15­20%, which in reality was 5%. based on our findings and above­mentioned explanations we make two recommendations: • automated testing for covid­19 modeling: we hope to see novel research in the domain of covid­19 that will test the correctness of constructed statistical models used in forecasting in an automated manner. in recent years, we have seen research efforts that test deep learn­ ing models (tian et al., 2018; pei et al., 2017; ma et al., 2018). we expect similar research pursuits for covid­ 19 statistical modeling. • better synergies between data science and public health an empirical study of bugs in covid­19 software projects rahman and farhana 2021 practitioners: construction and verification of covid­ 19 statistical modeling should involve practitioners from public health and data science. public health prac­ titioners within a specific locality can provide necessary context that data scientists can incorporate in their sta­ tistical models. implications for educators: our findings have implica­ tions for educators involved in teaching the following topics: • data science: educators who teach data science can use the examples of statistical modeling bugs to highlight the value of considering the full context and related lim­ itations that accompany statistical modeling. • information security and privacy: user tracking soft­ ware can be discussed in information security and pri­ vacy courses to demonstrate the value of protecting user data. such discussion can also include privacy policy frameworks that are already in place, such as the nist privacy framework (national institute of standard and technology, 2020). • software engineering: our categorization of bugs re­ lated to covid­19 software development can be dis­ cussed to demonstrate that understanding and repair of bugs requires contextualization. benchmark for practitioners and researchers: ta­ bles 6— 10 can be used as a measuring stick by practitioners and researchers who are involved with covid­19 software projects. practitioners can estimate their bug resolution ef­ forts by comparing median resolution times for bugs in their covid­19 software projects to that of tables 8, 9, and 10. compared to prior work related to blockchain and machine learning (thung et al., 2012; wan et al., 2017), median bug resolution time is lower for covid­19 software projects. we provide two possible explanations: one possible explanation can be related to the sense of urgency. practitioners may have realized that bugs in covid­19 software projects could ham­ per the analysis or mitigation of covid­19, and therefore, needs immediate attention. another possible explanation can be the limitations of our dataset. the age of our software projects does not exceed four months and that may have bi­ ased median bug resolution time. we advocate for future re­ search that will confirm or refute our explanations. recurrence­related implications: researchers (kissler et al., 2020; chen et al., 2020) have provided evidence that support the recurring nature of covid­19. about the re­ currence of covid­19 kissler et al. 2020 stated “a resur­ gence in contagion could be possible as late as 2024.”. we hypothesize that covid­19’s recurrence will lead to more covid­19 software building. whether or not our findings hold for these newly constructed covid­19 software can be validated through a replication of our paper. we expect to observe more categories of covid­19 software projects as well as more bug categories. 5.3 differences between covid­19 software projects and other software projects we provide the differences that we have noticed between covid­19 software projects and other software projects, which we discuss in the following subsections: 5.3.1 differences in bug manifestation a non­covid­19 software project does not have the con­ text of public health consequences that are associated with a covid­19 software project. we define a covid­19 soft­ ware project to be a software project that is related with an­ alyzing and mitigating the consequences of covid­19. by definition, we include software projects that directly captures the consequences related to public health, which is absent from a traditional software project. we observe empirical ev­ idence that shows the unique context of covid­19 to yield differences in bugs and bug resolution time when compared with other software projects. let us consider the case of algorithm bugs. algorithm bugs manifest in covid­19 projects as well as in machine learn­ ing and autonomous vehicle projects. a machine learning project that uses statistical modeling can have algorithm bugs that generates erroneous predictions. for a covid­19 soft­ ware project that predicts death rates, a bug related to the modeling algorithm can have serious consequences, as pub­ lic health policies are derived based on these models, as it oc­ curred during incorrect estimation of hospitalization rate (at­ tia, 2020). as discussed in section 4.3 algorithm­related bugs also appear for autonomous vehicles but presence of such bugs manifest in components unique to autonomous vehicle projects, such as lane positioning and navigation, and traffic light processing. we have observed that data bugs appear for both deep learning projects and covid­19 software projects. the dif­ ference is for covid­19 we have the concepts of location, as practitioners tend to miss important location­related data for covid­19, e.g., not able to identify states in india that are observing an outbreak of covid­19. in the case of deep learning projects, data bugs are related with structure and type of training data. as another example, dependency­related bugs appear for both iac scripts and covid­19 software projects. in the case of iac, dependency­related bugs are related to an iac­related artifact, such as puppet manifest, class, or a module, upon which execution of an iac script is dependent upon (rahman et al., 2020). for covid­19 software project dependencies are related with api and build artifacts, such as maven depen­ dencies. this difference with respect to dependent artifacts also highlight the differences between covid­19 software projects and iac­based software projects. in short, our findings suggest that while commonalities for bug categories between covid­19 software projects and other software projects, the manifestation and artifacts re­ lated to the bug categories are different from other categories of software projects. 5.3.2 difference in bug resolution time our findings indicate that median bug resolution time is lower for ovid­19 software projects than that of blockchain and machine learning projects. based on our findings, we conjecture that the sense of urgency might have motivated practitioners to fix bugs in covid­19 software projects. an empirical study of bugs in covid­19 software projects rahman and farhana 2021 5.3.3 differences with existing healthcare­related soft­ ware projects our findings also demonstrate differences between covid­ 19 software projects and other projects related to healthcare domain. to illustrate these differences we use janamanchi et al. 2009’s work. janamanchi et al. 2009 studied 174 open­ source software projects related to the health domain and identified 11 categories of software projects that do not in­ clude the three categories of projects that we have iden­ tified for covid­19 software projects: volunteer manage­ ment, user tracking, and education. the inception and spread of covid­19 have motivated software practitioners to cre­ ate a wide range of software projects, such as projects related to user tracking and volunteer management so that people are aware about the consequences and hygiene practices related to covid­19. in the context of covid­19 software projects, projects related to user tracking focus on tracking user loca­ tion data emitted from smartphones to assess the proximity of individuals who might be exposed to covid­19. software projects related to volunteer management are related with managing volunteers to address covid­19­related societal issues, such as food banking. a pandemic of this nature was not experienced by health professionals prior to 2020. exist­ ing research related to software projects that belong to health domain were not able to perform characterization of covid­ 19 software projects and identify project categories unique to covid­19. janmanchi et al. 2009 did not systematically study the types of bugs that appear in health care software projects. our paper complements janamanchi et al. 2009’s work by studying healthcare­related projects that are related with covid­19 by characterizing the bugs and the types of software projects related to covid­19 in which the bugs ap­ pear in. 6 threats to validity we describe the limitations of our paper as following: conclusion validity: we have used raters who derived the software and bug categories. both raters are authors of the paper. our derived categories are susceptible to the authors’ bias. we mitigate this limitation by allocating another rater who is not the author of the paper who verified our ratings. our categories might not be comprehensive because our categorization for projects and bugs is limited to the dataset that we collected. the bug resolution time could be limiting as our dataset includes projects that have a duration of four months. we use the topic ‘covid­19’ to identify and filter covid­ 19 software projects from github. any software project that is not labeled as ‘covid­19’ will not be included in our dataset. our datasets have limited lifetime as the covid­19 was discovered in december 2019, and the lack of maturity in our datasets may influence our analysis. we mitigate this limita­ tion by identifying projects using a filtering criteria so that we can identify projects with sufficient development activ­ ity. internal validity: for rq1 and rq2 we use ourselves, the authors of the paper, as raters who conduct open and closed coding on readme files and bug reports. our research is susceptible to mono­method bias, as our categorization and labeling may be influenced by the authors’ implicit expecta­ tions and hypotheses about the study. external validity: our findings are not comprehensive. we have not analyzed projects hosted outside github and private projects hosted on github. we mitigate this limita­ tion by analyzing 129 software projects that belong to 7 cat­ egories. also, as we have used open coding to determine cat­ egories, our findings may not be identified by other raters. we mitigate this limitation by conducting rater verification, where we use a rater who is not the author of the paper. 7 conclusion the covid­19 pandemic has impacted people all over the world causing thousands of deaths. software practitioners have joined the fight in combating the spread and mitigating the dire consequences of covid­19. an understanding of covid­19 software categories and software bugs can give us clues on how the software engineering community can help even further in combating covid­19. we conduct an empirical study with 129 covid­19 soft­ ware projects hosted on github. we identify 7 categories of software projects: aggregation, mining, statistical models, ed­ ucation, volunteer management, user tracking, and medical equipment. by applying open coding on 550 bug reports, we identify 8 categories of bugs: algorithm, data, dependency, documentation, performance, security, syntax, and ui. we observe bug category frequency to vary with project cate­ gories, e.g., for mining projects data­related bugs is the most frequently occurring category. our findings have implications for educators, practition­ ers, and researchers. educators can use our categorization of covid software projects and related bugs to educate stu­ dents about the security and privacy implications of covid­ 19 software. privacy researchers can build tools that will check if user tracking software related to covid­19 are not leaking user data. practitioners in the data science do­ main can learn from our categorization of statistical model­ ing bugs to understand limitations of constructed statistical models and verify underlying assumptions that accompany constructed statistical models. based on our findings we also advocate for better synergies between data scientists and pub­ lic health experts so that statistical modeling bugs can be miti­ gated. we hope our paper will advance further research in the domain of covid­19 software. acknowledgements we thank the paser group at tennessee technological university for their useful feedback. we also thank farzana ahamed bhuiyan of tennessee technological university for her help as an additional rater. the research was partially supported by the national science foundation (nsf) award # 2026869. an empirical study of bugs in covid­19 software projects rahman and farhana 2021 references abquirarte (2020). accessibility fixes. github.com/ cagov/covid19/issues/137. [online; accessed 10­ may­2020]. agrawal, a., rahman, a., krishna, r., sobran, a., and men­ zies, t. (2018). we don’t need another hero?: the im­ pact of ”heroes” on software development. in proceed­ ings of the 40th international conference on software en­ gineering: software engineering in practice, icse­seip ’18, pages 245–253, new york, ny, usa. acm. alasdair sandford (2020). coronavirus: half of humanity now on lockdown as 90 countries call for confinement. https://www.euronews.com/2020/04/02/. [online; accessed 17­apr­2020]. anderson, s., allen, p., peckham, s., and goodwin, n. (2008). asking the right questions: scoping studies in the commissioning of research on the organisation and deliv­ ery of health services. health research policy and systems, 6(1):7. apple (2020). privacy­preserving contact tracing. https:// www.apple.com/covid19/contacttracing. [online; accessed 25­may­2020]. applifting (2020). pomuzeme.si. github.com/ applifting/pomuzeme.si. [online; accessed 09­ may­2020]. attia, p. (2020). comparing covid­19 to past pandemics, preparing for the future, and reasons for optimism. https: //peterattiamd.com/ameshadalja/. [online; ac­ cessed 21­may­2020]. begley, s. (2020a). death rates should increase when icu’s are overwhelmed. https://github.com/neherlab/ covid19_scenarios/issues/7. [online; accessed 10­ may­2020]. begley, s. (2020b). influential covid­19 model uses flawed methods and shouldn’t guide u.s. policies, critics say. https://www.statnews.com/2020/04/17/. [online; accessed 10­may­2020]. boogheta (2020). boogheta/coronavirus­countries. https: //github.com/boogheta/coronavirus-countries. [online; accessed 09­may­2020]. butler, j. l. and jaffe, s. (2020). challenges and gratitude: a diary study of software engineers working from home during covid­19 pandemic. catolino, g., palomba, f., zaidman, a., and ferrucci, f. (2019). not all bugs are the same: understanding, char­ acterizing, and classifying bug types. journal of systems and software, 152:165 – 181. cdc (2020). cases, data, and surveillance. https://www.cdc.gov/coronavirus/2019-ncov/ cases-updates/index.html. [online; accessed 09­may­2020]. chen, d., xu, w., lei, z., huang, z., liu, j., gao, z., and peng, l. (2020). recurrence of positive sars­cov­2 rna in covid­19: a case report. international journal of infec­ tious diseases, 93:297 – 299. cohen, j. (1960). a coefficient of agreement for nomi­ nal scales. educational and psychological measurement, 20(1):37–46. corey, l., mascola, j. r., fauci, a. s., and collins, f. s. (2020). a strategic approach to covid­19 vaccine r&d. sci­ ence. cormen, t. h., leiserson, c. e., rivest, r. l., and stein, c. (2009). introduction to algorithms. mit press. crabtree, b. f. and miller, w. l. (1999). doing qualitative research. sage publications. crowell morning (2020). mobile applications for covid tracking & tracing – balancing the need for personal information and privacy rights in the time of coro­ navirus. https://www.crowell.com/newsevents/ alertsnewsletters/all/. [online; accessed 20­may­ 2020]. de clercq, e. (2006). potential antivirals and antiviral strate­ gies against sars coronavirus infections. expert review of anti­infective therapy, 4(2):291–302. deepset ai (2020). deepset­ai/covid­qa. https://github. com/deepset-ai/covid-qa. [online; accessed 09­ may­2020]. dehning, j., zierenberg, j., spitzner, f. p., wibral, m., neto, j. p., wilczek, m., and priesemann, v. (2020). inferring change points in the spread of covid­19 reveals the effec­ tiveness of interventions. science. elcronos (2020). elcronos/covid­19. https://github. com/elcronos/covid-19. [online; accessed 09­may­ 2020]. emery berger (2021). csrankings: computer science rankings. http://csrankings.org/#/index?all&us. [online; accessed 31­february­2021]. enigmampc (2020). safetrace. github.com/enigmampc/ safetrace. [online; accessed 09­may­2020]. erin duffin (2020). impact of the coronavirus pan­ demic on the global economy ­ statistics & facts. https://www.statista.com/topics/6139/ covid-19-impact-on-the-global-economy/. [online; accessed 08­may­2020]. eurocrypt (2020a). eurocrypt 2020 program. https:// eurocrypt.iacr.org/2020/program.php. [online; accessed 16­may­2020]. eurocrypt (2020b). s­212 panel discussion on contact trac­ ing. https://youtu.be/xt4p8e_y-xc. [online; ac­ cessed 16­may­2020]. evans, a. b., blackwell, j., dolan, p., fahlén, j., hoekman, r., lenneis, v., mcnarry, g., smith, m., and wilcock, l. (2020). sport in the face of the covid­19 pandemic: to­ wards an agenda for research in the sociology of sport. farhana, e., imtiaz, n., and rahman, a. (2019). synthesiz­ ing program execution time discrepancies in julia used for scientific software. in 2019 ieee international confer­ ence on software maintenance and evolution (icsme), pages 496–500. garcia, j., feng, y., shen, j., almanee, sumaya xia, y., and chen, q. a. (2020). a comprehensive study of au­ tonomous vehicle bugs. in proceedings of the 42nd inter­ national conference on software engineering, icse ’20. to appear. github (2020a). covid­19 : github topics. https:// github.com/topics/covid-19. [online; accessed 07­ may­2020]. github.com/cagov/covid19/issues/137 github.com/cagov/covid19/issues/137 https://www.euronews.com/2020/04/02/ https://www.apple.com/covid19/contacttracing https://www.apple.com/covid19/contacttracing github.com/applifting/pomuzeme.si github.com/applifting/pomuzeme.si https://peterattiamd.com/ameshadalja/ https://peterattiamd.com/ameshadalja/ https://github.com/neherlab/covid19_scenarios/issues/7 https://github.com/neherlab/covid19_scenarios/issues/7 https://www.statnews.com/2020/04/17/ https://github.com/boogheta/coronavirus-countries https://github.com/boogheta/coronavirus-countries https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/index.html https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/index.html https://www.crowell.com/newsevents/alertsnewsletters/all/ https://www.crowell.com/newsevents/alertsnewsletters/all/ https://github.com/deepset-ai/covid-qa https://github.com/deepset-ai/covid-qa https://github.com/elcronos/covid-19 https://github.com/elcronos/covid-19 http://csrankings.org/#/index?all&us github.com/enigmampc/safetrace github.com/enigmampc/safetrace https://www.statista.com/topics/6139/covid-19-impact-on-the-global-economy/ https://www.statista.com/topics/6139/covid-19-impact-on-the-global-economy/ https://eurocrypt.iacr.org/2020/program.php https://eurocrypt.iacr.org/2020/program.php https://youtu.be/xt4p8e_y-xc https://github.com/topics/covid-19 https://github.com/topics/covid-19 an empirical study of bugs in covid­19 software projects rahman and farhana 2021 github (2020b). language savant. https://github.com/ github/linguist. [online; accessed 07­may­2020]. github (2020c). search : covid­19. https://github. com/search?q=covid-19. [online; accessed 07­may­ 2020]. greenberg, a. (2020). india’s covid­19 contact tracing app could leak patient locations. https://www.wired.com/ story/india-covid-19-contract-tracing-app/. [online; accessed 23­may­2020]. helms, j., kremer, s., merdji, h., clere­jehl, r., schenck, m., kummerlen, c., collange, o., boulay, c., fafi­ kremer, s., ohana, m., et al. (2020). neurologic features in severe sars­cov­2 infection. new england journal of medicine. helpwithcovid (2020). helpwithcovid/covid­ volunteers. https://github.com/helpwithcovid/ covid-volunteers. [online; accessed 09­may­2020]. herzig, k., just, s., and zeller, a. (2013). it’s not a bug, it’s a feature: how misclassification impacts bug prediction. in proceedings of the 2013 international conference on soft­ ware engineering, icse ’13, page 392–401. ieee press. hu, f. z. and qian, j. (2017). land­based finance, fiscal autonomy and land supply for affordable housing in ur­ ban china: a prefecture­level analysis. land use policy, 69:454 – 460. huang, y., sun, m., and sui, y. (2020). how digital contact tracing slowed covid­19 in east asia. https://hbr.org/2020/04/ how-digital-contact-tracing-slowed-covid-19. [online; accessed 09­may­2020]. ieee (2010). ieee standard classification for software anomalies. ieee std 1044­2009 (revision of ieee std 1044­1993), pages 1–23. imperialcollegelondon (2020). imperialcollegelon­ don/covid19model. https://github.com/ imperialcollegelondon/covid19model. [online; accessed 09­may­2020]. islam, m. j., nguyen, g., pan, r., and rajan, h. (2019). a comprehensive study on deep learning bug characteristics. in proceedings of the 2019 27th acm joint meeting on eu­ ropean software engineering conference and symposium on the foundations of software engineering, esec/fse 2019, page 510–520, new york, ny, usa. association for computing machinery. ivan aksamentov (2020). fix types and linting er­ rors. https://github.com/neherlab/covid19_ scenarios/issues/101. [online; accessed 10­may­ 2020]. janamanchi, b., katsamakas, e., raghupathi, w., and gao, w. (2009). the state and profile of open source software projects in health and medical informatics. international journal of medical informatics, 78(7):457–472. jarynowski, a., wójta­kempa, m., płatek, d., and czopek, k. (2020). attempt to understand public health relevant social dimensions of covid­19 outbreak in poland. avail­ able at ssrn 3570609. jin, z., zhao, y., sun, y., zhang, b., wang, h., wu, y., zhu, y., zhu, c., hu, t., du, x., et al. (2020). structural ba­ sis for the inhibition of sars­cov­2 main protease by anti­ neoplastic drug carmofur. nature structural & molecular biology, pages 1–4. john hopkins university (2020). corona virus resource center. https://coronavirus.jhu.edu/. [online; ac­ cessed 31­may­2020]. johof (2020). johof/lungmask. https://github.com/ johof/lungmask. [online; accessed 09­may­2020]. juanmnl (2020). covid19­monitor. github.com/juanmnl/ covid19-monitor. [online; accessed 09­may­2020]. kissler, s. m., tedijanto, c., goldstein, e., grad, y. h., and lipsitch, m. (2020). projecting the transmission dynamics of sars­cov­2 through the postpandemic period. science. koerth, m., bronner, l., and mithani, j. (2020). why it’s so freaking hard to make a good covid­19 model. https://fivethirtyeight.com/features/ why-its-so-freaking-hard-to-make/. [online; accessed 22­may­2020]. kraemer, m. u., yang, c.­h., gutierrez, b., wu, c.­h., klein, b., pigott, d. m., du plessis, l., faria, n. r., li, r., hanage, w. p., et al. (2020). the effect of human mo­ bility and control measures on the covid­19 epidemic in china. science, 368(6490):493–497. landis, j. r. and koch, g. g. (1977). the measurement of observer agreement for categorical data. biometrics, 33(1):159–174. landovsky (2020). fix password reset procedure. https:// github.com/applifting/pomuzeme.si/issues/99. [online; accessed 10­may­2020]. linares­vásquez, m., bavota, g., and escobar­velasquez, c. (2017). an empirical study on android­related vulnerabil­ ities. in proceedings of the 14th international conference on mining software repositories, msr ’17, pages 2–13, piscataway, nj, usa. ieee press. ma, l., zhang, f., sun, j., xue, m., li, b., juefei­xu, f., xie, c., li, l., liu, y., zhao, j., and wang, y. (2018). deep­ mutation: mutation testing of deep learning systems. in 2018 ieee 29th international symposium on software re­ liability engineering (issre), pages 100–111. ma, w., chen, l., zhang, x., zhou, y., and xu, b. (2017). how do developers fix cross­project correlated bugs? a case study on the github scientific python ecosystem. in proceedings of the 39th international conference on soft­ ware engineering, icse ’17, page 381–392. ieee press. makers­for life (2020). makers­for­life/makair. https:// github.com/makers-for-life/makair. [online; ac­ cessed 09­may­2020]. marivate, v. and combrink, h. m. (2020). use of available data to inform the covid­19 outbreak in south africa: a case study. data science journal, 19(1):1–7. marivate, v., nsoesie, e., bekele, e., and open covid­19 data working group, a. (2020). coronavirus covid­19 (2019­ncov) data repository for africa. mdeous (2020). missing code of conduct. https://github. com/reach4help/reach4help/issues/135. [online; accessed 10­may­2020]. mello, m. m. and wang, c. j. (2020). ethics and governance for digital disease surveillance. science. mitchell hartman (2020). covid­19 job­ less claims are now over 40 million. many https://github.com/github/linguist https://github.com/github/linguist https://github.com/search?q=covid-19 https://github.com/search?q=covid-19 https://www.wired.com/story/india-covid-19-contract-tracing-app/ https://www.wired.com/story/india-covid-19-contract-tracing-app/ https://github.com/helpwithcovid/covid-volunteers https://github.com/helpwithcovid/covid-volunteers https://hbr.org/2020/04/how-digital-contact-tracing-slowed-covid-19 https://hbr.org/2020/04/how-digital-contact-tracing-slowed-covid-19 https://github.com/imperialcollegelondon/covid19model https://github.com/imperialcollegelondon/covid19model https://github.com/neherlab/covid19_scenarios/issues/101 https://github.com/neherlab/covid19_scenarios/issues/101 https://coronavirus.jhu.edu/ https://github.com/johof/lungmask https://github.com/johof/lungmask github.com/juanmnl/covid19-monitor github.com/juanmnl/covid19-monitor https://fivethirtyeight.com/features/why-its-so-freaking-hard-to-make/ https://fivethirtyeight.com/features/why-its-so-freaking-hard-to-make/ https://github.com/applifting/pomuzeme.si/issues/99 https://github.com/applifting/pomuzeme.si/issues/99 https://github.com/makers-for-life/makair https://github.com/makers-for-life/makair https://github.com/reach4help/reach4help/issues/135 https://github.com/reach4help/reach4help/issues/135 an empirical study of bugs in covid­19 software projects rahman and farhana 2021 are still waiting for unemployment benefits. https://www.marketplace.org/2020/05/28/ covid-19-jobless-claims-unemployment-benefits-waiting/. [online; accessed 31­may­2020]. mockus, a., fielding, r. t., and herbsleb, j. d. (2002). two case studies of open source software development: apache and mozilla. acm trans. softw. eng. methodol., 11(3):309–346. munaiah, n., kroh, s., cabrey, c., and nagappan, m. (2017). curating github for engineered software projects. empiri­ cal software engineering, pages 1–35. munn, z., peters, m. d., stern, c., tufanaru, c., mcarthur, a., and aromataris, e. (2018). systematic review or scop­ ing review? guidance for authors when choosing between a systematic or scoping review approach. bmc medical research methodology, 18(1):143. national institute of standard and technology (2020). nist privacy framework. https://www.nist.gov/ privacy-framework. [online; accessed 24­may­2020]. neherlab (2020). covid19_scenarios. github.com/ neherlab/covid19_scenarios. [online; accessed 09­ may­2020]. nthopinion (2020). nthopinion/covid19. https://github. com/nthopinion/covid19. [online; accessed 09­may­ 2020]. oliveira, e., leal, g., valente, m. t., morandini, m., prik­ ladnicki, r., pompermaier, l., chanin, r., caldeira, c., machado, l., and de souza, c. (2020). surveying the im­ pacts of covid­19 on the perceived productivity of brazil­ ian software developers. in proceedings of the 34th brazil­ ian symposium on software engineering, sbes ’20, page 586–595, new york, ny, usa. association for comput­ ing machinery. openmined (2020). covid­alert. github.com/openmined/ covid-alert. [online; accessed 09­may­2020]. paul, r., baltes, s., gianisa, a., torkar, r., kovalenko, v., marcos, k., nicole, n., yoo, s., xavier, d., tan, x., et al. (2020). pandemic programming. empirical software en­ gineering, 25(6):4927–4961. pavel ilin (2020). temperature data not saved in the backend. https://github.com/ covid-19-electronic-health-system/ corona-tracker/issues/351. [online; accessed 10­may­2020]. pei, k., cao, y., yang, j., and jana, s. (2017). deepxplore: automated whitebox testing of deep learning systems. in proceedings of the 26th symposium on operating systems principles, sosp ’17, page 1–18, new york, ny, usa. association for computing machinery. popsolutions (2020). popsolutions/openventilator. https: //github.com/popsolutions/openventilator. [online; accessed 09­may­2020]. prana, g. a., treude, c., thung, f., atapattu, t., and lo, d. (2019). categorizing the content of github readme files. empirical softw. engg., 24(3):1296–1327. pulido, c. m., villarejo­carballido, b., redondo­sama, g., and gómez, a. (2020). covid­19 infodemic: more retweets for science­based information on coronavirus than for false information. international sociology, page 0268580920914755. rahman, a. and farhana, e. (2020). dataset for pa­ per ­ covid­19­emse. https://figshare.com/s/ 7044678e1d7e7feb1efb. [online; accessed 22­january­ 2021]. rahman, a., farhana, e., parnin, c., and williams, l. (2020). gang of eight: a defect taxonomy for infrastructure as code scripts. in proceedings of the 42nd international conference on software engineering, icse ’20. to ap­ pear. ray, b., posnett, d., filkov, v., and devanbu, p. (2014). a large scale study of programming languages and code quality in github. in proceedings of the 22nd acm sig­ soft international symposium on foundations of soft­ ware engineering, fse 2014, pages 155–165, new york, ny, usa. acm. reustle (2020). fix prefecture sorting. https://github. com/reustle/covid19japan/issues/15. [online; ac­ cessed 05­mar­2021]. rourke, m., eccleston­turner, m., phelan, a., and gostin, l. (2020). policy opportunities to enhance sharing for pan­ demic research. science, 368(6492):716–718. saldana, j. (2015). the coding manual for qualitative re­ searchers. sage. singhrajenm (2020). rajasthan district names are wrong. https://github.com/covid19india/ covid19india-react/issues/321. [online; accessed 10­may­2020]. soroushchehresa (2020). soroushchehresa/awesome­ coronavirus. github.com/soroushchehresa/ awesome-coronavirus. [online; accessed 16­may­ 2020]. subratappt (2020). cluster animation slowing down the browser. it also takes much time. https://github.com/ covid19india/covid19india-react/issues/497. [online; accessed 10­may­2020]. tamm, m. v. (2020). covid­19 in moscow: prognoses and scenarios. farmakoekonomika. modern pharma­ coeconomic and pharmacoepidemiology, 13(1):43–51. thung, f., wang, s., lo, d., and jiang, l. (2012). an empir­ ical study of bugs in machine learning systems. in 2012 ieee 23rd international symposium on software reliabil­ ity engineering, pages 271–280. tian, y., pei, k., jana, s., and ray, b. (2018). deeptest: automated testing of deep­neural­network­driven au­ tonomous cars. in proceedings of the 40th international conference on software engineering, icse ’18, page 303–314, new york, ny, usa. association for comput­ ing machinery. timoeller (2020). cdc children scraper is outdated. https: //github.com/deepset-ai/covid-qa/issues/43. [online; accessed 10­may­2020]. tom simonite (2020). software that reads ct lung scans had been used primarily to detect cancer. now it’s retooled to look for signs of pneumonia caused by coronavirus. https://www.wired.com/story/ chinese-hospitals-deploy-ai-help-diagnose/. [online; accessed 08­may­2020]. vaclavpavlicek (2020). missing postgis. https://github. https://www.marketplace.org/2020/05/28/covid-19-jobless-claims-unemployment-benefits-waiting/ https://www.marketplace.org/2020/05/28/covid-19-jobless-claims-unemployment-benefits-waiting/ https://www.nist.gov/privacy-framework https://www.nist.gov/privacy-framework github.com/neherlab/covid19_scenarios github.com/neherlab/covid19_scenarios https://github.com/nthopinion/covid19 https://github.com/nthopinion/covid19 github.com/openmined/covid-alert github.com/openmined/covid-alert https://github.com/covid-19-electronic-health-system/corona-tracker/issues/351 https://github.com/covid-19-electronic-health-system/corona-tracker/issues/351 https://github.com/covid-19-electronic-health-system/corona-tracker/issues/351 https://github.com/popsolutions/openventilator https://github.com/popsolutions/openventilator https://figshare.com/s/7044678e1d7e7feb1efb https://figshare.com/s/7044678e1d7e7feb1efb https://github.com/reustle/covid19japan/issues/15 https://github.com/reustle/covid19japan/issues/15 https://github.com/covid19india/covid19india-react/issues/321 https://github.com/covid19india/covid19india-react/issues/321 github.com/soroushchehresa/awesome-coronavirus github.com/soroushchehresa/awesome-coronavirus https://github.com/covid19india/covid19india-react/issues/497 https://github.com/covid19india/covid19india-react/issues/497 https://github.com/deepset-ai/covid-qa/issues/43 https://github.com/deepset-ai/covid-qa/issues/43 https://www.wired.com/story/chinese-hospitals-deploy-ai-help-diagnose/ https://www.wired.com/story/chinese-hospitals-deploy-ai-help-diagnose/ https://github.com/applifting/pomuzeme.si/issues/164 an empirical study of bugs in covid­19 software projects rahman and farhana 2021 com/applifting/pomuzeme.si/issues/164. [on­ line; accessed 10­mar­2021]. van bavel, j. j., baicker, k., boggio, p. s., capraro, v., ci­ chocka, a., cikara, m., crockett, m. j., crum, a. j., dou­ glas, k. m., druckman, j. n., et al. (2020). using social and behavioural science to support covid­19 pandemic re­ sponse. nature human behaviour, pages 1–12. vardi, m. y. (2009). conferences vs. journals in computing research. communications of the acm, 52(5):5–5. wan, z., lo, d., xia, x., and cai, l. (2017). bug characteris­ tics in blockchain systems: a large­scale empirical study. in 2017 ieee/acm 14th international conference on min­ ing software repositories (msr), pages 413–424. wang, c., li, w., drabek, d., okba, n. m., van haperen, r., osterhaus, a. d., van kuppeveld, f. j., haagmans, b. l., grosveld, f., and bosch, b.­j. (2020). a human monoclonal antibody blocking sars­cov­2 infection. na­ ture communications, 11(1):1–6. who (2020). global research on coronavirus disease (covid­19). https://www.who.int/emergencies/ diseases/novel-coronavirus-2019/ global-research-on-novel-coronavirus-2019-ncov. [online; accessed 09­may­2020]. why hunger (2020). why hunger. https://whyhunger. org/map.php. [online; accessed 08­may­2020]. will, c. m. (2020). ‘and breathe...’? the sociology of health and illness in covid­19 time. sociology of health & illness. yang, c. y. and wang, j. (2020). a mathematical model for the novel coronavirus epidemic in wuhan, china. mathe­ matical biosciences and engineering, 17(3):2708–2724. zbraniecki (2020). data has a gap between 2020­3­11 and 2020­3­24. https://github.com/covidatlas/ coronadatascraper/issues/375. [online; accessed 10­may­2020]. zhang, t., chen, j., luo, x., and li, t. (2019). bug reports for desktop software and mobile apps in github: what’s the difference? ieee software, 36(1):63–71. https://github.com/applifting/pomuzeme.si/issues/164 https://github.com/applifting/pomuzeme.si/issues/164 https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov https://whyhunger.org/map.php https://whyhunger.org/map.php https://github.com/covidatlas/coronadatascraper/issues/375 https://github.com/covidatlas/coronadatascraper/issues/375 introduction related work methodology methodology for rq1: what categories of open source covid-19 software projects exist? dataset collection qualitative analysis of readme files closed coding rater verification methodology for rq2: what categories of bugs exist in open source covid-19 software projects? how frequently do the identified bug categories appear? what is the resolution time for the identified bug categories? methodology to answer rq3: how similar are the identified bug categories to that with previously studied software projects? results answer to rq1: what categories of open source covid-19 software projects exist? summary of dataset categorization of covid-19 software projects frequency of the identified categories rater agreement answer to rq2: what categories of bugs exist in open source covid-19 software projects? how frequently do the identified bug categories appear? what is the resolution time for the identified bug categories? bug categories of covid-19 projects frequency of identified bug categories rater agreement and verification resolution time of identified bug categories answer to rq3: how similar are the identified bug categories to that with previously studied software projects? discussion summary implications differences between covid-19 software projects and other software projects differences in bug manifestation difference in bug resolution time differences with existing healthcare-related software projects threats to validity conclusion journal of software engineering research and development, 2023, 11:9, doi: 10.5753/jserd.2023.967  this work is licensed under a creative commons attribution 4.0 international license.. software architectural practices: influences on the open source ecosystem health simone da silva amorim  [ federal institute of bahia | simone.amorim@ifba.edu.br ] john d. mcgregor [ clemson university | johnmc@clemson.edu ] eduardo santana de almeida [ federal university of bahia | esa@rise.com.br ] christina von flach garcia chavez [ federal university of bahia | flach@ufba.br ] abstract. the health state of a software ecosystem has been determined by its capacity for growth and longevity. three health indicators represent a healthy software ecosystem: robustness, productivity, and niche creation. studies focusing on understanding the causes and processes of the health state of ecosystems have used these indicators largely. researchers have intensified studies to understand how to achieve a good health state. despite the growing number of studies, there is little knowledge about influences and actions to achieve health, and more specifically, studies that consider the effects of the software architecture on the ecosystem. this article presents a study exploring seven open source ecosystems within different domains to describe the influence of architectural practices on the software ecosystem health in terms of their motivations and effects. our main goal was to understand how the software architecture and related practices can contribute to a healthy ecosystem. we conducted a netnographybased study to gather practices used to create and maintain the software architecture of these ecosystems. our study brings evidence that architectural practices play a critical role in the achievement of ecosystems’ health. we found fifty practices that have influenced different aspects of health indicators. we highlight the importance of five influential factors – business goals, experience, requirements, resources, and time-to-market – for motivating the adoption of such practices. these factors may also contribute to understanding different strategies used to achieve a good health state. moreover, we proposed a novel health indicator, trustworthiness, that accounts for the normal operation of a healthy software ecosystem. keywords: software ecosystems, ecosystem health, software practices, netnographic study 1 introduction over the years, the development of large-scale software has been problematic and overpriced. most software companies spend a lot of time fixing problems, and there is not enough time to release the necessary features. releasing software components suffers delay constantly, consequently delivering value to the customer has been a complex challenge. in trying to solve these problems, companies are using a range of techniques to manage the complexity of software development. bosch and bosch-sijtsema introduced trends guiding the modern large-scale software development (bosch and bosch-sijtsema, 2010). one of these trends is to provide a software platform and open its boundaries to the community around, sharing work and challenges. this community develops relevant solution elements to satisfy customer needs, composing a software ecosystem. software ecosystems are platforms that allow developers to build new capabilities, providing value based on the exchange between businesses and people. the benefits of platform technologies permit sharing decisions, risks, and profits, causing significant growth in collaboration and platformminded reasoning (liu, 2017). several organizations follow the ecosystem strategy and have kept their success for several years, such as hadoop, amazon, and apple (satell, 2016). for instance, apple rapidly gained the smartphone market introduced in 2007, when this ecosystem sold 270,000 iphones in the first 30 hours which it was available (west and mace, 2010). currently, apple continues to be the market leader in the fourth quarter of 2020 with its iphone sales, when it grew almost 15%, and it was the first time revenue passed the $100 billion mark, $65.6 billion of that was from iphone sales (silverman, 2021). however, there are some cases of failure, like nokia and blackberry. nokia had a dominant position in the market in the early 2000s but failed by adopting several strategies in building an ecosystem around their products causing its downfall in 2011 when happened the shift from symbian to the windows phone (bouwman et al., 2014). another example of the downfall was the blackberry. by 2006, it was a market leader providing services such blackberry messenger(bbm), email and qwerty keyboard. however, from 2009 until 2013, blackberry started to decline, and in 2017 blackberry finally decided to end the hardware development of mobile devices (mittal, 2019). understanding the reasons for the failures and successes of these ecosystems can address the mechanism of governance more efficiently to achieve a healthy ecosystem. a metaphor defined by iansiti and levien describes a healthy ecosystem with “the growing and continuity of the software ecosystem remaining variable and productive over time” (iansiti and levien, 2002). having awareness of the health status of a software ecosystem is relevant to support decision-making in different areas such as business processes, software design, and social interactions. stakeholders can decide to enter or leave the ecosystem, or to keep their participation in it. by assessing the health of software ecosystems, iansiti and levien proposed three health indicators: robustness, productivity, and niche creation (iansiti and levien, 2002). the health evaluation of a software ecosystem is a complex activity and faces hard challenges. several factors may http://orcid.org/0000-0001-5757-8995 mailto:simone.amorim@ifba.edu.br mailto:johnmc@clemson.edu mailto:eduardo.almeida@ufba.br mailto:flach@ufba.br software architectural practices: influences on the open source ecosystem health amorim et al. 2023 influence the achievement of possible health states, including the software architecture. health evaluations are often performed using the set of indicators defined by iansiti and levien (2002), however, there has been little research on the relationship between software architecture and the health state of a software ecosystem, mainly with regard to software architecture practices (da silva amorim et al., 2017). to better understand this research topic, we decided to investigate the software architectural practices more deeply to know the used practices in the ecosystem scenarios, what factors influence their use and determine their adoption, and how these practices can influence the health indicators. by knowing factors of adoption for a practice, governance mechanisms can be created to define the proper adoption of a practice, and its consequent influence on the health of the ecosystem. moreover, defining how many architectural practices influence the health of software ecosystems will allow us to figure out the strength and weaknesses of software development in this environment. based on proper information, guidelines for success can be created to drive efforts for other organizations. this paper addresses the problem by presenting “a netnography-based study of the architectural practices and their influences on the health of open source ecosystems”. netnography is an adaptation of the ethnography methodology that considers new conditions of interactions intermediated by computers on the internet (kozinets, 2009). highlevel descriptive findings explain the motivations and effects of architectural practices on the ecosystem health. our findings were derived using a nethnography-based methodology adapted to the context of software ecosystems that explore the relationships around the architectural practices used. they allowed us to develop substantive results based on data that includes explanations of how things are related to each other. our results describe the experience of the authors based on a systematic analysis of such practices. we observed that none of the studied ecosystems adopted all the identified practices. besides, some of them developed singular practices in accordance with their needs. we argue in our discussion that software architecture has a key role in software health. so far, there is no research considering this influence. from the architectural practices, we can analyze possible effects on the health indicators. based on the awareness of the influence from practices, we highlight the importance to focus not only on metrics already suggested by other studies (da silva amorim et al., 2017), but also on the investigation of proper architectural practices to evaluate the ecosystem health. the rest of this paper is structured as follows. section 2 provides some theoretical background on software ecosystems, architecture, and health. section 3 introduces related work in accordance with our study, and section 4 presents the research methodology. section 5 describes the influence factors on the architectural practices and health indicators. section 6 discusses our findings and the relationship between practices and health indicators, and section 7 presents some threats and limitations of the study. finally, section 8 presents our conclusions. 2 software ecosystems, architecture, and health this section provides some background on software ecosystems, software architecture, and software ecosystem health. 2.1 software ecosystems bosch and bosch-sijtsema define software ecosystems as “a software platform, a set of internal and external developers and a community of domain experts in service to a community of users that compose relevant solution elements to satisfy their needs” (bosch and bosch-sijtsema, 2010). this idea is based on external developers building applications on top of a platform, extending their products to attend specific needs. by opening the boundaries of the platform, the organization can increase its consumer base faster. in addition, the organization allows third-party to build components to address specific segments of the market which it cannot provide in a determined amount of time. therefore, the ecosystem strategy contributes to organizations accelerating innovation and at the same time sharing the cost of this innovation (bosch, 2009). adopting an ecosystem strategy is not a trivial task. in this scenario, new dependencies are created connecting thirdparty applications on the top of the platform. from this moment, software evolution should be coordinated, involving all stakeholders. internal and external teams should be synchronized to furnish the interests of all group participants. interfaces between the platform and products should be developed in collaboration, including all developers affected by the changes. also, external developers should validate new releases of the platform to reduce breaks in the components launched in this context (bosch and bosch-sijtsema, 2010). software ecosystem characteristics are commonly organized in three-dimensional views (campbell and ahmed, 2010; dos santos and werner, 2011). these views consider characteristics identifying aspects of business, social, and architectural scenarios. based on our studies (da silva amorim et al., 2017), we provide our characterization of software ecosystems composed by business, community, and technical views: (i) business views define policies to guide efforts to achieve sustainable growth through proper communication, attracting and retaining developers; (ii) community views lead to challenges as coordinating work and making decisions, and the delay in performing some tasks acting in conjunction with third-parties; (iii) technical views encompass software platforms and mechanisms for building software considering primarily the technology, methods, and tools. furthermore, there is an important classification for software ecosystems, considering their openness degree. according to manikas and hansen, a software ecosystem can be proprietary (commercial), free or open source software (foss), or hybrid. the openness degree is related to the level of access to the platform resources. proprietary ecosystems control access to the source code and artifacts. information is protected, so developers should have some kind of permission to engage in the ecosystem. on the other hand, in foss ecosystems, developers have other motivations besides the fisoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 nancial return. for instance, they may want to get new knowledge or improve their skills (manikas and hansen, 2013). they are often opened to any participant willing to join. our work is focused on the scenario of open source ecosystems. 2.2 software ecosystem architecture according to pelliccione, “software architecture plays a key role in managing software ecosystems and their evolution”. indeed, dependencies among the platform and applications working on top of that generate a set of architectural issues that should be considered during software development. the software architecture will express the business goals of stakeholders through technical mechanisms. especially, the architecture of the platform will define the boundaries of the evolution of the ecosystem and how and how many developers can work to extend their features. regarding the large number of software connections found in software ecosystems, architectural decisions have a key role, since wrong decisions can cause serious harm for the balance of the ecosystems. these decisions should consider internal and external influences attending different types of business goals. also, quality attributes affect the actions of developers that should adapt their applications to work well in sync with the core platform (pelliccione, 2014). in brief, software architecture has a critical role in software ecosystems. some researchers study its impact and challenges to manage the architecture correctly in this scenario (bosch, 2010; pelliccione, 2014). however, there are few studies addressing the impact of architectural practices on the health of software ecosystems. 2.3 software ecosystem health iansiti and levien argued that a healthy ecosystem provides “durably growing opportunities for its members and for those who depend on it”. they pointed out that in an ecosystem, there are a lot of connections among their members. besides, the health of individual organizations or their products depends on the health of the whole ecosystem. business decisions and strategies of operation influence directly the health of the ecosystem. the well-being, longevity, and performance of the ecosystem depend on the choices of its members. in this work, the authors also proposed a framework for assessing different strategies adopted by participants of an ecosystem. this approach involves the result of interactions among their members. for this, they defined three indicators for the ecosystem health based on biological ecosystems: robustness, productivity, and niche creation (iansiti and levien, 2002). robustness expresses the capability of the ecosystem to survive to crises and disruptions. productivity denotes the ability to transform inputs to new functions or products at a low cost. last, niche creation consists of the ability of providing new competences aggregating innovation to the ecosystem (iansiti and levien, 2002). the indicators proposed by iansiti and levien provide the basis for the majority of studies conducted for assessing the health of software ecosystems (da silva amorim et al., 2017). however, no theories support the health concepts applied by now. our research aims to understand the impacts of architectural practices on ecosystem health. we argue that ”the way” software architecture is designed and maintained influences the operation of related software components. consequently, architectural practices may interfere with the performance and longevity of the ecosystem, affecting directly the ecosystem health. in particular, we aim to improve our knowledge on the influence of architectural practices used during software architecture design and evolution on the ecosystem health. in previous work, we coined the term “architectural health of software ecosystem” to represent the weight of architectural practices on ecosystem health (amorim et al., 2017). 3 related work recently, challenges and gaps related to the health of software ecosystems have attracted the attention of the research community. several studies have focused on how to define and measure the health by means of indicators (da silva amorim et al., 2017). however, none of these studies were concerned with understanding the health achievement from different perspectives, and figuring out the influence of architectural practices on it. in the following paragraphs, we outline studies similar to ours concerning the discovery of processes and practices on open source ecosystems (jensen and scacchi, 2004; bogart et al., 2021; jansen, 2020), the experience of conduction of qualitative studies on open source communities (sigfridsson and sheehan, 2011), the aspects that can influence the health of the ecosystem dijkers et al. (2018); avelino et al. (2019); charleux and viseur (2019), and approaches to evaluating the health of ecosystems using metrics and some related processes (franco-bedoya et al., 2015; wnuk et al., 2014; liao et al., 2019; goggins et al., 2021). jensen and scacchi (2004) explored techniques for discovering development processes from openly available open source software community. they presented a new approach to identify software processes through mining information from development repositories. the use of techniques such as text analysis, link analysis, and the usage and update of the repositories patterns allowed us to understand the processes used by the community. in another work, jensen and scacchi (2005) introduced an approach to recognize processes in open source projects, basically by searching for information on the internet. in addition, bogart et al. (2021) conducted a mix of methods such as survey, repository mining, and document analysis across 18 ecosystems to discover values and practices of breaking changes, and understand how happens the planning, management, and coordination of these breaking changes inside open source ecosystems. besides, jansen (2020) proposed a software ecosystem governance maturity model (seg-m2), describing governance practices for ecosystem coordinators. he collected the practices for the maturity model from literature studies, also performing techniques of snowballing forward and backward. following, he tested the practices in six case studies at four companies to validate and improve his model. all of these authors applied different methodologies to discover different software architectural practices: influences on the open source ecosystem health amorim et al. 2023 practices with distinct goals. two authors used repository mining to collect open source practices, and understand how diverse process occurs in the ecosystem. the third one collected practices focused on general ecosystem governance to create a maturity model. our work used another methodology to gather specific practices related to creation and maintenance of software architecture. by using the netnographybased approach, we extracted practices from the literal texts transcription available on websites, also analyzing the context in which the transcription is included. sigfridsson and sheehan (2011) used qualitative methodologies to study principles and practices used by free and open source software (foss) communities. this study contributed to relevant issues about the application of qualitative research as virtual ethnography on open source communities. findings covered the potential problems of applying qualitative methodologies and highlighted the importance of maintaining an active relationship with the core community group. the authors also performed a case study with the pypy community to provide examples on the challenges faced by foss communities. despite the similarity to our study, our approach included only the participant observation, without engagement in the community and application of questionnaires. besides, the case study with pypy spent an extensive period of time, and we used a condensed period of time for each ecosystem. moreover, we focused only on practices related to create and maintain the software architecture. however, this study provided us with some insights to understand the dynamics and development of an open source ecosystem, in particular, supporting issues about the qualitative approach. concerning the health of open source ecosystems, different aspects have been raised by researchers in the last years. these aspects can influence the health of the ecosystem through different forms. dijkers et al. (2018) explored the effects of the software ecosystem health on the financial performance of the open source companies. they conducted a case study on two open source companies, cloudera and hortonworks by looking at the companies individually and after comparing the companies. the results showed that productivity did not influence the relationship between financial performance and ecosystem health. the robustness has a middle influence on this relation, and the niche creation is the main contributor to this relationship. in another study, avelino et al. (2019) investigated the abandonment and survival of open source projects. in this study, they aimed to discover the frequency of project abandonment and survival, the differences between abandoned and surviving projects, and the motivation and difficulties faced when assuming an abandoned project. both studies pointed important aspects that can influence on the ecosystem health, however, the practices related to financial performance and abandonment of the project are not directly addressed. closer to the practices dimension, charleux and viseur (2019) studied the managerial decisions, exploring their impacts and community composition on the health of open source projects. they conducted a longitudinal single case study and a qualitative study based on fortysix complementary interviews with open source community members to identify key managerial changes impacting on the community activity. they concluded that the health depends on the business model and governance. by comparing these studies with our work, we perceived that aspects such as financial performance, abandonment of the community, and managerial decisions influence the ecosystem health, and although it is apparently not related to software architecture, we captured some practices, connected with these aspects, that can also influence the health indicators. last, regarding the evaluation of ecosystem health, liao et al. (2019) created an approach to measure the health of the gtihub ecosystem. they proposed new health indicators to define the structure and resilience of the health of the github ecosystem. also, they proposed a health prediction method. this study is an example of an analysis quantitative of health and did not consider the influence of practices on the health state as our investigation. following, we found two studies performing measurement of the health through activities and/or processes (da silva amorim et al., 2017). franco-bedoya et al. (2015) introduced a model to measure the quality of open source ecosystems. based on a literature review, they collected several metrics used to measure the health of an ecosystem. after, they analyzed relationships among quality characteristics that can be assessed by these metrics and proposed the queso quality model, composed of quality characteristics and measures. basically, the proposed model considered health as a kind of quality attribute and extracted a set of values to all model areas. in spite of queso model to be a large model, it did not address practices and health indicators defined by iansiti and levien (2002). queso defines quality characteristics and measures, but our work considers practices in a detailed point of view to influence different types of health indicators. the other model was proposed by wnuk et al. (2014). they conducted an evaluation of the governance model proposed by jansen et al. (2013) and jansen and cusumano (2013). a hardwaredependent ecosystem called axis was evaluated, considering processes and practices preserving and improving the health of software ecosystems. they analyzed governance activities to gather the degree in which some activities were performed to support the health indicators (productivity, robustness, and niche creation). governance activities in diverse areas compose this model, influencing the ecosystem health through their indicators. they operate with a set of general practices composing these governance activities. our work is focused only on practices related to software architecture design and evolution. finally, goggins et al. (2021) described the work performed by the linux foundation’s community health analytics in open source software (chaoss) project over four years to understand how to achieve open source ecosystem health. the main strategy adopted by the group was to define metrics to provide a full understanding of the open source project health over time. this group also provided tools to work with these metrics, trying to discover how healthy and sustainable is an ecosystem community. in spite of our study has similar objectives on how to understand the open source ecosystem health, we adopted a different strategy. we focused specifically on architectural practices and their influence on health indicators. all these studies guided us to investigate and extract information from open source communities and the relationship between practices and ecosystem health. however, there is software architectural practices: influences on the open source ecosystem health amorim et al. 2023 a lack of approaches and evidence to support these connections, especially, in the software architecture setting. hence, this work is an initial effort towards understanding these relationships. 4 methodology kozinets (2009) argues a qualitative approach is useful to explore and understand a context, and the choice of the research method must match the nature and scope of the questions. he also claims qualitative techniques help to map new terrain in a constantly evolving on the internet environment. in this environment, netnography is an appropriate approach to study virtual communities that manifest important social interactions virtually. given the nature of our research questions, aiming to understand the universe of architectural practices in open source ecosystems and their relationships with the health of these ecosystems, we chose the netnography using an observational approach in an environment with ample information available. as a result, we could understand the behavior of the community, as well as its internal structure, to answer our research questions. we conducted a qualitative study on seven open source ecosystems to approach our research questions. we decided to use mixed empirical methodologies, a netnography-based approach for data collection and a grounded theory-inspired approach for data analysis (kozinets, 2009; stol et al., 2016). the netnography-based study adapts common ethnographic procedures for observing the research object in the physical world to be used in the virtual universe mediated by computers. by conducting the netnography-based study, we followed the guidelines proposed by kozinets (2009) and adapted the steps from the general protocol of the ethnography for our study: study planning, data collection and interpretation, a guarantee of ethical standards, and research representation. regarding the netnography-based approach, we performed the following steps: (i) 1st step we defined our research question aiming to identifying what practices were adopted by each ecosystem and their influences, as well as open source ecosystems as the focus communities of our research; (ii) 2nd step we chose the seven ecosystems described in section 4.2; (iii) 3rd step we joined to the community, but we did not engage as a member, our participation was observational to monitor the dynamics of the work and collect data; (iv) 4th step we performed data analysis and interpreted the results using a grounded theory-inspired approach; (v) 5th step we reported the research results in section 5. figure 1 shows the steps defined by kozinets (2009) and used to conduct our netnography-based study. kozinets (2009) introduced the netnography method as an observational and participatory approach, in which the researcher should be immersed for an extended time in a community or culture. the researcher should conduct several online interactions through online interviews, engagement in the community, and observation. our approach is an adaptation of the method proposed by kozinets (2009). we named the approach netnography-based because we only performed the observational part without interacting directly with members of the communities. also, we could not do an immersion for an extended time into the seven ecosystems. concerning ethical standards, kozinets (2009) argues the netnographer participatory should follow a short protocol to identify himself and ask permission to work with the community since he is engaged in the community. in our case, when the netnography is only observational, we had not needed to ask permission to the community. all information collected was published in public space and the access was free. in addition, we respected all copyright, and we cited and recognized data from each community when published. also, we did not publish the names of members or pseudonyms to identify individuals in particular. our research was conducted by characterizing data related to participants and the community, avoiding damage to community members, as well as following the guidelines defined by the netnography approach. kozinets (2009) also suggested two types of data analysis for netnography studies: analytical methods based on coding and hermeneutic interpretation. we chose to apply the first one because it supported us in handling the whole volume of data collected during our study. the grounded theory method generates a theory that considers the context, conditions, strategies, and results from data. this method is used in qualitative research, inductive paradigm, to develop a theory from situations in the real-world (stol et al., 2016). however, our goal was not to create a theory but to provide an understanding of the influence that architectural practices exert on the health of open source ecosystems. in addition, we aimed to gather useful contributions that provide new foundations to consolidate such influences. therefore, we decided to borrow some elements from the grounded theory method to restrict coding techniques (data analysis, constructing codes and categories, constant comparative method, writing memos), characterizing a grounded theory-inspired approach following the guidelines for coding from charmaz (2006); saldaña (2009); stol et al. (2016). charmaz (2006) also states data should be analyzed to emerge a theory, not from preconceived deduced hypotheses. in our case, we applied grounded theory techniques to capture the connection between architectural practices and existing health indicators. the concepts of ecosystem health are preconceived, but their possible existing connections have been not investigated yet. finally, by applying the netnography study, as kozinets (2009) states, the researcher is compelled to make assumptions about cultural meanings that he does not fully understand. he offers a purely descriptive analysis of the content found online, when the researcher is not a participant in the community. in our study, some practices were already defined explicitly on the webpages or there was data allowed to deduce an adoption of an existing practice. concerning the second approach, charmaz (2006) advocates that grounded theory uses an abductive inference method since it considers the possible theoretical explanations for the data, and then form hypotheses to search for possible explanations until it finds the most probable explanation. we applied the grounded theory-inspired approach to uncover how connections between software architectures and ecosystem health happen, applying the reasoning about experience for making theoretical conjectures jointly with their verification through additional experiences. the use of grounded theory-inspired software architectural practices: influences on the open source ecosystem health amorim et al. 2023 is to discover possible explanations for these connections, aimed to find the most probable explanation. through coding techniques, the approach allowed, not to generate a theory, but to clarify relationships and influences existing between two different conceptual worlds. 4.1 research questions the goal of this study is to identify and understand possible practices used to create and maintain the software architecture of software ecosystems. by elucidating potential reasons for making architects adopt specific practices, and figuring out if any architectural practice might influence the open source ecosystem health, we defined the following research questions: rq1. what are the factors that can influence the adoption of architectural practices in the design of open source ecosystems? rq2. how can architectural practices influence the health of the open source ecosystem? as described previously, the software architecture has a crucial role in the development and evolution of a software ecosystem. problems with the platform can cause serious damage to the ecosystem. therefore, the investigation of factors that influence the practices adopted by architects, and the effects of these practices on the ecosystem health, contribute to the definition of ecosystem strategies. findings can guide decisions and concerns about the software architecture that help keep the whole ecosystem balanced. 4.2 research context our study analyzed seven open source ecosystems: gitlab, jenkins, kde, mapserver, node.js, open edx, and wordpress. in accordance to the classification proposed by manikas (manikas and hansen, 2013), gitlab, open edx, and wordpress are hybrid ecosystems that adopt the open source strategy concerning software platform development. in addition, all the ecosystems fit to the definition of software ecosystems provided by bosch and bosch-sijtsema (bosch and bosch-sijtsema, 2010). they have a community with internal and external members working to create relevant solutions upon a software platform. the criteria used to choose these ecosystems were their degree of openness and the availability of documentation on their websites. this enabled us to access documents, diagrams, artifacts, and code. besides, ecosystems with diverse sizes and domains were considered to allow a large range for the generalization of practices adopted in the open source context. in addition, we observed the working environment of the community. this way, we could identify practices adopted by them to capture the dynamics in their entirety. • gitlab1 started as a git-based software repository manager. it was created in ukraine by dmitriy zaporozhets in 2011. nowadays, their functions were extended, and 1https://about.gitlab.com it supports all stages of the devops lifecycle for product, development, quality assurance, security, and operations of a software project. the code is mainly written in ruby. gitlab has a different characteristic regarding its license because it has two software: gitlab ce which is open source and gitlab ee which is closed source. the company proprietary of the gitlab ee manages the open source project. we consider our work the gitlab ce community, which follows the behavior of open source ecosystems. • jenkins2 is an automation server that helps to automate the non-human part of the software development process. it provides automation for several tasks related to building, testing, and delivering or deploying software, implementing continuous integration, and contributing to implement continuous delivery. the initial release was launched in february 2011 and came from a project originally named hudson. after a dispute with oracle, which had forked the hudson project, jenkins was a name chosen for the project through an election in the community. the code is mainly written in java. nowadays, jenkins provides various infrastructure tasks and an extensive library of over one thousand and three hundred plugins. • kde3 provides a platform to easily build new applications upon, not to mention the advanced graphical desktop and set of applications for communication, work, education, and entertainment. kde was founded by matthias ettrich in 1996. it has a strong community spread by several countries sharing experiences and contributing to strengthening one of the largest active open source ecosystem communities. the code is written mostly in c++ into a mature codebase. the platform kde frameworks allows building all kinds of applications upon. presently, kde has the ambition of providing a reliable monopoly-free computer solution. • mapserver4 is a platform for publishing spatial data and interactive mapping applications to the web. it was created at the university of minnesota in 1994. written in language c, mapserver renders geographic data and allows the image of maps to work on the internet, providing a spatial context for these maps. this ecosystem is supported by several organizations that manage funds for the adoption of the open geospatial technology. • node.js5 provides a javascript run-time environment to perform javascript code outside of a browser. it allows for building dynamic web page content before the page is sent to the final browser. applying an event-driven non-blocking i/o model, node.js has a lightweight for applications running across distributed devices. the initial release was in 2009, and it is written in c++, javascript, and assembly. • open edx6 is the open source platform for massively scalable learning. as a provider of massive open online courses (mooc), the platform produces weekly learn2https://jenkins.io 3https://kde.org 4https://mapserver.org 5https://nodejs.org 6https://open.edx.org software architectural practices: influences on the open source ecosystem health amorim et al. 2023 table 1. open source ecosystems studied name foundation age klocs contributors gitlab gitlab inc. 8 years 752,409 2,530 jenkins continuous delivery foundation 8 years 1,031,859 2,132 kde kde foundation 22 years 57,899,444 5,539 mapserver osgeo foundation 25 years 413,852 147 node.js node.js foundation 9 years 6,340,475 2,871 open edx edx inc. 7 years 1,114,609 727 wordpress wordpress foundation 16 years 560,703 541 figure 1. steps to perform the netnography methodology (kozinets, 2009) ing sequences of courses. these courses include tutorial videos, online discussion forums, and textbooks. open edx was created by the massachusetts institute of technology and harvard university in may 2012, and it is mostly written in python. • wordpress7 is an online publishing platform that supports users, even those without a technical background, to quickly create blogs, apps, or websites. they keep the site free, but also offer some paid plans with tools to improve the user experience. wordpress is built on php and mysql. it was created in 2003 by mike little and matt mullenweg. table 1 presents more information about the studied open source ecosystems in accordance with open hub8. 4.3 data collection conducting data collection from open source ecosystem, we have extracted data from different online sources. the set of communication tools found in open source ecosystems such as forums, newgroups, blogs, github pages, and wikis, as well as some external websites connected to the ecosystem such stack overflow, reddit, bugzilla, and others provides relevant information for researchers outside the community context. it is possible to observe behaviors, rules, and values that guide the community steps. the communication leaves traces that are easily observable, recorded, and copied. information can be widely captured and recorded. in this environment, the processes of accessing and analyzing data are facilitated (kozinets, 2009). although open source ecosystems make available several dynamic communication channels such as internet relay chats (ircs), we kept our focus on extracting data from widely published information about work in the community on webpages reached through links provided by the ecosystem home pages. we did not consider analyzing conversations in chat format to analyze official 7http://wordpress.org 8https://www.openhub.net/ practices released by the community board. for each ecosystem, we started the research from the initial page (home page) and navigated to several weblinks. since the practices were identified, we classified them in a set of seven software architecture key areas introduced by our previous work (amorim et al., 2017). each practice should fit into one of these areas. they represent the architectural design decisions through a division in logical areas. architects should focus on these key areas to exercise their activities such as architectural knowledge, external management, choice of technology, resources management, design-making, quality management, and change management. each key area is classified in one of the three software ecosystem views: community, technical, or business. the key areas are classified according to the objective of the ecosystem views. in this way, the practices of each area fit by affinity to the objective of each ecosystem view in which it is allocated. for example, the key area named architectural knowledge that encompasses practices of knowledge management in the community is allocated to the community view and so on. table 2 presents the mapping of the key areas by ecosystems views. table 2. mapping of key areas by views (amorim et al., 2017) view key areas community architectural knowledge external management business choice of technology resources management technical design-making quality management change management • architectural knowledge. practices related to this key area support tasks to manage and share architectural knowledge with the community. • external management. external management encompasses practices keeping the diversity of contributions of the community and support the management of arsoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 chitectural interfaces used by third-party developers. • choice of technology. this key area is related to the choice of the best technologies to build the systems. • resources management. this key area comprises practices used to supporting architectural process. • design-making. this key area encompasses practices related to taking technical decisions. • quality management. this key area is responsible for controlling all practices satisfying quality criteria previously defined. • change management. change management controls all practices used to regulate and keep the balance into the architecture facing all the implemented changes. based on these seven areas, we defined fourteen search topics to facilitate the search of the practices on the websites. these search topics aimed to cover common topics of software development and that could be included within a key area. in addition, a search topic can cover more than one key area. for example, business organization could have practices in several key areas. table 3 shows these search topics. when we identified a candidate practice that could fit into a search topic, we registered information such as ecosystem foundation, date, url, description, type of website, and the topic of activity of this practice. these records concentrate data about the practice found on the websites. the description is exactly the clipping of the text as found on the website. for instance, regarding the mapserver ecosystem, we collected the description “psc management responsibilities: setting the overall development road map and project infrastructure (e.g. github, cvs/svn, trac/bugzilla, hosting options, etc…)”. table 3. search topics of the netnography-based study id search topics id search topics 1 business organization 8 release launching 2 communication 9 resources 3 coding 10 security 4 documentation 11 taking decisions 5 financial 12 training 6 meetings 13 translation 7 quality 14 tests data extraction was focused on the documentation and guidelines on each web page. we also analyzed information about activities to create and manage the software architecture found in code repositories. however, we did not conduct a code analysis itself to identify practices. navigation always started on the initial web page of the ecosystem. for each search topic, we searched for pages with related information to the current search topic, guiding the navigation on the ecosystem website. we read the entire page in search of some information related to the current search topic. finding some related information, we analyzed it to see if any practice could be extracted. then, we went through the links on the page to search for more information related to the current search topic. this search process through the pages figure 2. steps to collect data continued until the page no longer contains related information and no further links were found on the page related to the search topic. in the majority of cases, we arrived at pages where there were no more links and no way to go to another page. besides, there were a few cases, where we arrived at cycles clicking by links that sent us to go back to the initial page. during each search process, we focused our data collection only on information related to the search topic conducted at the moment to capture architectural practices related to the topic. figure 2 presents the steps performed in the data collection process. we did not record the different levels that we went through in the search for practices, since each ecosystem has a different infrastructure for building and organizing the pages on the website. besides, the level of navigation between pages varied according to each current search topic in the same ecosystem. regarding the registration process, kozinets (2009) argues smaller or more limited investigations of online communities and cultures may employ manual coding, categorization and classification, as well as hermeneutic interpretive analysis, to gain insights. besides, a semiautomatic (manual and computer-assisted) method can be used to organize different levels of coding and abstraction using a spreadsheet tool. our research was restricted to identifying and collecting architectural practices and their context. although we had collected a reasonable amount of data, we were able to manage data using this semiautomatic method. therefore, we used a general spreadsheet tool to support data collection and data analysis processes, instead of using sophisticated software packages. at the end of this process, a set of architectural practices were identified. the selection process aimed at identifying all practices related to software architecture in some way, based on the researcher’s background. moreover, we included practices that are not directly related to the technical view, but which are also relevant practices for the business and the community views. for example, the practice “(p33) define a financial board to manage the financial resources” influences the software architecture when the financial board team provides resources to support the architecture such as hardware, software, developers, and it also defines market guidelines that need to be met at any given time. it is closely connected to the business area, but also ensures the provision of technical operation. software architectural practices: influences on the open source ecosystem health amorim et al. 2023 4.4 data analysis from the netnography-based study, we started the process to discover the architectural practices adopted in the ecosystems. so, the first step of the data analysis was to define a common text for practices found in the seven open source ecosystems. based on the description of each practice, we defined a practice specific and a practice catalogued. the practice specific describes the practice in a more abstract level, and practice catalogued describes the practice in a formal way suitable for all the observed ecosystems. for example, regarding the mapserver ecosystem, we collected a description “psc management responsibilities: setting the overall development road map and project infrastructure (e.g. github, cvs/svn, trac/bugzilla, hosting options, etc…)”. so, the practice specific was defined as “psc committee is responsible to define technology”, and the practice catalogued was “the project leaders define specific technology that impacts into the work of project community”. by analyzing the practice specific, we checked if the practice fit into any existing practice catalogued that could be attributed to it, otherwise, this practice originated a new practice catalogued. this process was conducted for all practices found in all ecosystems. as a result, we catalogued fifty architectural practices. following data analysis, we started to analyze all practices and information about their context to answer the research questions. according to saldaña (2009) coding is the initial step before a powerful analysis and interpretation for reporting. he states that the quality of the code used is essential to gather important information from the research history. in addition, qualitative codes allow the establishment of patterns and facilitate the development of categories from their connections. coding is a way to organize things systematically to infer some classification (saldaña, 2009). moreover, research should be critical about data asking questions to identify relevant practices (charmaz, 2006). in this context, we adapted questions from (charmaz, 2006) to evaluate the practices: (i) what are they doing?; (ii) what are they saying?; (iii) who is doing?; (iv) why are they doing?; and (v) when are they doing it? during data collection, we addressed our search to answer these questions and registered information for each practice identified in the process. for example, the practice (p34) in wordpress ecosystem had the following answers: (i) what are they doing? “provide financial resources to support meetings face-to-face”; (ii) what are they saying? “companies that sponsor wordpress community events support the wordpress open source project by helping our volunteerorganized, local events provide free or low-cost access for attendees”; (iii) who is doing it? “global community sponsors”; (iv) why are they doing? “they believe that a casual, non-commercial, and educational event permit discussing wordpress issues face-to-face easily and strengthens the community”; and (v) when do they do it? “during the wordcamp, an annual conference for local wordpress communities”. a similar practice was also observed in the other 6 ecosystems (see appendix a) and the answers to these questions were recorded by each ecosystem. by conducting the data analysis in our grounded theoryinspired of the answers jointly with other data collected, we performed the followings steps for coding: (a) identifying and labeling data using a code reflecting its meaning; (b) performing a constant comparison searching for patterns that point to concepts; (c) grouping similar concepts in a highlevel abstraction called categories. this process occurs until no new conceptual relationships emerge for categories; (d) writing memos to describe ideas and relationships among codes, concepts, and categories; (e) developing an understanding of the studied phenomenon, in our case, the factors that motivate the adoption of the practices and the influence of the practices on the health of open source ecosystems. the steps described previously present some key components of the grounded theory (charmaz, 2006; coleman and o’connor, 2007): • codes.they are words used to provide meaning to the data. they summarize and reflect the experience described. • concepts. they are ideas derived from a set of codes and organized in a high-level abstraction. • categories. they are a classification that explains ideas or processes gathered from data expressing common patterns in various codes. • memos. they are notes describing thoughts, capturing comparisons and connections from data. they also delineate directions and issues that should be considered. figure 3 presents our research analysis processes for the grounded theory-inspired approach. data collection and analysis processes were conducted by one researcher. other researchers subsequently analyzed the results found and the written article, requesting clarification and reviews when necessary. they had complete access to the entire database and information about the process. the needed adjustments were done in the review process. as seaman (1999) pointed out, “any proposition that the researcher synthesizes must be clearly and strongly supported by the data”. this way, we constructed a set of propositions about factors that influence the adoption of architectural practices based on evidence collected on the studied ecosystems. all factors synthesized were inferred using abductive logic, and categorized and labeled with their properties stol et al. (2016). regarding the influence from architectural practices on the health indicators, the process required an additional literature review to support the findings. previously, jansen (2020) stated the relationship between practices and health indicators, as well as its effects of one on the other, are unknown. currently, determining these relationships and effects still constitute a considerable scientific challenge. hence, looking for a solution, we observed that stol et al. (2016) presented that the literature could support the process of grounded theory, using concepts and improving theoretical sensitivity as additional data sources. in addition, charmaz (2006) also suggests using a literature review to support the work of analysis. so, we consulted concepts, metrics, and arguments of the health indicators introduced by iansiti and levien (2002) to make explicit and rational connections between our concepts inferred from data and the health indicators presented in this earlier study and developed insights to answer our research questions, allowing us to make claims from our grounded theory-inspired software architectural practices: influences on the open source ecosystem health amorim et al. 2023 figure 3. research analysis process approach. as a result, from the codes, categories, and concepts that emerged during data analysis, and supported by concepts from literature, we derived ideas and constructed relationships between architectural practices and health indicators, which enhanced the understanding of reasons for adoption and influences on the health indicators. in the same way, we used the literature with concepts introduced by bass et al. (2012) to support the analysis of codes, categories, and concepts to find the factors that influence the adoption of architectural practices. table 4 presents the codes that emerged from the study. the codebook is available at http://doi.org/10.6084/m9.figshare.23657856. table 5 presents the categories emerged from the study. table 4. codes emerged from study codes documentation reuse changes sharing knowledge novelties compatibility design decisions efficacy obsoleteness problems patterns marketing security knowledge meeting automated translate money table 5. categories emerged from study categories budgeting design-making innovation knowledge management quality standardization figure 4 presents a short example of how the data analysis in the coding process was conducted. for instance, the practice specific “provide an official channel to publish all changes to the community” was observed in the kde and mapserver ecosystems. the practice catalogued that can fit in all ecosystems had been defined as “(p43) publish widely the architectural changes for the community”. in addition, we also collected additional information about the context of these practices, such as ”core changes in mapserver can affect existing applications” and ”kde provides the kde.news as the official news channel”. furthermore, from the data analysis process emerged the following codes: documentation, sharing knowledge, and design decisions. besides, the category knowledge management also emerged, jointly with the concept “everybody must be aware of changes in the architecture that will impact in their work”. based on this concept, our analysis emerged, reasoning about what factors can influence the adoption of the practice. for factors, the reasoning was ”the knowledge of the architects should be shared with the community to guide the work of developers“ represented by the factor experience, and ”everybody must know about changes to adapt their applications for the new scenarios and avoid break of operation” represented by business goal. for influences on the healthy indicators, the reasoning was ”commununications of critical changes improve the interactions among organization and third-party that have applications influenced by these changes“ represented by trustworthiness, and ”communicate everyone about changes in the architecture avoid breaks of applications considering lack of information“ represented by robustness. table 6 presents graphic symbols that are used on the graphic schemes of the examples in this study. these symbols were defined by authors to improve the graphic representation. table 6. semantics of shapes applied on the graphic scheme symbol description concepts category codes practice health indicator influence factor texts (reasoning) connections from data connections from inferences 5 findings this section presents the findings of our study, including the architectural practices identified (section 5.1), organized software architectural practices: influences on the open source ecosystem health amorim et al. 2023 figure 4. emergence of the influences from underlying concepts according to factors that have an influence on their adoption (section 5.2) and the rationale behind their influence on ecosystem health and its indicators (section 5.3). 5.1 architectural practices fifty architectural practices used by the studied open source ecosystems were identified during our study. table 7 presents all practices captured in our study by the nethnography-based approach. following, the table 8 presents a categorization for all architectural practices with respect to key area and ecosystem view. in addition, table 9 introduces only a small subset of architectural practices (p1-p7) identified for each ecosystem, where the ecosystem that has the practice checked, means that it was found in its environment by the researchers. for instance, the practice (p7) provide a newcomer-specific page or portal guiding their first steps, including development information has been adopted by the seven ecosystems. p7, along with other six practices (p10, p13, p18, p27, p35, p39) were adopted by all ecosystems. five practices were used by one ecosystem only (p46, p47, p48, p46, and p50). appendix a shows the complete set of practices and ecosystems where they were found. 5.2 factors that influence the adoption of architectural practices during architecture design, there are many influences guiding/forcing the software architecture towards some direction (bass et al., 2012). these influences are diverse and depend on the environment in which the architecture will operate. to understand these factors of adoption for a practice allow us to create governance mechanisms to define whether a practice should be adopted, influencing the final result on the health of the ecosystem. some influence factors are related to requirements, technical environment, and experience of the architects. for them, software architecture is constrained by a large variety of sources. some influences can be implicit and others explicit, however, it is difficult to capture all properties required by the architecture. finally, there are gaps that can cause conflicts among the goals of the software architecture (bass et al., 2012). regarding this reality in the ecosystem scenario, architects should also consider demands from third-party and be aware that external business goals also suffer the impact of changes in the platform. in addition, they must take application response time into account when making necessary corrections by releasing software versions. as the cost of innovation and development are shared with the community, business strategies, and development resources should be considered together. in our study, we found five relevant factors that we understand performing influences in our set of fifty practices: business goals, experience, requirements, resources, and time-to-market. table 10 shows each factor with the practices that are influenced by them. 5.2.1 business goals business goals refer to strategic goals that the software ecosystem aims to accomplish. these goals determine positions to be achieved and drive specific tasks and deadlines to conduct to these positions (bass et al., 2012). business goals are very important factors to guide all activities in the ecosystem. effective goals will impact directly on the success or failure of the whole ecosystem. in order to build a successful architecture, the architect should understand his competitors, software products, and strategies. moreover, she should know the key factors in the business environment that affect the progress of the organization (bredemeyer et al., 2000). in the ecosystem, business goals express the desires of internal and third-party developers. architects should balance the goals of third-party that behave as collaborators and at the same time competitors among themselves. there are different concerns that should be aligned to promote the success of the ecosystem. our research identified eighteen practices that may have been directly influenced by business goals. for instance: • a business goal such as “provide the market needs and keep the fidelity of customers” motivates the practice software architectural practices: influences on the open source ecosystem health amorim et al. 2023 table 7. architectural practices found in the netnography-based approach id architectural practices id architectural practices p1 create personal blogs and/or wikis to inform about the development and architectural issues p26 do online meetings in a timezone adequate for most of the community p2 during code review, provide feedback information about architecture, good practices to code, doing refactoring and show the best way to solve problem p27 provide several online meetings to discuss architectural problems by irc, email p3 document apis constantly p28 create partnerships with third-parties to solve problems of the core and their interfaces p4 provide internal or third-party mentoring programs to train newcomers p29 setting a code of conduct to avoid mistreatment among members p5 provide code recommendations, defining a standard in the community p30 provide a message template for newcomers to use to interact with the community p6 keep a register of meetings available to community to know all decisions of the meeting p31 the organization board defines some technologies that should be used by the whole community as tools for testing, communication, coding review, bugging manager, and navigation p7 provide a newcomer-specific page or portal guiding their first steps, including development information p32 the organization board provides hardware and software resources to be used by the community p8 identify and dismiss outdated information on websites p33 define a financial board to manage the financial resources p9 provide generation of (semi-)automated documentation filtered to up-to-date information relevant to newcomers p34 provide financial resources to support meetings face-toface p10 answer questions on the mailing list quickly p35 provide meetings (sprints) face-to-face to accelerate the development of critical issues and solve development problems with interdependent modules p11 create a detailed step-by-step tutorial linking information about common problems and possible solutions p36 define minimal quality criteria requirements to add an application to the ecosystem (documentation, automatic tests, dependence restrictions) p12 provide updated official documentation about code’s organizational structure, and how the components, modules, classes, and packages are related to each other p37 define a team to test performance and behavior of the application p13 support the participation of the ecosystem on aggregators’ sites such as stack overflow, reddit, hacker news, and so on p38 use some tools to compute some quality metrics p14 provide a dictionary to newcomers to facilitate their learning of the technical jargon, acronyms of the community p39 use automatic tests to gather problems with the code recently added p15 provide video-classes or tutorials about introducing the ecosystem, installing technologies, configuring the development environment, and using the dependencies with other ecosystems p40 provide an automatic process to launch release of applications p16 provide a manifesto explaining requirements of an application belonging to the ecosystem p41 divide the parts of the software in layers defining restrictions for managing dependencies among the layers p17 provide video-conference sessions with questions and answers about relevant topics p42 discuss with the community about critical changes into architecture that will impact in the applications p18 provide information about the translation process (how to participate and tools used) p43 publish widely the architectural changes for the community p19 provide documentation about translation rules that developers must follow to prepare code for other languages p44 build the architecture based in plug-ins to facilitate the coupling of applications p20 keep the backward compatibility for a medium or long time to allow the community update their software p45 provide guidelines with a set of steps to be followed by developers how to add code to repository p21 provide different levels of security access for the parts of the ecosystems in accordance with the degree of commitment and tasks in the ecosystem p46 provide a virtual machine with pre-configured build environments, web-based ides, or a container management tool p22 point newcomers to easy tasks filtered by difficulty, skills needed, and topics p47 inform newcomers about technical background required. identify which specific technologies they need to know or they should learn to achieve their goal of contributing to the ecosystem p23 use tools to publish known cyber security vulnerabilities. for example the common vulnerabilities and exposures (cve) p48 tag all tasks in accordance with degree of difficulty (easy, medium, difficult) p24 keep a team to manager the security problems registered p49 provide a group for gardening to care the global state of ecosystem p25 preference to use the english language written to avoid misunderstanding p50 keep the list of tasks updated informing about who is working on the solution software architectural practices: influences on the open source ecosystem health amorim et al. 2023 table 8. categorization of the practices by views and key areas view key areas practices community architectural knowledge p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p45 external management p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p30, p46, p47, p48, p50 business choice of technology p31 resources management p32, p33, p34 technical design p35 quality management p36, p37, p38, p39, p40, p49 change management p41, p42, p43, p44 (p36) define minimal quality criteria requirements to add an application to the ecosystem (documentation, automated tests, dependence restrictions). defining quality rules for applications will provide a good quality service to support such business goals. • a business goal such as “attract and retain developers and customers in the ecosystem” induces the practice (p44) build the architecture based on plug-ins to facilitate the coupling of applications. it contributes to protecting the core from unwanted changes avoiding damaging the code and at the same time allowing the integration of several external applications easily. 5.2.2 experience experience describes the effect of the knowledge acquired by the software architects considering their practices. design decisions are guided by the acquired background through the experience of success and/or failure (bass et al., 2012). architects are already prepared to take decisions considering generic knowledge such as patterns and guidelines. however, they also consider past architectural decisions in the architectural design of other systems. over time, they create best practices based on specific projects. so the challenge is to adapt and use these past decisions for new projects with generic knowledge (weinreich and groher, 2016). bringing these challenges to the ecosystem setting, in a plural environment, many architectural decisions should be taken in conjunction with the community. we observed that architectural knowledge comprises the experience of various architects. the architectural decisions should also consider the impact on third-party applications and share the knowledge with the community, ensuring ecosystem surveillance. we have identified that the experience of architects influenced directly twenty-one practices. for instance: • the architectural practice (p2) during code review, provide feedback information about architecture, good practices of coding, do refactor and show the best way to solve problems shares the experience of the architects during code reviews. this practice is very common and important to prepare newcomers and guarantee architectural knowledge sharing. furthermore, architectural knowledge is important to support and define tools and environmental issues. • the practice (p31) the organization board defines some technologies that should be used by the whole community as tools for testing, communication, coding review, bugging manager, and navigation is also related to experience. the experience of architects should support the decision about the best tools to work within the context of the software ecosystem. 5.2.3 requirements requirements are the basis for building the software architecture. in open source ecosystems, requirements are discussed with the community; virtual or physical meetings are used to decide what is important to be launched into the next version. besides, sharing the cost of innovation, external members can develop new features apart from the community. these features can be added to the platform in the future. the software architecture must address requirements, and architectural practices are adopted to allow the fulfillment of the requirement. in our research, eight practices related to this factor were found. so, the adoption of these practices is encouraged to contribute to the achievement of goals defined by the requirements. for instance: • (p3) document apis constantly is used to reduce the impact of changes for third-party. this is because, to meet the requirements, some changes to the platform may be necessary, including into the api. the api should be documented constantly, allowing updating applications on top of the platform. • the practice (p16) provide a manifesto explaining requirements of an application belonging to the ecosystem defines criteria to applications be engaged into the ecosystem platform. as a result, all projects of the ecosystem already start with a set of predefined requirements. • (p49) provide a group for gardening to care the global state of ecosystem is a gardening practice adopted by the kde ecosystem to provide a team to care for important bugs, find stale review boards, and ping people to review them. the idea is to care about the general state and keep alive the projects. so that, gardening activities ensure that many requirements with problems or stopped activity can be accomplished. 5.2.4 resources resources describe the elements used to accomplish tasks. these resources encompass software, hardware, people, environment, money, and so on (bass et al., 2012). the architect also determines some dimensions of the software environment, including the infrastructure. choosing the right tools helps to quickly achieve architectural goals. this choice is based on a set of aspects such as the understanding of the domain, business environment, costs, integration with other software architectural practices: influences on the open source ecosystem health amorim et al. 2023 table 9. a subset of architectural practices id architectural practices gitlab jenkins kde mapserver node.js open edx wordpress p1 create personal blogs and/or wikis to inform about the development and architectural issues " " " " " " p2 during code review, provide feedback information about architecture, good practices to code, doing refactoring and show the best way to solve problem " " " " " " p3 document apis constantly " " " p4 provide internal or third-party mentoring programs to train newcomers " " " " " p5 provide code recommendations, defining a standard in the community " " " " " p6 keep a register of meetings available to community to know all decisions of the meeting " " " " " " p7 provide a newcomer-specific page or portal guiding their first steps, including development information " " " " " " " table 10. influence factors with practices influence factors practices business goals p4, p7, p13, p14, p16, p17, p18, p20, p25, p28, p29, p36, p38, p43, p44, p46, p47, p49 experience p1, p2, p4, p5, p9, p11, p12, p15, p17, p19, p20, p22, p27, p31, p35, p37, p41, p42, p43, p45, p50 requirements p3, p16, p19, p20, p27, p36, p37, p49 resources p2, p4, p6, p7, p13, p15, p17, p21, p23, p24, p25, p28, p31, p32, p33, p34, p35, p38 time-to-market p1, p3, p5, p6, p7, p8, p9, p10, p11, p14, p20, p22, p24, p26, p27, p46, p48, p30, p31, p37, p38, p39, p40, p50 systems, and so on (jade, 2019). the environment of ecosystems provides a rich set of features regarding the diversity of members, and at the same time, some resources may be scarce. for example: • (p32) the organization board provides hardware and software resources to be used by the community means that the infrastructure is provided by the organization. however, some resources cannot be provided. due to the large range of possible configurations for a system, some configurations are difficult or expensive to be offered and tested. • the practice (p28) create partnerships with third-party to solve problems of the core and their interfaces allows partners to provide the support needed to work with some scarce resources. this way, the community continues developing innovation and aggregating value to different features. • another important practice is (p34) provide financial resources to support meetings face-to-face. face-toface meetings promote knowledge sharing and strength social interactions. in many cases, these events can help to solve important issues that are hampering or harming the ecosystem. therefore, the money of the community should be applied to the benefit of the community itself, increasing members’ interaction. 5.2.5 time-to-market time-to-market refers to a continuous period in that the team should build and deliver the software. this time is expressed through deadlines to conclude tasks. the time pressure in development and delivery can cause technical debt. the lack of rigor, insufficient tests, and no time for proper design or careful reflection can accumulate massive amounts of debt rapidly (philippe kruchten and ozkaya, 2012). on the other hand, market forces demand cost reductions and release cycles, depending on the business segments. so, reducing timeto-market can bring competitive advantages. the software architecture can contribute to reducing the time-to-market, establishing some strategies such as the use of existing assets and common architectural frameworks. also, it can optimize the integration and mechanisms of generation of the architecture (garlan and perry, 1995). for open source ecosystems, time-to-market depends on the business scenario and community goals. the needs of the internal and external community are important variables to be considered. platform and thirdparty objectives must converge to allow the growth of the ecosystem. this way, the time-to-market should be defined in sync with stakeholders’ demands. we found several practices contributing to reducing timeto-market: • the practice (p7) provide a newcomer-specific page or portal guiding their first steps, including development information, contributes to reducing the time-tomarket because this page informs the community rules, decision-making processes, tutorials, and documentation of the code and software architecture. this facilisoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 tates a faster engagement of a member in the development of the software. so, he could evolve and contribute more quickly to features that will be released in the next versions. the faster the newcomer knows the architecture, the faster it can make use of existing assets and common architectural frameworks, contributing to reducing the time-to-market. • (p20) keep the backward compatibility for a medium or long time to allow the community updates their software provides time to developers get used to a new environment and change their applications. at the same time, it allows for launching releases and following market needs. 5.3 health indicators the idea of ecosystem health and its measurement, as introduced by iansiti and levien (2002), requires the use of three health indicators: robustness, productivity, and niche creation. analyzing fifty practices, our study found concepts demonstrating the influence of architectural practices on the health of the seven open source ecosystems. in addition, our study also identified other types of influences not covered by the existing three health indicators. the literature in the ecosystem area has suggested the use of new indicators, some studies defend not all aspects of health are covered by existing indicators. campbell and ahmed (2011) uses the indicators from iansiti and levien (2002), but they also suggested creating more one indicator, internal characteristic, to fully assess the health. hyrynsalmi et al. (2018) defended the improvement of health indicators, suggesting the use of indicators for specific domains of the ecosystem. on top of that, some practices together with information about their contexts suggest actions to keep the “normal functioning” of the ecosystem. for instance, “(p28) create a partnership with third-party to solve problems of the core and their interfaces” signalizes that the ecosystem should create partnerships with other organizations to solve problems in the core platform, create new features, and share the cost with third-party. this is one of the basic goals of an ecosystem. also, “(p23) use tools to publish known cybersecurity vulnerabilities. for example, the common vulnerabilities and exposures (cve®)” shows that ecosystem members register vulnerabilities and their possible solutions to ensure the community is aware of security issues, contributing to protecting the ecosystem to do not anything unexpected. the compilation of these data allowed us to suggest another health indicator to represent the healthiness of the ecosystems, filling this gap. so, we proposed the trustworthiness indicator that accounts for the normal operation of the ecosystem. table 11 shows each indicator and the practices that influence them. following, we describe the influence of the architectural practice on each health indicator. 5.3.1 robustness in the context of software ecosystems, robustness refers to the capacity of the ecosystem to survive crises and disruptions. in a robust ecosystem, the connection between members and technologies remains in front of a collapse. the table 11. health indicators with practices health indicators practices robustness p2, p3, p5, p8, p9, p10, p18, p19, p20, p23, p24, p28, p33, p35, p36, p37, p38, p42, p43, p44, p49 productivity p1, p2, p3, p5, p6, p7, p8, p9, p10, p11, p12, p14, p15, p17, p19, p22, p25, p26, p27, p28, p30, p34, p35, p37, p38, p39, p40, p41, p50 niche creation p1, p2, p3, p4, p14, p16, p18, p25, p28, p35 trustworthiness p1, p2, p3, p4, p5, p6, p7, p9, p10, p11, p12, p13, p14, p16, p17, p18, p19, p20, p21, p22, p23, p24, p25, p26, p27, p28, p29, p31, p32, p33, p38, p42, p43, p44, p45, p46, p47, p48, p50 figure 5. influences on the robustness from underlying concepts ecosystem has the capacity to adapt to new situations without harming its core. in addition, most of its active resources can be used for this new phase (iansiti and levien, 2002). we observed twenty-one practices influencing this indicator. by adopting these practices, the ecosystem prevents or mitigates the effects of problems, reinforcing the robustness. for example, the practice “(p44) build the architecture based in plugins to facilitate the coupling of applications” contributes to isolating the core from external applications. this way, the core trend suffers less with the impact of changes in the applications upon. the core platform is more protected from breaking and can remain unaffected in crisis. figure 5 illustrates an example of how the influence on the robustness emerged from the underlying concepts. regarding another practice, “(p23) use tools to publish known cybersecurity vulnerabilities. for example, the common vulnerabilities and exposures (cve®)”, we realized that publishing known vulnerabilities allows developers to be aware of the danger. also, they can protect their applications and/or help to fix the vulnerability. knowing security problems help to lead and recover in front of security collapses. another example is that “(p38) use some tools to compute some quality metrics” contributes to the robustness when existing tools collect quality metrics to identify problems in advance and avoid crises. software architectural practices: influences on the open source ecosystem health amorim et al. 2023 5.3.2 productivity productivity refers to the ability of the ecosystem to transform inputs into outputs efficiently. the productive ecosystem works efficiently, reducing progressively its cost. besides, it also creates new production techniques, and delivers them for all members to improve the productivity of the whole community(iansiti and levien, 2002). the impact of practices on this indicator is clear. some practices influence directly the form of how the work is done. this way, our research can observe and deduce the reasoning of the influences. figure 6. influences on the productivity from underlying concepts for instance, practices such “(p35) provide meetings (sprints) face-to-face to accelerate the development of critical issues and solve development problems with interdependent modules” and “(p2) during code review, provide feedback information about architecture, good practices of coding, do refactor and show the best way to solve the problems” intend to reduce the time for exchanging information among members in the face-to-face meeting. also, they can train inexperienced members and keep the focus on the development of code, instead of losing time trying to learn things alone. moreover, practices such as “(p39) use automated tests to gather problems with the code recently added” and “(p40) provide an automated process to launch releases of applications” use automation for improving tests and release launching, reducing the time of these development tasks. they also avoid human errors. automated tests contribute to discovering errors early, and automated release launching contributes to reducing launching problems, ensuring that all files are included in the final package. figure 6 illustrates how the influence on the productivity emerged from the underlying concepts. 5.3.3 niche creation this indicator refers to the capacity to create opportunities, add new functions, and carry out innovation in the ecosystem. the niche creation is represented by the number of new features and technologies created by the ecosystem. a good level of niche creation is expressed by new business scenarios and technologies or ideas that aggregate value for the ecosystem. the diversity creates value, contributing to the innovation of the ecosystem (iansiti and levien, 2002). from the practices adopted, several business opportunities can be generated easier and naturally. figure 7. influences on the niche creation from underlying concepts the practice “(p18) provide information about the translation process (how to participate and tools used)” contributes to the expansion of the ecosystem to new markets. the translation for other languages can create opportunities for new applications supporting new needs in other markets. in addition, the practice “(p28) create partnerships with thirdparty to solve problems of the core and their interfaces” conduces third-party to add new features to the ecosystem to support their interests. these new features will aggregate innovation, and they have the cost-shared with the community. figure 7 shows a part of the influence on the niche creation which emerged from the underlying concepts. another example is that “(p4) provide internal or third-party mentoring programs to train newcomers” facilitates the appearance of new ideas through the reception of new developers. 5.3.4 trustworthiness we propose trustworthiness as a novel health indicator to address the need to express the ordinary functioning of the ecosystem when its behavior is as expected, without unwanted surprises. we characterize the trustworthiness indicator as the likelihood of an ecosystem to work as expected and doing nothing beyond what is supposed to. this indicator expresses the level of accomplishment for the following tasks: (i) facilitating interactions among organizations and third-party; (ii) increasing the attractiveness for new users/developers; (iii) sharing the maintenance with ecosystem partners; (iv) sharing cost of innovation; and (v) incorporating in the platform features developed by third-party. in addition, the ecosystem should not have some security issues such as the presence of faults, catastrophic consequences, unauthorized disclosure of information, and unauthorized access. these characteristics ensure that the ecosystem does not do anything it should not do. as well as, it should do a correct service for a given duration time in a reasonable response time. taking all these into account, we believe that an ecosystem presents an operation as expected, deserving of trust. figure 8 shows the features of this new health indicator. the term “trustworthiness” has been used with different meanings by some authors. first, deljoo et al. introduced a computational trust model where they provided mechanisms for estimating trustworthiness and assessing trust. this model can support taking decisions to establish future relationships. their concept of trustworthiness involves taking the risk to use the system no matter what, without monitoring or controlling the environment. this concept makes memsoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 figure 8. features of trustworthiness bers vulnerable to the actions of other members based on the expectation that specific actions will be taken (deljoo et al., 2018). second, the trustworthiness for open source ecosystems was studied by del bianco et al. they proposed to define a notion of trustworthiness for software products and artifacts in open source systems (oss). besides, they identified some factors that influence this notion. the idea is to support the choice of oss products that can provide user needs. this model identifies strengths and weak points to help to improve the quality of the oss products choice (del bianco et al., 2009). next, becker et al. discussed the terminology for the term trustworthiness in software systems, presenting its definition and characteristics (becker et al., 2006). for them, the trustworthiness of a system happens when a system operates as expected, despite disruptions or errors in the environment. in this study, trustworthiness is composed of other characteristics such as security, reliability, privacy, safety, and survivability. this concept is very comprehensive, including several features, and is very similar to the robustness. lastly, regarding ecosystem health, franco-bedoya et al. introduce trustworthiness as “the ability to establish a trusted partnership of shared responsibility in building an overall open source ecosystem” (franco-bedoya et al., 2015). in their model, trustworthiness is a sub-characteristic of the characteristic of resource health. related to financial health, trustworthiness is represented by operational financial measures. the trusted partnership should create value for end products. based on these concepts, we suggested trustworthiness as a new health indicator. the definition of software ecosystems was analyzed and considered the trustworthiness of the whole ecosystem, not only for their products but including all relationships in the ecosystem scenario. the trustworthiness encompasses the normal operation of the ecosystem. applying the definition of bosch to software ecosystems and their characteristics, we figured out a normal behavior for an ecosystem (bosch, 2009). figure 9 shows part of the influence on trustworthiness that emerged from the underlying concepts. in our study, during data analysis process with the gtinspired, we also investigated how the practices adopted by the ecosystem influence this novel health indicator. we found thirty-nine practices that contribute in some way to the figure 9. emergence of the indicator trustworthiness from underlying concepts ecosystem working as expected. for example, the practice “(p7) provide a newcomer-specific page or portal guiding their first steps, including development information” helps to increase the attractiveness for newcomers. this is because the portal facilitates the first steps for newcomers, preventing them from abandoning the attempt to become involved in the community. another practice “(p12) provide updated official documentation about code’s organizational structure, and how the components, modules, classes, and packages are related to each other” makes it easy to share maintenance with third-party due to the documentation that provides knowledge about the code infrastructure. the practice “(p28) create partnerships with third-party to solve problems of the core and their interfaces” facilitates interactions among organizations and third-party, as well as sharing the cost of innovation due to the partners to aggregate valuable features to the core. in addition, the practice “(p36) define minimal quality criteria requirements to add an application to the ecosystem (documentation, automated tests, dependence restrictions)” provides a set of rules such as automated tests, documentation and so on to ensure the quality of the components that will be incorporated into the ecosystem platform. 6 discussion our initial effort was purely exploratory, aiming to know if and how the architectural practices could influence the ecosystem health. to answer our research questions, first, we investigated which practices were used in the design of open source ecosystems. the netnography-based approach allowed us to collect several practices in their context of use. the grounded theory-inspired approach helped us to identify the motivations for adopting these practices and the influences of their adoption. the investigation led us to a set of concepts about reasons for using such practices. the findings of the study include a number of relevant factors and indicators for understanding the relationship between software architecture and the health of open source ecosystems. figure 10 summarizes our findings. our findings reinforce previous statements about the influence factors found in the literature. (bass et al., 2012) described some factors that influence the architecture. however, we organized these influence factors considering architectural practices and related characteristics. in addition, with respect to health indicators, our work also reinforces the insoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 figure 10. summary of findings dicators introduced by (iansiti and levien, 2002). in addition, this research suggests a new health indicator: trustworthiness. in our analysis, we highlighted that some practices contributed to improving and keeping the normal operation of the ecosystems in accordance with what is expected. so, we reframed this characteristic as a new health indicator. a software ecosystem needs to provide trustworthiness for the community members to achieve a good health state. however, we observed a thin line between the trustworthiness concept of (becker et al., 2006) and our proposal. they also consider trustworthiness as robustness. for us, they are different concepts. robustness is the ability of the ecosystem to survive disruptions, but trustworthiness considers its operation in normal conditions, representing what is expected from the ecosystem in normal scenarios. trustworthiness can also be questioned, considering the business behaviors of the organization boards. in some cases, the board defines rules that are not in total agreement with the whole community. as a result, third-parties cannot trust totally in the ecosystem, but they continue engaging in the ecosystem aiming for some profit. for us, if you are engaged in an ecosystem, you are subordinated to rules determined by the organization. although you know that your judgment can sometimes disagree with them. you know what to expect from them, and you trust them to do nothing against you. you believe your application will not be harmed. despite the rules, you will still make a profit. trustworthiness is exactly that, believing that it plays its role as an ecosystem, without harming it. regarding data analysis, we also observed different influence factors can affect the same practice. one factor can determine various practices driving the ecosystem to achieve its goals. at the same time, we realize that a practice impacts different health indicators. this is because the consequences of a practice spread its effects throughout the ecosystem. in our study, not all fifty practices are adopted by all ecosystems. in fact, none of the ecosystems has adopted the entire set of practices; however, each practice is carried out in at least one ecosystem. in addition, we also found some interesting and unique practices that were outside the scope of this study. they were found only in one ecosystem, however, they also could be adopted by others. for example, the gitlab ecosystem establishes a “hall of fame” where for every release, team members elect a community contributor as the mvp (most valuable person) of the release. this member receives the prestigious golden fork. this practice could increase the attractiveness of new developers. also, in another practice, the gitlab provides a log automatic for changes registering all changes done. this way, everyone can be aware of the changes and analyze the impact of these changes. in addition, the open edx establishes a contract for contributors about intellectual property rights. this way, it protects itself of future problems with property rights. moreover, they have one practice to provide accessibility guidelines to ensure that any user interfaces are usable by everyone, regardless of any physical limitations. in summary, some ecosystems develop different practices according to their particular needs. however, our research focused on practices that we perceive to influence software architecture. these unique practices observed during the phase of netnography were not considered in our study. the overall direction of the results showed trends that could be helpful to learn about how to build a good healthy state. knowing the motivations and influences of the practices on the health will contribute to understanding mechanisms to make a healthy ecosystem. the relevance of our results using such architectural practices to figure out the health state encompasses a crucial part of the ecosystem: the software architecture. as mentioned previously, the architecture plays a key role in supporting the entire ecosystem. we are aware that ecosystem health is constructed based on a large set of elements; however, our research only focuses on the architectural component. the software architecture is built based on a set of factors that determine how the practices will be performed. in turn, the results show evidence that the practices adopted affect all products in the ecosystem, and their management, prosperity, and longevity. according to jacobson, practices can be easily disseminated and used many times to produce some results. they provide a picture of a specific aspect of software development, describing outcomes, and how to achieve them (jacobson et al., 2007). in addition, they can be analyzed individually to clarify part of the impact from their use. this way, understanding the context and mechanisms for the adoption of practices could signalize influences on the health state. this approach opens up ways to investigate different health scenarios and guide choice according to ecosystem strategies. our interpretation considers that the architectural practices used directly impact the health status of the ecosystem. however, further studies are needed to explain how the practices can be used to improve the health of the ecosystem. by now, it was not possible to extract a quantitative value for the level of influence of each practice, just as we did not know the implications of interactions between practices, because a practice can also harm another practice instead of helping to improve health. the results of this study create a positive perspective to find forms to measure influences and also construct an approach to guide the ecosystem governance. 7 threats of validity there are some threats to the validity of the study. in order to reduce these threats, we describe briefly some mitigation strategies for them: • context: findings in this study is applicable only in the context of open source ecosystems. many pracsoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 tices in commercial ecosystems are different practices in open source ecosystems. by expanding the study for all ecosystem types, all practices should be included in the study. • generalization: the fifty practices came from seven open source ecosystems with different domains and sizes. despite mature ecosystems concentrate the majority of practices, we believe that a range of practices is also used for other ecosystems. even different ecosystems have common characteristics and, therefore, for which the findings are relevant. • observation: due to the methodology used, we could not cover all practices used for the whole community. the lack of some practice in our collected data does not mean that this practice is not used by the community. it can happen in a situation where the researchers could not find evidence of the adoption of this practice. besides, all practices collected cannot represent all aspects of the architecture, as there is no ground truth based on architectural practices. the choice was done based on the background of the researchers. moreover, we cannot guarantee that community members or even the governance of the ecosystem are following the practices published on their website. we make assumptions that they really do what they publish on the internet. in addition, we found some drawbacks during the data collection. first, we read the web page describing a practice, but the artifact was not found. for example, node.js ecosystem presents that they have a standard code, but we did not find it. on the other hand, the inverse situation occurred, when the practice is performed, but we did not find its documentation. for example, some ecosystems do a translation of the system; however, there is no material to guide a translation. moreover, a practice is performed with different levels of details by different ecosystems. for example, node.js is very simple describing the translation process, but kde has a comprehensive and detailed translation process and uses several supporting tools in this process. the translation process at node.js depends on the language group. for example, the portuguese language has a basic description, but the spanish language provides a large description. another threat was the challenge of identifying discussions about critical changes in software architecture, if it is not explicitly published. we only found changes shared with the community. we performed efforts to mitigate information not found searching data in several links on the ecosystem website. following, essentially, netnography is performed by a single researcher who engages in an online community, with a participatory role, mediated by computers to gather research data. in our study, only a single researcher conducted data collection and analysis. when the nethnographer does not have a participatory role immersed in the community, kozinets (2009) explained that he is compelled to make assumptions about meanings that he does not fully understand. this is a weakness of the approach. to mitigate misinterpretation of some practices, the nethnographer visited several pages in the community to clarify doubts. also, during the writing process review, some questions could be solved by the authors. lastly, the research was focused on identifying evidence of practices adopted and their influences. it was not the scope of the study to identify any practice should-have but not-have in accordance with our experience. • data analysis: the scope of this study was to identify practices and their influences. we could not identify the comprehensive reasons for different systems to adopt a different set of architectural practices. the occurrence of each practice on each project was recorded by data collection. however, we analyzed the influences of the practice in a general way for all ecosystems without considering a particular influence for each ecosystem. moreover, we could not measure the level of influence of the practice on the health of the ecosystems. in addition, one researcher conducted the data collection and analysis process. consequently, some bias can have been introduced by the researcher originating from her own ideas about some practice, situation, or ecosystem behavior. also, the researcher can have some inclination to follow preconceived ideas about the ecosystem area, resulting in discovering the limited scope of information and/or omitting other important data concerned with architectural practices and the ecosystem health. in trying to mitigate the ill effects of the bias, the researcher was extra careful in the processes of data collection, analysis, and interpretation, reviewing their conclusions. also, the conclusions in the written article were analyzed and reviewed by other researchers. they had available to them the complete dataset of information. • time: a netnographic study is performed over a longtime and prolonged involvement with the community. our activities focused on the observation of the practices for a short time. due to the short observation period, we may not have found or understood some practices used by the community. to avoid misunderstandings, we searched several web pages to confirm the adoption of a practice. 8 conclusion a healthy ecosystem works for achieving and maintaining success. usually, its health state is represented by health indicators. these indicators are influenced by practices adopted by the ecosystem members, more specifically, architectural practices used to create and maintain the software architecture of ecosystems. this study aimed at understanding the software architectural practices universe in the ecosystem scenarios, the factors that can influence their adoption, and how these architectural practices influence the open source ecosystem health. our experiences can help other researchers to understand influences of the architectural practices on the health of open source ecosystems. in particular, the main contribution of this work was the identification and discussion of five factors influencing the adoption of architectural practices, as well as the knowledge about the influences of four health indicators on the ecosystem health. we proposed a novel health indicasoftware architectural practices: influences on the open source ecosystem health amorim et al. 2023 tor – trustworthiness – to represent a health dimension that has not been considered in previous studies. the degree of trustworthiness of the whole ecosystem indicates how much the ecosystem operates as expected regarding the ordinary behavior, including all ecosystem views. this can signalize a good accomplishment of the ecosystem activities. this work is an initial step towards a health evaluation approach considering the influence of architectural practices on the ecosystem health. future work includes the replication of this study for proprietary software ecosystems, rather than open source environments, and will highlight similarities and differences in results in the scope of that environment. in addition, we could establish a standard for the most common practices used in software ecosystems. besides, we could clarify the explicit benefits of adopting the best practices. further investigation is also needed to determine whether these findings could be applied to build an approach to measure ecosystem health. we plan to identify qualitative and quantitative forms of understanding the weight of the software architecture on ecosystem health. acknowledgements this study was partially supported by the coordenação de aperfeiçoamento de pessoal de nível superior brasil (capes) grant 001. references amorim, s., mcgregor, j., almeida, e., and chavez, c. (2017). software ecosystems′ architectural health: another view. in proc. of the 5th icse int. workshop on software engineering for sos and 11th workshop on distributed software development, software ecosystems and sos, sesos/wdes ’17, pages 66–69. avelino, g., constantinou, e., valente, m. t., and serebrenik, a. (2019). on the abandonment and survival of open source projects: an empirical investigation. in proceedings of the acm/ieee international symposium on empirical software engineering and measurement, esem ’19, pages 1–12. bass, l., clements, p. c., and kazman, r. (2012). software architecture in practice. addison-wesley professional, third edition edition. becker, s., hasselbring, w., paul, a., boskovic, m., koziolek, h., ploski, j., dhama, a., lipskoch, h., rohr, m., winteler, d., giesecke, s., meyer, r., swaminathan, m., happe, j., muhle, m., and warns, t. (2006). trustworthy software systems: a discussion of basic concepts and terminology. sigsoft softw. eng. notes, 31(6):1–18. bogart, c., kästner, c., herbsleb, j., and thung, f. (2021). when and how to make breaking changes: policies and practices in 18 open source so ware ecosystems. acm transactions on software engineering and methodology, 30(4). bosch, j. (2009). from software product lines to software ecosystems. in proceedings of the 13th international software product line conference, splc ’09, pages 111–119. bosch, j. (2010). architecture challenges for software ecosystems. in proceedings of the fourth european conference on software architecture, ecsa ’10, pages 93– 95. bosch, j. and bosch-sijtsema, p. (2010). from integration to composition: on the impact of software product lines, global development and ecosystems. the journal of systems and software, 83:67–76. bouwman, h., carlsson, c., carlsson, j., nikou, s., sell, a., and walden, p. (2014). how nokia failed to nail the smartphone market. in proceedings of the 25th european regional conference of the international telecommunications society, its ’14, pages 1–18. bredemeyer, d., malan, r., and consulting, b. (2000). the role of the architect. campbell, p. r. j. and ahmed, f. (2010). a threedimensional view of software ecosystems. in proceedings of the fourth european conference on software architecture, ecsa ’10, pages 81–84. campbell, p. r. j. and ahmed, f. (2011). an assessment of mobile os-centric ecosystems. journal of theoretical and applied electronic commerce research, 6:50–62. charleux, a. and viseur, r. (2019). exploring impacts of managerial decisions and community composition on the open source projects’ health. in proceedings of the 2nd international workshop on software health, soheal ’19, pages 1–8. charmaz, k. (2006). constructing grounded theory a practical guide through qualitative analysis. sage publications. coleman, g. and o’connor, r. (2007). using grounded theory to understand software process improvement: a study of irish software product companies. information and software technology, 49:654–667. da silva amorim, s., neto, f. s. s., mcgregor, j. d., de almeida, e. s., and von flach garcia chavez, c. (2017). how has the health of software ecosystems been evaluated? a systematic review. in proceedings of the 31st brazilian symposium on software engineering, sbes ’17. del bianco, v., lavazza, l., morasca, s., and taibi, d. (2009). quality of open source software: the qualipso trustworthiness model. in proceedings of the ifip international conference on open source systems, oss ’09, pages 199–212. deljoo, a., van engers, t., gommans, l., and de laat, c. (2018). the impact of competence and benevolence in a computational model of trust. in proceedings of the ifip international conference on trust management, ifiptm ’18, pages 45–57. dijkers, j., sincic, r., wasankhasit, n., and jansen, s. (2018). exploring the effect of software ecosystem health on the financial performance of the open source companies. in proceedings of the 1st international workshop on software health, soheal ’18, pages 48–55. dos santos, r. p. and werner, c. (2011). treating business dimension in software ecosystems. in proceedings of the international conference on management of emergent digital ecosystems, medes ’11, pages 197–201. franco-bedoya, o., ameller, d., costal, d., and franch, x. software architectural practices: influences on the open source ecosystem health amorim et al. 2023 (2015). measuring the quality of open source software ecosystems using queso. in proc. of the 10th int. conference on software technologies (icsoft), pages 39–62. garlan, d. and perry, d. e. (1995). introduction to the special issue on software architecture. ieee transaction on software engineering, 21(4):269–274. goggins, s., lumbard, k., and germonprez, m. (2021). open source community health: analytical metrics and their corresponding narratives. in proceedings of the 4th international workshop on software health in projects, ecosystems and communities (soheal), soheal ’21. hyrynsalmi, s., ruohonen, j., and seppänen, m. (2018). healthy until otherwise proven: some proposals for renewing research of software ecosystem health. in proceedings of the 41st international workshop on software health (soheal), soheal ’18, pages 18–24. iansiti, m. and levien, r. (2002). keystones and dominators: framing operating and technology strategy in a business ecosystem. harvard business school, 3(61). jacobson, i., ng, p. w., and spence, i. (2007). enough of processes-lets do practices. journal of object technology, 6(6):41–66. jade, v. (2019). software architecture tools. iasa global. accessed: 2019-05-12. jansen, s. (2020). a focus area maturity model for software ecosystem governance. information and software technology, 118. jansen, s., cusumano, m., and brinkkemper, s. (2013). software ecosystems: analyzing and managing business networks in the software industry. edward elgar publishers. jansen, s. and cusumano, m. a. (2013). defining software ecosystems: a survey of software platforms and business network governance. in jansen, s., brinkkemper, s., and cusumano, m., editors, software ecosystems: analyzing and managing business networks in the software industry, chapter 1, pages 13–28. edward elgar publishing. jensen, c. and scacchi, w. (2004). data mining for software process discovery in open source software development communities. in proceedings of the 1st international workshop on mining software repositories, msr ’04, pages 96–100. jensen, c. and scacchi, w. (2005). experiences in discovering, modeling, and reenacting open source software development processes. in proceedings of the international software process workshop, spw ’05, pages 449–462. kozinets, r. (2009). netnography: doing ethnographic research online. sage publications. liao, z., yi, m., wang, y., liu, s., liu, h., zhang, y., and zhou, y. (2019). healthy or not: a way to predict ecosystem health in github. symmetry, 11(2). liu, d. (2017). the art of building platforms. forbes. accessed: 2018-11-16. manikas, k. and hansen, k. m. (2013). software ecosystems a systematic literature review. journal of systems and software, 86:1294–1306. mittal, r. (2019). blackberry ltd. marketing downfall in mobile handset industry. technical report, university of roehampton eu business school. pelliccione, p. (2014). open architectures and software evolution: the case of software ecosystems. in proceedings of the 23rd australian software engineering conference, aswec ’14, pages 66–69. philippe kruchten, r. l. n. and ozkaya, i. (2012). technical debt: from metaphor to theory and practice. ieee software, 29:18–21. saldaña, j. (2009). the coding manual for qualitative researchers. sage publications. satell, g. (2016). platforms are eating the world. forbes. accessed: 2018-11-16. seaman, c. b. (1999). qualitative methods in empirical studies of software engineering. ieee transactions on software engineering, 25(4):557–572. sigfridsson, a. and sheehan, a. (2011). on qualitative methodologies and dispersed communities: reflections on the process of investigating an open source community. information and software technology, 53:981–993. silverman, d. (2021). apple back on top: iphone is the bestselling smartphone globally in q4 2020. forbes. accessed: 2021-03-24. stol, k.-j., ralph, p., and fitzgerald, b. (2016). grounded theory in software engineering research: a critical review and guidelines. in proceedings of the 38th international conference on software engineering, icse ’16, pages 120–131. weinreich, r. and groher, i. (2016). the architect’s role in practice: from decision maker to knowledge manager? ieee software, 33:63–69. west, j. and mace, m. (2010). browsing as the killer app: explaining the rapid success of apple’s iphone. telecommunications policy, 34:270–286. wnuk, k., manikas, k., runeson, p., lantz, m., weijden, o., and munir, h. (2014). evaluating the governance model of hardware-dependent software ecosystems – a case study of the axis ecosystem. in proc. of the 4th international conference on software business (icsob), pages 212–226. a architectural practices we identified fifty architectural practices during our study. table 12 presents the architectural practices and, for each ecosystem, the adopted practices. software architectural practices: influences on the open source ecosystem health amorim et al. 2023 ta bl e 12 .a rc hi te ct ur al pr ac tic es id a rc hi te ct ur al p ra ct ic es g it l ab je nk in s k d e m ap se rv er n od e. js o pe n ed x w or dp re ss p1 c re at e pe rs on al bl og s an d/ or w ik is to in fo rm ab ou tt he de ve lo pm en ta nd ar ch ite ct ur al is su es " " " " " " p2 d ur in g co de re vi ew , pr ov id e fe ed ba ck in fo rm at io n ab ou t ar ch ite ct ur e, go od pr ac tic es to co de ,d oi ng re fa ct or in g an d sh ow th e be st w ay to so lv e pr ob le m " " " " " " p3 d oc um en ta pi s co ns ta nt ly " " " p4 pr ov id e in te rn al or th ir dpa rt y m en to ri ng pr og ra m s to tr ai n ne w co m er s " " " " " p5 pr ov id e co de re co m m en da tio ns ,d ef in in g a st an da rd in th e co m m un ity " " " " " p6 k ee p a re gi st er of m ee tin gs av ai la bl e to co m m un ity to kn ow al ld ec is io ns of th e m ee tin g " " " " " " p7 pr ov id e a ne w co m er -s pe ci fi c pa ge or po rt al gu id in g th ei rf ir st st ep s, in cl ud in g de ve lo pm en ti nf or m at io n " " " " " " " p8 id en tif y an d di sm is s ou td at ed in fo rm at io n on w eb si te s " " p9 pr ov id e ge ne ra tio n of (s em i)a ut om at ed do cu m en ta tio n fi lte re d to up -t oda te in fo rm at io n re le va nt to ne w co m er s " " " " " " p1 0 a ns w er qu es tio ns on th e m ai lin g lis tq ui ck ly " " " " " " " p1 1 c re at e a de ta ile d st ep -b yst ep tu to ri al lin ki ng in fo rm at io n ab ou tc om m on pr ob le m s an d po ss ib le so lu tio ns " " " " p1 2 pr ov id e up da te d of fi ci al do cu m en ta tio n ab ou tc od e’ s or ga ni za tio na ls tr uc tu re ,a nd ho w th e co m po ne nt s, m od ul es ,c la ss es ,a nd pa ck ag es ar e re la te d to ea ch ot he r " " " " " " p1 3 su pp or t th e pa rt ic ip at io n of th e ec os ys te m on ag gr eg at or s’ si te s su ch as st ac k ov er fl ow ,r ed di t, ha ck er ne w s, an d so on " " " " " " " p1 4 pr ov id e a di ct io na ry to ne w co m er s to fa ci lit at e th ei rl ea rn in g of th e te ch ni ca lj ar go n, ac ro ny m s of th e co m m un ity " " p1 5 pr ov id e vi de ocl as se s or tu to ri al s ab ou t in tr od uc in g th e ec os ys te m , in st al lin g te ch no lo gi es ,c on fi gu ri ng th e de ve lo pm en te nv ir on m en t, an d us in g th e de pe nd en ci es w ith ot he re co sy st em s " " " " " " p1 6 pr ov id e a m an if es to ex pl ai ni ng re qu ir em en ts of an ap pl ic at io n be lo ng in g to th e ec os ys te m " " p1 7 pr ov id e vi de oco nf er en ce se ss io ns w ith qu es tio ns an d an sw er s ab ou tr el eva nt to pi cs " " p1 8 pr ov id e in fo rm at io n ab ou tt he tr an sl at io n pr oc es s (h ow to pa rt ic ip at e an d to ol s us ed ) " " " " " " " p1 9 pr ov id e do cu m en ta tio n ab ou tt ra ns la tio n ru le s th at de ve lo pe rs m us tf ol lo w to pr ep ar e co de fo ro th er la ng ua ge s " " " " " " p2 0 k ee p th e ba ck w ar d co m pa tib ili ty fo r a m ed iu m or lo ng tim e to al lo w th e co m m un ity up da te th ei rs of tw ar e " " p2 1 pr ov id e di ff er en tl ev el s of se cu ri ty ac ce ss fo r th e pa rt s of th e ec os ys te m s in ac co rd an ce w ith th e de gr ee of co m m itm en ta nd ta sk s in th e ec os ys te m " " p2 2 po in tn ew co m er s to ea sy ta sk s fi lte re d by di ff ic ul ty ,s ki lls ne ed ed ,a nd to pic s " " " " " p2 3 u se to ol s to pu bl is h kn ow n cy be rs ec ur ity vu ln er ab ili tie s. fo re xa m pl e th e c om m on v ul ne ra bi lit ie s an d e xp os ur es (c v e ) " " " " " p2 4 k ee p a te am to m an ag er th e se cu ri ty pr ob le m s re gi st er ed " " " " " p2 5 pr ef er en ce to us e th e e ng lis h la ng ua ge w ri tte n to av oi d m is un de rs ta nd in g " " p2 6 d o on lin e m ee tin gs in a tim ez on e ad eq ua te fo rm os to ft he co m m un ity " " " p2 7 pr ov id e se ve ra lo nl in e m ee tin gs to di sc us s ar ch ite ct ur al pr ob le m s by ir c , em ai l " " " " " " " p2 8 c re at e pa rt ne rs hi ps w ith th ir dpa rt ie s to so lv e pr ob le m s of th e co re an d th ei ri nt er fa ce s " " " p2 9 se tti ng a co de of co nd uc tt o av oi d m is tr ea tm en ta m on g m em be rs " " " " c on tin ue d on ne xt pa ge software architectural practices: influences on the open source ecosystem health amorim et al. 2023 ta bl e 12 – c on tin ue d fr om pr ev io us pa ge id a rc hi te ct ur al p ra ct ic es g it l ab je nk in s k d e m ap se rv er n od e. js o pe n ed x w or dp re ss p3 0 pr ov id e a m es sa ge te m pl at e fo rn ew co m er s to us e to in te ra ct w ith th e co m m un ity " " " " " p3 1 t he or ga ni za tio n bo ar d de fi ne s so m e te ch no lo gi es th at sh ou ld be us ed by th e w ho le co m m un ity as to ol s fo r te st in g, co m m un ic at io n, co di ng re vi ew , bu gg in g m an ag er ,a nd na vi ga tio n " " " " " " p3 2 t he or ga ni za tio n bo ar d pr ov id e ha rd w ar e an d so ft w ar e re so ur ce s to be us ed by th e co m m un ity " " p3 3 d ef in e a fi na nc ia lb oa rd to m an ag e th e fi na nc ia lr es ou rc es " " " " " " p3 4 pr ov id e fi na nc ia lr es ou rc es to su pp or tm ee tin gs fa ce -t ofa ce " " " " " " p3 5 pr ov id e m ee tin gs (s pr in ts ) fa ce -t ofa ce to ac ce le ra te th e de ve lo pm en t of cr iti ca li ss ue s an d so lv e de ve lo pm en tp ro bl em s w ith in te rd ep en de nt m od ul es " " " " " " " p3 6 d ef in e m in im al qu al ity cr ite ri a re qu ir em en ts to ad d an ap pl ic at io n to th e ec os ys te m (d oc um en ta tio n, au to m at ic te st s, de pe nd en ce re st ri ct io ns ) " " " p3 7 d ef in e a te am to te st pe rf or m an ce an d be ha vi or of th e ap pl ic at io n " " " p3 8 u se so m e to ol s to co m pu te so m e qu al ity m et ri cs " " " p3 9 u se au to m at ic te st s to ga th er pr ob le m s w ith th e co de re ce nt ly ad de d " " " " " " " p4 0 pr ov id e an au to m at ic pr oc es s to la un ch re le as e of ap pl ic at io ns " " " " p4 1 d iv id e th e pa rt s of th e so ft w ar e in la ye rs ,d ef in in g re st ri ct io ns fo r m an ag in g de pe nd en ci es am on g th e la ye rs " " p4 2 d is cu ss w ith th e co m m un ity ab ou tc ri tic al ch an ge s in to ar ch ite ct ur e th at w ill im pa ct in th e ap pl ic at io ns " " " " p4 3 pu bl is h w id el y th e ar ch ite ct ur al ch an ge s fo rt he co m m un ity " " " " " p4 4 b ui ld th e ar ch ite ct ur e ba se d in pl ug -i ns to fa ci lit at e th e co up lin g of ap pl ica tio ns " " " " " p4 5 pr ov id e gu id el in es w ith a se to fs te ps to be fo llo w ed by de ve lo pe rs ho w to ad d co de to re po si to ry " " p4 6 pr ov id e a vi rt ua l m ac hi ne w ith pr eco nf ig ur ed bu ild en vi ro nm en ts ,w eb ba se d id e s, or a co nt ai ne rm an ag em en tt oo l " p4 7 in fo rm ne w co m er s ab ou t te ch ni ca l ba ck gr ou nd re qu ir ed . id en tif y w hi ch sp ec if ic te ch no lo gi es th ey ne ed to kn ow or th ey sh ou ld le ar n to ac hi ev e th ei rg oa lo fc on tr ib ut in g to th e ec os ys te m " p4 8 ta g al lt as ks in ac co rd an ce w ith de gr ee of di ff ic ul ty (e as y, m ed iu m ,d if fi cu lt) " p4 9 pr ov id e a gr ou p fo rg ar de ni ng to ca re th e gl ob al st at e of ec os ys te m " p5 0 k ee p th e lis t of ta sk s up da te d, in fo rm in g ab ou t w ho is w or ki ng on th e so lu tio n " introduction software ecosystems, architecture, and health software ecosystems software ecosystem architecture software ecosystem health related work methodology research questions research context data collection data analysis findings architectural practices factors that influence the adoption of architectural practices business goals experience requirements resources time-to-market health indicators robustness productivity niche creation trustworthiness discussion threats of validity conclusion architectural practices